Immersive Translate

降维问题
Dimensionality reduction issues

缓解“维数灾难”的途径除了我们在特征工程中提到的利用数学或专业知识进行特征选择之外，有一个重要途径就是降维。
In addition to the use of mathematics or expertise for feature selection that we mentioned in feature engineering, an important way to alleviate the "dimensionality disaster" is dimensionality reduction.

概述
overview

首先，我们通过图1的可视化数据分布，直观理解什么是降维，原始高维空间中的样本点，在这个低维嵌入子空间中更容易进行学习。
First, we intuitively understand what dimensionality reduction is through the visualization of data distribution in Figure 1, and the sample points in the original high-dimensional space are easier to learn in this low-dimensional embedded subspace.

图1 低维嵌入示意图
Figure 1: Schematic diagram of low-dimensional embedding

我们还记得利用核方法（核函数）或特征组合（特征工程），从低维到高维会更好地分割不同的样本属性，降维的目的类似也是要使得样本的划分更加容易。只不过从维度转换的角度，其操作正好相反，即将高维数据映射到低维空间，保留关键信息，减少计算复杂度与噪声干扰。
We also remember that using kernel methods (kernel functions) or feature combinations (feature engineering) can better segment different sample properties from low to high dimensions, and the purpose of dimensionality reduction is similar to making it easier to divide samples. However, from the perspective of dimensional transformation, the operation is just the opposite, that is, the high-dimensional data is mapped to the low-dimensional space, the key information is retained, and the computational complexity and noise interference are reduced.

降维的核心目标是最大化数据方差（保留信息）或最小化信息损失。通常应用于数据可视化（如3D→2D）、
The core goal of dimensionality reduction is to maximize data variance (preserve information) or minimize information loss. Typically used for data visualization (e.g. 3D→2D),

特征压缩（如减少传感器数据维度、去冗余（如消除共线性特征）等。主要方法包括线性降维和非线性降维，其中最常用是降维方法是主成分分析（Principal Component Analysis，PCA）。
Feature compression (e.g., reducing sensor data dimensions, de-redundancy (e.g., eliminating collinearity features), etc. The main methods include linear and nonlinear dimensionality reduction, the most commonly used of which is Principal Component Analysis，PCA）。

介绍PCA之前，首先考虑这样一个问题：对于正交属性空间中的样本点，如何用一个超平面（直线的高维推广）对所有样本进行恰当的表达？容易想到，若存在这样的超平面，那么它大概应具有这样的性质：最近重构性：样本点到这个超平面的距离都足够近；最大可分性：样本点在这个超平面上的投影能尽可能分开。主成分的本质是通过线性变换后互不相关的一组向量，即正交向量。
Before introducing PCA, first consider the following question: for the sample points in the orthogonal attribute space, how to properly represent all the samples with a hyperplane (the high-dimensional generalization of the line)? It is easy to think that if such a hyperplane exists, then it should probably have the following properties: reproximal reconstruction: the distance between the sample points and the hyperplane is close enough; Maximum separability: The projections of the sample points on this hyperplane can be separated as much as possible. The essence of principal components is a set of vectors that are unrelated to each other after linear transformation, i.e., orthogonal vectors.

图2 使所有样本的投影尽可能分开（如图中红线所示），则需最大化投影点的方差
Figure 2 To make the projections of all samples as separate as possible (as shown in the red line in the figure), the variance of the projection points needs to be maximized

因此，PCA的核心思想是寻找正交投影方向（主成分），使得投影后数据的方差最大(图2红线方向)。因此降维过程就是要找到K个向量基，使得样本点在K个向量基上的投影点间的方差最大，协方差最小。下面结合线性代数介绍PCA实现步骤。
Therefore, the core idea of PCA is to find the orthogonal projection direction (principal component) so that the variance of the projected data is maximized (Figure 2 red line direction). Therefore, the dimensionality reduction process is to find the K vector bases, so that the variance between the projection points of the sample points on the K vector bases is the largest, and the covariance is the smallest. The following describes the PCA implementation steps in conjunction with linear algebra.

- Step 1：数据标准化
- Step 1: Data standardization

- 中心化：（均值归零）
- Centralization: (mean zero)

图 [数据标准化示意图]: 通过中心化将数据均值归零，消除量纲差异，为协方差计算提供基准。
Figure [Schematic diagram of data standardization]: The mean value of the data is zeroed out through centralization, and the dimensional differences are eliminated to provide a benchmark for the calculation of covariance.

- Step 2：计算协方差矩阵，以描述数据各维度间的相关性。
- Step 2: Calculate the covariance matrix to describe the correlation between the dimensions of the data.

图：[协方差矩阵示意图]：协方差矩阵反映数据维度间的相关性，非对角元素表示线性关联强度
Figure: [Schematic diagram of covariance matrix]: The covariance matrix reflects the correlation between data dimensions, and the non-diagonal elements represent the linear correlation strength

- Step 3：特征值分解，即求解协方差矩阵的特征值与特征向量：
- Step 3: Eigenvalue decomposition, i.e., solving the eigenvalues and eigenvectors of the covariance matrix:

其中：特征值（λ）表示对应主成分方向的数据方差大小；特征向量（v）表示主成分投影方向。
Wherein: the eigenvalue (λ) represents the data variance in the direction of the principal component; The eigenvector (v) represents the principal component projection direction.

图：特征值分解示意图]
Figure: Schematic diagram of eigenvalue decomposition].

特征向量（箭头方向）为数据的主成分方向，特征值（箭头长度）表示该方向的方差大小
The eigenvector (arrow direction) is the principal component direction of the data, and the eigenvalue (arrow length) represents the variance in that direction

- Step 4：选择主成分
- Step 4: Select the main ingredient

- 按特征值降序排列，选择前k个特征向量构成投影矩阵W。
- Arrange the eigenvalues in descending order, and select the first k eigenvectors to form the projection matrix W.

- 降维数据：
- Dimensionality Reduction Data:

图 [主成分投影示意图]：将数据投影到主成分方向（PC1），实现从2D到1D的降维，最大化保留原始方差。
Figure [Schematic diagram of principal component projection]: The data is projected to the principal component direction (PC1) to achieve dimensionality reduction from 2D to 1D, and the original variance is preserved to the greatest extent.

3. 数学性质
3. Mathematical properties

- 方差保留率：前k个主成分的累计方差贡献率，)
- Variance retention: the cumulative variance contribution rate of the first k principal components,).

- 正交性：主成分方向彼此正交（协方差矩阵对称性保证）。
- Orthogonality: The principal component directions are orthogonal to each other (covariance matrix symmetry guaranteed).

案例1：水质参数可视化与异常检测
Case 1: Visualization and anomaly detection of water quality parameters

- 背景：某水务公司监测10项水质指标（pH、溶解氧、浊度等），数据维度高且存在冗余。
- Background: A water utility company monitors 10 water quality indicators (pH, dissolved oxygen, turbidity, etc.), and the data dimensions are high and redundant.

- 任务：利用PCA将数据降维至2D，实现可视化与快速异常识别。
- Task: Use PCA to reduce the dimensionality of data to 2D for visualization and rapid anomaly identification.

- 实现步骤：
- Implementation steps:

1. 标准化10维数据并计算协方差矩阵。
1. Standardize 10-dimensional data and calculate the covariance matrix.

2. 提取前2个主成分（累计方差贡献率>85%）。
2. Extract the first 2 principal components (cumulative variance contribution rate>85%).

3. 绘制2D散点图，标记异常样本（远离主簇的点）。
3. Draw a 2D scatter plot and label anomalous samples (points away from the main cluster).

- 效果：
-Effect:

- 运维人员快速定位污染事件（如某次工业废水泄漏）。
- O&M personnel can quickly locate pollution events (e.g., an industrial wastewater leak).

- 数据存储量减少80%，提升分析效率。
- Reduce data storage by 80% to improve analysis efficiency.

图 [水质数据PCA可视化](https://via.placeholder.com/400x300?text=Water+Quality+PCA)
Figure [PCA visualization of water quality data] (https://via.placeholder.com/400x300?text=Water+Quality+PCA).

案例2：供水管网传感器数据压缩
Case 2: Compression of sensor data in the water supply network

- 背景：城市供水管网部署200个压力传感器，数据维度高且存在共线性。
- Background: 200 pressure sensors are deployed in the urban water supply network, and the data dimension is high and collinearity.

- 任务：通过PCA压缩数据至20维，用于预测管道漏损。
- Task: Compress data to 20 dimensions via PCA for predicting pipeline leakage.

- 实现步骤：
- Implementation steps:

1. 对传感器数据标准化并计算协方差矩阵。
1. Normalize the sensor data and calculate the covariance matrix.

2. 选择前20个主成分（保留95%方差）。
2. Select the top 20 principal components (keep 95% variance).

3. 输入降维后数据训练随机森林漏损预测模型。
3. Input the dimensionally reduced data to train the random forest leakage prediction model.

- 效果：
-Effect:

- 模型训练时间减少60%，准确率提升12%（去冗余噪声）。
- Model training time is reduced by 60% and accuracy is improved by 12% (redundant noise is removed).

- 发现关键主成分对应特定区域管段老化问题。
- Identify key principal components that correspond to the aging of the pipe spool in a specific area.

PCA伪代码
PCA pseudocode

2. PCA的局限性
2. Limitations of PCA

- 仅适用于线性结构数据。
- Applies only to linear structural data.

- 对异常值敏感（需提前清洗数据）。
- Sensitive to outliers (data needs to be cleaned in advance).

五、动手实践（代码示例）
5. Hands-on practice (code examples)

```python

from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

# 智慧水务数据集（示例：10维水质数据）
# Smart Water Dataset (Example: 10D Water Quality Data)

X = load_water_quality_data()

# PCA降维至2D
# PCA is reduced to 2D

pca = PCA(n_components=2)

Z = pca.fit_transform(X)

# 可视化

plt.scatter(Z[:, 0], Z[:, 1])

plt.xlabel('PC1 (方差占比: {:.1f}%)'.format(pca.explained_variance_ratio_[0]*100))

plt.ylabel('PC2 (方差占比: {:.1f}%)'.format(pca.explained_variance_ratio_[1]*100))

plt.title('水质数据PCA可视化')

plt.show()