This is a bilingual snapshot page saved by the user at 2024-10-27 7:53 for https://app.immersivetranslate.com/word/, provided with bilingual support by Immersive Translate. Learn how to save?

ECOM90025
Advanced Data Analysis

Linear model

Interactions: 捕捉变量之间的异质性效应, 交互作用是指在统计模型中,两个或多个变量在一起对因变量(YYY)的影响不仅仅是各自独立影响的简单相加。. 假设我们想研究推广活动和折扣对产品销量的影响。单独看推广活动可能会增加销量,单独的折扣也可能提高销量,但如果我们同时进行推广活动和提供折扣,两者的结合可能会对销量有超出单独效果之和的影响。这就是交互作用。
Interactions: Capturing heterogeneous effects between variables Interaction means that in a statistical model, the impact of two or more variables together on the dependent variable (YYY) is not just the simple addition of their independent effects. Suppose we want to study the impact of promotions and discounts on product sales. Promotions alone may increase sales, and discounts alone may increase sales, but if we run promotions and offer discounts at the same time, the combination of the two may have an impact on sales that exceeds the sum of the individual effects. This is interaction.

Logistic Regression:

常用于分类问题的统计方法,特别适用于因变量(即要预测的变量)只有两个结果的情况,例如“是”或“否”。逻辑回归的目标是预测一个事件发生的概率,这个概率是基于一个或多个自变量(也叫做预测变量)计算得出的。目标变量是二元的(通常表示为0和1)。它使用了逻辑函数(或称sigmoid函数),该函数将任意实数映射到区间(0, 1),输出值可以解释为发生某事件的概率。
A statistical method commonly used in classification problems, especially when the dependent variable (that is, the variable to be predicted) has only two outcomes, such as "yes" or "no". The goal of logistic regression is to predict the probability of an event occurring based on one or more independent variables (also called predictors). The target variable is binary (usually represented as 0 and 1). It uses a logistic function (or sigmoid function), which maps any real number to the interval (0, 1), and the output value can be interpreted as the probability of an event occurring.

or

模型的左侧,使得逻辑回归可以应用线性建模的技术来处理本质上是非线性关系的概率问题。这也使得模型的输出可以很自然地限制在 0 和 1 之间,正好符合概率的范围
The left side of the model allows logistic regression to apply linear modeling techniques to deal with probabilistic problems that are essentially nonlinear relationships. This also allows the output of the model to be naturally limited between 0 and 1, which exactly fits the range of probability.
.

deviance计算实际观察值和预测之间的偏差。R^2 值常用于衡量模型对数据的拟合程度。
deviance calculates the deviation between actual observations and predictions. The R ^ 2 value is often used to measure how well a model fits the data.

Uncertainly

标准误差(Standard Error, SE)
Standard Error (SE)

用于描述统计量(如均值、比例或回归系数)的估计精度。标准误差衡量的是统计量的样本估计值与总体真实值之间的预期变异或标准偏差。简单来说,它告诉我们,如果我们从同一个总体中重复多次抽样,那么抽样得到的统计量将如何围绕总体的真实值波动。
The precision of an estimate used to describe a statistic such as a mean, a proportion, or a regression coefficient. Standard error measures the expected variation, or standard deviation, between a sample estimate of a statistic and the true population value. Simply put, it tells us how, if we repeatedly sample from the same population many times, the statistics obtained from the sampling will fluctuate around the true value of the population.

通俗解释
Popular explanation

想象一下,你有一个装满乒乓球的大桶,每个球上都写有一个数字。你想知道桶里所有乒乓球上数字的平均值(这个值我们通常不可能直接计算,除非我们把桶里的球都倒出来数一遍)。所以,你决定随机抽取一些球,计算这些球数字的平均值。这个平均值是总体平均数的一个估计。如果你多次进行这样的抽样,每次都会得到一个略有不同的平均值。这些平均值围绕总体真实平均值的波动程度,就可以通过标准误差来衡量。
Imagine you have a bucket full of ping pong balls, and each ball has a number written on it. You want to know the average of the numbers on all the ping pong balls in the bucket (a value that is usually impossible to calculate directly unless we empty all the balls out of the bucket and count them). So, you decide to randomly pick some balls and calculate the average of those ball numbers . This average is an estimate of the population average. If you sample like this multiple times, you'll get a slightly different average each time. The degree to which these averages fluctuate around the true mean of the population can be measured by the standard error.

Estimate: 估计值是使用估计器从样本数据得到的具体数值。如果估计器是一个公式,那么将样本数据代入这个公式得到的结果就是估计值。
Estimate is a specific value obtained from sample data using an estimator . If the estimator is a formula, then the result obtained by plugging the sample data into the formula is the estimated value.

Standard Deviation

标准差是度量一组数据的离散程度的统计量,它描述了数据点与数据集均值的偏离程度。标准差越大,表示数据分布越分散;标准差越小,表示数据越集中。
Standard deviation is a statistic that measures the dispersion of a set of data. It describes how far the data points deviate from the mean of the data set. The larger the standard deviation, the more dispersed the data distribution is; the smaller the standard deviation, the more concentrated the data is.

Standard Error

标准误差是一个统计量的标准差,特别是指样本均值的标准差。
Standard error is the standard deviation of a statistic, specifically the standard deviation of the sample mean.

Heteroskedasticity

异方差性是指模型的残差的方差随着自变量的变化而变化。
Heteroskedasticity means that the variance of the residuals of the model changes as the independent variables change.

Bootstrap

标准误差是一种非常有效的统计技术,用于估计样本统计量(如均值、中位数、回归系数等)的标准误差。这种方法不依赖于严格的参数分布假设,因此在实际应用中尤为受到青睐。Bootstrap方法通过重新采样已有数据来模拟数据生成过程(DGP,Data Generating Process)和抽样分布,从而估计任何统计量的变异性和不确定性。
Standard error is a very effective statistical technique for estimating the standard error of a sample statistic such as the mean, median, regression coefficient, etc. This method does not rely on strict parametric distribution assumptions and is therefore particularly favored in practical applications. The Bootstrap method simulates the data generation process (DGP, Data Generating Process) and sampling distribution by resampling existing data to estimate the variability and uncertainty of any statistics.

Bootstrap方法的基本原理
Basic principles of the Bootstrap method

数据重抽样:从原始样本中(通常为大小为 nnn 的样本)进行有放回的抽样,多次抽取同样大小的样本。这种方法称为自助法(bootstrap)。每次抽样都可能会重复选择某些观测,同时可能遗漏其他观测。
Data resampling : Sampling with replacement from the original sample (usually a sample of size nnn), and drawing samples of the same size multiple times. This method is called bootstrap. Each sampling may result in repeated selection of some observations, while other observations may be missed.

计算统计量:在每个自助样本上计算感兴趣的统计量(如均值、方差、回归系数等)。
Calculate statistics : Calculate the statistics of interest (such as mean, variance, regression coefficient, etc.) on each bootstrap sample.

估计分布:通过大量自助样本计算出的统计量,可以构建该统计量的经验分布。这个分布提供了统计量的变异性和估计误差的信息。
Estimated distribution : A statistic calculated from a large number of bootstrap samples, allowing the empirical distribution of the statistic to be constructed. This distribution provides information about the variability and estimation error of the statistic.

标准误差:从上述经验分布中可以直接估算出统计量的标准误差,即这些统计量值的标准偏差。
Standard error : The standard error of statistics can be directly estimated from the above empirical distribution, that is, the standard deviation of the values ​​of these statistics.

Bootstrap标准误差的计算步骤
Bootstrap standard error calculation steps

以下是一个简化的Bootstrap计算过程示例:
The following is an example of a simplified Bootstrap calculation process:

Bootstrap方法的优势
Advantages of Bootstrap Method

模型灵活性:Bootstrap不依赖于严格的分布假设,因此适用于更广泛的统计模型和数据类型。
Model flexibility : Bootstrap does not rely on strict distributional assumptions and is therefore suitable for a wider range of statistical models and data types.

易于实现:只需对已有数据进行重抽样并重新计算统计量,实现简单。
Easy to implement : Just resample the existing data and recalculate the statistics, which is simple to implement.

广泛适用性:可用于估计各种统计量的标准误差,包括但不限于均值、中位数、方差、相关系数等。
Wide applicability : It can be used to estimate the standard error of various statistics, including but not limited to mean, median, variance, correlation coefficient, etc.

注意事项
Things to note

尽管Bootstrap方法具有上述优点,但在样本量非常小或数据高度偏态时,Bootstrap可能不太稳定,估计结果的可靠性可能会降低。此外,Bootstrap方法计算量较大,尤其是在需要进行大量重抽样时。
Although the Bootstrap method has the above advantages, when the sample size is very small or the data is highly skewed , Bootstrap may be less stable and the reliability of the estimation results may be reduced. In addition, the Bootstrap method is computationally expensive, especially when a large amount of resampling is required.

总之,Bootstrap是一个强大的工具,可以为统计推断提供准确的误差估计和增强分析的健壮性。
In summary, Bootstrap is a powerful tool that can provide accurate error estimates and enhance the robustness of analysis for statistical inference.

标准误差 (Standard Error) 标准误差提供了在没有异方差性时估计量不确定性的测量。
Standard Error Standard error provides a measure of the uncertainty of the estimator in the absence of heteroskedasticity.

标准误差衡量的是估计量(如样本均值或回归系数)的准确性,描述了如果从同一总体中重复多次抽样,该估计量的平均值会围绕总体参数的真实值波动的程度。它是估计精度的关键指标,通常用于构建置信区间和进行假设检验。
Standard error is a measure of the accuracy of an estimator, such as a sample mean or a regression coefficient, describing how much the mean of the estimator will fluctuate around the true value of the population parameter if samples are repeated many times from the same population. It is a key indicator of estimation accuracy and is often used to construct confidence intervals and perform hypothesis testing.

2. 异方差性 (Heteroskedasticity) 异方差性的存在要求重新考虑如何有效地测量估计量的标准误差,因为它说明了标准误差可能随模型中的自变量变化而变化。results_ols.HC3_se
2. Heteroskedasticity The existence of heteroscedasticity requires reconsideration of how to effectively measure the standard error of an estimator because it illustrates that the standard error may change as the independent variables in the model change. results_ols.HC3_se

异方差性描述的是在回归模型中,残差的方差不是常数,而是随着模型中一个或多个变量的变化而变化。异方差性的存在可以影响到标准误差的计算,从而影响置信区间和假设检验的准确性。在存在异方差性的情况下,常规的标准误差可能会低估或高估实际的不确定性。
Heteroskedasticity describes that in a regression model, the variance of the residuals is not constant, but changes as one or more variables in the model change. The existence of heteroscedasticity can affect the calculation of standard errors, thereby affecting the accuracy of confidence intervals and hypothesis testing. In the presence of heteroskedasticity, conventional standard errors may underestimate or overestimate the actual uncertainty.

3. 自助法标准误差 (Bootstrap Standard Errors) 自助法标准误差提供了一种灵活的方法来估计标准误差,特别是在传统假设(如方差齐性)不成立的情况下。
3. Bootstrap Standard Errors Bootstrap standard errors provide a flexible method to estimate standard errors, especially when traditional assumptions (such as homogeneity of variances) do not hold.

当数据表现出复杂的变异性,如异方差性,或当分析的模型不符合传统分析方法的标准假设时,传统的标准误差估计可能不再有效。此时,自助法标准误差成为一种可靠的替代方案。自助法通过重复从数据集中随机抽取样本(有放回),计算每个样本上的估计量,从而获得估计量分布的经验描述。这种方法不依赖于对数据分布的严格假设,因此可以在更广泛的情况下提供有效的标准误差估计。
When the data exhibit complex variability, such as heteroskedasticity, or when the model being analyzed does not meet the standard assumptions of traditional analysis methods, traditional standard error estimates may no longer be valid. At this point, bootstrap standard errors become a reliable alternative. The bootstrap method obtains an empirical description of the distribution of the estimator by repeatedly drawing random samples from the data set (with replacement) and calculating the estimator on each sample. This approach does not rely on strict assumptions about the data distribution and therefore can provide valid standard error estimates in a wider range of situations.

1. 交叉验证 (Cross Validation
1. Cross Validation
)

交叉验证是一种评估统计模型性能的方法,特别是在机器学习和数据挖掘中。通过将数据集分割成多个小子集,分别用来训练和验证模型,从而能够减少模型在未见数据上的泛化误差。常见的方法包括:
Cross-validation is a method of evaluating the performance of statistical models, especially in machine learning and data mining. By dividing the data set into multiple small subsets, which are used to train and verify the model respectively, the generalization error of the model on unseen data can be reduced. Common methods include:

K-折交叉验证:将数据集平均分割为K子集,每次使用K-1个子集训练模型,剩下的一个子集用于测试模型,这个过程重复K次,每次选择不同的测试子集。
K-fold cross-validation : Divide the data set evenly into K subsets , use K-1 subsets each time to train the model, and the remaining subset is used to test the model. This process is repeated K times, selecting different ones each time. test subset.

交叉验证 (LOOCV):是K-折交叉验证的一个特例,其中K等于样本总数,这意味着每次留下一个样本作为测试集。
Leave- one-out cross-validation (LOOCV) : is a special case of K-fold cross-validation, where K is equal to the total number of samples, which means that one sample is left as the test set at a time.

3. Benjamini-Hochberg 方法
3. Benjamini-Hochberg method

Benjamini-Hochberg(B-H)程序是一种控制假发现率(False Discovery Rate, FDR)的方法,广泛用于处理多重假设测试问题,尤其在那些需要进行大量独立或相关测试的研究中,如基因表达数据分析。它的目的是减少在多重比较中做出错误发现的概率。
The Benjamini-Hochberg (BH) procedure is a method to control the False Discovery Rate (FDR) and is widely used to deal with multiple hypothesis testing problems, especially in studies that require a large number of independent or correlated tests, such as gene expression Data analysis. Its purpose is to reduce the probability of making false findings in multiple comparisons.

假发现率 (FDR)
False discovery rate (FDR)

假发现率是指在所有拒绝的原假设(即显著性测试结果为阳性)中,错误拒绝(即实际上是真的原假设,但被错误判定为假的)的比例。
The false discovery rate is the proportion of false rejections (i.e., a null hypothesis that is actually true, but was incorrectly judged to be false) among all rejected null hypotheses (i.e., a significance test results in a positive result).

Benjamini-Hochberg 步骤
Benjamini-Hochberg steps

排列 p 值: 将研究中所有的 p 值从小到大排序。
Sort p-values : Sort all p-values ​​in the study from small to large.

计算阈值: 对于每个 p 值,计算一个阈值:Threshold=im×Q\text{Threshold} = \frac{i}{m} \times QThreshold=mi×Q, 其中 iii 是当前 p 值的排名(从 1 开始),mmm 是总的假设测试数量,QQQ 是你选择的 FDR 水平(常用的 Q 值如 0.05 或 0.1)。
Calculate the threshold : For each p value, calculate a threshold: Threshold=im×Q\text{Threshold} = \frac{i}{m} \times QThreshold=mi ​× Q, where iii is the ranking of the current p value ( Starting from 1), mmm is the total number of hypothesis tests, and QQQ is the FDR level you choose (common Q values ​​such as 0.05 or 0.1).

选择最大的 p 值: 找到最大的 p 值,使得 pi≤Thresholdp_i \leq \text{Threshold}pi≤Threshold。所有 p 值小于或等于这个 p 值的假设被拒绝。
Select the largest p value : Find the largest p value such that pi≤Thresholdp_i \leq \text{Threshold}pi ​≤Threshold . All hypotheses with p-values ​​less than or equal to this p-value are rejected.

使用 Benjamini-Hochberg 的原因
Reasons for using Benjamini-Hochberg

这种方法比传统的 Bonferroni 校正(严格控制家族错误率)提供了更高的统计能力(即发现真实效应的能力),特别是在处理大规模数据集时,Bonferroni 方法过于保守,可能导致许多真正的发现未被识别。
This method provides higher statistical power (i.e., the ability to find true effects) than the traditional Bonferroni correction (which strictly controls the family-wise error rate), especially when dealing with large-scale data sets. The Bonferroni method is too conservative and may lead to many true effects. of findings were not recognized.

4. K-折外部验证 (K-fold OOS Validation)
4. K-fold OOS Validation (K-fold OOS Validation)

K-折外部验证通常是指在时间序列数据或需要预测未来值的情境中应用K-折交叉验证,其中“外部”强调了验证数据在时间上是在训练数据之后。这种方法适用于预测市场趋势、金融分析等领域,能够更好地模拟模型在实际应用中的表现。
K-fold external validation usually refers to the application of K-fold cross-validation in time series data or situations where future values ​​need to be predicted, where "external" emphasizes that the validation data is after the training data in time. This method is suitable for predicting market trends, financial analysis and other fields, and can better simulate the performance of the model in practical applications.

5. 前向逐步回归 (Forward Stepwise Regression)
5. Forward Stepwise Regression

前向逐步回归是一种回归模型构建技术,它从没有任何预测变量的模型开始,然后逐步添加变量,每一步都添加对模型改进最大的变量。每添加一个新变量,都会对模型进行评估,以确定是否确实提高了模型的性能。这种方法是变量选择过程中的一种贪婪算法,尤其在变量数目较多时有助于简化模型,减少计算负担和过拟合风险。
Forward stepwise regression is a regression model building technique that starts with a model without any predictor variables and then adds variables step by step, adding at each step the variable that improves the model the most. Each time a new variable is added, the model is evaluated to determine whether it actually improves the model's performance. This method is a greedy algorithm in the variable selection process, especially when the number of variables is large, it helps to simplify the model and reduce the computational burden and the risk of overfitting.

后向逐步回归
backward stepwise regression

起点:后向逐步回归从包含所有可能变量的完整模型开始。
Starting point: Backward stepwise regression starts with a complete model including all possible variables.

过程:逐步检查每个变量的重要性,去除那些对模型没有实质性贡献的变量。
Process: Step by step check the importance of each variable and remove those that do not contribute substantially to the model.

问题:
question:

多重共线性:当模型中的变量高度相关时,可能会干扰对单个变量重要性的准确评估。
Multicollinearity: When variables in a model are highly correlated, it can interfere with accurate assessment of the importance of individual variables.

变量维度大:当变量数量很多时,从完整模型开始的方法可能效率低下且计算量大。
Large variable dimensionality: When the number of variables is large, approaches starting from a full model can be inefficient and computationally expensive.

前向逐步回归
forward stepwise regression

起点:前向逐步回归从一个不包含任何预测变量的模型开始,可以视为只有截距的模型。
Starting point: Forward stepwise regression starts with a model that does not contain any predictor variables, which can be viewed as an intercept-only model.

过程:一步一步地添加变量,每次都添加在当前模型中表现最好的变量。
Process: Add variables step by step, each time adding the variables that perform best in the current model.

特点:这是一种贪婪算法,即在每一步都做出当前看似最优的选择,而不一定考虑长远后果。
Characteristics: This is a greedy algorithm that makes the seemingly optimal choice at each step without necessarily considering the long-term consequences.

Binary outcome

Logistic Regression + LASSO, Interactions! Cross Validation
Logistic Regression + LASSO Interactions! Cross Validation

1. LASSO
1.LASSO

是一种用于处理高维数据、选择重要变量的技术。它通过引入一个惩罚项来避免模型的过度拟合,从而更好地区分哪些变量是信号(有用的)而哪些是噪声(无用的)。
It is a technique used to process high-dimensional data and select important variables. It avoids overfitting of the model by introducing a penalty term to better distinguish which variables are signals (useful) and which are noise (useless).

假设我们要估计线性回归中的系数 β,MLE 通过选择使模型预测值与真实值最接近的 β来进行优化。
Suppose we want to estimate the coefficient β in linear regression, MLE optimizes by choosing the β that makes the model predicted value closest to the true value.

2.Scaling
2. Scaling

含义:缩放是数据预处理的一种常用技术,它涉及对数据进行转换,以便不同的特征具有相似的尺度。这在使用LASSO等模型时非常重要,因为大的特征值可能对带有正则化的模型产生不成比例的影响。数据缩放/标准化过程,对于使用LASSO及其变体如Elastic Net非常关键,因为正则化对各特征的量级非常敏感。通常使用标准化(减去均值,除以标准差)或归一化(将数据缩放到0和1之间)。
What it means : Scaling is a common technique for data preprocessing, which involves transforming the data so that different features have similar scales. This is very important when using models such as LASSO, as large eigenvalues ​​may have a disproportionate impact on the model with regularization. The data scaling/normalization process is very critical for using LASSO and its variants such as Elastic Net, because regularization is very sensitive to the magnitude of each feature . Typically standardization (subtracting the mean, dividing by the standard deviation) or normalization (scaling the data to between 0 and 1) is used.

例子:如果一个特征的范围是0到1,而另一个特征的范围是0到1000,未经缩放直接用于LASSO可能导致模型偏向于具有较大范围的特征。使用标准化(如Z-score标准化)可以解决这个问题。
Example : If one feature ranges from 0 to 1, and another feature ranges from 0 to 1000, using it directly in LASSO without scaling may cause the model to be biased towards the feature with the larger range. Using normalization (such as Z-score normalization) can solve this problem.

3. Model selection via AICc在选择正则化参数(如LASSO的λ)时非常有用
3. Model selection via AICc is very useful when selecting regularization parameters (such as λ of LASSO)

含义:AICc是Akaike信息准则的一种校正形式,用于有限样本大小。在模型选择中,AICc帮助选择最佳模型(通常是解释数据的同时避免过拟合)。
Meaning : AICc is a corrected form of Akaike's information criterion for finite sample sizes. In model selection, AICc helps select the best model (usually to explain the data while avoiding overfitting).

AIC 被广泛用于模型比较,选择AIC值最小的模型,因为它提供了最佳的拟合和复杂性的平衡。
AIC is widely used for model comparison, and the model with the smallest AIC value is selected because it provides the best fit and complexity balance.

4. Logistic Lasso

含义Logistic Lasso是将LASSO回归(Least Absolute Shrinkage and Selection Operator,最小绝对收缩和选择算子)应用于logistic regression的一种方法。逻辑回归常用于处理二元分类问题,如预测事件发生与否。在逻辑回归中,引入LASSO的主要目的是通过对系数应用L1正则化来实现变量选择和复杂度控制。L1正则化倾向于产生一些精确为零的系数,这意味着某些特征被模型完全忽略,从而简化了模型。这对于处理特征数量大于样本数的数据集尤为有用,也有助于减少多重共线性的问题。Logistic Lasso是一种强大的工具,特别适用于需要进行特征选择的场合,可以有效地处理高维数据并降低过拟合的风险。通过适当选择正则化参数λ,可以达到在模型复杂性和预测性能之间取得良好平衡的效果。
Meaning : Logistic Lasso is a method that applies LASSO regression (Least Absolute Shrinkage and Selection Operator, Least Absolute Shrinkage and Selection Operator) to logistic regression . Logistic regression is often used to deal with binary classification problems, such as predicting whether an event will occur or not. In logistic regression, the main purpose of introducing LASSO is to achieve variable selection and complexity control by applying L1 regularization to the coefficients. L1 regularization tends to produce some coefficients that are exactly zero, which means that some features are completely ignored by the model, thus simplifying the model. This is especially useful when dealing with data sets where the number of features is greater than the number of samples, and can also help reduce the problem of multicollinearity. Logistic Lasso is a powerful tool, especially suitable for situations where feature selection is required. It can effectively handle high -dimensional data and reduce the risk of over-fitting. By appropriately choosing the regularization parameter λ , a good balance between model complexity and predictive performance can be achieved.

例子:如果半导体公司想预测每个芯片是否合格(合格 vs. 不合格),可以使用Logistic Lasso模型来选择最有影响力的特征并进行预测。
Example : If a semiconductor company wants to predict whether each chip is qualified (passed vs. failed), it can use the Logistic Lasso model to select the most influential features and make predictions.

5. Cross-Validation 在应用LASSO或其他正则化方法时,交叉验证帮助确定最优的正则化强度。
5. Cross-Validation When applying LASSO or other regularization methods, cross-validation helps determine the optimal regularization strength.

含义交叉验证来选择LASSO模型的最优正则化参数(alpha)。这种方法非常实用,因为它通过多次验证不同子集的数据来评估正则化参数的效果,旨在找到最佳的正则化强度,从而增强模型的泛化能力
Meaning : Cross-validation to select the optimal regularization parameter (alpha) of the LASSO model. This method is very practical because it evaluates the effect of regularization parameters by validating different subsets of data multiple times, aiming to find the optimal regularization strength, thereby enhancing the generalization ability of the model .

例子:通过K折交叉验证,可以评估不同λ值的LASSO模型性能,并选择最佳的λ值。
Example : Through K-fold cross-validation, the performance of the LASSO model with different λ values ​​can be evaluated and the best λ value can be selected.

6. Elastic Net
6. ElasticNet

含义Elastic Net模型的一个完整的示例,包括初始化模型参数、拟合数据、预测、计算均方误差(MSE)、并输出模型的结果。通过L1正则化进行变量选择,同时用L2正则化稳定模型参数非常适合用于包含多重共线性或数据特征数量多于样本数量的情况,因为它可以保留一组影响相似的特征,而不是只选择其中一个。
Meaning : A complete example of the Elastic Net model, including initializing model parameters, fitting data, prediction, calculating mean square error (MSE), and outputting the results of the model. Variable selection is performed through L1 regularization, while L2 regularization is used to stabilize the model parameters . Ideal for use in situations that contain multicollinearity or where the data has more features than samples because it retains a set of features with similar effects, rather than selecting just one of them.

例子:在半导体数据中,可能有多个高度相关的生产参数。使用弹性网,可以保留这些相关特征的综合效应,而不是像LASSO那样可能只选择其中一个特征。
Example : In semiconductor data, there may be multiple highly correlated production parameters. Using elastic nets, the combined effect of these related features can be preserved, rather than possibly selecting only one of the features like LASSO.

关联
association

这些技术和概念共同支撑了在数据科学项目中进行有效的数据分析和预测建模。从数据的预处理(Scaling),到模型选择(AICc、交叉验证)、特定模型的应用(Logistic Lasso、Elastic Net),每个步骤都是构建高效、可解释模型的重要部分。在半导体行业这样的高技术领域,这些方法尤其关键,因为它们可以帮助提高生产效率和产品质量。
Together, these techniques and concepts underpin effective data analysis and predictive modeling in data science projects. From data preprocessing (Scaling), to model selection (AICc, cross-validation), and application of specific models (Logistic Lasso, Elastic Net), each step is an important part of building an efficient and interpretable model. In a high-tech field like the semiconductor industry, these methods are particularly critical because they can help improve production efficiency and product quality.

Lasso和logistic lasso区别:
The difference between Lasso and logistic lasso:

Lasso主要用于线性回归模型中。旨在预测连续的输出变量。
Lasso is mainly used in linear regression models. Designed to predict continuous output variables.

Logistic Lasso用于逻辑回归模型,适合处理分类问题,尤其是二元分类。旨在预测分类结果(如0和1)。
Logistic Lasso is used in logistic regression models and is suitable for handling classification problems, especially binary classification. Designed to predict classification outcomes (such as 0 and 1).

两者的核心差异在于它们处理的问题类型不同(回归 vs. 分类)和相应的损失函数不同。Lasso用于估计输出为连续值的线性关系,而Logistic Lasso用于估计分类结果,尤其是在预测概率时,这涉及到对数几率(logit)的使用。
The core difference between the two is that they handle different types of problems (regression vs. classification) and the corresponding loss functions. Lasso is used to estimate linear relationships whose output is continuous values, while logistic Lasso is used to estimate classification results, especially when predicting probabilities, which involves the use of log odds (logit).

Browser Data

Working with multiple datasets

描述:指的是在数据分析过程中涉及多个数据集的管理和整合。这包括数据的合并、连接和整理,确保数据一致性和可用性。
Description : Refers to the management and integration of multiple data sets involved in the data analysis process. This includes merging, joining and organizing data to ensure data consistency and availability.

先后关系:通常是数据分析项目的初始阶段,处理原始数据,为后续的分析、建模和验证做准备。
Sequence relationship : It is usually the initial stage of a data analysis project, processing raw data to prepare for subsequent analysis, modeling and verification .

2. Pivoting

描述:数据透视是一种将数据从长格式转换为宽格式的技术,或反之。这对于数据清洗、预处理、及后续的分析和可视化非常有用。
Description : Pivot is a technique for converting data from long format to wide format, or vice versa. This is very useful for data cleaning, preprocessing, and subsequent analysis and visualization.

先后关系:通常在处理多个数据集后进行,用于数据重塑,使其更适合特定的分析需求。
Sequence relationship : Usually performed after processing multiple data sets, it is used to reshape the data to make it more suitable for specific analysis needs.

3. GIGO (Garbage In, Garbage Out)

描述:这是一个警告,意味着如果输入的数据质量不高,则输出的结果也会受到质量的限制。这强调了数据清洗和验证的重要性。
Description : This is a warning, meaning that if the input data is of low quality, the output results will also be limited by the quality. This emphasizes the importance of data cleaning and validation.

先后关系:这个原则在整个数据处理和分析过程中都非常重要,特别是在数据整合和透视之后,确保在进行任何进一步分析之前数据是准确和可靠的。
Sequentiality : This principle is important throughout the data processing and analysis process, especially after data integration and pivoting, to ensure that the data is accurate and reliable before any further analysis is performed.

4. LASSO it

描述:这通常指使用LASSO回归技术进行特征选择和正则化,以改善模型的泛化能力和解释性。
Description : This usually refers to using LASSO regression techniques for feature selection and regularization to improve the generalization ability and interpretability of the model.

先后关系:在数据清洗和准备工作完成后进行,用于建模和预测阶段,帮助处理高维数据集中的特征选择问题。
Sequence relationship : It is carried out after the data cleaning and preparation work is completed. It is used in the modeling and prediction stages to help deal with the feature selection problem in high-dimensional data sets.

5. Cross validation

描述选择 K-1 子集作为训练数据,剩下的一个子集作为验证数据。这个步骤会循环 K 次,每次选择不同的子集作为验证数据,保证每个子集都有一次作为验证数据的机会。在每一次循环中,都会用选择的训练数据来训练模型,并用验证数据来测试模型的性能。通常会计算诸如偏差、R²、准确率等统计指标来评估模型表现。先后关系:在建立了初步的模型(如使用LASSO技术后)之后进行,以验证模型的有效性和性在所有的 K 次循环结束后,通常会计算每一次验证结果的平均值,以此得到模型整体的性能指标。可以选择最优的模型参数或者是模型本身
Description : Select the K-1 subset as training data, and the remaining subset as validation data. This step will cycle K times, each time selecting a different subset as verification data, ensuring that each subset has a chance to be used as verification data. In each cycle, the selected training data is used to train the model, and the validation data is used to test the model's performance. Statistical metrics such as bias, R², accuracy, etc. are often calculated to evaluate model performance. Sequence relationship : Carry out after establishing the preliminary model (such as after using LASSO technology) to verify the validity and accuracy of the model . After all K cycles are completed, the average of each verification result is usually calculated to obtain the overall performance index of the model. You can choose the optimal model parameters or the model itself .

Summary

Model: linear model or generalised linear model such as the Logistic regression model.
Model: linear model or generalized linear model such as the Logistic regression model.

Aim: control for overparametrization
Aim: control for overparameterization
.

Method: LASSO, Ridge or ElasticNet.

Choice of Penalty paramter: AICc (AIC, BIC) or cross-validation.
Choice of Penalty paramter : AICc (AIC, BIC) or cross-validation.

Multinomial Logistic Regression Model
Multinomial Logistic Regression Model :

多项逻辑回归是一种用于处理因变量有多个类别的分类问题的统计模型。与普通的二元逻辑回归不同,多项逻辑回归可以处理因变量超过两个水平的情形,非常适合于多类分类问题。
Multinomial logistic regression is a statistical model used to handle classification problems where the dependent variable has multiple categories. Different from ordinary binary logistic regression, multinomial logistic regression can handle situations where the dependent variable exceeds two levels and is very suitable for multi-class classification problems.

Lasso

cv

Divide-and-Conquer
Divide-and-Conquer :

使用分层交叉验证(StratifiedKFold
Use stratified cross-validation ( StratifiedKFold )

和正则化方法(L1 Lasso)调整正则化强度(C值),并计算每种设置下的模型表现。
and regularization method (L1 Lasso) to adjust the regularization strength (C value) and calculate the model performance under each setting.

来进行多项逻辑回归的模型拟合和评估。
to perform model fitting and evaluation of multinomial logistic regression.

第一张图片:
First picture:

Continuous:连续变量的问题,使用的是回归方法,y是连续值,用回归模型来预测。yj 为1的概率。β是系数向量,每个类别有自己的系数。
Continuous: For problems with continuous variables, the regression method is used, y is a continuous value, and the regression model is used to predict. The probability that yj​ is 1. β is a vector of coefficients, and each category has its own coefficient.

Binary:二分类问题,比如Y=0,1这样的二分类问题,使用的是逻辑回归(Logistic Regression)模型。
Binary: Binary classification problems, such as Y=0,1, use a logistic regression model.

Multinomial:多分类问题,y有多种分类,比如K=1,2,...,K,需要使用多项逻辑回归(Multinomial Logistic Regression)。
Multinomial: Multi-classification problem, y has multiple classifications, such as K=1,2, ... ,K, and multinomial logistic regression (Multinomial Logistic Regression) is required.

每个分类都有自己的系数 β:每一个类别都有自己对应的模型参数βk ,即每个类别有自己的回归系数。
Each category has its own coefficient β: each category has its own corresponding model parameter βk ​, that is, each category has its own regression coefficient.

Has a log-likelihood:每个模型都会基于最大似然估计来估算其参数(通过对数似然函数)。
Has a log-likelihood: Each model estimates its parameters (via the log-likelihood function) based on maximum likelihood estimates.

Can use Lasso:这里说明了可以使用LASSO回归来对这些系数做稀疏化(即模型正则化)。
Can use Lasso: This illustrates that LASSO regression can be used to sparse these coefficients (ie, model regularization).

AIC/BIC: out of sample model fit:AIC(赤池信息准则)和BIC(贝叶斯信息准则)都是模型选择的准则,主要用来评估模型对样本外数据的拟合情况。
AIC/BIC: out of sample model fit: AIC ( Akaike Information Criterion) and BIC (Bayesian Information Criterion) are both model selection criteria, which are mainly used to evaluate the model 's fit to out-of-sample data .

交叉验证过程(cross-validation):通过k折交叉验证,将数据分为训练集、验证集和测试集,确保模型不会过拟合。验证集用于调优,测试集用于最终评估。
Cross-validation process (cross-validation): Through k-fold cross-validation, the data is divided into training set, validation set and test set to ensure that the model will not overfit. The validation set is used for tuning and the test set is used for final evaluation.

Find best model on training data:首先在训练数据上找到最优模型。
Find best model on training data: First find the best model on the training data.

然后在测试集上评估:找到最优模型后,在测试数据上验证模型的效果。
Then evaluate on the test set: After finding the optimal model, verify the effect of the model on the test data.

第二张图片:
Second picture:

Multinomial:在这里说明了多项逻辑回归的形式。对于每个yi,jy_{i,j}yi,j,其概率P(yj=1∣x)P(y_j=1|x)P(yj=1∣x)通过softmax函数来计算。
Multinomial: The form of multinomial logistic regression is explained here. For each yi,jy _{ i,j yi,j ​, its probability P( yj =1∣x)P( y_j =1|x)P( yj ​= 1∣x) is calculated by the softmax function.

Lasso:表示多项逻辑回归中的LASSO正则化,它的目标函数是对系数做L1正则化(即在损失函数中加入稀疏惩罚项),从而得到更简洁的模型。
Lasso: Represents LASSO regularization in multinomial logistic regression. Its objective function is to L1 regularize the coefficients (that is, add a sparse penalty term to the loss function) to obtain a more concise model.

Some penalty given to all categories:这里解释了在多项逻辑回归中,LASSO会对每个类别的参数加上同样的惩罚项,以控制模型的复杂度。
Some penalty given to all categories: It is explained here that in multinomial logistic regression, LASSO will add the same penalty term to the parameters of each category to control the complexity of the model.

skf = StratifiedKFold:使用分层k折交叉验证(Stratified KFold)来进行模型验证,这种方法可以确保每折中的每个类别的数据比例与原始数据集相同。
skf = StratifiedKFold : Use stratified k-fold cross-validation (Stratified KFold ) for model validation. This method ensures that the proportion of data for each category in each fold is the same as the original data set.

箱形图,显示不同测试集上模型表现的离群值和分布
Box plot showing outliers and distribution of model performance on different test sets

第五周:
Week 5:

计量经济学更关注因果关系,而机器学习则喜欢预测。只有当因果关系成立时,我们才能做出反事实的预测。
Econometrics is more concerned with causation, while machine learning likes predictions. Only when causality holds true can we make counterfactual predictions.

 随机对照试验的定义:RCT 是一种研究设计,其中对照组和处理组的分配是随机的。这样的设计可以帮助研究者控制实验中的变量,从而只关注他们想要了解的特定变量对结果的影响。
 Definition of Randomized Controlled Trial: An RCT is a research design in which the allocation of control and treatment groups is random. Such a design can help researchers control the variables in an experiment so that they can focus only on the effects of specific variables they want to understand on the results.

 AB 测试:Randomised Controlled Trials (RCT)也被称为 AB 测试,常见于教科书和商业优化领域。通过这种方法,研究者可以比较两组(对照组 A 和处理组 B)在接受不同处理(或不接受处理)后的结果差异。
 AB Testing: Randomized Controlled Trials (RCT) , also known as AB testing, are commonly found in textbooks and in the field of business optimization. With this method, researchers can compare the differences in outcomes between two groups (control group A and treatment group B) after receiving different treatments (or no treatments).

 治疗效应(TE)和治疗变量(d):
 Treatment effect (TE) and treatment variable (d):

治疗效应(Treatment Effect,TE)指处理效应指的是在随机对照试验(RCT)或观察性研究中,对研究对象施加某种处理(如药物、教育干预、政策变动等)后观察到的效果。计算处理效应的目的是为了评估这种干预的实际影响。
Treatment effect (TE) refers to the effect observed after a certain treatment (such as drugs, educational intervention, policy changes, etc.) is applied to the research subjects in a randomized controlled trial (RCT) or observational study. . The purpose of calculating treatment effects is to assess the actual impact of this intervention.

治疗变量(d)可以是离散的或连续的,比如是否参与某个项目、销售价格的变化、新药的剂量等。
Treatment variables (d) can be discrete or continuous, such as participation in a program, changes in sales price, dosage of new drugs, etc.

完全随机化设计:这是一种实验设计方法,其中参与者完全随机地被分配到对照组或治疗组,以确保实验结果的公正性和准确性。
Completely randomized design: This is an experimental design method in which participants are assigned to a control or treatment group completely randomly to ensure the fairness and accuracy of the experimental results.

平均治疗效应(ATE):
Average treatment effect (ATE):

这里 ATE 表示当变量 d 从 0 变为 1 时,结果变量 y 的期望变化。d 必须与其他所有因素独立,以确保 ATE 的准确估计。
Here ATE represents the expected change in the resulting variable y when the variable d changes from 0 to 1. d must be independent of all other factors to ensure an accurate estimate of ATE.

标准误差
standard error

Sanity check:

Is the key able to link data? See if the identifiers line up across the data sources.

How many variables?

How many observations?

Data Wrangling

This sub-section follows the textbook stuff.

Let's collect some key variables into a new data frame, P
Let's collect some key variables into a new data frame, P
.

ATE (Average Treatment Effect) - 平均处理效应,表示在全部研究样本中,处理组与对照组之间的平均效应差异。这是衡量某个治疗或政策介入效果的核心指标。
ATE (Average Treatment Effect) - The average treatment effect, which represents the average effect difference between the treatment group and the control group in the entire study sample. This is a core indicator for measuring the effectiveness of a treatment or policy intervention.

Reweight - 重权是一种统计方法,用于调整样本数据的权重,使处理组和对照组在一些关键协变量上具有相似的分布。这样做可以减少样本选择偏差,提高ATE的估计准确性。通过重新加权,可以确保两组在重要协变量上具有相似的分布,这是通过估计平均处理效应时控制混杂变量的一种有效方法。
Reweight - Reweighting is a statistical method used to adjust the weight of sample data so that the treatment and control groups have similar distributions on some key covariates. Doing so can reduce sample selection bias and improve the estimation accuracy of ATE. Reweighting ensures that the two groups have similar distributions on important covariates, which is an effective way to control for confounding variables when estimating average treatment effects.

Balance Check - 平衡检验,这是在开始处理效应分析之前进行的一步,用于检查处理组和对照组在干预前是否在观察到的协变量上具有相似的分布。如果两组在干预前已经存在显著差异,那么干预效果的估计可能会受到偏差。
Balance Check - A balance test, a step performed before starting a treatment effect analysis to check whether the treatment and control groups have similar distributions on observed covariates before the intervention. If the two groups are already significantly different before the intervention, then the estimate of the intervention effect may be biased.

Regression for Control - 控制回归,这是一种使用回归分析方法来控制混杂变量的影响,准确估计处理效应的技术。通过在回归模型中包括关键的协变量,可以准确估计处理的独立效应。
Regression for Control - Control regression is a technique that uses regression analysis methods to control the effects of confounding variables and accurately estimate treatment effects. By including key covariates in the regression model, the independent effects of treatment can be accurately estimated.

第六周
Week 6

Difference-in-Differences,简称DiD
Difference-in-Differences, DiD for short :

假设它们具有可比性。没有太多的异质性。
Assume they are comparable. There isn't much heterogeneity.
双重差分可以通过比较处理前后以及处理组和控制组的变化来控制内生性。
Difference-in-differences can control for endogeneity by comparing changes before and after treatment and between the treatment group and the control group.

Search engine marketing (SEM).调查市场
Search engine marketing (SEM). Survey the market

Differences 

Individual effects 

GIGO (Garbage In, Garbage Out)这是一个程序和数据分析中的常见术语,意思是如果输入的数据质量差,那么输出的数据也将质量差。
GIGO (Garbage In, Garbage Out) : This is a common term in programming and data analysis, meaning that if the input data is of poor quality, the output data will also be of poor quality.

Regression Discontinuity Design (RDD
Regression Discontinuity Design ( RDD
)

Simple Model

强制变量:决定是否接受处理(
Mandatory variable: determines whether to accept processing (
treatment),如政策改变、法规实施的阈值。
), such as thresholds for policy changes and regulatory implementation.

简单模型:模型假设在强制变量接近阈值时,对应的结果会有一个明显的跳跃,即在阈值附近可以观察到局部的处理效果。
Simple model: The model assumes that when the forcing variable approaches the threshold, the corresponding result will have an obvious jump, that is, a local processing effect can be observed near the threshold.

局部处理效果:RDD 能检测到的是条件平均处理效果(conditional ATE),对于强制变量接近零的个体。
Local processing effect: What RDD can detect is the conditional average processing effect ( conditional ATE ), for individuals whose forcing variables are close to zero.

Rank Score
Score Rank

Sub-sample Means子样本均值
Sub-sample Meanssub-sample mean

LOWESS使用子样本均值并不完美,因为可能存在局部趋势。
LOWESS : Using subsample means is not perfect because there may be local trends.

一种非参数方法,用于探索局部回归和样本统计以拟合曲线。
A non-parametric method for exploring local regression and sample statistics to fit curves.

适用于在数据中存在的局部趋势,可以看作是移动平均的一种形式。
It is suitable for local trends that exist in the data and can be regarded as a form of moving average.

Local linear regression: Window threshold h=3, 进行局部回归如LOWESS用来定义影响局部拟合范围的距离阈值
Local linear regression: Window threshold h=3, perform local regression such as LOWESS to define the distance threshold that affects the local fitting range

Two-stage Least Squares (2SLS) and Instrumental Variables

Endogeneity解释变量(即自变量)与模型的误差项存在相关关系
There is a correlation between the Endogeneity explanatory variables (i.e. independent variables) and the error term of the model.

第七周:
Week 7:

Conditional Ignorability,CI
Conditional Ignorability , CI ,

Linear Treatment Effects (LTE) model

EconML 是一个 Python 包,由微软研究院的 ALICE 项目团队设计和构建,旨在利用机器学习技术从观测数据中估计异质性处理效应。这个包结合了最先进的机器学习技术和计量经济学,通过自动化来处理复杂的因果推断问题。
EconML is a Python package designed and built by Microsoft Research's ALICE project team to estimate heterogeneous treatment effects from observational data using machine learning techniques. This package combines state-of-the-art machine learning techniques and econometrics to handle complex causal inference problems through automation.

Synthetic controls

Matrix Completion

第八周:
Week 8:

聚类Clustering
Clustering :

应用在:
Applied in:

将文档集合分为不同主题:这种应用常用于文本挖掘和信息检索领域,帮助识别大量文档中的主题或概念。
Divide a collection of documents into different topics: This application is often used in the field of text mining and information retrieval to help identify topics or concepts in a large number of documents.

按偏好或价格弹性对购物者进行分段:这可以用于市场分析,通过理解不同消费者的购买偏好或对价格变化的反应程度,进行价格歧视或市场细分。
Segmenting shoppers by preferences or price elasticity: This can be used in market analysis to conduct price discrimination or market segmentation by understanding the purchasing preferences or responsiveness of different consumers to price changes.

按特征对选民进行分组:提到的例子是剑桥分析公司(Cambridge Analytica),这种分析可用于政治竞选,通过选民的数据来定制更具针对性的政治广告或活动。
Grouping voters by characteristics : The example mentioned was Cambridge Analytica , where this kind of analysis can be used in political campaigns to use data from voters to tailor more targeted political ads or campaigns.

在线推荐系统:聚类技术可用于推荐系统,通过用户的历史行为和偏好,推荐相关内容或产品。
Online recommendation systems: Clustering technology can be used in recommendation systems to recommend relevant content or products based on users’ historical behaviors and preferences.

K-mean normal mixture:

这种方法用于在数据中识别出不同的组,它假设每个组都是正态分布的,而聚类则基于这些混合的正态分布。
This method is used to identify distinct groups in the data. It assumes that each group is normally distributed and clustering is based on these mixtures of normal distributions.

K-Means (Algorithm 17): K-means是一种流行的聚类算法,用于将数据点分组到预先设定数量的簇中,通过最小化每个点到其最近的簇中心的距离平方和。
K-Means (Algorithm 17): K-means is a popular clustering algorithm used to group data points into a predetermined number of clusters by minimizing the squared distance of each point to its nearest cluster center and.

数据应用(Protein data): 聚类技术常应用于蛋白质数据,以了解不同蛋白质间的关系或分类。
Data application (Protein data): Clustering technology is often applied to protein data to understand the relationship or classification between different proteins.

LOOCV(Leave-One-Out Cross Validation): 一种交叉验证技术,用于评估模型的性能。在聚类的上下文中,它可能用于选择最佳的聚类数量。
LOOCV (Leave-One-Out Cross Validation): A cross-validation technique used to evaluate the performance of the model. In the context of clustering, it may be used to choose the optimal number of clusters.

ELBOW: 肘部方法通常用于确定K-means聚类中的最佳聚类数目。通过寻找聚类内误差平方和(SSE)的图中的肘点
ELBOW: The elbow method is often used to determine the optimal number of clusters in K-means clustering. By looking for the " elbow point " in the plot of the within-cluster sum of squared errors ( SSE ) .

主成分分析(Principal Component Analysis):
Principal Component Analysis :

PCA的目标是通过线性变换,将原始数据映射到一个新的坐标系中,以减少数据的维度,同时尽量保留数据的主要信息。PCA的核心思想是寻找一组新的正交坐标轴(称为主成分),这些主成分是数据变异性最大的方向。通过将数据投影到这些主成分上,PCA可以减少维度,同时尽可能保持原始数据的方差信息。
The goal of PCA is to map the original data into a new coordinate system through linear transformation to reduce the dimensionality of the data while retaining the main information of the data as much as possible. The core idea of ​​PCA is to find a new set of orthogonal coordinate axes (called principal components) that are the directions with the greatest variability in the data. By projecting the data onto these principal components, PCA can reduce the dimensionality while maintaining as much variance information as possible in the original data.

主要步骤
Main steps

标准化数据: 在进行PCA之前,通常需要对数据进行标准化(或归一化),以确保每个特征对主成分的贡献一致。否则,数值较大的特征会对主成分分析有更大的影响。
Standardize the data: Before performing PCA , the data usually needs to be standardized (or normalized) to ensure that each feature contributes consistently to the principal components. Otherwise, features with larger values ​​will have a greater impact on PCA.

计算协方差矩阵: 协方差矩阵表示各个特征之间的线性相关性。协方差矩阵的大小为 n×n (n 为特征的数量),矩阵中的每个元素表示两个特征之间的协方差。
Calculate the covariance matrix: The covariance matrix represents the linear correlation between features. The size of the covariance matrix is ​​n×n ( n is the number of features), and each element in the matrix represents the covariance between two features.

求协方差矩阵的特征值和特征向量: 主成分是协方差矩阵的特征向量,它们表示了数据中方差最大的方向。特征值表示每个特征向量所解释的方差大小,特征值越大,表示该方向的重要性越高。
Find the eigenvalues ​​and eigenvectors of the covariance matrix: The principal components are the eigenvectors of the covariance matrix, and they represent the direction of the largest variance in the data. The eigenvalue represents the amount of variance explained by each eigenvector. The larger the eigenvalue, the higher the importance of that direction.

选择主成分: 通过选择具有最大特征值的前 k 特征向量,PCA可以降维。主成分的数量 k是用户预先选择的,通常选择那些可以解释大部分方差的主成分。
Select principal components: PCA can reduce dimensionality by selecting the top k eigenvectors with the largest eigenvalues . The number k of principal components is pre-selected by the user, usually choosing those that can explain most of the variance.

变换数据: 通过将原始数据投影到选取的主成分上,得到新的低维数据。这个新数据集保留了原始数据中的主要信息,但维度较低。
Transform data: New low-dimensional data is obtained by projecting the original data onto the selected principal components. This new dataset retains the main information in the original data, but with lower dimensionality.

Illustration: 可能指的是展示PCA的图示,帮助解释如何从原始数据中提取主成分。
Illustration: May refer to an illustration showing PCA to help explain how to extract principal components from raw data.

Algorithm 18 (PCA): 一种降维技术,通过计算数据的协方差矩阵的特征向量和特征值,找出数据的主要变异方向。
Algorithm 18 (PCA): A dimensionality reduction technique that finds the main variation direction of the data by calculating the eigenvectors and eigenvalues ​​of the covariance matrix of the data.

第九周:
Week 9:

主成分分析(PCA)、主成分回归(PCR)和偏最小二乘回归(PLS)解决高维数据中的多重共线性问题
Principal component analysis ( PCA ), principal component regression ( PCR ) and partial least squares regression ( PLS ) solve multicollinearity problems in high-dimensional data .

PCA and Scree Plot——Elbow
PCA and Scree Plot - Elbow

Principal Component (Lasso) Regressions(主成分回归)
Principal Component (Lasso) Regressions

——PCA——Bottom-upBottom-up 的过程主要是从主成分的角度出发,查看主成分如何解释数据集中的特征。通过转置矩阵后,我们可以查看各个特征在每个主成分中的权重,进而得出
The process is mainly from the perspective of principal components to see how the principal components explain the features in the data set. By transposing the matrix, we can view the weight of each feature in each principal component, and then get
各特征
Features
对主成分的影响。
influence on the principal components.
Top-down:从整体出发,逐步细化到具体细节。
: Start from the overall situation and gradually refine it to specific details.
——PC regression(主成分回归):
(principal component regression):
这种方法从整体出发,通过
This method starts from the whole, through
PCA结果来分析数据的分类和分布情况
The results are used to analyze the classification and distribution of data.
以减少多重共线性的问题。
to reduce multicollinearity problems.

Partial Least Squares(偏最小二乘回归,PLS适合多维数据的降维回归方法,能够处理自变量之间存在共线性的情况。多元回归中使用它通过在解释变量和响应变量之间找到投影方向,减少解释变量的维度,同时保留与响应变量相关的最大方差。
Partial Least Squares (Partial Least Squares Regression, PLS ) : A dimensionality reduction regression method suitable for multidimensional data , which can handle the situation of collinearity between independent variables. Used in multiple regression , it reduces the dimensionality of the explanatory variables by finding the projection direction between the explanatory variables and the response variable, while retaining the maximum variance associated with the response variable.

PLS 的特点是,它对残差的逐步回归可以看作是一种类似于“boosting”的方法。即在每一步中,使用前一步的残差作为新的响应变量进行回归,逐步提高拟合效果。
The characteristic of PLS ​​is that its stepwise regression of residuals can be regarded as a method similar to "boosting" . That is, in each step, the residual of the previous step is used as a new response variable for regression to gradually improve the fitting effect.

——Marginal Regression:一种逐步回归的技术,用于测试每个变量对模型的边际贡献。
: A stepwise regression technique used to test the marginal contribution of each variable to the model.
——Partial Least Squares (PLS)PLS 是对其先前获得的残差重复
is a repetition of its previously obtained residuals
MR,它通过同时考虑解释变量和响应变量之间的协方差来进行建模。
, which models by simultaneously considering the covariance between the explanatory and response variables.
——Cross-Validation(交叉验证):用于模型评估的技术,确保模型具有良好的泛化能力,并防止过拟合。
(Cross-validation): A technique used for model evaluation to ensure that the model generalizes well and prevents overfitting.

第十周:
Week 10:

非参数建模(Nonparametric Modelling它不假设数据遵循任何特定的分布或模型形式。适合处理复杂的、具有非线性关系的数据。常见的非参数方法包括核回归、K近邻、决策树等。非参数模型往往具有高灵活性,因此容易过拟合数据。过拟合是指模型在训练数据上表现非常好,但在测试数据或新数据上表现不佳。原因在于模型可能捕捉到了数据中的噪声,而不是数据的真实结构。贝叶斯非参数方法(Bayesian Nonparametrics 是一个例外,贝叶斯方法通过引入先验信息,能够在一定程度上缓解过拟合问题。
Nonparametric Modeling : It does not assume that the data follows any specific distribution or model form. Suitable for processing complex data with non-linear relationships. Common non-parametric methods include kernel regression, K nearest neighbor, decision tree, etc. Non-parametric models tend to have high flexibility and therefore tend to overfit the data. Overfitting is when a model performs very well on training data, but performs poorly on test data or new data. The reason is that the model may be capturing noise in the data rather than the true structure of the data. Bayesian Nonparametrics is an exception. Bayesian methods can alleviate the overfitting problem to a certain extent by introducing prior information.

Curse of Dimensionality维度诅咒:当数据的维度增加时,数据的稀疏性也会增加,模型需要更多的数据来捕捉复杂关系。
Curse of Dimensionality : When the dimensionality of data increases, the sparsity of the data also increases, and the model requires more data to capture complex relationships.

半参数模型:结合了参数和非参数方法的优点,能够在保持模型解释性的同时处理非线性关系。
Semi-parametric model: combines the advantages of parametric and non-parametric methods and is able to handle nonlinear relationships while maintaining model interpretability.

修剪是用于决策树的技术,目的是减少决策树的复杂度,防止过拟合。修剪可以通过剪除不必要的枝叶(即降低模型的深度)来简化决策树,使模型更具泛化能力。
Pruning is a technique used on decision trees to reduce the complexity of the decision tree and prevent overfitting. Pruning can simplify the decision tree by cutting off unnecessary branches and leaves (that is, reducing the depth of the model), making the model more generalizable.

预剪枝(Pre-pruning):在树的生长过程中,设定条件限制树的生长深度,提前停止分裂。
Pre - pruning : During the growth process of the tree, conditions are set to limit the growth depth of the tree and stop splitting in advance.

后剪枝(Post-pruning):允许决策树完全生长,再根据某种标准剪掉一些枝叶,以提高模型的泛化能力。
Post - pruning : Allow the decision tree to grow completely, and then prune some branches and leaves according to certain standards to improve the generalization ability of the model.

XGBoost(Extreme Gradient Boosting) 是一种增强(Boosting)算法,它通过逐步添加弱学习器(通常是决策树)来提高模型性能。每次新模型的加入都会修正前一模型的误差,从而逐步提高整体的预测效果。
XGBoost Extreme Gradient Boosting is a boosting ( Boosting ) algorithm that improves model performance by gradually adding weak learners (usually decision trees). Each time a new model is added, the error of the previous model will be corrected, thereby gradually improving the overall prediction effect.

Classification And Regression Trees, CART
Classification And Regression Trees, CART :

这种算法的主要缺点包括容易过拟合,尤其是在决策树较深时。通常需要通过剪枝、设置最大树深度或最小样本分割限制等方法来避免过拟合。在实践中,CART 通常会与集成学习方法如随机森林和提升树(Boosted Trees)结合使用,以提高预测性能和模型的泛化能力。
The main disadvantages of this algorithm include the tendency to overfit, especially when the decision tree is deep. It is usually necessary to avoid overfitting through methods such as pruning, setting maximum tree depth or minimum sample split limits. In practice, CART is often used in conjunction with ensemble learning methods such as random forests and boosted trees to improve prediction performance and model generalization capabilities.

非参数化自助法(Non-parametric bootstrap
Non -parametric bootstrap

BaggingBootstrap aggregating):是一种有效的减少方差的技术,通过对多个模型的预测进行平均,能够提高模型的泛化能力。一种通过对训练数据集进行重抽样并训练多个模型来提高预测性能的方法。
Bagging ( Bootstrap aggregating ): It is an effective technique for reducing variance. By averaging the predictions of multiple models, it can improve the generalization ability of the model. A method to improve prediction performance by resampling the training data set and training multiple models.

随机森林(Random Forests, RF):一种使用多个决策树进行预测的集成学习方法。
Random Forests ( RF ): An ensemble learning method that uses multiple decision trees for prediction.

在构建决策树时,分裂分支的过程主要是基于损失最小化原则进行的。具体来说,选择最佳分裂点时,算法会尝试找到一个能最大程度减少节点的纯度或误差的分裂方式。
When building a decision tree, the process of splitting branches is mainly based on the principle of " loss minimization " . Specifically, when choosing the best split point, the algorithm tries to find a split that minimizes the impurity or error of the node .

常用的纯度衡量指标包括:
Commonly used measures of impurity include :

基尼指数(Gini Impurity):主要用于分类问题,衡量节点中样本被错误分类的概率。分裂时选择能最大程度降低基尼指数的特征和阈值。
Gini Index ( Gini Impurity ): Mainly used for classification problems, measuring the probability that samples in a node are misclassified. When splitting, select features and thresholds that minimize the Gini index.

信息增益(Information Gain):基于熵的概念,选择能最大化信息增益的特征进行分裂。
Information Gain : Based on the concept of entropy, select features that maximize information gain for splitting.

均方误差(Mean Squared Error, MSE):在回归问题中,决策树通过最小化均方误差来选择分裂点。
Mean Squared Error ( MSE ): In regression problems, decision trees select split points by minimizing the mean squared error.

Mean Squared Error, MSE
Mean Squared Error, MSE :

1. Bootstrap自助法(Resampling with Replacement
1. Bootstrap self-service method ( Resampling with Replacement )

Bootstrap是一种从原始数据集中进行有放回抽样的方法,即每次抽样后将样本放回原数据集中,使得同一个样本有可能被多次抽取。
Bootstrap is a method of sampling with replacement from the original data set, that is, the sample is put back into the original data set after each sampling, making it possible for the same sample to be sampled multiple times.

应用场景:
Application scenarios:

在经济学中用于解决估计量的标准误问题(例如异方差稳健标准误)。
Used in economics to solve problems with standard errors of estimators (such as heteroskedasticity-robust standard errors).

**随机森林(Random Forest**模型中用于构建多棵树。
Used in the ** Random Forest ** model to build multiple trees.

2. 随机树(Random Trees
2. Random Trees

**什么是随机树?**它是一种在构建过程中引入随机性的决策树。
** What is a random tree? ** It is a decision tree that introduces randomness into the construction process.

引入随机性的方式:
Ways to introduce randomness:

随机选择特征:在构建树的过程中,每次分裂时,随机选择部分特征进行判断,而不是使用全部特征。
Randomly select features: In the process of building a tree , at each split, some features are randomly selected for judgment instead of using all features.

对数据进行Bootstrap采样:通过对数据进行有放回的抽样,然后在抽样后的数据上构建树。
Bootstrap sampling of data : Sampling the data with replacement, and then building a tree on the sampled data.

3. 随机森林(Random Forest, RF
3. Random Forest ( RF )

随机森林是由多棵随机树组成的集成模型,通常构建大量(例如100棵)随机树。
Random forest is an ensemble model composed of multiple random trees, usually building a large number (for example, 100 ) of random trees.

预测过程:对所有树的预测结果进行平均,从而得到最终的预测值。
Prediction process: average the prediction results of all trees to obtain the final prediction value.

如果随机森林中的每棵树都是通过Bootstrap采样来训练的,这种方法就称为Bootstrap AggregatingBagging)。
If each tree in the random forest is trained by Bootstrap sampling, this method is called Bootstrap Aggregating ( Bagging ).

4. 随机森林的特性
4. Characteristics of Random Forest

准确性高:由于随机森林模型综合了多棵树的预测结果,具有较高的稳定性和预测准确性。
High accuracy: Because the random forest model combines the prediction results of multiple trees, it has high stability and prediction accuracy.

难以解释:随机森林模型通常被认为是黑盒子,因为模型的复杂性和随机性使得解释单个预测过程非常困难。
Difficult to interpret: Random forest models are often considered black boxes because the complexity and stochasticity of the model make interpreting a single prediction process very difficult.

第十一周:
Week 11:

1. 文本预处理/分词(Text Preprocessing / Tokenization
1. Text preprocessing/ Tokenization ( Text Preprocessing/Tokenization )

在进行文本分析之前,需要对文本数据进行预处理。预处理步骤通常包括去除标点符号、转换为小写、去除停用词(如"the""and")、分词(将文本分成单个词或词组)等操作。
Before text analysis, text data needs to be preprocessed. Preprocessing steps usually include operations such as removing punctuation, converting to lowercase, removing stop words (such as "the" , "and" ), word segmentation (breaking the text into individual words or phrases), etc.

分词(Tokenization)是将文本拆分为词或短语的过程,是文本分析的第一步。
Tokenization is the process of splitting text into words or phrases and is the first step in text analysis .

2. 词云(Word Clouds
2. Word Clouds

词云是一种可视化工具,通过不同大小和颜色的词展示文本数据中各个词的频率或权重。词频越高的词,通常显示得越大、越明显。
A word cloud is a visualization tool that shows the frequency or weight of individual words in text data using words of different sizes and colors. Words with higher frequency usually appear larger and more obvious.

词云可以帮助快速识别文本数据中最常见的词,便于理解数据的主题或趋势。
Word clouds can help quickly identify the most common words in text data, making it easier to understand themes or trends in the data.

3. 主题模型(Topic Models
3. Topic Models

LDALatent Dirichlet Allocation):LDA是一种常见的主题模型算法,用于发现文本数据中潜在的主题。它通过将文本表示为主题的概率分布,从而识别出文档中常见的主题。
LDA ( Latent Dirichlet Allocation ): LDA is a common topic model algorithm used to discover latent topics in text data. It identifies common topics in documents by representing text as a probability distribution of topics.

主题模型的目标是对大量文本数据进行自动化分类,帮助分析者了解文本的主要内容。
The goal of topic models is to automatically classify large amounts of text data and help analysts understand the main content of the text.

4. 嵌入(Embeddings
4. Embeddings

嵌入是一种将词或文档转换为数值向量的技术,便于机器学习模型处理。常见的词嵌入方法包括Word2VecGloVeBERT等。
Embedding is a technique that converts words or documents into numerical vectors for easy processing by machine learning models. Common word embedding methods include Word2Vec , GloVe , BERT , etc.

通过嵌入,可以将文本中的语义信息转化为数字表示,使其更适合用于分类、聚类等任务。
Through embedding, the semantic information in the text can be converted into a numerical representation, making it more suitable for tasks such as classification and clustering.

1. Tokenization(分词)
1. Tokenization (participle)

分词是将文本数据拆分为独立的词或词组的过程。通过分词,可以更方便地对文本进行分析和处理。
Tokenization is the process of splitting text data into independent words or phrases. Through word segmentation, text can be analyzed and processed more conveniently.

在分词之后,可以计算每个词的出现频率,也就是词的数量(Count of Words)。
After word segmentation, the frequency of occurrence of each word can be calculated, that is, the number of words ( Count of Words ).

2. Bag-of-Words词袋模型
2. Bag-of-Words ( bag-of-words model )

词袋模型是一种简单的文本表示方法。它忽略词的顺序,只关注词的出现次数。
The bag-of-words model is a simple text representation method. It ignores the order of words and only focuses on the number of times they appear.

优点:
advantage:

构建简单:只需计算每个词的出现频率。
Simple to build: just count the frequency of each word.

易于解释:结果就是一个词频向量,表示文本中每个词的数量。
Easy to interpret: the result is a word frequency vector, representing the number of each word in the text.

缺点:
shortcoming:

特征过多:由于每个词都是一个特征,因此词汇量越大,生成的特征也就越多,可能导致维度过高的问题。
Too many features: Since each word is a feature, the larger the vocabulary, the more features are generated, which may lead to the problem of excessive dimensionality.

忽略上下文:词袋模型只关注词的频率,完全忽略了词的顺序和上下文关系。
Ignoring context: The bag-of-words model only focuses on the frequency of words and completely ignores the order and context of words.

忽略多词短语的含义:词袋模型无法区分由多个词组成的短语。例如,“good”“not good”在词袋模型中会被分开处理,这样模型就无法理解这两个词组合的实际含义。
Ignoring the meaning of multi-word phrases: Bag-of-words models cannot differentiate between phrases consisting of multiple words. For example, “good” and “not good” are treated separately in a bag-of-words model , so that the model cannot understand the actual meaning of the combination of these two words.

在文本分析中,单词或短语的组合被称为n-grams,其中 n 表示组合中的单词数量。
In text analysis, combinations of words or phrases are called n-grams , where n represents the number of words in the combination.

术语解释:
Terminology explanation:

Unigram:单词或单个词的组合,表示 n=1。例如,“good”就是一个unigram
Unigram : A combination of words or single words, indicating n=1 . For example, "good" is a unigram .

Bigram:由两个连续单词组成的组合,表示 n=2。例如,“very good”“not good”bigrams
Bigram : A combination of two consecutive words, indicating n=2 . For example, "very good" and "not good" are bigrams .

Trigram:由三个连续单词组成的组合,表示 n=3。例如,“not very good”是一个trigram
Trigram : A combination of three consecutive words, indicating n=3 . For example, "not very good" is a trigram .

应用:
application:

使用n-grams可以帮助模型捕捉更丰富的上下文信息。例如,在情感分析中,“not good”“good”虽然都包含“good”,但因为前者是负面表达,因此将其作为bigram能保留“not”“good”的组合含义。
Using n-grams can help the model capture richer contextual information. For example, in sentiment analysis, although "not good" and "good" both contain "good" , because the former is a negative expression, using it as a bigram can retain the combined meaning of "not" and "good" .

通过增加n的值,可以让模型理解更多上下文,但n越大,可能生成的特征数也越多,计算复杂度也随之增加。
By increasing the value of n , the model can understand more context, but the larger n is, the more features may be generated, and the computational complexity also increases.

通常在文本处理中,unigramsbigrams甚至trigrams都被用来保留词与词之间的关系,以提高模型在理解短语和语境上的表现。
Usually in text processing, unigrams , bigrams and even trigrams are used to preserve the relationship between words to improve the performance of the model in understanding phrases and context.

接下来讨论了在将文本数据输入模型之前的预处理步骤,并给出了一些常见的方法和注意事项。
Next, the preprocessing steps before inputting text data into the model are discussed, and some common methods and considerations are given.

文本预处理的原则:
Principles of text preprocessing:

在将文本数据用于建模之前,通常需要对其进行预处理,以便更好地清洗和规范化数据。这些步骤包括:
Before text data can be used for modeling, it often needs to be preprocessed to better clean and normalize the data. These steps include:

去除停用词(Stopwords):停用词是一些常见词汇(如“the”“is”),它们通常对文本语义贡献较小,可以在分析前去除。
Remove stop words ( Stopwords ): Stop words are some common words (such as "the" , "is" ), which usually contribute little to the semantics of the text and can be removed before analysis.

小写转换(Lowercase):将所有文本转换为小写,有助于减少因大小写不同而重复出现的词。
Lowercase conversion ( Lowercase ): Convert all text to lowercase, helping to reduce repeated words due to different case.

词干提取(Stemming):将单词还原为词干形式,去掉不同的词缀。例如,将“running”还原为“run”
Stemming : Restore words to their stem form and remove different affixes. For example, revert "running" to "run" .

n-grams:根据需要,生成n-grams来捕捉多词短语的含义,如“not good”
n-grams : As needed, n-grams are generated to capture the meaning of multi-word phrases, such as "not good" .

去除标点符号(Punctuation):删除标点符号以简化文本。
Punctuation : Remove punctuation to simplify text.

视具体情况而定(Depends on Context):
Depends on Context :

以上预处理步骤是否需要执行,取决于具体的文本分析任务和上下文。不同的任务对文本预处理的需求有所不同,因此在应用这些步骤时应根据具体情况进行调整。
Whether the above preprocessing steps need to be performed depends on the specific text analysis task and context. Different tasks have different requirements for text preprocessing, so the application of these steps should be tailored to the specific situation.

主题模型(Topic Models)概述:
Overview of Topic Models :

无监督学习(Unsupervised Learning):主题模型是一种无监督学习方法,它不需要标签即可运行,主要用于从大规模文本数据中自动发现隐藏的主题。
Unsupervised Learning : Topic model is an unsupervised learning method that does not require labels to run and is mainly used to automatically discover hidden topics from large-scale text data.

聚类词汇到相似的主题中:主题模型会将词汇聚类到我们指定数量的主题中。每个主题由具有相似语义的词汇组成。用户需要预先指定主题的数量(例如,10个主题)。
Cluster words into similar topics: Topic models cluster words into a number of topics that we specify. Each topic consists of words with similar semantics. The user needs to specify the number of topics in advance (for example, 10 topics).

词汇可以属于多个主题:在主题模型中,词汇可以同时属于多个主题,但在不同的主题中其权重(或概率)不同。这种灵活性使得模型可以更好地捕捉词汇在不同语境中的多重含义。
A word can belong to multiple topics: In a topic model, a word can belong to multiple topics at the same time, but its weight (or probability) is different in different topics. This flexibility allows the model to better capture the multiple meanings of words in different contexts.

主题权重或强度(Weaning of a Topic):
Topic weight or strength ( Weaning of a Topic ):

你提到的“weaning of a topic”可能指的是主题强度或主题权重,即每个主题在文档中的相对重要性。通常,主题模型会为每个文档分配一个主题分布,这个分布表明该文档与各个主题的关联强度。
What you mention by "weaning of a topic" probably refers to topic strength or topic weight, which is the relative importance of each topic in the document. Typically, a topic model assigns each document a topic distribution that indicates how strongly the document is associated with each topic.

理解主题权重:为了理解一个文档中某个主题的强度,我们可以查看该主题中高频出现的词汇以及这些词的权重分布。通常,模型输出每个主题下的高频词及其在主题中的重要性,从而帮助我们解释和理解主题的含义。
Understanding topic weights: To understand the strength of a topic in a document, we can look at the words that occur frequently in the topic and the weight distribution of those words. Usually, the model outputs high-frequency words under each topic and their importance in the topic, thereby helping us interpret and understand the meaning of the topic.

如何找到主题权重或解释主题:
How to find topic weights or interpret topics:

查看高频词:每个主题由其高频词汇定义,例如,LDA模型输出的主题通常会附带该主题的前若干高频词。这些词帮助我们定义或解释主题的语义。
View high-frequency words: Each topic is defined by its high-frequency words. For example, the topic output by the LDA model usually comes with the first several high-frequency words of the topic. These words help us define or explain the semantics of a topic.

文档-主题分布:主题模型输出每个文档在不同主题上的分布权重。对于某一文档,其主题分布表示文档内容在不同主题上的重要性。
Document topic distribution: The topic model outputs the distribution weight of each document on different topics. For a document, its topic distribution represents the importance of the document content on different topics.

可视化工具:一些工具如pyLDAvis,可以帮助可视化每个主题的权重分布及其相似性,从而更直观地理解主题之间的关系。
Visualization tools: Some tools, such as pyLDAvis , can help visualize the weight distribution of each topic and its similarities, thereby more intuitively understanding the relationship between topics.

BERTBidirectional Encoder Representations from Transformers一种用于自然语言处理(NLP)的预训练模型。它采用了双向Transformer架构,可以捕捉上下文中的词汇信息,并在多个NLP任务中取得了非常好的效果。
BERT ( Bidirectional Encoder Representations from Transformers ) is a pre-trained model for natural language processing ( NLP ) . It adopts a bidirectional Transformer architecture that can capture lexical information in context and achieves very good results in multiple NLP tasks.

LDALatent Dirichlet Allocation,隐狄利克雷分配): LDA 是一种用于主题建模的无监督学习方法,它从文档中提取潜在的主题,并将文档表示为多个主题的混合。
LDA ( Latent Dirichlet Allocation , Latent Dirichlet Allocation) LDA is an unsupervised learning method for topic modeling, which extracts latent topics from documents and represents the document as a mixture of multiple topics.

主题模型 LDA PCA(主成分分析) 存在一定的相似性。 PCA 中,我们使用主成分得分 vik来表示数据在不同主成分方向上的贡献,而 LDA 中使用ωik表示文档 iii 在不同主题上的权重("topic score")。PCA 的旋转矩阵φk类似于 LDA 中的主题概率向量θk
There is a certain similarity between topic model LDA and PCA (principal component analysis) . In PCA , we use the principal component score v ik to represent the contribution of the data in different principal component directions, while in LDA we use ωik to represent the weight of document iii on different topics ( "topic score" ). The rotation matrix φk of PCA is similar to the topic probability vector θk in LDA .

Daniel

12周:
Week 12 :

Discuss the differences between supervised and unsupervised learning. How can you tell what

type of model it is? What types of models fall into which category? 讨论监督学习和无监督学习之间的差异。你怎么知道它是什么类型的模型?哪些类型的模型属于哪一类?
type of model it is? What types of models fall into which category? Discuss the differences between supervised and unsupervised learning. How do you know what type of model it is? Which types of models belong to which category?

监督学习是指在有标签的训练数据上训练模型。每个训练样本都包含输入特征(X)和对应的目标输出(Y,如果数据集中有明确的目标标签(Y),并且模型的目标是学习如何从特征输入(X)预测目标输出(Y),那么它就是监督学习模型。监督学习主要用于分类(Classification)和回归(Regression)任务。 Linear Regression, Logistic Regression, Decision Trees, Random Forest
Supervised learning refers to training a model on labeled training data. Each training sample contains input features (X) and corresponding target outputs (Y) if there are explicit target labels (Y) in the dataset, and the goal of the model is to learn how to predict the target output (Y) from the feature inputs (X) , then it is a supervised learning model. Supervised learning is mainly used for classification and regression tasks. Linear RegressionLogistic RegressionDecision TreesRandom Forest
.

无监督学习是指在没有标签的数据上训练模型。模型的任务是通过输入数据发现数据的内在结构或模式,而无需已知的输出标签。无监督学习主要用于聚类(Clustering)和降维(Dimensionality Reduction
Unsupervised learning refers to training a model on unlabeled data. The task of the model is to discover the intrinsic structure or pattern of the data from the input data without the need for known output labels. Unsupervised learning is mainly used for clustering ( Clustering ) and dimensionality reduction ( Dimensionality Reduction )

### 1. 如何使用词袋模型来预测个人是否就业?
### 1. How to use the bag-of-words model to predict whether an individual is employed?

这个问题描述的是一个数据集,其中唯一的特征是与每个标签相关联的文档(如个人的简历或求职信)。标签可能是反映个人是否就业。题目问的是如何使用这些文档(在文本分析中,采用词袋模型)来预测个人是否就业,并讨论使用词袋模型在训练数据集上进行交叉验证时可能遇到的问题,以及如何在实践中处理这些问题。
This question describes a dataset where the only feature is the document associated with each label (such as an individual's resume or cover letter). Labels may reflect whether an individual is employed. The question asks how to use these documents (in text analysis, using the bag-of-words model ) to predict whether an individual is employed, and discusses the problems that may be encountered when using the bag-of-words model to perform cross-validation on the training data set, and how to implement it in practice address these issues.

**词袋模型Bag of Words, BOW** 是一种常用的文本表示方法,它忽略了文本中单词的顺序,而是将文本表示为单个词汇的集合,并统计每个词汇在文本中出现的次数或是否出现。具体步骤如下:
** Bag of Words ( BOW ) ** is a commonly used text representation method. It ignores the order of words in the text, but represents the text as a collection of individual words, and counts the number of words in the text for each word. The number of times it appears or whether it appears. The specific steps are as follows:

1. **文本预处理**
1. ** Text preprocessing ** :

- 将所有简历或求职信转换为纯文本。
-Convert any resume or cover letter to plain text.

- 对文本进行**分词**,可以使用自然语言处理工具,如 NLTK SpaCy 来完成分词。
-To perform ** word segmentation ** on the text , you can use natural language processing tools such as NLTK or SpaCy to complete the segmentation.

- 可选择进一步进行**停用词去除****小写化****词干化**等处理,以减少词汇表的冗余。
- You can choose to further perform ** stop word removal ** , ** lowercase ** , ** stemming ** and other processes to reduce the redundancy of the vocabulary.

2. **构建词袋模型**
2. ** Build word bag model ** :

- 为每个文档创建一个词汇表,词汇表中的每个单词作为特征。
- Create a vocabulary for each document, with each word in the vocabulary as a feature.

- 统计每个单词在文档中出现的次数,形成一个**词频矩阵**(词频矩阵中每一行对应一个文档,每一列对应一个单词,值表示该单词在文档中的出现频率)。
- Count the number of times each word appears in the document to form a ** word frequency matrix ** (each row in the word frequency matrix corresponds to a document, each column corresponds to a word, and the value indicates the frequency of occurrence of the word in the document).

3. **应用机器学习算法**
3. ** Application of machine learning algorithms ** :

- 使用词频矩阵作为输入特征,将其与是否就业的标签配对,形成一个标准的监督学习问题。
-Use the word frequency matrix as the input feature and pair it with the label of employment to form a standard supervised learning problem.

- 选择一个分类模型(如**逻辑回归****支持向量机****随机森林**等),并使用这些文档的词袋表示来训练模型。
- Choose a classification model (such as ** Logistic Regression ** , ** Support Vector Machine ** , ** Random Forest **, etc.) and train the model using the bag-of-words representation of these documents.

4. **模型预测**
4. ** Model prediction ** :

- 一旦模型训练完成,可以输入一个新的简历或求职信的词袋表示,模型会根据文本中的单词来预测该个人是否就业。
- Once the model is trained, a new bag-of-words representation of a resume or cover letter can be input and the model will predict whether that individual is employed based on the words in the text.

### 2. 使用词袋模型进行交叉验证时可能遇到的问题及处理方式
### 2. Possible problems and solutions when using the bag-of-words model for cross-validation

**交叉验证(Cross-Validation** 是用于评估机器学习模型性能的技术。使用词袋模型进行交叉验证时,可能遇到以下问题:
** Cross -Validation ** is a technique used to evaluate the performance of machine learning models. When using the bag-of-words model for cross-validation, you may encounter the following problems:

#### **问题 1:稀疏性问题**
#### ** Problem 1 : Sparsity problem **

词袋模型通常会生成一个非常高维且稀疏的矩阵,因为文本中可能包含成千上万的不同单词,而每个文档只包含其中的小部分。因此,词频矩阵会包含大量的零值,这会导致训练模型的效率较低,并且可能造成模型的过拟合。
Bag-of-words models typically produce a very high-dimensional and sparse matrix, since a text may contain thousands of different words, and each document contains only a small subset of them. Therefore, the word frequency matrix will contain a large number of zero values, which will result in less efficient training of the model and may cause overfitting of the model.

**处理方式**
** Processing method ** :

- **降维技术**:可以使用降维技术,如**主成分分析(PCA****奇异值分解(SVD**,来降低矩阵的维度。
- ** Dimensionality Reduction Techniques ** : Dimensionality reduction techniques , such as ** Principal Component Analysis ( PCA ) ** or ** Singular Value Decomposition ( SVD ) ** , can be used to reduce the dimensionality of the matrix.

- **特征选择**:通过选择频率最高的特征(如**TF-IDF** 或者**词汇出现的频次阈值**),来减少特征的数量。
- ** Feature Selection ** : Reduce the number of features by selecting the features with the highest frequency (such as **TF-IDF** or ** frequency threshold of word occurrence ** ).

#### **问题 2:数据泄漏**
#### ** Issue 2 : Data leakage **

在交叉验证过程中,可能会出现数据泄漏的问题,即训练集中包含了测试集的信息。例如,如果在构建词袋时没有先划分训练集和测试集,而是使用整个数据集构建词汇表,这可能导致模型过拟合。
During the cross-validation process, the problem of data leakage may occur, that is, the training set contains information from the test set. For example, if you do not first divide the training set and the test set when building the bag of words , but use the entire data set to build the vocabulary, this may cause the model to overfit.

**处理方式**
** Processing method ** :

- **先划分数据,再构建词袋**:为了防止数据泄漏,应先将数据集分成训练集和测试集,然后只用训练集来构建词汇表和词袋模型。随后在测试集上进行交叉验证时,词汇表应保持不变。
- ** Divide the data first, then build the bag of words ** : In order to prevent data leakage, the data set should be divided into a training set and a test set first, and then only the training set should be used to build the vocabulary and bag of words models . The vocabulary should remain unchanged when subsequently cross-validating on the test set.

#### **问题 3:模型的高方差问题**
#### ** Problem 3 : High variance problem of the model **

由于词袋模型可能生成大量特征,特别是当词汇表过大时,模型可能会表现出高方差问题,即在训练集上表现良好,但在测试集上表现较差。
Since bag-of-words models may generate a large number of features, especially when the vocabulary is too large, the model may exhibit high variance issues, i.e. perform well on the training set but perform poorly on the test set.

**处理方式**
** Processing method ** :

- **正则化**:可以在训练模型时加入正则化项(如 L1 L2 正则化)以防止过拟合。
- ** Regularization ** : Regularization terms (such as L1 or L2 regularization) can be added when training the model to prevent overfitting.

- **TF-IDF 转换**:与单纯的词频表示相比,使用**TF-IDF(词频-逆文档频率)**可以减少常见词对模型的干扰,并提高模型的泛化能力。
- **TF-IDF conversion ** : Compared with pure word frequency representation, using **TF-IDF (word frequency inverse document frequency) ** can reduce the interference of common words on the model and improve the generalization ability of the model.

### 总结
### Summarize

- **如何使用词袋模型预测是否就业**:通过文本预处理、词袋表示、构建词频矩阵并使用机器学习模型进行预测。
- ** How ​​to use the bag-of-words model to predict employment ** : Prediction through text preprocessing, bag-of-words representation , building a word frequency matrix and using a machine learning model.

- **交叉验证时的挑战**:主要包括稀疏性问题、数据泄漏和模型高方差问题。可以通过降维、正则化和TF-IDF等方法来解决这些问题。
- ** Challenges during cross-validation ** : Mainly include sparsity issues, data leakage and high model variance issues. These problems can be solved through methods such as dimensionality reduction, regularization and TF-IDF .

3, 假设您有一个带标签的数据集,其中唯一的特征是附加到每个标签的文档。为了进行讨论,您可以假设标签可以是个人是否受雇,文档可以是他们的简历/求职信。你如何使用这封信(在带有词袋表示的文本分析上下文中)来预测一个人是否被雇用?
3. Suppose you have a labeled dataset where the only features are the documents attached to each label. For the sake of discussion, you can assume that the label could be whether the individual is employed or not, and the document could be their resume/cover letter. How do you use this letter (in the context of text analysis with a bag-of-words representation ) to predict whether a person is hired?

此外,当尝试在训练数据集上交叉验证数据时,使用词袋格式可能会遇到哪些问题?实践中通常如何处理?
Also, what issues might be encountered using the bag-of-words format when trying to cross-validate data on a training dataset ? How is it usually handled in practice?

题目要求针对几种不同的模型,给出可以调节的超参数的示例,并说明如何调节这些超参数。以下是各模型的超参数示例及其调节方法:
The question requires giving examples of hyperparameters that can be adjusted for several different models and explaining how to adjust these hyperparameters. The following are examples of hyperparameters for each model and how to adjust them:

### 1. **逻辑回归(Logistic Regression**
### 1. ** Logistic Regression ( Logistic Regression ) **

- **可调节超参数**:正则化参数 \(C\)
- ** Adjustable hyperparameters ** : Regularization parameter \(C\)

- **解释**\(C\) 控制正则化的强度,它是惩罚项的倒数。当 \(C\) 值较小时,正则化效果较强,防止过拟合;当 \(C\) 值较大时,正则化效果较弱。
- ** Explanation ** : \(C\) controls the strength of regularization, which is the reciprocal of the penalty term. When the value of \(C\) is small, the regularization effect is strong to prevent over-fitting; when the value of \(C\) is large, the regularization effect is weak.

- **调节方法**:通过网格搜索(Grid Search)或随机搜索(Random Search)来寻找最佳 \(C\) 值。通常会在对数尺度下选择 \(C\) 的多个可能值进行交叉验证。
- ** Adjustment method ** : Find the best \(C\) value through Grid Search or Random Search . Multiple possible values ​​of \(C\) are usually chosen for cross-validation on a logarithmic scale .

### 2. **决策树(Decision Trees**
### 2. ** Decision Trees **

- **可调节超参数**:树的最大深度(max_depth
- ** Adjustable Hyperparameter ** : Maximum depth of the tree ( max_depth )

- **解释**:最大深度决定了树的最大层数,限制了树的复杂度。如果树的深度过大,模型可能会过拟合;如果树的深度过小,模型可能会欠拟合。
- ** Explanation ** : The maximum depth determines the maximum number of layers of the tree and limits the complexity of the tree. If the depth of the tree is too large, the model may be overfitted; if the depth of the tree is too small, the model may be underfitted.

- **调节方法**:通过调整 `max_depth` 的值来控制树的复杂度。使用交叉验证进行网格搜索,测试不同的深度以确定最优深度。
- ** Adjustment method ** : Control the complexity of the tree by adjusting the value of ` max_depth` . Conduct a grid search using cross-validation, testing different depths to determine the optimal depth.

### 3. **随机森林(Random Forests**
### 3. ** Random Forests **

- **可调节超参数**:树的数量(n_estimators
- ** Adjustable hyperparameters ** : number of trees ( n_estimators )

- **解释**`n_estimators` 控制森林中树的数量。更多的树通常会提高模型的准确性,但也会增加计算成本。
- ** Explanation ** : n_estimators ` Control the number of trees in the forest. More trees generally improve model accuracy but also increase computational cost.

- **调节方法**:通过逐步增加 `n_estimators` 的值来找到收敛点,即当增加树的数量不再显著提高模型性能时停止。可以结合交叉验证来选择最优的树的数量。
- ** Adjustment method ** : Find the convergence point by gradually increasing the value of ` n_estimators` , that is, stopping when increasing the number of trees no longer significantly improves model performance. Cross-validation can be used to select the optimal number of trees.

### 4. **偏最小二乘法(Partial Least Squares, PLS**
### 4. ** Partial Least Squares (PLS ) **

- **可调节超参数**:成分数(n_components
- ** Adjustable Hyperparameters ** : Number of components ( n_components )

- **解释**`n_components` 决定了在模型中要使用的潜在变量(成分)的数量。选择适当的成分数可以在解释方差与模型复杂度之间找到平衡。
- ** Explanation ** : n_components ` determines the number of latent variables (components) to be used in the model. Choosing an appropriate number of components can find a balance between explained variance and model complexity.

- **调节方法**:通过交叉验证测试不同的成分数量,从而找到可以最好地解释数据的成分数。
- ** Conditioning Method ** : Test different numbers of ingredients through cross-validation to find the number of ingredients that best explains the data.

### 总结:
### Summarize:

对于每种模型,超参数的调节都是通过方法如网格搜索、随机搜索或交叉验证来实现的,目的是找到能够优化模型性能的最佳超参数组合。
For each model, hyperparameter tuning is achieved through methods such as grid search, random search, or cross-validation, with the goal of finding the best combination of hyperparameters that optimizes model performance.

4/假设数据分析师试图预测澳大利亚现金利率,这是澳大利亚名义利率的衡量标准。在任何给定时期内,现金利率都可能下降、保持不变或上升。 分析师有 100 个观察值和 200 个特征。他们首先对完整数据集进行逻辑套索以调整逆正则化惩罚,从而选择分类特征。用于回归的特征是根据该惩罚来提取的。然后,将数据集分为 80% 的训练数据集和 20% 的测试数据集,根据最低均方误差计算样本外性能的最终度量。 上述规范至少存在两个问题。指出并解释您将如何纠正这些问题。
4/ Suppose a data analyst is trying to predict the Australian cash rate, which is a measure of nominal interest rates in Australia. In any given period, the cash rate can fall, stay the same or rise. The analyst has 100 observations and 200 features. They first perform a logical lasso on the full dataset to adjust the inverse regularization penalty to select categorical features. Features used for regression are extracted based on this penalty. The data set was then split into 80% training data set and 20% testing data set, and the final measure of out-of-sample performance was calculated based on the lowest mean square error. There are at least two problems with the above specification. Identify and explain how you would correct these problems.

在上述数据分析规范中,存在至少两个问题。以下是这两个问题的解释及其潜在的解决方案:
There are at least two problems with the above data analysis specifications. Here's an explanation of both problems and their potential solutions:

### 1. **问题一:特征选择过程中的数据泄漏**
### 1. **Problem 1: Data leakage during feature selection**

数据分析师在对**完整数据集**进行逻辑回归的 Lasso 正则化(Lasso 回归)以选择特征。这种方法存在**数据泄漏(data leakage)**的风险,因为特征选择是在整个数据集上进行的,而不是仅在训练集上进行。也就是说,测试集的信息已经被提前用于特征选择,这样在进行最终的模型评估时,测试集的表现会被高估。
The data analyst is performing Lasso regularization of logistic regression (Lasso regression) on the **full data set** to select features. This approach runs the risk of data leakage because feature selection is performed on the entire dataset, not just on the training set. That is to say, the information of the test set has been used for feature selection in advance, so that the performance of the test set will be overestimated when the final model is evaluated.

#### **解决方案**:
#### **Solution**:

特征选择应该**仅在训练集上进行**,而不是在整个数据集上。具体步骤是:
Feature selection should be done only on the training set, not on the entire data set. The specific steps are:

1. 先将数据集划分为**训练集和测试集**。
1. First divide the data set into **training set and test set**.

2. 在训练集上使用 Lasso 正则化进行特征选择,并提取重要特征。
2. Use Lasso regularization for feature selection on the training set and extract important features.

3. 然后,使用这些选择出来的特征在训练集上拟合模型,并使用测试集进行评估。
3. Then, use these selected features to fit the model on the training set and use the test set for evaluation.

通过仅在训练集上进行特征选择,可以确保模型评估的公正性,避免过度拟合和对模型性能的错误估计。
By performing feature selection only on the training set, you can ensure the fairness of model evaluation and avoid overfitting and misestimation of model performance.

### 2. **问题二:特征数量远大于样本数量(高维问题)**
### 2. **Problem 2: The number of features is much larger than the number of samples (high-dimensional problem)**

分析师使用的特征数量(200 个)远大于观察值数量(100 个),这属于**高维数据问题**。在这种情况下,模型可能会容易过拟合,因为特征数量过多,模型可以在训练数据上拟合得非常好,但在新数据上表现不佳(泛化性能差)。
The number of features used by the analyst (200) is much larger than the number of observations (100), which is a **high-dimensional data problem**. In this case, the model may easily overfit because the number of features is too large and the model can fit very well on the training data but perform poorly on new data (poor generalization performance).

#### **解决方案**:
#### **Solution**:

可以通过以下方法来缓解高维问题:
High-dimensional problems can be alleviated by:

1. **减少特征维度**:除了使用 Lasso 正则化之外,还可以尝试其他维度缩减技术,如**主成分分析(PCA)** 或 **特征选择算法**,以减少特征数量。
1. **Reduce feature dimension**: In addition to using Lasso regularization, you can also try other dimensionality reduction techniques, such as **Principal Component Analysis (PCA)** or **Feature Selection Algorithm** to reduce the number of features .

2. **交叉验证**:为了确保模型的稳定性和防止过拟合,应该在训练数据上使用**交叉验证(cross-validation)**来选择合适的模型超参数(如正则化强度)。交叉验证有助于评估模型的泛化性能,并选择合适的复杂度参数。
2. **Cross-validation**: In order to ensure the stability of the model and prevent overfitting, **cross-validation** should be used on the training data to select appropriate model hyperparameters (such as regularization strength) ). Cross-validation helps evaluate the generalization performance of the model and select appropriate complexity parameters.

### 结论:
### in conclusion:

1. **纠正数据泄漏**:在特征选择过程中,仅在训练集上进行 Lasso 正则化,而不使用整个数据集,从而避免数据泄漏。
1. **Correction of data leakage**: During the feature selection process, Lasso regularization is only performed on the training set instead of using the entire data set, thereby avoiding data leakage.

2. **解决高维问题**:通过进一步的降维或特征选择来减少特征数量,并在训练集上进行交叉验证以防止过拟合。
2. **Solve high-dimensional problems**: Reduce the number of features through further dimensionality reduction or feature selection, and perform cross-validation on the training set to prevent overfitting.