Regularization 正規化
Regularization, also known as shrinkage methods, is a technique used to reduce the variance of a model at the expense of a slight increase in bias. We will cover three types of regularization methods:
正則化,也稱為收縮方法,是一種以稍微增加偏差為代價來減少模型變異數的技術。我們將介紹三種類型的正規化方法:
- Ridge regression 嶺回歸
- Lasso regression 套索迴歸
- Elastic net regression 彈性網路回歸
Ridge Regression 嶺回歸
Recall that for an MLR, ordinary least squares (OLS) finds the regression coefficient estimates (
回想一下,對於 MLR,普通最小平方法 (OLS) 計算迴歸係數估計值 (
As the number of predictors (
由於預測變數的數量 (
Ridge regression estimates the coefficients by minimizing the SSE with an additional restriction:
嶺迴歸透過最小化 SSE 來估計係數,並附加一個限制:
where
在哪裡
We can depict this idea for the case of
我們可以用以下案例來描述這個想法
- The purple cross marks the
and values under OLS, i.e. where SSE is minimized.
紫色十字標誌著 和 OLS 下的值,即 SSE 最小化的情況。 - The black contour illustrates all the
and coordinates that result in one specific value of SSE. The further the contour is from the purple cross, the larger the SSE value.
黑色輪廓說明了所有 和 產生 SSE 的一個特定值的座標。輪廓距離紫色十字越遠,SSE 值越大。 - The yellow circle denotes the region
.
黃色圓圈表示區域 。 - The black contour touches the edge of the yellow circle, signifying the point where SSE is minimized while restricted by
. This green point denotes the ridge estimates.
黑色輪廓接觸黃色圓圈的邊緣,表示 SSE 最小化的點,同時受 。該綠點表示山脊估計值。 - This demonstrates ridge estimates (green point) that have shrunk towards 0, relative to the OLS estimates (purple cross).
這表明相對於 OLS 估計(紫色十字),山脊估計(綠點)已縮小到 0。
Mathematically, ridge regression seeks to minimize
從數學上講,嶺迴歸旨在最小化
where
在哪裡
- When
, there is no shrinkage penalty, so the ridge estimates are the same as the OLS estimates.
什麼時候 ,沒有收縮懲罰,因此山脊估計與 OLS 估計相同。 - As
, the estimates (except the intercept's) will approach 0 in order to minimize the expression, which results in the null model.
作為 ,估計值(截距除外)將接近 0,以便最小化表達式,從而導致空模型。
Scaling Predictors 縮放預測器
When predictors have different scales, their corresponding coefficient estimates will also have different scales to compensate. This leads to unfair shrinkage, as larger-scale coefficients contribute more to the penalty term.
當預測變數具有不同的尺度時,其對應的係數估計也將具有不同的尺度來補償。這會導致不公平的收縮,因為較大的係數對懲罰項的貢獻較大。
For example, if
例如,如果
To address this issue, predictors should be scaled or standardized before applying ridge regression or any other regularization method. Since scaling divides a variable by its standard deviation, it puts all predictors on the same scale.
為了解決這個問題,在應用嶺迴歸或任何其他正規化方法之前,應該對預測變數進行縮放或標準化。由於縮放將變數除以其標準差,因此它將所有預測變數置於相同的縮放比例上。
Implementing Ridge Regression in R
在 R 中實現嶺迴歸
We use the glmnet()
function from the glmnet
package to run ridge regression. Its key arguments are
我們使用glmnet
套件中的glmnet()
函數來執行嶺迴歸。其關鍵論點是
x
: The predictor values in a matrix object.x
:矩陣物件中的預測變數值。y
: The column of target values.y
:目標值列。family
: Specify"gaussian"
to confirm SSE as part of the expression to be minimized.family
:指定"gaussian"
以確認 SSE 為要最小化的表達式的一部分。lambda
: The hyperparameter that controls shrinkage.lambda
:控制收縮的超參數。alpha
: Specify0
for ridge regression.alpha
:指定0
進行嶺迴歸。
Here is a sample code:
這是範例程式碼:
The glmnet()
function requires the predictor values to be supplied as a matrix object. In the sample above, a matrix object named "Xmat" was created beforehand, which includes the same predictors as "mlr.mod8".glmnet()
函數需要將預測變數值作為矩陣物件提供。在上面的範例中,預先建立了一個名為「Xmat」的矩陣對象,其中包含與「mlr.mod8」相同的預測變數。
Here is a peek at the first 6 rows of "Xmat":
下面是「Xmat」的前 6 行:
Notice the first column, which represents what we've been calling glmnet()
function, it means binarization was performed.
注意第一列,它代表我們一直在呼叫的內容 glmnet()
函數,所以這意味著執行了二值化。
You might be wondering about scaling the values in the matrix. This will be taken care of by glmnet()
; by default, the standardize
argument will handle the scaling, so no additional action is needed.
您可能想知道如何縮放矩陣中的值。這將由glmnet()
處理;預設情況下, standardize
參數將處理縮放,因此不需要額外的操作。
Interpreting Ridge Regression Output
解釋嶺迴歸輸出
With lambda
set to 0 for "rid.mod1", the ridge estimates will be the same as the OLS estimates. Here is an example of the output:
將「rid.mod1」的lambda
設為 0 時,嶺估計值將與 OLS 估計值相同。以下是輸出的範例:
The corresponding prediction function is:
對應的預測函數為:
where: 在哪裡:
is Sepal.Width
是萼片寬度 is the dummy variable for versicolor
是 versicolor 的虛擬變數 is the dummy variable for virginica
是維吉尼亞州的虛擬變量
There are a couple of technical issues to address regarding this output:
關於此輸出,有幾個技術問題需要解決:
-
Two Intercept Rows: You might encounter output with two (Intercept) rows, where one doesn't have an estimate. This occurs because the first column of the "Xmat" matrix is redundant. By default, the
glmnet()
function includes an intercept term, hence the first column of "Xmat" is essentially including another intercept term. As a result, there's no estimate for the second (Intercept) row. If this appears on the exam, just ignore the row with no estimate.
兩個截距行:您可能會遇到包含兩行(截距)的輸出,其中一行沒有估計值。出現這種情況是因為「Xmat」矩陣的第一列是多餘的。預設情況下,glmnet()
函數包含截距項,因此「Xmat」的第一列本質上包含另一個截距項。因此,沒有對第二(截距)行的估計。如果考試中出現這種情況,請忽略沒有估計的行。Alternatively, you might see output like this:
或者,您可能會看到以下輸出:In this particular case, it doesn't mean
has no intercept. It reveals the same information as the previous output but neglects to report the intercept estimate.
在這種特殊情況下,這並不意味著 沒有攔截。它顯示與先前輸出相同的訊息,但忽略報告截距估計。The exam could also provide output that has no such issue, like this:
考試還可以提供不存在此類問題的輸出,如下所示:To simplify, we'll use this version going forward.
為了簡單起見,我們將在以後使用這個版本。 -
Discrepancies in Estimates: When comparing these ridge estimates (same as OLS) to those from "mlr.mod8", you might notice that the numbers are close but not exactly the same. This discrepancy is due to computational precision. To get the estimates to match, we can set the
thresh
argument to a much smaller positive number. However, this should be irrelevant to the exam.
估計值的差異:將這些山脊估計值(與 OLS 相同)與「mlr.mod8」中的估計值進行比較時,您可能會注意到數字很接近,但並不完全相同。這種差異是由於計算精度造成的。為了讓估計值匹配,我們可以將thresh
參數設定為一個小得多的正數。不過,這應該與考試無關。
As the hyperparameter lambda increases, the estimates move closer to 0, reducing the influence of the predictors. Here are examples with two other values:
隨著超參數 lambda 的增加,估計值逐漸接近 0,從而減少了預測變數的影響。以下是具有其他兩個值的範例:
-
lambda
= 2 -
lambda
= 100
The process of finding the best value for lambda is called tuning. While it is possible to tune lambda using test RMSE, cross-validation is more commonly used for regularization models.
找到 lambda 最佳值的過程稱為調整。雖然可以使用測試 RMSE 來調整 lambda,但交叉驗證更常用於正規化模型。
Here are a couple of key facts about ridge regression:
以下是有關嶺回歸的幾個關鍵事實:
- If
is finite, practically none of the ridge estimates will equal 0, meaning ridge regression will not remove predictors.
如果 是有限的,實際上沒有一個 嶺估計值將等於 0,這表示嶺迴歸不會刪除預測變數。 - Ridge regression is useful in high-dimensional settings, as only some of the predictors should matter with coefficient estimates that are far enough from 0.
嶺迴歸在高維設定中非常有用,因為只有部分預測變數與距離 0 足夠遠的係數估計值有關。
Cross-Validation and One-Standard-Error Rule
交叉驗證和單標準誤差法則
Cross-validation is a technique used to validate or compare models, particularly when tuning hyperparameters. It addresses the issue of using a separate validation/test set, which reduces the size of the training set and may result in suboptimal model fits. Cross-validation is performed on the training set.
交叉驗證是一種用於驗證或比較模型的技術,特別是在調整超參數時。它解決了使用單獨的驗證/測試集的問題,這會減少訓練集的大小並可能導致模型擬合不理想。在訓練集上進行交叉驗證。
In the context of the exam, cross-validation is preferred over a separate validation/test set when the goal is to tune hyperparameters. However, a validation/test set is still common, typically for validating or comparing models rather than tuning hyperparameters.
在考試中,當目標是調整超參數時,交叉驗證優於單獨的驗證/測試集。然而,驗證/測試集仍然很常見,通常用於驗證或比較模型而不是調整超參數。
k-Fold Cross-Validation Procedure
k 重交叉驗證程序
- Suppose we have 15 observations that we randomly divide into 5 folds with 3 observations each.
假設我們有 15 個觀察值,我們將其隨機分成 5 個折疊,每個折疊 3 個觀察值。 - Fit a model (say ridge with
lambda
= 5) to the 12 observations not in the first fold.
將模型(例如lambda
= 5 的嶺)擬合到不在第一折疊中的 12 個觀測值。 - Use the fitted equation from step 2 and the first fold's 3 observations to calculate an accuracy measure, say RMSE. This is considered a test RMSE since the fold was excluded from the fit.
使用步驟 2 中的擬合方程式和第一次折疊的 3 個觀測值來計算準確度測量值,即 RMSE。這被認為是測試 RMSE,因為折疊被排除在擬合之外。 - Repeat steps 2 and 3 such that the remaining 4 folds are each treated like the first — fit the model to the 12 observations not in that fold and calculate the test RMSE using the 3 observations in that fold.
重複步驟 2 和 3,使剩餘 4 個折疊的處理方式與第一個折疊相同 - 將模型擬合到不在該折疊中的 12 個觀測值,並使用該折疊中的 3 個觀測值計算檢驗 RMSE。 - Average the 5 test RMSE values to obtain the CV error, which is an accuracy measure for the specified hyperparameter value (
lambda
= 5).
對 5 個測試 RMSE 值進行平均以獲得 CV 誤差,這是指定超參數值 (lambda
= 5) 的精確度度量。 - Repeat steps 2 to 5 with different hyperparameter values (e.g.,
lambda
= 10) to determine which value results in the lowest CV error.
使用不同的超參數值(例如lambda
= 10)重複步驟 2 至 5,以確定哪個值導致最低 CV 誤差。
Implementing Cross-Validation in R
在 R 中實現交叉驗證
In R, the cv.glmnet()
function performs the entire cross-validation procedure. Here is sample code:
在 R 中, cv.glmnet()
函數執行整個交叉驗證程序。這是範例程式碼:
We set the seed before introducing the randomness of dividing the dataset into folds. The seed number is arbitrary. The cv.glmnet()
function will generate its own candidate lambda values if not specified. The number of folds can be changed but defaults to 10 and should not be an important detail for the exam.
我們在引入將資料集劃分為折疊的隨機性之前設定種子。種子數是任意的。如果未指定, cv.glmnet()
函數將產生自己的候選 lambda 值。折疊次數可以更改,但預設為 10,並且不應成為考試的重要細節。
To obtain the lambda value that resulted in the lowest CV error from the cross-validation results, use code like this:
若要從交叉驗證結果中取得導致最低 CV 誤差的 lambda 值,請使用下列程式碼:
The value seems rather close to 0. If say this value happens to be the smallest among the validated values, then it suggests that the cross-validation preferred the most flexible ridge regression, and the ridge estimates should be close to the OLS estimates. This would indicate that the MLR was not very flexible to begin with, as reducing flexibility with a larger lambda was not preferred.
該值似乎相當接近 0。這表明 MLR 一開始就不是很靈活,因為用較大的 lambda 來降低靈活性並不是首選。
Let's now run a ridge regression with lambda
= 0.054, which is reasonably close to the one with the lowest CV error:
現在讓我們運行lambda
= 0.054 的嶺回歸,該回歸相當接近 CV 誤差最低的回歸:
This yields the following coefficient estimates:
這產生以下係數估計:
When compared to "rid.mod1", the last two estimates actually move away from 0, despite the higher lambda. That is possible; the restricted quantity
與“rid.mod1”相比,最後兩個估計值實際上遠離 0,儘管 lambda 更高。這是可能的;限制數量
Interpreting Cross-Validation Results
解釋交叉驗證結果
A cross-validation plot can help visualize the results. Imagine a generic dataset with 10 predictors and 30 candidate lambda values, resulting in:
交叉驗證圖可以幫助視覺化結果。想像一個具有 10 個預測變數和 30 個候選 lambda 值的通用資料集,結果是:
- The vertical axis represents the CV error.
縱軸表示CV誤差。 - The horizontal axis shows the natural log of the lambda candidates, with flexibility decreasing from left to right. Using the natural log simply helps to visualize a wide range of lambda values more easily.
橫軸顯示 lambda 候選值的自然對數,彈性從左到右遞減。使用自然對數有助於更輕鬆地視覺化各種 lambda 值。 - The point with the minimum CV error corresponds to the left dashed line. From the plot, this lambda is approximately
.
CV 誤差最小的點對應於左側的虛線。從圖中可以看出,該 lambda 約為 。 - The numbers at the top of the plot represent the number of predictors for select levels of flexibility, paralleling the horizontal axis. Ridge regression does not remove predictors when lambda is finite, so the number of predictors remains constant at 10 throughout the plotted range.
圖頂部的數字代表與水平軸平行的選定靈活性等級的預測變數數量。當 lambda 有限時,嶺回歸不會刪除預測變量,因此預測變量的數量在整個繪製範圍內保持恆定為 10。
One-Standard-Error Rule 一標準誤差法則
Instead of choosing the lambda with the minimum CV error, the one-standard-error rule can be applied:
可以應用一標準誤差法則,而不是選擇具有最小 CV 誤差的 lambda:
- Each CV error has a standard error, represented by the interval above and below the point.
每個 CV 誤差都有一個標準誤差,由該點上方和下方的區間表示。 - Consider all CV errors within one standard error of the minimum CV error to be close enough to the minimum.
考慮最小 CV 誤差的一個標準誤差內的所有 CV 誤差,使其足夠接近最小值。 - Select the rightmost point within this interval, which is the least flexible option with roughly the same error quality. In the plot above, this point corresponds to the right dashed line; its lambda is approximately
.
The exact lambda values for both the minimum CV error and the one-standard-error rule can be found from the cv.glmnet
object:
Lasso Regression 套索迴歸
Least Absolute Selection and Shrinkage Operator (LASSO) regression is a close cousin to ridge regression. While minimizing the SSE, lasso regression imposes the restriction of
最小絕對選擇和收縮算子 (LASSO) 回歸是嶺回歸的近親。在最小化 SSE 的同時,套索迴歸施加了以下限制:
for some constant
對於一些常數
Let's visualize for the case where
讓我們想像一下這種情況
- The purple cross marks the
and values under OLS.
紫色十字標誌著 和 OLS 下的值。 - The yellow diamond denotes the region
.
黃色菱形表示該地區 。 - The black contour touches the edge of the yellow diamond, signifying the point where SSE is minimized while restricted by
. This green point denotes the lasso estimates.
黑色輪廓接觸黃色菱形的邊緣,表示 SSE 最小化的點,同時受 。該綠點表示套索估計。
Mathematically, lasso regression seeks to minimize:
從數學上講,套索迴歸旨在最小化:
Just like with ridge regression, when
就像嶺回歸一樣,當
Key Difference between Lasso and Ridge
套索和嶺之間的主要區別
Lasso regression can set coefficient estimates to 0 for finite
Lasso and ridge regressions are similar in many ways, including their ability to work in high-dimensional settings. However, the crucial difference in variable selection makes lasso regression a useful tool for simplifying models and focusing on the most important predictors.
Implementing Lasso Regression in R
To model a lasso regression using the glmnet()
function, set the alpha
argument to 1. Here is sample code:
Let's compare lasso regressions with different lambda values:
lambda
= 0 (OLS)
lambda
= 0.02
lambda
= 0.1
As lambda increases, more coefficient estimates shrink to 0. These lasso regressions seem to prefer keeping the interaction terms, ignoring the hierarchical principle.
It is not surprising that the estimates for lambda
= 0 are exactly the same as the estimates of "rid.mod1", as they both represent the OLS estimates.
Cross-Validation for Lasso Regression
To perform cross-validation for lasso regression, set alpha
as 1 in the cv.glmnet()
function. Here is sample code:
The lambda value that produces the lowest CV error is approximately 0.0013.
Here is sample code to fit a lasso regression with lambda
= 0.0013:
The resulting coefficient estimates are:
None of the estimates are 0, and so all predictors are retained.
Interpreting Cross-Validation Results for Lasso Regression
Consider a generic dataset with 10 predictors and 30 candidate lambda values:
- The left dashed line corresponds to the lambda with the lowest CV error (approximately 0.72).
- The right dashed line corresponds to the lambda using the one-standard-error rule (approximately 1.28).
- The number of predictors decreases as lambda increases.
- For the lambda with the lowest CV error, it's unclear whether 3, 4, or 5 predictors are retained.
- The one-standard-error rule leads to just 3 predictors, providing a simpler model with comparable performance.
Elastic Net Regression
Elastic net regression is a hybrid of ridge and lasso regression, combining their respective restrictions on the SSE. The restriction in elastic net regression is a weighted average of the restrictions in ridge and lasso regressions, controlled by
where:
is known as the mixing coefficient.- The first sum is the quantity from lasso regression.
- The second sum is the quantity from ridge regression.
If
As
Mathematically, elastic net regression seeks to minimize:
Both
Implementing Elastic Net in R
The alpha
argument in the glmnet()
function is how we can implement elastic net regressions. Here is sample code:
The alpha values were chosen arbitrarily, while the lambda values were obtained from cross-validation after setting the alpha value. The coefficient estimates for the two models are shown below, but there appears to be little difference between them in this specific case.
-
"net.mod1"
-
"net.mod2"
Comparing the Models
Here is a summary of the glmnet
objects we made:
Model | alpha | lambda | Training RMSE |
---|---|---|---|
rid.mod2 | 0 | 0.0540 | 0.4360 |
net.mod1 | 0.35 | 0.0148 | 0.4322 |
net.mod2 | 0.65 | 0.0106 | 0.4321 |
las.mod2 | 1 | 0.0013 | 0.4309 |
When comparing regularized regressions with different alpha values, it is important to note that there is no single flexibility measure that ties them together. While the training RMSE decreases as alpha increases in this specific instance, this does not imply a general statement about the flexibility of ridge, lasso, and elastic net regressions as a whole. We may argue that "las.mod2" is more flexible than "rid.mod2" based on their training RMSE's, but that is a model-specific comparison and should not be generalized to all lasso and ridge regressions.
Comparing Regularization Procedures
This section focuses on the core similarities and differences between three regularization methods: ridge, lasso, and elastic net regressions. We will discuss four key points to understand the characteristics and use cases of these methods.
Reducing Model Flexibility and Overfitting
All three regularization methods aim to reduce model flexibility from a set of predictors. By doing so, they reduce model variance and discourage overfitting. This is particularly useful when dealing with complex models that are prone to capturing noise in the data.
Handling High-Dimensional Data
Ridge, lasso, and elastic net regressions are suitable when faced with high-dimensional data, i.e., when the number of predictors is large. A large enough
Variable Selection
One key difference between ridge regression and the other two methods is variable selection. Ridge regression practically never removes predictors with a finite
In the case of elastic net, the mixing coefficient
However, it's important to consider the implications of variable selection, particularly when dealing with categorical variables. Excluding dummy variables leads to merging levels of a factor, which may result in a model that is challenging to interpret. Allowing an algorithm to decide which levels to merge might not result in a simple or easily interpretable model. It is crucial to carefully evaluate the consequences of variable selection in the context of your specific problem and the interpretability requirements of your model.
Hierarchical Principle
Recall that the hierarchical principle states that if an interaction term is included in a model, the individual terms that make up the interaction should also be included. Ridge regression will never violate this principle because it cannot remove predictors. However, both elastic net and lasso have the potential to violate the hierarchical principle.
When minimizing the "SSE plus penalty" expression in elastic net and lasso, all coefficients have the opportunity to be shrunk to zero. Hence, it is possible to remove an individual term associated with an interaction term before removing the interaction term itself.