Regularization 正規化

50m

Regularization, also known as shrinkage methods, is a technique used to reduce the variance of a model at the expense of a slight increase in bias. We will cover three types of regularization methods:
正則化，也稱為收縮方法，是一種以稍微增加偏差為代價來減少模型變異數的技術。我們將介紹三種類型的正規化方法：

Ridge regression 嶺回歸
Lasso regression 套索迴歸
Elastic net regression 彈性網路回歸

Ridge Regression 嶺回歸

Recall that for an MLR, ordinary least squares (OLS) finds the regression coefficient estimates ( $b_{0}, b_{1}, \dots, b_{p}$ ) that minimize the error sum of squares (SSE):
回想一下，對於 MLR，普通最小平方法 (OLS) 計算迴歸係數估計值 ( $b_{0}, b_{1}, \dots, b_{p}$ ) 最小化誤差平方和 (SSE)：

$\begin{matrix} (3.2.1.3) & SSE = \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2} \end{matrix}$

As the number of predictors ( $p$ ) increases, flexibility also increases, which reduces bias but increases variance. Ridge regression addresses this issue, not by lowering $p$ , but by restricting the possible values for the coefficient estimates, forcing them to be closer to 0; a similar expression is to say the estimates are being shrunk towards 0.
由於預測變數的數量 ( $p$ ) 增加，靈活性也增加，這減少了偏差，但增加了變異數。嶺迴歸解決了這個問題，而不是透過降低 $p$ ，但透過限制係數估計的可能值，迫使它們更接近 0；類似的表達是說估計值正在向 0 縮小。

Ridge regression estimates the coefficients by minimizing the SSE with an additional restriction:
嶺迴歸透過最小化 SSE 來估計係數，並附加一個限制：

$\sum_{j = 1}^{p} b_{j}^{2} \leq s$

where $s$ is a constant. The intercept estimate ( $b_{0}$ ) is not included in this restriction.
在哪裡 $s$ 是一個常數。截距估計（ $b_{0}$ ) 不包括在此限制內。

We can depict this idea for the case of $p = 2$ with the following plot of $b_{1}$ and $b_{2}$ values:
我們可以用以下案例來描述這個想法 $p = 2$ 與以下情節 $b_{1}$ 和 $b_{2}$ 價值觀：

Figure-3.3.2.1.png

The purple cross marks the $b_{1}$ and $b_{2}$ values under OLS, i.e. where SSE is minimized.
紫色十字標誌著 $b_{1}$ 和 $b_{2}$ OLS 下的值，即 SSE 最小化的情況。
The black contour illustrates all the $b_{1}$ and $b_{2}$ coordinates that result in one specific value of SSE. The further the contour is from the purple cross, the larger the SSE value.
黑色輪廓說明了所有 $b_{1}$ 和 $b_{2}$ 產生 SSE 的一個特定值的座標。輪廓距離紫色十字越遠，SSE 值越大。
The yellow circle denotes the region $b_{1}^{2} + b_{2}^{2} \leq s$ .
黃色圓圈表示區域 $b_{1}^{2} + b_{2}^{2} \leq s$ 。
The black contour touches the edge of the yellow circle, signifying the point where SSE is minimized while restricted by $b_{1}^{2} + b_{2}^{2} \leq s$ . This green point denotes the ridge estimates.
黑色輪廓接觸黃色圓圈的邊緣，表示 SSE 最小化的點，同時受 $b_{1}^{2} + b_{2}^{2} \leq s$ 。該綠點表示山脊估計值。
This demonstrates ridge estimates (green point) that have shrunk towards 0, relative to the OLS estimates (purple cross).
這表明相對於 OLS 估計（紫色十字），山脊估計（綠點）已縮小到 0。

Mathematically, ridge regression seeks to minimize
從數學上講，嶺迴歸旨在最小化

$\begin{matrix} (3.3.2.1) & SSE + λ \sum_{j = 1}^{p} b_{j}^{2} \end{matrix}$

where $λ$ is a hyperparameter called the shrinkage parameter or regularization coefficient, which controls the amount of shrinkage. Hyperparameters are pre-specified values that control the fitting procedure. The second term in the expression is often referred to as a penalty term.
在哪裡 $λ$ 是一個稱為收縮參數或正則化係數的超參數，它控制收縮量。超參數是控制擬合過程的預先指定的值。表達式中的第二項通常稱為懲罰項。

When $λ = 0$ , there is no shrinkage penalty, so the ridge estimates are the same as the OLS estimates.
什麼時候 $λ = 0$ ，沒有收縮懲罰，因此山脊估計與 OLS 估計相同。
As $λ \to \infty$ , the estimates (except the intercept's) will approach 0 in order to minimize the expression, which results in the null model.
作為 $λ \to \infty$ ，估計值（截距除外）將接近 0，以便最小化表達式，從而導致空模型。

$λ$ is inversely related to flexibility, allowing us to select a level between the inflexible null model ( $λ \to \infty$ ) and the flexible MLR model with all $p$ predictors ( $λ = 0$ ).
$λ$ 與靈活性成反比，允許我們在不靈活的空模型（ $λ \to \infty$ ）和靈活的 MLR 模型 $p$ 預測變數（ $λ = 0$ ）。

Scaling Predictors 縮放預測器

When predictors have different scales, their corresponding coefficient estimates will also have different scales to compensate. This leads to unfair shrinkage, as larger-scale coefficients contribute more to the penalty term.
當預測變數具有不同的尺度時，其對應的係數估計也將具有不同的尺度來補償。這會導致不公平的收縮，因為較大的係數對懲罰項的貢獻較大。

For example, if $x_{1}$ represents temperature ranging from 0 to 50 °C, and $x_{p}$ represents a ratio ranging from 0 to 1, then $b_{1}$ would be on a smaller scale than $b_{p}$ to account for the wider range of $x_{1}$ . Consequently, $b_{p}$ would face more aggressive shrinkage when minimizing the ridge expression.
例如，如果 $x_{1}$ 代表溫度範圍從 0 到 50 °C，並且 $x_{p}$ 表示0到1之間的比率，則 $b_{1}$ 規模將小於 $b_{p}$ 來考慮更廣泛的範圍 $x_{1}$ 。最後， $b_{p}$ 當最小化脊表達時，將面臨更劇烈的收縮。

To address this issue, predictors should be scaled or standardized before applying ridge regression or any other regularization method. Since scaling divides a variable by its standard deviation, it puts all predictors on the same scale.
為了解決這個問題，在應用嶺迴歸或任何其他正規化方法之前，應該對預測變數進行縮放或標準化。由於縮放將變數除以其標準差，因此它將所有預測變數置於相同的縮放比例上。

Implementing Ridge Regression in R
在 R 中實現嶺迴歸

We use the glmnet() function from the glmnet package to run ridge regression. Its key arguments are
我們使用glmnet套件中的glmnet()函數來執行嶺迴歸。其關鍵論點是

x: The predictor values in a matrix object.
x ：矩陣物件中的預測變數值。
y: The column of target values.
y ：目標值列。
family: Specify "gaussian" to confirm SSE as part of the expression to be minimized.
family ：指定"gaussian"以確認 SSE 為要最小化的表達式的一部分。
lambda: The hyperparameter that controls shrinkage.
lambda ：控制收縮的超參數。
alpha: Specify 0 for ridge regression.
alpha ：指定0進行嶺迴歸。

Here is a sample code:
這是範例程式碼：

The glmnet() function requires the predictor values to be supplied as a matrix object. In the sample above, a matrix object named "Xmat" was created beforehand, which includes the same predictors as "mlr.mod8".
glmnet()函數需要將預測變數值作為矩陣物件提供。在上面的範例中，預先建立了一個名為「Xmat」的矩陣對象，其中包含與「mlr.mod8」相同的預測變數。

Here is a peek at the first 6 rows of "Xmat":
下面是「Xmat」的前 6 行：

Notice the first column, which represents what we've been calling $x_{0}$ . Also, there are two columns of dummy variables. Because the dummy variables were created and then supplied to the glmnet() function, it means binarization was performed.
注意第一列，它代表我們一直在呼叫的內容 $x_{0}$ 。此外，還有兩列虛擬變數。因為創建了虛擬變量，然後將其提供給glmnet()函數，所以這意味著執行了二值化。

You might be wondering about scaling the values in the matrix. This will be taken care of by glmnet(); by default, the standardize argument will handle the scaling, so no additional action is needed.
您可能想知道如何縮放矩陣中的值。這將由glmnet()處理；預設情況下， standardize參數將處理縮放，因此不需要額外的操作。

Interpreting Ridge Regression Output
解釋嶺迴歸輸出

With lambda set to 0 for "rid.mod1", the ridge estimates will be the same as the OLS estimates. Here is an example of the output:
將「rid.mod1」的lambda設為 0 時，嶺估計值將與 OLS 估計值相同。以下是輸出的範例：

The corresponding prediction function is:
對應的預測函數為：

$\hat{y} = 2.624 + 0.695 x_{1} + 0.899 x_{2} + 1.321 x_{3} + 0.176 x_{1} x_{2} + 0.194 x_{1} x_{3}$

where: 在哪裡：

$x_{1}$ is Sepal.Width
$x_{1}$ 是萼片寬度
$x_{2}$ is the dummy variable for versicolor
$x_{2}$ 是 versicolor 的虛擬變數
$x_{3}$ is the dummy variable for virginica
$x_{3}$ 是維吉尼亞州的虛擬變量

There are a couple of technical issues to address regarding this output:
關於此輸出，有幾個技術問題需要解決：

Two Intercept Rows: You might encounter output with two (Intercept) rows, where one doesn't have an estimate. This occurs because the first column of the "Xmat" matrix is redundant. By default, the glmnet() function includes an intercept term, hence the first column of "Xmat" is essentially including another intercept term. As a result, there's no estimate for the second (Intercept) row. If this appears on the exam, just ignore the row with no estimate.
兩個截距行：您可能會遇到包含兩行（截距）的輸出，其中一行沒有估計值。出現這種情況是因為「Xmat」矩陣的第一列是多餘的。預設情況下， glmnet()函數包含截距項，因此「Xmat」的第一列本質上包含另一個截距項。因此，沒有對第二（截距）行的估計。如果考試中出現這種情況，請忽略沒有估計的行。

Alternatively, you might see output like this:
或者，您可能會看到以下輸出：

In this particular case, it doesn't mean $\hat{y}$ has no intercept. It reveals the same information as the previous output but neglects to report the intercept estimate.
在這種特殊情況下，這並不意味著 $\hat{y}$ 沒有攔截。它顯示與先前輸出相同的訊息，但忽略報告截距估計。

The exam could also provide output that has no such issue, like this:
考試還可以提供不存在此類問題的輸出，如下所示：

To simplify, we'll use this version going forward.
為了簡單起見，我們將在以後使用這個版本。
Discrepancies in Estimates: When comparing these ridge estimates (same as OLS) to those from "mlr.mod8", you might notice that the numbers are close but not exactly the same. This discrepancy is due to computational precision. To get the estimates to match, we can set the thresh argument to a much smaller positive number. However, this should be irrelevant to the exam.
估計值的差異：將這些山脊估計值（與 OLS 相同）與「mlr.mod8」中的估計值進行比較時，您可能會注意到數字很接近，但並不完全相同。這種差異是由於計算精度造成的。為了讓估計值匹配，我們可以將thresh參數設定為一個小得多的正數。不過，這應該與考試無關。

As the hyperparameter lambda increases, the estimates move closer to 0, reducing the influence of the predictors. Here are examples with two other values:
隨著超參數 lambda 的增加，估計值逐漸接近 0，從而減少了預測變數的影響。以下是具有其他兩個值的範例：

lambda = 2
lambda = 100

The process of finding the best value for lambda is called tuning. While it is possible to tune lambda using test RMSE, cross-validation is more commonly used for regularization models.
找到 lambda 最佳值的過程稱為調整。雖然可以使用測試 RMSE 來調整 lambda，但交叉驗證更常用於正規化模型。

Here are a couple of key facts about ridge regression:
以下是有關嶺回歸的幾個關鍵事實：

If $λ$ is finite, practically none of the $p$ ridge estimates will equal 0, meaning ridge regression will not remove predictors.
如果 $λ$ 是有限的，實際上沒有一個 $p$ 嶺估計值將等於 0，這表示嶺迴歸不會刪除預測變數。
Ridge regression is useful in high-dimensional settings, as only some of the predictors should matter with coefficient estimates that are far enough from 0.
嶺迴歸在高維設定中非常有用，因為只有部分預測變數與距離 0 足夠遠的係數估計值有關。

Cross-Validation and One-Standard-Error Rule
交叉驗證和單標準誤差法則

Cross-validation is a technique used to validate or compare models, particularly when tuning hyperparameters. It addresses the issue of using a separate validation/test set, which reduces the size of the training set and may result in suboptimal model fits. Cross-validation is performed on the training set.
交叉驗證是一種用於驗證或比較模型的技術，特別是在調整超參數時。它解決了使用單獨的驗證/測試集的問題，這會減少訓練集的大小並可能導致模型擬合不理想。在訓練集上進行交叉驗證。

In the context of the exam, cross-validation is preferred over a separate validation/test set when the goal is to tune hyperparameters. However, a validation/test set is still common, typically for validating or comparing models rather than tuning hyperparameters.
在考試中，當目標是調整超參數時，交叉驗證優於單獨的驗證/測試集。然而，驗證/測試集仍然很常見，通常用於驗證或比較模型而不是調整超參數。

k-Fold Cross-Validation Procedure
k 重交叉驗證程序

$k$ -fold cross-validation involves randomly dividing the training set into $k$ groups (i.e. folds) of roughly equal size. Here is an example of the process:
$k$ -fold交叉驗證涉及將訓練集隨機劃分為 $k$ 大小大致相等的組別（即折疊）。以下是該過程的範例：

Suppose we have 15 observations that we randomly divide into 5 folds with 3 observations each.
假設我們有 15 個觀察值，我們將其隨機分成 5 個折疊，每個折疊 3 個觀察值。
Fit a model (say ridge with lambda = 5) to the 12 observations not in the first fold.
將模型（例如lambda = 5 的嶺）擬合到不在第一折疊中的 12 個觀測值。
Use the fitted equation from step 2 and the first fold's 3 observations to calculate an accuracy measure, say RMSE. This is considered a test RMSE since the fold was excluded from the fit.
使用步驟 2 中的擬合方程式和第一次折疊的 3 個觀測值來計算準確度測量值，即 RMSE。這被認為是測試 RMSE，因為折疊被排除在擬合之外。
Repeat steps 2 and 3 such that the remaining 4 folds are each treated like the first — fit the model to the 12 observations not in that fold and calculate the test RMSE using the 3 observations in that fold.
重複步驟 2 和 3，使剩餘 4 個折疊的處理方式與第一個折疊相同 - 將模型擬合到不在該折疊中的 12 個觀測值，並使用該折疊中的 3 個觀測值計算檢驗 RMSE。
Average the 5 test RMSE values to obtain the CV error, which is an accuracy measure for the specified hyperparameter value (lambda = 5).
對 5 個測試 RMSE 值進行平均以獲得 CV 誤差，這是指定超參數值 ( lambda = 5) 的精確度度量。
Repeat steps 2 to 5 with different hyperparameter values (e.g., lambda = 10) to determine which value results in the lowest CV error.
使用不同的超參數值（例如lambda = 10）重複步驟 2 至 5，以確定哪個值導致最低 CV 誤差。

Implementing Cross-Validation in R
在 R 中實現交叉驗證

In R, the cv.glmnet() function performs the entire cross-validation procedure. Here is sample code:
在 R 中， cv.glmnet()函數執行整個交叉驗證程序。這是範例程式碼：

We set the seed before introducing the randomness of dividing the dataset into folds. The seed number is arbitrary. The cv.glmnet() function will generate its own candidate lambda values if not specified. The number of folds can be changed but defaults to 10 and should not be an important detail for the exam.
我們在引入將資料集劃分為折疊的隨機性之前設定種子。種子數是任意的。如果未指定， cv.glmnet()函數將產生自己的候選 lambda 值。折疊次數可以更改，但預設為 10，並且不應成為考試的重要細節。

To obtain the lambda value that resulted in the lowest CV error from the cross-validation results, use code like this:
若要從交叉驗證結果中取得導致最低 CV 誤差的 lambda 值，請使用下列程式碼：

The value seems rather close to 0. If say this value happens to be the smallest among the validated values, then it suggests that the cross-validation preferred the most flexible ridge regression, and the ridge estimates should be close to the OLS estimates. This would indicate that the MLR was not very flexible to begin with, as reducing flexibility with a larger lambda was not preferred.
該值似乎相當接近 0。這表明 MLR 一開始就不是很靈活，因為用較大的 lambda 來降低靈活性並不是首選。

Let's now run a ridge regression with lambda = 0.054, which is reasonably close to the one with the lowest CV error:
現在讓我們運行lambda = 0.054 的嶺回歸，該回歸相當接近 CV 誤差最低的回歸：

This yields the following coefficient estimates:
這產生以下係數估計：

When compared to "rid.mod1", the last two estimates actually move away from 0, despite the higher lambda. That is possible; the restricted quantity $\sum_{j = 1}^{5} b_{j}^{2}$ is what will head towards 0 as lambda increases.
與“rid.mod1”相比，最後兩個估計值實際上遠離 0，儘管 lambda 更高。這是可能的；限制數量 $\sum_{j = 1}^{5} b_{j}^{2}$ 是隨著 lambda 增加而趨向於 0 的值。

Interpreting Cross-Validation Results
解釋交叉驗證結果

A cross-validation plot can help visualize the results. Imagine a generic dataset with 10 predictors and 30 candidate lambda values, resulting in:
交叉驗證圖可以幫助視覺化結果。想像一個具有 10 個預測變數和 30 個候選 lambda 值的通用資料集，結果是：

The vertical axis represents the CV error.
縱軸表示CV誤差。
The horizontal axis shows the natural log of the lambda candidates, with flexibility decreasing from left to right. Using the natural log simply helps to visualize a wide range of lambda values more easily.
橫軸顯示 lambda 候選值的自然對數，彈性從左到右遞減。使用自然對數有助於更輕鬆地視覺化各種 lambda 值。
The point with the minimum CV error corresponds to the left dashed line. From the plot, this lambda is approximately $e^{1} \approx 2.7$ .
CV 誤差最小的點對應於左側的虛線。從圖中可以看出，該 lambda 約為 $e^{1} \approx 2.7$ 。
The numbers at the top of the plot represent the number of predictors for select levels of flexibility, paralleling the horizontal axis. Ridge regression does not remove predictors when lambda is finite, so the number of predictors remains constant at 10 throughout the plotted range.
圖頂部的數字代表與水平軸平行的選定靈活性等級的預測變數數量。當 lambda 有限時，嶺回歸不會刪除預測變量，因此預測變量的數量在整個繪製範圍內保持恆定為 10。

One-Standard-Error Rule 一標準誤差法則

Instead of choosing the lambda with the minimum CV error, the one-standard-error rule can be applied:
可以應用一標準誤差法則，而不是選擇具有最小 CV 誤差的 lambda：

Each CV error has a standard error, represented by the interval above and below the point.
每個 CV 誤差都有一個標準誤差，由該點上方和下方的區間表示。
Consider all CV errors within one standard error of the minimum CV error to be close enough to the minimum.
考慮最小 CV 誤差的一個標準誤差內的所有 CV 誤差，使其足夠接近最小值。
Select the rightmost point within this interval, which is the least flexible option with roughly the same error quality. In the plot above, this point corresponds to the right dashed line; its lambda is approximately $e^{2.2} \approx 9$ .

The exact lambda values for both the minimum CV error and the one-standard-error rule can be found from the cv.glmnet object:

Lasso Regression 套索迴歸

Least Absolute Selection and Shrinkage Operator (LASSO) regression is a close cousin to ridge regression. While minimizing the SSE, lasso regression imposes the restriction of
最小絕對選擇和收縮算子 (LASSO) 回歸是嶺回歸的近親。在最小化 SSE 的同時，套索迴歸施加了以下限制：

$\sum_{j = 1}^{p} | b_{j} | \leq s$

for some constant $s$ . Again, the intercept estimate is not included in the restriction. In short, where ridge takes the square of the $b_{j}$ 's, lasso takes their absolute value instead.
對於一些常數 $s$ 。同樣，截距估計不包括在限制中。簡言之，其中 ridge 取的平方 $b_{j}$ 's，套索取它們的絕對值。

Let's visualize for the case where $p = 2$ with the following plot of $b_{1}$ and $b_{2}$ values:
讓我們想像一下這種情況 $p = 2$ 與以下情節 $b_{1}$ 和 $b_{2}$ 價值觀：

Figure-3.3.2.2.png

The purple cross marks the $b_{1}$ and $b_{2}$ values under OLS.
紫色十字標誌著 $b_{1}$ 和 $b_{2}$ OLS 下的值。
The yellow diamond denotes the region $| b_{1} | + | b_{2} | \leq s$ .
黃色菱形表示該地區 $| b_{1} | + | b_{2} | \leq s$ 。
The black contour touches the edge of the yellow diamond, signifying the point where SSE is minimized while restricted by $| b_{1} | + | b_{2} | \leq s$ . This green point denotes the lasso estimates.
黑色輪廓接觸黃色菱形的邊緣，表示 SSE 最小化的點，同時受 $| b_{1} | + | b_{2} | \leq s$ 。該綠點表示套索估計。

Mathematically, lasso regression seeks to minimize:
從數學上講，套索迴歸旨在最小化：

$\begin{matrix} (3.3.2.2) & SSE + λ \sum_{j = 1}^{p} | b_{j} | \end{matrix}$

Just like with ridge regression, when $λ = 0$ , the lasso estimates match the OLS estimates, while as $λ$ increases, the estimates (except the intercept's) converge to 0.
就像嶺回歸一樣，當 $λ = 0$ ，套索估計與 OLS 估計相匹配，而 $λ$ 增加時，估計值（截距除外）收斂於 0。

Key Difference between Lasso and Ridge
套索和嶺之間的主要區別

Lasso regression can set coefficient estimates to 0 for finite $λ$ values, effectively removing predictors (variable selection). In contrast, ridge regression does not remove predictors. This makes interpreting lasso results potentially easier, as there may be fewer predictors to discuss after removing those deemed insignificant.

Lasso and ridge regressions are similar in many ways, including their ability to work in high-dimensional settings. However, the crucial difference in variable selection makes lasso regression a useful tool for simplifying models and focusing on the most important predictors.

Implementing Lasso Regression in R

To model a lasso regression using the glmnet() function, set the alpha argument to 1. Here is sample code:

Let's compare lasso regressions with different lambda values:

lambda = 0 (OLS)

lambda = 0.02

lambda = 0.1

As lambda increases, more coefficient estimates shrink to 0. These lasso regressions seem to prefer keeping the interaction terms, ignoring the hierarchical principle.

It is not surprising that the estimates for lambda = 0 are exactly the same as the estimates of "rid.mod1", as they both represent the OLS estimates.

Cross-Validation for Lasso Regression

To perform cross-validation for lasso regression, set alpha as 1 in the cv.glmnet() function. Here is sample code:

The lambda value that produces the lowest CV error is approximately 0.0013.

Here is sample code to fit a lasso regression with lambda = 0.0013:

The resulting coefficient estimates are:

None of the estimates are 0, and so all predictors are retained.

Interpreting Cross-Validation Results for Lasso Regression

Consider a generic dataset with 10 predictors and 30 candidate lambda values:

The left dashed line corresponds to the lambda with the lowest CV error (approximately 0.72).
The right dashed line corresponds to the lambda using the one-standard-error rule (approximately 1.28).
The number of predictors decreases as lambda increases.
For the lambda with the lowest CV error, it's unclear whether 3, 4, or 5 predictors are retained.
The one-standard-error rule leads to just 3 predictors, providing a simpler model with comparable performance.

Elastic Net Regression

Elastic net regression is a hybrid of ridge and lasso regression, combining their respective restrictions on the SSE. The restriction in elastic net regression is a weighted average of the restrictions in ridge and lasso regressions, controlled by $α$ which ranges from 0 to 1. The restricted quantity in elastic net regression is given by:

$α \sum_{j = 1}^{p} | b_{j} | + (1 - α) \sum_{j = 1}^{p} b_{j}^{2}$

where:

$α$ is known as the mixing coefficient.
The first sum is the quantity from lasso regression.
The second sum is the quantity from ridge regression.

If $α = 0$ , the restricted quantity simplifies to the same quantity for ridge regression. If $α = 1$ , it simplifies to the same quantity for lasso regression.

As $α$ moves between its extremes, the shape of the restricted quantity becomes rounder as $α$ approaches 0 (more like ridge's circle) and sharper as $α$ approaches 1 (more like lasso's diamond).

Mathematically, elastic net regression seeks to minimize:

$\begin{matrix} (3.3.2.3) & SSE + λ (α \sum_{j = 1}^{p} | b_{j} | + (1 - α) \sum_{j = 1}^{p} b_{j}^{2}) \end{matrix}$

Both $α$ and $λ$ are considered hyperparameters in elastic net regression and they make sense to be tuned together using cross-validation to determine their best pair of values. However, it's more common on the exam to consider a handful of $α$ values, and only cross-validate $λ$ for each $α$ separately.

Implementing Elastic Net in R

The alpha argument in the glmnet() function is how we can implement elastic net regressions. Here is sample code:

The alpha values were chosen arbitrarily, while the lambda values were obtained from cross-validation after setting the alpha value. The coefficient estimates for the two models are shown below, but there appears to be little difference between them in this specific case.

"net.mod1"
"net.mod2"

Comparing the Models

Here is a summary of the glmnet objects we made:

Model	alpha	lambda	Training RMSE
rid.mod2	0	0.0540	0.4360
net.mod1	0.35	0.0148	0.4322
net.mod2	0.65	0.0106	0.4321
las.mod2	1	0.0013	0.4309

When comparing regularized regressions with different alpha values, it is important to note that there is no single flexibility measure that ties them together. While the training RMSE decreases as alpha increases in this specific instance, this does not imply a general statement about the flexibility of ridge, lasso, and elastic net regressions as a whole. We may argue that "las.mod2" is more flexible than "rid.mod2" based on their training RMSE's, but that is a model-specific comparison and should not be generalized to all lasso and ridge regressions.

Comparing Regularization Procedures

This section focuses on the core similarities and differences between three regularization methods: ridge, lasso, and elastic net regressions. We will discuss four key points to understand the characteristics and use cases of these methods.

Reducing Model Flexibility and Overfitting

All three regularization methods aim to reduce model flexibility from a set of predictors. By doing so, they reduce model variance and discourage overfitting. This is particularly useful when dealing with complex models that are prone to capturing noise in the data.

Handling High-Dimensional Data

Ridge, lasso, and elastic net regressions are suitable when faced with high-dimensional data, i.e., when the number of predictors is large. A large enough $λ$ will shrink the coefficients towards zero, effectively reducing the impact of predictors or eliminating them entirely. The effectiveness of these methods is not affected by the number of predictors.

Variable Selection

One key difference between ridge regression and the other two methods is variable selection. Ridge regression practically never removes predictors with a finite $λ$ , making it incapable of variable selection. In contrast, both elastic net and lasso can drop predictors entirely.

In the case of elastic net, the mixing coefficient $α$ controls the tendency towards variable selection. As $α$ increases, elastic net becomes more like lasso and encourages variable selection more strongly.

However, it's important to consider the implications of variable selection, particularly when dealing with categorical variables. Excluding dummy variables leads to merging levels of a factor, which may result in a model that is challenging to interpret. Allowing an algorithm to decide which levels to merge might not result in a simple or easily interpretable model. It is crucial to carefully evaluate the consequences of variable selection in the context of your specific problem and the interpretability requirements of your model.

Hierarchical Principle

Recall that the hierarchical principle states that if an interaction term is included in a model, the individual terms that make up the interaction should also be included. Ridge regression will never violate this principle because it cannot remove predictors. However, both elastic net and lasso have the potential to violate the hierarchical principle.

When minimizing the "SSE plus penalty" expression in elastic net and lasso, all coefficients have the opportunity to be shrunk to zero. Hence, it is possible to remove an individual term associated with an interaction term before removing the interaction term itself.

Next Lesson

3.3.2 Ridge Regression

cross validation
qjj
Sep 9, 2024
can you check?

5-fold cross validation: you divide the training set into 5 folds. You designate the first fold to be the test set and the rest of the folds to the training set. You calculate the test metric using the first fold. You designate the second fold to be the test set and the rest to the training set. You calculate the test metric using the second fold. You repeat this process three more times. After having the 5 test metrics, you average the metrics to get the average, which is the one you want to get. Instead, you could have split the dataset into the training and test set one time and calculated the test metric. But, that test metric is not reliable enough because there can be a few outliers, which increases the test metric significantly. If the 5-fold CV has been done, every observation is once tested and the test result will be more reliable.

1
0
Interpreting Cross-Validation Results for Lasso Regression
actuary pursing
Sep 11, 2024
W.r.t the plot,

I don't understand why log(\lambda) is negative

The left dashed line corresponds to the lambda with the lowest CV error (approximately 0.72).
The right dashed line corresponds to the lambda using the one-standard-error rule (approximately 1.28).

For the above point, do you mean 0.72 & 1.28 are the values of \lambda
3
0
Interpreting Ridge Regression Output
Lindsey Shearstone
Sep 25, 2024
If we get an output on the exam for a ridge regression that doesn't have an intercept estimate, what should we do? I understand that we ignore the blank one when there are 2 rows, but what about the case you mention when the estimate is not displayed? What would we need to be able to do in this case?

1
0
Scaling
Tony Lin
Oct 2, 2024
Hi Coach,

Can you please help me understand why here it said a variable with wider range (temp 0 to 50) will have smaller scale?

"For example, if X1 represents temperature ranging from 0 to 50 °C, and X2 represents a ratio ranging from 0 to 1, then X1 would be on a smaller scale than X2 to account for the wider range of . Consequently, X2 would face more aggressive shrinkage when minimizing the ridge expression."

Thanks
5
2
Lasso and Elastic Net Variable Selection
Sean Carey
Oct 2, 2024
Is there a logical or mathematical explanation for why lasso and elastic net regression can cause coefficients to become zero while ridge regression cannot? With the graph provided near the top of the Lasso Regression section, I can visualize a situation where the smallest SSE contour line would touch the corner of the restricting yellow square, but is there a more rigorous explanation for this property of lasso and elastic net regression?

1
0
Meaning of reducing dimensionality
Yingying Li
Oct 6, 2024
In my understanding, reducing dimensionality means reducing the number of predictors. As I practice through the SOA exam I got a little confused here. Is reducing dimensionality one of ridge regression's benefit? or ridge regression only lower the variance and flexibility?

3
2

Learn
3
3.3
Regularization

Regularization 正規化

Ridge Regression 嶺回歸

Scaling Predictors 縮放預測器

Implementing Ridge Regression in R
在 R 中實現嶺迴歸

Interpreting Ridge Regression Output
解釋嶺迴歸輸出

Cross-Validation and One-Standard-Error Rule
交叉驗證和單標準誤差法則

k-Fold Cross-Validation Procedure
k 重交叉驗證程序

Implementing Cross-Validation in R
在 R 中實現交叉驗證

Interpreting Cross-Validation Results
解釋交叉驗證結果

One-Standard-Error Rule 一標準誤差法則

Lasso Regression 套索迴歸

Key Difference between Lasso and Ridge
套索和嶺之間的主要區別

Implementing Lasso Regression in R

Cross-Validation for Lasso Regression

Interpreting Cross-Validation Results for Lasso Regression

Elastic Net Regression

Implementing Elastic Net in R

Comparing the Models

Comparing Regularization Procedures

Reducing Model Flexibility and Overfitting

Handling High-Dimensional Data

Variable Selection

Hierarchical Principle

Next Lesson

Discussions

cross validation

Interpreting Cross-Validation Results for Lasso Regression

Interpreting Ridge Regression Output

Scaling

Lasso and Elastic Net Variable Selection

Meaning of reducing dimensionality

Timeline of New Content

3.3.1 Stepwise Selection

3.3.2 Regularization

3.3.3 Task Examples

3.3.4 Focus Questions

3.3 Summary

Regularization 正規化

Ridge Regression 嶺回歸

Scaling Predictors 縮放預測器

Implementing Ridge Regression in R在 R 中實現嶺迴歸

Interpreting Ridge Regression Output解釋嶺迴歸輸出

Cross-Validation and One-Standard-Error Rule交叉驗證和單標準誤差法則

k-Fold Cross-Validation Procedurek 重交叉驗證程序

Implementing Cross-Validation in R在 R 中實現交叉驗證

Interpreting Cross-Validation Results解釋交叉驗證結果

One-Standard-Error Rule 一標準誤差法則

Lasso Regression 套索迴歸

Key Difference between Lasso and Ridge套索和嶺之間的主要區別

Implementing Lasso Regression in R

Cross-Validation for Lasso Regression

Interpreting Cross-Validation Results for Lasso Regression

Elastic Net Regression

Implementing Elastic Net in R

Comparing the Models

Comparing Regularization Procedures

Reducing Model Flexibility and Overfitting

Handling High-Dimensional Data

Variable Selection

Hierarchical Principle

Next Lesson

cross validation

Interpreting Cross-Validation Results for Lasso Regression

Interpreting Ridge Regression Output

Scaling

Lasso and Elastic Net Variable Selection

Meaning of reducing dimensionality

Implementing Ridge Regression in R
在 R 中實現嶺迴歸

Interpreting Ridge Regression Output
解釋嶺迴歸輸出

Cross-Validation and One-Standard-Error Rule
交叉驗證和單標準誤差法則

k-Fold Cross-Validation Procedure
k 重交叉驗證程序

Implementing Cross-Validation in R
在 R 中實現交叉驗證

Interpreting Cross-Validation Results
解釋交叉驗證結果

Key Difference between Lasso and Ridge
套索和嶺之間的主要區別