I believe everyone who is familiar with machine learning should know about ridge regression(or \(L_2\) norm regularized least square, or Linear regression with Gaussian prior) and PCA. Actually, these are concepts we learnt at the beginning of our study in machine learning.
Why do we want to study the relationship between ridge regression and PCA? This problem arises from linear regression.
1.Linear regression
If X has full column rank, then \(\beta\) has a close form solution
However, if \(rank(X)<p\), we cannot get an analytical solution of \(\beta\). In general, there are two ways to solve this problem.
- we can implement PCA on \(X\) and then do linear regression.
- we can add regularization term(\(L1,L2\)) on loss function.
2.PCA
\(X_{PCA}\) can be obtained from SVD, specially
where \(\Sigma\) is a diagnoal matrix.
where \(\sigma_i\) is the singular value and \(\forall i<j, \sigma_i\geq \sigma_j\)
If we want to keep \(k\) largest principal components of X, then
If we plug \(X_{PCA}\) into \(\hat{Y}=X\hat{\beta}=X(X^TX)^{-1}X^TY\), then we have
3.Ridge regression
For ridge regression, our solution is
According to
plug \(\beta_{ridge}\) and \(X=U\Sigma V^T\) into above equation, we have
4.Comparison
When we put these two equaitons together, we can see their relationship clearly.
-
for ridge regression, regularization term \(\lambda \) has different impact on different singular values. When \(\sigma_j\) is small, \(\frac{\sigma_j^2}{\sigma_j^2+\lambda}\) is also very small. On the contrary, \(\frac{\sigma_j^2}{\sigma_j^2+\lambda}\) tends to be 1 when \(\sigma_j\) is large.
-
for PCA, it sets all dimensions with small singular values to be 0 and remaining other dimensions to be 1.
Therefore, ridge regression is a soft PCA regression in fact. They both intend to solve the multi-collinearity in order to improve the model fittness.