Primal, Dual ve Kernel Ridge Regresyonu Arasındaki Fark

Primal , Dual ve Kernel Ridge Regresyonu arasındaki fark nedir ? İnsanlar her üçünü de kullanıyor ve herkesin farklı kaynaklarda kullandığı farklı gösterimden dolayı benim için zor.

Birisi bana bu üç kelime arasındaki farkın ne olduğunu basit bir şekilde söyleyebilir mi? Ayrıca, her birinin avantajları ve dezavantajları neler olabilir ve karmaşıklıkları neler olabilir?

regression kernel-trick ridge-regression

— Jim Blum
kaynak

Kısa cevap: Primal ve Dual arasında fark yok - sadece çözüme varmanın yolu hakkında. Çekirdek sırt regresyonu esas olarak normal sırt regresyonu ile aynıdır, ancak doğrusal olmayan gitmek için çekirdek hile kullanır.

Doğrusal Regresyon

Her şeyden önce, olağan bir En Küçük Kareler Doğrusal Regresyon, kare hatalarının toplamı minimum olacak şekilde veri noktalarına düz bir çizgi sığdırmaya çalışır.

enter image description here

Bu en iyi uyum hattı parametrize $\mathbb w$ ve her bir veri noktası için $(\mathbf x_i, y_i)$ istediğimiz $\mathbf w^T \mathbf x_i \approx y_i$ . Let $e_i = y_i - \mathbf w^T \mathbf x_i$ öngörülen ve gerçek değerler arasındaki mesafe - hata olabilir. Bu nedenle hedefimiz, kare hatalarının toplamını en aza indirmektir $\sum e_i^2 = \| \mathbf e \|^2 = \| X \mathbf w - \mathbf y \|^2$ burada $X = \begin{bmatrix} — \mathbf x_1 \,— \\ — \mathbf x_2 \,— \\ \vdots \\ — \mathbf x_n \,— \end{bmatrix}$ - her biri, bir veri matrisi $\mathbf x_i$ bir satır ve olmak $\mathbf y = (y_1 , \ ... \ , y_n)$ her bir vektör $y_i$ Var.

Bu nedenle amaç, bir $\min\limits_{\mathbf w} \| X \mathbf w - \mathbf y \|^2$ ve çözüm $\mathbf w = (X^T X)^{-1} X^T \mathbf y$ ( "normal Denklem" olarak da bilinir).

Yeni görünmeyen veri noktası için $\mathbf x$ onun hedef değer tahmin olarak . $\hat y$ $\hat y = \mathbf w^T \mathbf x$

Sırt Regresyonu

Doğrusal regresyon modellerinde birçok ilişkili değişken olduğunda, $\mathbf w$ katsayıları zayıf bir şekilde belirlenebilir ve çok fazla varyansa sahip olabilir. Bu sorunun çözümlerinden biri, ağırlıkları $\mathbf w$ sınırlamak ve böylece bazı bütçe $C$ aşmamaktır . Bu kullanarak eşdeğerdir $L_2$ aynı zamanda "ağırlık çürüme" olarak bilinen -regularization: bazen doğru sonuçları eksik pahasına varyansını azalacak (yani bazı önyargı tanıtarak).

Amaç şimdi $\min\limits_{\mathbf w} \| X \mathbf w - y \|^2 + \lambda \, \| \mathbf w \|^2$ ile, $\lambda$ düzenlilestirme parametredir. Matematikten geçerek aşağıdaki çözümü elde ederiz: $\mathbf w = (X^T X + \lambda \, I )^{-1} X^T \mathbf y$ . Her zamanki doğrusal regresyona çok benzer, ancak burada her bir diyagonal elemanına $\lambda$ ekliyoruz. $X^T X$

Not biz ki yeniden yazma $\mathbf w$ olarak $\mathbf w = X^T \, (X X^T + \lambda \, I)^{-1} \mathbf y$ (ayrıntılar içinburayabakın). Yeni görünmeyen veri noktası için $\mathbf x$ onun hedef değer tahmin olarak $\hat y$ $\hat y = \mathbf x^T \mathbf w = \mathbf x^T X^T \, (X X^T + \lambda \, I)^{-1} \mathbf y$ . Let $\boldsymbol \alpha = (X X^T + \lambda \, I)^{-1} \mathbf y$ . Then $\hat y = \mathbf x^T X^T \boldsymbol \alpha = \sum\limits_{i=1}^{n} \alpha_i \cdot \mathbf x^T \mathbf x_i$ .

Ridge Regression Dual Form

We can have a different look at our objective - and define the following quadratic program problem:

$\min\limits_{\mathbf e, \mathbf w} \sum\limits_{i = 1}^n e_i^2$ s.t. $e_i = y_i - \mathbf w^T \mathbf x_i$ for $i = 1 \, .. \, n$ and $\| \mathbf w \|^2 \leqslant C$ .

It's the same objective, but expressed somewhat differently, and here the constraint on the size of $\mathbf w$ is explicit. To solve it, we define the Lagrangian $\mathcal L_p(\mathbf w, \mathbf e ; C)$ - this is the primal form that contains primal variables $\mathbf w$ and $\mathbf e$ . Then we optimize it w.r.t. $\mathbf e$ and $\mathbf w$ . To get the dual formulation, we put found $\mathbf e$ and $\mathbf w$ back to $\mathcal L_p(\mathbf w, \mathbf e ; C)$ .

So, $\mathcal L_p(\mathbf w, \mathbf e ; C) = \| \mathbf e \|^2 + \boldsymbol \beta^T (\mathbf y - X \mathbf w - \mathbf e) - \lambda \, (\| \mathbf w \|^2 - C)$ . By taking derivatives w.r.t. $\mathbf w$ and $\mathbf e$ , we obtain $\mathbf e = \cfrac{1}{2} \boldsymbol \beta$ and $\mathbf w = \cfrac{1}{2 \lambda} X^T \boldsymbol \beta$ . By letting $\boldsymbol \alpha = \cfrac{1}{2 \lambda} \boldsymbol \beta$ , and putting $\mathbf e$ and $\mathbf w$ back to $\mathcal L_p(\mathbf w, \mathbf e ; C)$ , we get dual Lagrangian $\mathcal L_d(\boldsymbol \alpha, \lambda; C) = -\lambda^2 \| \boldsymbol \alpha \|^2 + 2 \lambda \, \boldsymbol \alpha^T y - \lambda \| X^T \boldsymbol \alpha \| - \lambda C$ . If we take a derivative w.r.t. $\boldsymbol \alpha$ , we get $\boldsymbol \alpha = (XX^T - \lambda I)^{-1} \mathbf y$ - the same answer as for usual Kernel Ridge regression. There's no need to take a derivative w.r.t $\lambda$ - it depends on $C$ , which is a regularization parameter - and it makes $\lambda$ regularization parameter as well.

Next, put $\boldsymbol \alpha$ to the primal form solution for $\mathbf w$ , and get $\mathbf w = \cfrac{1}{2 \lambda} X^T \boldsymbol \beta = X^T \boldsymbol \alpha$ . Thus, the dual form gives the same solution as usual Ridge Regression, and it's just a different way to come to the same solution.

Kernel Ridge Regression

Kernels are used to calculate inner product of two vectors in some feature space without even visiting it. We can view a kernel $k$ as $k(\mathbf x_1, \mathbf x_2) = \phi(\mathbf x_1)^T \phi(\mathbf x_2)$ , although we don't know what $\phi(\cdot)$ is - we only know it exists. There are many kernels, e.g. RBF, Polynonial, etc.

We can use kernels to make our Ridge Regression non-linear. Suppose we have a kernel $k(\mathbf x_1, \mathbf x_2) = \phi(\mathbf x_1)^T \phi(\mathbf x_2)$ . Let $\Phi(X)$ be a matrix where each row is $\phi(\mathbf x_i)$ , i.e. $\Phi(X) = \begin{bmatrix} — \phi(\mathbf x_1) \,— \\ — \phi(\mathbf x_2) \,— \\ \vdots \\ — \phi(\mathbf x_n) \,— \end{bmatrix}$

Now we can just take the solution for Ridge Regression and replace every $X$ with $\Phi(X)$ : $\mathbf w = \Phi(X)^T \, (\Phi(X) \Phi(X)^T + \lambda \, I)^{-1} \mathbf y$ . For a new unseen data point $\mathbf x$ we predict its target value $\hat y$ as $\hat y= \mathbf \phi(\mathbf x)^T \Phi(X)^T \, (\Phi(X) \Phi(X)^T + \lambda \, I)^{-1} \mathbf y$ .

First, we can replace $\Phi(X) \Phi(X)^T$ by a matrix $K$ , calculated as $(K)_{ij} = k(\mathbf x_i, \mathbf x_j)$ . Then, $\phi(\mathbf x)^T \Phi(X)^T$ is $\sum\limits_{i = 1}^n \phi(\mathbf x)^T \phi(\mathbf x_i) = \sum\limits_{i = 1}^n k(\mathbf x, \mathbf x_j)$ . So here we managed to express every dot product of the problem in terms of kernels.

Finally, by letting $\boldsymbol \alpha = (K + \lambda \, I)^{-1} \mathbf y$ (as previously), we obtain $\hat y= \sum\limits_{i = 1}^n \alpha_i k(\mathbf x, \mathbf x_j)$

References

Machine Learning I class at TU Berlin
Elements of Statistical Learning, http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://0agr.ru/wiki/index.php/Normal_Equation
http://stat.wikia.com/wiki/Kernel_Ridge_Regression
http://stat.rutgers.edu/home/tzhang/papers/ml02_dual.pdf
http://www.ics.uci.edu/~welling/classnotes/papers_class/Kernel-Ridge.pdf
http://www.cs.nyu.edu/~mohri/mls/lecture_8.pdf

— Alexey Grigorev
kaynak

I am impressed by the well-organized discussion. However, your early reference to "outliers" confused me. It appears the weights

$w$ apply to the variables rather than the cases, so how exactly would ridge regression help make the solution robust to outlying cases, as suggested by the illustration?

— whuber

Excellent answer, Alexey (though I wouldn't call it "simple words")! +1 with no questions asked. You like to write in LaTeX, don't you?

— Aleksandr Blekh

I suspect you might be confusing some basic things here. AFAIK, ridge regression is neither a response to nor a way of coping with "noisy observations." OLS already does that. Ridge regression is a tool used to cope with near-collinearity among regressors. Those phenomena are completely different from noise in the dependent variable.

— whuber

+1 whuber. Alexey you are right it is overfitting -ie too many parameters for the available data - not really noise. [ and add enough dimensions for fixed sample size and 'any' data set becomes collinear]. So a better 2-d picture for RR would be all the points clustered around (0,1) with a single point at (1,0) ['justifying' the slope parameter]. See ESL fig 3.9,page 67 web.stanford.edu/~hastie/local.ftp/Springer/OLD/…. also look at primal cost function: to increase weight by 1 unit, error must decrease by

$1/\lambda$ unit

— seanv507

I believe you meant add

$\lambda$ to diagonal elements of

$X^TX$ not subtract(?) in the ridge regression section. I applied the edit.

— Heteroskedastic Jim