Sırt regresyonunu uygulama:

Ridge Regression'ı bir Python / C modülünde uyguluyorum ve bu "küçük" problemle karşılaştım. Fikir, etkin serbestlik derecelerini aşağı yukarı eşit aralıklarla ( "İstatistiksel Öğrenmenin Öğeleri" sayfa 65'teki çizim gibi ) örneklemek istiyorum, yani örnek:

d f (λ) = \sum_{i = 1}^{p} \frac{d_{i}^{2}}{d_{i}^{2} + λ},

$\mathrm{df}(\lambda)=\sum_{i=1}^{p}\frac{d_i^2}{d_i^2+\lambda},$

d_{i}^{2}

$d_i^2$

X^{T} X

$X^TX$

d f (λ_{max}) \approx 0

$\mathrm{df}(\lambda_{\max})\approx 0$ ila

d f (λ_{min}) = p

$\mathrm{df}(\lambda_{\min})=p$ . İlk sınırı belirlemenin kolay bir yolu,

λ_{max} = \sum_{i}^{p} d_{i}^{2} / c

$\lambda_{\max}=\sum_i^p d_i^2/c$ (

varsayarak)izin vermektir

λ_{max} ≫ d_{i}^{2}

$\lambda_{\max} \gg d_i^2$ ; burada

c

$c$ küçük bir sabittir ve yaklaşık olarak istediğiniz minimum serbestlik derecesini temsil eder örnek (örn.

c = 0.1

$c=0.1$ ). İkinci sınır elbette

λ_{min} = 0

$\lambda_{\min}=0$ .

Başlık anlaşılacağı gibi, daha sonra, bir numune için gereken $\lambda$ den $\lambda_{\min}$ için $\lambda_{\max}$ gibi bazı ölçekte $\mathrm{df}(\lambda)$ içinde, mesela, (yaklaşık olarak) örneklenir $0.1$ den aralıklarla $c$ için $p$ kolay olduğu ... Bunu yapmanın yolu nedir? Bir Newton-Raphson yöntemi kullanarak her için denklemini çözmeyi düşündüm , ancak bu özellikle büyük olduğunda çok fazla yineleme ekleyecektir . Herhangi bir öneri? $\mathrm{df}(\lambda)$ $\lambda$ $p$

ridge-regression

— Néstor
kaynak

Bu fonksiyon

azalan dışbükey bir rasyonel fonksiyonudur . Kökler, özellikle ikili bir ızgara üzerinden seçilirse, bulmak çok hızlı olmalıdır.

λ \geq 0

$\lambda \geq 0$

— kardinal

@cardinal, muhtemelen haklısın. Ancak, mümkünse, bazı "varsayılan" ızgara olup olmadığını bilmek istiyorum. Örneğin, yaparak bir ızgara elde çalıştı

, burada

ve bazı serbestlik dereceleri için oldukça iyi çalıştı, ancak

λ = l o g (s) λ_{m a x} / l o g (s_{m a x})

$\lambda=log(s)\lambda_{max}/log(s_{max})$

s = (1, 2, . . ., s_{m a x})

$s=(1,2,...,s_{max})$

, patladı. Bu bana belki

için ızgara seçmek için bazı düzgün bir yol olduğunu merak etti

d f (λ) \to p

$df(\lambda)\to p$

λ

$\lambda$ soruyorum nedir 'ın,. Bu mevcut değilse, ben de bilmek mutlu olurdu (Newton-Rapson yöntemi mutlu "daha iyi bir yol var" bilerek kodumu mutlu bırakabilirsiniz gibi).

— Néstor

Karşılaştığınız potansiyel zorluklar hakkında daha iyi bir fikir edinmek için,

tipik ve en kötü durum değerleri nelerdir? Özdeğer dağılımı hakkında önceden bildiğiniz bir şey var mı ?

p

$p$

— kardinal

@ kardinal, tipik

uygulama değerleri

için

ila

arasında değişir , ama mümkün olduğunca genel yapmak istiyorum. Özdeğer dağılımı hakkında, pek değil.

, sütunlarında her zaman dik olmayan yordayıcılar içeren bir matristir.

p

$p$

15

$15$

40

$40$

X

$X$

— Néstor

Newton-Raphson tipik olarak kökleri bulur

içinde doğruluğu

için

adımlarında

ve küçük değerler

; neredeyse hiç

adımdan fazla. Daha büyük değerler için, bazen

adıma kadar ihtiyaç duyulur. Her adım

hesaplamaları gerektirdiğinden , toplam hesaplama miktarı önemsizdir. Gerçekten de, iyi bir başlangıç değeri seçilirse adım sayısı

bağlı görünmemektedir (eğer tüm

10^{- 12}

$10^{-12}$

3

$3$

4

$4$

p = 40

$p=40$

d f (λ)

$df(\lambda)$

6

$6$

30

$30$

O (p)

$O(p)$

p

$p$

d_{i}

$d_i$ equal their mean).

— whuber

Yanıtlar:

This is a long answer. So, let's give a short-story version of it here.

There's no nice algebraic solution to this root-finding problem, so we need a numerical algorithm.
The function $\mathrm{df}(\lambda)$ has lots of nice properties. We can harness these to create a specialized version of Newton's method for this problem with guaranteed monotonic convergence to each root.
Even brain-dead R code absent any attempts at optimization can compute a grid of size 100 with $p = 100\,000$ in a few of seconds. Carefully written C code would reduce this by at least 2–3 orders of magnitude.

There are two schemes given below to guarantee monotonic convergence. One uses bounds shown below, which seem to help save a Newton step or two on occasion.

Example: $p = 100\,000$ and a uniform grid for the degrees of freedom of size 100. The eigenvalues are Pareto-distributed, hence highly skewed. Below are tables of the number of Newton steps to find each root.

# Table of Newton iterations per root.
# Without using lower-bound check.
  1  3  4  5  6 
  1 28 65  5  1 
# Table with lower-bound check.
  1  2  3 
  1 14 85

There won't be a closed-form solution for this, in general, but there is a lot of structure present which can be used to produce very effective and safe solutions using standard root-finding methods.

Before digging too deeply into things, let's collect some properties and consequences of the function

d f (λ) = \sum_{i = 1}^{p} \frac{d_{i}^{2}}{d_{i}^{2} + λ} .

$\newcommand{\df}{\mathrm{df}} \df(\lambda) = \sum_{i=1}^p \frac{d_i^2}{d_i^2 + \lambda} \>.$

Property 0: $\df$ is a rational function of $\lambda$ . (This is apparent from the definition.)
Consequence 0: No general algebraic solution will exist for finding the root $\df(\lambda) - y = 0$ . This is because there is an equivalent polynomial root-finding problem of degree $p$ and so if $p$ is not extremely small (i.e., less than five), no general solution will exist. So, we'll need a numerical method.

Property 1: The function $\df$ is convex and decreasing on $\lambda \geq 0$ . (Take derivatives.)
Consequence 1(a): Newton's root-finding algorithm will behave very nicely in this situation. Let $y$ be the desired degrees of freedom and $\lambda_0$ the corresponding root, i.e., $y = \df(\lambda_0)$ . In particular, if we start out with any initial value $\lambda_1 < \lambda_0$ (so, $\df(\lambda_1) > y$ ), then the sequence of Newton-step iterations $\lambda_1,\lambda_2,\ldots$ will converge monotonically to the unique solution $\lambda_0$ .
Consequence 1(b): Furthermore, if we were to start out with $\lambda_1 > \lambda_0$ , then the first step would yield $\lambda_2 \leq \lambda_0$ , from whence it will monotonically increase to the solution by the previous consequence (see caveat below). Intuitively, this last fact follows because if we start to the right of the root, the derivative is "too" shallow due to the convexity of $\df$ and so the first Newton step will take us somewhere to the left of the root. NB Since $\df$ is not in general convex for negative $\lambda$ , this provides a strong reason to prefer starting to the left of the desired root. Otherwise, we need to double check that the Newton step hasn't resulted in a negative value for the estimated root, which may place us somewhere in a nonconvex portion of $\df$ .
Consequence 1(c): Once we've found the root for some $y_1$ and are then searching for the root from some $y_2 < y_1$ , using $\lambda_1$ such that $\df(\lambda_1) = y_1$ as our initial guess guarantees we start to the left of the second root. So, our convergence is guaranteed to be monotonic from there.

Mülkiyet 2 : "Güvenli" başlangıç noktaları vermek için makul sınırlar vardır. Konveksite argümanlarını ve Jensen eşitsizliğini kullanarak, aşağıdaki sınırlara sahibiz

\frac{p}{1 + \frac{λ}{p} \sum d_{i}^{- 2}} \leq d f (λ) \leq \frac{p \sum_{i} d_{i}^{2}}{\sum_{i} d_{i}^{2} + p λ} .

$\frac{p}{1+ \frac{\lambda}{p}\sum d_i^{-2}} \leq \df(\lambda) \leq \frac{p \sum_i d_i^2}{\sum_i d_i^2 + p \lambda} \>.$ Consequence 2: This tells us that the root

λ_{0}

$\lambda_0$ satisfying

d f (λ_{0}) = y

$\df(\lambda_0) = y$ obeys

\begin{matrix} (⋆) & \frac{1}{\frac{1}{p} \sum_{i} d_{i}^{- 2}} (\frac{p - y}{y}) \leq λ_{0} \leq (\frac{1}{p} \sum_{i} d_{i}^{2}) (\frac{p - y}{y}) . \end{matrix}

$\frac{1}{\frac{1}{p}\sum_i d_i^{-2}}\left(\frac{p - y}{y}\right) \leq \lambda_0 \leq \left(\frac{1}{p}\sum_i d_i^2\right) \left(\frac{p - y}{y}\right) \>. \tag{$\star$}$ So, up to a common constant, we've sandwiched the root in between the harmonic and arithmetic means of the

d_{i}^{2}

$d_i^2$ .

This assumes that $d_i > 0$ for all $i$ . If this is not the case, then the same bound holds by considering only the positive $d_i$ and replacing $p$ by the number of positive $d_i$ . NB: Since $\df(0) = p$ assuming all $d_i > 0$ , then $y \in (0,p]$ , whence the bounds are always nontrivial (e.g., the lower bound is always nonnegative).

Here is a plot of a "typical" example of $\df(\lambda)$ with $p = 400$ . We've superimposed a grid of size 10 for the degrees of freedom. These are the horizontal lines in the plot. The vertical green lines correspond to the lower bound in $(\star)$ .

Example dof plot with grid and bounds

An algorithm and some example R code

A very efficient algorithm given a grid of desired degrees of freedom $y_1, \ldots y_n$ in $(0,p]$ is to sort them in decreasing order and then sequentially find the root of each, using the previous root as the starting point for the following one. We can refine this further by checking if each root is greater than the lower bound for the next root, and, if not, we can start the next iteration at the lower bound instead.

Here is some example code in R, with no attempts made to optimize it. As seen below, it is still quite fast even though R is—to put it politely—horrifingly, awfully, terribly slow at loops.

# Newton's step for finding solutions to regularization dof.

dof <- function(lambda, d) { sum(1/(1+lambda / (d[d>0])^2)) }
dof.prime <- function(lambda, d) { -sum(1/(d[d>0]+lambda / d[d>0])^2) }

newton.step <- function(lambda, y, d)
{ lambda - (dof(lambda,d)-y)/dof.prime(lambda,d) }

# Full Newton step; Finds the root of y = dof(lambda, d).
newton <- function(y, d, lambda = NA, tol=1e-10, smart.start=T)
{
    if( is.na(lambda) || smart.start )
        lambda <- max(ifelse(is.na(lambda),0,lambda), (sum(d>0)/y-1)/mean(1/(d[d>0])^2))
    iter <- 0
    yn   <- Inf
    while( abs(y-yn) > tol )
    {
        lambda <- max(0, newton.step(lambda, y, d)) # max = pedantically safe
        yn <- dof(lambda,d)
        iter = iter + 1
    }
    return(list(lambda=lambda, dof=y, iter=iter, err=abs(y-yn)))
}

Below is the final full algorithm which takes a grid of points, and a vector of the $d_i$ (not $d_i^2$ !).

newton.grid <- function(ygrid, d, lambda=NA, tol=1e-10, smart.start=TRUE)
{
    p <- sum(d>0)
    if( any(d < 0) || all(d==0) || any(ygrid > p) 
        || any(ygrid <= 0) || (!is.na(lambda) && lambda < 0) )
        stop("Don't try to fool me. That's not nice. Give me valid inputs, please.")
    ygrid <- sort(ygrid, decreasing=TRUE)
    out    <- data.frame()
    lambda <- NA
    for(y in ygrid)
    {
        out <- rbind(out, newton(y,d,lambda, smart.start=smart.start))
        lambda <- out$lambda[nrow(out)]
    }
    out
}

Sample function call

set.seed(17)
p <- 100000
d <- sqrt(sort(exp(rexp(p, 10)),decr=T))
ygrid <- p*(1:100)/100
# Should take ten seconds or so.
out <- newton.grid(ygrid,d)

— cardinal
kaynak

Favoriting the question so I can refer back to this answer. Thanks for posting this detailed analysis, cardinal.

— Macro

Amazing answer :-), thanks a lot cardinal for the suggestions AND the answer.

— Néstor

In addition, a couple of methods exist that will calculate the complete regularization path efficiently:

The above are all R packages, as you are using Python, scikit-learn contains implementations for ridge, lasso and elastic net.

— sebp
kaynak

The ols function in the R rms package can use numerical optimization to find the optimum penalty using effective AIC. But you have to provide the maximum penalty which is not always easy.

— Frank Harrell

A possible alternative according to the source below seems to be:

The closed form solution: $df(\lambda) = tr(X(X^{\top} X + \lambda I_{p})^{-1}X^{\top})$

Should you be using the normal equation as the solver or computing the variance-covariance estimate, you should already have computed $(X^{\top}X + \lambda I_{p})^{-1}$ . This approach works best if you are estimating the coefficients at the various $\lambda$ .

Source: https://onlinecourses.science.psu.edu/stat857/node/155

— José Bayoán Santiago Calderón
kaynak