En Küçük Kareler Regresyonu Adım Adım Doğrusal Cebir Hesaplaması


22

R'deki lineer karışık modeller hakkında bir soruya ön hazırlık olarak ve yeni başlayan / orta düzey istatistik meraklıları için referans olarak paylaşırken, bağımsız bir “Soru ve Cevap Stili” olarak göndermeye karar verdim. Basit doğrusal regresyonun katsayıları ve öngörülen değerleri.

Örnek, R dahili veri mtcarsseti ile ilgilidir ve bağımsız değişken olarak hareket eden bir araç tarafından tüketilen galon başına mil olarak ayarlanmış, aracın ağırlığına göre regüle edilmiş (sürekli değişken) ve etkileşimi olmayan üç seviyeli (4, 6 veya 8) faktör.

EDIT: Bu soru ile ilgileniyorsanız, bu yazıda CV dışında Matthew Drury tarafından kesinlikle ayrıntılı ve tatmin edici bir cevap bulacaksınız .


"El ile hesaplama" derken, ne arıyorsun? Parametre tahminleri ve benzeri işlemler için bir dizi nispeten basit adım göstermek oldukça kolaydır (örneğin, Gram-Schmidt ortogonalizasyonu yoluyla veya SWEEP operatörleri tarafından); o (ve diğer pek çok istatistik paketi) QR ayrıştırmasını kullanır (sitedeki bazı yayınlarda ele alınmıştır - QR ayrıştırması ile ilgili bir araştırma bir kaç
mesajın

Yes. I believe that this was very nicely addresses in the answer by M.D. I should probably edit my post, perhaps emphasizing the geometric approach behind my answer - column space, projection matrix...
Antoni Parellada

Yep! @Matthew Drury Do you want me to erase that line in the OP, or update the link?
Antoni Parellada

1
Not sure if you have this link, but this is closely related, and I really love J.M's answer. stats.stackexchange.com/questions/1829/…
Haitao Du

Yanıtlar:


51

Not : Bu cevabın genişletilmiş versiyonunu web siteme gönderdim .

Açıkça gerçek R motoruyla benzer bir cevap göndermeyi düşünür müsünüz?

Emin! Tavşan deliğinden aşağıya gidiyoruz.

İlk katman, lmarayüz R programlayıcısına maruz kalmış. Bunun için kaynağa sadece lmR konsoluna yazarak bakabilirsiniz . Bunların çoğunluğu (çoğu üretim seviyesi kodunun çoğunluğu gibi) girdilerin kontrolü, nesne niteliklerinin ayarlanması ve hataların atılmasıyla meşgul; ama bu çizgi yapışıyor

lm.fit(x, y, offset = offset, singular.ok = singular.ok, 
                ...)

lm.fit is another R function, you can call it yourself. While lm conveniently works with formulas and data frame, lm.fit wants matrices, so that's one level of abstraction removed. Checking the source for lm.fit, more busywork, and the following really interesting line

z <- .Call(C_Cdqrls, x, y, tol, FALSE)

Now we are getting somewhere. .Call is R's way of calling into C code. There is a C function, C_Cdqrls in the R source somewhere, and we need to find it. Here it is.

Looking at the C function, again, we find mostly bounds checking, error cleanup, and busy work. But this line is different

F77_CALL(dqrls)(REAL(qr), &n, &p, REAL(y), &ny, &rtol,
        REAL(coefficients), REAL(residuals), REAL(effects),
        &rank, INTEGER(pivot), REAL(qraux), work);

So now we are on our third language, R has called C which is calling into fortran. Here's the fortran code.

The first comment tells it all

c     dqrfit is a subroutine to compute least squares solutions
c     to the system
c
c     (1)               x * b = y

(ilginç bir şekilde, bu rutinin adı bir noktada değiştirilmiş gibi görünüyor, ancak birisi yorumu güncellemeyi unuttu). Bu yüzden nihayet bazı lineer cebir yapabileceğimiz ve aslında denklem sistemini çözdüğümüz noktadayız. Bu, fortran'ın gerçekten iyi olduğu bir şey. Bu, neden buraya ulaşmak için bu kadar çok katmandan geçtiğimizi açıklıyor.

Bu yorum ayrıca kodun ne yapacağını da açıklar

c     on return
c
c        x      contains the output array from dqrdc2.
c               namely the qr decomposition of x stored in
c               compact form.

Fortran bularak sistemini çözeceğini mi Yani QR ayrışma.

Olan ve en önemlisi olan ilk şey

call dqrdc2(x,n,n,p,tol,k,qraux,jpvt,work)

Bu dqrdc2giriş matrisindeki fortran fonksiyonunu çağırır x. Bu nedir?

 c     dqrfit uses the linpack routines dqrdc and dqrsl.

Sonunda Linpack için yaptık. Linpack is a fortran linear algebra library that has been around since the 70s. Most serious linear algebra eventualy finds its way to linpack. In our case, we are using the function dqrdc2

c     dqrdc2 uses householder transformations to compute the qr
c     factorization of an n by p matrix x.

XX=QRQRQR you can solve the linear equations for regression

XtXβ=XtY

very easily. Indeed

XtX=RtQtQR=RtR

so the whole system becomes

RtRβ=RtQty

but R is upper triangular and has the same rank as XtX, so as long as our problem is well posed, it is full rank, and we may as well just solve the reduced system

Rβ=Qty

But here's the awesome thing. R is upper triangular, so the last linear equation here is just constant * beta_n = constant, so solving for βn is trivial. You can then go up the rows, one by one, and substitute in the βs you already know, each time getting a simple one variable linear equation to solve. So, once you have Q and R, the whole thing collapses to what is called backwards substitution, which is easy. You can read about this in more detail here, where an explicit small example is fully worked out.


4
This was the most fun mathematical / coding short essay one can imagine. I know next to nothing about coding, but your "tour" through the guts of a seemingly innocuous R function was truly eye-opening. Excellent writing! Since "kindly" did the trick... Could you kindly consider this one as a related challenge? :-)
Antoni Parellada

6
+1 I hadn't seen this before, nice summary. Just to add a little bit of information in case @Antoni isn't familiar with Householder transformations; it's essentially a linear transformation that allows you to zero out one part of the R matrix you're trying to achieve without mucking up parts you've already dealt with (as long as you go through it in the right order), making it ideal for transforming matrices to upper triangular form (Givens rotations do a similar job and are perhaps easier to visualize, but are a little slower). As you build R, you must at the same time construct Q
Glen_b -Reinstate Monica

2
Matthew (+1), I suggest you start or end your post with a link to your much more detailed write-up madrury.github.io/jekyll/update/2016/07/20/lm-in-R.html.
amoeba says Reinstate Monica

3
-1 for chickening out and not going down to machine code.
S. Kolassa - Reinstate Monica

3
(Sorry, just kidding ;-)
S. Kolassa - Reinstate Monica

8

The actual step-by-step calculations in R are beautifully described in the answer by Matthew Drury in this same thread. In this answer I want to walk through the process of proving to oneself that the results in R with a simple example can be reached following the linear algebra of projections onto the column space, and perpendicular (dot product) errors concept, illustrated in different posts, and nicely explained by Dr. Strang in Linear Algebra and Its Applications, and readily accessible here.

In order to estimate the coefficients β in the regression,

mpg=intercept(cyl=4)+β1weight+D1intercept(cyl=6)+D2intercept(cyl=8)[]

with D1 and D2 representing dummy variables with values [0,1], we first would need to include in the design matrix (X) the dummy coding for the number of cylinders, as follows:

attach(mtcars)    
x1 <- wt

    x2 <- cyl; x2[x2==4] <- 1; x2[!x2==1] <-0

    x3 <- cyl; x3[x3==6] <- 1; x3[!x3==1] <-0

    x4 <- cyl; x4[x4==8] <- 1; x4[!x4==1] <-0

    X <- cbind(x1, x2, x3, x4)
    colnames(X) <-c('wt','4cyl', '6cyl', '8cyl')

head(X)
        wt 4cyl 6cyl 8cyl
[1,] 2.620    0    1    0
[2,] 2.875    0    1    0
[3,] 2.320    1    0    0
[4,] 3.215    0    1    0
[5,] 3.440    0    0    1
[6,] 3.460    0    1    0

If the design matrix had to parallel strictly equation [] (above), where the first intercept corresponds to cars of four cylinders, as in the lm without a `-1', it would require a first column of just ones, but we'll derive the same results without this intercept column.

Continuing then, to calculate the coefficients (β) we need the projection matrix - we project the vector of the independent variable values on to the column space of the vectors constituting the design matrix. The linear algebra is ProjMatrix=(XTX)1XT, which multiplied by the vector of the independent variable: [ProjMatrix][y]=[RegrCoefs], or (XTX)1XTy=β:

X_tr_X_inv <- solve(t(X) %*% X)    
Proj_M <- X_tr_X_inv %*% t(X)
Proj_M %*% mpg

          [,1]
wt   -3.205613
4cyl 33.990794
6cyl 29.735212
8cyl 27.919934

Identical to: coef(lm(mpg ~ wt + as.factor(cyl)-1)).

Finally, to calculate the predicted values, we will need the hat matrix, which is defined as, HatMatrix=X(XTX)1XT. This is readily calculated as:

HAT <- X %*% X_tr_X_inv %*% t(X)

And the estimated (y^) values as X(XTX)1XTy, in this case: y_hat <- HAT %*% mpg, which gives identical values to:

cyl <- as.factor(cyl); OLS <- lm(mpg ~ wt + cyl); predict(OLS):

y_hat <- as.numeric(y_hat)
predicted <- as.numeric(predict(OLS))
all.equal(y_hat,predicted)
[1] TRUE

1
In general, in numerical computing, I believe it is best to solve the linear equation instead of compute the inverse matrix. So, I think beta = solve(t(X) %*% X, t(X) %*% y) is in practice more accurate than solve(t(X) %*% X) %*% t(X) %*% y.
Matthew Drury

R doesn't do it that way - it uses a QR decomposition. If you are going to describe the algorithm used, on a computer I doubt anyone uses the one you show.
Reinstate Monica - G. Simpson

Not after the algorithm, just trying to understand the linear algebra underpinnings.
Antoni Parellada

@AntoniParellada Even in that case, I still find thinking in terms of linear equations more illuminating in many situations.
Matthew Drury

1
Given the peripheral relationship of this thread to our site's objectives, while seeing value in illustrating the use of R for important calculations, I would like to suggest that you consider turning it into a contribution to our blog.
whuber
Sitemizi kullandığınızda şunları okuyup anladığınızı kabul etmiş olursunuz: Çerez Politikası ve Gizlilik Politikası.
Licensed under cc by-sa 3.0 with attribution required.