Lojistik regresyon için hangi kayıp fonksiyonu doğrudur?


31

Lojistik regresyon için kayıp fonksiyonunun iki versiyonunu okudum, hangisi doğru ve neden?

  1. Kaynaktan Machine Learning , Zhou ZH (Çince) ile β=(w,b) and βTx=wTx+b :

    (1)l(β)=i=1m(yiβTxi+ln(1+eβTxi))

  2. Üniversite tabii kaynaktan ile zi=yif(xi)=yi(wTxi+b) :

    (2)L(zi)=log(1+ezi)


İlki tüm örneklerin birikimidir, ikincisi ise tek örnek için toplanmıştır, ancak iki kayıp fonksiyonu biçimindeki fark hakkında daha fazla merak ediyorum. Her nasılsa onların eşdeğer olduğunu hissediyorum.

Yanıtlar:


31

İlişki aşağıdaki gibidir: .l(β)=iL(zi)

Lojistik bir işlevi f ( z ) = e z olarak tanımlayın.f(z)=ez1+ez=11+ez. They possess the property that f(z)=1f(z). Or in other words:

11+ez=ez1+ez.

If you take the reciprocal of both sides, then take the log you get:

ln(1+ez)=ln(1+ez)+z.

Subtract z from both sides and you should see this:

yiβTxi+ln(1+eyiβTxi)=L(zi).

Edit:

At the moment I am re-reading this answer and am confused about how I got yiβTxi+ln(1+eβTxi) to be equal to yiβTxi+ln(1+eyiβTxi). Perhaps there's a typo in the original question.

Edit 2:

In the case that there wasn't a typo in the original question, @ManelMorales appears to be correct to draw attention to the fact that, when y{1,1}, the probability mass function can be written as P(Yi=yi)=f(yiβTxi), due to the property that f(z)=1f(z). I am re-writing it differently here, because he introduces a new equivocation on the notation zi. The rest follows by taking the negative log-likelihood for each y coding. See his answer below for more details.


42

OP mistakenly believes the relationship between these two functions is due to the number of samples (i.e. single vs all). However, the actual difference is simply how we select our training labels.

In the case of binary classification we may assign the labels y=±1 or y=0,1.

As it has already been stated, the logistic function σ(z) is a good choice since it has the form of a probability, i.e. σ(z)=1σ(z) and σ(z)(0,1) as z±. If we pick the labels y=0,1 we may assign

P(y=1|z)=σ(z)=11+ezP(y=0|z)=1σ(z)=11+ez

which can be written more compactly as P(y|z)=σ(z)y(1σ(z))1y.

It is easier to maximize the log-likelihood. Maximizing the log-likelihood is the same as minimizing the negative log-likelihood. For m samples {xi,yi}, after taking the natural logarithm and some simplification, we will find out:

l(z)=log(imP(yi|zi))=imlog(P(yi|zi))=imyizi+log(1+ezi)

Full derivation and additional information can be found on this jupyter notebook. On the other hand, we may have instead used the labels y=±1. It is pretty obvious then that we can assign

P(y|z)=σ(yz).

It is also obvious that P(y=0|z)=P(y=1|z)=σ(z). Following the same steps as before we minimize in this case the loss function

L(z)=log(jmP(yj|zj))=jmlog(P(yj|zj))=jmlog(1+eyzj)

Where the last step follows after we take the reciprocal which is induced by the negative sign. While we should not equate these two forms, given that in each form y takes different values, nevertheless these two are equivalent:

yizi+log(1+ezi)log(1+eyzj)

The case yi=1 is trivial to show. If yi1, then yi=0 on the left hand side and yi=1 on the right hand side.

While there may be fundamental reasons as to why we have two different forms (see Why there are two different logistic loss formulation / notations?), one reason to choose the former is for practical considerations. In the former we can use the property σ(z)/z=σ(z)(1σ(z)) to trivially calculate l(z) and 2l(z), both of which are needed for convergence analysis (i.e. to determine the convexity of the loss function by calculating the Hessian).


Is logistic loss function convex?
user85361

2
Log reg l(z) IS convex, but not α-convex. Thus we can't place a bound on how long gradient descent takes to converge. We can adjust the form of l to make it strongly convex by adding a regularization term: with positive constant λ define our new function to be l(z)=l(z)+λz2 s.t l(z) is λ-strongly convex and we can now prove the convergence bound of l. Unfortunately, we are now minimizing a different function! Luckily, we can show that the value of the optimum of the regularized function is close to the value of the optimum of the original.
Manuel Morales

The notebook you referred has gone, I got another proof: statlect.com/fundamentals-of-statistics/…
Domi.Zhang

2
I found this to be the most helpful answer.
mohit6up

@ManuelMorales Do you have a link to the regularized function's optimum value being close to the original?
Mark

19

I learned the loss function for logistic regression as follows.

Logistic regression performs binary classification, and so the label outputs are binary, 0 or 1. Let P(y=1|x) be the probability that the binary output y is 1 given the input feature vector x. The coefficients w are the weights that the algorithm is trying to learn.

P(y=1|x)=11+ewTx

Because logistic regression is binary, the probability P(y=0|x) is simply 1 minus the term above.

P(y=0|x)=111+ewTx

The loss function J(w) is the sum of (A) the output y=1 multiplied by P(y=1) and (B) the output y=0 multiplied by P(y=0) for one training example, summed over m training examples.

J(w)=i=1my(i)logP(y=1)+(1y(i))logP(y=0)

where y(i) indicates the ith label in your training data. If a training instance has a label of 1, then y(i)=1, leaving the left summand in place but making the right summand with 1y(i) become 0. On the other hand, if a training instance has y=0, then the right summand with the term 1y(i) remains in place, but the left summand becomes 0. Log probability is used for ease of calculation.

If we then replace P(y=1) and P(y=0) with the earlier expressions, then we get:

J(w)=i=1my(i)log(11+ewTx)+(1y(i))log(111+ewTx)

You can read more about this form in these Stanford lecture notes.


This answer also provides some relevant perspective here.
GeoMatt22

6
The expression you have is not a loss (to be minimized), but rather a log-likelihood (to be maximized).
xenocyon

2
@xenocyon true - this same formulation is typically written with a negative sign applied to the full summation.
Alex Klibisz

1

Instead of Mean Squared Error, we use a cost function called Cross-Entropy, also known as Log Loss. Cross-entropy loss can be divided into two separate cost functions: one for y=1 and one for y=0.

j(θ)=1mi=1mCost(hθ(x(i)),y(i))Cost(hθ(x),y)=log(hθ(x))if y=1Cost(hθ(x),y)=log(1hθ(x))if y=0

When we put them together we have:

j(θ)=1mi=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x)(i))]

Multiplying by y and (1y) in the above equation is a sneaky trick that let’s us use the same equation to solve for both y=1 and y=0 cases. If y=0, the first side cancels out. If y=1, the second side cancels out. In both cases we only perform the operation we need to perform.

If you don't want to use a for loop, you can try a vectorized form of the equation above

h=g(Xθ)J(θ)=1m(yTlog(h)(1y)Tlog(1h))

The entire explanation can be view on Machine Learning Cheatsheet.

Sitemizi kullandığınızda şunları okuyup anladığınızı kabul etmiş olursunuz: Çerez Politikası ve Gizlilik Politikası.
Licensed under cc by-sa 3.0 with attribution required.