Logistic Regression and Softmax Algorithm

by allenlu2007


首先 Bias Trick, 可以移除 bias.


可以化簡如下。 W’ = [W, b],  xi’ = [xi, 1]’



MLE Model for Logistic Regression

前文已提過 logistic regression 的 loss function 其實就是 maximum likelihood estimator


注意此處用 y = -1/+1 instead of 0/+1 

Loss function 定義為 log(Prob).  (Note:  linear regression log(prob) = – (y-w’x)^2 也就是 least square.  dl/dw=0 有 close form)



The gradient of the log-likelihood is:


如果 y = -1,   g = -1*sigma(w’xi) * xi

如果 y = +1,  g = (1-sigma(x’xi)) * xi

Logistic regression 沒有 close form, 只能用 numerical solution.



所以 l(w) 是 convex function!


如果 y = 0 or +1




如果 y(i) = 0  —>   h(x(i))*x(i)

y(i) = 1  —>  [h(x(i))-1]*x(i)

J 和上述 L 差一個負號。


MAP for Logistic Regression (Regulation)

以上是 MLE  max arg  P(y | w, X);  how about  P(y | w, X) p(w)  (w, X 都只是參數,並非 R.V.)

但若帶入 Bayesian 的觀念,讓 w 有 prior distribution, (X 仍是參數)

P(y | X) = p(y | w, X) * P(w)    log P(y|X) = lop P(y|w,X) + log P(w) = 原來的 loss function + log P(w)

假設 w 是 Gaussian distribution, log P(w) 基本上就是 w 的 L2 norm.  意即新的 loss function 

= 原來 loss function + regulation.  

Probabilistic approach 可以把 loss function and loss function with regulation 整合在一起。


Multi-class Classification and Softmax Algorithm

如何推廣 logistic regression to more than two outcomes?






乍看之下,Multi-chlass classification 很難用 linear or logistic regression 來類比。

因為這些 outcomes 沒有數學關係,需要用 probability 來處理。

很明顯不可能用一條直線 separate 不同 class.  最直接是分成 n 次 (n 是 class number) logistic regression 如上圖。Parameter space 的 dimension 就是原來的 n 倍 (theta1, theta2, theta2, etc.)

Wrong way: (one w to fit xa, xb, xc, truth is there is only one x.   assuming x is nx1 -> w is also nx1)

P(y in class 1) = 1/(1+exp(w. xa))

P(y in class 2) = 1/(1+exp(w. xb))

P(y in class 3) = 1/(1+exp(w. xc))


Right way I:  (如上圖) 使用 w(k) to fit one class of x.  x is nx1 -> w(1), w(2), w(3) is also nx1 ->  W is nxk

最大的問題需要重新 relable for each classification?

P(y in class 1) = 1/(1+exp(-w(1). x))

P(y in class 2) = 1/(1+exp(-w(2). x))

P(y in class 3) = 1/(1+exp(-w(3). x))

The probability does not sum to 1!  所以需要重新 normalization.


Right way II, 更簡化數學

P(y in class 1) = exp(w(1). x) /sum[exp(w(k). x)]

P(y in class 2) = exp(w(2). x) /sum[exp(w(k). x)]

P(y in class 3) = exp(w(3). x) /sum[exp(w(k). x)]

The probability does sum to 1 after normalization.


如何把 Softmax reduce to Logistics?  k = 1 and 2


P(y in class 1) = exp(w(1). x) / exp(w(1). x) + exp(w(2). x) = 1 / [1+ exp(-w. x)

P(y in class 2) = exp(w(2). x) / exp(w(1). x) + exp(w(2). x)= exp(w.x) / [1+ exp(-w. x)]

where w = w(1)-w(2)  so w is still nx1.   We don’t need to define w = [w(1), w(2)] nx2 in this case.