### Logistic Regression and Softmax Algorithm

#### by allenlu2007

首先 Bias Trick, 可以移除 bias.

可以化簡如下。 W’ = [W, b], xi’ = [xi, 1]’

### MLE Model for Logistic Regression

前文已提過 logistic regression 的 loss function 其實就是 maximum likelihood estimator

注意此處用 y = -1/+1 instead of 0/+1

Loss function 定義為 log(Prob). (Note: linear regression log(prob) = – (y-w’x)^2 也就是 least square. dl/dw=0 有 close form)

因此

The gradient of the log-likelihood is:

**如果 y = -1, g = -1*sigma(w’xi) * xi**

**如果 y = +1, g = (1-sigma(x’xi)) * xi**

**Logistic regression 沒有 close form, 只能用 numerical solution.**

所以 l(w) 是 convex function!

### 如果 y = 0 or +1

**如果 y(i) = 0 —> h(x(i))*x(i)**

**y(i) = 1 —> [h(x(i))-1]*x(i)**

J 和上述 L 差一個負號。

### MAP for Logistic Regression (Regulation)

以上是 MLE max arg P(y | w, X); how about P(y | w, X) p(w) (w, X 都只是參數，並非 R.V.)

但若帶入 Bayesian 的觀念，讓 w 有 prior distribution, (X 仍是參數)

P(y | X) = p(y | w, X) * P(w) log P(y|X) = lop P(y|w,X) + log P(w) = 原來的 loss function + log P(w)

假設 w 是 Gaussian distribution, log P(w) 基本上就是 w 的 L2 norm. 意即新的 loss function

= 原來 loss function + regulation.

Probabilistic approach 可以把 loss function and loss function with regulation 整合在一起。

**Multi-class Classification and Softmax Algorithm**

如何推廣 logistic regression to more than two outcomes?

乍看之下，Multi-chlass classification 很難用 linear or logistic regression 來類比。

因為這些 outcomes 沒有數學關係，需要用 probability 來處理。

很明顯不可能用一條直線 separate 不同 class. 最直接是分成 n 次 (n 是 class number) logistic regression 如上圖。Parameter space 的 dimension 就是原來的 n 倍 (theta1, theta2, theta2, etc.)

Wrong way: (one w to fit xa, xb, xc, truth is there is only one x. assuming x is nx1 -> w is also nx1)

P(y in class 1) = 1/(1+exp(w. xa))

P(y in class 2) = 1/(1+exp(w. xb))

P(y in class 3) = 1/(1+exp(w. xc))

Right way I: (如上圖) 使用 w(k) to fit one class of x. x is nx1 -> w(1), w(2), w(3) is also nx1 -> W is nxk

最大的問題需要重新 relable for each classification?

P(y in class 1) = 1/(1+exp(-w(1). x))

P(y in class 2) = 1/(1+exp(-w(2). x))

P(y in class 3) = 1/(1+exp(-w(3). x))

The probability does not sum to 1! 所以需要重新 normalization.

Right way II, 更簡化數學

P(y in class 1) = exp(w(1). x) /sum[exp(w(k). x)]

P(y in class 2) = exp(w(2). x) /sum[exp(w(k). x)]

P(y in class 3) = exp(w(3). x) /sum[exp(w(k). x)]

The probability does sum to 1 after normalization.

如何把 Softmax reduce to Logistics? k = 1 and 2

P(y in class 1) = exp(w(1). x) / exp(w(1). x) + exp(w(2). x) = 1 / [1+ exp(-w. x)

P(y in class 2) = exp(w(2). x) / exp(w(1). x) + exp(w(2). x)= exp(w.x) / [1+ exp(-w. x)]

where w = w(1)-w(2) so w is still nx1. We don’t need to define w = [w(1), w(2)] nx2 in this case.