Perceptron Learning Algorithm and Stochastic Gradient Descent

by allenlu2007

Please refer to 林軒田’s lecture.


好好想一想,以下包含所有的 machine learning 精華!

Step 1 Based on Hypothesis (linear hypothesis, logistic hypothesis, etc.), 如何 optimize w (=> Algorithm A)?  

==> w 基本上是 (xi, yi) 的 linear combination =>   w = sum ( alpha xi yi)  for classification  and sum( alpha xi error(i) ) for linear regression.

where error(i) = (y(i) – f(xi) ) = y(i) – w(i) x(i)


在此可以引入 kernel 的觀念:   x -> phi(x)  w -> w_new

w_new = sum( alpha yi phi(xi) )


Step 2: 如何重組回 Hypothesis function, H or f(x)

==> 最簡單是 f(x) = w’x = sum( beta yi xi x ) 

如果有 kernel

==>  f(x) = sum( beta yi phi(x) phi(xi))   phi(x) * phi(xi) 就是 similarity 的測量!!  Linear 只是 inner product.  也可以用 Gaussian kernel (RBF)


重點在 beta (weighting) -> 這是由 w or w_new 在 learning 時決定的。 On-line learning 則可以 dynamically 修改。

如果 beta(i) = 0 則代表非 supporting vector (不重要!)  所以 SVM 是 special case.  

SVM 以及 L1 error 可以造成 sparsity ==>  可以省下 testing 和 CV 時的 computation effort!!!!


如果是 linear regression

f(x) = w’ x= sum( beta (yi – w’ xi) xi x ) = sum ( beta yi xi x – w’ xi xi x )  ==>  Need more derivation to confirm linear regression case!! 







I like the perceptron learning algorithm (PLA) idea.

It’s very geometric and intuitive!!!!

同時可以比較 PLA, Logistic Regression, and SVM!!!


Gradient descent, especially apply to logistic regression, can be regarded as soft PLA!

Actually, not gradient descent, but stochastic gradient descent.


General form:


w(n+1) = w(n) + (error) * (yn zn)


w_pla = sum(beta_n (yn zn))  n from 1 to N (data)    beta is mistake correction #


Logistic Regression

w(n+1) = w(n) + soft_error * (yn zn)


w_logisticregression = sum( E(??) * (yn zn))

注意如果是 linear regression, 其實是一樣 (stochastic gradient descent)

w_lin(n+1) = w_lin(n) + alpha * ( error ) * zn

where error = ( y(n) – w_lin(n)’ zn ) 


wsvm = sum( alpha(n) * (yn zn))  from 1 to N (data), only support vectors

也就是非 support vector 根本不影響 w 的變化。


w_pla = sum(beta_n* (yn zn))

PLA 是用容易犯錯的點來表示。SVM 是用 support vector (也是容易犯錯的點) 來表示!


f(x) = sum( a_i yi (xi’ x) + b )


w => linear combination of yn zn (beta_n), 同時 yn zn 愈近 boundary, beta_n 愈大 


這也順便 link representer theorem and Kernel method.

For any L2-regularized cost/loss function in terms of zn * yn (or linear model)

w* = linear combination of zn * yn. 

==> 可以 apply kernel trick!


L1 regularize 就不行?  對!  因為在 proof 用到 垂直正交 = 0 的特性。也就是L2切線的特性 (微分=0)。

但在 L1 卻是 vertex (simplex) 的特性。所以 L2 regularized 的特性是 sparsity 而非 微分=0!!!


L2-regularized  ===  Kernel!!!

L1-regularized <<<<>>> kernel!!!


Kernel can be applied to :

SVM – L2 regularized inherent 

Linear regression with L2

Logistic regression with L2

PLA with L2