Optimization Algorithm 分類

by allenlu2007

 

基本分為三類:

 

0th order algorithm (不需要 derivative, 只要 f(x), where x is a vector,  R^n -> R mapping)

Alternating optimization:

1. Examples:  coordinate descent;  EM algorithm; Gradient descent matrix factorization

2. Two (or more) orthogonal directions (or un-correlated directions), zig and zag to reduce the error.

3. Pro and Con:  

Pro: easy,  error always becomes smaller, convergence is guaranteed

Con: Non smooth function cannot use, very slow in convergence

 

1st order algorithm (只需要 f(x) and f'(x),

不需要 2nd order derivative, or Hessian)

1. Example: GD, SGD, functional gradient descent, and variants

更 advanced method 是加入 momentum terms to avoid zig-zag.

 

2. Learning rate is the key.  Largest step is called steepest gradient descent.

3. Pro and Con:

Pro: easy, converge

Con: still slow in convergence

 

2nd order algorithm (Need f(x), f'(x), and f”(x))

1. Newton method (equivalent to steepest descent with linear transformation converting ellipsoid to circular error surface) -> Need a complete Hessian -> For true N-dim ellipsoid error surface, Newton method only need 1-step to the optimal !!!

2. LBFGS

3. Hessian Free optimizer -> conjugate momentum method

Conjugate momentum method for a true N-dim ellipsoid error surface, it requires N-step to the optimal point!!!

 

Hessian-Free method (e.g. conjugate momentum method) 是 approx. Hessian matrix (in a sense).  是 2nd order method.

Momentum method 是 1st order method with modification term.  兩者不同,可以參考 Hinton’s lecture.

 

3.  Pro and Con

Pro: fast

Con: may not converge

 

 

 

0-th order Optimization Algorithm Example

K-mean algorithm (unsupervised clustering algorithm)

首先 objective/cost function J(c, u) 如下。有兩組參數需要 optimized: c(i) and u(k).

NewImage

 

兩組參數可以分開 optimize, 最後會收歙到 global optimal, how to prove? Convex? -> Wrong, J is non-convex.   但 K-mean 總是會收歙到 local minimum. 

Based on Boyd, quadratic functions 似乎總是可被 globally optimized -> Wrong!  Only local minimum.

NewImage

可以証明每次 cluster assignment 都可以 reduce J.  同樣每次 move centroid 也可以 reduce J.  所以一定可以收歙到 local minimum.  但因為 J 不是 convex.  並不保証收歙到 global minimum.  一般會取幾次 random initial centroid 以避免 local minimum. 

From Andrew Ng’s note: cs229.stanford.edu/notes/cs229-notes7a.ps

以下右上是 global optimal.  右下是 local optimal.  如果 initial points 不好,會卡在 local optimal.

NewImage

 

 

Collaborative Filtering Algorithm : Low Rank Matrix factorization: Recommendation System

Movie feature -> personal preference

 

 

Advertisements