Optimization Algorithm 分類

by allenlu2007




0th order algorithm (不需要 derivative, 只要 f(x), where x is a vector,  R^n -> R mapping)

Alternating optimization:

1. Examples:  coordinate descent;  EM algorithm; Gradient descent matrix factorization

2. Two (or more) orthogonal directions (or un-correlated directions), zig and zag to reduce the error.

3. Pro and Con:  

Pro: easy,  error always becomes smaller, convergence is guaranteed

Con: Non smooth function cannot use, very slow in convergence


1st order algorithm (只需要 f(x) and f'(x),

不需要 2nd order derivative, or Hessian)

1. Example: GD, SGD, functional gradient descent, and variants

更 advanced method 是加入 momentum terms to avoid zig-zag.


2. Learning rate is the key.  Largest step is called steepest gradient descent.

3. Pro and Con:

Pro: easy, converge

Con: still slow in convergence


2nd order algorithm (Need f(x), f'(x), and f”(x))

1. Newton method (equivalent to steepest descent with linear transformation converting ellipsoid to circular error surface) -> Need a complete Hessian -> For true N-dim ellipsoid error surface, Newton method only need 1-step to the optimal !!!


3. Hessian Free optimizer -> conjugate momentum method

Conjugate momentum method for a true N-dim ellipsoid error surface, it requires N-step to the optimal point!!!


Hessian-Free method (e.g. conjugate momentum method) 是 approx. Hessian matrix (in a sense).  是 2nd order method.

Momentum method 是 1st order method with modification term.  兩者不同,可以參考 Hinton’s lecture.


3.  Pro and Con

Pro: fast

Con: may not converge




0-th order Optimization Algorithm Example

K-mean algorithm (unsupervised clustering algorithm)

首先 objective/cost function J(c, u) 如下。有兩組參數需要 optimized: c(i) and u(k).



兩組參數可以分開 optimize, 最後會收歙到 global optimal, how to prove? Convex? -> Wrong, J is non-convex.   但 K-mean 總是會收歙到 local minimum. 

Based on Boyd, quadratic functions 似乎總是可被 globally optimized -> Wrong!  Only local minimum.


可以証明每次 cluster assignment 都可以 reduce J.  同樣每次 move centroid 也可以 reduce J.  所以一定可以收歙到 local minimum.  但因為 J 不是 convex.  並不保証收歙到 global minimum.  一般會取幾次 random initial centroid 以避免 local minimum. 

From Andrew Ng’s note: cs229.stanford.edu/notes/cs229-notes7a.ps

以下右上是 global optimal.  右下是 local optimal.  如果 initial points 不好,會卡在 local optimal.




Collaborative Filtering Algorithm : Low Rank Matrix factorization: Recommendation System

Movie feature -> personal preference