### Expectation Maximization Algorithm

#### by allenlu2007

ML estimation and MAP estimation 乍看之下是兩種不同類型的 estimation. ML estimation 是 classical estimation without prior information. MAP 是 Bayesian estimation with prior information.

雖然從 MAP estimation 可以推出 ML estimation (assuming unknow has a delta distrubtion or uniform distribution) and vice versa. 但畢竟是理論上的推導，並沒有實用上的用途。

考慮一個情況，一開始是 ML estimation, estimate theta. For example, theta 的 mean and variance. 再來就得到一個 theta 的 distribution. 再來 iterative 得到更好的結果。

可以有兩種不同的方法，一是 time sequence add one more sample at a time for iteration. 二是 total sample number fix, but iterate many times.

簡單來說，雖然是 MLE parameter estimation, iteration 中卻變成 MAP during iteration.

Several problems with EM

1. It may converge to a local maximum

2. It’s a over fit system, it may cause problem by doing to many iteration! for ML system.

—————–以上的解釋有誤 ——————-

以上的解釋雖不是 EM algorithm 的本意。

但在我看到 Daphne Knoll 的 probability graph model (PGM) lecture 後很 surprise 發現 belief propagation (loopy BP PGM) 有類似的 idea. 特別在 turbo code decoder 是利用 decoder1 的 bit 作為 dcoder2 的 prior information. 把 ML 變為 MAP and vice versa. 不同點是用兩個 encoders and decoders to iteratively decoding, instead of self looping!!!

可否 self-looping from ML to MAP, 可能要再 check.

However, in some cases it is difficult to compute above equation.

When is difficult to compute?

1. multi-nomial?

2. hidden variable or latent variable?

Key is iterative! It is possible to use iterative approach.

Some examples:

1. Binomial

2. communication detection

重點是引入 x,

(1) Y=T(X) is a deterministic function of X. E.g. Y = (X1+X2)/2 or Y = sqrt(X1^2+X2^2)

(2) X is a complete and correct data

Step1: Pick an initial guess for .

Step2: 假設 theta 是對的，given observed y, calculate the conditional pdf p(x|y,theta)

Step3: choose the new \theta maximize the expected log p(x|theta) or Q function. We have to maximize the epxetected log p(x|theta) because we don’t really know x (it may be a hidden variable or latent variable).

Iterative way:

1. EM

2. ?? Kalman filter??

Example

EX1: Normal distribution with known variance sig^2.

y = (y1, y2, .. ym) missing data = (ym+1, … yn)

x = (y1, y2, … yn)

log Lc(phi) = log gc(x, phi)

Q(phi, phi_hat) = Ephi(k) log (Lc|y)

= E phi(k) ( sig(y1+..+ym)-phi(k)/sig^2) + ..)

Expectation part:

E phi(t) ( sig m+1 to n |x) = (n-m) u(t)

s1(t) = sum(xi) + (n-m)u(t)

Maximization part:

u(t+1) = s1(t)/n

u(t+1) = [sum(xi to m) + (n-m)u(t)]/n

with large iteration –> u(t+1)=u(t) = u

u = sum(xi to m) / m !!! The hidden information is useless since i.i.d.

Applications:

Missing data

Gaussian clustering

Hidden markov model (HMM)

Gaussian clustering, hidden markov chain (same idea?)

eye diagram estimation?

DC offset estimation with data?

clock recovery with data (CDR) bang-bang?

CDR with missing pulse?