Excellent article: https://arxiv.org/pdf/1606.05908.pdf
One thing always puzzles me in machine learning.
Neural network (NN) is deterministic, essentially a nonlinear mapping/function y = G(x)
Probabilistic Graph Model (PGF) is probabilistic, estimated various probability (joint, conditional) with graph model.
Why can we solve machine learning problems such as image classification, image generation using both technicals?
To be more specific, machine learning problem can be viewed in a general probabilistic framework.
Machine learning 從一堆已知的 big data 學習，藉此推論新的 data. 基本是 probabilistic problem.
For example, (image) classification problem on the surface is deterministic. It is more adequate to describe it using probability.
為什麼 NN, a deterministic mapping, 可以如此廣泛用於 machine learning?
我們可以用 discriminative and generative problems/models 來看這個問題。
Discriminative problem is used to estimate posterior probability (and distribution): p(Y|X),
where Y is the labels and X is the input image or feature vectors.
X is much higher dimension than Y (binary or 10-1000 categories).
In discriminative model, we estimate p(Y|X) directly. How to do it using NN?
(a) inference: NN 可視為前處理把 X 轉換為 a deterministic and mostly lower dimension vectors, which is suitable for linear classifier (e.g. logistic or softmax).
NN 並非直接參與 posterior probability (and distribution)!
而是最後一層的 logistic or softmax 臨門一腳決定 posterior probability (and distribution). 可以 reference https://taweihuang.hpd.io/2017/03/21/mlbayes/ or Andrew paper: http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf for logistic regression role in discriminative classifiers.
(b) training: training 時不關心 global data distribution, 重點在於 training posterior accuracy p(Yi|Xi) as high as possible without overfitting. Cost function = min (training error) + regularization loss
In summary, in discriminative problem, NN 可視為前處理把 X (very high dimension) distribution maps to a (low dimension) linear separable distribution! 最後一層的 logistic or softmax 臨門一腳決定 posterior probability (and distribution).
Generative problem is used to estimate joint probability distribution: P(X) or P(X, Y) or P(X|Y) or P(Y|X) = P(x,y)/p(x)
P(X) : E.g. 看過一堆 unlabeled images (Xi) 找出 X distribution in very high dimension or generate a new image based on the distribution (unsupervised learning).
P(X, Y): E.g. 看過一堆 unlabeled images (Xi, Yi) 找出 (X, Y) joint distribution in very high dimension or generate a new image P(X|Y) based on the label (supervised learning).
在 generative problem 中，NN 扮演什麼角色？剛好和 discriminative problem 相反。是把一個 simple (low dimension) distribution Z (normal distribution or uniform distribution) maps to G(Z) a (very high dimension) complicated distribution Pg(Z) and 盡可能近似 X distribution P(X)!!
Generative problem 最困難的部份是 training. 因為很難定義 high dimension (image) distribution’s error based on individual sample (on the contrary, (labeled) discriminative problem 容易定義 low dimension (分類) 的 error based on training samples (individually label and summable)!
There are two common generative models:
1. Variational approach: variational autoencoder.
2. GAN approach. Define a G and D. Use GAN to iterative approach the sampled PDF.
(a) training: 主要是讓 Pg(Z) 盡量接近 P(X). 就是 minimize Pg(Z) and P(X) 的距離 (or divergence)
Cost function = divergence(Pg(Z), P(X)) ==> Wrong!! 問題是我們沒有 P(X) 的 distribution!
但是我們有 X samples, 因此應該是用 maximal log likelihood of P(X) via optimize θ as follows:
這是 general formula. P(X|z;θ) 可以是任何 probability distribution.
P(X) = E( P(X|z) ) over z
(1) 可以改寫為 (by applying Bayes rule)：
Eq (5) 是 variational autoencoder 的核心！
P(X|z) 是 “decoder network”. 但不是 conventional autoencoder 的 deterministic NN decoder network!
Q(z|X) 自然就是 “encoder network”.
Cost function 就是 (5) 的 RHS.
第一項是 reconstruction (or generation) loss.
Physical insight: variational approach 假設 P(X|z) 是 normal distribution. Take log and minimize expectation over Q 等價 minimize L2 norm 如下圖左的 X -> Q -> μ/Σ -> Sample -> P -> L2 norm.
第二項是 latent (or regularization) loss.
Physical insight: 就是限制 Q(z|X) 近似 N(0, 1).
如下圖左的 X -> Q -> μ/Σ -> Sample -> KL
BTW, 為了使 back propagation works, 不能有 sample 在 back-prop path. 因此需要轉變成下圖右！！
??? Q(z|X) and P(X|z) 到底是 deterministic function or distribution in encoder/decoder ???
Ans: if Q and P are deterministic NN, Q(z|X) and P(X|z) 的 PDF 是 delta function.
但 NN 若有 transition probability or random dropout or noise injection, 則 Q(z|X) and P(X|z) 可用 normal distribution 趨近。
(b) Interference: generative model 的 inference is trivial. 就是找一個 (new) sample, P(X_new_image) or P(X_new_image | car). 只要由 simple (low dimension) distribution random generate a sample, 經由 NN map to a (high dimension) image or speech.
因此在 generative model inference (in variational approach), NN 扮演的角色是 transform probability distribution. 在 variational inference 只要 decoder NN P, 把 normal distribution maps 成近似 P(X) 的 distribution; from low dimension to high dimension. 這和 discriminative model 剛好相反。
不過 Q (encoder) 把 X distribution map to normal distribution, i.e. high dimension to low dimension. 非常類似 discriminative model.
In summary, NN roles
Training model: NN maps high dimension distribution to low dimension LINEAR SEPARABLE distribution!
Training cost function: Minimal discriminative error (classification error)!
Inference model: same as training model. NN maps high dimension distribution to low dimension linear separable distribution.
Generative model (Variational):
Training model: Encoder NN maps high dimension to low dimension pre-determined distribution (normal distribution) + decoder NN maps low dimension to high dimension distribution
Training cost function: maximum likelihood => generation loss + regularization loss
Inference: Decoder NN maps low dimension (normal) distribution to high dimension distribution.
Generative model (GAN):
Training model: Discriminative NN maps high dimension to low dimension binary separable distribution. +
Generative NN maps low dimension to high dimension distribution
Training cots function: maximum likelihood => minimize divergence!
Inference: Generative NN maps low dimension (uniform or normal) distribution to high dimension distribution.
不論是 supervised or unsupervised generative learning, 主要是找出 very high dimension PDF of P(X) or P(X, Y). 也就是 density estimation. 實務上 nobody care distribution! 重要的是 generative samples!
——————— The following sentence is to be modified ——————–
unlike discriminative problem (many to one or a few), the generative is typically a one to many problem or a harder problem.
The key is to find (a very high dimension) P(X) distribution (not just a probability since it’s in very high dimension, not a simple classification problem).
The final result of a generative problem is typically a 2D graph of P(X, y) or P(X|y). or lots of pictures with different parameters.
For this type of question, we only need to add a logistic function or softmax function to covert deterministic NN results into probability.
How about the other way around. If the problem is purely probabilistic. For example, we want to get/generate the PDF of a random variable (vector). How can we use NN to rescue?
If’s not difficult, we need to use a latent random variable Z with PDF either normal or uniform distribution or any other.
We define Y = G(Z) to the final PDF, where G() is implemented using NN (could be a very complicated function mapping).
The PDF of Y is very complicated and most likely intractable.
However, we can use learning method to approximate PDF of Y to a given distribution by the following methods.