Deep Learning and Machine Learning

by allenlu2007


之前在看 machine learning lecture 時,主要是 focus on PGF (probabilistic graph model) by Koller, Andrew Ng, and MLSS lecturers.  幾點感想:

(1) PGF 在 math 上很漂亮。簡單的 case 如 HMM, junction tree 都有 systematic solutions 解 inference or learning (也是一種 inference) problem.  複雜的 case 也有 belief propagation 方法可以得到有用的結果 (e.g. turbo code decoding,  random Markov graph for image processing).   

(2) PGF or specifically, HMM with GMM 用於 speech recognition, or  OCR, etc. 也有非常不錯的結果。 Conventional method: feature extraction (SIFT, HOG) + classifier 對於在 still image object detection / recognition 也 ok.

(3) 我想說的是 PGM (e.g. HMM or chain-tree graph model) 基本上都在找出 information 的內在 structure (speech 就是相鄰 syllabus 的關連)。如果 graph model 和 real structure 愈接近,recognition 的效果愈好。  Model or feature 如果愈準,自然愈好。也就是 model/feature based, feature extraction matters a lot.  (any relationship with generative/discriminative models?) 例如常聽到指紋辨識的 14 個特徵點,face detection 的 33 個特徵點。所謂特徵點就是 feature extraction.  是由 feature engineer hand-craft 出來。主要的 performance (detection accuracy) 大多是由 feature extraction 決定。如果 feature extraction 沒做好,之後再多的 leaning 都是無用。Feature extraction 可說是痛點。

(4) 無奈現實的應用可能更複雜。例如 Kinect 的 gesture detection (video with vast information), computer vision (similar), and nature language recognition (with lots of background noise, and irregular scenario).  有時候好的 feature extraction 很難直覺得到。或是太複雜而無用。

(5) 想到另一種 approach?  人的大腦。大腦的 work 方式完全不是這樣。並不會根據應用而產生不同 model/feature. E.g. for speech recognition, 大腦不會是 HMM-GMM model.   相反的,大腦對於所有應用的 model 似乎都相同。Layered structures.  只是多少 layers, 多少 node 之類的差別。Model 似乎永遠一樣,或 does not matter much?

==> 以上的說明並不好。PGF 和 deep learning 並非互斥的觀念。甚至是相似的觀念。Deep learning 是 multilayer 的 structure, 也可以視為 graph model 的一種。多半是 DAG (directed acyclic graph), 不過最近 recurrent network (loop network) 也愈來愈多研究。Deep learning 和普通 Bayesian 的主要區別在於引入 nonlinearity component 如 sigmoid 以及 training/optimization 的觀念。PGF 本身是以 probability 為出發點 (joint pdf or conditional pdf), 可以是 linear or nonlinear.  在 Linear space model 就變成 HMM, 和 deep learning 就有 linear or nonlinear 的不同 (recurrent neural network 做 time unfolding 時, graph 就和 HMM 一樣)。Nonlinearity 是 neural network 的 KEY concept.  Nonlinearity 可以把 high dimension  壓縮在一起,multilayer nonlinearity 才能處理非常複雜的問題如 image recognition.  這些問題都非 linear separable (即使在 high dimension).  更多的 explanation 可以參考 Yann Lecun 的 lecture.

早期 1965 的 neural network, perceptron 就是如此。No hidden units; Extract feature by hand, hard-decision, learning to change weight (feature is decided by hand, weight is learned, essentially no learning).  Feature by hand, weight by learning, but lots of problems cannot find feasible weights.  Simple but only useful for feature rich data set.

Perceptron classification 其實和 logistic regression 一樣。基本上是 discriminative model?  Still use in feature rich data structure.  如果選錯 features, 基本上之後 weight learning 是無用的。

Perceptron 只能解 linear separable data set training.  對於 non-linear separable case, e.g.  y= XOR(x1, x2). 等簡單 case 都無解。只能 (1) higher dimension and nonlinear function, e.g. SVM;  or (2) more layers function (deep learning).  


之後 1980 引入了 hidden layer, 又有 learning (back propagation) algorithm.  又熱了一陣,還是無法解決 general 的問題.  基本上是只管結果,沒有過程 (feature by learning 非常神密)。似乎只是 try-and-error (layers and nodes).  很快就 out-of-date.  

Neural network 的幾個問題:  (1) training (gradient decent method) 非常 computation extensive; (2) 很多的 local min/max.  這些問題在 shallow learning 就存在 (hidden layer 只有 1-2 層), 在 deep learning ( hidden layers 有多層, 類似人腦) 更複雜。

* 不過近年來似乎有所突破。像是 deep learning 加上 convolution neural network (CNN), or DNN, or DBN?  為什麼有進展?  一部份和 computation power improve 有關。但理論上有何進展? 也用到 PGM 的概念嗎, yes?


Why deep learning has advantage over shallow learning?  because it resembles human brain?

Shallow learning 只有 1-2 層的 hidden layers, 似乎只能達到簡單的亮暗 feature extraction. 再多的 sample training 就會 over training.   Deep learning 可以做例 multiple layer feature extraction.  例如 layer 1 是亮暗,layer 2 是條紋, layer 3…  更重要是每層 layer 可以分別 training and fine tuning.  這似乎更接近視覺的功能。

請參考以下 article.



Deep learning 也是 PGF, 也是 Generative Model?

Deep learning 是 Bayesian network (DAG: directed acyclic graph)?  或是 Markov random field (undirected graph)? 

可以想成除了 chain, tree, 之外。另一種 topology (bipartite layered model).  用在 feature extraction 剛好。  Training 是用 believe nets.  為了讓問題 trackable.  State 是 binary sigmoid.   適用於 high dimension problems.  

Deep learning 是比 SVM 更進一步的 classification algorithm 嗎? 

* Both SVM and Deep learning 用於解決 high dimension 的問題。 SVM 是用不同的 kernel functions.  Deep learning 則是用 layer structure (N*M*…).

* Deep learning 可用於 supervise learning or un-supervised learning.  

SVM 基本上是 supervised learning. 見下圖。

* Deep learning 基本上仍是 generative model.  會找出 joint PDF 嗎? 除了用於 classification 外,也可用於找出 data structure (unsupervised learning? clustering?)

SVM 是 discriminative model? 最後只 care P(x|y) 的 classification questions.

 不錯的 Lecture from Coursera by Geoffrey Hinton (deep learning pioneer) 


什麼是 Learning?

Learning 可粗分為三類: (1) supervised learning : 用於 regression or classification;  (2) unsupervised learning : 用於 clustering 或 data structure exploration (PCA, ..);   (3) reinforcement learning (find the optimal policy for a distant outcome).

最直接且和其他學科相關的就是 supervised learning, (類似 math 用於 control, optimization, filtering (Kalman), etc.)

Learning 以 Bayesian 而言是個偽命題。Learning 就是 inference 問題。只是 …

Learning 以 (Gen1) Neural network 而言就是找 optimal weights.  和 optimization 是同一類問題。只是 nonlinear function 更複雜。Learning 完之後再做 inference.  

Yann LeConn has a nice presentation on MIT web 解釋 deep learning and shallow learning.  SVM and kernel methods are not deep.  Why need deep learning and trade-off.

* Learning -> push lower energy function -> Hamiltonian?


Deep Learning and Graph Model

* 並不互斥。但也不相同。Deep learning 也有 factor graph. 可以有 directed acyclic graph (Bayesian) or undirected graph (cyclic allowed? yes, recurrent neural net).  


Not Deep Learning!

* No hidden layer

* 1 hidden layer (no feature layer)

* SVM or kernel methods

* Classification tree

* graph model 中的 HMM(?) and tree (?)

* 大部份 graph model is not deep.  不過 most deep learning model are formulated as factor graphs.


Deep Learning

* more than 2 layers of hidden layers in Neural net

* trade time vs. parallel space

* 大多 non-convex optimization (supervised learning)

* Can do unsupervised learning

* 和 Energy function highly related?  Hamiltonian?


以下 slide is from M. Ranzato.  把各種常用的 machine learning toolbox 根據 supervised/unsupervised, deep and shallow 做了很好的分類。大多我們之前常用的 toolbox (SVM, GMM, Perceptron) 都是 shallow learning.