Dimension Reduction and Neural Network Insight

by allenlu2007



上文是非常好的文章幫助了解 neural network 的效果。

就是用 dimension reduction as a tool 來了解 neural network 在每一層所得到的結果。

因為 neural network 是非常 high dimension algorithm.  直接看到結果不可能。但結合 dimension reduction 可以 visualize 到每一層的效果!!!

這和我之前想法類似。就是用微分幾何詮釋 machine learning or even deep learning 的作用。

但此文指出一個更具體方向。就是用 dimension reduction as a tool for viauslization.

原理先不論。至少先看到如何 work.  Of course not only for neural network, should work for SVM or other machine learning map to higher dimensions.


首先 dimension reduction 是 supervised learning or unsupervised learning?

乍看之下 unsupervised learning.  因為目的是 reduce dimension, 不需要 label.  就像 clustering algorithm, 純粹 explore data 本身的內在結構。例如 PCA 就是找 data matrix 的 eigenvectors.

但如果從詮釋的角度來看,判斷那一種 dimension reduction 比較好,或是 optimize a parameter, 就會要看最後的結果。 In other words,  很多 dimension reduction 之後會加上 classifier (supervised learning), i.e.

Dimension reduction (unsupervised learning) + Classifier (supervised learning)

Optimization 必須考慮最後 classifier 的結果。這在很多的 unsupervised learning 也是如此。如何判斷 unsupervised learning 的結果是要看整體系統 (加上 classifier) 的 performance.


Dimension Reduction Technique

Linear Dimension Reduction (PCA, Factor Analysis, ICA)

Reference: http://slideplayer.com/slide/10455933/

我的建議是 try them all!  不要有先入為主的觀念。認為 data 屬於那一種。

最有名的是 PCA (principle component analysis), 就是找最大 variance 的 eivenvectors.  數學上也非常漂亮。就是用 SVD (singular value decomposition)!   

幾何上的意義就是 linear projection to eigenvectors 所構成的 lower dimension hyperplane.  主要是 preserve linear structure.  也 preserve data 的 total variance (information)!





PCA 可視為 projection-based to lower dimension 的一種算法。主要目的是 maximum variance (information). 

當然也可以有不同的目標的 projection.  例如以下的 2D data 要 project 到 1D line.


如果從 maximize variance (PCA), 就是 project 到 y-axis.  但完全失去了 two-modes information.  

如果從 keep two-mode 角度 : e.g. in a cocktail party 要找兩個人的聲音。我們假設左半平面是 label “+”, 右半平面 label “-“.   projection 之後 + classifier.  此時最佳的 projection 是 project to x-axis. 

當然以上的 data 是 one extreme to show maximum variance (PCA) and multi-modality (ICA?) 得到完全不同的結果。

如果 data 本身是像啞鈴的分佈。此時 maximum variance (PCA) and multi-modality (ICA?) 會得到完全一致的 projection!


ICA – Maximize modality? 

Ref: http://slideplayer.com/slide/10455933/

我的建議是 try them (PCA, ICA, NCA, factor analysis) all!  不要有先入為主的觀念。認為 data 屬於那一種。

標準說法 ICA 適合 Cocktail party to separate two voices 如下圖.  但 Andrew Ng 的 CS2?? 是用 SVD (PCA) to separate two voices (or one music).  


Cardio Physionet detection!!

Baby cry detection!!  Using matlab



Linear dimension reduction 在 data features 是 linear correlated 時有最好的效果  (e.g. PCA 適合在 multivariate Gaussian distribution; ICA 適合在 multivariate non-Gaussian distribution)。

但是在 data 本身是 highly nonlinear 時就不一定適用。就像 linear classifier 對 nonlinear data structure 也不適用。例如:

* Camera vision (3D project to 2D surface), high dimension data sits on nonlinear manifold.

* Neural network (structure 本身包含 nonlinear elements).

* 2D shadow manifold, etc.


Two approaches: (i) Kernel trick; (ii) Nonlinear dimension reduction!

PCA 加上 kernel trick 就像 SVM + kernel trick. 都是延伸 linear machine to nonlinear data set 的方法。


PCA + kernel trick  等價於 manifold 

– SVM + kernel trick for nonlinear data

– SVM + kernel trick for manifold data


另一方法就是 nonlinear dimension reduction. 

一個自然的方法就是 manifold learning.  是一種 nonlinear dimension reduction 方法。


Nonlinear Dimension Reduction

Isomap learning (就是守住”距離”的 optimization)

先看 MDS (Multi-Dimensional Scaling), 就是 pair-wise distance preserved dimension reduction.

Ref: http://colah.github.io/posts/2014-10-Visualizing-MNIST/

Let’s be a bit more precise. For any two MNIST data points, xi and xj, there are two notions of distance between them. One is the distance between them in the original space and one is the distance between them in our visualization (reduced dimension). We will use dij* to denote the distance between xi and xj in the original space and dij to denote the distance between xi and xj in our visualization.  Now we can define a cost:


This value describes how bad a visualization is. It basically says: “It’s bad for distances to not be the same. In fact, it’s quadratically bad.” If it’s high, it means that distances are dissimilar to the original space. If it’s small, it means they are similar. If it is zero, we have a ‘perfect’ embedding.



MDS 是 isomapping.  另外的 mapping 是 Sammon mapping.

如果用以下的 cost function, 則是強調 local structure than global structure. 



另一種方式是用 energy 觀點。同性相吸異性相斥。用 graph (V, E) 



Nonlinear (Low-Dimensional) Embedding (i.e. Dimension Reduction) – Preserve (Neighbor) Topology

Isomap 方法,如上面所述。是儘量守住兩點之間的距離 (minimize distance difference),在此前提下做 dimension reduction.

這是 geometry optimization 方法的一種。還有更基本的 geometry 特性就是 topology 的特性 (distance 是比較 advanced 特性和 metric 相關).  什麼是 topology 的特性? 有 global and local 特性。在 machine learning 一般 local 特性比較重要。什麼樣的 local 特性和 metric 無關?  可以用 neighborhood 的數目做特徵。(edge, vertex, face).  F+V-E = 2 for 0 hole topology.  



最有名關於 neighbor algorithm 應該是 k nearest neighbor algorithm (好像在某一本書的10大 algorithm 就包含 KNN).   KNN 是 clustering algorithm. (Wrong!  K-mean 才是 unsupervised clustering algorithm!!  KNN 是 supervised classification or regression algorithm).


NCA (Neighborhood Component Analysis)

Reference: http://www.wikicoursenote.com/wiki/Learning_a_Nonlinear_Embedding_by_Preserving_Class_Neighborhood_Structure


NCA 也分為 linear and non-linear.   Linear 定義如上,就是要找到 W.  但如果 W 是 nonlinear network 的 weight factor, NCA 就變成 nonlinear NCA.

RBM + NCA 用在 MNIST 的結果約 99% accuracy (非常不錯。在 deep learning 之前算最優)。

2D dimension reduction 結果如下。



再來就是 neighborhood based on statistical method.  解釋如下。

T-distributed Stochastic Neighbor Embedding (t-SNE)

The final technique I wish to introduce is the t-Distributed Stochastic Neighbor Embedding (t-SNE). This technique is extremely popular in the deep learning community. Unfortunately, t-SNE’s cost function involves some non-trivial mathematical machinery and requires some significant effort to understand.

But, roughly, what t-SNE tries to optimize for is preserving the topology of the data. For every point, it constructs a notion of which other points are its ‘neighbors,’ trying to make all points have the same number of neighbors. Then it tries to embed them so that those points all have the same number of neighbors.

In some ways, t-SNE is a lot like the graph based visualization. But instead of just having points be neighbors (if there’s an edge) or not neighbors (if there isn’t an edge), t-SNE has a continuous spectrum of having points be neighbors to different extents.

t-SNE is often very successful at revealing clusters and subclusters in data.




我的建議是 try them (PCA, ICA, NCA, factor analysis, linear, nonlinear) all!  不要有先入為主的觀念。認為 data 屬於那一種。