LeNet CNN 實例: MNIST 手寫數字辨識

by allenlu2007

MNIST digits (LeCun et al., 1998) 是當初為 US post office 所發展的 ZIP code OCR 自動辨識辨識系統。

一共有 50K training images, 10K validation images, and 10K test images.  

每個 image 是 28X28 grey-scale pixel image, 最後要決定是 0, 1, …, 9 十個 digit 之一。

 

有很多不同的算法,不過 deep learning 的 CNN 的效果非常好。

可以參考 “The MNIST Database”  http://yann.lecun.com/exdb/mnist/

Linear classifier:  8% ~ 12% error

K-nearest-neighbor: 1.x% ~ 5% error

SVM: 0.6% ~ 1.4% error

(Conventional) Neural Nets: 1% ~ 5% error

Convolution Nets: 0.3% ~ 1% error

 

Caffe 中有一個完整的 example (example/01-learning-lenet.ipynb)  使用 CNN (LeNet) 來解 MNIST summarize 如下:

首先 LeNet 的架構如下:

每一層:  convolution layer (filter) + nonlinear (sigmoid or rectified linear) layer + pooling (local max sub-sampilng) layer.  (應該還要加上 normalization contrast layer).   最後是 fully connected layer for classification (bipartite graph: perceptron); , 最後加上的 loss layer 只是把 scaler value 轉為 probability (softmax function 見 wiki softmax) 同時取最大機率者 (hard decision).

 

NewImage

NewImage 

Python Solving with LeNet (Example)

LeNet 的架構如下:  (defined 在 lenet_auto_train.prototxt).  但是 dimension?

from caffe import layers as L
from caffe import params as P

def lenet(lmdb, batch_size):
    # our version of LeNet: a series of linear and simple nonlinear transformations
    n = caffe.NetSpec()
    n.data, n.label = L.Data(batch_size=batch_size, backend=P.Data.LMDB, source=lmdb,
                             transform_param=dict(scale=1./255), ntop=2)
    n.conv1 = L.Convolution(n.data, kernel_size=5, num_output=20, weight_filler=dict(type='xavier'))
    n.pool1 = L.Pooling(n.conv1, kernel_size=2, stride=2, pool=P.Pooling.MAX)
    n.conv2 = L.Convolution(n.pool1, kernel_size=5, num_output=50, weight_filler=dict(type='xavier'))
    n.pool2 = L.Pooling(n.conv2, kernel_size=2, stride=2, pool=P.Pooling.MAX)
    n.ip1 = L.InnerProduct(n.pool2, num_output=500, weight_filler=dict(type='xavier'))
    n.relu1 = L.ReLU(n.ip1, in_place=True)
    n.ip2 = L.InnerProduct(n.relu1, num_output=10, weight_filler=dict(type='xavier'))
    n.loss = L.SoftmaxWithLoss(n.ip2, n.label)
    return n.to_proto()
    
with open('examples/mnist/lenet_auto_train.prototxt', 'w') as f:
    f.write(str(lenet('examples/mnist/mnist_train_lmdb', 64)))
    
with open('examples/mnist/lenet_auto_test.prototxt', 'w') as f:
    f.write(str(lenet('examples/mnist/mnist_test_lmdb', 100)))

Layer: data -> conv1 -> pool1 -> conv2 -> pool2 -> fully connected 1 ->  relu1 -> fully connected 2 -> loss (softmax label)

Dimension:  (28×28 -> 20x?? -> 20x?? -> 50x?? -> 50x?? -> 500 -> 500 -> 10)  後面有說明。

xavier 是 weight filter 的 initialization condition type.  可以參考此文

 

Q1:  Why no nonlinearity in conv1 and conv2? Pool1 and Pool2 (local max, 本身也有 nonlinearity?)

Pooling 應該不算 nonlinearity.  就算有 (max 而非 average) 也是非常 weak nonlinearity.

就 linear conv and pooling 來看 conv1 and conv2 可以合併成新的 conv layer with

higher dimension because of linear operation?

但實務上用 conv1 and conv2 可以大幅減小 parameter space (f1 + f2 vs. f1 * f2)

所以 conv1 and conv2 也可以看成 higher dimension 1 layer structure.  

 

Q3:  似乎缺了 local contrast layer?  

可能因為影像已經是 black-and-white.  或是已處理過的 MNIST database 已經有很好的 contrast. 

 

Q2: Why nonlinearity layer (ReLU) 是在 ip1 之後?

 

 

 

Learning parameter and SGD with momentum

是定義在 lenet_auto_solver.prototxt 

# The train/test net protocol buffer definition
train_net: "examples/mnist/lenet_auto_train.prototxt"
test_net: "examples/mnist/lenet_auto_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
 

NewImage 

 

Data Dimension and Parameter Space Dimension

第一是確認每一層的 dimension, 也就是每層 stored 的 data/feature size (blob dimension); or data path.

 
# each output is (batch size, feature dim, spatial dim)
[(k, v.data.shape) for k, v in solver.net.blobs.items()]
[('data', (64, 1, 28, 28)),
 ('label', (64,)),
 ('conv1', (64, 20, 24, 24)),
 ('pool1', (64, 20, 12, 12)),
 ('conv2', (64, 50, 8, 8)),
 ('pool2', (64, 50, 4, 4)),
 ('ip1', (64, 500)),
 ('ip2', (64, 10)),
 ('loss', ())]

k 是 layer 名稱,注意 ReLU1 是 in-place replacement, 被 ip1 取代。

v.data.shape 第一個數字是 batch size (64), 應是用來做 training mini batch SGD 之用。loss 是每一個 mini batch 的 loss. 

第二個數字 (dimension) 是 feature dimension (就是上圖的顏色數目); 最後 2 個數字是 spatial dimension (就是上圖一種顏色的 pixel block size)

data: 1:  gray level;  28×28 : block pixel.  => 784 dim

conv1: 20 (20 組 FIR filter?);  24×24 spatial dim (類似 pixel) => 11520 dim

pool1: 20 (same as above,with smaller spatial dimension); 12×12 spatial dim(subsampling)=>2880 dim

conv2: 50 (50 組 FIR filter); 8×8 spatial dim => 3200 dim

pool2:  50 (same as above); 4×4 spatial dim => 800 dim

ip1: 500 dim scaler

ip2: 10 dim scaler

———————————————————————

 

再來就是重點, parameter space (FiR filter taps), or control path

 
# just print the weight sizes (not biases)
[(k, v[0].data.shape) for k, v in solver.net.params.items()]
[('conv1', (20, 1, 5, 5)),
 ('conv2', (50, 20, 5, 5)),
 ('ip1', (500, 800)),
 ('ip2', (10, 500))]

con1: 20 組 FIR filter; 5×5 taps.   注意應該是 share weight filter, 所以所有的 “block pixels” 都用一樣的 weights (當然有 20 組)  => parameter space = 20x5x5 = 500 dim

conv2: 50 組 FIR filter; 5×5 taps => 50x5x5 = 1250 dim;  how about the 20?

ip1: 800?    ip2: 500?

 

結果和如何 Debug

Train 完後 (i.e. 所有的 parameter 都收歛) 如何 debug?

Step 1: 我們再拿 training data 和 test data 經過 forward prop 如下。

solver.net.forward()  # train net
solver.test_nets[0].forward()  # test net (there can be more than one)

 

先看 train (input) data (8 個 image) 以及最後解出的 label (數字) 是否正確。

# we use a little trick to tile the first eight images
imshow(solver.net.blobs['data'].data[:8, 0].transpose(1, 0, 2).reshape(28, 8*28), cmap='gray')
print solver.net.blobs['label'].data[:8]
 
[ 5.  0.  4.  1.  9.  2.  1.  3.]
 
NewImage
Input data 就是上面圖像,  28×28 (x8) for 前 8 個影像。
Output label 就是上面的數字。
再看 test data 如下, 也是一樣,是否正確。
imshow(solver.test_nets[0].blobs['data'].data[:8, 0].transpose(1, 0, 2).reshape(28, 8*28), cmap='gray')
print solver.test_nets[0].blobs['label'].data[:8]
 
[ 7.  2.  1.  0.  4.  1.  4.  9.]
 
NewImage

 

Step 2: 再來觀察 FIR filter 的影像

Let’s take one step of (minibatch) SGD and see what happens.

solver.step(1) 


Do we have gradients propagating through our filters? Let’s see the updates to the first layer,

shown here as a 4X5 grid or 5X5 filters.

(20 宮格 5×5 filter, i.e. parameter space, not feature space!!)

前面有說過,Conv1 有 20 組 FIR (5×5) filter, 圖形表示如下。

這是已經收歙後的結果? 還是 first mini-batch train 後的結果? 應該只是 first mini-batch 結果

4×grid of 5×5

imshow(solver.net.params['conv1'][0].diff[:, 0].reshape(4, 5, 5, 5)
       .transpose(0, 2, 1, 3).reshape(4*5, 5*5), cmap='gray')
<matplotlib.image.AxesImage at 0x7f79383819d0>
 
NewImage

 Something is happening. Let’s run the net for a while, keeping track of a few things as it goes.

 

Step 3:  觀察 intermediate data/feature 的影像

 

%%time
niter = 200
test_interval = 25
# losses will also be stored in the log
train_loss = zeros(niter)
test_acc = zeros(int(np.ceil(niter / test_interval)))
output = zeros((niter, 8, 10))

# the main solver loop
for it in range(niter):
    solver.step(1)  # SGD by Caffe
    
    # store the train loss
    train_loss[it] = solver.net.blobs['loss'].data
    
    # store the output on the first test batch
    # (start the forward pass at conv1 to avoid loading new data)
    solver.test_nets[0].forward(start='conv1')
    output[it] = solver.test_nets[0].blobs['ip2'].data[:8]
    
    # run a full test every so often
    # (Caffe can also do this for us and write to a log, but we show here
    #  how to do it directly in Python, where more complicated things are easier.)
    if it % test_interval == 0:
        print 'Iteration', it, 'testing...'
        correct = 0
        for test_it in range(100):
            solver.test_nets[0].forward()
            correct += sum(solver.test_nets[0].blobs['ip2'].data.argmax(1)
                           == solver.test_nets[0].blobs['label'].data)
        test_acc[it // test_interval] = correct / 1e4

先看 loss function 和 test accuracy

_, ax1 = subplots()
ax2 = ax1.twinx()
ax1.plot(arange(niter), train_loss)
ax2.plot(test_interval * arange(len(test_acc)), test_acc, 'r')
ax1.set_xlabel('iteration')
ax1.set_ylabel('train loss')
ax2.set_ylabel('test accuracy') 
 
大約 95% accuracy, 看來還不夠好。可能還需要 fine tune.  不過第一次就可以得到這樣結果似乎不錯。

看幾個例子

input data 和 output, 但不是最後的 label, 而是 ip2 如上.   ip2 是 10 dimension scale (1×1).  不過我們加上了 iteration dimension 如下。

 

for i in range(8):
    figure(figsize=(2, 2))
    imshow(solver.test_nets[0].blobs['data'].data[i, 0], cmap='gray')
    figure(figsize=(10, 2))
    imshow(output[:50, i].T, interpolation='nearest', cmap='gray')
    xlabel('iteration')
    ylabel('label')

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

可以看到最後一個 case, 既像 9 也像 4.  ip2 的 output 是數值。我們可以轉為 probability vector (利用 partition function and exp) 如下,也就是 softmax function, 見 wiki sofmax function。更清楚如何轉為最後的 label. 

for i in range(8):
    figure(figsize=(2, 2))
    imshow(solver.test_nets[0].blobs['data'].data[i, 0], cmap='gray')
    figure(figsize=(10, 2))
    imshow(exp(output[:50, i].T) / exp(output[:50, i].T).sum(0), interpolation='nearest', cmap='gray')
    xlabel('iteration')
    ylabel('label')

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

NewImage

 

Advertisements