# Supervised Learning

In this tutorial, we're going to learn how to define a model, and train it using a supervised approach, to solve a multiclass classifaction task. Some of the material here is based on this existing tutorial.

The tutorial demonstrates how to:

• pre-process the (train and test) data, to facilitate learning
• describe a model to solve a classification task
• choose a loss function to minimize
• define a sampling procedure (stochastic, mini-batches), and apply one of several optimization techniques to train the model's parameters
• estimate the model's performance on unseen (test) data

Each of these 5 steps is accompanied by a script, provided on GitHub, on this page:

• 1_data.lua
• 2_model.lua
• 3_loss.lua
• 4_train.lua
• 5_test.lua

A top script, doall.lua, is also provided to run the complete procedure at once.

At the end of each section, I propose a couple of exercises, which are mostly intended to make you modify the code, and get a good idea of the effect of each parameter on the global procedure. Although the exercises are proposed at the end of each section, they should be done after you've read the complete tutorial, as they (almost) all require you to run the doall.lua script, to get training results.

The complete dataset is big, and we don't have time to play with the full set in this short tutorial session. The script doall.lua comes with a -size flag, which you should set to small, to only use 10,000 training samples.

The example scripts provided are quite verbose, on purpose. Instead of relying on opaque classes, dataset creation and the training loop are basically exposed right here. Although a bit challenging at first, it should help new users quickly become independent, and able to tweak the code for their own problems.

On top of the scripts above, I provide an extra script, A_slicing.lua, which should help you understand how tensor/arry slicing works in Torch (if you're a Matlab user, you should be familiar with the contept, then it's just a matter of syntax).

You can now follow these steps, in order:

# Step 1: Data

The code for this section is in 1_data.lua. Run it like this:

th -i 1_data.lua

This will give you an interpreter to play with the data once it's loaded/preprocessed.

For this tutorial, we'll be using the Street View House Number http://ufldl.stanford.edu/housenumbers/ dataset. SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.

Overview of the dataset:

• 10 classes, 1 for each digit. Digit '1' has label 1, '9' has label 9 and '0' has label 10.
• 73257 digits for training, 26032 digits for testing, and 531131 additional, somewhat less difficult samples, to use as extra training data
• Comes in two formats:
• Original images with character level bounding boxes.
• MNIST-like 32-by-32 images centered around a single character (many of the images do contain some distractors at the sides).

We will be using the second format. In terms of dimensionality:

• the inputs (images) are 3x32x32
• the outputs (targets) are 10-dimensional

In this first section, we are going to preprocess the data to facilitate training.

The script provided automatically retrieves the dataset, all we have to do is load it:

 1: -- We load the dataset from disk, and re-arrange it to be compatible
2: -- with Torch's representation. Matlab uses a column-major representation,
3: -- Torch is row-major, so we just have to transpose the data.
4:
5: -- Note: the data, in X, is 4-d: the 1st dim indexes the samples, the 2nd
6: -- dim indexes the color channels (RGB), and the last two dims index the
7: -- height and width of the samples.
8:
10: trainData = {
13:    size = function() return (#trainData.data)[1] end
14: }
15:
17: testData = {
20:    size = function() return (#testData.data)[1] end
21: }

Preprocessing requires a floating point representation (the original data is stored on bytes). Types can be easily converted in Torch, in general by doing: dst = src:type('torch.TypeTensor'), where Type=='Float','Double','Byte','Int',... Shortcuts are provided for simplicity (float(),double(),cuda(),...):

trainData.data = trainData.data:float()
testData.data = testData.data:float()

We now preprocess the data. Preprocessing is crucial when applying pretty much any kind of machine learning algorithm.

For natural images, we use several intuitive tricks:

• images are mapped into YUV space, to separate luminance information from color information
• the luminance channel (Y) is locally normalized, using a contrastive normalization operator: for each neighborhood, defined by a Gaussian kernel, the mean is suppressed, and the standard deviation is normalized to one.
• color channels are normalized globally, across the entire dataset; as a result, each color component has 0-mean and 1-norm across the dataset.

 1: -- Convert all images to YUV
2: print '==> preprocessing data: colorspace RGB -> YUV'
3: for i = 1,trainData:size() do
4:    trainData.data[i] = image.rgb2yuv(trainData.data[i])
5: end
6: for i = 1,testData:size() do
7:    testData.data[i] = image.rgb2yuv(testData.data[i])
8: end
9:
10: -- Name channels for convenience
11: channels = {'y','u','v'}
12:
13: -- Normalize each channel, and store mean/std
14: -- per channel. These values are important, as they are part of
15: -- the trainable parameters. At test time, test data will be normalized
16: -- using these values.
17:
18: print '==> preprocessing data: normalize each feature (channel) globally'
19: mean = {}
20: std = {}
21: for i,channel in ipairs(channels) do
22:    -- normalize each channel globally:
23:    mean[i] = trainData.data[{ {},i,{},{} }]:mean()
24:    std[i] = trainData.data[{ {},i,{},{} }]:std()
26:    trainData.data[{ {},i,{},{} }]:div(std[i])
27: end
28:
29: -- Normalize test data, using the training means/stds
30: for i,channel in ipairs(channels) do
31:    -- normalize each channel globally:
33:    testData.data[{ {},i,{},{} }]:div(std[i])
34: end
35:
36: -- Local normalization
37: print '==> preprocessing data: normalize Y (luminance) channel locally'
38:
39: -- Define the normalization neighborhood:
40: neighborhood = image.gaussian1D(7)
41:
42: -- Define our local normalization operator (It is an actual nn module,
43: -- which could be inserted into a trainable model):
44: normalization = nn.SpatialContrastiveNormalization(1, neighborhood):float()
45:
46: -- Normalize all Y channels locally:
47: for i = 1,trainData:size() do
48:    trainData.data[{ i,{1},{},{} }] = normalization(trainData.data[{ i,{1},{},{} }])
49: end
50: for i = 1,testData:size() do
51:    testData.data[{ i,{1},{},{} }] = normalization(testData.data[{ i,{1},{},{} }])
52: end

At this stage, it's good practice to verify that data is properly normalized:

 1: for i,channel in ipairs(channels) do
2:    trainMean = trainData.data[{ {},i }]:mean()
3:    trainStd = trainData.data[{ {},i }]:std()
4:
5:    testMean = testData.data[{ {},i }]:mean()
6:    testStd = testData.data[{ {},i }]:std()
7:
8:    print('training data, '..channel..'-channel, mean: ' .. trainMean)
9:    print('training data, '..channel..'-channel, standard deviation: ' .. trainStd)
10:
11:    print('test data, '..channel..'-channel, mean: ' .. testMean)
12:    print('test data, '..channel..'-channel, standard deviation: ' .. testStd)
13: end


We can then get an idea of how the preprocessing transformed the data by displaying it:

 1: -- Visualization is quite easy, using image.display(). Check out:
3:
4: first256Samples_y = trainData.data[{ {1,256},1 }]
5: first256Samples_u = trainData.data[{ {1,256},2 }]
6: first256Samples_v = trainData.data[{ {1,256},3 }]
7: itorch.image(first256Samples_y)
8: itorch.image(first256Samples_u)
9: itorch.image(first256Samples_v)

### Exercise:

This is not the only kind of normalization! Data can be normalized in different manners, for instance, by normalizing individual features across the dataset (in this case, the pixels). Try these different normalizations, and see the impact they have on the training convergence.

# Step 2: Model Definition

The code for this section is in 2_model.lua. Run it like this:

th -i 2_model.lua -model linear
th -i 2_model.lua -model mlp
th -i 2_model.lua -model convnet

In this file, we describe three different models: convolutional neural networks (CNNs, or ConvNets), multi-layer neural networks (MLPs), and a simple linear model (which becomes a logistic regression if used with a negative log-likelihood loss).

Linear regression is the simplest type of model. It is parametrized by a weight matrix W, and a bias vector b. Mathematically, it can be written as:

yn=Wxn+b

Using the nn package, describing ConvNets, MLPs and other forms of sequential trainable models is really easy. All we have to do is create a top-level wrapper, which, as for the logistic regression, is going to be a sequential module, and then append modules into it. Implementing a simple linear model is therefore trivial:

model = nn.Sequential()
model:add( nn.Linear(ninputs, noutputs) )

A slightly more complicated model is the multi-layer neural network (MLP). This model is parametrized by two weight matrices, and two bias vectors:

yn=W2sigmoid(W1xn+b1)+b2

where the function sigmoid is typically the symmetric hyperbolic tangent function. Again, in Torch:

model = nn.Sequential()
model:add(nn.Linear(nhiddens,noutputs))

Compared to the linear regression model, the 2-layer neural network can learn arbitrary non-linear mappings between its inputs and outputs. In practice, it can be quite hard to train fully-connected MLPs to classify natural images.

Convolutional Networks are a particular form of MLP, which was tailored to efficiently learn to classify images. Convolutional Networks are trainable architectures composed of multiple stages. The input and output of each stage are sets of arrays called feature maps. For example, if the input is a color image, each feature map would be a 2D array containing a color channel of the input image (for an audio input each feature map would be a 1D array, and for a video or volumetric image, it would be a 3D array). At the output, each feature map represents a particular feature extracted at all locations on the input. Each stage is composed of three layers: a filter bank layer, a non-linearity layer, and a feature pooling layer. A typical ConvNet is composed of one, two or three such 3-layer stages, followed by a classification module. Each layer type is now described for the case of image recognition.

Trainable hierarchical vision models, and more generally image processing algorithms are usually expressed as sequences of operations or transformations. They can be well described by a modular approach, in which each module processes an input image bank and produces a new bank. The figure above is a nice graphical illustration of this approach. Each module requires the previous bank to be fully (or at least partially) available before computing its output. This causality prevents simple parallelism to be implemented across modules. However parallelism can easily be introduced within a module, and at several levels, depending on the kind of underlying operations. These forms of parallelism are exploited in Torch7.

Typical ConvNets rely on a few basic modules:

• Filter bank layer: the input is a 3D array with n1 2D feature maps of size n2 x n3. Each component is denoted xijk, and each feature map is denoted xi. The output is also a 3D array, y composed of m1 feature maps of size m2 x m3. A trainable filter (kernel) kij in the filter bank has size l1 x l2 and connects input feature map x to output feature map yj. The module computes yj=bj+ikijxi where  is the 2D discrete convolution operator and bj is a trainable bias parameter. Each filter detects a particular feature at every location on the input. Hence spatially translating the input of a feature detection layer will translate the output but leave it otherwise unchanged.

• Non-Linearity Layer: In traditional ConvNets this simply consists in a pointwise tanh() sigmoid function applied to each site (ijk). However, recent implementations have used more sophisticated non-linearities. A useful one for natural image recognition is the rectified sigmoid Rabs: abs(tanh(gi) where gi is a trainable gain parameter. The rectified sigmoid is sometimes followed by a subtractive and divisive local normalization N, which enforces local competition between adjacent features in a feature map, and between features at the same spatial location.

• Feature Pooling Layer: This layer treats each feature map separately. In its simplest instance, it computes the average values over a neighborhood in each feature map. Recent work has shown that more selective poolings, based on the LP-norm, tend to work best, with P=2, or P=inf (also known as max pooling). The neighborhoods are stepped by a stride larger than 1 (but smaller than or equal the pooling neighborhood). This results in a reduced-resolution output feature map which is robust to small variations in the location of features in the previous layer. The average operation is sometimes replaced by a max PM. Traditional ConvNets use a pointwise tanh() after the pooling layer, but more recent models do not. Some ConvNets dispense with the separate pooling layer entirely, but use strides larger than one in the filter bank layer to reduce the resolution. In some recent versions of ConvNets, the pooling also pools similar feature at the same location, in addition to the same feature at nearby locations.

Here is an example of ConvNet that we will use in this tutorial:

 1: -- parameters
2: nstates = {16,256,128}
3: fanin = {1,4}
4: filtsize = 5
5: poolsize = 2
6: normkernel = image.gaussian1D(7)
7:
8: -- Container:
9: model = nn.Sequential()
10:
11: -- stage 1 : filter bank -> squashing -> L2 pooling -> normalization
12: model:add(nn.SpatialConvolutionMap(nn.tables.random(nfeats, nstates[1], fanin[1]), filtsize, filtsize))
16:
17: -- stage 2 : filter bank -> squashing -> L2 pooling -> normalization
18: model:add(nn.SpatialConvolutionMap(nn.tables.random(nstates[1], nstates[2], fanin[2]), filtsize, filtsize))
22:
23: -- stage 3 : standard 2-layer neural network


• the input has 3 feature maps, each 32x32 pixels. It is the convention for all nn.Spatial* layers to work on 3D arrays, with the first dimension indexing different features (here normalized YUV), and the next two dimensions indexing the height and width of the image/map.

• the first layer applies 16 filters to a the input map (choosing randomly among its different layers [see fanin parameter]), each being 5x5. The receptive field of this first layer is 5x5, and the maps produced by it are therefore 16x28x28. This linear transform is then followed by a non-linearity (tanh), and an L2-pooling function, which pools regions of size 2x2, and uses a stride of 2x2. The result of that operation is a 16x14x14 array, which represents a 14x14 map of 16-dimensional feature vectors. The receptive field of each unit at this stage is 7x7.

• the second layer is very much analogous to the first, except that now the 16-dim feature maps are projected into 256-dim maps, with a fully-connected connection table: each unit in the output array is influenced by a 4x5x5 neighborhood of features in the previous layer. That layer has therefore 4x256x5x5 trainable kernel weights (and 256 biases). The result of the complete layer (conv+pooling) is a 256x5x5 array.

• at this stage, the 5x5 array of 256-dimensional feature vectors is flattened into a 6400-dimensional vector, which we feed to a two-layer neural net. The final prediction (10-dimensional distribution over classes) is influenced by a 32x32 neighborhood of input variables (YUV pixels).

• recent work (Jarret et al.) has demonstrated the advantage of locally normalizing sets of internal features, at each stage of the model. The use of smoother pooling functions, such as the L2 norm for instance instead of the harsher max-pooling, has also been shown to yield better generalization (Sermanet et al.). We use these two ingredients in this model.

• one other remark: it is typically not a good idea to use fully connected layers, in internal layers. In general, favoring large numbers of features (over-completeness) over density of connections helps achieve better results (empirical evidence of this was reported in several papers, as in Hadsell et al.). The SpatialConvolutionMap module accepts tables of connectivities (maps) that allows one to create arbitrarily sparse connections between two layers. A couple of standard maps/tables are provided in nn.tables.

### Exercises:

The number of meta-parameters to adjust can be daunting at first. Try to get a feeling of the inlfuence of these parameters on the learning convergence:

• going from the MLP to a ConvNet of similar size (you will need to think a little bit about the equivalence between the ConvNet states and the MLP states)

• replacing the 2-layer MLP on top of the ConvNet by a simpler linear classifier

• replacing the L2-pooling function by a max-pooling

• replacing the two-layer ConvNet by a single layer ConvNet with a much larger pooling area (to conserve the size of the receptive field)

## Step 3: Loss Function

Now that we have a model, we need to define a loss function to be minimized, across the entire training set:

$$L = \sum_n l(y^n,t^n)$$

One of the simplest loss functions we can minimize is the mean-square error between the predictions (outputs of the model), and the groundtruth labels, across the entire dataset:

$$l(y^n,t^n) = \frac{1}{2} \sum_i (y_i^n - t_i^n)^2$$

or, in Torch:

criterion = nn.MSECriterion()

The MSE loss is typically not a good one for classification, as it forces the model to exactly predict the values imposed by the targets (labels).

Instead, a more commonly used, probabilistic objective is the negative log-likelihood. To minimize a negative log-likelihood, we first need to turn the predictions of our models into properly normalized log-probabilities. For the linear model, this is achieved by feeding the output units into a softmax function, which turns the linear regression into a logistic regression:

$$P(Y=i|x^n,W,b) = \text{softmax}(Wx^n+be) = \frac{ e^{Wx_i^n+b} }{ \sum_j e^{Wx_j^n+b} }$$

As we're interested in classification, the final prediction is then achieved by taking the argmax of this distribution:

$$y^n = \arg\max_i P(Y=i|x^n,W,b)$$

in which case the ouput y is a scalar.

More generally, the output of any model can be turned into normalized log-probabilities, by stacking a softmax function on top. So given any of the models defined above, we can simply do:

model:add( nn.LogSoftMax() )

We want to maximize the likelihood of the correct (target) class, for each sample in the dataset. This is equivalent to minimizing the negative log-likelihood (NLL), or minimizing the cross-entropy between the predictions of our model and the targets (training data). Mathematically, the per-sample loss can be defined as:

$$l(x^n,t^n) = -\log(P(Y=t^n|x^n,W,b))$$

Given that our model already produces log-probabilities (thanks to the softmax), the loss is quite straightforward to estimate. In Torch, we use the ClassNLLCriterion, which expects its input as being a vector of log-probabilities, and the target as being an integer pointing to the correct class:

criterion = nn.ClassNLLCriterion()

Finally, another type of classification loss is the multi-class margin loss, which is closer to the well-known SVM loss. This loss function doesn't require normalized outputs, and can be implemented like this:

criterion = nn.MultiMarginCriterion()

The margin loss typically works on par with the negative log-likelihood. I haven't tested this thoroughly, so it's time for more exercises.

### Exercises:

The obvious exercise now is to play with these different loss functions, and see how they affect convergence. In particular try to:

• swap the loss from NLL to MultiMargin, and if it doesn't work as well, thinkg a little bit more about the scaling of the gradients, and whether you should rescale the learning rate.

## Step 4: Training Procedure

We now have some training data, a model to train, and a loss function to minimize. We define a training procedure, which you will find in this file: 4_train.lua.

A very important aspect about supervised training of non-linear models (ConvNets and MLPs) is the fact that the optimization problem is not convex anymore. This reinforces the need for a stochastic estimation of gradients, which have shown to produce much better generalization results for several problems.

In this example, we show how the optimization algorithm can be easily set to either L-BFGS, CG, SGD or ASGD. In practice, it's very important to start with a few epochs of pure SGD, before switching to L-BFGS or ASGD (if switching at all). The intuition for that is related to the non-convex nature of the problem: at the very beginning of training (random initialization), the landscape might be highly non-convex, and no assumption should be made about the shape of the energy function. Often, SGD is the best we can do. Later on, batch methods (L-BFGS, CG) can be used more safely.

Interestingly, in the case of large convex problems, stochasticity is also very important, as it allows much faster (rough) convergence. Several works have explored these techniques, in particular, this recent paper from Byrd/Nocedal, and work on pure stochastic gradient descent by Bottou.

Here is our full training function, which demonstrates that you can switch the optimization you're using at runtime (if you want to), and also modify the batch size you're using at run time. You can do all these things because we create the evaluation closure each time we create a new batch. If the batch size is 1, then the method is purely stochastic. If the batch size is set to the complete dataset, then the method is a pure batch method.

  1: -- classes
2: classes = {'1','2','3','4','5','6','7','8','9','0'}
3:
4: -- This matrix records the current confusion across classes
5: confusion = optim.ConfusionMatrix(classes)
6:
7: -- Log results to files
8: trainLogger = optim.Logger(paths.concat(opt.save, 'train.log'))
9: testLogger = optim.Logger(paths.concat(opt.save, 'test.log'))
10:
11: -- Retrieve parameters and gradients:
12: -- this extracts and flattens all the trainable parameters of the mode
13: -- into a 1-dim vector
14: if model then
16: end
17:
18: -- Training function
19: function train()
20:
21:    -- epoch tracker
22:    epoch = epoch or 1
23:
24:    -- local vars
25:    local time = sys.clock()
26:
27:    -- shuffle at each epoch
28:    shuffle = torch.randperm(trsize)
29:
30:    -- do one epoch
31:    print('==> doing epoch on training data:')
32:    print("==> online epoch # " .. epoch .. ' [batchSize = ' .. opt.batchSize .. ']')
33:    for t = 1,trainData:size(),opt.batchSize do
34:       -- disp progress
35:       xlua.progress(t, trainData:size())
36:
37:       -- create mini batch
38:       local inputs = {}
39:       local targets = {}
40:       for i = t,math.min(t+opt.batchSize-1,trainData:size()) do
42:          local input = trainData.data[shuffle[i]]:double()
43:          local target = trainData.labels[shuffle[i]]
44:          table.insert(inputs, input)
45:          table.insert(targets, target)
46:       end
47:
48:       -- create closure to evaluate f(X) and df/dX
49:       local feval = function(x)
50:                        -- get new parameters
51:                        if x ~= parameters then
52:                           parameters:copy(x)
53:                        end
54:
57:
58:                        -- f is the average of all criterions
59:                        local f = 0
60:
61:                        -- evaluate function for complete mini batch
62:                        for i = 1,#inputs do
63:                           -- estimate f
64:                           local output = model:forward(inputs[i])
65:                           local err = criterion:forward(output, targets[i])
66:                           f = f + err
67:
68:                           -- estimate df/dW
69:                           local df_do = criterion:backward(output, targets[i])
70:                           model:backward(inputs[i], df_do)
71:
72:                           -- update confusion
74:                        end
75:
76:                        -- normalize gradients and f(X)
78:                        f = f/#inputs
79:
80:                        -- return f and df/dX
82:                     end
83:
84:       -- optimize on current mini-batch
85:       if opt.optimization == 'CG' then
86:          config = config or {maxIter = opt.maxIter}
87:          optim.cg(feval, parameters, config)
88:
89:       elseif opt.optimization == 'LBFGS' then
90:          config = config or {learningRate = opt.learningRate,
91:                              maxIter = opt.maxIter,
92:                              nCorrection = 10}
93:          optim.lbfgs(feval, parameters, config)
94:
95:       elseif opt.optimization == 'SGD' then
96:          config = config or {learningRate = opt.learningRate,
97:                              weightDecay = opt.weightDecay,
98:                              momentum = opt.momentum,
99:                              learningRateDecay = 5e-7}
100:          optim.sgd(feval, parameters, config)
101:
102:       elseif opt.optimization == 'ASGD' then
103:          config = config or {eta0 = opt.learningRate,
104:                              t0 = trsize * opt.t0}
105:          _,_,average = optim.asgd(feval, parameters, config)
106:
107:       else
108:          error('unknown optimization method')
109:       end
110:    end
111:
112:    -- time taken
113:    time = sys.clock() - time
114:    time = time / trainData:size()
115:    print("==> time to learn 1 sample = " .. (time*1000) .. 'ms')
116:
117:    -- print confusion matrix
118:    print(confusion)
119:    confusion:zero()
120:
121:    -- update logger/plot
122:    trainLogger:add{['% mean class accuracy (train set)'] = confusion.totalValid * 100}
123:    if opt.plot then
124:       trainLogger:style{['% mean class accuracy (train set)'] = '-'}
125:       trainLogger:plot()
126:    end
127:
128:    -- save/log current net
129:    local filename = paths.concat(opt.save, 'model.net')
130:    os.execute('mkdir -p ' .. sys.dirname(filename))
131:    print('==> saving model to '..filename)
132:    torch.save(filename, model)
133:
134:    -- next epoch
135:    epoch = epoch + 1
136: end


We could then run the training procedure like this:

while true
train()
end

### Exercices:

So, a bit on purpose, I've given you this blob of training code with rather few explanations. Try to understand what's going on, to do the following things:

• modify the batch size (and possibly the learning rate) and observe the impact on training accuracy, and test accuracy (generalization)

• change the optimization method, and in particular, try to start with L-BFGS from the very first epoch. What happens then?

## Step 5: Test the Model

A common thing to do is to test the model's performance while we train it. Usually, this test is done on a subset of the training data, that is kept for validation. Here we simply define the test procedure on the available test set:

 1: function test()
2:    -- local vars
3:    local time = sys.clock()
4:
5:    -- averaged param use?
6:    if average then
7:       cachedparams = parameters:clone()
8:       parameters:copy(average)
9:    end
10:
11:    -- test over test data
12:    print('==> testing on test set:')
13:    for t = 1,testData:size() do
14:       -- disp progress
15:       xlua.progress(t, testData:size())
16:
17:       -- get new sample
18:       local input = testData.data[t]:double()
19:       local target = testData.labels[t]
20:
21:       -- test sample
22:       local pred = model:forward(input)
24:    end
25:
26:    -- timing
27:    time = sys.clock() - time
28:    time = time / testData:size()
29:    print("==> time to test 1 sample = " .. (time*1000) .. 'ms')
30:
31:    -- print confusion matrix
32:    print(confusion)
33:    confusion:zero()
34:
35:    -- update log/plot
36:    testLogger:add{['% mean class accuracy (test set)'] = confusion.totalValid * 100}
37:    if opt.plot then
38:       testLogger:style{['% mean class accuracy (test set)'] = '-'}
39:       testLogger:plot()
40:    end
41:
42:    -- averaged param use?
43:    if average then
44:       -- restore parameters
45:       parameters:copy(cachedparams)
46:    end
47: end


The train/test procedure now looks like this:

while true
train()
test()
end

### Exercices:

As mentionned above, validation is the proper (an only!) way to train a model and estimate how well it does on unseen data:

• modify the code above to extract a subset of the training data to use for validation

• once you have that, add a stopping condition to the script, such that it terminates once the validation error starts rising above a certain threshold. This is called early-stopping.

## All Done!

The final step of course, is to run doall.lua, which will train the model over the entire training set. By default, it uses the basic training set size (about 70,000 samples). If you use the flag: -size extra, you will obtain state-of-the-art results (in a couple of days of course!).

### Final Exercise

If time allows, you can try to replace this dataset by other datasets, such as MNIST, which you should already have working (from day 1). Try to think about what you have to change/adapt to work with other types of images (non RGB, binary, infrared?).

## Tips, going futher

### Tips and tricks for MLP training

There are several hyper-parameters in the above code, which are not (and, generally speaking, cannot be) optimized by gradient descent. The design of outer-loop algorithms for optimizing them is a topic of ongoing research. Over the last 25 years, researchers have devised various rules of thumb for choosing them. A very good overview of these tricks can be found in Efficient BackProp by Yann LeCun, Leon Bottou, Genevieve Orr, and Klaus-Robert Mueller. Here, we summarize the same issues, with an emphasis on the parameters and techniques that we actually used in our code.

### Tips and Tricks: Nonlinearity

Which non-linear activation function should you use in a neural network? Two of the most common ones are the logistic sigmoid and the tanh functions. For reasons explained in Section 4.4, nonlinearities that are symmetric around the origin are preferred because they tend to produce zero-mean inputs to the next layer (which is a desirable property). Empirically, we have observed that the tanh has better convergence properties.

### Tips and Tricks: Weight initialization

At initialization we want the weights to be small enough around the origin so that the activation function operates near its linear regime, where gradients are the largest. Otherwise, the gradient signal used for learning is attenuated by each layer as it is propagated from the classifier towards the inputs. Proper weight initialization is implemented in all the modules provided in nn, so you don't have to worry about it. Each module has a reset() method, which initializes the parameter with a uniform distribution that takes into account the fanin/fanout of the module. It's called by default when you create a new module, but you can call it at any time to reset the weights.

### Tips and Tricks: Learning Rate

Optimization by stochastic gradient descent is very sensitive to the step size or learning rate. There is a great deal of literature on how to choose a the learning rate, and how to change it during optimization. A good heuristic is to use a lr_0/(1+t*decay) decay on the learning, where you set the decay to a value that's inversely proportional to the number of samples you want to see with an almost flat learning rate, before starting decaying exponentially.

Section 4.7 details procedures for choosing a learning rate for each parameter (weight) in our network and for choosing them adaptively based on the error of the classifier.

### Tips and Tricks: Number of hidden units

The number of hidden units that gives best results is dataset-dependent. Generally speaking, the more complicated the input distribution is, the more capacity the network will require to model it, and so the larger the number of hidden units that will be needed.

### Tips and Tricks: Norm Regularization

Typical values to try for the L1/L2 regularization parameter are 10^-2 or 10^-3. It is usually only useful to regularize the topmost layers of the MLP (closest to the classifier), if not the classifier only. An L2 regularization is really easy to implement, optim.sgd provides an implementation, but it's global to the parameters, which is typically not a good idea. Instead, after each call to optim.sgd, you can simply apply the regularization on the subset of weights of interest:

-- model:
model = nn.Sequential()

-- weights to regularize:
reg = {}
reg[1] = model:get(3).weight
reg[2] = model:get(3).bias

-- optimization:
while true do
-- ...
optim.sgd(...)

-- after each optimization step (gradient descent), regularize weights
for _,w in ipairs(reg) do
end
end

### Tips and tricks for ConvNet training

ConvNets are especially tricky to train, as they add even more hyper-parameters than a standard MLP. While the usual rules of thumb for learning rates and regularization constants still apply, the following should be kept in mind when optimizing ConvNets.

#### Number of filters

Since feature map size decreases with depth, layers near the input layer will tend to have fewer filters while layers higher up can have much more. In fact, to equalize computation at each layer, the product of the number of features and the number of pixel positions is typically picked to be roughly constant across layers. To preserve the information about the input would require keeping the total number of activations (number of feature maps times number of pixel positions) to be non-decreasing from one layer to the next (of course we could hope to get away with less when we are doing supervised learning). The number of feature maps directly controls capacity and so that depends on the number of available examples and the complexity of the task.

#### Filter Shape

Common filter shapes found in the literature vary greatly, usually based on the dataset. Best results on MNIST-sized images (28x28) are usually in the 5x5 range on the first layer, while natural image datasets (often with hundreds of pixels in each dimension) tend to use larger first-layer filters of shape 7x7 to 12x12.

The trick is thus to find the right level of "granularity" (i.e. filter shapes) in order to create abstractions at the proper scale, given a particular dataset.

It's also possible to use multiscale receptive fields, to allow the ConvNet to have a much larger receptive field, yet keeping its computational complexity low. This type of procedure was proposed for scene parsing (where context is crucial to recognize objects) in this paper.

#### Pooling Shape

Typical values for pooling are 2x2. Very large input images may warrant 4x4 pooling in the lower-layers. Keep in mind however, that this will reduce the dimension of the signal by a factor of 16, and may result in throwing away too much information. In general, the pooling region is independent from the stride at which you discard information. In Torch, all the pooling modules (L2, average, max) have separate parameters for the pooling size and the strides, for example:

nn.SpatialMaxPooling(pool_x, pool_y, stride_x, stride_y)