Install on mac
Follow http://deeplearning.net/tutorial/gettingstarted.html step by step Getting StartedThese tutorials do not attempt to make up for a graduate or undergraduate course in machine learning, but we do make a rapid overview of some important concepts (and notation) to make sure that we’re on the same page. You’ll also need to download the datasets mentioned in this chapter in order to run the example code of the up-coming tutorials. DownloadOn each learning algorithm page, you will be able to download the corresponding files. If you want to download all of them at the same time, you can clone the git repository of the tutorial: git clone git://github.com/lisa-lab/DeepLearningTutorials.git DatasetsMNIST Dataset
The data has to be stored as floats on the GPU ( the right dtype for storing on the GPU is given by theano.config.floatX). To get around this shortcomming for the labels, we store them as float, and then cast it to int. Note If you are running your code on the GPU and the dataset you are using is too large to fit in memory the code will crash. In such a case you should store the data in a shared variable. You can however store a sufficiently small chunk of your data (several minibatches) in a shared variable and use that during training. Once you got through the chunk, update the values it stores. This way you minimize the number of data transfers between CPU memory and GPU memory. NotationDataset notationWe label data sets as The tutorials mostly deal with classification problems, where each data set
Math Conventions
List of Symbols and acronyms
A Primer on Supervised Optimization for Deep LearningWhat’s exciting about Deep Learning is largely the use of unsupervised learning of deep networks. But supervised learning also plays an important role. The utility of unsupervised pre-training is often evaluated on the basis of what performance can be achieved after supervised fine-tuning. This chapter reviews the basics of supervised learning for classification models, and covers the minibatch stochastic gradient descent algorithm that is used to fine-tune many of the models in the Deep Learning Tutorials. Have a look at these introductory course notes on gradient-based learning for more basics on the notion of optimizing a training criterion using the gradient. Learning a ClassifierLoss FunctionZero-One LossThe models presented in these deep learning tutorials are mostly used
for classification. The objective in training a classifier is to minimize the number
of errors (zero-one loss) on unseen examples. If where either In this tutorial, In python, using Theano this can be written as : # zero_one_loss is a Theano variable representing a symbolic expression of the zero one loss zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y)) # neq is used b.c. it is a 'loss' (all y's that do not give max probability Negative Log-Likelihood LossSince the zero-one loss is not differentiable, optimizing it for large models (thousands or millions of parameters) is prohibitively expensive (computationally). We thus maximize the log-likelihood of our classifier given all the labels in a training set. The likelihood of the correct class is not the same as the number of right predictions, but from the point of view of a randomly initialized classifier they are pretty similar. Remember that likelihood and zero-one loss are different objectives; you should see that they are corralated on the validation set but sometimes one will rise while the other falls, or vice-versa. Since we usually speak in terms of minimizing a loss function, learning will thus attempt to minimize the Negative Log-Likelihood (NLL), defined as: The NLL of our classifier is a differentiable surrogate for the zero-one loss, and we use the gradient of this function over our training data as a supervised learning signal for deep learning of a classifier. This can be computed using the following line of code : NLL = -T.sum( T.log(p_y_given_x)[T.arange(y.shape[0]), y] ) # NLL is a symbolic variable, to get the actual value of NLL, this symbolic expression has to be compiled into a Theano function
Note: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)]. Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the elements M[0,a], M[1,b], ..., M[K,k] as a vector. Here, we use this syntax to retrieve the log-probability of the correct labels, y. Stochastic Gradient DescentWhat is ordinary gradient descent? it is a simple algorithm in which we repeatedly make small steps downward on an error surface defined by a loss function of some parameters. For the purpose of ordinary gradient descent we consider that the training data is rolled into the loss function. Then the pseudocode of this algorithm can be described as : # GRADIENT DESCENT
while True:
loss = f(params)
d_loss_wrt_params = ... # compute gradient
params -= learning_rate * d_loss_wrt_params
if <stopping condition is met>:
return params
Stochastic gradient descent (SGD) works according to the same principles as ordinary gradient descent, but proceeds more quickly by estimating the gradient from just a few examples at a time instead of the entire training set. In its purest form, we estimate the gradient from just a single example at a time. # STOCHASTIC GRADIENT DESCENT for (x_i,y_i) in training_set: # imagine an infinite generator # that may repeat examples (if there is only a finite training set) loss = f(params, x_i, y_i) d_loss_wrt_params = ... # compute gradient params -= learning_rate * d_loss_wrt_params if <stopping condition is met>: return params The variant that we recommend for deep learning is a further twist on stochastic gradient descent using so-called “minibatches”. Minibatch SGD works identically to SGD, except that we use more than one training example to make each estimate of the gradient. This technique reduces variance in the estimate of the gradient, and often makes better use of the hierarchical memory organization in modern computers. for (x_batch,y_batch) in train_batches: # this is not batch, in batch update takes place after all items of batch have been visited. # imagine an infinite generator # that may repeat examples loss = f(params, x_batch, y_batch) d_loss_wrt_params = ... # compute gradient using theano params -= learning_rate * d_loss_wrt_params if <stopping condition is met>: return params There is a tradeoff in the choice of the minibatch size Note If you are training for a fixed number of epochs, the minibatch size becomes important because it controls the number of updates done to your parameters. Training the same model for 10 epochs using a batch size of 1 yields completely different results compared to training for the same 10 epochs but with a batchsize of 20. Keep this in mind when switching between batch sizes and be prepared to tweak all the other parameters acording to the batch size used. All code-blocks above show pseudocode of how the algorithm looks like. Implementing such algorithm in Theano can be done as follows : # Minibatch Stochastic Gradient Descent (MSGD) # assume loss is a symbolic description of the loss function given the symbolic variables params (shared variable), x_batch, y_batch; # compute gradient of loss with respect to params d_loss_wrt_params = T.grad(loss, params) # compile the MSGD step into a theano function updates = [(params, params - learning_rate * d_loss_wrt_params)] # (b.c. it is a shared variable across function calls) MSGD = theano.function([x_batch,y_batch], loss, updates=updates) for (x_batch, y_batch) in train_batches: # here x_batch and y_batch are elements of train_batches and # therefore numpy arrays; function MSGD also updates the params print('Current loss is ', MSGD(x_batch, y_batch)) if stopping_condition_is_met: return params RegularizationThere is more to machine learning than optimization. When we train our model from data we are trying to prepare it to do well on new examples, not the ones it has already seen. The training loop above for MSGD does not take this into account, and may overfit the training examples. A way to combat overfitting is through regularization. There are several techniques for regularization; the ones we will explain here are L1/L2 regularization and early-stopping. L1 and L2 regularization involve adding an extra term to the loss function, which penalizes certain parameter configurations. Formally, if our loss function is: then the regularized loss will be: or, in our case where which is the In principle, adding a regularization term to the loss will encourage smooth
network mappings in a neural network (by penalizing large values of the
parameters, which decreases the amount of nonlinearity that the
network models). More intuitively, the two terms (NLL and Note that the fact that a solution is “simple” does not mean that it will
generalize well. Empirically, it was found that performing such regularization
in the context of neural networks helps with generalization, especially
on small datasets.
The code block below shows how to compute the loss in python when it
contains both a L1 regularization term weighted by
Early-StoppingEarly-stopping combats overfitting by monitoring the model’s performance on a validation set. A validation set is a set of examples that we never use for gradient descent, but which is also not a part of the test set. The validation examples are considered to be representative of future test examples. We can use them during training because they are not part of the test set. If the model’s performance ceases to improve sufficiently on the validation set, or even degrades with further optimization, then the heuristic implemented here gives up on much further optimization. The choice of when to stop is a judgement call and a few heuristics exist, but these tutorials will make use of a strategy based on a geometrically increasing amount of patience. # early-stopping parameters patience = 5000 # look as this many examples regardless patience_increase = 2 # wait this much longer when a new best is # found improvement_threshold = 0.995 # a relative improvement of this much is # considered significant validation_frequency = min(n_train_batches, patience/2) # go through this many # minibatches before checking the network # on the validation set; in this case we # check every epoch best_params = None best_validation_loss = numpy.inf test_score = 0. start_time = time.clock() done_looping = False epoch = 0 while (epoch < n_epochs) and (not done_looping): # Report "1" for first epoch, "n_epochs" for last epoch epoch = epoch + 1 for minibatch_index in xrange(n_train_batches): d_loss_wrt_params = ... # compute gradient params -= learning_rate * d_loss_wrt_params # gradient descent # iteration number. We want it to start at 0. iter = (epoch - 1) * n_train_batches + minibatch_index # note that if we do `iter % validation_frequency` it will be # true for iter = 0 which we do not want. We want it true for # iter = validation_frequency - 1. if (iter + 1) % validation_frequency == 0: this_validation_loss = ... # compute zero-one loss on validation set if this_validation_loss < best_validation_loss: # improve patience if loss improvement is good enough if this_validation_loss < best_validation_loss * improvement_threshold: patience = max(patience, iter * patience_increase) best_params = copy.deepcopy(params) best_validation_loss = this_validation_loss if patience <= iter: done_looping = True break # POSTCONDITION: # best_params refers to the best out-of-sample parameters observed during the optimization If we run out of batches of training data before running out of patience, then we just go back to the beginning of the training set and repeat. Note The validation_frequency should always be smaller than the patience. The code should check at least two times how it performs before running out of patience. This is the reason we used the formulation validation_frequency = min( value, patience/2.) Note This algorithm could possibly be improved by using a test of statistical significance rather than the simple comparison, when deciding whether to increase the patience. TestingAfter the loop exits, the best_params variable refers to the best-performing model on the validation set. If we repeat this procedure for another model class, or even another random initialization, we should use the same train/valid/test split of the data, and get other best-performing models. If we have to choose what the best model class or the best initialization was, we compare the best_validation_loss for each model. When we have finally chosen the model we think is the best (on validation data), we report that model’s test set performance. That is the performance we expect on unseen examples. Theano/Python TipsLoading and Saving ModelsWhen you’re doing experiments, it can take hours (sometimes days!) for gradient-descent to find the best parameters. You will want to save those weights once you find them. You may also want to save your current-best estimates as the search progresses. Pickle the numpy ndarrays from your shared variables The best way to save/archive your model’s parameters is to use pickle or deepcopy the ndarray objects. So for example, if your parameters are in shared variables w, v, u, then your save command should look something like:
Then later, you can load your data back like this: save_file = open('path') w.set_value(cPickle.load(save_file), borrow=True) v.set_value(cPickle.load(save_file), borrow=True) u.set_value(cPickle.load(save_file), borrow=True) This technique is a bit verbose, but it is tried and true. You will be able to load your data and render it in matplotlib without trouble, years after saving it. DO NOT pickle your training or test functions for long-term storage Theano functions are compatible with Python’s deepcopy and pickle mechanisms, but you should not necessarily pickle a Theano function. If you update your Theano folder and one of the internal changes, then you may not be able to un-pickle your model. Theano is still in active development, and the internal APIs are subject to change. So to be on the safe side – do not pickle your entire training or testing functions for long-term storage. The pickle mechanism is aimed at for short-term storage, such as a temp file, or a copy to another machine in a distributed job. Read more about serialization in Theano, or Python’s pickling. Plotting Intermediate ResultsVisualizations can be very powerful tools for understanding what your model or training algorithm is doing. You might be tempted to insert matplotlib plotting commands, or PIL image-rendering commands into your model-training script. However, later you will observe something interesting in one of those pre-rendered images and want to investigate something that isn’t clear from the pictures. You’ll wished you had saved the original model. If you have enough disk space, your training script should save intermediate models and a visualization script should process those saved models. You already have a model-saving function right? Just use it again to save these intermediate models. Libraries you’ll want to know about: Python Image Library (PIL), matplotlib. |
Software Engineering ➼ Machine learning ➼ Data Science ➼ Product Leadership 🎯 > AI > Machine Learning > Neural Networks > Deep Learning > python >
MNIST (Theano)
Subpages (13):
10 LSTM Networks for Sentiment Analysis
11 Modeling and generating sequences of polyphonic music with the RNN-RBM¶
12 Miscellaneous
13 References
1 Logistic Regression
2 Multi-layer Perceptron
3 Convolutional Neural Network (LeNet)
4 Denoising Autoencoders
5 Stacked Denoising Auto Encoders
6 Restricted Boltzman Machine (RBM)
7 Deep Belief Networks
8 Hybrid Monte-Carlo Sampling
9 Recurrent Neural Networks
Comments