Install on mac
Follow http://deeplearning.net/tutorial/gettingstarted.html step by step ## Getting StartedThese tutorials do not attempt to make up for a graduate or undergraduate course in machine learning, but we do make a rapid overview of some important concepts (and notation) to make sure that we’re on the same page. You’ll also need to download the datasets mentioned in this chapter in order to run the example code of the up-coming tutorials. ## DownloadOn each learning algorithm page, you will be able to download the corresponding files. If you want to download all of them at the same time, you can clone the git repository of the tutorial: git clone git://github.com/lisa-lab/DeepLearningTutorials.git ## Datasets## MNIST Dataset
The data has to be stored as floats on the GPU ( the right
Note If you are running your code on the GPU and the dataset you are using is too large to fit in memory the code will crash. In such a case you should store the data in a shared variable. You can however store a sufficiently small chunk of your data (several minibatches) in a shared variable and use that during training. Once you got through the chunk, update the values it stores. This way you minimize the number of data transfers between CPU memory and GPU memory. ## Notation## Dataset notationWe label data sets as . When the distinction is important, we indicate train, validation, and test sets as: , and . The validation set is used to perform model selection and hyper-parameter selection, whereas the test set is used to evaluate the final generalization error and compare different algorithms in an unbiased way. The tutorials mostly deal with classification problems, where each data set is an indexed set of pairs . We use superscripts to distinguish training set examples: is thus the i-th training example of dimensionality . Similarly, is the i-th label assigned to input . It is straightforward to extend these examples to ones where has other types (e.g. Gaussian for regression, or groups of multinomials for predicting multiple symbols). ## Math Conventions- : upper-case symbols refer to a matrix unless specified otherwise
- : element at i-th row and j-th column of matrix
- : vector, i-th row of matrix
- : vector, j-th column of matrix
- : lower-case symbols refer to a vector unless specified otherwise
- : i-th element of vector
## List of Symbols and acronyms- : number of input dimensions.
- : number of hidden units in the -th layer.
- , : classification function associated with a model , defined as .
Note that we will often drop the subscript.
**maximum probable class** - L: number of labels.
- : log-likelihood of the model defined by parameters .
- empirical loss of the prediction function f parameterized by on data set .
- NLL: negative log-likelihood
- : set of all parameters for a given model
## A Primer on Supervised Optimization for Deep LearningWhat’s exciting about Deep Learning is largely the use of unsupervised learning
of deep networks. But supervised learning also plays an important role. The
utility of unsupervised ## Learning a Classifier## Loss Function## Zero-One LossThe models presented in these deep learning tutorials are mostly used for classification. The objective in training a classifier is to minimize the number of errors (zero-one loss) on unseen examples. If is the prediction function, then this loss can be written as: where either is the training set (during training) or (to avoid biasing the evaluation of validation or test error). is the indicator function defined as: In this tutorial, is defined as: In python, using Theano this can be written as : # zero_one_loss is a Theano variable representing a symbolic expression of the zero one loss zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y)) # neq is used b.c. it is a 'loss' (all y's that do not give max probability ## Negative Log-Likelihood LossSince the zero-one loss is not differentiable, optimizing it for large models (thousands or millions of parameters) is prohibitively expensive (computationally). We thus maximize the log-likelihood of our classifier given all the labels in a training set. The likelihood of the correct class is not the same as the number of right predictions, but from the point of view of a randomly initialized classifier they are pretty similar. Remember that likelihood and zero-one loss are different objectives; you should see that they are corralated on the validation set but sometimes one will rise while the other falls, or vice-versa. Since we usually speak in terms of minimizing a loss function, learning will
thus attempt to The NLL of our classifier is a differentiable surrogate for the zero-one loss, and we use the gradient of this function over our training data as a supervised learning signal for deep learning of a classifier. This can be computed using the following line of code : `NLL = -T.sum( T.log(p_y_given_x)` Note: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)]. Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the elements M[0,a], M[1,b], ..., M[K,k] as a vector. Here, we use this syntax to retrieve the log-probability of the correct labels, y. ## Stochastic Gradient DescentWhat is ordinary gradient descent? it is a simple algorithm in which we repeatedly make small steps downward on an error surface defined by a loss function of some parameters. For the purpose of ordinary gradient descent we consider that the training data is rolled into the loss function. Then the pseudocode of this algorithm can be described as : ```
# GRADIENT DESCENT
while True:
loss = f(params)
d_loss_wrt_params = ... # compute gradient
params -= learning_rate * d_loss_wrt_params
if <stopping condition is met>:
return params
``` Stochastic gradient descent (SGD) works according to the same principles as ordinary gradient descent, but proceeds more quickly by estimating the gradient from just a few examples at a time instead of the entire training set. In its purest form, we estimate the gradient from just a single example at a time. # STOCHASTIC GRADIENT DESCENT for (x_i,y_i) in training_set: # imagine an infinite generator # that may repeat examples (if there is only a finite training set) loss = f(params, x_i, y_i) d_loss_wrt_params = ... # compute gradient params -= learning_rate * d_loss_wrt_params if <stopping condition is met>: return params The variant that we recommend for deep learning is a further twist on stochastic gradient descent using so-called “minibatches”. Minibatch SGD works identically to SGD, except that we use more than one training example to make each estimate of the gradient. This technique reduces variance in the estimate of the gradient, and often makes better use of the hierarchical memory organization in modern computers. for (x_batch,y_batch) in train_batches: # this is not batch, in batch update takes place after all items of batch have been visited. # imagine an infinite generator # that may repeat examples loss = f(params, x_batch, y_batch) d_loss_wrt_params = ... # compute gradient using theano params -= learning_rate * d_loss_wrt_params if <stopping condition is met>: return params There is a tradeoff in the choice of the minibatch size . The reduction of variance and use of SIMD instructions helps most when increasing from 1 to 2, but the marginal improvement fades rapidly to nothing. With large , time is wasted in reducing the variance of the gradient estimator, that time would be better spent on additional gradient steps. An optimal is model-, dataset-, and hardware-dependent, and can be anywhere from 1 to maybe several hundreds. In the tutorial we set it to 20, but this choice is almost arbitrary (though harmless). Note If you are training for a fixed number of epochs, the minibatch size becomes important because it controls the number of updates done to your parameters. Training the same model for 10 epochs using a batch size of 1 yields completely different results compared to training for the same 10 epochs but with a batchsize of 20. Keep this in mind when switching between batch sizes and be prepared to tweak all the other parameters acording to the batch size used. All code-blocks above show pseudocode of how the algorithm looks like. Implementing such algorithm in Theano can be done as follows : # Minibatch Stochastic Gradient Descent (MSGD) # assume loss is a symbolic description of the loss function given the symbolic variables params (shared variable), x_batch, y_batch; # compute gradient of loss with respect to params d_loss_wrt_params = T.grad(loss, params) # compile the MSGD step into a theano function updates = [(params, params - learning_rate * d_loss_wrt_params)] # (b.c. it is a shared variable across function calls) MSGD = theano.function([x_batch,y_batch], loss, updates=updates) for (x_batch, y_batch) in train_batches: # here x_batch and y_batch are elements of train_batches and # therefore numpy arrays; function MSGD also updates the params print('Current loss is ', MSGD(x_batch, y_batch)) if stopping_condition_is_met: return params ## RegularizationThere is more to machine learning than optimization. When we
train our model from data we are trying to prepare it to do well on L1 and L2 regularization involve adding an extra term to the loss function, which penalizes certain parameter configurations. Formally, if our loss function is: then the regularized loss will be: or, in our case where which is the norm of . is a hyper-parameter which controls the relative importance of the regularization parameter. Commonly used values for p are 1 and 2, hence the L1/L2 nomenclature. If p=2, then the regularizer is also called “weight decay”. In principle, adding a regularization term to the loss will encourage smooth network mappings in a neural network (by penalizing large values of the parameters, which decreases the amount of nonlinearity that the network models). More intuitively, the two terms (NLL and ) correspond to modelling the data well (NLL) and having “simple” or “smooth” solutions (). Thus, minimizing the sum of both will, in theory, correspond to finding the right trade-off between the fit to the training data and the “generality” of the solution that is found. To follow Occam’s razor principle, this minimization should find us the simplest solution (as measured by our simplicity criterion) that fits the training data. Note that the fact that a solution is “simple” does not mean that it will generalize well. Empirically, it was found that performing such regularization in the context of neural networks helps with generalization, especially on small datasets. The code block below shows how to compute the loss in python when it contains both a L1 regularization term weighted by and L2 regularization term weighted by
## Early-StoppingEarly-stopping combats overfitting by monitoring the model’s performance on a
The choice of when to stop is a judgement call and a few heuristics exist, but these tutorials will make use of a strategy based on a geometrically increasing amount of patience. # early-stopping parameters patience = 5000 # look as this many examples regardless patience_increase = 2 # wait this much longer when a new best is # found improvement_threshold = 0.995 # a relative improvement of this much is # considered significant validation_frequency = min(n_train_batches, patience/2) # go through this many # minibatches before checking the network # on the validation set; in this case we # check every epoch best_params = None best_validation_loss = numpy.inf test_score = 0. start_time = time.clock() done_looping = False epoch = 0 while (epoch < n_epochs) and (not done_looping): # Report "1" for first epoch, "n_epochs" for last epoch epoch = epoch + 1 for minibatch_index in xrange(n_train_batches): d_loss_wrt_params = ... # compute gradient params -= learning_rate * d_loss_wrt_params # gradient descent # iteration number. We want it to start at 0. iter = (epoch - 1) * n_train_batches + minibatch_index # note that if we do `iter % validation_frequency` it will be # true for iter = 0 which we do not want. We want it true for # iter = validation_frequency - 1. if (iter + 1) % validation_frequency == 0: this_validation_loss = ... # compute zero-one loss on validation set if this_validation_loss < best_validation_loss: # improve patience if loss improvement is good enough if this_validation_loss < best_validation_loss * improvement_threshold: patience = max(patience, iter * patience_increase) If we run out of batches of training data before running out of patience, then we just go back to the beginning of the training set and repeat. Note The Note This algorithm could possibly be improved by using a test of statistical significance rather than the simple comparison, when deciding whether to increase the patience. ## TestingAfter the loop exits, the best_params variable refers to the best-performing model on the validation set. If we repeat this procedure for another model class, or even another random initialization, we should use the same train/valid/test split of the data, and get other best-performing models. If we have to choose what the best model class or the best initialization was, we compare the best_validation_loss for each model. When we have finally chosen the model we think is the best (on validation data), we report that model’s test set performance. That is the performance we expect on unseen examples. ## Theano/Python Tips## Loading and Saving ModelsWhen you’re doing experiments, it can take hours (sometimes days!) for gradient-descent to find the best parameters. You will want to save those weights once you find them. You may also want to save your current-best estimates as the search progresses.
The best way to save/archive your model’s parameters is to use pickle or
deepcopy the ndarray objects. So for example, if your parameters are in
shared variables
Then later, you can load your data back like this: `save_file = open('path')` `w.set_value(cPickle.load(save_file), borrow=True)` `v.set_value(cPickle.load(save_file), borrow=True)` `u.set_value(cPickle.load(save_file), borrow=True)` This technique is a bit verbose, but it is tried and true. You will be able to load your data and render it in matplotlib without trouble, years after saving it.
Theano functions are compatible with Python’s deepcopy and pickle mechanisms, but you should not necessarily pickle a Theano function. If you update your Theano folder and one of the internal changes, then you may not be able to un-pickle your model. Theano is still in active development, and the internal APIs are subject to change. So to be on the safe side – do not pickle your entire training or testing functions for long-term storage. The pickle mechanism is aimed at for short-term storage, such as a temp file, or a copy to another machine in a distributed job. Read more about serialization in Theano, or Python’s pickling. ## Plotting Intermediate ResultsVisualizations can be very powerful tools for understanding what your model or
training algorithm is doing. You might be tempted to insert
You already have a model-saving function right? Just use it again to save these intermediate models. Libraries you’ll want to know about: Python Image Library (PIL), matplotlib. |

Software Engineering ➼ Machine learning ➼ Data Science ➼ Product Leadership 🎯 > AI > Machine Learning > Neural Networks > Deep Learning > python >

### MNIST (Theano)

Subpages (13):
10 LSTM Networks for Sentiment Analysis
11 Modeling and generating sequences of polyphonic music with the RNN-RBM¶
12 Miscellaneous
13 References
1 Logistic Regression
2 Multi-layer Perceptron
3 Convolutional Neural Network (LeNet)
4 Denoising Autoencoders
5 Stacked Denoising Auto Encoders
6 Restricted Boltzman Machine (RBM)
7 Deep Belief Networks
8 Hybrid Monte-Carlo Sampling
9 Recurrent Neural Networks

Comments