Deep Learning

Batch normalization:
batch normalization can be applied at hte output of each layer that you choose. What it does is it adds some new parameters mu and sigma to normalize the output. Then this normalized output is passed through a linear layer with no activation function. because if one output is large at one layer, it causes the subsequent outputs to cascade the largeness and distort the network outcome.
In general normalization is performed to scale hte data to a similar scale and avoid exploding gradient. similar concept holds for batch normalization.
Occurs on a per batch basis, hence called batch norm

-n-grams or char-level
- threshold calibration on validation set and use that threshold it for test

model.compile(loss='categorical_crossentropy', optimizer='adam')  for next character prediction

Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning

The newest paper on this is probably Do Deep Nets Really Need to be Deep?. While its angle is different, the main point is exactly the same: you can train a shallow network to imitate a deep one, but first you need to train the deep network to get predictions from it. Once you have those predictions, they become the labels and the second model attempts to learn the mapping from the first model.

based on their study neurons work like this: they build a dictionary of bases, and for each new item they approximate it with a sparse sum of bases

here out of 64 bases only 3 are used to approximate the new item:

sparse coding allows learning useful features from unlabeled data. infinit amount of unlabeled data e.g. from internet

sparse coding very closely related to ICA (Independent Component Analysis)
Andrew Ng uses ICA these days rather than sparse coding     

or learn feature hierarchies

where DBN is Deep Belief Network
The same holds when applying the same technique to other data sets and modalities. If you let unsupervised features to be learned hierachically like below it is much better than hand engineering nice features.


How quickly this stanford feature learning technique passed previous benchmarks in various fields by high margins:

Technical challenge: Scaling up!

Maxims: It is not that who has the best algorithm, it is that who has the most data. An utterly complex algorithm loses to an inferior one that has ben trained on more and more data. In sillicon valley you see simple algorithms like logistic regression being used but their advantage to others is that they see far more data than others hence they outperform

supervised algorithms

unsupervised it is not important how big data yo have trained on but rather ow many features you have learned

To speed up: divide the neural network into different machines, in each machine take advantage of multicores and also propagate updates every delta p thresholds.  robust to machine failiure

Subpages (3): Active Learning python Torch