Neural Networks

To re-read:
Siamese networks: parallel networks that get score to two inputs
Transformer model.  Attention Is All You Need
Neural memory networks
neural turing machines
capsule networks

multi-class cross entropy x * (1-y) - log(sigmoid(x)) == -[(1-y)*log(1-sigmoid(x)) + y*log(sigmoid(x))]

single layer neural network
can only learn
Linearly separable function (by a plane)(and - or)

Non-linearly separable function (xor)

In higher dimensions it would be a hyper plane (for some entries they will be on one side of the plane, for others on the other sde)

for example if we have a threshold function in a neuron (if sigma w_i .(inner product of vectors) Input_i > some number output one and vice versa)

Total weight preservative, weight adjustment among input values of a neuron. Why?
How to mix training set so that we get to a proper training of neurons? because if we just apply two diffrerent classes of items intermediately, the weights will oscillate between those classes.
Some of the neurons might have kind of odd initial weights so that it never wins in any competition among other neurons in a one layer classifying task.
There's always chance of local minima, a little randomization in weight adjustment might be helpful

Recurrent Neural Network

Linguistic Regularities in Continuous Space Word Representations, by microsoft
RNN generates a vector for each word based on RNN Language Model toolkit
trained on PENN treebank POS tags
Word projections to vector space, in 80 dimensions RNN-80in 1600 dimensions

Hopfield Network

Good for emulating human memory

Interview Questions

l1 vs l2 norm
Sigmoid function has range [0,1] whereas the ReL function has range [0,∞]. Hence sigmoid function can be used to model probability, whereas ReL can be used to model positive real number.
ReLU can be used in Restricted Boltzmann machine to model real/integer valued inputs.

The gradient of the sigmoid or tanh function vanishes as we increase or decrease x. However, the gradient of the ReL function doesn't vanish as we increase x.

ReLUs are much simpler computationally. The forward and backward passes through an ReLU are both just a simple if statement.

Sigmoid activations are easier to saturate. There is a comparatively narrow interval of inputs for which the sigmoid's derivative is sufficiently nonzero. In other words, once a sigmoid reaches either the left or right plateau, it is almost meaningless to make a backward pass through it, since the derivative is very close to 0. On the other hand, ReLUs only saturate when the input is less than 0. And even this saturation can be eliminated by using leaky ReLUs. For very deep networks, saturation hampers learning, and so ReLUs provide a nice workaround.

approximated by ln(1+e^x)

leaky relu if>0 x else 0.01 x

Biological plausibility rather tan antisemetry

Efficient gradient propagation: No vanishing gradient problem or exploding effect.

Potential problems
Non-differentiable at zero: however it is differentiable at any point arbitrarily close to 0.

The other benefit of ReLUs is sparsity. Sparsity arises when a<=0. The more such units that exist in a layer the more sparse the resulting representation. Sigmoids on the other hand are always likely to generate some non-zero value resulting in dense representations. Sparse representations seem to be more beneficial than dense representations.

maxout is just sending the max to next layer
maxpooling is in CNNs patches where maxes are concatenated to eachother

LSTM vs GRU It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes.