# Introduction to Multi-Layer Perceptrons (Feedforward Neural Networks)

## Multi-Layer Neural Networks

An MLP (for Multi-Layer Perceptron) or multi-layer neural network defines a family of functions. Let us first consider the most classical case of a single hidden layer neural network, mapping a -vector to an -vector (e.g. for regression): output is an affine transformation of the hidden layer

where is a -vector (the input), is an matrix (called input-to-hidden weights), is a -vector (called hidden units offsets or hidden unit biases), is an -vector (called output units offset or output units biases), and is an matrix (called hidden-to-output weights).

The vector-valued function is called the output of the hidden layer. Note how the output is an affine transformation of the hidden layer, in the above network. A non-linearity may be tacked on to it in some network architectures. The elements of the hidden layer are called hidden units.

The kind of operation computed by the above can be applied on itself, but with different parameters (different biases and weights). This would give rise to a feedforward multi-layer network with two hidden layers. More generally, one can build a deep neural network by stacking more such layers. Each of these layers may have a different dimension ( above). A common variant is to have skip connections, i.e., a layer can take as input not only the layer at the previous level but also some of the lower layers.

## Most Common Training Criteria and Output Non-Linearities

Let with representing the output non-linearity function. In supervised learning, the output can be compared with a target value through a loss functional . Here are common loss functionals, with the associated output non-linearity:

• for ordinary (L2) regression: no non-linearity ( ), squared loss .
• for median (L1) regression: no non-linearity ( ), absolute value loss .
• for 2-way probabilistic classification: sigmoid non-linearity ( , applied element by element), and cross-entropy loss for binary. Note that the sigmoid output is in the (0,1) interval, and corresponds to an estimator of . The predicted class is 1 if .
• for multiple binary probabilistic classification: each output element is treated as above.
• for 2-way hard classification with hinge loss: no non-linearity ( ) and the hinge loss is (again for binary ). This is the SVM classifier loss.
• the above can be generalized to multiple classes by separately considering the binary classifications of each class against the others.
• multi-way probabilistic classification: softmax non-linearity ( with one output per class) with the negative log-likelihood loss . Note that and . Note also how this is equivalent to the cross-entropy loss in the 2-class case (the output for the one of the classes is actually redundant).

## The Back-Propagation Algorithm

We just apply the recursive gradient computation algorithm seen previously to the graph formed naturally by the MLP, with one node for each input unit, hidden unit and output unit. Note that each parameter (weight or bias) also corresponds to a node, and the final

Let us formalize a notation for MLPs with more than one hidden layer. Let us denote with the output vector of the i-th layer, starting with (the input), and finishing with a special output layer which produces the prediction or output of the network.

With tanh units in the hidden layers, we have (in matrix-vector notation):

• for to :
• where is a vector of biases and is a matrix of weights connecting layer to layer . The scalar computation associated with a single unit of layer is In the case of a probabilistic classifier, we would then have a softmax output layer, e.g.,

• where we used to denote the output because it is a vector indicating a probability distribution over classes. And the loss is

• where is the target class, i.e., we want to maximize , an estimator of the conditional probability of class given input .

Let us now see how the recursive application of the chain rule in flow graphs is instantiated in this structure. First of all, let us denote (for the argument of the non-linearity at each level) and note (from a small derivation) that and that .

Now let us apply the back-propagation recipe in the corresponding flow graph. Each parameter (each weight and each bias) is a node, each neuron potential and each neuron output is also a node.

• starting at the output node: • then compute the gradient with respect to each pre-softmax sum : • We can now repeat the same recipe for each layer. For down to 1

• obtain trivially the gradient wrt biases: • compute the gradient wrt weights: • back-propagate the gradient into lower layer, if :

• • ## Logistic Regression

Logistic regression is a special case of the MLP with no hidden layer (the input is directly connected to the output) and the cross-entropy (sigmoid output) or negative log-likelihood (softmax output) loss. It corresponds to a probabilistic linear classifier and the training criterion is convex in terms of the parameters (which garantees that there is only one minimum, which is global).

# Training Multi-Layer Neural Networks

Many algorithms have been proposed to train multi-layer neural networks but the most commonly used ones are gradient-based.

Two fundamental issues guide the various strategies employed in training MLPs:

• training as efficiently as possible, i.e., getting training error down as quickly as possible, avoiding to get stuck in narrow valleys or even local minima of the cost function,
• controlling capacity so as to achieve the largest capacity avoids overfitting, i.e., to minimize generalization error.