Entrepreneurials‎ > ‎Jobs‎ > ‎Interview‎ > ‎


Best practices

Word embeddings

Word embeddings are arguably the most widely known best practice in the recent history of NLP. It is well-known that using pre-trained embeddings helps (Kim, 2014) [12]. The optimal dimensionality of word embeddings is mostly task-dependent: a smaller dimensionality works better for more syntactic tasks such as named entity recognition (Melamud et al., 2016) [44] or part-of-speech (POS) tagging (Plank et al., 2016) [32], while a larger dimensionality is more useful for more semantic tasks such as sentiment analysis (Ruder et al., 2016) [45].


While we will not reach the depths of computer vision for a while, neural networks in NLP have become progressively deeper. State-of-the-art approaches now regularly use deep Bi-LSTMs, typically consisting of 3-4 layers, e.g. for POS tagging (Plank et al., 2016) and semantic role labelling (He et al., 2017) [33]. Models for some tasks can be even deeper, cf. Google's NMT model with 8 encoder and 8 decoder layers (Wu et al., 2016) [20]. In most cases, however, performance improvements of making the model deeper than 2 layers are minimal (Reimers & Gurevych, 2017) [46].

These observations hold for most sequence tagging and structured prediction problems. For classification, deep or very deep models perform well only with character-level input and shallow word-level models are still the state-of-the-art (Zhang et al., 2015; Conneau et al., 2016; Le et al., 2017) [282930].

Layer connections

For training deep neural networks, some tricks are essential to avoid the vanishing gradient problem. Different layers and connections have been proposed. Here, we will discuss three: i) highway layers, ii) residual connections, and iii) dense connections.

Highway layers   Highway layers (Srivastava et al., 2015) [1] are inspired by the gates of an LSTM. First let us assume a one-layer MLP, which applies an affine transformation followed by a non-linearity g to its input x:


A highway layer then computes the following function instead:

x . t + (1-t) . σ(wx+b)


where  is elementwise multiplication, t=σ(WTx+bT) is called the transform gate, and (1t) is called the carry gate. As we can see, highway layers are similar to the gates of an LSTM in that they adaptively carry some dimensions of the input directly to the output.  What happens is that when the transform gate is 1, we pass through our activation (H) and suppress the carry gate (since it will be 0). When the carry gate is 1, we pass through the unmodified input (x), while the activation is suppressed. It's like Kalman filter with weights.

Highway layers have been used pre-dominantly to achieve state-of-the-art results for language modelling (Kim et al., 2016; Jozefowicz et al., 2016; Zilly et al., 2017) [234], but have also been used for other tasks such as speech recognition (Zhang et al., 2016) [5]. Sristava's page contains more information and code regarding highway layers.

Residual connections   Residual connections (He et al., 2016) [6] have been first proposed for computer vision and were the main factor for winning ImageNet 2016. Residual connections are even more straightforward than highway layers and learn the following function:

x + σ(wx + b)


which simply adds the input of the current layer to its output via a short-cut connection. This simple modification mitigates the vanishing gradient problem, as the model can default to using the identity function if the layer is not beneficial.

Dense connections   Rather than just adding layers from each layer to the next, dense connections (Huang et al., 2017) [7] (best paper award at CVPR 2017) add direct connections from each layer to all subsequent layers. Let us augment the layer output hh and layer input xx with indices llindicating the current layer. Dense connections then feed the concatenated output from all previous layers as input to the current layer:


where [;][⋅;⋅] represents concatenation. Dense connections have been successfully used in computer vision. They have also found to be useful for Multi-Task Learning of different NLP tasks (Ruder et al., 2017) [49], while a residual variant that uses summation has been shown to consistently outperform residual connections for neural machine translation (Britz et al., 2017) [27].

batch normalization: At each layer, before applying activation function, normalize the output by subtracting mean and division by std of the batch


While batch normalisation in computer vision has made other regularizers obsolete in most applications, dropout (Srivasta et al., 2014) [8] is still the go-to regularizer for deep neural networks in NLP. A dropout rate of 0.5 has been shown to be effective in most scenarios (Kim, 2014). In recent years, variations of dropout such as adaptive (Ba & Frey, 2013) [9] and evolutional dropout (Li et al., 2016) [10] have been proposed, but none of these have found wide adoption in the community. The main problem hindering dropout in NLP has been that it could not be applied to recurrent connections, as the aggregating dropout masks would effectively zero out embeddings over time.

Recurrent dropout   Recurrent dropout (Gal & Ghahramani, 2016) [11] addresses this issue by applying the same dropout mask across timesteps at layer ll. This avoids amplifying the dropout noise along the sequence and leads to effective regularization for sequence models. Recurrent dropout has been used for instance to achieve state-of-the-art results in semantic role labelling (He et al., 2017) and language modelling (Melis et al., 2017) [34].

Multi-task learning

If additional data is available, multi-task learning (MTL) can often be used to improve performance on the target task. Have a look this blog postfor more information on MTL.

Auxiliary objectives   We can often find auxiliary objectives that are useful for the task we care about (Ruder, 2017) [13]. While we can already predict surrounding words in order to pre-train word embeddings (Mikolov et al., 2013), we can also use this as an auxiliary objective during training (Rei, 2017) [35]. A similar objective has also been used by (Ramachandran et al., 2016) [36] for sequence-to-sequence models.

Task-specific layers   While the standard approach to MTL for NLP is hard parameter sharing, it is beneficial to allow the model to learn task-specific layers. This can be done by placing the output layer of one task at a lower level (Søgaard & Goldberg, 2016) [47]. Another way is to induce private and shared subspaces (Liu et al., 2017; Ruder et al., 2017) [4849].

-- In sequence labeling, alogn with providing a label for each token, produce prev and next tokens also: mutual learning of language model and semantic role labeling such as NER

in seq2seq encoder/decoder, All parameters in a encoder lstm or decoder lstm are pretrained, either from the source- side (light red) or target-side (light blue) language model. Otherwise, they are randomly initialized.

redisual connection in layers. Attention over all encoder layers concatenated to decoder output


Softmax attention Attention: a = Softmax(W(o + u))       https://arxiv.org/pdf/1409.0473.pdf     attentio is basically a softmax on all the lstm outputs that tells us with what probability each step is important. context vector ci is calculated as an average of the previous states weighted with the attention scores

 Attention is most commonly used in sequence-to-sequence models to attend to encoder states, but can also be used in any sequence model to look back at past states. Using attention, we obtain a context vector cici based on hidden states s1,,sms1,…,sm that can be used together with the current hidden state hihi for prediction. The context vector cici at position is calculated as an average of the previous states weighted with the attention scores aiai:

The attention function fatt(hi,sj)fatt(hi,sj) calculates an unnormalized alignment score between the current hidden state hihi and the previous hidden state sjsj. In the following, we will discuss four attention variants: i) additive attention, ii) multiplicative attention, iii) self-attention, and iv) key-value attention.

Additive attention   The original attention mechanism (Bahdanau et al., 2015) [15] uses a one-hidden layer feed-forward network to calculate the attention alignment:


where vava and WaWa are learned attention parameters. Analogously, we can also use matrices W1W1 and W2W2 to learn separate transformations for hihiand sjsj respectively, which are then summed:

fatt(hi,sj)=vatanh(W1hi+W2sj)f                            fatt(hi,sj)= va⊤tanh(W1hi+W2sj)

Multiplicative attention   Multiplicative attention (Luong et al., 2015) [16] simplifies the attention operation by calculating the following function:

fatt(hi,sj)=hiWasj                f                                          fatt(hi,sj)=hi⊤ * Wa * sj

Additive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication. Both variants perform similar for small dimensionality dhdh of the decoder states, but additive attention performs better for larger dimensions. One way to mitigate this is to scale fatt(hi,sj)fatt(hi,sj) by 1/dh1/dh (Vaswani et al., 2017) [17].

Attention cannot only be used to attend to encoder or previous hidden states, but also to obtain a distribution over other features, such as the word embeddings of a text as used for reading comprehension (Kadlec et al., 2017) [37]. However, attention is not directly applicable to classification tasks that do not require additional information, such as sentiment analysis. In such models, the final hidden state of an LSTM or an aggregation function such as max pooling or averaging is often used to obtain a sentence representation.

Self-attention   Without any additional information, however, we can still extract relevant aspects from the sentence by allowing it to attend to itself using self-attention (Lin et al., 2017) [18]. Self-attention, also called intra-attention has been used successfully in a variety of tasks including reading comprehension (Cheng et al., 2016) [38], textual entailment (Parikh et al., 2016) [39], and abstractive summarization (Paulus et al., 2017) [40].

We can simplify additive attention to compute the unnormalized alignment score for each hidden state hihi:


In matrix form, for hidden states H=h1,,hnH=h1,…,hn we can calculate the attention vector aa and the final sentence representation cc as follows:


Rather than only extracting one vector, we can perform several hops of attention by using a matrix VaVa instead of vava, which allows us to extract an attention matrix AA:


In practice, we enforce the following orthogonality constraint to penalize redundancy and encourage diversity in the attention vectors in the form of the squared Frobenius norm:


A similar multi-head attention is also used by Vaswani et al. (2017).

Key-value attention   Finally, key-value attention (Daniluk et al., 2017) [19] is a recent attention variant that separates form from function by keeping separate vectors for the attention calculation. It has also been found useful for different document modelling tasks (Liu & Lapata, 2017) [41]. Specifically, key-value attention splits each hidden vector hihi into a key kiki and a value vivi[ki;vi]=hi[ki;vi]=hi. The keys are used for calculating the attention distribution aiai using additive attention:


where LL is the length of the attention window and 11 is a vector of ones. The values are then used to obtain the context representation cici:


The context cici is used together with the current value vivi for prediction.


The optimization algorithm and scheme is often one of the parts of the model that is used as-is and treated as a black-box. Sometimes, even slight changes to the algorithm, e.g. reducing the β2β2 value in Adam (Dozat & Manning, 2017) [50] can make a large difference to the optimization behaviour.

Optimization algorithm   Adam (Kingma & Ba, 2015) [21] is one of the most popular and widely used optimization algorithms and often the go-to optimizer for NLP researchers. It is often thought that Adam clearly outperforms vanilla stochastic gradient descent (SGD). However, while it converges much faster than SGD, it has been observed that SGD with learning rate annealing slightly outperforms Adam (Wu et al., 2016). Recent work furthermore shows that SGD with properly tuned momentum outperforms Adam (Zhang et al., 2017) [42].

Optimization scheme   While Adam internally tunes the learning rate for every parameter (Ruder, 2016) [22], we can explicitly use SGD-style annealing with Adam. In particular, we can perform learning rate annealing with restarts: We set a learning rate and train the model until convergence. We then halve the learning rate and restart by loading the previous best model. In Adam's case, this causes the optimizer to forget its per-parameter learning rates and start fresh. Denkowski & Neubig (2017) [23] show that Adam with 2 restarts and learning rate annealing is faster and performs better than SGD with annealing.


Combining multiple models into an ensemble by averaging their predictions is a proven strategy to improve model performance. While predicting with an ensemble is expensive at test time, recent advances in distillation allow us to compress an expensive ensemble into a much smaller model (Hinton et al., 2015; Kuncoro et al., 2016; Kim & Rush, 2016) [242526].

Ensembling is an important way to ensure that results are still reliable if the diversity of the evaluated models increases (Denkowski & Neubig, 2017). While ensembling different checkpoints of a model has been shown to be effective (Jean et al., 2015; Sennrich et al., 2016) [5152], it comes at the cost of model diversity. Cyclical learning rates can help to mitigate this effect (Huang et al., 2017) [53]. However, if resources are available, we prefer to ensemble multiple independently trained models to maximize model diversity.

Hyperparameter optimization

Rather than pre-defining or using off-the-shelf hyperparameters, simply tuning the hyperparameters of our model can yield significant improvements over baselines. Recent advances in Bayesian Optimization have made it an ideal tool for the black-box optimization of hyperparameters in neural networks (Snoek et al., 2012) [56] and far more efficient than the widely used grid search. Automatic tuning of hyperparameters of an LSTM has led to state-of-the-art results in language modeling, outperforming models that are far more complex (Melis et al., 2017).

LSTM tricks

Learning the initial state   We generally initialize the initial LSTM states with a 00 vector. Instead of fixing the initial state, we can learn it like any other parameter, which can improve performance and is also recommended by Hinton. Refer to this blog post for a Tensorflow implementation.

Tying input and output embeddings   Input and output embeddings account for the largest number of parameters in the LSTM model. If the LSTM predicts words as in language modelling, input and output parameters can be shared (Inan et al., 2016; Press & Wolf, 2017) [5455]. This is particularly useful on small datasets that do not allow to learn a large number of parameters.

Gradient norm clipping   One way to decrease the risk of exploding gradients is to clip their maximum value (Mikolov, 2012) [57]. This, however, does not improve performance consistently (Reimers & Gurevych, 2017). Rather than clipping each gradient independently, clipping the global norm of the gradient (Pascanu et al., 2013) [58] yields more significant improvements (a Tensorflow implementation can be found here).

Down-projection   To reduce the number of output parameters further, the hidden state of the LSTM can be projected to a smaller size. This is useful particularly for tasks with a large number of outputs, such as language modelling (Melis et al., 2017).

Task-specific best practices

In the following, we will discuss task-specific best practices. Most of these perform best for a particular type of task. Some of them might still be applied to other tasks, but should be validated before. We will discuss the following tasks: classification, sequence labelling, natural language generation (NLG), and -- as a special case of NLG -- neural machine translation.


More so than for sequence tasks, where CNNs have only recently found application due to more efficient convolutional operations, CNNs have been popular for classification tasks in NLP. The following best practices relate to CNNs and capture some of their optimal hyperparameter choices.

CNN filters   Combining filter sizes near the optimal filter size, e.g. (3,4,5) performs best (Kim, 2014; Kim et al., 2016). The optimal number of feature maps is in the range of 50-600 (Zhang & Wallace, 2015) [59].

Aggregation function   1-max-pooling outperforms average-pooling and kk-max pooling (Zhang & Wallace, 2015).

Sequence labelling

Sequence labelling is ubiquitous in NLP. While many of the existing best practices are with regard to a particular part of the model architecture, the following guidelines discuss choices for the model's output and prediction stage.

Tagging scheme   For some tasks, which can assign labels to segments of texts, different tagging schemes are possible. These are: BIO, which marks the first token in a segment with a B- tag, all remaining tokens in the span with an I-tag, and tokens outside of segments with an O- tag; IOB, which is similar to BIO, but only uses B- if the previous token is of the same class but not part of the segment; and IOBES, which in addition distinguishes between single-token entities (S-) and the last token in a segment (E-). Using IOBES and BIO yield similar performance (Lample et al., 2017)

CRF output layer   If there are any dependencies between outputs, such as in named entity recognition the final softmax layer can be replaced with a linear-chain conditional random field (CRF). This has been shown to yield consistent improvements for tasks that require the modelling of constraints (Huang et al., 2015; Max & Hovy, 2016; Lample et al., 2016) [606162].

Constrained decoding   Rather than using a CRF output layer, constrained decoding can be used as an alternative approach to reject erroneous sequences, i.e. such that do not produce valid BIO transitions (He et al., 2017). Constrained decoding has the advantage that arbitrary constraints can be enforced this way, e.g. task-specific or syntactic constraints.

Natural language generation

Most of the existing best practices can be applied to natural language generation (NLG). In fact, many of the tips presented so far stem from advances in language modelling, the most prototypical NLP task.

Modelling coverage   Repetition is a big problem in many NLG tasks as current models do not have a good way of remembering what outputs they already produced. Modelling coverage explicitly in the model is a good way of addressing this issue. A checklist can be used if it is known in advances, which entities should be mentioned in the output, e.g. ingredients in recipes (Kiddon et al., 2016) [63]. If attention is used, we can keep track of a coverage vector cici, which is the sum of attention distributions atat over previous time steps (Tu et al., 2016; See et al., 2017) [6465]:


This vector captures how much attention we have paid to all words in the source. We can now condition additive attention additionally on this coverage vector in order to encourage our model not to attend to the same words repeatedly:


In addition, we can add an auxiliary loss that captures the task-specific attention behaviour that we would like to elicit: For NMT, we would like to have a roughly one-to-one alignment; we thus penalize the model if the final coverage vector is more or less than one at every index (Tu et al., 2016). For summarization, we only want to penalize the model if it repeatedly attends to the same location (See et al., 2017).

Neural machine translation

While neural machine translation (NMT) is an instance of NLG, NMT receives so much attention that many methods have been developed specifically for the task. Similarly, many best practices or hyperparameter choices apply exclusively to it.

Embedding dimensionality   2048-dimensional embeddings yield the best performance, but only do so by a small margin. Even 128-dimensional embeddings perform surprisingly well and converge almost twice as quickly (Britz et al., 2017).

Encoder and decoder depth   The encoder does not need to be deeper than 242−4 layers. Deeper models outperform shallower ones, but more than 44 layers is not necessary for the decoder (Britz et al., 2017).

Directionality   Bidirectional encoders outperform unidirectional ones by a small margin. Sutskever et al., (2014) [67] proposed to reverse the source sequence to reduce the number of long-term dependencies. Reversing the source sequence in unidirectional encoders outperforms its non-reversed counter-part (Britz et al., 2017).

Beam search strategy   Medium beam sizes around 1010 with length normalization penalty of 1.01.0 (Wu et al., 2016) yield the best performance (Britz et al., 2017).

Sub-word translation   Senrich et al. (2016) [66] proposed to split words into sub-words based on a byte-pair encoding (BPE). BPE iteratively merges frequent symbol pairs, which eventually results in frequent character n-grams being merged into a single symbol, thereby effectively eliminating out-of-vocabulary-words. While it was originally meant to handle rare words, a model with sub-word units outperforms full-word systems across the board, with 32,000 being an effective vocabulary size for sub-word units (Denkowski & Neubig, 2017).


I hope this post was helpful in kick-starting your learning of a new NLP task. Even if you have already been familiar with most of these, I hope that you still learnt something new or refreshed your knowledge of useful tips.

I am sure that I have forgotten many best practices that deserve to be on this list. Similarly, there are many tasks such as parsing, information extraction, etc., which I do not know enough about to give recommendations. If you have a best practice that should be on this list, do let me know in the comments below. Please provide at least one reference and your handle for attribution. If this gets very collaborative, I might open a GitHub repository rather than collecting feedback here (I won't be able to accept PRs submitted directly to the generated HTML source of this article)

Subpages (1): Papers NLP