year | citations
| main authors
| title
| | 2015 | 5,697 | Bahandau, Cho, Benjio
| Neural Machine Translation by Jointly Learning to Align and Translate (attention 2015) | For machine translation:
Stanford nlp attention video
- Recent models for neural machine translation often belong to a family of encoder–decoders and encode a source sentence into a fixed-length vector from which a decoder generates a translation. - We conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder–decoder architecture, - We propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly
In the current Encoder–Decoder frameworks,
- an encoder reads the input sentence, a sequence of vectors x = (x1 ... x_T_x ), into a vector c. e.g. RNN such that h_t = f (x_t; h_{t-1}) and c = q ({h1; . . . ; h_T_x}) ; (a non-linearity over of all outputs time steps) where f and q are non-linear functions - decoder is trained to predict the next word y_t' given the context vector c and all the previously predicted words {y_1 ... y_{t-1}} In other words, the decoder defines a probability over the translation y by decomposing the joint probability into the ordered conditionals where y = (y1; . . . ; y_T_y) . . With an RNN, each conditional probability is modeled as where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt, and st is the hidden state of the RNN.
TLDR: encoder is a forward lstm, c is a non-linearity over all outputs, decoder is a forward lstm
Encoder - decoder
Encoder is just a simple bi-lstm
decoder is
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
|
|