Good Stuff‎ > ‎Data‎ > ‎Papers NLP‎ > ‎

attention


























 yearcitations
main authors
title
 
 2015 5,697 Bahandau, Cho, Benjio
 Neural Machine Translation by Jointly Learning to Align and Translate (attention 2015)For machine translation:

Stanford nlp attention video

- Recent models for neural machine translation often belong to a family of encoder–decoders and encode a source sentence into a fixed-length vector from which a decoder generates a translation.
- We conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder–decoder architecture,
- We propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly

In the current Encoder–Decoder frameworks,
- an encoder reads the input sentence, a sequence of vectors x = (x1 ...  x_T_x ), into a vector c.
e.g. RNN such that
h_t = f (x_t; h_{t-1})
and
c = q ({h1; . . . ; h_T_x}) ;     (a non-linearity over of all outputs time steps)     where f and q are non-linear functions
- decoder is trained to predict the next word y_t' given the context vector c and all the previously predicted words {y_1 ... y_{t-1}}
In other words, the decoder defines a probability over the translation y by decomposing the joint probability into the ordered conditionals
where y = (y1; . . . ; y_T_y)
.
. With an RNN, each conditional probability is modeled as
where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt, and st is the hidden state of the RNN.

TLDR: encoder is a forward lstm, c is a non-linearity over all outputs, decoder is a forward lstm

Encoder - decoder

Encoder is just a simple bi-lstm

decoder is





     
     
     
     
     
     
     








Comments