Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, arXiv Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer, 2015
TLDR; Scheduled sampling improves the quality of language generation by being more robust to mistakes. Use inverse sigmoid decay.
One of the issues with training RNNs for prediction is that each
Traditional inference is conditioned on the most likely previous prediction. Prediction error can compound through the entire prediction. One way to deal with this is to use beam search, which maintains several probable sequences in memory. Beam search produces
The authors propose a "curriculum learning" approach that forces the model to deal with mistakes. This is interesting since error correction is baked into the model.
While training the sampling mechanism randomly decides to use
The sampling variable
- Trained on MSCOCO, 75k for training and 5k for dev set.
- Each image has 5 possible captions, one is chosen at random.
- Image preprocessed by pretrained CNN
- Word generation done with LSTM(512), vocabular size is 8857
- Used inverse sigmoid decay
This approach led the team to first place for MSCOCO captioning challenge 2015.
Map a sentence onto a parse tree. Unlike image captioning, the task is much more deterministic, "uni-modal". Generally only one correct parse tree.
- One layer LSTM(512)
- Words as embeddings of size 512
- Attention mechanism
- Inverse sigmoid decay
- Two layers of LSTM(250)
- Baseline trained 14 epochs, scheduled sampling only needed 9 epochs.