|
| 1 | +# OpenNMT-py: Open-Source Neural Machine Translation |
| 2 | + |
| 3 | +This is a [Pytorch](https://github.com/pytorch/pytorch) |
| 4 | +port of [OpenNMT](https://github.com/OpenNMT/OpenNMT), |
| 5 | +an open-source (MIT) neural machine translation system. It is designed to be research friendly to try out new ideas in translation, summary, image-to-text, morphology, and many other domains. |
| 6 | + |
| 7 | + |
| 8 | +OpenNMT-py is run as a collaborative open-source project. It is currently maintained by [Sasha Rush](http://github.com/srush) (Cambridge, MA), [Ben Peters](http://github.com/bpopeters) (Saarbrücken), and [Jianyu Zhan](http://github.com/jianyuzhan) (Shenzhen). The original code was written by [Adam Lerer](http://github.com/adamlerer) (NYC). Codebase is nearing a stable 0.1 version. We currently recommend forking if you want stable code. |
| 9 | + |
| 10 | +We love contributions. Please consult the Issues page for any [Contributions Welcome](https://github.com/OpenNMT/OpenNMT-py/issues?q=is%3Aissue+is%3Aopen+label%3A%22contributions+welcome%22) tagged post. |
| 11 | + |
| 12 | +<center style="padding: 40px"><img width="70%" src="http://opennmt.github.io/simple-attn.png" /></center> |
| 13 | + |
| 14 | + |
| 15 | +Table of Contents |
| 16 | +================= |
| 17 | + |
| 18 | + * [Requirements](#requirements) |
| 19 | + * [Features](#features) |
| 20 | + * [Quickstart](#quickstart) |
| 21 | + * [Advanced](#advanced) |
| 22 | + * [Citation](#citation) |
| 23 | + |
| 24 | +## Requirements |
| 25 | + |
| 26 | +```bash |
| 27 | +pip install -r requirements.txt |
| 28 | +``` |
| 29 | + |
| 30 | + |
| 31 | +## Features |
| 32 | + |
| 33 | +The following OpenNMT features are implemented: |
| 34 | + |
| 35 | +- multi-layer bidirectional RNNs with attention and dropout |
| 36 | +- data preprocessing |
| 37 | +- saving and loading from checkpoints |
| 38 | +- Inference (translation) with batching and beam search |
| 39 | +- Context gate |
| 40 | +- Multiple source and target RNN (lstm/gru) types and attention (dotprod/mlp) types |
| 41 | +- TensorBoard/Crayon logging |
| 42 | +- Source word features |
| 43 | + |
| 44 | +Beta Features (committed): |
| 45 | +- multi-GPU |
| 46 | +- Image-to-text processing |
| 47 | +- "Attention is all you need" |
| 48 | +- Copy, coverage |
| 49 | +- Structured attention |
| 50 | +- Conv2Conv convolution model |
| 51 | +- SRU "RNNs faster than CNN" paper |
| 52 | +- Inference time loss functions. |
| 53 | + |
| 54 | +## Quickstart |
| 55 | + |
| 56 | +## Step 1: Preprocess the data |
| 57 | + |
| 58 | +```bash |
| 59 | +python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo |
| 60 | +``` |
| 61 | + |
| 62 | +We will be working with some example data in `data/` folder. |
| 63 | + |
| 64 | +The data consists of parallel source (`src`) and target (`tgt`) data containing one sentence per line with tokens separated by a space: |
| 65 | + |
| 66 | +* `src-train.txt` |
| 67 | +* `tgt-train.txt` |
| 68 | +* `src-val.txt` |
| 69 | +* `tgt-val.txt` |
| 70 | + |
| 71 | +Validation files are required and used to evaluate the convergence of the training. It usually contains no more than 5000 sentences. |
| 72 | + |
| 73 | + |
| 74 | +After running the preprocessing, the following files are generated: |
| 75 | + |
| 76 | +* `demo.src.dict`: Dictionary of source vocab to index mappings. |
| 77 | +* `demo.tgt.dict`: Dictionary of target vocab to index mappings. |
| 78 | +* `demo.train.pt`: serialized PyTorch file containing vocabulary, training and validation data |
| 79 | + |
| 80 | + |
| 81 | +Internally the system never touches the words themselves, but uses these indices. |
| 82 | + |
| 83 | +## Step 2: Train the model |
| 84 | + |
| 85 | +```bash |
| 86 | +python train.py -data data/demo -save_model demo-model |
| 87 | +``` |
| 88 | + |
| 89 | +The main train command is quite simple. Minimally it takes a data file |
| 90 | +and a save file. This will run the default model, which consists of a |
| 91 | +2-layer LSTM with 500 hidden units on both the encoder/decoder. You |
| 92 | +can also add `-gpuid 1` to use (say) GPU 1. |
| 93 | + |
| 94 | +## Step 3: Translate |
| 95 | + |
| 96 | +```bash |
| 97 | +python translate.py -model demo-model_epochX_PPL.pt -src data/src-test.txt -output pred.txt -replace_unk -verbose |
| 98 | +``` |
| 99 | + |
| 100 | +Now you have a model which you can use to predict on new data. We do this by running beam search. This will output predictions into `pred.txt`. |
| 101 | + |
| 102 | +!!! note "Note" |
| 103 | + The predictions are going to be quite terrible, as the demo dataset is small. Try running on some larger datasets! For example you can download millions of parallel sentences for [translation](http://www.statmt.org/wmt16/translation-task.html) or [summarization](https://github.com/harvardnlp/sent-summary). |
| 104 | + |
| 105 | +## Some useful tools: |
| 106 | + |
| 107 | + |
| 108 | +## Full Translation Example |
| 109 | + |
| 110 | +The example below uses the Moses tokenizer (http://www.statmt.org/moses/) to prepare the data and the moses BLEU script for evaluation. |
| 111 | + |
| 112 | +```bash |
| 113 | +wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/tokenizer.perl |
| 114 | +wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.de |
| 115 | +wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en |
| 116 | +sed -i "s/$RealBin\/..\/share\/nonbreaking_prefixes//" tokenizer.perl |
| 117 | +wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl |
| 118 | +``` |
| 119 | + |
| 120 | +## WMT'16 Multimodal Translation: Multi30k (de-en) |
| 121 | + |
| 122 | +An example of training for the WMT'16 Multimodal Translation task (http://www.statmt.org/wmt16/multimodal-task.html). |
| 123 | + |
| 124 | +### 0) Download the data. |
| 125 | + |
| 126 | +```bash |
| 127 | +mkdir -p data/multi30k |
| 128 | +wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz && tar -xf training.tar.gz -C data/multi30k && rm training.tar.gz |
| 129 | +wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz && tar -xf validation.tar.gz -C data/multi30k && rm validation.tar.gz |
| 130 | +wget https://staff.fnwi.uva.nl/d.elliott/wmt16/mmt16_task1_test.tgz && tar -xf mmt16_task1_test.tgz -C data/multi30k && rm mmt16_task1_test.tgz |
| 131 | +``` |
| 132 | + |
| 133 | +### 1) Preprocess the data. |
| 134 | + |
| 135 | +```bash |
| 136 | +# Delete the last line of val and training files. |
| 137 | +for l in en de; do for f in data/multi30k/*.$l; do if [[ "$f" != *"test"* ]]; then sed -i "$ d" $f; fi; done; done |
| 138 | +for l in en de; do for f in data/multi30k/*.$l; do perl tokenizer.perl -a -no-escape -l $l -q < $f > $f.atok; done; done |
| 139 | +python preprocess.py -train_src data/multi30k/train.en.atok -train_tgt data/multi30k/train.de.atok -valid_src data/multi30k/val.en.atok -valid_tgt data/multi30k/val.de.atok -save_data data/multi30k.atok.low -lower |
| 140 | +``` |
| 141 | + |
| 142 | +### 2) Train the model. |
| 143 | + |
| 144 | +```bash |
| 145 | +python train.py -data data/multi30k.atok.low -save_model multi30k_model -gpuid 0 |
| 146 | +``` |
| 147 | + |
| 148 | +### 3) Translate sentences. |
| 149 | + |
| 150 | +```bash |
| 151 | +python translate.py -gpu 0 -model multi30k_model_*_e13.pt -src data/multi30k/test.en.atok -tgt data/multi30k/test.de.atok -replace_unk -verbose -output multi30k.test.pred.atok |
| 152 | +``` |
| 153 | + |
| 154 | +### 4) Evaluate. |
| 155 | + |
| 156 | +```bash |
| 157 | +perl tools/multi-bleu.perl data/multi30k/test.de.atok < multi30k.test.pred.atok |
| 158 | +``` |
| 159 | + |
| 160 | +## Pretrained Models |
| 161 | + |
| 162 | +The following pretrained models can be downloaded and used with translate.py (These were trained with an older version of the code; they will be updated soon). |
| 163 | + |
| 164 | +- [onmt_model_en_de_200k](https://drive.google.com/file/d/0B6N7tANPyVeBWE9WazRYaUd2QTg/view?usp=sharing): An English-German translation model based on the 200k sentence dataset at [OpenNMT/IntegrationTesting](https://github.com/OpenNMT/IntegrationTesting/tree/master/data). Perplexity: 20. |
| 165 | +- onmt_model_en_fr_b1M (coming soon): An English-French model trained on benchmark-1M. Perplexity: 4.85. |
| 166 | + |
| 167 | + |
| 168 | +## Citation |
| 169 | + |
| 170 | +[OpenNMT technical report](https://doi.org/10.18653/v1/P17-4012) |
| 171 | + |
| 172 | +``` |
| 173 | +@inproceedings{opennmt, |
| 174 | + author = {Guillaume Klein and |
| 175 | + Yoon Kim and |
| 176 | + Yuntian Deng and |
| 177 | + Jean Senellart and |
| 178 | + Alexander M. Rush}, |
| 179 | + title = {OpenNMT: Open-Source Toolkit for Neural Machine Translation}, |
| 180 | + booktitle = {Proc. ACL}, |
| 181 | + year = {2017}, |
| 182 | + url = {https://doi.org/10.18653/v1/P17-4012}, |
| 183 | + doi = {10.18653/v1/P17-4012} |
| 184 | +} |
| 185 | +``` |
0 commit comments