Skip to content

Commit f7a1d48

Browse files
move into separate repository
0 parents  commit f7a1d48

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+7457
-0
lines changed

.gitignore

+109
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# repo-specific stuff
2+
pred.txt
3+
multi-bleu.perl
4+
*.pt
5+
\#*#
6+
.idea
7+
*.sublime-*
8+
.DS_Store
9+
data/
10+
11+
# Byte-compiled / optimized / DLL files
12+
__pycache__/
13+
*.py[cod]
14+
*$py.class
15+
16+
# C extensions
17+
*.so
18+
19+
# Distribution / packaging
20+
.Python
21+
build/
22+
develop-eggs/
23+
dist/
24+
downloads/
25+
eggs/
26+
.eggs/
27+
lib/
28+
lib64/
29+
parts/
30+
sdist/
31+
var/
32+
wheels/
33+
*.egg-info/
34+
.installed.cfg
35+
*.egg
36+
37+
# PyInstaller
38+
# Usually these files are written by a python script from a template
39+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
40+
*.manifest
41+
*.spec
42+
43+
# Installer logs
44+
pip-log.txt
45+
pip-delete-this-directory.txt
46+
47+
# Unit test / coverage reports
48+
htmlcov/
49+
.tox/
50+
.coverage
51+
.coverage.*
52+
.cache
53+
nosetests.xml
54+
coverage.xml
55+
*.cover
56+
.hypothesis/
57+
58+
# Translations
59+
*.mo
60+
*.pot
61+
62+
# Django stuff:
63+
*.log
64+
local_settings.py
65+
66+
# Flask stuff:
67+
instance/
68+
.webassets-cache
69+
70+
# Scrapy stuff:
71+
.scrapy
72+
73+
# Sphinx documentation
74+
docs/_build/
75+
76+
# PyBuilder
77+
target/
78+
79+
# Jupyter Notebook
80+
.ipynb_checkpoints
81+
82+
# pyenv
83+
.python-version
84+
85+
# celery beat schedule file
86+
celerybeat-schedule
87+
88+
# SageMath parsed files
89+
*.sage.py
90+
91+
# Environments
92+
.env
93+
.venv
94+
env/
95+
venv/
96+
ENV/
97+
98+
# Spyder project settings
99+
.spyderproject
100+
.spyproject
101+
102+
# Rope project settings
103+
.ropeproject
104+
105+
# mkdocs documentation
106+
/site
107+
108+
# mypy
109+
.mypy_cache/

.travis.yml

+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
language: python
2+
python:
3+
- "2.7"
4+
- "3.5"
5+
6+
install:
7+
- sudo apt-get update
8+
- wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
9+
- bash miniconda.sh -b -p $HOME/miniconda
10+
- export PATH="$HOME/miniconda/bin:$PATH"
11+
- hash -r
12+
- conda config --set always_yes yes --set changeps1 no
13+
- conda update -q conda
14+
# Useful for debugging any issues with conda
15+
- conda info -a
16+
# freeze the supported pytorch version for consistency
17+
- conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION pytorch=0.2.0 -c soumith
18+
- source activate test-environment
19+
# use requirements.txt for dependencies
20+
- pip install -r requirements.txt
21+
- python setup.py install
22+
23+
script:
24+
- python -m unittest discover
25+
- python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data /tmp/data -src_vocab_size 1000 -tgt_vocab_size 1000
26+
- head data/src-test.txt > /tmp/src-test.txt; python translate.py -model test/test_model.pt -src /tmp/src-test.txt -verbose
27+
- head data/src-val.txt > /tmp/src-val.txt; head data/tgt-val.txt > /tmp/tgt-val.txt; python preprocess.py -train_src /tmp/src-val.txt -train_tgt /tmp/tgt-val.txt -valid_src /tmp/src-val.txt -valid_tgt /tmp/tgt-val.txt -save_data /tmp/q -src_vocab_size 1000 -tgt_vocab_size 1000; python train.py -data /tmp/q -rnn_size 2 -batch_size 10 -word_vec_size 5 -report_every 5 -rnn_size 10 -epochs 1
28+
- python translate.py -model test/test_model2.pt -src data/morph/src.valid -verbose -batch_size 10 -beam_size 10 -tgt data/morph/tgt.valid -out /tmp/trans; diff data/morph/tgt.valid /tmp/trans
29+
matrix:
30+
include:
31+
- env: LINT_CHECK
32+
python: "2.7"
33+
install: pip install flake8
34+
script: flake8

CONTRIBUTORS.md

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
OpenNMT-py is a community developed project and we love developer contributions.
2+
3+
Before sending a PR, please do this checklist first:
4+
5+
- Please run `tools/pull_request_chk.sh` and fix any errors. When adding new functionality, also add tests to this script. Included checks:
6+
1. flake8 check for coding style;
7+
2. unittest;
8+
3. continuous integration tests listed in `.travis.yml`.
9+
- When adding/modifying class constructor, please make the arguments as same naming style as its superclass in pytorch.
10+
- If your change is based on a paper, please include a clear comment and reference in the code.
11+
- If your function takes/returns tensor arguments, please include assertions to document the sizes. See `GlobalAttention.py` for examples.

Dockerfile

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
FROM pytorch/pytorch:latest
2+
RUN git clone https://github.com/OpenNMT/OpenNMT-py.git && cd OpenNMT-py && pip install -r requirements.txt && python setup.py install

LICENSE.md

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
This software is derived from the OpenNMT project at
2+
https://github.com/OpenNMT/OpenNMT.
3+
4+
The MIT License (MIT)
5+
6+
Permission is hereby granted, free of charge, to any person obtaining a copy
7+
of this software and associated documentation files (the "Software"), to deal
8+
in the Software without restriction, including without limitation the rights
9+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10+
copies of the Software, and to permit persons to whom the Software is
11+
furnished to do so, subject to the following conditions:
12+
13+
The above copyright notice and this permission notice shall be included in
14+
all copies or substantial portions of the Software.
15+
16+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
22+
THE SOFTWARE.

README.md

+185
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
# OpenNMT-py: Open-Source Neural Machine Translation
2+
3+
This is a [Pytorch](https://github.com/pytorch/pytorch)
4+
port of [OpenNMT](https://github.com/OpenNMT/OpenNMT),
5+
an open-source (MIT) neural machine translation system. It is designed to be research friendly to try out new ideas in translation, summary, image-to-text, morphology, and many other domains.
6+
7+
8+
OpenNMT-py is run as a collaborative open-source project. It is currently maintained by [Sasha Rush](http://github.com/srush) (Cambridge, MA), [Ben Peters](http://github.com/bpopeters) (Saarbrücken), and [Jianyu Zhan](http://github.com/jianyuzhan) (Shenzhen). The original code was written by [Adam Lerer](http://github.com/adamlerer) (NYC). Codebase is nearing a stable 0.1 version. We currently recommend forking if you want stable code.
9+
10+
We love contributions. Please consult the Issues page for any [Contributions Welcome](https://github.com/OpenNMT/OpenNMT-py/issues?q=is%3Aissue+is%3Aopen+label%3A%22contributions+welcome%22) tagged post.
11+
12+
<center style="padding: 40px"><img width="70%" src="http://opennmt.github.io/simple-attn.png" /></center>
13+
14+
15+
Table of Contents
16+
=================
17+
18+
* [Requirements](#requirements)
19+
* [Features](#features)
20+
* [Quickstart](#quickstart)
21+
* [Advanced](#advanced)
22+
* [Citation](#citation)
23+
24+
## Requirements
25+
26+
```bash
27+
pip install -r requirements.txt
28+
```
29+
30+
31+
## Features
32+
33+
The following OpenNMT features are implemented:
34+
35+
- multi-layer bidirectional RNNs with attention and dropout
36+
- data preprocessing
37+
- saving and loading from checkpoints
38+
- Inference (translation) with batching and beam search
39+
- Context gate
40+
- Multiple source and target RNN (lstm/gru) types and attention (dotprod/mlp) types
41+
- TensorBoard/Crayon logging
42+
- Source word features
43+
44+
Beta Features (committed):
45+
- multi-GPU
46+
- Image-to-text processing
47+
- "Attention is all you need"
48+
- Copy, coverage
49+
- Structured attention
50+
- Conv2Conv convolution model
51+
- SRU "RNNs faster than CNN" paper
52+
- Inference time loss functions.
53+
54+
## Quickstart
55+
56+
## Step 1: Preprocess the data
57+
58+
```bash
59+
python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo
60+
```
61+
62+
We will be working with some example data in `data/` folder.
63+
64+
The data consists of parallel source (`src`) and target (`tgt`) data containing one sentence per line with tokens separated by a space:
65+
66+
* `src-train.txt`
67+
* `tgt-train.txt`
68+
* `src-val.txt`
69+
* `tgt-val.txt`
70+
71+
Validation files are required and used to evaluate the convergence of the training. It usually contains no more than 5000 sentences.
72+
73+
74+
After running the preprocessing, the following files are generated:
75+
76+
* `demo.src.dict`: Dictionary of source vocab to index mappings.
77+
* `demo.tgt.dict`: Dictionary of target vocab to index mappings.
78+
* `demo.train.pt`: serialized PyTorch file containing vocabulary, training and validation data
79+
80+
81+
Internally the system never touches the words themselves, but uses these indices.
82+
83+
## Step 2: Train the model
84+
85+
```bash
86+
python train.py -data data/demo -save_model demo-model
87+
```
88+
89+
The main train command is quite simple. Minimally it takes a data file
90+
and a save file. This will run the default model, which consists of a
91+
2-layer LSTM with 500 hidden units on both the encoder/decoder. You
92+
can also add `-gpuid 1` to use (say) GPU 1.
93+
94+
## Step 3: Translate
95+
96+
```bash
97+
python translate.py -model demo-model_epochX_PPL.pt -src data/src-test.txt -output pred.txt -replace_unk -verbose
98+
```
99+
100+
Now you have a model which you can use to predict on new data. We do this by running beam search. This will output predictions into `pred.txt`.
101+
102+
!!! note "Note"
103+
The predictions are going to be quite terrible, as the demo dataset is small. Try running on some larger datasets! For example you can download millions of parallel sentences for [translation](http://www.statmt.org/wmt16/translation-task.html) or [summarization](https://github.com/harvardnlp/sent-summary).
104+
105+
## Some useful tools:
106+
107+
108+
## Full Translation Example
109+
110+
The example below uses the Moses tokenizer (http://www.statmt.org/moses/) to prepare the data and the moses BLEU script for evaluation.
111+
112+
```bash
113+
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/tokenizer.perl
114+
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.de
115+
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en
116+
sed -i "s/$RealBin\/..\/share\/nonbreaking_prefixes//" tokenizer.perl
117+
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl
118+
```
119+
120+
## WMT'16 Multimodal Translation: Multi30k (de-en)
121+
122+
An example of training for the WMT'16 Multimodal Translation task (http://www.statmt.org/wmt16/multimodal-task.html).
123+
124+
### 0) Download the data.
125+
126+
```bash
127+
mkdir -p data/multi30k
128+
wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz && tar -xf training.tar.gz -C data/multi30k && rm training.tar.gz
129+
wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz && tar -xf validation.tar.gz -C data/multi30k && rm validation.tar.gz
130+
wget https://staff.fnwi.uva.nl/d.elliott/wmt16/mmt16_task1_test.tgz && tar -xf mmt16_task1_test.tgz -C data/multi30k && rm mmt16_task1_test.tgz
131+
```
132+
133+
### 1) Preprocess the data.
134+
135+
```bash
136+
# Delete the last line of val and training files.
137+
for l in en de; do for f in data/multi30k/*.$l; do if [[ "$f" != *"test"* ]]; then sed -i "$ d" $f; fi; done; done
138+
for l in en de; do for f in data/multi30k/*.$l; do perl tokenizer.perl -a -no-escape -l $l -q < $f > $f.atok; done; done
139+
python preprocess.py -train_src data/multi30k/train.en.atok -train_tgt data/multi30k/train.de.atok -valid_src data/multi30k/val.en.atok -valid_tgt data/multi30k/val.de.atok -save_data data/multi30k.atok.low -lower
140+
```
141+
142+
### 2) Train the model.
143+
144+
```bash
145+
python train.py -data data/multi30k.atok.low -save_model multi30k_model -gpuid 0
146+
```
147+
148+
### 3) Translate sentences.
149+
150+
```bash
151+
python translate.py -gpu 0 -model multi30k_model_*_e13.pt -src data/multi30k/test.en.atok -tgt data/multi30k/test.de.atok -replace_unk -verbose -output multi30k.test.pred.atok
152+
```
153+
154+
### 4) Evaluate.
155+
156+
```bash
157+
perl tools/multi-bleu.perl data/multi30k/test.de.atok < multi30k.test.pred.atok
158+
```
159+
160+
## Pretrained Models
161+
162+
The following pretrained models can be downloaded and used with translate.py (These were trained with an older version of the code; they will be updated soon).
163+
164+
- [onmt_model_en_de_200k](https://drive.google.com/file/d/0B6N7tANPyVeBWE9WazRYaUd2QTg/view?usp=sharing): An English-German translation model based on the 200k sentence dataset at [OpenNMT/IntegrationTesting](https://github.com/OpenNMT/IntegrationTesting/tree/master/data). Perplexity: 20.
165+
- onmt_model_en_fr_b1M (coming soon): An English-French model trained on benchmark-1M. Perplexity: 4.85.
166+
167+
168+
## Citation
169+
170+
[OpenNMT technical report](https://doi.org/10.18653/v1/P17-4012)
171+
172+
```
173+
@inproceedings{opennmt,
174+
author = {Guillaume Klein and
175+
Yoon Kim and
176+
Yuntian Deng and
177+
Jean Senellart and
178+
Alexander M. Rush},
179+
title = {OpenNMT: Open-Source Toolkit for Neural Machine Translation},
180+
booktitle = {Proc. ACL},
181+
year = {2017},
182+
url = {https://doi.org/10.18653/v1/P17-4012},
183+
doi = {10.18653/v1/P17-4012}
184+
}
185+
```

docs/README.md

+28
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
[MkDocs](http://www.mkdocs.org/) is used to generate the documentation at http://opennmt.net/OpenNMT/.
2+
3+
If you want to visualize and deploy the documentation, continue reading the next sections.
4+
5+
## Installation
6+
7+
```bash
8+
pip install mkdocs mkdocs-material python-markdown-math
9+
```
10+
11+
## Workflow
12+
13+
1. Edit the Markdown documentation in `docs/`
14+
2. Visualize the documentation locally with `mkdocs serve`
15+
3. Commit your documentation changes
16+
4. Generate and deploy the static website on the `gh-pages` branch with `mkdocs gh-deploy` (if you are testing on a fork, don't forget to configure the remote with the `-r` option)
17+
18+
## Tips
19+
20+
### Adding pages
21+
22+
Update the main configuration file `mkdocs.yml`.
23+
24+
### Generating options listing
25+
26+
```bash
27+
./docs/options/generate.sh
28+
```

0 commit comments

Comments
 (0)