Robust NMT for L2 Disfluency

Silver dataset creation project using Wikiann data for Named Entity Recognition (NER).

Step-by-step Instruction

First, generate errors from clean parallel data: qsub scripts/gen_errors.sh [train|dev|test] [typo|runon|paraphrase]

python3 scripts/gen_errors.py \
    -s [path/to/clean/file/prefix] \
    -t [path/to/output/file/prefix] \
    -l [path/to/log/file] \
    --compound-errors \
    -e typo runon paraphrase \
    -p 0.4 0.5 0.7 \
    --tgt-lang es

Second (optional), combine different error types to one dataset: qsub scripts/move_data.sh

python scripts/move_data.py \
    --ratio 4 \
    --output_dir [path/to/output/directory] \
    --key [experiment-name] \
    --errors article nounnum prep sva typo runon paraphrase

Third, combine error data with clean data to required format: ./scripts/combine_data.sh
Train transformers NMT model with fairseq: qsub scripts/clean_bpe.sh [0|1] [0|1] [0|1] [0|1]
```
# 1. Tokenize data
# 2. Fairseq preprocess data
# 3. Fairseq train
# 4. Fairseq evaluate
```

(All data and models are stored on the clsp grid: /export/c11/sli136/l2mt/)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
.gitignore		.gitignore
README.md		README.md
l2mt.yml		l2mt.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robust NMT for L2 Disfluency

Step-by-step Instruction

About

Releases

Packages

Languages

stellalisy/L2MT

Folders and files

Latest commit

History

Repository files navigation

Robust NMT for L2 Disfluency

Step-by-step Instruction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages