Few Shot System for Multilingual GEC

This is the repo for the paper "Explanation based In-Context Demonstrations Retrieval for Multilingual Grammatical Error Correction".

How to use

Configuration

Installation requires pytorch environment. After pytorch is installed, please use

pip install -r requirements.txt

to install packages.

Install spacy language packages through:

bash scripts/install_spacy.sh

Directory Setting

Please get permission and download the raw GEC data into /data/xxx/datasets/multilingual_raw directory. Our work have a recommend directory setting:

/data/xxx (your working directory)
|--- FewShotGEC   (This repo)
|--- models      (HuggingFace Models)
|--- datasets
   |--- multilingual_raw (Get permission and download the raw data, and rename to settings as in data/dataset_processor/settings.py)
   |--- multilingual (Processed dataset)

You can also use your own directory settings but you need to adjust configs/directory.json and data/dataset_processor/settings and all .yaml experimental configurations.

Prepare Data

To process datasets into standard ones, you should use scripts in our repo to convert them.

python data/dataset_processor/en_conll14.py
python data/dataset_processor/zh_hsk.py
python data/dataset_processor/zh_nlpcc18.py
python data/dataset_processor/de_falko_merlin.py
python data/dataset_processor/ru_rulec.py
python data/dataset_processor/et_estgec.py

Few-Shot baselines

We offer the scripts for all procedures including database construction, retrieval, inital detection and few-shot process. Please refer to scripts/pipeline folder.

For example, you can run random-selection few-shot prediction by

bash scripts/pipeline/infer_icl_random.sh experiments/icl_llama31/conll14.yaml

You can also run the ICL database (with grammatical error explanations, GEE) construction by

bash scripts/pipeline/prepare_icl.sh experiments/icl_llama31/conll14.yaml

Then you can get the dataset with GEE.

Note that all the experiment shell scripts should be executed with a .yaml configuration file.

For running baseline methods and proposed methods, you can refer to:

scripts/pipeline_baseline.sh
scripts/pipeline_main.sh

For reproducing, we offer the tasks that are used in main results:

bash experiments/task_llama31.sh
bash experiments/task_llama_baseline_methods.sh
...

Brief view for the repo

We have ICL scripts in the root directory, including icl.py, icl_prepare.py and icl_retrieval.py.
And the data/ include the datasets and the database implementation.
The main experiments configuration files are saved in experiments

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data		data
evaluators		evaluators
experiments		experiments
llm		llm
scripts		scripts
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
deepspeed_zero.yaml		deepspeed_zero.yaml
evaluate_gec.py		evaluate_gec.py
generate.py		generate.py
icl.py		icl.py
icl_prepare.py		icl_prepare.py
icl_retrieval.py		icl_retrieval.py
requirements.txt		requirements.txt
sft.py		sft.py
yaml_load.py		yaml_load.py
yaml_save.py		yaml_save.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Few Shot System for Multilingual GEC

How to use

Configuration

Directory Setting

Prepare Data

Few-Shot baselines

Brief view for the repo

About

Releases

Packages

Languages

License

GMago-LeWay/FewShotGEC

Folders and files

Latest commit

History

Repository files navigation

Few Shot System for Multilingual GEC

How to use

Configuration

Directory Setting

Prepare Data

Few-Shot baselines

Brief view for the repo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages