GEC-EB: Mitigating Exposure Bias in Grammatical Error Correction with Data Augmentation and Reweighting

Hannan Cao, Wenmian Yang, Hwee Tou Ng. Mitigating Exposure Bias in Grammatical Error Correction with Data Augmentation and Reweighting. In EACL 2023.

The program is tested under pytorch 1.7.1, CUDA version 11.7

Download required data and install required software

1.1. Generate the C4 200M synthetic data by following https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction

1.2. Download the NUCLE, FCE, CLang8, W&I , CoNLL-2013 and CoNLL-2014 ;

1.3. Install the fairseq inside the fairseq folder
```
cd fairseq
pip3 install --editable ./
```

Note all the scripts are inside train-scripts folder.

Pretrain and Train the Transformer-big model

2.1 Pass the path for target sentences into tok+bpe+pre.sh to generate the bpe using the subword_nmt package (https://github.com/rsennrich/subword-nmt)
```
./tok+bpe+pre.sh
```
2.2 Use apply_bpe.sh and create-dict-preprocess.sh to preprocess the pre-training data.
```
./apple_bpe.sh path/to/pretrain/data/folder
./create-dict-preprocess.sh path/to/bpe-ed/data/folder
```
2.3 Pretrain the model with pretrain.sh
```
./pretrain.sh
```
2.4 Preprocess the training data
```
./apple_bpe.sh path/to/train/data/folder
./preprocess.sh path/to/bpe-ed/data/folder
```
2.5 Train the model with train.sh
```
./train.sh model/train preprocessed/train/data path/to/pretrained/checkpoint
```

Generate augmented sentence

3.1. Use downloaded checkpoint to make predicitions on the training set (need to specify):

./predict.sh 0 path/to/source/training/sentence "candidate_data" path/to/downloaded/checkpoint output/directory

3.2. Generate candidate sentences from the prediction result, move the candidate files to respective folders (e.g. neg-1, neg-2, neg-3, neg-4, neg-5 are the respective folders and assume original training and validation sentences are stored in pos folder):

python generate_candidates.py --root_path previous/used/output/directory --candidate_name test.nbest.tok.candidate_data
mkdir pos
mkdir neg-1
mkdir neg-2
mkdir neg-3
mkdir neg-4
mkdir neg-5
mkdir pos-data
mkdir neg-1-data
mkdir neg-2-data
mkdir neg-3-data
mkdir neg-4-data
mkdir neg-5-data
mv candi.1 neg-1/train.tgt
mv candi.2 neg-2/train.tgt
mv candi.3 neg-3/train.tgt
mv candi.4 neg-4/train.tgt
mv candi.5 neg-5/train.tgt

3.3. Copy the train.src, valid.src and valid.tgt to neg-1, neg-2, neg-3, neg-4, neg-5 folders

5.4. Create the count for the number of candidates:

python valid_count.py --candi_path path/to/your/source/training/data/folder --file_name output/file/name --count numner/of/candidates/you/selected

3.5. Pass neg-1, neg-2, ..., neg-5 folders to apply_bpe.py to process the data

./apply_bpe.sh /path/to/neg-1/folder

3.6. Combine augmented data together (e.g. combine 5 together):

python multi_target.py --source /path/to/processed/pos/train.tgt/file /path/to/processed/neg-1/train.tgt/file ... /path/to/processed neg-5/train.tgt/file /path/to/count/file" \
					--target /path/to/processed/pos/train.src/file \
					--max 750 \
					--ratio 750 \
					--out path/to/output/data/folder

3.7. Binarize the data using preprocess.sh

./preprocess.sh path/to/output/data/folder

Train the model using DM method.

./conll_run.sh 0 path/to/save/finetuned/checkpoint
./bea_run.sh 0 path/to/save/finetuned/checkpoint

Make prediction with predict.sh

./predict.sh 0 path/to/test/set randome/name path/to/finetuned/weight output/directory

Use M2 scorer to evaluate the result of CoNLL-2014 test set, and evaluate the result on BEA-2019 test set by submitting the prediction result to colab:https://competitions.codalab.org/competitions/20228#participate-get-data

Citation

If you found our paper or code useful, please cite as:

@inproceedings{cao-etal-2022-eb,
    title = "Mitigating Exposure Bias in Grammatical Error Correction with Data Augmentation and Reweighting",
    author = "Cao, Hannan  and
      Yang, Wenmian  and
      Ng, Hwee Tou",
    booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
    year = "2023",
}

License

The source code is licensed under GNU GPL 3.0 (see License) for non-commercial use. For commercial use of this code, separate commercial licensing is also available. Please contact Prof. Hwee Tou Ng ([email protected]).

nusnlp / gec_eb Goto Github PK

gec_eb's Introduction

GEC-EB: Mitigating Exposure Bias in Grammatical Error Correction with Data Augmentation and Reweighting

Citation

License

gec_eb's People

Contributors

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent