Giter Club home page Giter Club logo

gector's Introduction

GECToR – Grammatical Error Correction: Tag, Not Rewrite

This repository provides code for training and testing state-of-the-art models for grammatical error correction with the official PyTorch implementation of the following paper:

GECToR – Grammatical Error Correction: Tag, Not Rewrite
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, Oleksandr Skurzhanskyi
Grammarly
15th Workshop on Innovative Use of NLP for Building Educational Applications (co-located with ACL 2020)

It is mainly based on AllenNLP and transformers.

Installation

The following command installs all necessary packages:

pip install -r requirements.txt

The project was tested using Python 3.7.

Datasets

All the public GEC datasets used in the paper can be downloaded from here.
Synthetically created datasets can be generated/downloaded here.
To train the model data has to be preprocessed and converted to special format with the command:

python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE

Pretrained models

Pretrained encoder Confidence bias Min error prob CoNNL-2014 (test) BEA-2019 (test)
BERT [link] 0.1 0.41 61.0 68.0
RoBERTa [link] 0.2 0.5 64.0 71.8
XLNet [link] 0.2 0.5 63.2 71.2

Note: The scores in the table are different from the paper's ones, as the later version of transformers is used. To reproduce the results reported in the paper, use this version of the repository.

Train model

To train the model, simply run:

python train.py --train_set TRAIN_SET --dev_set DEV_SET \
                --model_dir MODEL_DIR

There are a lot of parameters to specify among them:

  • cold_steps_count the number of epochs where we train only last linear layer
  • transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert} model encoder
  • tn_prob probability of getting sentences with no errors; helps to balance precision/recall
  • pieces_per_token maximum number of subwords per token; helps not to get CUDA out of memory

In our experiments we had 98/2 train/dev split.

Training parameters

We described all parameters that we use for training and evaluating here.

Model inference

To run your model on the input file use the following command:

python predict.py --model_path MODEL_PATH [MODEL_PATH ...] \
                  --vocab_path VOCAB_PATH --input_file INPUT_FILE \
                  --output_file OUTPUT_FILE

Among parameters:

  • min_error_probability - minimum error probability (as in the paper)
  • additional_confidence - confidence bias (as in the paper)
  • special_tokens_fix to reproduce some reported results of pretrained models

For evaluation use M^2Scorer and ERRANT.

Text Simplification

This repository also implements the code of the following paper:

Text Simplification by Tagging
Kostiantyn Omelianchuk, Vipul Raheja, Oleksandr Skurzhanskyi
Grammarly
16th Workshop on Innovative Use of NLP for Building Educational Applications (co-located w EACL 2021)

For data preprocessing, training and testing the same interface as for GEC could be used. For both training and evaluation stages utils/filter_brackets.py is used to remove noise. During inference, we use --normalize flag.

SARI FKGL
Model TurkCorpus ASSET
TST-FINAL [link] 39.9 40.3 7.65
TST-FINAL + tweaks 41.0 42.7 7.61

Inference tweaks parameters:

iteration_count = 2
additional_keep_confidence = -0.68
additional_del_confidence = -0.84
min_error_probability = 0.04

For evaluation use EASSE package.

Note: The scores in the table are very close to those in the paper, but not fully match them due to the 2 reasons:

  • in the paper, we reported average scores of 4 models trained with different seeds;
  • we merged codebases for GEC and Text Simplification tasks and updated them to the newer version of transformers lib.

Noticeable works based on GECToR

  • Vanilla PyTorch implementation of GECToR with AMP and distributed support by DeepSpeed [code]
  • Improving Sequence Tagging approach for Grammatical Error Correction task [paper][code]
  • LM-Critic: Language Models for Unsupervised Grammatical Error Correction [paper][code]

Citation

If you find this work is useful for your research, please cite our papers:

GECToR – Grammatical Error Correction: Tag, Not Rewrite

@inproceedings{omelianchuk-etal-2020-gector,
    title = "{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite",
    author = "Omelianchuk, Kostiantyn  and
      Atrasevych, Vitaliy  and
      Chernodub, Artem  and
      Skurzhanskyi, Oleksandr",
    booktitle = "Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications",
    month = jul,
    year = "2020",
    address = "Seattle, WA, USA → Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.bea-1.16",
    pages = "163--170",
    abstract = "In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an F{\_}0.5 of 65.3/66.5 on CONLL-2014 (test) and F{\_}0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system.",
}

Text Simplification by Tagging

@inproceedings{omelianchuk-etal-2021-text,
    title = "{T}ext {S}implification by {T}agging",
    author = "Omelianchuk, Kostiantyn  and
      Raheja, Vipul  and
      Skurzhanskyi, Oleksandr",
    booktitle = "Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.bea-1.2",
    pages = "11--25",
    abstract = "Edit-based approaches have recently shown promising results on multiple monolingual sequence transduction tasks. In contrast to conventional sequence-to-sequence (Seq2Seq) models, which learn to generate text from scratch as they are trained on parallel corpora, these methods have proven to be much more effective since they are able to learn to make fast and accurate transformations while leveraging powerful pre-trained language models. Inspired by these ideas, we present TST, a simple and efficient Text Simplification system based on sequence Tagging, leveraging pre-trained Transformer-based encoders. Our system makes simplistic data augmentations and tweaks in training and inference on a pre-existing system, which makes it less reliant on large amounts of parallel training data, provides more control over the outputs and enables faster inference speeds. Our best model achieves near state-of-the-art performance on benchmark test datasets for the task. Since it is fully non-autoregressive, it achieves faster inference speeds by over 11 times than the current state-of-the-art text simplification system.",
}

gector's People

Contributors

achernodub avatar aneeshbhatb avatar komelianchuk avatar makstarnavskyi avatar simoneliasen avatar skurzhanskyi avatar tagucci avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gector's Issues

custom dataset training

Hi,

If I wish to train the model on my dataset, How do I prepare the data? Kindly guide me in preparing the dataset

The model starts training from scratch during stage 2, 3

I have replicated your xlnet based result during stage 1 but it seems that the model starts training from scratch during stage 2, even though the pretrain arg is set to the best model from stage 1. Am I missing something. I am new to allennlp library. Here is the command I used for stage 2:
python train.py --train_set "./data/wi.train.tagged" --dev_set "./data/wi.dev.tagged" --model_dir "./modelxlnet_stage2/" --vocab_path=data/output_vocabulary --skip_correct=1 --tp_prob 1 --tn_prob 0 --special_tokens_fix=0 --cold_steps_count=2 --n_epoch 20 --transformer_model=xlnet --updates_per_epoch=0 --batch_size=64 --accumulation_size=2 --pretrain_folder ./modelxlnet_stage1/ --pretrain best

Can i resume training from where i left off by loading the last state of the model?

I want to confirm what does the below code actually do?( load the state of the model which we pass to the pretrain argument).Also a snapshot of the model after each epoch is stored in the model dir,therefore can we stop the pretraining in between and then pick again from where we left off by using the below code to load the last snapshot of the model being pretrained.(for the purpose of pretraining in a non continuous manner taking some breaks)

if args.pretrain: model.load_state_dict(torch.load(os.path.join(args.pretrain_folder, args.pretrain + '.th')))

if i am right than which file should i pass to the "pretrain" argument above in order to continue training?
options - model_state_epoch.th OR training_state_epoch.th

Reproducing the result of CONLL2014(test) using XLNet

Hi,

Thank you very much for the great work. I have been trying to reproduce the result for CONLL2014. I have obtained the following results for the stage-1, stage-2, stage-3 and the Inf. Tweaks stage:
image

I observe that for the previous three stages, the difference between the reproduced result and the original result in the paper is not very large. However, for the Inf. Tweak stage the difference is quite large. I am using the following scripts:

"python predict.py --model_path=model_weights/xlnet-2-gpu/stage-3/best.th --vocab_path=data/output_vo
cabulary --input_file=COLL14/conll14st-test.original --output_file=output/xlnet-stage-3.txt --transformer_model=xlnet --special_tokens_fix=0 --additional_confidence=0.35 --min_error_probability=0.66"

May I know how to solve this problem?

Thank you in advance.

About the dataset of stage 1

I have processed the data of stage 1. I got 44326735 pairs and I found that gector uses 9000000 pairs in experiments. Do you sample training data in your experiments?

Any other embeddings?

Hi!

I am wondering, is there is a easy way to make use of other word embeddings? Mostly intested in those with OOV (out of vocabulary) feature, like Fasttext, or randomly intialized character-level embeddings. Or only transformer-embeddings are supported?

The reason is that I am trying to use your model for typo-correction task (which is a subtask of grammar correction task), so I have a lot of OOV-word in my dataset and I think that using character-level embeddings (or embeddings with OOV) may increase accuracy of model.

If I wanted to conduct an experiment on custom data - how do I go about it?

I wanted to test this architecture with custom pretraining and training set. As I'm seeing in the codebase there's a common training script. According to the paper the training was done in 3 steps. So is it correct to assume that the common script of "train.py" was run 3 times each time on the previous step snapshot?

Word Embedding and Project Layer Shap Mismatch

I'm getting the following error:

File "train.py", line 308, in <module>
   main(args)
 File "train.py", line 141, in main
   model.load_state_dict(torch.load(os.path.join(args.pretrain_folder, args.pretrain + '.th')))
 File "/h/asabet/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 830, in load_state_dict
   self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Seq2Labels:
   size mismatch for text_field_embedder.token_embedder_bert.bert_model.embeddings.word_embeddings.weight: copying a param with shape torch.Size([28996, 768]) from checkpoint, the shape in current model is torch.Size([28997, 768]).
   size mismatch for tag_labels_projection_layer._module.weight: copying a param with shape torch.Size([5002, 768]) from checkpoint, the shape in current model is torch.Size([1002, 768]).
   size mismatch for tag_labels_projection_layer._module.bias: copying a param with shape torch.Size([5002]) from checkpoint, the shape in current model is torch.Size([1002]).

after running
python3 train.py --train_set ./data/train --dev_set ./data/dev --model_dir models/ --pretrain bert_0_gector --pretrain_folder models/ --transformer_model bert
based on pretrained weights from https://grammarly-nlp-data-public.s3.amazonaws.com/gector/bert_0_gector.th. What's the source of the mismatch, and how do I fix it?

hyperparameters during inference

Hi,

Thank you for the novel idea and great work. I was trying to learn and reproduce the inference result with the provided pretrained roberta and XLNet model. However, referring to the paper, I had some confusion on hyperparameter during inference. In the appendix does the confidence bias refer to --additional_confidence? I tried to reproduce this result by XLNet with --min_error_probability 0.66 --additional_confidence 0.35:

CUDA_VISIBLE_DEVICES='0' python predict.py --model_path pretrained_models/xlnet_0_gector.th --vocab_path data/output_vocabulary --input_file 'data/wi+locness/m2/ABCN.dev.gold.bea19.original' --output_file output/xlnet.ABCN.dev.bea19.corr.min_error_0.66.add_conf_0.35.txt --transformer_model xlnet --special_tokens_fix 0 --min_error_probability 0.66 --additional_confidence 0.35
Produced overall corrections: 2438

However, with the XLNet model on BEA19 dev set I only obtained the following result. Could you help to provide the inference command for roberta and XLNet released pretrained model? It will be very helpful in the appendix as well

Span-Based Correction
TP FP FN Prec Rec F0.5
2523 1295 4938 0.6608 0.3382 0.5549

Span-Based Detection
TP FP FN Prec Rec F0.5
2851 981 4781 0.744 0.3736 0.6209

Token-Based Detection
TP FP FN Prec Rec F0.5
3415 802 5581 0.8098 0.3796 0.6602

Thank you!

Inconsistency of the amount of data

Hi,
There is a problem occured when I try to re-implemnt your experiment. In your paper's Table 1, there's a list of datasets you used. I downloaded these from the official bea 2019 shared task, and found that the number of sentences is not the same as yours. Is there some special preprocessing process or other reasons? By the way, I extracted sentences from m2 file.

Installation error in allennlp==0.8.4

Hello,

I am running into error when installing gector requirement allennlp==0.8.4. More specifically, there seem to be an error when installing gevent library:

x86_64-linux-gnu-gcc: error: src/gevent/libev/corecext.c: No such file or directory
x86_64-linux-gnu-gcc: fatal error: no input files
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Is there some workaround this bug? The latest version of allennlp installs without problems but then I encounter dependency issues with gector.

Thank you,

Jonas

How to reproduce the XLNet 72.4 f0.5-score

I use the command

"python predict.py --model_path pretained_model/xlnet_0_gector.th
--input_file ABCN.bea19.test.orig
--output_file bea19_test.txt --transformer_model xlnet --special_tokens_fix=0"
Then I submit the result to the CodaLab. It shows the Span-level correction F0.5 is 65.35.

Do I use the correct command? If I wanna get 72.4, how should I set the predict command?

The pre - trained models are much heavy to run online.

The pretrained models are very large. Not having proper architecture (GPU's),i tried to run them on colab but they are too heavy heavy to be uploaded online.Is there any other alternative (if i run on colab) in order to reproduce the results.

help!

sir please help me to solve this error. (also, do we need to preprocess the official .m2 files downloaded from bea2019 website
errorgec
)

Start Training Freezes?

A training run with the following parameters:

python3 train.py --train_set $train_set \
                --dev_set $dev_set \
                --model_dir $model_dir \
                --transformer_model 'bert' \
                --tune_bert 1 \
                --n_epoch 4 \
                --cold_steps_count 1 \
                --accumulation_size 4 \
                --updates_per_epoch 10000 \
                --tn_prob 0 \
                --tp_prob 1 \
                --pretrain_folder models \
                --pretrain bert_0_gector \
                --skip_correct 0 \
                --skip_complex 0 \
                --max_len 64 \
                --batch_size 64 \
                --tag_strategy keep_one \
                --cold_lr 1e-3 \
                --lr 1e-5 \
                --predictor_dropout 0.0 \
                --lowercase_tokens 0 \
                --pieces_per_token 5 \
                --vocab_path data/output_vocabulary \
                --label_smoothing 0.0 \

on a dataset with 1000 lines freezes at this point https://imgur.com/a/qobJxdc. What could be the cause?

Low iteration speed

Hi!
There's a problem occured when I tried to pretrain the model on mutliple(8) GPUs. It seems that the utility of per GPU is low (10%), and the iteration speed is only about 8s/it. Is there something wrong with my configuration?
By the way, when I use pytorch 1.3.0, it always return a Segmentation fault. So I upgrade the pytorch to ver. 1.5.0, and it disappears. But it didn't work out of "Stop Iteration" as mentioned in huggingface/transformers#4189 . So I followed the advice and install the transformers 2.10.0 from source. Is there something related to the low iteration speed problem?
The GPU devices I use are Nvidia GTX 1080Ti.

state_dict mismatch when loading from pretrained?

I'm tying to fine-tune on my own dataset from the pretrained model you provide (bert_0_gector.pt), with the following training arguments:

python3 train.py --train_set data/train \
        --dev_set data/dev \
        --model_dir models \
        --batch_size 100 \
        --cold_steps_count 1 \
        --pretrain_folder models \
        --pretrain bert_0_gector \
        --n_epoch 20

and get the follwoing error:

WARNING:root:vocabulary serialization directory models/vocabulary is not empty
Data is loaded
Traceback (most recent call last):
  File "train.py", line 303, in <module>
    main(args)
  File "train.py", line 141, in main
    model.load_state_dict(torch.load(os.path.join(args.pretrain_folder, args.pretrain + '.th')))
  File "/h/asabet/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 830, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Seq2Labels:
	size mismatch for text_field_embedder.token_embedder_bert.bert_model.embeddings.word_embeddings.weight: copying a param with shape torch.Size([28996, 768]) from checkpoint, the shape in current model is torch.Size([50266, 768]).
	size mismatch for text_field_embedder.token_embedder_bert.bert_model.embeddings.position_embeddings.weight: copying a param with shape torch.Size([512, 768]) from checkpoint, the shape in current model is torch.Size([514, 768]).
	size mismatch for text_field_embedder.token_embedder_bert.bert_model.embeddings.token_type_embeddings.weight: copying a param with shape torch.Size([2, 768]) from checkpoint, the shape in current model is torch.Size([1, 768]).
	size mismatch for tag_labels_projection_layer._module.weight: copying a param with shape torch.Size([5002, 768]) from checkpoint, the shape in current model is torch.Size([1002, 768]).
	size mismatch for tag_labels_projection_layer._module.bias: copying a param with shape torch.Size([5002]) from checkpoint, the shape in current model is torch.Size([1002]).

why would a mismatch show up if I'm just trying to load from pretrained?

F0.5 is almost zero after training the model

Thanks for the interesting work.
I have trained the model, stage 1, on the first part of the PIE data (a1). After shuffling and splitting the data, I used the following command:

python train.py --train_set "./data/a1/train.txt" --dev_set "./data/a1/dev.txt" --model_dir "./modelxlnet/" --vocab_path=data/output_vocabulary --skip_correct=1 --tn_prob=0 --special_tokens_fix=0 --cold_steps_count=2 --transformer_model=xlnet --updates_per_epoch=10000 --batch_size=64 --accumulation_size=4 --cold_steps_count=2.

I have also used the vocab_path from your project's ./data/output_vocabulary.
But after training for more than 5 epochs, I am still getting almost the same input file.
Here is the command I used for inference:

python predict.py --model_path "./modelxlnet/model_state_epoch_2.th" --subset valid --min_error_probability 0 --additional_confidence 0 --batch_size=64 --iteration_count=5 --transformer_model=xlnet --special_tokens_fix=0

And here is the ERRANT evaluation result :
image

Am I missing something. Thanks for the time.

Could we delete the token '$START' ?

Hi, thanks for your wonderful codes.

But, I wonder if we can delete the token '$START', when preprocessing the dataset. Because I found this token didn't make a lot of sense.

size mismatch for text_field_embedder.token_embedder_bert.bert_model.word_embedding.weight

  1. By
python predict.py --model_path xlnet_0_gector.th --vocab_path data/output_vocabulary/ --input_file xxx--output_file xxx --transformer_model xlnet

I get

size mismatch for text_field_embedder.token_embedder_bert.bert_model.word_embedding.weight: copying a param with shape torch.Size([32000, 768]) from checkpoint, the shape in current model is torch.Size([32006, 768]).

But when I use Roberta(which is default setting), the run is ok.
Thanks!

Getting Segmentation fault

Hi,

I've been trying to use the trained model (roberta_1_gector.th) on a GPU instance using this query "python3.6 predict.py --model_path models/roberta_1_gector.th --vocab_path data/output_vocabulary/ --input_file ../ data/ip.txt --output_file ../op.txt --min_error_probability 0.2 --additional_confidence 0.5"

But it is giving me this error trace:

Fatal Python error: Segmentation fault
Current thread 0x00007fbdf39c9700 (most recent call first):
File "/data/home/ubuntu/brijesh_intern/cqi_pip_venv/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 97 in init
File "/data/home/ubuntu/brijesh_intern/cqi_pip_venv/lib/python3.6/site-packages/transformers/modeling_bert.py", line 160 in init
File "/data/home/ubuntu/brijesh_intern/cqi_pip_venv/lib/python3.6/site-packages/transformers/modeling_bert.py", line 617 in init
File "/data/home/ubuntu/brijesh_intern/cqi_pip_venv/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 181 in init
File "/data/home/ubuntu/brijesh_intern/cqi_pip_venv/lib/python3.6/site-packages/transformers/modeling_utils.py", line 406 in from_pretrained
File "/data/home/ubuntu/brijesh_intern/cqi_pip_venv/lib/python3.6/site-packages/transformers/modeling_auto.py", line 156 in from_pretrained
File "/data/home/ubuntu/brijesh_intern/grammarly_gector/gector/gector/bert_token_embedder.py", line 29 in load
File "/data/home/ubuntu/brijesh_intern/grammarly_gector/gector/gector/bert_token_embedder.py", line 254 in init
File "/data/home/ubuntu/brijesh_intern/grammarly_gector/gector/gector/gec_model.py", line 176 in _get_embbeder
File "/data/home/ubuntu/brijesh_intern/grammarly_gector/gector/gector/gec_model.py", line 88 in init
File "predict.py", line 46 in main
File "predict.py", line 120 in
Segmentation fault

preprocessing limits

Hi,

I am trying to reproduce the results for BEA19 shared task. May I know what kind of pre-training data you used for initial training? Also, it seems when I try to construct large preprocessed data from the errorify repository. I got stuck at around 800,000 pairs of parallel copora. May I know what pretraining corpora did you use for producing the roberta score?

image

Tokenization problem

Thanks for the tool and models, the accuracy is pretty impressive! One thing I did notice was that the way the input was split into tokens causes some problems. predict.py is using python's string.split() to get an array of tokens, but string.split() doesn't separate punctuation from a word.

This causes things like this.
Original:

We need to be aware of cultural difference.

Corrected:

We need to be aware of cultural difference.sssss

or

Original:

I like to think about problems and ways to solve.

Corrected:

I like to think about problems and ways to solve. them

I changed the string.split() to spaCy's tokenizer to get the array of tokens and it's working much better.
Original:

We need to be aware of cultural difference.

Corrected:

We need to be aware of cultural differences .

or

Original:

I like to think about problems and ways to solve.

Corrected:

I like to think about problems and ways to solve them .

Of course, loading spaCy's model reduces speed (that's why I didn't submit a PR), so maybe there's a better option for tokenization. Just wanted to share the problem with string.split().

Can't train the model

I preprocess the data as described in README, the preprocessed train dataset as follows:

  • $STARTSEPL|||SEPR$KEEP DecemberSEPL|||SEPR$KEEP 12thSEPL|||SEPR$KEEP
  • $STARTSEPL|||SEPR$KEEP PrincipalSEPL|||SEPR$KEEP mr.SEPL|||SEPR$KEEP robertsonSEPL|||SEPR$KEEP

I think the preprocess phrase is right.

Then I use the following script, run.sh, to train the model:
CUDA_VISIBLE_DEVICES=0 python train.py --train_set ./data/fce-train.pre \
--dev_set ./data/conll-dev.pre \
--model_dir ./model \
--transformer_model bert

but it reported an error, as follows:
18545it [00:01, 9993.01it/s]
Data is loaded
run.sh: line 4: 731383 Segmentation fault (core dumped) CUDA_VISIBLE_DEVICES=0 python train.py --train_set ./data/fce-train.pre --dev_set ./data/conll-dev.pre --model_dir ./model --transformer_model bert

my environment is Cuda 9.2, pytorch1.3.1.
does anyone know how to solve this problem or are there any problems with my process?

Training loss decreases, but model doesn't learn

I'm training the bert gector model (using train.py) on an edit dataset similar to those found in the gector paper (ie nucle3.3 or conll14), but the model's predictions degenerate to predicting $KEEP for every token. This minimizes the loss, since most of the labels are $KEEP, but doesn't induce any learning in the model. Usually, this is solved by re-weighting the class losses to correct for class imbalance, but that wasn't done in your implementation.

How did you originally resolve this?

The size of raw dataset is 0

Thank you for the magnificent article and code. I'm sorry if my question is looked silly.

I have downloaded just one dataset (fce.train.gold.bea19.m2, Dev, and test) and moved files to the data directory, then started training I get the error 'The size of raw dataset is 0'.

My questions:

Is it required to rename any dataset files before moving it to the data directory? including parallel and synthetic datasets.

Also, can I run all the training process in a single GPU?

How to run inference with pretrained models?

I am trying to run gector using BERT, and I am likely missing something very trivial... to do so

I ran inference by firing on bash

python predict.py --model_path .\bert_0_gector.th --vocab_path .\data\output_vocabulary\ --input_file .\ABCN.test.bea19.orig --output_file foo

Execution crashes with:

I0728 17:10:37.239058 26000 file_utils.py:40] PyTorch version 1.3.0 available.
I0728 17:10:38.438563 26000 vocabulary.py:306] Loading token dictionary from .\data\output_vocabulary\.
I0728 17:10:39.341149 26000 tokenization_utils.py:398] loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json from cache at C:\Users\davide.fiocco\.cache\torch\transformers\d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b
I0728 17:10:39.342149 26000 tokenization_utils.py:398] loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt from cache at C:\Users\davide.fiocco\.cache\torch\transformers\b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
I0728 17:10:39.432337 26000 tokenization_utils.py:548] Adding $START to the vocabulary
I0728 17:10:39.908694 26000 configuration_utils.py:160] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json from cache at C:\Users\davide.fiocco\.cache\torch\transformers\e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690
I0728 17:10:39.909693 26000 configuration_utils.py:177] Model config {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pad_token_id": 1,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 1,
  "use_bfloat16": false,
  "vocab_size": 50265
}

I0728 17:10:40.361510 26000 modeling_utils.py:401] loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-pytorch_model.bin from cache at C:\Users\davide.fiocco\.cache\torch\transformers\228756ed15b6d200d7cb45aaef08c087e2706f54cb912863d2efe07c89584eb7.49b88ba7ec2c26a7558dda98ca3884c3b80fa31cf43a1b1f23aef3ff81ba344e
E0728 17:10:44.443636 26000 vocabulary.py:632] Namespace: d_tags
E0728 17:10:44.444651 26000 vocabulary.py:633] Token: INCORRECT
Traceback (most recent call last):
  File "predict.py", line 116, in <module>
    main(args)
  File "predict.py", line 43, in main
    weigths=args.weights)
  File "C:\Users\davide.fiocco\Projects\gector\gector\gec_model.py", line 89, in __init__
    confidence=self.confidence
  File "C:\Users\davide.fiocco\Projects\gector\gector\seq2labels_model.py", line 74, in __init__
    namespace=detect_namespace)
  File "C:\Users\davide.fiocco\Anaconda3\envs\gector\lib\site-packages\allennlp\data\vocabulary.py", line 630, in get_token_index
    return self._token_to_index[namespace][self._oov_token]
KeyError: '@@UNKNOWN@@'

There's quite a few things that I couldn't understand from the README file

  1. Should I specify an existing output file?
  2. Why does gector import RoBERTa if I have specified a pretrained BERT model?
  3. Why the crash in the allennlp library?

Ultimately, I would like to:

  • For an input sentence, see the transformation tags
  • For an incorrect sentence, compute iteratively its corrected version

Thanks for any guidance on this...!

Is there a tokenizer problem when using pretrained RoBerta?

Thank for sharing your excellent work.

The RoBerta model uses a Byte level BPE, and the vocabulary loaded from pretrained RoBerta looks like this,
image

Note that there is a special unicode character 'Ġ' before token 'the', and 'Ġ' means a blank space .
For example, 'Ġthe' is indexed as 5 in the vocabulary, and 'the' is indexed as 627.

But when processing the data, all the lines are simplely using 'split()' methods, which leads to that the word 'the' is encoded as 627 instead of 5.

So, I reckon there are lots of tokens that are not indexed correctly. Although, the stage one pre-training cound alleviate this problems.

Train and Dev Set for COLL2014 for Stage II, III

HI!

I have seen the paper have mentioned the lang8, face, nucle, and wi+locness have been used for stage II and stage III.
image
I believe this should be the case for BEA2019 test cases.

For the case of COLL2014, do we still use lang8, face, nucle and wi+locness for stage II, and wi+locness for stage III as well? COLL2014 dataset is only involved in the final testing?

role of tokenizer in the prediction stage

Hi,

I am curious about the role of tokenizer.

as far as I know,
the model only use labels as vocabulary set both in the training and prediction stage, .

Therefore, vocab file of each model is in fact not used here,
but the vocabulary extracted from the data is used.

If my understanding is right, what is the role of tokenizer(especially in the prediction stage)?

Extending functionalities: Adding Contextual Spell Check/Correction capabilities

Hi,

I wanted your suggestions for adding extra functionalities,

  1. adding Contextual, Academic Spell Correction along with Grammar Error Suggestions.

  2. How can I approach the grammar error correction in Academic (mainly K-12 context) domain, where textual data may contain equations, mathematical encodings, latex, HTML (tables, lists etc), XML and stuff?

Could you help me with this, In terms of feasibility and approach?

Roberta's Performance on BEA19-dev set and CONLL14-test set

Hi,

I have reproduced Roberta's result on BEA19-dev set and CONLL14-test set. I discovered that for the BEA19-dev set, it's performance is higher than the result presented in the paper (about 4%-5% higher). However, for CONLL14-test set, its performance is lower than paper's result (about 3%-4% lower).
image
I have used the following scripts:

For Stage 1: (I used two GPUs for training)

python3 train.py --train_set=PIE/pie-9m-train-tagged.txt --dev_set=PIE/pie-9m-dev-tagged.txt --model_dir=model_weights/roberta-2-gpu/stage-1 --cold_steps_count=2 --accumulation_size=2 --updates_per_epoch=10000 --tn_prob=0 --tp_prob=1 --transformer_model=roberta --special_tokens_fix=1 --tune_bert=1 --skip_correct=1 --skip_complex=0 --n_epoch=20 --patience=3 --max_len=50 --batch_size=64 --tag_strategy=keep_one --cold_steps_count=0 --cold_lr=1e-3 --lr=1e-5 --predictor_dropout=0.0 --lowercase_tokens=0 --pieces_per_token=5 --vocab_path=data/output_vocabulary --label_smoothing=0.0

For Stage 2:(I used 1 GPU for training)

python3 train.py --train_set=BEA19/bea19-train-tagged.txt --dev_set=BEA19/bea19-dev-tagged.txt --model_dir=model_weights/roberta-2-gpu/stage-2 --cold_steps_count=2 --accumulation_size=2 --updates_per_epoch=0 --tn_prob=0 --tp_prob=1 --transformer_model=roberta --special_tokens_fix=1 --tune_bert=1 --skip_correct=1 --skip_complex=0 --n_epoch=20 --patience=3 --max_len=50 --batch_size=64 --tag_strategy=keep_one --cold_steps_count=0 --cold_lr=1e-3 --lr=1e-5 --predictor_dropout=0.0 --lowercase_tokens=0 --pieces_per_token=5 --vocab_path=data/output_vocabulary --label_smoothing=0.0 --pretrain_folder=model_weights/roberta-2-gpu/stage-1 --pretrain=best

For Stage3:(I used 1 GPU for training)

"python3 train.py --train_set=Stage-3-data/train-tagged.txt --dev_set=Stage-3-data/dev-tagged.txt --model_dir=model_weights/roberta-2-gpu/stage-3 --cold_steps_count=0 --accumulation_size=2 --updates_per_epoch=0 --tn_prob=1 --tp_prob=1 --transformer_model=roberta --special_tokens_fix=1 --tune_bert=1 --skip_correct=1 --skip_complex=0 --n_epoch=20 --patience=3 --max_len=50 --batch_size=64 --tag_strategy=keep_one --cold_steps_count=0 --cold_lr=1e-3 --lr=1e-5 --predictor_dropout=0.0 --lowercase_tokens=0 --pieces_per_token=5 --vocab_path=data/output_vocabulary --label_smoothing=0.0 --pretrain_folder=model_weights/roberta-2-gpu/stage-2 --pretrain=best"

In Stage 2, the model trained for 12 epochs before it's dumped, and in Stage 3, it runs for 6 epochs.
May I know what's the problem for this?

What kind of data format do you use?

python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE

What is the SOURCE and TARGET ? All the datasets your paper mentened,can get the m2 format,but how can I get the SOURCE or TARGET ?

xlnet_base_cased

Hi...!
When I run predict.py using transfomer as xlnet, internally 3 files get downloaded. Are those 3 files xlnet_base_cased model ? If so, then I have downloaded the cased model. Now where in the code do I need to change such that it loads the model from the folder containing the downloaded model ?

RuntimeError: Error(s) in loading state_dict for Seq2Labels

Unable to resolve this issue observed while loading the pre trained xlnet_0_gector model

RuntimeError: Error(s) in loading state_dict for Seq2Labels:
size mismatch for tag_labels_projection_layer._module.weight: copying a param with shape torch.Size([5002, 768]) from checkpoint, the shape in current model is torch.Size([5482, 768]).
size mismatch for tag_labels_projection_layer._module.bias: copying a param with shape torch.Size([5002]) from checkpoint, the shape in current model is torch.Size([5482]).

Discard the last label when predicting?

In the function postprocess_batch of gec_model.py,
'for i in range(length):' seems discard the last label in idxs, since len(idxs) == len(tokens)+1 == length+1
idxs is the labels for [$START]+tokens

Parameters of Stage III

Hi!
I've achieved re-implementing your paper's BERT result in training stage II. But after I finished the training Stage III, it seems that the improvement is not so obvious (F0.5 from 57.02 to 59.06). I don't know if I got the correct hyper-parameters settings. So could you please provide me with your settings on BERT in stage III?

skip_correct, tn_prob, tp_prob, cold_steps_count, tune_bert, n_epoch

Questions about preprocessing data, [RoBERTa model]

Hi,

I found that when preprocessing data for RoBERTa model, there is a difference between our code and the official code of RoBERTa model

Our code's preprocessing procedure is:

  1. Split the sentence by space to tokens
  2. BPE every tokens

The official code's preprocessing procedure is:

  1. Split the sentence by self.pat to tokens
  2. BPE every tokens

The Main difference is: How to segment a sentence

For example (Just forget the token $START):

1.original sentence is: "That is impossible ."
2.our code's BPE results: "[1711, 354, 11850, 31497, 4]", which is "['That', 'is', 'imp', 'ossible', '.']", before encoding it.
3.official code's BPE results: "[1711, 16, 4703, 479]", which is "['That', 'Ġis', 'Ġimpossible', 'Ġ.']", before encoding it. 'Ġ' is byte_encoded from space.

So I think this difference will affect the RoBERTa model's performance.

Could you give me some explanation? Thank you.

CUDA out of memory error

Hello folks. I'm trying to fine tune the XLNET model with some arbitrary data. The command which I'm using is

python3 train.py --train_set=boost_train_pre --dev_set=boost_dev_pre --model_dir=XLNET_FT --vocab_path data/output_vocabulary/ --pretrain_folder=/gector_models/ --pretrain=xlnet_0_gector --n_epoch=10 --transformer_model=xlnet --special_tokens_fix=0

However when I do this, I run into CUDA out of memory error. I'm using a GeForce RTX 2080 Ti. Is there something wrong with my parameters?

How did you decide which type of token-level transformations to include in the edit space?

Hi, thanks for the amazing paper and recording at BEA 15, ACL 2020!

I was wondering how did you decide which type of token-level transformations to include in the edit space? For example, the 1167 token-dependent APPEND and 3802 REPLACE. Is it a dataset-driven process, e.g. determined by ranking the most frequent transformations in CoNLL-2014 and selecting the top 5000?

Many thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.