grammarly / gector Goto Github PK

Official implementation of the papers "GECToR – Grammatical Error Correction: Tag, Not Rewrite" (BEA-20) and "Text Simplification by Tagging" (BEA-21)

License: Apache License 2.0

Python 100.00%

grammatical-error-correction natural-language-processing sequence-labeling transformers roberta bert xlnet nlp text-simplification

gector's Introduction

GECToR – Grammatical Error Correction: Tag, Not Rewrite

This repository provides code for training and testing state-of-the-art models for grammatical error correction with the official PyTorch implementation of the following paper:

GECToR – Grammatical Error Correction: Tag, Not Rewrite
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, Oleksandr Skurzhanskyi
Grammarly
15th Workshop on Innovative Use of NLP for Building Educational Applications (co-located with ACL 2020)

It is mainly based on AllenNLP and transformers.

Installation

The following command installs all necessary packages:

pip install -r requirements.txt

The project was tested using Python 3.7.

Datasets

All the public GEC datasets used in the paper can be downloaded from here.
Synthetically created datasets can be generated/downloaded here.
To train the model data has to be preprocessed and converted to special format with the command:

python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE

Pretrained models

Pretrained encoder	Confidence bias	Min error prob	CoNNL-2014 (test)	BEA-2019 (test)
BERT [link]	0.1	0.41	61.0	68.0
RoBERTa [link]	0.2	0.5	64.0	71.8
XLNet [link]	0.2	0.5	63.2	71.2

Note: The scores in the table are different from the paper's ones, as the later version of transformers is used. To reproduce the results reported in the paper, use this version of the repository.

Train model

To train the model, simply run:

python train.py --train_set TRAIN_SET --dev_set DEV_SET \
                --model_dir MODEL_DIR

There are a lot of parameters to specify among them:

cold_steps_count the number of epochs where we train only last linear layer
transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert} model encoder
tn_prob probability of getting sentences with no errors; helps to balance precision/recall
pieces_per_token maximum number of subwords per token; helps not to get CUDA out of memory

In our experiments we had 98/2 train/dev split.

Training parameters

We described all parameters that we use for training and evaluating here.

Model inference

To run your model on the input file use the following command:

python predict.py --model_path MODEL_PATH [MODEL_PATH ...] \
                  --vocab_path VOCAB_PATH --input_file INPUT_FILE \
                  --output_file OUTPUT_FILE

Among parameters:

min_error_probability - minimum error probability (as in the paper)
additional_confidence - confidence bias (as in the paper)
special_tokens_fix to reproduce some reported results of pretrained models

For evaluation use M^2Scorer and ERRANT.

Text Simplification

This repository also implements the code of the following paper:

Text Simplification by Tagging
Kostiantyn Omelianchuk, Vipul Raheja, Oleksandr Skurzhanskyi
Grammarly
16th Workshop on Innovative Use of NLP for Building Educational Applications (co-located w EACL 2021)

For data preprocessing, training and testing the same interface as for GEC could be used. For both training and evaluation stages utils/filter_brackets.py is used to remove noise. During inference, we use --normalize flag.

	SARI		FKGL
Model	TurkCorpus	ASSET	FKGL
TST-FINAL [link]	39.9	40.3	7.65
TST-FINAL + tweaks	41.0	42.7	7.61

Inference tweaks parameters:

iteration_count = 2
additional_keep_confidence = -0.68
additional_del_confidence = -0.84
min_error_probability = 0.04

For evaluation use EASSE package.

Note: The scores in the table are very close to those in the paper, but not fully match them due to the 2 reasons:

in the paper, we reported average scores of 4 models trained with different seeds;
we merged codebases for GEC and Text Simplification tasks and updated them to the newer version of transformers lib.

Noticeable works based on GECToR

Vanilla PyTorch implementation of GECToR with AMP and distributed support by DeepSpeed [code]
Improving Sequence Tagging approach for Grammatical Error Correction task [paper][code]
LM-Critic: Language Models for Unsupervised Grammatical Error Correction [paper][code]

Citation

If you find this work is useful for your research, please cite our papers:

GECToR – Grammatical Error Correction: Tag, Not Rewrite

@inproceedings{omelianchuk-etal-2020-gector,
    title = "{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite",
    author = "Omelianchuk, Kostiantyn  and
      Atrasevych, Vitaliy  and
      Chernodub, Artem  and
      Skurzhanskyi, Oleksandr",
    booktitle = "Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications",
    month = jul,
    year = "2020",
    address = "Seattle, WA, USA â†’ Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.bea-1.16",
    pages = "163--170",
    abstract = "In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an F{\_}0.5 of 65.3/66.5 on CONLL-2014 (test) and F{\_}0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system.",
}

Text Simplification by Tagging

@inproceedings{omelianchuk-etal-2021-text,
    title = "{T}ext {S}implification by {T}agging",
    author = "Omelianchuk, Kostiantyn  and
      Raheja, Vipul  and
      Skurzhanskyi, Oleksandr",
    booktitle = "Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.bea-1.2",
    pages = "11--25",
    abstract = "Edit-based approaches have recently shown promising results on multiple monolingual sequence transduction tasks. In contrast to conventional sequence-to-sequence (Seq2Seq) models, which learn to generate text from scratch as they are trained on parallel corpora, these methods have proven to be much more effective since they are able to learn to make fast and accurate transformations while leveraging powerful pre-trained language models. Inspired by these ideas, we present TST, a simple and efficient Text Simplification system based on sequence Tagging, leveraging pre-trained Transformer-based encoders. Our system makes simplistic data augmentations and tweaks in training and inference on a pre-existing system, which makes it less reliant on large amounts of parallel training data, provides more control over the outputs and enables faster inference speeds. Our best model achieves near state-of-the-art performance on benchmark test datasets for the task. Since it is fully non-autoregressive, it achieves faster inference speeds by over 11 times than the current state-of-the-art text simplification system.",
}

gector's People

Contributors

Stargazers

Watchers

Forkers

wme7 hasnainajmal tagucci binhetech autotemp fancyerii zhangbo2008 shikhaprajapati shikha10799 adamits komelianchuk ashishkumarti wanghia xennygrimmato rococode kin-cs entn-at serenade-j sevilcaliskan victorleungtw mromanuk aayush6897 xvshiting xuerenlv manikant92 maxichu fenshion convobox vashu cb1473258684 jacqle xiaoshengjun jessewong333 mathstao deepankerkoul texervn carolx715 john9281 lolowisc yaboli kyriezx nicolexj makstarnavskyi bobycv06fpm shenzaimin fj-morales dsheng zhanzq ruohai0925 pameladdd bassamtiano taishiroy chenzelong dmitry-uraev dioxideme nymwa luoyangen pengak321 meherjp mindful kent0304 pygongnlp kassenov levyforchh wangphoebe ivanmkc dut-liuyang hsengivs robmch andreinosov timpelplomp gztangde leonardbongard nachuss georgiosetsias appleyc aneeshbhatb deepspell moowat10 melisa-writer rajaswa compose-ai gaojinghua sahicareerdatascience hossamsmesm samiksha-2196 nipi64310 jakoffe destwang rogervaas yang-hangwa lishengfever felixgithub2017 beakboomboom tinacristal kmoker8s lndoremi fp674018495 a43992899 jiacheng-xu

gector's Issues

Is there output file for final result?

CUDA out of memory error

Hello folks. I'm trying to fine tune the XLNET model with some arbitrary data. The command which I'm using is

python3 train.py --train_set=boost_train_pre --dev_set=boost_dev_pre --model_dir=XLNET_FT --vocab_path data/output_vocabulary/ --pretrain_folder=/gector_models/ --pretrain=xlnet_0_gector --n_epoch=10 --transformer_model=xlnet --special_tokens_fix=0

However when I do this, I run into CUDA out of memory error. I'm using a GeForce RTX 2080 Ti. Is there something wrong with my parameters?

the weakness of sequence tag approach

RuntimeError: Error(s) in loading state_dict for Seq2Labels

Unable to resolve this issue observed while loading the pre trained xlnet_0_gector model

RuntimeError: Error(s) in loading state_dict for Seq2Labels:
size mismatch for tag_labels_projection_layer._module.weight: copying a param with shape torch.Size([5002, 768]) from checkpoint, the shape in current model is torch.Size([5482, 768]).
size mismatch for tag_labels_projection_layer._module.bias: copying a param with shape torch.Size([5002]) from checkpoint, the shape in current model is torch.Size([5482]).

Getting Segmentation fault

Hi,

I've been trying to use the trained model (roberta_1_gector.th) on a GPU instance using this query "python3.6 predict.py --model_path models/roberta_1_gector.th --vocab_path data/output_vocabulary/ --input_file ../ data/ip.txt --output_file ../op.txt --min_error_probability 0.2 --additional_confidence 0.5"

But it is giving me this error trace:

Fatal Python error: Segmentation fault
Current thread 0x00007fbdf39c9700 (most recent call first):
File "/data/home/ubuntu/brijesh_intern/cqi_pip_venv/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 97 in init
File "/data/home/ubuntu/brijesh_intern/cqi_pip_venv/lib/python3.6/site-packages/transformers/modeling_bert.py", line 160 in init
File "/data/home/ubuntu/brijesh_intern/cqi_pip_venv/lib/python3.6/site-packages/transformers/modeling_bert.py", line 617 in init
File "/data/home/ubuntu/brijesh_intern/cqi_pip_venv/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 181 in init
File "/data/home/ubuntu/brijesh_intern/cqi_pip_venv/lib/python3.6/site-packages/transformers/modeling_utils.py", line 406 in from_pretrained
File "/data/home/ubuntu/brijesh_intern/cqi_pip_venv/lib/python3.6/site-packages/transformers/modeling_auto.py", line 156 in from_pretrained
File "/data/home/ubuntu/brijesh_intern/grammarly_gector/gector/gector/bert_token_embedder.py", line 29 in load
File "/data/home/ubuntu/brijesh_intern/grammarly_gector/gector/gector/bert_token_embedder.py", line 254 in init
File "/data/home/ubuntu/brijesh_intern/grammarly_gector/gector/gector/gec_model.py", line 176 in _get_embbeder
File "/data/home/ubuntu/brijesh_intern/grammarly_gector/gector/gector/gec_model.py", line 88 in init
File "predict.py", line 46 in main
File "predict.py", line 120 in
Segmentation fault

The pre - trained models are much heavy to run online.

The pretrained models are very large. Not having proper architecture (GPU's),i tried to run them on colab but they are too heavy heavy to be uploaded online.Is there any other alternative (if i run on colab) in order to reproduce the results.

Can't train the model

I preprocess the data as described in README, the preprocessed train dataset as follows:

$STARTSEPL|||SEPR$KEEP DecemberSEPL|||SEPR$KEEP 12thSEPL|||SEPR$KEEP
$STARTSEPL|||SEPR$KEEP PrincipalSEPL|||SEPR$KEEP mr.SEPL|||SEPR$KEEP robertsonSEPL|||SEPR$KEEP

I think the preprocess phrase is right.

Then I use the following script, run.sh, to train the model:
CUDA_VISIBLE_DEVICES=0 python train.py --train_set ./data/fce-train.pre \
--dev_set ./data/conll-dev.pre \
--model_dir ./model \
--transformer_model bert

but it reported an error, as follows:
18545it [00:01, 9993.01it/s]
Data is loaded
run.sh: line 4: 731383 Segmentation fault (core dumped) CUDA_VISIBLE_DEVICES=0 python train.py --train_set ./data/fce-train.pre --dev_set ./data/conll-dev.pre --model_dir ./model --transformer_model bert

my environment is Cuda 9.2, pytorch1.3.1.
does anyone know how to solve this problem or are there any problems with my process?

New theory on detection of grammatical errors

https://semanticsarchive.net/Archive/TQ4YjU4Z/AbrusanAsherVandeCruys.GrammaticalityandMeaningShift.pdf

This new theory could help design the next state of the art, so I guess this should probably interest you.

Any other embeddings?

Hi!

I am wondering, is there is a easy way to make use of other word embeddings? Mostly intested in those with OOV (out of vocabulary) feature, like Fasttext, or randomly intialized character-level embeddings. Or only transformer-embeddings are supported?

The reason is that I am trying to use your model for typo-correction task (which is a subtask of grammar correction task), so I have a lot of OOV-word in my dataset and I think that using character-level embeddings (or embeddings with OOV) may increase accuracy of model.

Start Training Freezes?

A training run with the following parameters:

python3 train.py --train_set $train_set \
                --dev_set $dev_set \
                --model_dir $model_dir \
                --transformer_model 'bert' \
                --tune_bert 1 \
                --n_epoch 4 \
                --cold_steps_count 1 \
                --accumulation_size 4 \
                --updates_per_epoch 10000 \
                --tn_prob 0 \
                --tp_prob 1 \
                --pretrain_folder models \
                --pretrain bert_0_gector \
                --skip_correct 0 \
                --skip_complex 0 \
                --max_len 64 \
                --batch_size 64 \
                --tag_strategy keep_one \
                --cold_lr 1e-3 \
                --lr 1e-5 \
                --predictor_dropout 0.0 \
                --lowercase_tokens 0 \
                --pieces_per_token 5 \
                --vocab_path data/output_vocabulary \
                --label_smoothing 0.0 \

on a dataset with 1000 lines freezes at this point https://imgur.com/a/qobJxdc. What could be the cause?

Inconsistency of the amount of data

Hi,
There is a problem occured when I try to re-implemnt your experiment. In your paper's Table 1, there's a list of datasets you used. I downloaded these from the official bea 2019 shared task, and found that the number of sentences is not the same as yours. Is there some special preprocessing process or other reasons? By the way, I extracted sentences from m2 file.

xlnet_base_cased

Hi...!
When I run predict.py using transfomer as xlnet, internally 3 files get downloaded. Are those 3 files xlnet_base_cased model ? If so, then I have downloaded the cased model. Now where in the code do I need to change such that it loads the model from the folder containing the downloaded model ?

F0.5 is almost zero after training the model

Thanks for the interesting work.
I have trained the model, stage 1, on the first part of the PIE data (a1). After shuffling and splitting the data, I used the following command:

python train.py --train_set "./data/a1/train.txt" --dev_set "./data/a1/dev.txt" --model_dir "./modelxlnet/" --vocab_path=data/output_vocabulary --skip_correct=1 --tn_prob=0 --special_tokens_fix=0 --cold_steps_count=2 --transformer_model=xlnet --updates_per_epoch=10000 --batch_size=64 --accumulation_size=4 --cold_steps_count=2.

I have also used the vocab_path from your project's ./data/output_vocabulary.
But after training for more than 5 epochs, I am still getting almost the same input file.
Here is the command I used for inference:

python predict.py --model_path "./modelxlnet/model_state_epoch_2.th" --subset valid --min_error_probability 0 --additional_confidence 0 --batch_size=64 --iteration_count=5 --transformer_model=xlnet --special_tokens_fix=0

And here is the ERRANT evaluation result :

Am I missing something. Thanks for the time.

custom dataset training

Hi,

If I wish to train the model on my dataset, How do I prepare the data? Kindly guide me in preparing the dataset

The size of raw dataset is 0

Thank you for the magnificent article and code. I'm sorry if my question is looked silly.

I have downloaded just one dataset (fce.train.gold.bea19.m2, Dev, and test) and moved files to the data directory, then started training I get the error 'The size of raw dataset is 0'.

My questions:

Is it required to rename any dataset files before moving it to the data directory? including parallel and synthetic datasets.

Also, can I run all the training process in a single GPU?

preprocessing limits

Hi,

I am trying to reproduce the results for BEA19 shared task. May I know what kind of pre-training data you used for initial training? Also, it seems when I try to construct large preprocessed data from the errorify repository. I got stuck at around 800,000 pairs of parallel copora. May I know what pretraining corpora did you use for producing the roberta score?

size mismatch for text_field_embedder.token_embedder_bert.bert_model.word_embedding.weight

python predict.py --model_path xlnet_0_gector.th --vocab_path data/output_vocabulary/ --input_file xxx--output_file xxx --transformer_model xlnet

I get

size mismatch for text_field_embedder.token_embedder_bert.bert_model.word_embedding.weight: copying a param with shape torch.Size([32000, 768]) from checkpoint, the shape in current model is torch.Size([32006, 768]).

But when I use Roberta(which is default setting), the run is ok.
Thanks!

Consider using Ranger as the optimizer

https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer
It is a meta optimizer that achieve SOTA accuracy gains for free.
Ideally it should be combined with Mish activation function

https://github.com/digantamisra98/Mish

Could you please provide me with a model after Stage 1 and Stage 2?

I want to fine-turn on my own datasets, but do not want to pretrain the model. Could you please provide me with a model after Stage 1 and Stage 2?

Is there any way to train other than english?

Can I use bert-base-multilingual-cased or another multilingual model and how?

Word Embedding and Project Layer Shap Mismatch

I'm getting the following error:

File "train.py", line 308, in <module>
   main(args)
 File "train.py", line 141, in main
   model.load_state_dict(torch.load(os.path.join(args.pretrain_folder, args.pretrain + '.th')))
 File "/h/asabet/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 830, in load_state_dict
   self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Seq2Labels:
   size mismatch for text_field_embedder.token_embedder_bert.bert_model.embeddings.word_embeddings.weight: copying a param with shape torch.Size([28996, 768]) from checkpoint, the shape in current model is torch.Size([28997, 768]).
   size mismatch for tag_labels_projection_layer._module.weight: copying a param with shape torch.Size([5002, 768]) from checkpoint, the shape in current model is torch.Size([1002, 768]).
   size mismatch for tag_labels_projection_layer._module.bias: copying a param with shape torch.Size([5002]) from checkpoint, the shape in current model is torch.Size([1002]).

after running
python3 train.py --train_set ./data/train --dev_set ./data/dev --model_dir models/ --pretrain bert_0_gector --pretrain_folder models/ --transformer_model bert
based on pretrained weights from https://grammarly-nlp-data-public.s3.amazonaws.com/gector/bert_0_gector.th. What's the source of the mismatch, and how do I fix it?

If I wanted to conduct an experiment on custom data - how do I go about it?

I wanted to test this architecture with custom pretraining and training set. As I'm seeing in the codebase there's a common training script. According to the paper the training was done in 3 steps. So is it correct to assume that the common script of "train.py" was run 3 times each time on the previous step snapshot?

How to reproduce the XLNet 72.4 f0.5-score

I use the command

"python predict.py --model_path pretained_model/xlnet_0_gector.th
--input_file ABCN.bea19.test.orig
--output_file bea19_test.txt --transformer_model xlnet --special_tokens_fix=0"
Then I submit the result to the CodaLab. It shows the Span-level correction F0.5 is 65.35.

Do I use the correct command? If I wanna get 72.4, how should I set the predict command?

Discard the last label when predicting?

In the function postprocess_batch of gec_model.py,
'for i in range(length):' seems discard the last label in idxs, since len(idxs) == len(tokens)+1 == length+1
idxs is the labels for [$START]+tokens

Can i resume training from where i left off by loading the last state of the model?

I want to confirm what does the below code actually do?( load the state of the model which we pass to the pretrain argument).Also a snapshot of the model after each epoch is stored in the model dir,therefore can we stop the pretraining in between and then pick again from where we left off by using the below code to load the last snapshot of the model being pretrained.(for the purpose of pretraining in a non continuous manner taking some breaks)

if args.pretrain: model.load_state_dict(torch.load(os.path.join(args.pretrain_folder, args.pretrain + '.th')))

if i am right than which file should i pass to the "pretrain" argument above in order to continue training?
options - model_state_epoch.th OR training_state_epoch.th

The final model had bad performance on spellchecking. How can I solve this?

The final model had bad performance on spellchecking. How can I solve this?
Should I add extra dataset in each stage?

Could we delete the token '$START' ?

Hi, thanks for your wonderful codes.

But, I wonder if we can delete the token '$START', when preprocessing the dataset. Because I found this token didn't make a lot of sense.

state_dict mismatch when loading from pretrained?

I'm tying to fine-tune on my own dataset from the pretrained model you provide (bert_0_gector.pt), with the following training arguments:

python3 train.py --train_set data/train \
        --dev_set data/dev \
        --model_dir models \
        --batch_size 100 \
        --cold_steps_count 1 \
        --pretrain_folder models \
        --pretrain bert_0_gector \
        --n_epoch 20

and get the follwoing error:

WARNING:root:vocabulary serialization directory models/vocabulary is not empty
Data is loaded
Traceback (most recent call last):
  File "train.py", line 303, in <module>
    main(args)
  File "train.py", line 141, in main
    model.load_state_dict(torch.load(os.path.join(args.pretrain_folder, args.pretrain + '.th')))
  File "/h/asabet/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 830, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Seq2Labels:
	size mismatch for text_field_embedder.token_embedder_bert.bert_model.embeddings.word_embeddings.weight: copying a param with shape torch.Size([28996, 768]) from checkpoint, the shape in current model is torch.Size([50266, 768]).
	size mismatch for text_field_embedder.token_embedder_bert.bert_model.embeddings.position_embeddings.weight: copying a param with shape torch.Size([512, 768]) from checkpoint, the shape in current model is torch.Size([514, 768]).
	size mismatch for text_field_embedder.token_embedder_bert.bert_model.embeddings.token_type_embeddings.weight: copying a param with shape torch.Size([2, 768]) from checkpoint, the shape in current model is torch.Size([1, 768]).
	size mismatch for tag_labels_projection_layer._module.weight: copying a param with shape torch.Size([5002, 768]) from checkpoint, the shape in current model is torch.Size([1002, 768]).
	size mismatch for tag_labels_projection_layer._module.bias: copying a param with shape torch.Size([5002]) from checkpoint, the shape in current model is torch.Size([1002]).

why would a mismatch show up if I'm just trying to load from pretrained?

The commands you used for training the xlnet-based models during the three stages.

I am trying to replicate the result for the xlnet-based model. Could you please share the commands you used for training during the three stages?

What kind of data format do you use?

python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE

What is the SOURCE and TARGET ? All the datasets your paper mentened，can get the m2 format，but how can I get the SOURCE or TARGET ?

role of tokenizer in the prediction stage

Hi,

I am curious about the role of tokenizer.

as far as I know,
the model only use labels as vocabulary set both in the training and prediction stage, .

Therefore, vocab file of each model is in fact not used here,
but the vocabulary extracted from the data is used.

If my understanding is right, what is the role of tokenizer(especially in the prediction stage)?

What methodology was used in selecting only errorful sentences for stage 2 and a mix of error-free and errorful sentences for stage 3?

Moreover, what exact number of datapoints went into both the stages of training?

Train and Dev Set for COLL2014 for Stage II, III

HI!

I have seen the paper have mentioned the lang8, face, nucle, and wi+locness have been used for stage II and stage III.

I believe this should be the case for BEA2019 test cases.

For the case of COLL2014, do we still use lang8, face, nucle and wi+locness for stage II, and wi+locness for stage III as well? COLL2014 dataset is only involved in the final testing?

Training loss decreases, but model doesn't learn

I'm training the bert gector model (using train.py) on an edit dataset similar to those found in the gector paper (ie nucle3.3 or conll14), but the model's predictions degenerate to predicting $KEEP for every token. This minimizes the loss, since most of the labels are $KEEP, but doesn't induce any learning in the model. Usually, this is solved by re-weighting the class losses to correct for class imbalance, but that wasn't done in your implementation.

How did you originally resolve this?

Is there a tokenizer problem when using pretrained RoBerta?

Thank for sharing your excellent work.

The RoBerta model uses a Byte level BPE, and the vocabulary loaded from pretrained RoBerta looks like this,

Note that there is a special unicode character 'Ġ' before token 'the', and 'Ġ' means a blank space .
For example, 'Ġthe' is indexed as 5 in the vocabulary, and 'the' is indexed as 627.

But when processing the data, all the lines are simplely using 'split()' methods, which leads to that the word 'the' is encoded as 627 instead of 5.

So, I reckon there are lots of tokens that are not indexed correctly. Although, the stage one pre-training cound alleviate this problems.

How did you decide which type of token-level transformations to include in the edit space?

Hi, thanks for the amazing paper and recording at BEA 15, ACL 2020!

I was wondering how did you decide which type of token-level transformations to include in the edit space? For example, the 1167 token-dependent APPEND and 3802 REPLACE. Is it a dataset-driven process, e.g. determined by ranking the most frequent transformations in CoNLL-2014 and selecting the top 5000?

Many thanks in advance!

Is it convenient to provide a GEC model trained?

Hi, Could you provide the GEC model that has trained? Thank you very much.

How to run inference with pretrained models?

I am trying to run gector using BERT, and I am likely missing something very trivial... to do so

I cloned this repo
I installed all requirements in a dedicated Python 3.7 conda environment called "gector"
I downloaded the BERT model from https://grammarly-nlp-data-public.s3.amazonaws.com/gector/bert_0_gector.th
I took from https://www.cl.cam.ac.uk/research/nl/bea2019st/data/wi+locness_v2.1.bea19.tar.gz the file ABCN.test.bea19.orig that I thought to use as input file.

I ran inference by firing on bash

python predict.py --model_path .\bert_0_gector.th --vocab_path .\data\output_vocabulary\ --input_file .\ABCN.test.bea19.orig --output_file foo

Execution crashes with:

I0728 17:10:37.239058 26000 file_utils.py:40] PyTorch version 1.3.0 available.
I0728 17:10:38.438563 26000 vocabulary.py:306] Loading token dictionary from .\data\output_vocabulary\.
I0728 17:10:39.341149 26000 tokenization_utils.py:398] loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json from cache at C:\Users\davide.fiocco\.cache\torch\transformers\d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b
I0728 17:10:39.342149 26000 tokenization_utils.py:398] loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt from cache at C:\Users\davide.fiocco\.cache\torch\transformers\b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
I0728 17:10:39.432337 26000 tokenization_utils.py:548] Adding $START to the vocabulary
I0728 17:10:39.908694 26000 configuration_utils.py:160] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json from cache at C:\Users\davide.fiocco\.cache\torch\transformers\e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690
I0728 17:10:39.909693 26000 configuration_utils.py:177] Model config {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pad_token_id": 1,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 1,
  "use_bfloat16": false,
  "vocab_size": 50265
}

I0728 17:10:40.361510 26000 modeling_utils.py:401] loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-pytorch_model.bin from cache at C:\Users\davide.fiocco\.cache\torch\transformers\228756ed15b6d200d7cb45aaef08c087e2706f54cb912863d2efe07c89584eb7.49b88ba7ec2c26a7558dda98ca3884c3b80fa31cf43a1b1f23aef3ff81ba344e
E0728 17:10:44.443636 26000 vocabulary.py:632] Namespace: d_tags
E0728 17:10:44.444651 26000 vocabulary.py:633] Token: INCORRECT
Traceback (most recent call last):
  File "predict.py", line 116, in <module>
    main(args)
  File "predict.py", line 43, in main
    weigths=args.weights)
  File "C:\Users\davide.fiocco\Projects\gector\gector\gec_model.py", line 89, in __init__
    confidence=self.confidence
  File "C:\Users\davide.fiocco\Projects\gector\gector\seq2labels_model.py", line 74, in __init__
    namespace=detect_namespace)
  File "C:\Users\davide.fiocco\Anaconda3\envs\gector\lib\site-packages\allennlp\data\vocabulary.py", line 630, in get_token_index
    return self._token_to_index[namespace][self._oov_token]
KeyError: '@@UNKNOWN@@'

There's quite a few things that I couldn't understand from the README file

Should I specify an existing output file?
Why does gector import RoBERTa if I have specified a pretrained BERT model?
Why the crash in the allennlp library?

Ultimately, I would like to:

For an input sentence, see the transformation tags
For an incorrect sentence, compute iteratively its corrected version

Thanks for any guidance on this...!

Installation error in allennlp==0.8.4

Hello,

I am running into error when installing gector requirement allennlp==0.8.4. More specifically, there seem to be an error when installing gevent library:

x86_64-linux-gnu-gcc: error: src/gevent/libev/corecext.c: No such file or directory
x86_64-linux-gnu-gcc: fatal error: no input files
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Is there some workaround this bug? The latest version of allennlp installs without problems but then I encounter dependency issues with gector.

Thank you,

Jonas

Questions about preprocessing data, [RoBERTa model]

Hi,

I found that when preprocessing data for RoBERTa model, there is a difference between our code and the official code of RoBERTa model

Our code's preprocessing procedure is:

The official code's preprocessing procedure is:

The Main difference is: How to segment a sentence

For example (Just forget the token $START)：

1.original sentence is: "That is impossible ."
2.our code's BPE results: "[1711, 354, 11850, 31497, 4]", which is "['That', 'is', 'imp', 'ossible', '.']", before encoding it.
3.official code's BPE results: "[1711, 16, 4703, 479]", which is "['That', 'Ġis', 'Ġimpossible', 'Ġ.']", before encoding it. 'Ġ' is byte_encoded from space.

So I think this difference will affect the RoBERTa model's performance.

Could you give me some explanation? Thank you.

Parameters of Stage III

Hi!
I've achieved re-implementing your paper's BERT result in training stage II. But after I finished the training Stage III, it seems that the improvement is not so obvious (F0.5 from 57.02 to 59.06). I don't know if I got the correct hyper-parameters settings. So could you please provide me with your settings on BERT in stage III?

skip_correct, tn_prob, tp_prob, cold_steps_count, tune_bert, n_epoch

hyperparameters during inference

Hi,

Thank you for the novel idea and great work. I was trying to learn and reproduce the inference result with the provided pretrained roberta and XLNet model. However, referring to the paper, I had some confusion on hyperparameter during inference. In the appendix does the confidence bias refer to --additional_confidence? I tried to reproduce this result by XLNet with --min_error_probability 0.66 --additional_confidence 0.35:

CUDA_VISIBLE_DEVICES='0' python predict.py --model_path pretrained_models/xlnet_0_gector.th --vocab_path data/output_vocabulary --input_file 'data/wi+locness/m2/ABCN.dev.gold.bea19.original' --output_file output/xlnet.ABCN.dev.bea19.corr.min_error_0.66.add_conf_0.35.txt --transformer_model xlnet --special_tokens_fix 0 --min_error_probability 0.66 --additional_confidence 0.35
Produced overall corrections: 2438

However, with the XLNet model on BEA19 dev set I only obtained the following result. Could you help to provide the inference command for roberta and XLNet released pretrained model? It will be very helpful in the appendix as well

Span-Based Correction
TP FP FN Prec Rec F0.5
2523 1295 4938 0.6608 0.3382 0.5549

Span-Based Detection
TP FP FN Prec Rec F0.5
2851 981 4781 0.744 0.3736 0.6209

Token-Based Detection
TP FP FN Prec Rec F0.5
3415 802 5581 0.8098 0.3796 0.6602

Thank you!

Tokenization problem

Thanks for the tool and models, the accuracy is pretty impressive! One thing I did notice was that the way the input was split into tokens causes some problems. predict.py is using python's string.split() to get an array of tokens, but string.split() doesn't separate punctuation from a word.

This causes things like this.
Original:

We need to be aware of cultural difference.

Corrected:

We need to be aware of cultural difference.sssss

Original:

I like to think about problems and ways to solve.

Corrected:

I like to think about problems and ways to solve. them

I changed the string.split() to spaCy's tokenizer to get the array of tokens and it's working much better.
Original:

We need to be aware of cultural difference.

Corrected:

We need to be aware of cultural differences .

Original:

I like to think about problems and ways to solve.

Corrected:

I like to think about problems and ways to solve them .

Of course, loading spaCy's model reduces speed (that's why I didn't submit a PR), so maybe there's a better option for tokenization. Just wanted to share the problem with string.split().

Roberta's Performance on BEA19-dev set and CONLL14-test set

Hi,

I have reproduced Roberta's result on BEA19-dev set and CONLL14-test set. I discovered that for the BEA19-dev set, it's performance is higher than the result presented in the paper (about 4%-5% higher). However, for CONLL14-test set, its performance is lower than paper's result (about 3%-4% lower).

I have used the following scripts:

For Stage 1: (I used two GPUs for training)

python3 train.py --train_set=PIE/pie-9m-train-tagged.txt --dev_set=PIE/pie-9m-dev-tagged.txt --model_dir=model_weights/roberta-2-gpu/stage-1 --cold_steps_count=2 --accumulation_size=2 --updates_per_epoch=10000 --tn_prob=0 --tp_prob=1 --transformer_model=roberta --special_tokens_fix=1 --tune_bert=1 --skip_correct=1 --skip_complex=0 --n_epoch=20 --patience=3 --max_len=50 --batch_size=64 --tag_strategy=keep_one --cold_steps_count=0 --cold_lr=1e-3 --lr=1e-5 --predictor_dropout=0.0 --lowercase_tokens=0 --pieces_per_token=5 --vocab_path=data/output_vocabulary --label_smoothing=0.0

For Stage 2:(I used 1 GPU for training)

python3 train.py --train_set=BEA19/bea19-train-tagged.txt --dev_set=BEA19/bea19-dev-tagged.txt --model_dir=model_weights/roberta-2-gpu/stage-2 --cold_steps_count=2 --accumulation_size=2 --updates_per_epoch=0 --tn_prob=0 --tp_prob=1 --transformer_model=roberta --special_tokens_fix=1 --tune_bert=1 --skip_correct=1 --skip_complex=0 --n_epoch=20 --patience=3 --max_len=50 --batch_size=64 --tag_strategy=keep_one --cold_steps_count=0 --cold_lr=1e-3 --lr=1e-5 --predictor_dropout=0.0 --lowercase_tokens=0 --pieces_per_token=5 --vocab_path=data/output_vocabulary --label_smoothing=0.0 --pretrain_folder=model_weights/roberta-2-gpu/stage-1 --pretrain=best

For Stage3:(I used 1 GPU for training)

"python3 train.py --train_set=Stage-3-data/train-tagged.txt --dev_set=Stage-3-data/dev-tagged.txt --model_dir=model_weights/roberta-2-gpu/stage-3 --cold_steps_count=0 --accumulation_size=2 --updates_per_epoch=0 --tn_prob=1 --tp_prob=1 --transformer_model=roberta --special_tokens_fix=1 --tune_bert=1 --skip_correct=1 --skip_complex=0 --n_epoch=20 --patience=3 --max_len=50 --batch_size=64 --tag_strategy=keep_one --cold_steps_count=0 --cold_lr=1e-3 --lr=1e-5 --predictor_dropout=0.0 --lowercase_tokens=0 --pieces_per_token=5 --vocab_path=data/output_vocabulary --label_smoothing=0.0 --pretrain_folder=model_weights/roberta-2-gpu/stage-2 --pretrain=best"

In Stage 2, the model trained for 12 epochs before it's dumped, and in Stage 3, it runs for 6 epochs.
May I know what's the problem for this?

Low iteration speed

Hi!
There's a problem occured when I tried to pretrain the model on mutliple(8) GPUs. It seems that the utility of per GPU is low (10%), and the iteration speed is only about 8s/it. Is there something wrong with my configuration?
By the way, when I use pytorch 1.3.0, it always return a Segmentation fault. So I upgrade the pytorch to ver. 1.5.0, and it disappears. But it didn't work out of "Stop Iteration" as mentioned in huggingface/transformers#4189 . So I followed the advice and install the transformers 2.10.0 from source. Is there something related to the low iteration speed problem?
The GPU devices I use are Nvidia GTX 1080Ti.

Extending functionalities: Adding Contextual Spell Check/Correction capabilities

Hi,

I wanted your suggestions for adding extra functionalities,

adding Contextual, Academic Spell Correction along with Grammar Error Suggestions.
How can I approach the grammar error correction in Academic (mainly K-12 context) domain, where textual data may contain equations, mathematical encodings, latex, HTML (tables, lists etc), XML and stuff?

Could you help me with this, In terms of feasibility and approach?

Reproducing the result of CONLL2014(test) using XLNet

Hi，

Thank you very much for the great work. I have been trying to reproduce the result for CONLL2014. I have obtained the following results for the stage-1, stage-2, stage-3 and the Inf. Tweaks stage:

I observe that for the previous three stages, the difference between the reproduced result and the original result in the paper is not very large. However, for the Inf. Tweak stage the difference is quite large. I am using the following scripts:

"python predict.py --model_path=model_weights/xlnet-2-gpu/stage-3/best.th --vocab_path=data/output_vo
cabulary --input_file=COLL14/conll14st-test.original --output_file=output/xlnet-stage-3.txt --transformer_model=xlnet --special_tokens_fix=0 --additional_confidence=0.35 --min_error_probability=0.66"

May I know how to solve this problem?

Thank you in advance.

help!

sir please help me to solve this error. (also, do we need to preprocess the official .m2 files downloaded from bea2019 website

)

The model starts training from scratch during stage 2, 3

I have replicated your xlnet based result during stage 1 but it seems that the model starts training from scratch during stage 2, even though the pretrain arg is set to the best model from stage 1. Am I missing something. I am new to allennlp library. Here is the command I used for stage 2:
python train.py --train_set "./data/wi.train.tagged" --dev_set "./data/wi.dev.tagged" --model_dir "./modelxlnet_stage2/" --vocab_path=data/output_vocabulary --skip_correct=1 --tp_prob 1 --tn_prob 0 --special_tokens_fix=0 --cold_steps_count=2 --n_epoch 20 --transformer_model=xlnet --updates_per_epoch=0 --batch_size=64 --accumulation_size=2 --pretrain_folder ./modelxlnet_stage1/ --pretrain best

About the dataset of stage 1

I have processed the data of stage 1. I got 44326735 pairs and I found that gector uses 9000000 pairs in experiments. Do you sample training data in your experiments?

grammarly / gector Goto Github PK

gector's Introduction

GECToR – Grammatical Error Correction: Tag, Not Rewrite

Installation

Datasets

Pretrained models

Train model

Training parameters

Model inference

Text Simplification

Noticeable works based on GECToR

Citation

GECToR – Grammatical Error Correction: Tag, Not Rewrite

Text Simplification by Tagging

gector's People

Contributors

Stargazers

Watchers

Forkers

gector's Issues

Our code's preprocessing procedure is:

The official code's preprocessing procedure is:

The Main difference is: How to segment a sentence

So I think this difference will affect the RoBERTa model's performance.

Recommend Projects

Recommend Topics

Recommend Org