chrisjbryant / errant Goto Github PK

ERRor ANnotation Toolkit: Automatically extract and classify grammatical errors in parallel original and corrected sentences.

License: MIT License

Python 100.00%

automatic annotation grammatical-framework classifier evaluation

errant's People

Contributors

Stargazers

Watchers

Forkers

hexingwei jeknov karendeng apurvnagvenkar borgr elitltd howardyclo renhongkai adrianeboyd gurunathparasaram yingywang hunan-rostomyan adsglass getao fyvictor93 ravikiransm lilt goo2go ocean1992 ytydt sai-prasanna antonisa raman-r-4978 niucheney jalajthanaki djjune hnn123 binhetech burakakrishna mtfelix kyuhwas melisa-writer writerai bfsujason xiyang85 liu123-op serenade-j balhafni weichinhsu cehinson spurscoder matanel-oren xvshiting oscarwang114 alsan-grammar-correction giovanni-alzetta xiaoshengjun tejasoberoi27 askinkaty lolowisc amichw nlpcode kent0304 gaozhiyan seongminp aneeshbhat23 lukasstankevicius laplacekorea shashankdeshpande sam-writer anhkhoado93 zc-work yuanxun-yx marekrydlewski yixuan-wang techthiyanes aseifert ehzawad duxiaochao eminlacin jlin816 parth-chudasama andifunke ddxz11 aarondebattista09 van51 jenish2222 damien2012eng frost45 murathankurfali hillzhang1999 visualjoyce lalia-sg hfxunlp quipper juliusfrost imagoodman-aa upto12forenglish lunkm twilight2001 vaimalaviya1233 submergence2000 wydengyre gotutiyan wzsxb233 ssdavidson chrisdavis90

errant's Issues

spacy 1.9.0

Hi,
The pip install doesn't work and also from file.
The problem is the outdated spacy that doesn't work, pip can't build wheels for it and it throws errors when trying to import errant). manually updating spacy seem to solve it. (I am yet to use errant deeply with this so I might find it did not)
python 3.7.3
spacy 2.2.4
gcc 8.3 if relevant

Is there a functionality for convering m2 file back to their parallel corpuses?

Datasets like FCE have been standardised into m2 formats using ERRANT. But some models take parallel corpuses as inputs - so how does one change it back to their parallel corpus format? (Reverse of what ERRANT does)

‘’AttributeError: 'English' object has no attribute 'tagger'” when running the "Quick Start" code in API given in README.md

How can I eliminate this error report "AttributeError: 'English' object has no attribute 'tagger'"? I changed several data models, but they didn't work.

code:
`import errant

annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edits = annotator.annotate(orig, cor)
for e in edits:
print(e.o_start, e.o_end, e.o_str, e.c_start, e.c_end, e.c_str, e.type)`

U type when there is a correction

Using the most up to date github clone we have found ERRANT output (e.g. on w&i as attached) sometimes categorizes an error as of unnecessary(U:) type although it is a replacement type.

See for example:
S The rich people will buy a car but the poor people always need to use a bus or taxi .
A 0 2|||U:DET|||Rich|||REQUIRED|||-NONE-|||0

ABCN.dev.gold.bea19.m2.txt

Edit indices

What is the domain of Error.o_start, etc? Are they token indices within spaCy docs?

Clarification on Spacy 2.0

In the documentation you have mentioned that spacy 2.0 is less compatible with ERRANT. What is the nature of this incompatibility and any pointers on what can be done to correct for it?

Errant incompatible with spacy 3

Note to self: update errant to work with spacy 3.

Handling Missing Annotations on certain sentence

I am not able to generate m2 files for the case when annotations are missing for certain sentences for some of the annotators. Choosing orig==annotated has its side effects. Am I missing something?

Edits missed for a substitute -> Delete -> Substitute sequence.

Hi.
I am running into the following error:
For the source, target pairs:

source: In the article mrom The the New York Times.
target: In the article from The New York Times.

The edit mrom -> from is missed by ERRANT. The output from ERRANT was:

["Orig: [4, 6, 'The the'], Cor: [4, 5, 'The'], Type: 'U:DET'"]

On digging a little, it seems to be the issue with all alignment types of the following form

Input: w1 w2 w3
Output: w4 w5
such that w3.lower() == w5.lower()

Alignment Sequence: S w1 -> w4, D w2 -> "", S w3 -> w5

Then the edit "w1" -> "w4" is missed, and "w2 w3" -> "w5" is generated by errant.en.merger.process_seq
Example:

source: "In thir the"
target: "On The"
Errant Output: ["Orig: [1, 3, 'Thir the'], Cor: [1, 2, 'The'], Type: 'U:NOUN'"]
# Missing In -> On

About speed up

Hi dear author, ERRANT is such an excellent tool and I'm very happy to see that the character level cost in the sentence alignment function is now computed by the much faster [python-Levenshtein] library instead of python's native difflib.SequenceMatcher, which makes ERRANT 3x faster.
I want to know if there are other potential explorations that can increase the speed.
Can you give me some clues? Thank you very much!

spaCy >= 2.0 support

Hi Chris, thanks for the big 2.0 updates!

This is regarding the following section of the README

Note: ERRANT does not support spaCy 2 at this time. spaCy 2 POS tags are slightly different from spaCy 1 POS tags and so ERRANT rules, which were designed for spaCy 1, may not always work with spaCy 2.

Since Python can't handle having multiple versions of a given library in a single project, and we need to use features that were introduced post spacy 2.0, we currently have to keep ERRANT isolated in a separate service which we talk to over HTTP. This is not ideal. Since ERRANT now supports passing in an nlp spacy object, it seems like adding support for spacy >= 2.0 would not be bad.

Specifically, I think we could check nlp._meta['spacy_version']. If the spacy version is less than 2.0, nlp._meta doesn't exist, above 2.0, this gives us the exact spacy version. For this current purpose, just testing is_spacy_2_or_above = bool(getattr(nlp, "_meta", False)) should be enough. Then the quickest fix would be to just map the 2.0 tags to 1.9 tags if is_spacy_2_or_above.

Is this acceptable? If not, is there some other path to supporting spacy 2.0+? Thank you!

EDIT: we are happy to work on this, we'd just like to find an approach that you would approve.

Hi, is there any way to restore m2 to parallel data?

How can I transform the M2 back to plain text

Are there any scripts for that?

cancelling

Not comparing the actual correction tokens between hypothesis and reference edits in compare_m2.py

In compare_m2.py, the edits for a coder obtained from the extract_edits() function are in the form of (start,end):category.
While comparing the extracted edits for the hypothesis and gold corrections in compareEdits() function here:

errant/compare_m2.py

Line 100 in fb3196e

if h_edit in ref_edits.keys():

in the lines below :

	# On occasion, multiple tokens at same span.
	for h_cat in ref_edits[h_edit]: # Use ref dict for TP
		tp += 1
		# Each dict value [TP, FP, FN]
		if h_cat in cat_dict.keys():
			cat_dict[h_cat][0] += 1
		else:
                        cat_dict[h_cat] = [1, 0, 0]

The edits are first being compared based on their (start,end) and then they are checked to see whether their error categories match.
If just their (start,end) and the error categories for a hypothesis edit and a reference edit are equal, then it is counted as a true positive.
Consider the case below:
Source sentence: With the risk of being genetically disorder , many individuals have done the decision to undergo genetic testing .
Hypothesis sentence: With the risk of being genetically disordered , many individuals have done the decision to undergo genetic testing .
Gold correction: With the risk of having genetic disorders , many individuals have made the decision to undergo genetic testing .
In this case, for hypothesis edit is (6,7):R:NOUN:NUM and the reference edit is (6,7):R:NOUN:NUM. Here their (start,end) and error categories are same and hence, they are being counted as true positive.
As far as I understand, since we are not comparing the actual correction tokens 'disordered' vs 'disorders', does it inflate the number of true positives? Is there any reasoning behind just comparing the start,end and error category of the edits that I am missing?
Will it be better if the corrected tokens in the hypothesis edit as well as the reference edit are also compared before counting it as a true positive?
Thanks.

Running setup.py install for murmurhash ... error

Hi Chris,

Thanks for your updating new packages!

This is regarding the install error, when I install your package both from pip and source, it would give me the following error messages:

Installing collected packages: numpy, murmurhash, cymem, preshed, wrapt, tqdm, toolz, cytoolz, plac, six, dill, termcolor, pathlib, thinc, pip, ujson, idna, urllib3, chardet, certifi, requests, regex, webencodings, html5lib, wcwidth, ftfy, spacy, nltk, python-Levenshtein, errant Running setup.py install for murmurhash ... error ERROR: Command errored out with exit status 1: command: /Users/helen/errant/errant/errant_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/setup.py'"'"'; __file__='"'"'/private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-record-ysbyt_yn/install-record.txt --single-version-externally-managed --compile --install-headers /Users/helen/errant/errant/errant_env/include/site/python3.6/murmurhash cwd: /private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/ Complete output (36 lines): running install running build running build_py creating build creating build/lib.macosx-10.7-x86_64-3.6 creating build/lib.macosx-10.7-x86_64-3.6/murmurhash copying murmurhash/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/murmurhash copying murmurhash/about.py -> build/lib.macosx-10.7-x86_64-3.6/murmurhash creating build/lib.macosx-10.7-x86_64-3.6/murmurhash/tests copying murmurhash/tests/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/murmurhash/tests copying murmurhash/tests/test_import.py -> build/lib.macosx-10.7-x86_64-3.6/murmurhash/tests copying murmurhash/mrmr.pyx -> build/lib.macosx-10.7-x86_64-3.6/murmurhash copying murmurhash/__init__.pxd -> build/lib.macosx-10.7-x86_64-3.6/murmurhash copying murmurhash/mrmr.pxd -> build/lib.macosx-10.7-x86_64-3.6/murmurhash creating build/lib.macosx-10.7-x86_64-3.6/murmurhash/include creating build/lib.macosx-10.7-x86_64-3.6/murmurhash/include/murmurhash copying murmurhash/include/murmurhash/MurmurHash2.h -> build/lib.macosx-10.7-x86_64-3.6/murmurhash/include/murmurhash copying murmurhash/include/murmurhash/MurmurHash3.h -> build/lib.macosx-10.7-x86_64-3.6/murmurhash/include/murmurhash running build_ext building 'murmurhash.mrmr' extension creating build/temp.macosx-10.7-x86_64-3.6 creating build/temp.macosx-10.7-x86_64-3.6/murmurhash gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/helen/anaconda3/include -arch x86_64 -I/Users/helen/anaconda3/include -arch x86_64 -I/Users/helen/anaconda3/include/python3.6m -I/private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/murmurhash/include -I/Users/helen/errant/errant/errant_env/include -I/Users/helen/anaconda3/include/python3.6m -c murmurhash/mrmr.cpp -o build/temp.macosx-10.7-x86_64-3.6/murmurhash/mrmr.o -O3 -Wno-strict-prototypes -Wno-unused-function warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found] 1 warning generated. gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/helen/anaconda3/include -arch x86_64 -I/Users/helen/anaconda3/include -arch x86_64 -I/Users/helen/anaconda3/include/python3.6m -I/private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/murmurhash/include -I/Users/helen/errant/errant/errant_env/include -I/Users/helen/anaconda3/include/python3.6m -c murmurhash/MurmurHash2.cpp -o build/temp.macosx-10.7-x86_64-3.6/murmurhash/MurmurHash2.o -O3 -Wno-strict-prototypes -Wno-unused-function warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found] 1 warning generated. gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/helen/anaconda3/include -arch x86_64 -I/Users/helen/anaconda3/include -arch x86_64 -I/Users/helen/anaconda3/include/python3.6m -I/private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/murmurhash/include -I/Users/helen/errant/errant/errant_env/include -I/Users/helen/anaconda3/include/python3.6m -c murmurhash/MurmurHash3.cpp -o build/temp.macosx-10.7-x86_64-3.6/murmurhash/MurmurHash3.o -O3 -Wno-strict-prototypes -Wno-unused-function warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found] 1 warning generated. g++ -bundle -undefined dynamic_lookup -L/Users/helen/anaconda3/lib -arch x86_64 -L/Users/helen/anaconda3/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.6/murmurhash/mrmr.o build/temp.macosx-10.7-x86_64-3.6/murmurhash/MurmurHash2.o build/temp.macosx-10.7-x86_64-3.6/murmurhash/MurmurHash3.o -o build/lib.macosx-10.7-x86_64-3.6/murmurhash/mrmr.cpython-36m-darwin.so clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated] ld: library not found for -lstdc++ clang: error: linker command failed with exit code 1 (use -v to see invocation) error: command 'g++' failed with exit status 1 ---------------------------------------- ERROR: Command errored out with exit status 1: /Users/helen/errant/errant/errant_env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/setup.py'"'"'; __file__='"'"'/private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-install-972fo3pe/murmurhash/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/1_/s5yrcv056kq02m5_6v6pv23c0000gn/T/pip-record-ysbyt_yn/install-record.txt --single-version-externally-managed --compile --install-headers /Users/helen/errant/errant/errant_env/include/site/python3.6/murmurhash Check the logs for full command output.

I searched the error on Google and tried the possible solutions, such as update my Xcode, reinstall command tool lines, it doesn't work, therefore I wonder if you could give me some advice? Thanks in advance for your help!

Questions about evaluating duplicate corrections

Hi, I have a question about duplicate corrections.

errant_parallel sometimes makes duplicate corrections, e.g.

echo "If you want to actally know somebody you can spend the whole day with that person or place but if you do not , you do not even speak to that person or even go there . " > orig.txt
echo "If you want to actually know somebody , you can spend the whole day with that person or place , but if you do not , you do not even speak to that person or even go there . " > sys.txt
echo "If you want to actually get to know someone , or something , you can spend the whole day with that person , or place , and if you do not , you would n't have reason to even speak to that person , or even go there . " > ref.txt
errant_parallel -orig orig.txt -cor sys.txt -out hyp.m2
errant_parallel -orig orig.txt -cor ref.txt -out ref.m2
errant_compare -hyp hyp.m2 -ref ref.m2

(The above is line 612 of JFLEG-dev. The reference is the first annotation.)
In the above case, errant_compare shows

=========== Span-Based Correction ============
TP      FP      FN      Prec    Rec     F0.5
4       0       9       1.0     0.3077  0.6897
==============================================

However, hyp.m2 has only three correction, so TP=4 is strange.

hyp.m2

S If you want to actally know somebody you can spend the whole day with that person or place but if you do not , you do not even speak to that person or even go there .
A 4 5|||R:SPELL|||actually|||REQUIRED|||-NONE-|||0
A 7 7|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0
A 18 18|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0

The reason of this is the duplicate corrections in the reference.
Actually, ref.m2 has two lines of A 7 7|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0.
(I don't know why such duplication appears.)

ref.m2

S If you want to actally know somebody you can spend the whole day with that person or place but if you do not , you do not even speak to that person or even go there .
A 4 5|||R:SPELL|||actually|||REQUIRED|||-NONE-|||0
A 5 5|||M:VERB|||get to|||REQUIRED|||-NONE-|||0
A 6 7|||R:NOUN|||someone|||REQUIRED|||-NONE-|||0
A 7 7|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0
A 7 7|||M:CONJ|||or|||REQUIRED|||-NONE-|||0
A 7 7|||M:NOUN|||something|||REQUIRED|||-NONE-|||0
A 7 7|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0
A 16 16|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0
A 18 18|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0
A 18 19|||R:CONJ|||and|||REQUIRED|||-NONE-|||0
A 25 27|||R:OTHER|||would n't have|||REQUIRED|||-NONE-|||0
A 27 27|||M:OTHER|||reason to|||REQUIRED|||-NONE-|||0
A 32 32|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0

During errant_compare, the coder_dict[coder][(7, 7, ',')] has multiple values: ['M:PUNCT', 'M:PUNCT'].
This adds two points to the evaluation score because ref_edits[h_edit] has two values (in here)

Is it expected?
Personally, I do not think it is desirable for the number of TP to exceed the number of edits of a hypothesis.
Possible solutions would be to

Prevent errant.Annotator.annotate() from outputting duplicate corrections.
Ensure that coder_dict variable in errant.commands.compare_m2.py only has a single value (now it is a list).

Thank you for your development of ERRANT!
(This is an aside, but I am developing an API-based errant_compare and noticed this problem because the my results did not match the official results.)

Add a test to check spacy and errant work properly

Hi!
I have found a bug which is pretty difficult to be replicated: in certain cases (especially if you re-install spacy after installing errant), errant will "apparently" work, giving feedback on the sentences it corrects...but in reality it won't, resulting in errant not recognising most of the mistakes.

Would it be possible to add a simple test, with few basic sentence pairs, e.g., "He go home. -> He goes home." on which errant is evaluated, so that after installation one can check if spacy is working?

I know that, especially with spacy 2.x, the results won't be always the same...but I still think that this kind of feedback could be useful to check that errant is working "reasonably" well together with spacy.

If that is okay, I can make a PR with a new folder and file tests/test_errant_base.py, with 10-20 simple sentence pairs, where I check how many of the mistakes are correctly recognised by errant.

Parallel_to_m2 is not working

Hello Chris, I'm trying to convert my parallel dataset into m2 format so I used:
import errant ! errant_parallel -orig D5-src.txt -cor D5-trg.txt -out /out_m2.m2

and the output I got is:
Loading resources... Processing parallel files...

am I doing something wrong?
Note: I am using Google Colab and my dataset is in Arabic language

Simulate Errors

can errant simulate errors of sentence instead of correcting it? I am trying to build a dataset with ground-truth original transcript/text and error text of it.

Expose errant_compare functionality via the API

It would be great if the functionality in the errant_compare command were available for invocation as an API call, so it could be used for things like early stopping when training GEC models.

I've looked through the compare_m2 file and it doesn't look like it would be all that much work to refactor things so that everything worked the same way but it was possible to import a function that returned a dict with the computed scores instead of printing them, so if this is the kind of thing you'd be willing to accept a PR for, I'd be happy to give it a go myself sometime in the next couple weeks. If not, it would be super awesome if you were able to get to it at some point.

API Quickstart script not working - Please update with fix provided

Hi @chrisjbryant , the API quickstart script below is not working.

import errant

annotator = errant.load('en')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edits = annotator.annotate(orig, cor)
for e in edits:
    print(e.o_start, e.o_end, e.o_str, e.c_start, e.c_end, e.c_str, e.type)

Error:

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

After python3 -m spacy download en_core_web_sm, it says

Successfully installed en_core_web_sm-2.3.1
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')
You do not have sufficient privilege to perform this operation.
✘ Couldn't link model to 'en'
Creating a symlink in spacy/data failed. Make sure you have the required
permissions and try re-running the command as admin, or use a virtualenv. You
can still import the model as a module and call its load() method, or create the
symlink manually.
C:\Users\xxx\anaconda3\envs\chat-langchain\lib\site-packages\en_core_web_sm
--> C:\Users\xxx\anaconda3\envs\chat-langchain\lib\site-packages\spacy\data\en
⚠ Download successful but linking failed
Creating a shortcut link for 'en' didn't work (maybe you don't have admin
permissions?), but you can still load the model via its full package name: nlp =
spacy.load('en_core_web_sm')

I had to update the code to below before it works.

import errant
import spacy
import spacy.cli 

# spacy.cli.download("en_core_web_md")
nlp = spacy.load('en_core_web_md')
annotator = errant.load('en', nlp)
# annotator = errant.load('en_core_web_md')
orig = annotator.parse('This are gramamtical sentence .')
cor = annotator.parse('This is a grammatical sentence .')
edits = annotator.annotate(orig, cor)
for e in edits:
    print(e.o_start, e.o_end, e.o_str, e.c_start, e.c_end, e.c_str, e.type)

Implementation issue

Dear Chris :)

I have applied Errant for fce.test.me and I got unexpected results as :

=========== Span-Based Correction ============
TP      FP      FN      Prec    Rec     F0.5
2       15503   4547    0.0001  0.0004  0.0002
=======================================

I follow you implementation as same as in the documentation as follow:

I have applied errant_parallel using, errant_parallel -orig m2Scripts/orig_sentes.txt -cor m2Scripts/corec_sentes.txt -out m2Scripts/output.m2 files as in GitHub. Actually, I'm confused about the format of the parallel corrected text file.
For the errant_m2 I applied as errant_m2 -auto m2Scripts/output.m2 -out m2Scripts/auto_output.
The last step errant_compare as errant_compare -hyp m2Scripts/auto_output -ref m2Scripts/fce.test.m2 , results was are :

=========== Span-Based Correction ============
TP      FP      FN      Prec    Rec     F0.5
2       15503   4547    0.0001  0.0004  0.0002
=======================================

Could you please help to fix this issue?

Kind regards
Aiman Solyman

Errant parse method not working

I'm facing a problem with the parse method of errant. I know it uses the Spacy tagger and I have the English Spacy model ready so I'm not sure what might be the problem.
Thanks in advance

Ignore temporary files generated by installation

Could you add /*.egg-info/ and __pycache__/ into .gitignore? I'm using this repo as submodule in git and these temporary files are created after installation.

Install error with bdist_wheel

Some versions of default python are missing or use an older version of the wheel package.

This raises an error when installing errant: error: invalid command 'bdist_wheel'

Although ERRANT was actually installed successfully and you can ignore this error, the fix is simply to install/upgrade the wheel package in your python venv before you install errant:
pip3 install -U wheel

SequenceLib is slower than python-Levenshtien

I was adapting this code for our private use. I noticed that using
python-Levenshtein package is 100x better than SequenceMatcher ratio.

Levenshtein.ratio(A, B) gets you the same result.
I understand that this library is more for offline benchmarking use, but it doesn't hurt to be faster 😉 .

btw Can you explain the rationale for the custom cost function for substitutions? Any example on how using it changes outcomes of the path taken.

Is there any way to further improve the method of summarizing error types?

There are some sentences where I noticed that the error type statement is not accurate enough.

I noticed that he is using a model of size sm, and I intended to replace it with a larger model, but it seems that the improvement is not significant. Is there any other way to improve his accuracy?

By the way,thank you for providing this tool, it is very useful!

Licensing concerns

Hi, thanks for the awesome library.

Is there any way to remove python-Levenshtein from the dependencies? It's licensed under GPLv2 which is not compatible with MIT of errant.

Evaluating Neural Network model using Errant

Dear @ALL

I have an overview of your documentation, I'm still confused about how to evaluate my Neural Network model (GEC). As I understood that, I have to translate the test set (correcting), then build a new (M2) file using errant_parallel command. The last step is to use errant_compare with the span-based correction to get F0.5 score.

Is this correct?
What is the optimal way to evaluate my NN model using Errant?

Regards,

Wrong format for incorr_sentences.txt

I used errant to preprocess the Oscar Tamil Dataset.

The source m2 file looks like this.

S முன்னாள் ஜனாதிபதி மஹிந்த ராஜபக்ஷவினால் முன்னெடுக்கப்பட்ட போராட்டம் உட்பட வேலைநிறுத்த போராட்டங்களுக்கான நிதி அனுசரணையை சீனாவே வழங்கி நாட்டையும் அரசாங்கத்தையும் நெருக்கடிக்குள்ளாக்க முயல்கிறது என சமூக நலன்புரி பிரதி அமைச்சர் ரஞ்சன் ராமநாயக்க தெரிவித்தார்.
A 0 1|||R:OTHER|||முன்னாழ்|||REQUIRED|||-NONE-|||0
A 2 3|||R:OTHER|||மஹிண்த|||REQUIRED|||-NONE-|||0
A 7 8|||R:OTHER|||வேளைணிறுத்த|||REQUIRED|||-NONE-|||0
A 8 9|||R:NOUN|||போராட்டங்களுக்காண|||REQUIRED|||-NONE-|||0
A 9 10|||R:OTHER|||ணிதி|||REQUIRED|||-NONE-|||0
A 10 11|||R:NOUN|||அநுசரநையை|||REQUIRED|||-NONE-|||0
A 15 16|||R:NOUN|||ணெருக்கடிக்குல்ளாக்க|||REQUIRED|||-NONE-|||0
A 19 20|||R:OTHER|||ணலந்புரி|||REQUIRED|||-NONE-|||0
A 23 24|||R:NOUN|||ராமனாயக்க|||REQUIRED|||-NONE-|||0
A 24 25|||R:OTHER|||தெரிவித்தார்|||REQUIRED|||-NONE-|||0

The corresponding generated section of corr_sentences.txt looks like this.

S முன்னாள் ஜனாதிபதி மஹிந்த ராஜபக்ஷவினால் முன்னெடுக்கப்பட்ட போராட்டம் உட்பட வேலைநிறுத்த போராட்டங்களுக்கான நிதி அனுசரணையை சீனாவே வழங்கி நாட்டையும் அரசாங்கத்தையும் நெருக்கடிக்குள்ளாக்க முயல்கிறது என சமூக நலன்புரி பிரதி அமைச்சர் ரஞ்சன் ராமநாயக்க தெரிவித்தார்.
A 0 1|||R:OTHER|||முன்னாழ்|||REQUIRED|||-NONE-|||0
A 2 3|||R:OTHER|||மஹிண்த|||REQUIRED|||-NONE-|||0
A 7 8|||R:OTHER|||வேளைணிறுத்த|||REQUIRED|||-NONE-|||0
A 8 9|||R:NOUN|||போராட்டங்களுக்காண|||REQUIRED|||-NONE-|||0
A 9 10|||R:OTHER|||ணிதி|||REQUIRED|||-NONE-|||0
A 10 11|||R:NOUN|||அநுசரநையை|||REQUIRED|||-NONE-|||0
A 15 16|||R:NOUN|||ணெருக்கடிக்குல்ளாக்க|||REQUIRED|||-NONE-|||0
A 19 20|||R:OTHER|||ணலந்புரி|||REQUIRED|||-NONE-|||0
A 23 24|||R:NOUN|||ராமனாயக்க|||REQUIRED|||-NONE-|||0
A 24 25|||R:OTHER|||தெரிவித்தார்|||REQUIRED|||-NONE-|||0

The corresponding section of incorr_sentences.txt looks like this.

S முன்னாள் ஜனாதிபதி மஹிந்த ராஜபக்ஷவினால் முன்னெடுக்கப்பட்ட போராட்டம் உட்பட வேலைநிறுத்த போராட்டங்களுக்கான நிதி அனுசரணையை சீனாவே வழங்கி நாட்டையும் அரசாங்கத்தையும் நெருக்கடிக்குள்ளாக்க முயல்கிறது என சமூக நலன்புரி பிரதி அமைச்சர் ரஞ்சன் ராமநாயக்க தெரிவித்தார்.
0 1|||R:OTHER|||முன்னாழ்|||REQUIRED|||-NONE-|||0
A work 3|||R:OTHER|||மஹிண்த|||REQUIRED|||-NONE-|||0
7 8|||R:OTHER|||வேளைணிறுத்த|||REQUIRED|||-NONE-|||0
Badly do my 9|||R:NOUN|||போராட்டங்களுக்காண|||REQUIRED|||-NONE-|||0
9 10|||R:OTHER|||ணிதி|||REQUIRED|||-NONE-|||0
A English 10 11|||R:NOUN|||அநுசரநையை|||REQUIRED|||-NONE-|||0
up relatively 15 16|||R:NOUN|||ணெருக்கடிக்குல்ளாக்க|||REQUIRED|||-NONE-|||0
19 20|||R:OTHER|||ணலந்புரி|||REQUIRED|||-NONE-|||0
Change 23 24|||R:NOUN|||ராமனாயக்க|||REQUIRED|||-NONE-|||0
24 25|||R:OTHER|||தெரிவித்தார்|||REQUIRED|||-NONE-|||0

The first correction line doesn't start with A. The word work appears randomly after A in the second correction line. The corresponding sentence in the source file does not have the word work. Other lines also have similar patterns.

Merge Casing Issue

Hello Chris,
I am working on Errant for Czech and found the following line problematic:

errant/errant/en/merger.py

Line 64 in 6c0d521

if start == 0 and (len(o) == 1 and c[0].text[0].isupper()) or \

The issue is that it is True even if start != 0 (if (len(c) == 1 and o[0].text[0].isupper()) gets evaluated to True.

In such case, return on lines 66-67 will omit "the preceding part of combo".

I suppose that the fix is to enforce start == 0 by adding a pair of brackets:

if start == 0 and ((len(o) == 1 and c[0].text[0].isupper()) or \
                    (len(c) == 1 and o[0].text[0].isupper())):

OSError: [E053] Could not read meta.json from en\meta.json

Traceback (most recent call last):
File "C:/Users/ITJaylon/Desktop/errant/errant/test.py", line 3, in
annotator = errant.load('en')
File "C:\Users\ITJaylon\Desktop\errant\errant_init_.py", line 16, in load
nlp = nlp or spacy.load(lang, disable=["ner"])
File "E:\Anaconda\envs\errant_env\lib\site-packages\spacy_init_.py", line 30, in load
return util.load_model(name, **overrides)
File "E:\Anaconda\envs\errant_env\lib\site-packages\spacy\util.py", line 172, in load_model
return load_model_from_path(Path(name), **overrides)
File "E:\Anaconda\envs\errant_env\lib\site-packages\spacy\util.py", line 198, in load_model_from_path
meta = get_model_meta(model_path)
File "E:\Anaconda\envs\errant_env\lib\site-packages\spacy\util.py", line 253, in get_model_meta
raise IOError(Errors.E053.format(path=meta_path))
OSError: [E053] Could not read meta.json from en\meta.json

Process finished with exit code 1

spacy tokenizer speed

There is one line code In parser function
text = self.nlp.tokenizer.tokens_from_list(text.split())

why do not just use nlp.tokenizer(text) directly ? This code can really accelerate tokenizing process.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2490: ordinal not in range(128)

Hi 😊

I've encountered a problem while using errant:

I think there are conflicts between the version of python and spacy and I couldn't fix it

Python 3.6.13 |Anaconda, Inc.| (default, Jun  4 2021, 14:25:59)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import errant
>>> errant.load('en')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/anaconda3/envs/errant200/lib/python3.6/site-packages/errant/__init__.py", line 19, in load
    classifier = import_module("errant.%s.classifier" % lang)
  File "/root/anaconda3/envs/errant200/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/root/anaconda3/envs/errant200/lib/python3.6/site-packages/errant/en/classifier.py", line 40, in <module>

I use

Ubuntu 20.04
python 3.6.13
spacy 1.9.0

Thank you

Errant for Arabic language

Hello :)

Is there any way to use errant with other languages like Arabic language, and using BERT Multi-language or BPEmb instead of spacy?

King regards,
Aiman Solyman