Comments (10)
Hi!
What are the mismatch in the number of sentences?
Can you provide the same table with the number of sentences for all datasets?
from gector.
Hi!
What are the mismatch in the number of sentences?
Can you provide the same table with the number of sentences for all datasets?
For lang8 learner corpus (ver. en 1.0), I got 1037561 sentences, which is 90217 more than yours.
And by the way, for PIE, the dev dataset is also splitted from the 9M sentences, or you use authentic corpus's dev set? For all dataset, you obey the 98/2 rule to split them?
from gector.
Hi!
What are the mismatch in the number of sentences?
Can you provide the same table with the number of sentences for all datasets?
PIE | lang8_en_1.0 | NUCLE_3.3 | CLC-FCE | W&I+locness | |
---|---|---|---|---|---|
sentences | 9000000 | 1037561 | 56958 | 34490 | 34304 |
errorful sents | 100% | 48.1% | 38.0% | 62.4% | 66.2% |
from gector.
hi.I am a young researcher.I have to do some improvements in the latest GEC approaches ,(among which one is your's approach) as my internship program demands.I have read your paper.Seems very interesting.But i am having some difficulty as to how to proceed in the task of making improvements in the same.Can you plz guide me.I have to mainly focus on preposition errors.
from gector.
@Jason3900 I see. So the main inconsistency is the number of sentences in the lang8 dataset. We used a dump which probably was requested a couple of years ago (so it might be slightly less). I've requested a new one. Once I get the data I will recheck the size. But I think more data is even better in this case, so it should not affect the final results much.
Regarding the % of errorful sent for W&I+locness - you are right, I've rechecked and it's equal to 66.2%.
Thank you for these observations!
from gector.
Hi, @shikha10799
Great to hear about your interest.
For example, you can focus on correcting only prepositions with the GECToR model. In order to do this, you need to refactor prepocessing
script and remove all tags related to other errors.
from gector.
@Jason3900 I see. So the main inconsistency is the number of sentences in the lang8 dataset. We used a dump which probably was requested a couple of years ago (so it might be slightly less). I've requested a new one. Once I get the data I will recheck the size. But I think more data is even better in this case, so it should not affect the final results much.
Regarding the % of errorful sent for W&I+locness - you are right, I've rechecked and it's equal to 66.2%.
Thank you for these observations!
Thanks for reply! I got it.
And what about the dataset split? Does the trainng stage I use 2% of PIE syntactic data as dev set?
from gector.
@Jason3900 yes, we used 98/2 split for all stages. So for the PIE dataset, we just split the 9m into 98/2 as well.
from gector.
@Jason3900 yes, we used 98/2 split for all stages. So for the PIE dataset, we just split the 9m into 98/2 as well.
Thanks a lot!
from gector.
Hi, @shikha10799
Great to hear about your interest.
For example, you can focus on correcting only prepositions with the GECToR model. In order to do this, you need to refactorprepocessing
script and remove all tags related to other errors.
oh! It was really very nice getting a reply from you.Thanks for your suggestions.
from gector.
Related Issues (20)
- How to evaluate the `output_file` using `m2scorer` and `errant` HOT 3
- trained gector for usage HOT 4
- bash file as example HOT 3
- data preprocessing HOT 7
- preprocessing data question
- Conversion from m2 to parallel HOT 5
- stage 2 training problem HOT 2
- stage 2 training data problem HOT 3
- Reproducing experiments and finding different scores after Stage 1
- What is special_tokens_fix doing? HOT 4
- TypeError: 'type' object is not subscriptable
- What's the max_len in prediction?
- some detail of gector-large HOT 2
- Are dev/test sets used for training?
- Using GECTOR model for arabic HOT 2
- Running environment
- Data/output Structure
- Can't make the pretrained model work, even after looking at previous issues HOT 1
- I have some Issue with pakage version and this how it fix it HOT 2
- Error when training model
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gector.