Giter Club home page Giter Club logo

Comments (10)

komelianchuk avatar komelianchuk commented on May 20, 2024

Hi!
What are the mismatch in the number of sentences?
Can you provide the same table with the number of sentences for all datasets?

from gector.

Jason3900 avatar Jason3900 commented on May 20, 2024

Hi!
What are the mismatch in the number of sentences?
Can you provide the same table with the number of sentences for all datasets?

For lang8 learner corpus (ver. en 1.0), I got 1037561 sentences, which is 90217 more than yours.
And by the way, for PIE, the dev dataset is also splitted from the 9M sentences, or you use authentic corpus's dev set? For all dataset, you obey the 98/2 rule to split them?

from gector.

Jason3900 avatar Jason3900 commented on May 20, 2024

Hi!
What are the mismatch in the number of sentences?
Can you provide the same table with the number of sentences for all datasets?

PIE lang8_en_1.0 NUCLE_3.3 CLC-FCE W&I+locness
sentences 9000000 1037561 56958 34490 34304
errorful sents  100% 48.1% 38.0% 62.4% 66.2%

from gector.

shikha10799 avatar shikha10799 commented on May 20, 2024

hi.I am a young researcher.I have to do some improvements in the latest GEC approaches ,(among which one is your's approach) as my internship program demands.I have read your paper.Seems very interesting.But i am having some difficulty as to how to proceed in the task of making improvements in the same.Can you plz guide me.I have to mainly focus on preposition errors.

from gector.

komelianchuk avatar komelianchuk commented on May 20, 2024

@Jason3900 I see. So the main inconsistency is the number of sentences in the lang8 dataset. We used a dump which probably was requested a couple of years ago (so it might be slightly less). I've requested a new one. Once I get the data I will recheck the size. But I think more data is even better in this case, so it should not affect the final results much.

Regarding the % of errorful sent for W&I+locness - you are right, I've rechecked and it's equal to 66.2%.
Thank you for these observations!

from gector.

komelianchuk avatar komelianchuk commented on May 20, 2024

Hi, @shikha10799
Great to hear about your interest.
For example, you can focus on correcting only prepositions with the GECToR model. In order to do this, you need to refactor prepocessing script and remove all tags related to other errors.

from gector.

Jason3900 avatar Jason3900 commented on May 20, 2024

@Jason3900 I see. So the main inconsistency is the number of sentences in the lang8 dataset. We used a dump which probably was requested a couple of years ago (so it might be slightly less). I've requested a new one. Once I get the data I will recheck the size. But I think more data is even better in this case, so it should not affect the final results much.

Regarding the % of errorful sent for W&I+locness - you are right, I've rechecked and it's equal to 66.2%.
Thank you for these observations!
Thanks for reply! I got it.
And what about the dataset split? Does the trainng stage I use 2% of PIE syntactic data as dev set?

from gector.

komelianchuk avatar komelianchuk commented on May 20, 2024

@Jason3900 yes, we used 98/2 split for all stages. So for the PIE dataset, we just split the 9m into 98/2 as well.

from gector.

Jason3900 avatar Jason3900 commented on May 20, 2024

@Jason3900 yes, we used 98/2 split for all stages. So for the PIE dataset, we just split the 9m into 98/2 as well.

Thanks a lot!

from gector.

shikha10799 avatar shikha10799 commented on May 20, 2024

Hi, @shikha10799
Great to hear about your interest.
For example, you can focus on correcting only prepositions with the GECToR model. In order to do this, you need to refactor prepocessing script and remove all tags related to other errors.

oh! It was really very nice getting a reply from you.Thanks for your suggestions.

from gector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.