Hi, There is a problem occured when I try to re-implemnt your experiment. In your

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-h

Inconsistency of the amount of data about gector HOT 10 CLOSED

grammarly commented on May 20, 2024

Inconsistency of the amount of data

from gector.

Comments (10)

komelianchuk commented on May 20, 2024

Hi!
What are the mismatch in the number of sentences?
Can you provide the same table with the number of sentences for all datasets?

from gector.

Jason3900 commented on May 20, 2024

Hi!
What are the mismatch in the number of sentences?
Can you provide the same table with the number of sentences for all datasets?

For lang8 learner corpus (ver. en 1.0), I got 1037561 sentences, which is 90217 more than yours.
And by the way, for PIE, the dev dataset is also splitted from the 9M sentences, or you use authentic corpus's dev set? For all dataset, you obey the 98/2 rule to split them?

from gector.

Jason3900 commented on May 20, 2024

Hi!
What are the mismatch in the number of sentences?
Can you provide the same table with the number of sentences for all datasets?

	PIE	lang8_en_1.0	NUCLE_3.3	CLC-FCE	W&I+locness
sentences	9000000	1037561	56958	34490	34304
errorful sents	100%	48.1%	38.0%	62.4%	66.2%

from gector.

shikha10799 commented on May 20, 2024

hi.I am a young researcher.I have to do some improvements in the latest GEC approaches ,(among which one is your's approach) as my internship program demands.I have read your paper.Seems very interesting.But i am having some difficulty as to how to proceed in the task of making improvements in the same.Can you plz guide me.I have to mainly focus on preposition errors.

from gector.

komelianchuk commented on May 20, 2024

@Jason3900 I see. So the main inconsistency is the number of sentences in the lang8 dataset. We used a dump which probably was requested a couple of years ago (so it might be slightly less). I've requested a new one. Once I get the data I will recheck the size. But I think more data is even better in this case, so it should not affect the final results much.

Regarding the % of errorful sent for W&I+locness - you are right, I've rechecked and it's equal to 66.2%.
Thank you for these observations!

from gector.

komelianchuk commented on May 20, 2024

Hi, @shikha10799
Great to hear about your interest.
For example, you can focus on correcting only prepositions with the GECToR model. In order to do this, you need to refactor prepocessing script and remove all tags related to other errors.

from gector.

Jason3900 commented on May 20, 2024

@Jason3900 I see. So the main inconsistency is the number of sentences in the lang8 dataset. We used a dump which probably was requested a couple of years ago (so it might be slightly less). I've requested a new one. Once I get the data I will recheck the size. But I think more data is even better in this case, so it should not affect the final results much.

Regarding the % of errorful sent for W&I+locness - you are right, I've rechecked and it's equal to 66.2%.
Thank you for these observations!
Thanks for reply! I got it.
And what about the dataset split? Does the trainng stage I use 2% of PIE syntactic data as dev set?

from gector.

komelianchuk commented on May 20, 2024

@Jason3900 yes, we used 98/2 split for all stages. So for the PIE dataset, we just split the 9m into 98/2 as well.

from gector.

Jason3900 commented on May 20, 2024

@Jason3900 yes, we used 98/2 split for all stages. So for the PIE dataset, we just split the 9m into 98/2 as well.

Thanks a lot!

from gector.

shikha10799 commented on May 20, 2024

Hi, @shikha10799
Great to hear about your interest.
For example, you can focus on correcting only prepositions with the GECToR model. In order to do this, you need to refactor prepocessing script and remove all tags related to other errors.

oh! It was really very nice getting a reply from you.Thanks for your suggestions.

from gector.

Inconsistency of the amount of data about gector HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent