Giter Club home page Giter Club logo

Comments (4)

RajK853 avatar RajK853 commented on June 19, 2024

Interface

The initial version of the environment is registered with the environment-id gec-v0. It uses ANSI to visualize the current state with highlighted texts as shown below:

image

If a token has a label other than the $KEEP label, that token and its reward value is highlighted with green color and its label is highlighted with red color.

$KEEP labels are not shown beside their tokens.

from drl-gec.

RajK853 avatar RajK853 commented on June 19, 2024

Clean text

Quotation mark

The Lang-8 dataset seems to use `` instead of " for quotation marks.

...
Line 798 S The title is `` closer `` .
Line 799 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||
...

Our processing script will replace `` with " and normalize other characters.

raw_text = 'The title is `` closer `` .'
text = clean_text(raw_text)                # 'The title is " closer " .'

Ellipsis

The Lang-8 dataset contains lots of sentences with the ellipsis (. . .).

...
Line 4012672 S For example , racing games , action games , puzzle games and more . . .
Line 4012673 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||0
...

Ellipsis marks the omission of a word or words. Therefore, some of these examples are incomplete sentences and they do not make much sense. So we can remove the 35,947 examples (approx. 3% of total data) containing the ellipsis.

from drl-gec.

RajK853 avatar RajK853 commented on June 19, 2024

Data Preparation

Data Format

The training datasets are available in the M2 format.

The example below is a sample from the Lang-8 training dataset with 4 annotations.

S So , I think if we have to go somewhere on foot , we must put our hat .
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||0
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||1
A 4 5|||R:OTHER|||when|||REQUIRED|||-NONE-|||2
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||2
A 17 18|||R:NOUN:NUM|||hats|||REQUIRED|||-NONE-|||2
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||3

Our goal is to process these data from the M2 format to generate a JSON file with input text and its references as shown below.

{
    "text" : "So , I think if we have to go somewhere on foot , we must put our hat .",
    "references": [
      "So , I think if we have to go somewhere on foot , we must put on our hat .",
      "So , I think when we have to go somewhere on foot , we must put on our hats ."
    ]
  }

Note that we have only 2 different references from the 4 annotations because the edits from the annotators 0, 1 and 3 produce the exact reference (1st one).

Data Cleaning

We perform the following data cleaning techniques while converting the data from M2 to JSON:

Filter based on the number of tokens

In the Lang-8 dataset, there are some short sentences as shown below:

Line 370 S Why ?
Line 371 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||0

Similarly, we would also like to filter out really longer sentences as they can cause huge GPU usage spikes during batch training.

We remove an example if
$$N_{min} \lt N_{token} \lt N_{max}$$
where
$N_{token} = \text{Number of tokens}$,
$N_{min} = \text{Minimum number of token}$
$N_{min} = \text{Maximum number of token}$

Filter based on proper reference sentence

In the English language, a proper sentence follows the following rule:

  1. Starting starting token is capitalized.
  2. Sentence ends with one of the following tokens: ., !, ?, "

If one of the references does not fulfil the above conditions, we discard those examples.

Filter based on source-reference similarity

In Lang-8 training dataset, some edits are so extreme that even a human may not be able to obtain the reference sentence based on the given source text.

Line 11217 S I think a few days later I can get right .
Line 11218 A 2 2|||M:PREP|||in|||REQUIRED|||-NONE-|||0
Line 11219 A 4 5|||R:NOUN|||daysI|||REQUIRED|||-NONE-|||0
Line 11220 A 5 7|||R:OTHER|||will be fine . ( ``|||REQUIRED|||-NONE-|||0
Line 11221 A 10 11|||R:OTHER|||`` sounds awkward and unclear )|||REQUIRED|||-NONE-|||0

If we apply the edits to the source text above, we get the following reference:

{
    "text": "I think a few days later I can get right ."
    "reference": [
        "I think in a few daysI will be fine . ( \" can get right \" sounds awkward and unclear )"
    ]
}

Please note that the "days" and "I" tokens are merged together in the reference because of the faulty annotation in the edit where the annotator forgot to put whitespace between them.

Line 11219 A 4 5|||R:NOUN|||daysI|||REQUIRED|||-NONE-|||0

These sorts of examples can be filtered out by checking the similarity between the source and reference tokens as follows:
$$\frac{1}{N_{refs}} \sum_{i=1}^{N_{refs}} similarity(tokens_{source}, tokens_{reference_i}) \ge S_{min}$$
where
$N_{refs} = \text{Number of references}$
$S_{min} = \text{Minimum similarity value}$

Filter based on ellipsis in source

The Lang-8 dataset contains lots of sentences with the ellipsis (. . .).

...
Line 4012672 S For example , racing games , action games , puzzle games and more . . .
Line 4012673 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||0
...

Ellipsis marks the omission of a word or words. Therefore, some of these examples are incomplete sentences and they do not make much sense. So we remove any example containing the ellipsis.

Other cleanings

We perform the following further cleaning steps during the conversion:

  1. Clean source and reference texts by normalizing the characters like to ' or '' to ".
  2. Correct the spelling errors in the source text before generating the references.

from drl-gec.

RajK853 avatar RajK853 commented on June 19, 2024

Parenthetical texts

Parenthetical texts are used to give extra context information such that removing them should not make the sentence grammatically incorrect.

Meena studied (all night) for the grammar test.
Meena studied for the grammar test.

In Lang-8 dataset, there are some edits that add parenthetical elements such as in this example (lines 4579 - 4582):

text = For example , today I ordered some clothes on the internet shop !
reference = For example , today I ordered some clothes online ( you do n't say " internet shop " ) .

It would be unreasonable to request a model to correct the text by adding parenthetical elements as in the above example. To deal with this issue, we remove the parenthetical elements from all the texts.

text = For example , today I ordered some clothes on the internet shop !
reference = For example , today I ordered some clothes online ( you do n't say " internet shop " ) .
cleaned = For example , today I ordered some clothes online .

from drl-gec.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.