Text-based OpenAI Gym Environment Data Format A

Create OpenAI Gym Environment about drl-gec HOT 4 CLOSED

RajK853 commented on June 19, 2024

Create OpenAI Gym Environment

from drl-gec.

Comments (4)

RajK853 commented on June 19, 2024

Interface

The initial version of the environment is registered with the environment-id gec-v0. It uses ANSI to visualize the current state with highlighted texts as shown below:

If a token has a label other than the $KEEP label, that token and its reward value is highlighted with green color and its label is highlighted with red color.

$KEEP labels are not shown beside their tokens.

from drl-gec.

RajK853 commented on June 19, 2024

Clean text

Quotation mark

The Lang-8 dataset seems to use `` instead of " for quotation marks.

...
Line 798 S The title is `` closer `` .
Line 799 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||
...

Our processing script will replace `` with " and normalize other characters.

raw_text = 'The title is `` closer `` .'
text = clean_text(raw_text)                # 'The title is " closer " .'

Ellipsis

The Lang-8 dataset contains lots of sentences with the ellipsis (. . .).

...
Line 4012672 S For example , racing games , action games , puzzle games and more . . .
Line 4012673 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||0
...

Ellipsis marks the omission of a word or words. Therefore, some of these examples are incomplete sentences and they do not make much sense. So we can remove the 35,947 examples (approx. 3% of total data) containing the ellipsis.

from drl-gec.

RajK853 commented on June 19, 2024

Data Preparation

Data Format

The training datasets are available in the M2 format.

The example below is a sample from the Lang-8 training dataset with 4 annotations.

S So , I think if we have to go somewhere on foot , we must put our hat .
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||0
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||1
A 4 5|||R:OTHER|||when|||REQUIRED|||-NONE-|||2
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||2
A 17 18|||R:NOUN:NUM|||hats|||REQUIRED|||-NONE-|||2
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||3

Our goal is to process these data from the M2 format to generate a JSON file with input text and its references as shown below.

{
    "text" : "So , I think if we have to go somewhere on foot , we must put our hat .",
    "references": [
      "So , I think if we have to go somewhere on foot , we must put on our hat .",
      "So , I think when we have to go somewhere on foot , we must put on our hats ."
    ]
  }

Note that we have only 2 different references from the 4 annotations because the edits from the annotators 0, 1 and 3 produce the exact reference (1st one).

Data Cleaning

We perform the following data cleaning techniques while converting the data from M2 to JSON:

Filter based on the number of tokens

In the Lang-8 dataset, there are some short sentences as shown below:

Line 370 S Why ?
Line 371 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||0

Similarly, we would also like to filter out really longer sentences as they can cause huge GPU usage spikes during batch training.

We remove an example if
$$N_{min} \lt N_{token} \lt N_{max}$$
where
$N_{token} = \text{Number of tokens}$,
$N_{min} = \text{Minimum number of token}$
$N_{min} = \text{Maximum number of token}$

Filter based on proper reference sentence

In the English language, a proper sentence follows the following rule:

Starting starting token is capitalized.
Sentence ends with one of the following tokens: ., !, ?, "

If one of the references does not fulfil the above conditions, we discard those examples.

Filter based on source-reference similarity

In Lang-8 training dataset, some edits are so extreme that even a human may not be able to obtain the reference sentence based on the given source text.

Line 11217 S I think a few days later I can get right .
Line 11218 A 2 2|||M:PREP|||in|||REQUIRED|||-NONE-|||0
Line 11219 A 4 5|||R:NOUN|||daysI|||REQUIRED|||-NONE-|||0
Line 11220 A 5 7|||R:OTHER|||will be fine . ( ``|||REQUIRED|||-NONE-|||0
Line 11221 A 10 11|||R:OTHER|||`` sounds awkward and unclear )|||REQUIRED|||-NONE-|||0

If we apply the edits to the source text above, we get the following reference:

{
    "text": "I think a few days later I can get right ."
    "reference": [
        "I think in a few daysI will be fine . ( \" can get right \" sounds awkward and unclear )"
    ]
}

Please note that the "days" and "I" tokens are merged together in the reference because of the faulty annotation in the edit where the annotator forgot to put whitespace between them.
Line 11219 A 4 5|||R:NOUN|||daysI|||REQUIRED|||-NONE-|||0

These sorts of examples can be filtered out by checking the similarity between the source and reference tokens as follows:
$$\frac{1}{N_{refs}} \sum_{i=1}^{N_{refs}} similarity(tokens_{source}, tokens_{reference_i}) \ge S_{min}$$
where
$N_{refs} = \text{Number of references}$
$S_{min} = \text{Minimum similarity value}$

Filter based on ellipsis in source

The Lang-8 dataset contains lots of sentences with the ellipsis (. . .).

...
Line 4012672 S For example , racing games , action games , puzzle games and more . . .
Line 4012673 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||0
...

Ellipsis marks the omission of a word or words. Therefore, some of these examples are incomplete sentences and they do not make much sense. So we remove any example containing the ellipsis.

Other cleanings

We perform the following further cleaning steps during the conversion:

Clean source and reference texts by normalizing the characters like ’ to ' or '' to ".
Correct the spelling errors in the source text before generating the references.

from drl-gec.

RajK853 commented on June 19, 2024

Parenthetical texts

Parenthetical texts are used to give extra context information such that removing them should not make the sentence grammatically incorrect.

Meena studied (all night) for the grammar test.
Meena studied for the grammar test.

In Lang-8 dataset, there are some edits that add parenthetical elements such as in this example (lines 4579 - 4582):

text = For example , today I ordered some clothes on the internet shop !
reference = For example , today I ordered some clothes online ( you do n't say " internet shop " ) .

It would be unreasonable to request a model to correct the text by adding parenthetical elements as in the above example. To deal with this issue, we remove the parenthetical elements from all the texts.

text = For example , today I ordered some clothes on the internet shop !
reference = For example , today I ordered some clothes online ( you do n't say " internet shop " ) .
cleaned = For example , today I ordered some clothes online .

from drl-gec.

Create OpenAI Gym Environment about drl-gec HOT 4 CLOSED

Comments (4)

Interface

Clean text

Quotation mark

Ellipsis

Data Preparation

Data Format

Data Cleaning

Filter based on the number of tokens

Filter based on proper reference sentence

Filter based on source-reference similarity

Filter based on ellipsis in source

Other cleanings

Parenthetical texts

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent