Comments (4)
Interface
The initial version of the environment is registered with the environment-id gec-v0
. It uses ANSI to visualize the current state with highlighted texts as shown below:
If a token has a label other than the $KEEP
label, that token and its reward value is highlighted with green color and its label is highlighted with red color.
$KEEP
labels are not shown beside their tokens.
from drl-gec.
Clean text
Quotation mark
The Lang-8 dataset seems to use `` instead of " for quotation marks.
...
Line 798 S The title is `` closer `` .
Line 799 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||
...
Our processing script will replace `` with " and normalize other characters.
raw_text = 'The title is `` closer `` .'
text = clean_text(raw_text) # 'The title is " closer " .'
Ellipsis
The Lang-8 dataset contains lots of sentences with the ellipsis (. . .).
...
Line 4012672 S For example , racing games , action games , puzzle games and more . . .
Line 4012673 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||0
...
Ellipsis marks the omission of a word or words. Therefore, some of these examples are incomplete sentences and they do not make much sense. So we can remove the 35,947 examples (approx. 3% of total data) containing the ellipsis.
from drl-gec.
Data Preparation
Data Format
The training datasets are available in the M2
format.
The example below is a sample from the Lang-8 training dataset with 4 annotations.
S So , I think if we have to go somewhere on foot , we must put our hat .
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||0
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||1
A 4 5|||R:OTHER|||when|||REQUIRED|||-NONE-|||2
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||2
A 17 18|||R:NOUN:NUM|||hats|||REQUIRED|||-NONE-|||2
A 16 16|||M:PREP|||on|||REQUIRED|||-NONE-|||3
Our goal is to process these data from the M2 format to generate a JSON file with input text and its references as shown below.
{
"text" : "So , I think if we have to go somewhere on foot , we must put our hat .",
"references": [
"So , I think if we have to go somewhere on foot , we must put on our hat .",
"So , I think when we have to go somewhere on foot , we must put on our hats ."
]
}
Note that we have only 2 different references from the 4 annotations because the edits from the annotators 0, 1 and 3 produce the exact reference (1st one).
Data Cleaning
We perform the following data cleaning techniques while converting the data from M2 to JSON:
Filter based on the number of tokens
In the Lang-8 dataset, there are some short sentences as shown below:
Line 370 S Why ?
Line 371 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||0
Similarly, we would also like to filter out really longer sentences as they can cause huge GPU usage spikes during batch training.
We remove an example if
where
Filter based on proper reference sentence
In the English language, a proper sentence follows the following rule:
- Starting starting token is capitalized.
- Sentence ends with one of the following tokens:
.
,!
,?
,"
If one of the references does not fulfil the above conditions, we discard those examples.
Filter based on source-reference similarity
In Lang-8 training dataset, some edits are so extreme that even a human may not be able to obtain the reference sentence based on the given source text.
Line 11217 S I think a few days later I can get right .
Line 11218 A 2 2|||M:PREP|||in|||REQUIRED|||-NONE-|||0
Line 11219 A 4 5|||R:NOUN|||daysI|||REQUIRED|||-NONE-|||0
Line 11220 A 5 7|||R:OTHER|||will be fine . ( ``|||REQUIRED|||-NONE-|||0
Line 11221 A 10 11|||R:OTHER|||`` sounds awkward and unclear )|||REQUIRED|||-NONE-|||0
If we apply the edits to the source text above, we get the following reference:
{
"text": "I think a few days later I can get right ."
"reference": [
"I think in a few daysI will be fine . ( \" can get right \" sounds awkward and unclear )"
]
}
Please note that the "days" and "I" tokens are merged together in the reference because of the faulty annotation in the edit where the annotator forgot to put whitespace between them.
Line 11219 A 4 5|||R:NOUN|||daysI|||REQUIRED|||-NONE-|||0
These sorts of examples can be filtered out by checking the similarity between the source and reference tokens as follows:
where
Filter based on ellipsis in source
The Lang-8 dataset contains lots of sentences with the ellipsis (. . .).
...
Line 4012672 S For example , racing games , action games , puzzle games and more . . .
Line 4012673 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||0
...
Ellipsis marks the omission of a word or words. Therefore, some of these examples are incomplete sentences and they do not make much sense. So we remove any example containing the ellipsis.
Other cleanings
We perform the following further cleaning steps during the conversion:
- Clean source and reference texts by normalizing the characters like
’
to'
or''
to"
. - Correct the spelling errors in the source text before generating the references.
from drl-gec.
Parenthetical texts
Parenthetical texts are used to give extra context information such that removing them should not make the sentence grammatically incorrect.
Meena studied (all night) for the grammar test.
Meena studied for the grammar test.
In Lang-8 dataset, there are some edits that add parenthetical elements such as in this example (lines 4579 - 4582):
text = For example , today I ordered some clothes on the internet shop !
reference = For example , today I ordered some clothes online ( you do n't say " internet shop " ) .
It would be unreasonable to request a model to correct the text by adding parenthetical elements as in the above example. To deal with this issue, we remove the parenthetical elements from all the texts.
text = For example , today I ordered some clothes on the internet shop !
reference = For example , today I ordered some clothes online ( you do n't say " internet shop " ) .
cleaned = For example , today I ordered some clothes online .
from drl-gec.
Related Issues (5)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from drl-gec.