Giter Club home page Giter Club logo

complex's Introduction

CompLex

This site holds the data associated with the paper: CompLex โ€” A New Corpus for Lexical Complexity Prediction from Likert Scale Data

available at: https://www.aclweb.org/anthology/2020.readi-1.9/

We are currently running a shared task on Lexical Complexity Prediction via SemEval using this data. As such, the data will be released according to the following release schedule:

  • Trial data available: July 31, 2020

  • Training data available: September 4, 2020

  • Test data available/Evaluation starts: January 11, 2021

See the task website for further information: https://sites.google.com/view/lcpsharedtask2021

Currently, the trial and test data are available. The trial data comprises of 99 MWEs (29 bible, 33 biomed and 37 europarl) and 421 single word instances (143 bible, 135 biomed and 143 europarl).

The training data comprises of 1,517 MWEs (505 bible, 514 biomed and 498 europarl) and 7,662 single word instances (2,574 bible, 2,576 biomed and 2,512 europarl). The training data is compressed and encrypted using 7zip (see: https://www.7-zip.org/), which is available on windows / linux. You will need the password to decompress the 7zip archive, which can be obtained by registering for the shared task.

The data is arranged by token and sorted by complexity (i.e., instances with the same token appear together in groups, and the groups are sorted by the complexity of the lowest scored item). Tokens that appear in one partition will not appear in another partition. We have deliberately included more tha oe instance of a token where possible to identify places where the context affects the complexity of a word. Consider, for example, the following two sentences from the trial data:

  • We now have a proposal on the table which lays down strict emissions values for the next ten years and which simultaneously creates clarity and incentives for technological innovations.
  • Mr President, in coordination with other groups, I would like to table an oral amendment concerning the draft bill in the Duma to ignore certain rulings of the European Court of Human Rights.

When table is used as a noun in the first sentence (albeit in an abstract sense) it receives a score of 0.01 indicating it is very easily understood. However, when it is used as a verb in the second, it is scored at 0.23, indicating a more difficult word to comprehend.

The data comes from three sources: biblical text, biomedical articles and proceedings of the European Parliament. These sources were selected as they contain a natural mixture of common language and difficult to understand expressions, whilst each containing vastly different domain-specific vocabulary. Systems must submit on all three sub-corpora

In addition to the single word task, we have also annotated multi-word expressions. These form a second track for our shared task. Systems which submit to both the single and multi word tracks will be additionally evaluated on the joint scores on both corpora.

complex's People

Contributors

mattshardlow avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.