Giter Club home page Giter Club logo

semeval2020's Introduction

Starting kit

Starting kit for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection.

The code draws from the LSCDetection repository.

Code

Under code/ we provide an implementation of the two baselines for the shared task:

  1. normalized frequency difference (FD)
  2. count vectors with column intersection and cosine distance (CNT+CI+CD)

FD first calculates the frequency for each target word in each of the two corpora, normalizes it by the total corpus frequency and then calculates the absolute difference in these values as a measure of change. CNT+CI+CD first learns vector representations for each of the two corpora, then aligns them by intersecting their columns and measures change by cosine distance between the two vectors for a target word. Find more information on these models in this paper.

The script run.sh will run FD and CNT+CI+CD on the trial data. For this, assuming you are working on a UNIX-based system, first make the script executable with

chmod 755 run.sh

Then execute

bash -e run.sh

The script will unzip the data, iterate over corpora of each language, learn matrices, store them under matrices/ and write the results for the trial targets under results/. It will also produce answer files for task 1 and 2 in the required submission format from the results and store them under results/. It does this in the following way: FD and CNT+CI+CD predict change values for the target words. These values provide the ranking for task 2. Then, target words are assigned into two classes depending on whether their predicted change values exceed a specified threshold or not. If the script throws errors, you might need to install Python dependencies: pip3 install -r requirements.txt.

Trial Data

We provide trial data in trial_data_public.zip. For each language, it contains:

  • trial target words for which predictions can be submitted in the practice phase (targets/)
  • the true classification of the trial target words for task 1 in the practice phase, i.e., the file against which submissions will be scored in the practice phase (truth/task1/)
  • the true ranking of the trial target words for task 2 in the practice phase (truth/task2/)
  • a sample submission for the trial target words in the above-specified format (answer.zip/)
  • two trial corpora from which you may predict change scores for the trial target words (corpora/)

Important: The scores in truth/task1/ and truth/task2/ are not meaningful as they were randomly assigned.

You can start by uploading the zipped answer folder to the system to check the submission and evaluation format. Find more information on the submission format on the shared task website.

Trial Corpora

The trial corpora under corpora/ are gzipped samples from the corpora that will be used in the evaluation phase. For each language two time-specific corpora are provided. Participants are required to predict the lexical semantic change of the target words between these two corpora. Each line contains one sentence and has the form

lemma1 lemma2 lemma3...

Sentences have been randomly shuffled. The corpora have the same format as the ones which will be used in the evaluation phase. Find more information about the corpora on the shared task website.

References

Dominik Schlechtweg, Anna Hätty, Marco del Tredici, and Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Florence, Italy. ACL.

semeval2020's People

Contributors

davidrother avatar

Watchers

 avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.