Giter Club home page Giter Club logo

semeval2020-task1's Introduction

CMCE at SemEval-2020 Task 1: Clustering on Manifolds of Contextualized Embeddings to Detect Historical Meaning Shifts

This repository contains code to reproduce the experiments in our paper CMCE at SemEval-2020 Task 1: Clustering on Manifolds of Contextualized Embeddings to Detect Historical Meaning Shifts about measuring and detecting semantic change in the SemEval 2020 challenge.

Please follow the instructions on the website to get the corpus data.

CMCE

System Implementation for Task 1 of the SemEval 2020 challenge.

To use the scripts as given please create the following folder structure with the corresponding corpora.

semeval2020-task1/
    data/
        main_task_data/
            corpora/
                english/
                    corpus1/
                        corpus1.txt.gz
                    corpus2/
                        corpus2.txt.gz
                german/
                    corpus1/
                        corpus1.txt.gz
                    corpus2/
                        corpus2.txt.gz
                latin/
                    corpus1/
                        corpus1.txt.gz
                    corpus2/
                        corpus2.txt.gz
                swedish/
                    corpus1/
                        corpus1.txt.gz
                    corpus2/
                        corpus2.txt.gz
            targets/
                english.txt
                german.txt
                latin.txt
                swedish.txt

The system implemented in this repository follows 4 main steps:

  1. Computing BERT or XLMR embeddings for each occurence of a word
  2. Computing Auto Embeddings of the original Embeddings
  3. Utilizing UMAP and HDBSCAN to perform unsupervised clustering
  4. Computing the answers for the challenge given the clusters

To compute the Embeddings please execute either the compute_bert_embeddings.py or the compute_xlmr_embeddings.py script found in the main folder and set the language and the corpus you need.

Then execute the autoembed_data.py script and specify the language and the used embedding type.

At last you can compute the final task answers by executing the main_semeval.py script. Do not forget to set the correct type of embeddings that you originally used. (Everything defaults to BERT embeddings)

To compare the results to the truth data you can upload the created submission to the challenge webpage.

Reference

@inproceedings{rother-etal-2020-cmce,
    title = "{CMCE} at {S}em{E}val-2020 Task 1: Clustering on Manifolds of Contextualized Embeddings to Detect Historical Meaning Shifts",
    author = "Rother, David  and
      Haider, Thomas  and
      Eger, Steffen",
    booktitle = "Proceedings of the Fourteenth Workshop on Semantic Evaluation",
    month = dec,
    year = "2020",
    address = "Barcelona (online)",
    publisher = "International Committee for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.semeval-1.22",
    pages = "187--193",
    abstract = "This paper describes the system Clustering on Manifolds of Contextualized Embeddings (CMCE) submitted to the SemEval-2020 Task 1 on Unsupervised Lexical Semantic Change Detection. Subtask 1 asks to identify whether or not a word gained/lost a sense across two time periods. Subtask 2 is about computing a ranking of words according to the amount of change their senses underwent. Our system uses contextualized word embeddings from MBERT, whose dimensionality we reduce with an autoencoder and the UMAP algorithm, to be able to use a wider array of clustering algorithms that can automatically determine the number of clusters. We use Hierarchical Density Based Clustering (HDBSCAN) and compare it to Gaussian MixtureModels (GMMs) and other clustering algorithms. Remarkably, with only 10 dimensional MBERT embeddings (reduced from the original size of 768), our submitted model performs best on subtask 1 for English and ranks third in subtask 2 for English. In addition to describing our system, we discuss our hyperparameter configurations and examine why our system lags behind for the other languages involved in the shared task (German, Swedish, Latin). Our code is available at https://github.com/DavidRother/semeval2020-task1",
}

semeval2020-task1's People

Contributors

steffeneger avatar davidrother avatar

Stargazers

 avatar  avatar Yile avatar

Watchers

 avatar  avatar

semeval2020-task1's Issues

missing module (easy to be fixed)

I got the error

python3 compute_bert_embeddings.py
...
File "compute_bert_embeddings.py", line 2, in
from semeval2020.data_loader.sentence_loader import SentenceLoader
File "/home/.../semeval2020/data_loader/init.py", line 1, in
from semeval2020.data_loader import lazy_embedding_loader, sentence_loader
File "/home/.../semeval2020/data_loader/lazy_embedding_loader.py", line 4, in
from semeval2020.factory_hub import abstract_data_loader, data_loader_factory
File "/home/.../semeval2020/factory_hub/init.py", line 2, in
import semeval2020.evaluation
ModuleNotFoundError: No module named 'semeval2020.evaluation'

if the line in regard is deleted it works.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.