Giter Club home page Giter Club logo

muda's Introduction

Multilingual Discourse-Aware (MuDA) Benchmark

The Multilingual Discourse-Aware (MuDA) Benchmark is a comprehensive suite of taggers and evaluators aimed at advancing the field of context-aware Machine Translation (MT).

Traditional translation quality metrics output uninterpertable scores, and fail to accuratly measure performance on context-aware discourse phenomena. MuDA takes a different direction, relying on neural-based syntatical and morphalogical analysers to measure performance of translation models on specific words and discourse phenomena.

The MuDA taggers currently support 14 language pairs (see this directory) but easily supports adding new languages.

Installation

The tagger relies on Pytorch (<1.10) to run models. If you want to run these models, first install Pytorch. You can find instructions for your system here.

For example, to install PyTorch on a Linux system with CUDA support in a conda environment, run:

conda install pytorch==1.9.1 torchvision==0.10.1 torchaudio==0.9.1 cudatoolkit=11.3 -c pytorch -c conda-forge

Then, to install the rest of the dependencies, run:

pip install -r requirements.txt

Example Usage

To tag an existing dataset, and extract the tags for later use, run the following command.

python muda/main.py \
    --src /path/to/src \
    --tgt /path/to/tgt \
    --docids /path/to/docids \
    --dump-tags /tmp/maia_ende.tags \
    --tgt-lang "$lang" \

To evaluate models on particular dataset (reporting per-tag metrics such as precision & recall), run

python muda/main.py \
    --src /path/to/src \
    --tgt /path/to/tgt \
    --docids /path/to/docids \
    --hyps /path/to/hyps.m1 /path/to/hyps.m2 \
    --tgt-lang "$lang"

Note that MuDA relies on an docids file, containing the same number of lines as the src/tgt files and where each line contains a document id to which the source/target in the line belong to.

muda's People

Contributors

nightingal3 avatar coderpat avatar kayoyin avatar

Stargazers

Paweł Mąka avatar  avatar Sweta Agrawal avatar  avatar Leonardo Emili avatar Gabriele Sarti avatar Matt Post avatar

Watchers

Vlad Niculae avatar Graham Neubig avatar  avatar  avatar  avatar

muda's Issues

Separate specific language tagger in different files

Currently the different taggers for the language supported live all in the same file muda/tagger.py. However, code for each language's tagger should live in a separate files since this adds encapsulation and makes it easier to add taggers for new languages.

I suggest them adding them in a new folder. e.g muda/langs/pt_tagger.py

Keep track of reference chain in MuDA

Currently, we track whether a sentence has any links to the previous sentence with booleans. If we could keep track of these references explicitly, we could automatically change them to create contrastive datasets for certain phenomena to measure a model's context sensitivity. This could be difficult to do, but would remove the dependence on non-contextual baselines.

Coref error

I am having some problems when running the script. I created my environment using the muda_env.yml file. When I test it on a small test document, I run into a "coref error". It seems to me that it is related to multi-token tokenisation (for example the Spanish word "al" is tokenized as "a" and "el"). See below for an example.

I'd be grateful for your thoughts on this!

This is the command I used: PYTHONPATH=/home/getalp/nakhlem/MuDA python muda/main.py --src my_data/text.en --tgt my_data/text.es --docids my_data/text.docids --dump-tags my_data/test_enes_muda-env-yaml.tags --tgt-lang "es"

And this is the full message:

2024-01-10 16:53:32 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 22.0MB/s]                                                 
2024-01-10 16:53:33 INFO: Loading these models for language: en (English):
=================================
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |
=================================

2024-01-10 16:53:33 INFO: Using device: cuda
2024-01-10 16:53:33 INFO: Loading: tokenize
2024-01-10 16:53:35 INFO: Loading: pos
2024-01-10 16:53:36 INFO: Loading: lemma
2024-01-10 16:53:36 INFO: Loading: depparse
2024-01-10 16:53:36 INFO: Done loading processors!
2024-01-10 16:53:36 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 21.0MB/s]                                                 
2024-01-10 16:53:37 WARNING: Language es package default expects mwt, which has been added
2024-01-10 16:53:38 INFO: Loading these models for language: es (Spanish):
===============================
| Processor | Package         |
-------------------------------
| tokenize  | ancora          |
| mwt       | ancora          |
| pos       | ancora_charlm   |
| lemma     | ancora_nocharlm |
| depparse  | ancora_charlm   |
===============================

2024-01-10 16:53:38 INFO: Using device: cuda
2024-01-10 16:53:38 INFO: Loading: tokenize
2024-01-10 16:53:38 INFO: Loading: mwt
2024-01-10 16:53:38 INFO: Loading: pos
2024-01-10 16:53:38 INFO: Loading: lemma
2024-01-10 16:53:38 INFO: Loading: depparse
2024-01-10 16:53:38 INFO: Done loading processors!
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Due to multiword token expansion or an alignment issue, the original text has been replaced by space-separated expanded tokens.
  docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Varios', 'enmascarados', 'irrumpieron', 'en', 'el', 'estudio', 'de', 'el', 'canal', 'público', 'TC', 'durante', 'una', 'emisión', ',', 'obligando', 'a', 'el', 'personal', 'a', 'tirar', 'se', 'a', 'el', 'suelo', '.']
Entities: []
  docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['A', 'el', 'menos', '10', 'personas', 'han', 'muerto', 'desde', 'que', 'el', 'lunes', 'se', 'declarara', 'el', 'estado', 'de', 'excepción', 'en', 'Ecuador', '.']
Entities: []
  docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Este', 'se', 'declaró', 'después', 'de', 'que', 'un', 'famoso', 'gánster', 'desapareciera', 'de', 'su', 'celda', 'en', 'prisión', '.', 'No', 'está', 'claro', 'si', 'el', 'incidente', 'en', 'el', 'estudio', 'de', 'televisión', 'de', 'Guayaquil', 'está', 'relacionado', 'con', 'la', 'desaparición', 'de', 'una', 'prisión', 'de', 'la', 'misma', 'ciudad', 'de', 'el', 'jefe', 'de', 'la', 'banda', 'de', 'los', 'Choneros', ',', 'Adolfo', 'Macías', 'Villamar', ',', 'o', 'Fito', ',', 'como', 'es', 'más', 'conocido', '.']
Entities: []
  docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['En', 'el', 'vecino', 'Perú', ',', 'el', 'gobierno', 'ordenó', 'el', 'despliegue', 'inmediato', 'de', 'una', 'fuerza', 'policial', 'en', 'la', 'frontera', 'para', 'evitar', 'que', 'la', 'inestabilidad', 'se', 'extienda', 'a', 'el', 'país', '.']
Entities: []
  docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Ecuador', 'es', 'uno', 'de', 'los', 'principales', 'exportadores', 'de', 'plátano', 'de', 'el', 'mundo', ',', 'pero', 'también', 'exporta', 'petróleo', ',', 'café', ',', 'cacao', ',', 'camarones', 'y', 'productos', 'pesqueros', '.', 'El', 'aumento', 'de', 'la', 'violencia', 'en', 'el', 'país', 'andino', ',', 'dentro', 'y', 'fuera', 'de', 'sus', 'prisiones', ',', 'se', 'ha', 'vinculado', 'a', 'los', 'enfrentamientos', 'entre', 'cárteles', 'de', 'la', 'droga', ',', 'tanto', 'extranjeros', 'como', 'locales', ',', 'por', 'el', 'control', 'de', 'las', 'rutas', 'de', 'la', 'cocaína', 'hacia', 'Estados', 'Unidos', 'y', 'Europa', '.']
Entities: []
  docs = (self._ensure_doc(text) for text in texts)
Loading the dataset...
Extracting: 9it [00:00, 23.15it/s]
Some weights of BertModel were not initialized from the model checkpoint at SpanBERT/spanbert-large-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
coref error
coref error

Originally posted by @MariamNakhle in #16 (comment)

Installation issues

I had a lot of issues with installation, noting them here in case you want to fix them, or others have similar ones.

  • I started by creating a python environment: python3 -m venv muda; . muda/bin/activate
  • There are two requirements files, in the root dir and under muda/. I installed both, starting with the one under muda.
  • I had to manually install overrides==3.1.0 to fix one set of errors
  • I had to manually downgrade pydantic==1.7.4 to fix another set of errors

After that, I was able to get the program to run.

Add basic tests unit tests for each phenomena

Currently, the tagger works but can be a bit flaky, and might fail for weird edge cases.

One way to ensure that both the code isn't broken and also produces expected tags, we could have some unit tests to assert the output of tagger for specific input sentences we design.

These tests would have to be language specific almost certainly, which could be a pain in the ass, so I think this is only mid-priority

Fix _verb_formality tagging

Currently due to the refactor, verb formalities for languages that support it are disabled.
We should revisit this functions and make them compatible

Add new UX to intepret MuDA results

Currently MuDA relies on compare-mt to analyse the performance of contextual models. However this delegation both makes MuDA harder to use AND brings alot of extra information to users who just want MuDA scores.

Ideally MuDA should have an interface that takes the tags (for both reference and the mt system) and outputs a score.
This is connected to #2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.