coderpat / muda Goto Github PK

Python 99.80% JavaScript 0.20%

muda's Introduction

Multilingual Discourse-Aware (MuDA) Benchmark

The Multilingual Discourse-Aware (MuDA) Benchmark is a comprehensive suite of taggers and evaluators aimed at advancing the field of context-aware Machine Translation (MT).

Traditional translation quality metrics output uninterpertable scores, and fail to accuratly measure performance on context-aware discourse phenomena. MuDA takes a different direction, relying on neural-based syntatical and morphalogical analysers to measure performance of translation models on specific words and discourse phenomena.

The MuDA taggers currently support 14 language pairs (see this directory) but easily supports adding new languages.

Installation

The tagger relies on Pytorch (<1.10) to run models. If you want to run these models, first install Pytorch. You can find instructions for your system here.

For example, to install PyTorch on a Linux system with CUDA support in a conda environment, run:

conda install pytorch==1.9.1 torchvision==0.10.1 torchaudio==0.9.1 cudatoolkit=11.3 -c pytorch -c conda-forge

Then, to install the rest of the dependencies, run:

pip install -r requirements.txt

Example Usage

To tag an existing dataset, and extract the tags for later use, run the following command.

python muda/main.py \
    --src /path/to/src \
    --tgt /path/to/tgt \
    --docids /path/to/docids \
    --dump-tags /tmp/maia_ende.tags \
    --tgt-lang "$lang" \

To evaluate models on particular dataset (reporting per-tag metrics such as precision & recall), run

python muda/main.py \
    --src /path/to/src \
    --tgt /path/to/tgt \
    --docids /path/to/docids \
    --hyps /path/to/hyps.m1 /path/to/hyps.m2 \
    --tgt-lang "$lang"

Note that MuDA relies on an docids file, containing the same number of lines as the src/tgt files and where each line contains a document id to which the source/target in the line belong to.

muda's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes mjpost codeaudit sweta20 hfxunlp

muda's Issues

Separate specific language tagger in different files

Currently the different taggers for the language supported live all in the same file muda/tagger.py. However, code for each language's tagger should live in a separate files since this adds encapsulation and makes it easier to add taggers for new languages.

I suggest them adding them in a new folder. e.g muda/langs/pt_tagger.py

Keep track of reference chain in MuDA

Currently, we track whether a sentence has any links to the previous sentence with booleans. If we could keep track of these references explicitly, we could automatically change them to create contrastive datasets for certain phenomena to measure a model's context sensitivity. This could be difficult to do, but would remove the dependence on non-contextual baselines.

Allow passing pre-existing alignments

Coref error

I am having some problems when running the script. I created my environment using the muda_env.yml file. When I test it on a small test document, I run into a "coref error". It seems to me that it is related to multi-token tokenisation (for example the Spanish word "al" is tokenized as "a" and "el"). See below for an example.

I'd be grateful for your thoughts on this!

This is the command I used: PYTHONPATH=/home/getalp/nakhlem/MuDA python muda/main.py --src my_data/text.en --tgt my_data/text.es --docids my_data/text.docids --dump-tags my_data/test_enes_muda-env-yaml.tags --tgt-lang "es"

And this is the full message:

2024-01-10 16:53:32 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 22.0MB/s]                                                 
2024-01-10 16:53:33 INFO: Loading these models for language: en (English):
=================================
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |
=================================

2024-01-10 16:53:33 INFO: Using device: cuda
2024-01-10 16:53:33 INFO: Loading: tokenize
2024-01-10 16:53:35 INFO: Loading: pos
2024-01-10 16:53:36 INFO: Loading: lemma
2024-01-10 16:53:36 INFO: Loading: depparse
2024-01-10 16:53:36 INFO: Done loading processors!
2024-01-10 16:53:36 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 21.0MB/s]                                                 
2024-01-10 16:53:37 WARNING: Language es package default expects mwt, which has been added
2024-01-10 16:53:38 INFO: Loading these models for language: es (Spanish):
===============================
| Processor | Package         |
-------------------------------
| tokenize  | ancora          |
| mwt       | ancora          |
| pos       | ancora_charlm   |
| lemma     | ancora_nocharlm |
| depparse  | ancora_charlm   |
===============================

2024-01-10 16:53:38 INFO: Using device: cuda
2024-01-10 16:53:38 INFO: Loading: tokenize
2024-01-10 16:53:38 INFO: Loading: mwt
2024-01-10 16:53:38 INFO: Loading: pos
2024-01-10 16:53:38 INFO: Loading: lemma
2024-01-10 16:53:38 INFO: Loading: depparse
2024-01-10 16:53:38 INFO: Done loading processors!
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Due to multiword token expansion or an alignment issue, the original text has been replaced by space-separated expanded tokens.
  docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Varios', 'enmascarados', 'irrumpieron', 'en', 'el', 'estudio', 'de', 'el', 'canal', 'público', 'TC', 'durante', 'una', 'emisión', ',', 'obligando', 'a', 'el', 'personal', 'a', 'tirar', 'se', 'a', 'el', 'suelo', '.']
Entities: []
  docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['A', 'el', 'menos', '10', 'personas', 'han', 'muerto', 'desde', 'que', 'el', 'lunes', 'se', 'declarara', 'el', 'estado', 'de', 'excepción', 'en', 'Ecuador', '.']
Entities: []
  docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Este', 'se', 'declaró', 'después', 'de', 'que', 'un', 'famoso', 'gánster', 'desapareciera', 'de', 'su', 'celda', 'en', 'prisión', '.', 'No', 'está', 'claro', 'si', 'el', 'incidente', 'en', 'el', 'estudio', 'de', 'televisión', 'de', 'Guayaquil', 'está', 'relacionado', 'con', 'la', 'desaparición', 'de', 'una', 'prisión', 'de', 'la', 'misma', 'ciudad', 'de', 'el', 'jefe', 'de', 'la', 'banda', 'de', 'los', 'Choneros', ',', 'Adolfo', 'Macías', 'Villamar', ',', 'o', 'Fito', ',', 'como', 'es', 'más', 'conocido', '.']
Entities: []
  docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['En', 'el', 'vecino', 'Perú', ',', 'el', 'gobierno', 'ordenó', 'el', 'despliegue', 'inmediato', 'de', 'una', 'fuerza', 'policial', 'en', 'la', 'frontera', 'para', 'evitar', 'que', 'la', 'inestabilidad', 'se', 'extienda', 'a', 'el', 'país', '.']
Entities: []
  docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Ecuador', 'es', 'uno', 'de', 'los', 'principales', 'exportadores', 'de', 'plátano', 'de', 'el', 'mundo', ',', 'pero', 'también', 'exporta', 'petróleo', ',', 'café', ',', 'cacao', ',', 'camarones', 'y', 'productos', 'pesqueros', '.', 'El', 'aumento', 'de', 'la', 'violencia', 'en', 'el', 'país', 'andino', ',', 'dentro', 'y', 'fuera', 'de', 'sus', 'prisiones', ',', 'se', 'ha', 'vinculado', 'a', 'los', 'enfrentamientos', 'entre', 'cárteles', 'de', 'la', 'droga', ',', 'tanto', 'extranjeros', 'como', 'locales', ',', 'por', 'el', 'control', 'de', 'las', 'rutas', 'de', 'la', 'cocaína', 'hacia', 'Estados', 'Unidos', 'y', 'Europa', '.']
Entities: []
  docs = (self._ensure_doc(text) for text in texts)
Loading the dataset...
Extracting: 9it [00:00, 23.15it/s]
Some weights of BertModel were not initialized from the model checkpoint at SpanBERT/spanbert-large-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
coref error
coref error

Originally posted by @MariamNakhle in #16 (comment)

Encapsulate all of MuDA's operations in a single command.

Currently to extract tags from MuDA, one has to run a series of commands (format, align, tag).
Ideally all these operation should be encapsulated in a single command.

Create a virtual environment or requirements file for MuDA

We should create this so that people don't have dependency issues and can easily use the library.

Installation issues

I had a lot of issues with installation, noting them here in case you want to fix them, or others have similar ones.

I started by creating a python environment: python3 -m venv muda; . muda/bin/activate
There are two requirements files, in the root dir and under muda/. I installed both, starting with the one under muda.
I had to manually install overrides==3.1.0 to fix one set of errors
I had to manually downgrade pydantic==1.7.4 to fix another set of errors

After that, I was able to get the program to run.

Add basic tests unit tests for each phenomena

Currently, the tagger works but can be a bit flaky, and might fail for weird edge cases.

One way to ensure that both the code isn't broken and also produces expected tags, we could have some unit tests to assert the output of tagger for specific input sentences we design.

These tests would have to be language specific almost certainly, which could be a pain in the ass, so I think this is only mid-priority

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.