🪐 LiBERTus - A Multilingual Language Model for Ancient and Historical Languages

Submission to Task 1 (Constrained) of the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages. The system is built by first pretraining a multilingual language model and then finetuning it for a downstream task. The submission for Phase 1 and 2 of the Shared Task can be found in the submission_p1 and submission_p2 directories.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the Weasel documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using weasel run [name]. Commands are only re-run if their inputs have changed.

Command	Description
`create-pretraining`	Create corpus for multilingual LM pretraining
`create-vocab`	Train a tokenizer to create a vocabulary
`pretrain-model`	Pretrain a multilingual LM from a corpus
`pretrain-model-from-checkpoint`	Pretrain a multilingual LM from a corpus based on a checkpoint
`upload-to-hf`	Upload pretrained model and corresponding tokenizer to the HuggingFace repository
`convert-to-spacy-merged`	Convert CoNLL-U files into spaCy format for finetuning
`convert-to-spacy`	Convert CoNLL-U files into spaCy format for finetuning
`finetune-tok2vec-model`	Finetune a tok2vec model given a training and validation corpora
`finetune-trf-model`	Finetune a transformer model given a training and validation corpora
`finetune-with-merged-corpus`	Finetune a transformer model on the combined training and validation corpora
`package-model`	Package model and upload to HuggingFace
`evaluate-model-dev`	Evaluate a model on the validation set
`plot-figures`	Plot figures for the writeup
`setup-test`	Install models from HuggingFace via pip
`download-models-locally`	Download models from HuggingFace
`get-test-results`	Get results from the test file
`zip-results-p1`	Zip the results into a single file for submission (Phase 1)
`zip-results-p2`	Zip teh results into a single file for submission (Phase 2)

⏭ Workflows

The following workflows are defined by the project. They can be executed using weasel run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow	Steps
`pretrain`	`create-pretraining` → `create-vocab` → `pretrain-model`
`finetune`	`convert-to-spacy` → `finetune-trf-model` → `evaluate-model-dev`
`experiment-merged`	`convert-to-spacy-merged` → `finetune-with-merged-corpus`
`experiment-sampling`	`create-vocab` → `pretrain-model`
`make-submission-p1`	`setup-test` → `get-test-results` → `zip-results-p1`
`make-submission-p2`	`download-models-locally` → `zip-results-p2`

🗂 Assets

The following assets are defined by the project. They can be fetched by running weasel assets in the project directory.

File	Source	Description
`assets/train/`	Git	CoNLL-U training datasets for Task 0 (morphology/lemma/POS)
`assets/dev/`	Git	CoNLL-U validation datasets for Task 0 (morphology/lemma/POS)
`assets/test/`	Git	CoNLL-U test datasets for Task 0 (morphology/lemma/POS)

📄 Cite

If you used any of the code or the models, don't forget to cite

@inproceedings{miranda-2024-allen,
    title = "{A}llen Institute for {AI} @ {SIGTYP} 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages",
    author = "Miranda, Lester",
    booktitle = "Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP",
    month = mar,
    year = "2024",
    address = "St. Julian's, Malta",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.sigtyp-1.18",
    pages = "151--159",
}

ljvmiranda921 / libertus Goto Github PK

libertus's Introduction

🪐 LiBERTus - A Multilingual Language Model for Ancient and Historical Languages

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets

📄 Cite

libertus's People

Contributors

Stargazers

Watchers

libertus's Issues

Create benchmarks folder containing bash scripts

Improve pretraining module to incorporate checkpoints

Add sampling strategies

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent