niger-volta-lti / yoruba-text Goto Github PK

Yorùbá language training text for NLP, ASR and TTS tasks

License: GNU General Public License v3.0

Python 100.00%

african-languages natural-language-processing diacritization machine-translation training-dataset nlp yoruba tts asr nlp-datasets

yoruba-text's Introduction

Yorùbá text

This repository contains fully diacritized Yorùbá text, converted to Unicode Normalization Form Composition (NFC) format, where diacritized characters are composed into a single character with the following code:

def convert_to_NFC(filename, outfilename):
    text=''.join(c for c in unicodedata.normalize('NFC', open(filename).read()))
    with open(outfilename, 'w') as f:
        f.write(text)

Sources:

Sources yet to be scraped and cleaned

Social Media sources:

Text has been gathered with permission from online sources, and lightly preprocessed for use in NLP, TTS, ASR applications. Note, some of the sentences may have errors, please submit a pull-request if you have corrections!

Resources

Bibtex

If you want to cite this repo in your work, please use:

@misc{Orife_yoruba-text_2018,
author = {Orife, Iroro and Fasubaa, Timilehin and Wahab, Olamilekan},
month = {1},
title = {{yoruba-text}},
url = {https://github.com/Niger-Volta-LTI/yoruba-text},
year = {2018}
}

yoruba-text's People

Contributors

Stargazers

Watchers

yoruba-text's Issues

Add JW300 Text

Add JW300 text (for ADR & NMT pairs)

Reference: http://opus.nlpl.eu/JW300.php

$ opus_read -d JW300 -s yo -t en -wm moses -w jw300.yo jw300.en

Validate and convert new texts to NFC

Validate and convert new texts to NFC.

Use Ìrànlọ́wọ́.

Check if text is NFC (code example)
Normalize text to NFC (code example)

Scrape second Yorùbá Bible version

Currently the Bible corpus only comprises scraped text from the first version. Examining the first chapter of Genesis, they are different in a way that perhaps makes them non-redundant for ADR or NMT training.

Ní ìbẹ̀rẹ̀ ohun gbogbo Ọlọ́run dá àwọn ọ̀run àti ayé. 2 Ayé sì wà ní rúdurùdu, ó sì ṣófo, òkùnkùn sì wà lójú ibú omi, Ẹ̀mí Ọlọ́run sì ń rábàbà lójú omi.

Ní ìbẹ̀rẹ̀, nígbà tí Ọlọrun dá ọ̀run ati ayé, 2 ayé rí júujùu, ó sì ṣófo. Ibú omi bo gbogbo ayé, gbogbo rẹ̀ ṣókùnkùn biribiri, ẹ̀mí Ọlọrun sì ń rábàbà lójú omi.

Experiments are necessary to not add redundant text to the ADR training corpus, but the second corpus it definitely worth the scraping effort.

Provide a script to cleanly download and normalize text

Rather than the current system of each sub-corpora it is own folder with its own code. Create a top-level downloads.sh which can re-assemble the sub-corpora.

Separately, have the downloaded & pre-processed sub-corpora ready to be referenced from ADR, and NMT repos as submodules etc.

Fully scrape the book of mormon

Srape the rest of the book of mormon.

https://www.lds.org/study/scriptures/dc-testament/title-page?lang=yor

https://www.lds.org/study/scriptures/bofm/title-page?lang=yor

@Timilehin

Move test_yoruba_diacritic_removal.py to a utils or scripts folder

Move test_yoruba_diacritic_removal.py to a utils or scripts folder as this file needs to be better organized, since two sub-corpora depend on it to recreate their clean (vs raw scraped) training texts.

This will give consistent access via imports to its functionality which includes converting to NFC, stripping accents, splitting a long sentence on symbol like comma or colon, etc.

[FIX] training and test text from Tolúwaṣẹ lang-id task

From this commit it appears that either there is overlapping text between training and text (Fix!) or that we did not combine the text from this repo correctly. In either case, we need to rename the text-file so that we can trace its origin better.

Since we are going to shuffle all sentences anyway,

combine all available text, do sort | uniq
call it something reasonable, related to the Tolúwaṣẹ language word-id task!

Add OSCAR corpus

Add https://oscar-corpus.com, common crawl from the BBC to the working corpus for ADR and other monolingual tasks

Language | Words original | Size original | File original | Words deduplicated | Size deduplicated | File deduplicated
Yoruba   | 8,906          | 55K           | yo.txt.gz     | 3,518              | 27K               | yo_dedup.txt.gz

Separate òwe corpus into EN & YO

Separate òwe corpus into individual EN & YO files.

The format typically follows this repeating pattern:

YO
EN1 - literal translation
EN2 - proverbial (non literal) translation
EMPTY LINE

For example:

A di gàárì sílẹ̀ ewúrẹ́ ńyọjú; ẹrù ìran rẹ̀ ni?
We prepare the saddle, and the goat presents itself; is it a burden for the lineage of goats?
Goats that know their place do not offer their backs to be saddled. This is a variant of A gbé gàárì ọmọ ewúrẹ́ ńrojú . . .

A fi ọ́ jọba ò ńṣàwúre o fẹ́ jẹ Ọlọ́run ni?
You have been crowned a king, and yet you make good-luck charms; would you be crowned God?
Being crowned a king is about the best fortune a mortal could hope for.

A fijó gba Awà; a fìjà gba Awà; bí a ò bá jó, bí a ò bá jà, bí a bá ti gba Awà, kò tán bí?
By dancing we take possession of Awà; through fighting we take possession of Awà; if we neither dance nor fight, but take possession of Awà anyway, is the result not the same?
Why make a huge production of a matter that is easily taken care of?

How the text is separated is up to you. One idea is to have 2 files for each òwe:
yo_001.txt, en_001.txt (which includes both translations on 2 different lines as above ☝️

Then for text preparation, we can just aggregate all the YO text. For Machine Translation, we can create a parallel YO-EN corpus by taking the literal or proverbial EN translations.

Please feel free to add any questions or clarifications.

verses to bible text

Is it possible to add the verse numbers to the bible text?

Scrape partially diacritized text

how to produce BENCHMARK_DISTRIBUTION ?

In dataset_scorer.py, what is BENCHMARK_DISTRIBUTION? and how it can be generated?
thank you.

[CLEANUP] unclean text

Right now OCR texts (Aaro Meta, Ogboju, etc) suffer from errors intrinsic to the OCR process (non Yorùbá characters, inconsistencies from sentence to sentence that don't reflect a human's authors errors, etc)

Téò fi j€ k'Áwéòjọbí é káyò rẹ̀ d€lé€?
Tòô burú ju tìẹ yìí lọ
Bòyá aya rẹ léô ṣàìsí
Bôyá ọkọ rẹ lô yúnsẹ̀ Òrìṣà

This means that to clean the text, line-by-line human supervision is required. To not confuse users, we will move the OCR texts into their own folders to and indicate in the README that we're not using these texts YET for any tasks, i.e. work in progress 🚧 👷 🚧