Giter Club home page Giter Club logo

yoruba-text's Introduction

Yorùbá text

This repository contains fully diacritized Yorùbá text, converted to Unicode Normalization Form Composition (NFC) format, where diacritized characters are composed into a single character with the following code:

def convert_to_NFC(filename, outfilename):
    text=''.join(c for c in unicodedata.normalize('NFC', open(filename).read()))
    with open(outfilename, 'w') as f:
        f.write(text)

Sources:

Sources yet to be scraped and cleaned

Social Media sources:

Text has been gathered with permission from online sources, and lightly preprocessed for use in NLP, TTS, ASR applications. Note, some of the sentences may have errors, please submit a pull-request if you have corrections!

Resources

Bibtex

If you want to cite this repo in your work, please use:

@misc{Orife_yoruba-text_2018,
author = {Orife, Iroro and Fasubaa, Timilehin and Wahab, Olamilekan},
month = {1},
title = {{yoruba-text}},
url = {https://github.com/Niger-Volta-LTI/yoruba-text},
year = {2018}
}

yoruba-text's People

Contributors

dadelani avatar olamyy avatar ruohoruotsi avatar timilehin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yoruba-text's Issues

Provide a script to cleanly download and normalize text

Rather than the current system of each sub-corpora it is own folder with its own code. Create a top-level downloads.sh which can re-assemble the sub-corpora.

Separately, have the downloaded & pre-processed sub-corpora ready to be referenced from ADR, and NMT repos as submodules etc.

[CLEANUP] unclean text

Right now OCR texts (Aaro Meta, Ogboju, etc) suffer from errors intrinsic to the OCR process (non Yorùbá characters, inconsistencies from sentence to sentence that don't reflect a human's authors errors, etc)

Téò fi j€ k'Áwéòjọbí é káyò rẹ̀ d€lé€?
Tòô burú ju tìẹ yìí lọ
Bòyá aya rẹ léô ṣàìsí
Bôyá ọkọ rẹ lô yúnsẹ̀ Òrìṣà

This means that to clean the text, line-by-line human supervision is required. To not confuse users, we will move the OCR texts into their own folders to and indicate in the README that we're not using these texts YET for any tasks, i.e. work in progress 🚧 👷 🚧

[FIX] training and test text from Tolúwaṣẹ lang-id task

From this commit it appears that either there is overlapping text between training and text (Fix!) or that we did not combine the text from this repo correctly. In either case, we need to rename the text-file so that we can trace its origin better.

Since we are going to shuffle all sentences anyway,

  • combine all available text, do sort | uniq
  • call it something reasonable, related to the Tolúwaṣẹ language word-id task!

Move test_yoruba_diacritic_removal.py to a utils or scripts folder

Move test_yoruba_diacritic_removal.py to a utils or scripts folder as this file needs to be better organized, since two sub-corpora depend on it to recreate their clean (vs raw scraped) training texts.

This will give consistent access via imports to its functionality which includes converting to NFC, stripping accents, splitting a long sentence on symbol like comma or colon, etc.

Separate òwe corpus into EN & YO

Separate òwe corpus into individual EN & YO files.

The format typically follows this repeating pattern:

YO
EN1 - literal translation
EN2 - proverbial (non literal) translation
EMPTY LINE

For example:

A di gàárì sílẹ̀ ewúrẹ́ ńyọjú; ẹrù ìran rẹ̀ ni?
We prepare the saddle, and the goat presents itself; is it a burden for the lineage of goats?
Goats that know their place do not offer their backs to be saddled. This is a variant of A gbé gàárì ọmọ ewúrẹ́ ńrojú . . .

A fi ọ́ jọba ò ńṣàwúre o fẹ́ jẹ Ọlọ́run ni?
You have been crowned a king, and yet you make good-luck charms; would you be crowned God?
Being crowned a king is about the best fortune a mortal could hope for.

A fijó gba Awà; a fìjà gba Awà; bí a ò bá jó, bí a ò bá jà, bí a bá ti gba Awà, kò tán bí?
By dancing we take possession of Awà; through fighting we take possession of Awà; if we neither dance nor fight, but take possession of Awà anyway, is the result not the same?
Why make a huge production of a matter that is easily taken care of?

How the text is separated is up to you. One idea is to have 2 files for each òwe:
yo_001.txt, en_001.txt (which includes both translations on 2 different lines as above ☝️

Then for text preparation, we can just aggregate all the YO text. For Machine Translation, we can create a parallel YO-EN corpus by taking the literal or proverbial EN translations.

Please feel free to add any questions or clarifications.

Scrape second Yorùbá Bible version

Currently the Bible corpus only comprises scraped text from the first version. Examining the first chapter of Genesis, they are different in a way that perhaps makes them non-redundant for ADR or NMT training.

Ní ìbẹ̀rẹ̀ ohun gbogbo Ọlọ́run dá àwọn ọ̀run àti ayé. 2 Ayé sì wà ní rúdurùdu, ó sì ṣófo, òkùnkùn sì wà lójú ibú omi, Ẹ̀mí Ọlọ́run sì ń rábàbà lójú omi.
Ní ìbẹ̀rẹ̀, nígbà tí Ọlọrun dá ọ̀run ati ayé, 2 ayé rí júujùu, ó sì ṣófo. Ibú omi bo gbogbo ayé, gbogbo rẹ̀ ṣókùnkùn biribiri, ẹ̀mí Ọlọrun sì ń rábàbà lójú omi.
  1. https://www.bible.com/versions/911-ycb-bibeli-mimo-ni-ede-yoruba-de-ni
  2. https://www.bible.com/versions/207-bm-yoruba-bible

Experiments are necessary to not add redundant text to the ADR training corpus, but the second corpus it definitely worth the scraping effort.

Add OSCAR corpus

Add https://oscar-corpus.com, common crawl from the BBC to the working corpus for ADR and other monolingual tasks

Language | Words original | Size original | File original | Words deduplicated | Size deduplicated | File deduplicated
Yoruba   | 8,906          | 55K           | yo.txt.gz     | 3,518              | 27K               | yo_dedup.txt.gz

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.