danielinux7 / abkhaz-nlp-data-pipeline Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 3.0 1.24 GB

Abkhazian language focused multilingual and monolingual corpuses for Natural Language Processing(NLP)

Home Page: https://bagrat.space/

License: Creative Commons Zero v1.0 Universal

Python 85.80% Shell 14.20%

neural-machine-translation parallel-corpus

abkhaz-nlp-data-pipeline's People

Contributors

Stargazers

Watchers

Forkers

plkmoi bachstelze panagoa

abkhaz-nlp-data-pipeline's Issues

Automate Project board

This needs some research, probably actions could be used.

When a task is created and added/moved to: To do, In progress, and Done columns in a the project, it should automatically add the task to the current milestone (sprint).
When a task is created and added/moved to:
Backlog column in a the project, it should automatically remove the task from the current milestone (sprint).
The solutions is a customized action:
https://github.com/danielinux7/Multilingual-Parallel-Corpus/blob/master/.github/workflows/sprint.yml

--punctuation usage with join_corpus script

How can I use this option to filter out a parallel corpus file of unwanted sentences, could you give an example?

  --punctuation      We use the punctuation criteria as filter in such way
                     that each translation have the same order of sentence
                     signs. The sentence signs are ".:!?0-9…()[]«»".

Multilingual dictionary parsing

It would be a good start for other low-resource languages to parse translation dictionaries.
If there are only dictionary images, then we should start with the support of OCR.
Which translation dictionaries could be used?

Adyghe OCR #16

GCS training

Use the Google cloud service for training:
https://console.cloud.google.com/ai-platform/notebooks

Setup up AB - RU transformer

Setup up AB - RU transformer same as ru-ab on Kaggle https://www.kaggle.com/nartaa/ab-ru-transformer

Monolingual corpus

Is the monolingual corpus size to big for the standard github? The text list of the Abkhaz monolingual corpus seems large.

Add number filters

Add a number filter in addition to the punctuation filter.

Naulinux аԥсышәала

Ахҳәаа
NowLinux is a distribution that has some parts of it localized into the Abkhazian language, those parts consist of: Gnome, FireFox, OpenOffice.org and StarDic

Ауадаҩрақәа
Localization only exists in this distribution and it has been discontinued.

Аӡбара

Export all the locale files into the latest versions of the localized programs.

Адҵақәа:

Азхьарԥшқәа:
http://www.linux-ink.ru/projects/

Common Voice 4

Ахҳәаа

Common Voice ари Mozilla иаҵанакуа ауаа хаҭала рцәажәашьа амашьынақәа идзырҵо ԥшьгамҭоуп.

Ауадаҩрақәа

Акорпорациа дуқәа рхы иадырхәо адыррақәа реиҳарак ауаа рхы ирзархәом. Ари аҭагылазаашьа алагаламҭа ҿыцқәа рцәырҵра иаԥырхагоуп ҳәа ҳгәы иаанагоит. Убри азоуп абжьы адырразы ицхыраагӡахаша, зегьы рзы иаарту иманшәалоу Common Voice апрограмма заԥаҳҵаз.

Аӡбара

4 аҭаҩразы.
4 азыӡырҩразы.

Азхьарԥшқәа:
https://commonvoice.mozilla.org/ab

Abkhazian Arabic

A corpus can be made using Abkhazian translation of Quran and Arabic Quran.

Draft alignment

The bilingual ab-ru files in the draft could be aligned with a current translation model. There are some unmainted tools to align documents, so we can also customize NLTK align from scratch.

Paralize the corpus processing script

The single-core processing can take several minutes. A scaling multi-core approach would increase the speed.

Setting up RU-AB transformer

Setting up RU-AB transformer on Kaggle
https://www.kaggle.com/nartaa/ru-ab-transformer

Adyghe OCR

Add alphabet and dictionary to easyOCR https://github.com/JaidedAI/EasyOCR

Add alphabet.
Add dictionary.
Added an issue to easyOCR,

The OCR model has been done, but it is still worth it to add those files to easyOCR, later on for improvements, ~ from easyOCR

Add capability to join_corpus tool

The following enhancements:

pass the dictionaries, synonym dictionaries and the parallel corpus as a command line argument.
ability to generate paraphrases without splitting data to train, test and validation sets.
ability to import join_corpus to a python script and use it's functionality.

Ab-Ru corpus correction

A NMT model is quite sensitive to incorrect training data. The current filtration yields many cases which can be corrected in the original files. In such way we minimize the side effects of wrong alignments or sentence tokenization and we can utilize more data with less filtration.

Possible correction steps are:

Analyze the verbose output of the joining script.
Get the difference lines of the bifixed corpus.
Score the parallel sentences with a current translation model.

Validation and testing corpus

We need a corpus around 5000 lines each for a proper validation and testing. The corpus should be new aligned.
Following #12

Document kaggle notebooks

Describe the nmt trained models in Kaggle.

Paraphrase start and end words

The current implementation only looks for words with surrounding space. This should be exchanged with more possible tokens like none token or sentence signs.

Optional latin script in the dictionary

There are some biology terms in the Latin script which aren't commonly used. Those terms should be optional filtered in the dictionary parser.

Rare word counting for paraphrases

It is beneficial to fill in seldom words in the corpus, if we want to use just a small amount of paraphrases.

Corpus generation with back-translation

The current generation script can't mix in back-translation data. This data could also be filtered like the human translations, but it shouldn't be used for the paraphrase generation. Furthermore, a labeling and scaling factor would be handy.
This development puncture includes:

generate back-translation of the abkhazian, monolingual corpus
implement the features in the join corpus script
test the corpus with a trained model

Punctuation filter

Some parallel sentences like the abkhazian "Иашарыла умҩа қәҵоуп." and the russian "Кто ещё такой, как ты?" don't share the same punctuation marks. 65667b1 adds a hard punctuation filter to the join corpus script. The issue was initial described in #11
A more sensitive implementation should be described in the sprint planning.

Prepare a presentation at the Uni 10/06

A 20 minutes presentation about artificial intelligence.
The main focus is how to bring volunteers to help in the parallel corpus.
ALPP - 08-06-21-v2.pptx

generate_synphrases wrong paraphrase (postfix)

I generated from ab/abaza.org:
– Ашәуа филологцәа аузыжьҭуа иреиҳау ҵараиурҭас иҳамоу Ҟарачы-Черқьессктәи аҳәынҭқарратә университет ауп, ақалақь Каачаевск иҟоу. Иахьазы абас еиԥш иҟоу акадрқәа ҳазҭо абри аҵараиурҭа мацароуп.
– Ашәуа филологцәа аузыжьҭуа иреиҳау ҵараиурҭас иҳамоу Ҟарачы-Черқьессктәи аҳәынҭқарратә университет ауп, ақалақь Каачаевск иҟоу. Иахьазы абас гыгшәыгшәа иҟоу акадрқәа ҳазҭо абри аҵараиурҭа мацароуп.
The paraphrase is incorrect.
like : еиԥш, -ҵас, -шәа (-ҵас, -шәа are postfixes)
гыгшәы́гшәа : like an animal (гыгшәы́г + шәа)

Wordnet compatibility

We are currently using specific language formats for abkhazian and russian. If we transform them into the wordnet format, then we could use our tools for other languages.

New data set for BLEU scoring

prepare a better quality data set for BLEU scoring
e0d8dd5
https://github.com/danielinux7/Abkhaz-Monolingual-Corpus/commit/a3f882ba25003ca0fe4f4d4bf308ff3a15aadf72

Setup translation api server

Setup translation api server on GCS and hook it to the web interface
either checkpoint or saved models should be used!

Testing punctuation join_corpus

Testing punctuation join_corpus script

Поиск текста

Нашёл текст из предвыборной компаний в депутаты 01.06.2021.

Training methods to enhance NMT models

Possible methods to enhance NMT models:

Domain adaptation; tagging at the beginning of the source sentence with their domain tag i.e _news, _politics,_bible,_quran,_other
different punctuation filtering of source and target sentences using the join_corpus tool.
Using (sentencepiece dropout)[https://github.com/google/sentencepiece#subword-regularization-and-bpe-dropout]
Using paraphrase, There is the script "utils/parallel_paraphrasing.py" which can generate the plain paraphrases.

Common Voice 16

Ахҳәаа

Common Voice ари Mozilla иаҵанакуа ауаа хаҭала рцәажәашьа амашьынақәа идзырҵо ԥшьгамҭоуп.

Ауадаҩрақәа

Аӡбара

16 аҭаҩразы.
16 азыӡырҩразы.

Азхьарԥшқәа:
https://commonvoice.mozilla.org/ab

Enlarge the corpus with a Russian-Abkhazian dictionary

We can read the pdf file with pdfminer and extract the text after the installation with pip3 install pdfminer.six:

from pdfminer import high_level
# this will take some minutes
text = high_level.extract_text('Russian-abkhazian_dictionary.pdf')

# We save the text to a pickle to extract it only ones
import pickle
pickle.dump(text, open( "extracted_text.p", "wb" ))

The extracted text can now easily loaded with:

# Load the dictionary lines back from the pickle file.
import pickle

extracted = pickle.load( open( "extracted_text.p", "rb" ) )

# TODO adapt the parsing to the dictionary structure
lines = extracted.splitlines()

# TODO parse the lines into parallel text
for line in lines:
    print(line)

FileNotFoundError Join_corpus script

When using my own corpus.txt file with the command line example, I get the following error:

cd Multilingual-Parallel-Corpus/tools
python3 join_corpus.py --dictionary --paraphrase 1 0.7 2.25 10 50 1 0 0 /content/corpus.txt

[Errno 2] No such file or directory: 'Multilingual-Parallel-Corpus/tools'
/content/Multilingual-Parallel-Corpus/tools
Traceback (most recent call last):
  File "join_corpus.py", line 302, in <module>
    ab_text_train = io.open(folder+current_date+'_corpus_abkhaz.train',"w+", encoding="utf-8")
FileNotFoundError: [Errno 2] No such file or directory: 'joined_translation_data/08-25-2020_corpus_abkhaz.train'

Ru-Ab transformer

Setup a notebook for training

Possible resources

We want to crawl a bigger corpus, after we have a useable translation model. There are many possible websites which we could use:

The questions about the quality and technical problems are still open and should be investigated during the crawling process.

R&D

Studying the Abkhazian language.
Studying Tensorflow.

Replace the latin letters to it's cyrillic equivalent !important

I came across a serious issue in the dictionary, the rest of the corpus should be checked also.
In the abkhazian dictionary text, the latin letter a (U+0041, U+0061) is used instead of the Cyrillic letter a (U+0410, U+0430)
All the letters that look like latin letters should be checked and replaced if needed to it's Cyrillic equivalent!

Clean text:

description

Testing paraphrase join_corpus

Testing paraphrase join_corpus scriptAn issue is blocking #33

Training Ru-Ab model

Filter out text by scoring the parallel corpus and the synthetic one.
Separate high quality and low quality parallel corpus based on punctuation matching.
Shared sentencepiece for ru and ab.
The first and second subtasks are back to the backlog.
The 3rd subtask has been done, testing is needed on the new BLEU data.

MASS training

Pretraining a whole transformer encoder and decoder with MASS.

Python script to identify mismatched punctuation

Problem Description
For Neural machine translation, it is important to have identical punctuation marks on both source and target text.

Solution
A python script where i can pass a tsv file, once I run the python script, I should get back as a result two tsv files, one that has the text with identical punctuation, a second file contains the text with mismatched punctuation.

Resources
This code could be helpful to look at: 65667b1
You can try one of the tsv files in the ab-ru folder.
Once the script is ready, it should be pushed to the utils folder in this repository.

Sponsor

Do research about who can sponsor buying a powerful computer for our projects.
https://lambdalabs.com/gpu-workstations/vector/customize
Possible sponsors:

Usage:
Train NMT models that are released under public license CC0, the focus is on low resource languages.
Current language pair (ab-ru) https://www.kaggle.com/nartaa/abrutemp
Grant_CSSP.zip

Update:
I applied for a grant at UNDP, waiting for response, the resources also will be directed to enlarge the text corpus.

Testing the integrity of the ab-ru Transformer model

I keep updating the model in Kaggle, the data that I am using and the latest model, here are the links:
https://www.kaggle.com/nartaa/ab-ru-transformer
https://www.kaggle.com/nartaa/abrutransformer
https://www.kaggle.com/nartaa/abrudata

This needs testing and QA, could you double check my work?
Here is the python code I am using to detokenize:

import io
import re
from mosestokenizer import *
out_pre = ""
pre_list = []
in_pre = io.open("/content/tgt-test.txt",'r', encoding="utf-8")
pre_list.extend(in_pre.readlines())
for i, item in enumerate(pre_list):
  pre_list[i] = re.sub(r" ","",pre_list[i])
  pre_list[i] = re.sub(r"▁"," ",pre_list[i])
detokenize = MosesDetokenizer('ru')
for i, item in enumerate(pre_list):
  temp = item.strip().split(" ")
  item = detokenize(temp)
  out_pre = out_pre + item.lower() + "\n"
f_pre = open("/content/tgt-test-dec.txt","w")
f_pre.writelines(out_pre)
f_pre.close()

Dictionary lists for rare words

The rare word counting from #32 should be usable for the dictionary lists. In such case that we calculate the common words after the paraphrasing and construct the dictionary lists for the remaining rare words.

The readable writing of paraphrases

The output of the paraphrase generation isn't readable for larger files. A good format would be:

One original sentence...
   one initial word --> exchanged word
   another word --> swapped word

Another primary string...
   one initial word --> changed word
   some other word --> switched word

Back translation

Using the abkhazian monolingual corpus to build a synthetic corpus.
with the ab-ru trained models on Kaggle
https://www.kaggle.com/nartaa/ab-ru-transformer
https://www.kaggle.com/nartaa/ru-ab-transformer

generate synphrases less sentences than count

I added few code lines to generate an output file, but the paraphrase count on the terminal and the number of lines of the file differ.
I tried ab/abaza.org ab, I got an output file with 2428 lines, but the paraphrase count: 3091

danielinux7 / abkhaz-nlp-data-pipeline Goto Github PK

abkhaz-nlp-data-pipeline's People

Contributors

Stargazers

Watchers

Forkers

abkhaz-nlp-data-pipeline's Issues

Recommend Projects

Recommend Topics

Recommend Org