Giter Club home page Giter Club logo

abkhaz-nlp-data-pipeline's People

Contributors

bachstelze avatar danielinux7 avatar plkmoi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

abkhaz-nlp-data-pipeline's Issues

Automate Project board

This needs some research, probably actions could be used.

  1. When a task is created and added/moved to: To do, In progress, and Done columns in a the project, it should automatically add the task to the current milestone (sprint).
  2. When a task is created and added/moved to:
    Backlog column in a the project, it should automatically remove the task from the current milestone (sprint).
    The solutions is a customized action:
    https://github.com/danielinux7/Multilingual-Parallel-Corpus/blob/master/.github/workflows/sprint.yml

--punctuation usage with join_corpus script

How can I use this option to filter out a parallel corpus file of unwanted sentences, could you give an example?

  --punctuation      We use the punctuation criteria as filter in such way
                     that each translation have the same order of sentence
                     signs. The sentence signs are ".:!?0-9…()[]«»".

Multilingual dictionary parsing

It would be a good start for other low-resource languages to parse translation dictionaries.
If there are only dictionary images, then we should start with the support of OCR.
Which translation dictionaries could be used?

  • Adyghe OCR #16

Naulinux аԥсышәала

Ахҳәаа
NowLinux is a distribution that has some parts of it localized into the Abkhazian language, those parts consist of: Gnome, FireFox, OpenOffice.org and StarDic

Ауадаҩрақәа
Localization only exists in this distribution and it has been discontinued.

Аӡбара

Export all the locale files into the latest versions of the localized programs.

Адҵақәа:

Азхьарԥшқәа:
http://www.linux-ink.ru/projects/

Common Voice 4

Ахҳәаа

Common Voice ари Mozilla иаҵанакуа ауаа хаҭала рцәажәашьа амашьынақәа идзырҵо ԥшьгамҭоуп.

Ауадаҩрақәа

Акорпорациа дуқәа рхы иадырхәо адыррақәа реиҳарак ауаа рхы ирзархәом. Ари аҭагылазаашьа алагаламҭа ҿыцқәа рцәырҵра иаԥырхагоуп ҳәа ҳгәы иаанагоит. Убри азоуп абжьы адырразы ицхыраагӡахаша, зегьы рзы иаарту иманшәалоу Common Voice апрограмма заԥаҳҵаз.

Аӡбара

  • 4 аҭаҩразы.
  • 4 азыӡырҩразы.

Азхьарԥшқәа:
https://commonvoice.mozilla.org/ab

Abkhazian Arabic

A corpus can be made using Abkhazian translation of Quran and Arabic Quran.

Draft alignment

The bilingual ab-ru files in the draft could be aligned with a current translation model. There are some unmainted tools to align documents, so we can also customize NLTK align from scratch.

Add capability to join_corpus tool

The following enhancements:

  • pass the dictionaries, synonym dictionaries and the parallel corpus as a command line argument.
  • ability to generate paraphrases without splitting data to train, test and validation sets.
  • ability to import join_corpus to a python script and use it's functionality.

Ab-Ru corpus correction

A NMT model is quite sensitive to incorrect training data. The current filtration yields many cases which can be corrected in the original files. In such way we minimize the side effects of wrong alignments or sentence tokenization and we can utilize more data with less filtration.

Possible correction steps are:

  • Analyze the verbose output of the joining script.
  • Get the difference lines of the bifixed corpus.
  • Score the parallel sentences with a current translation model.

Paraphrase start and end words

The current implementation only looks for words with surrounding space. This should be exchanged with more possible tokens like none token or sentence signs.

Corpus generation with back-translation

The current generation script can't mix in back-translation data. This data could also be filtered like the human translations, but it shouldn't be used for the paraphrase generation. Furthermore, a labeling and scaling factor would be handy.
This development puncture includes:

  • generate back-translation of the abkhazian, monolingual corpus
  • implement the features in the join corpus script
  • test the corpus with a trained model

Punctuation filter

Some parallel sentences like the abkhazian "Иашарыла умҩа қәҵоуп." and the russian "Кто ещё такой, как ты?" don't share the same punctuation marks. 65667b1 adds a hard punctuation filter to the join corpus script. The issue was initial described in #11
A more sensitive implementation should be described in the sprint planning.

generate_synphrases wrong paraphrase (postfix)

I generated from ab/abaza.org:
– Ашәуа филологцәа аузыжьҭуа иреиҳау ҵараиурҭас иҳамоу Ҟарачы-Черқьессктәи аҳәынҭқарратә университет ауп, ақалақь Каачаевск иҟоу. Иахьазы абас еиԥш иҟоу акадрқәа ҳазҭо абри аҵараиурҭа мацароуп.
– Ашәуа филологцәа аузыжьҭуа иреиҳау ҵараиурҭас иҳамоу Ҟарачы-Черқьессктәи аҳәынҭқарратә университет ауп, ақалақь Каачаевск иҟоу. Иахьазы абас гыгшәыгшәа иҟоу акадрқәа ҳазҭо абри аҵараиурҭа мацароуп.
The paraphrase is incorrect.
like : еиԥш, -ҵас, -шәа (-ҵас, -шәа are postfixes)
гыгшәы́гшәа : like an animal (гыгшәы́г + шәа)

Wordnet compatibility

We are currently using specific language formats for abkhazian and russian. If we transform them into the wordnet format, then we could use our tools for other languages.

Setup translation api server

Setup translation api server on GCS and hook it to the web interface
either checkpoint or saved models should be used!

Поиск текста

  • Нашёл текст из предвыборной компаний в депутаты 01.06.2021.

Training methods to enhance NMT models

Possible methods to enhance NMT models:

  • Domain adaptation; tagging at the beginning of the source sentence with their domain tag i.e _news, _politics,_bible,_quran,_other
  • different punctuation filtering of source and target sentences using the join_corpus tool.
  • Using (sentencepiece dropout)[https://github.com/google/sentencepiece#subword-regularization-and-bpe-dropout]
  • Using paraphrase, There is the script "utils/parallel_paraphrasing.py" which can generate the plain paraphrases.

Common Voice 16

Ахҳәаа

Common Voice ари Mozilla иаҵанакуа ауаа хаҭала рцәажәашьа амашьынақәа идзырҵо ԥшьгамҭоуп.

Ауадаҩрақәа

Акорпорациа дуқәа рхы иадырхәо адыррақәа реиҳарак ауаа рхы ирзархәом. Ари аҭагылазаашьа алагаламҭа ҿыцқәа рцәырҵра иаԥырхагоуп ҳәа ҳгәы иаанагоит. Убри азоуп абжьы адырразы ицхыраагӡахаша, зегьы рзы иаарту иманшәалоу Common Voice апрограмма заԥаҳҵаз.

Аӡбара

  • 16 аҭаҩразы.
  • 16 азыӡырҩразы.

Азхьарԥшқәа:
https://commonvoice.mozilla.org/ab

Enlarge the corpus with a Russian-Abkhazian dictionary

We can read the pdf file with pdfminer and extract the text after the installation with pip3 install pdfminer.six:

from pdfminer import high_level
# this will take some minutes
text = high_level.extract_text('Russian-abkhazian_dictionary.pdf')

# We save the text to a pickle to extract it only ones
import pickle
pickle.dump(text, open( "extracted_text.p", "wb" ))

The extracted text can now easily loaded with:

# Load the dictionary lines back from the pickle file.
import pickle

extracted = pickle.load( open( "extracted_text.p", "rb" ) )

# TODO adapt the parsing to the dictionary structure
lines = extracted.splitlines()

# TODO parse the lines into parallel text
for line in lines:
    print(line)

FileNotFoundError Join_corpus script

When using my own corpus.txt file with the command line example, I get the following error:

cd Multilingual-Parallel-Corpus/tools
python3 join_corpus.py --dictionary --paraphrase 1 0.7 2.25 10 50 1 0 0 /content/corpus.txt
[Errno 2] No such file or directory: 'Multilingual-Parallel-Corpus/tools'
/content/Multilingual-Parallel-Corpus/tools
Traceback (most recent call last):
  File "join_corpus.py", line 302, in <module>
    ab_text_train = io.open(folder+current_date+'_corpus_abkhaz.train',"w+", encoding="utf-8")
FileNotFoundError: [Errno 2] No such file or directory: 'joined_translation_data/08-25-2020_corpus_abkhaz.train'

Possible resources

We want to crawl a bigger corpus, after we have a useable translation model. There are many possible websites which we could use:

The questions about the quality and technical problems are still open and should be investigated during the crawling process.

R&D

  • Studying the Abkhazian language.
  • Studying Tensorflow.

Replace the latin letters to it's cyrillic equivalent !important

I came across a serious issue in the dictionary, the rest of the corpus should be checked also.
In the abkhazian dictionary text, the latin letter a (U+0041, U+0061) is used instead of the Cyrillic letter a (U+0410, U+0430)
All the letters that look like latin letters should be checked and replaced if needed to it's Cyrillic equivalent!

Training Ru-Ab model

  • Filter out text by scoring the parallel corpus and the synthetic one.
  • Separate high quality and low quality parallel corpus based on punctuation matching.
  • Shared sentencepiece for ru and ab.
    The first and second subtasks are back to the backlog.
    The 3rd subtask has been done, testing is needed on the new BLEU data.

MASS training

Pretraining a whole transformer encoder and decoder with MASS.

Python script to identify mismatched punctuation

Problem Description
For Neural machine translation, it is important to have identical punctuation marks on both source and target text.

Solution
A python script where i can pass a tsv file, once I run the python script, I should get back as a result two tsv files, one that has the text with identical punctuation, a second file contains the text with mismatched punctuation.

Resources
This code could be helpful to look at: 65667b1
You can try one of the tsv files in the ab-ru folder.
Once the script is ready, it should be pushed to the utils folder in this repository.

Sponsor

Do research about who can sponsor buying a powerful computer for our projects.
https://lambdalabs.com/gpu-workstations/vector/customize
Possible sponsors:

  1. https://www.undp.org/
  2. https://en.unesco.org/

Usage:
Train NMT models that are released under public license CC0, the focus is on low resource languages.
Current language pair (ab-ru) https://www.kaggle.com/nartaa/abrutemp
Grant_CSSP.zip

Update:
I applied for a grant at UNDP, waiting for response, the resources also will be directed to enlarge the text corpus.

Testing the integrity of the ab-ru Transformer model

I keep updating the model in Kaggle, the data that I am using and the latest model, here are the links:
https://www.kaggle.com/nartaa/ab-ru-transformer
https://www.kaggle.com/nartaa/abrutransformer
https://www.kaggle.com/nartaa/abrudata

This needs testing and QA, could you double check my work?
Here is the python code I am using to detokenize:

import io
import re
from mosestokenizer import *
out_pre = ""
pre_list = []
in_pre = io.open("/content/tgt-test.txt",'r', encoding="utf-8")
pre_list.extend(in_pre.readlines())
for i, item in enumerate(pre_list):
  pre_list[i] = re.sub(r" ","",pre_list[i])
  pre_list[i] = re.sub(r"▁"," ",pre_list[i])
detokenize = MosesDetokenizer('ru')
for i, item in enumerate(pre_list):
  temp = item.strip().split(" ")
  item = detokenize(temp)
  out_pre = out_pre + item.lower() + "\n"
f_pre = open("/content/tgt-test-dec.txt","w")
f_pre.writelines(out_pre)
f_pre.close()

Dictionary lists for rare words

The rare word counting from #32 should be usable for the dictionary lists. In such case that we calculate the common words after the paraphrasing and construct the dictionary lists for the remaining rare words.

The readable writing of paraphrases

The output of the paraphrase generation isn't readable for larger files. A good format would be:

One original sentence...
   one initial word --> exchanged word
   another word --> swapped word

Another primary string...
   one initial word --> changed word
   some other word --> switched word

generate synphrases less sentences than count

I added few code lines to generate an output file, but the paraphrase count on the terminal and the number of lines of the file differ.
I tried ab/abaza.org ab, I got an output file with 2428 lines, but the paraphrase count: 3091

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.