niger-volta-lti / iranlowo Goto Github PK

View Code? Open in Web Editor NEW

17.0 4.0 8.0 158.59 MB

Ìrànlọ́wọ́ is a utility library for analysis & (pre)processing of Yorùbá text → https://pypi.org/project/iranlowo

License: MIT License

Python 100.00%

yoruba african-languages utility-library nlp-machine-learning

iranlowo's People

Contributors

Stargazers

Watchers

Forkers

juniorknot olamyy profemzy just4ease hamzzy ifedave trellixvulnteam afro-lingo

iranlowo's Issues

Corpus Loading Features

I had some free time this week and I was able to pen down some features I'm hoping we'll be able to include. These are:

A class for handling all forms of scrapping. This API for this feature can be like an interface that other scrappers can be built on. We can leverage either bs4 or scrapy . I'm thinking something like:

class BaseScrapper(scrappy.Spider):
         def __init__(name, urls, **kwargs):
               super(BaseScrapper, self).__init__(name, **kwargs)

         def parse_urls(self):
                ###Do something to the URLs before starting
                pass         

         def parse(self):
               #Crawling logic
               pass

Then a scrapper like the Bibeli scrapper can use this class:

class BibeliScrapper(BaseScrapper)
          ###Logic goes here

Major advantage here is reusability. So, anyone can build their own yoruba scrapper with minimum amount of work.

Corpus class and DirectoryCorpus classs (Inspired by gensim)
This would be a class that can be used to load various format of yoruba corpus using a single API interface. It should support:

Streaming files
Reading various file formats. txt, gzip, csv,
Validating a file format. Say if a user loads an Owe file. It should be able to validate that the content of the file conforms to that format.
Preprocessing while reading.
Generating random text

A commit for this is available here
The interface is described below:

class Corpus(interfaces.CorpusABC):
    def __init__(self, path=None, text=None, stream=False, fformat='txt', cformat=None, labels=False, preprocess=None):
        """

        Args:
            path:
            text:
        """
        self.path = path
        self.text = text
        self.labels = labels
        self.stream = stream
        self.fformat = fformat
        self.cformat = cformat
        self.preprocess = preprocess
        if not self.preprocess:
            self.preprocess = [normalize_diacritics_text]
        self.data = self.read_file_filename_or_text(text=text) if text else self.read_file_filename_or_text()
        self.validate_format()

    def __iter__(self):
        for line in self.data:
            yield line

    def __len__(self):
        return len(self.data)

    @staticmethod
    def save_corpus(fname, corpus, id2word=None, metadata=False):
        pass

    def streamfile(self, fobj):
        pass

    def read_file_filename_or_text(self, f=None, text=None):
        """

        Returns:

        """
        pass

    def handle_preprocessing(self, text):
        if callable(self.preprocess):
            return self.preprocess(text)
        if isinstance(self.preprocess, list):
            for technique in self.preprocess:
                text = technique(text)
            return text

    def validate_format(self):
        """

        Returns:

        """


    def generate(self, size):
        """

        Args:
            size:

        Returns:

        """
        if not self.cformat:
            raise ValueError("You need to specify a format for generating random text")


class DirectoryCorpus(Corpus):
    def __init__(self, path, **kwargs):
        self.path_dir = path
        walked = list(walk(self.path_dir))
        self.depth = walked[0][0]
        self.dirnames = walked[0][2]
        self.flist = walked[0][3]
        self.path = list(self.read_files())
        super(DirectoryCorpus, self).__init__(path=self.path, **kwargs)

    def read_files(self):
        for path in self.flist:
            yield os.path.join(self.path_dir, path)

Loaders : These would be responsible for loading corpus made available by iranlowo.. They should return a Corpus object.

class OweLoader(DirectoryCorpus):
    def __init__(self, path, **kwargs):
        super(DirectoryCorpus, self).__init__(path=self.path, **kwargs)

I imagine a downside of these features is that it might make the project become bloated(big(?)) but I think the uses would outweigh this downside.

Outstanding task before submission to PyPI

Time to get this wrapped up! and out into the world:

Need to start using it in a few places, so show some example usage (scraping, nfc 2 nfd)
Usage also in Yorùbá text and or in ADR?
Have PyPI credentials, go the rest of the way there to pip install-ability! Do you need to have Python 3.7 installed to test out? Find out from tox people.

[ADD] Travis CI integration

Since we are using tox, get the travis CI badge going to report build status on the readme.

https://docs.travis-ci.com/user/status-images/
https://stackoverflow.com/questions/19810386/showing-travis-build-status-in-github-repo

[REFACTOR] to use the PyPI OpenNMT-py

OpenNMT/OpenNMT-py#1261 (comment)
https://pypi.org/project/OpenNMT-py/

OpenNMT-py has finally been published on PyPI, refactor and reduce source code dependencies to just a pip install OpenNMT-py.

Pre-filter words whose diacrictic forms are not in the dictionary

Pre-filter words whose non-diacrictized word-forms are not in the dictionary, before asking the model to do ADR. This way we can get more predictable results and error messages for Out-Of-Vocabulary words (OOV)

If the model sees a word like elerindodo, validate that this word's diacritic form exists in the dictionary and return an error message if it doesn't! This way, since the model doesn't know about elerindodo, it can just say so, rather than confuse the users by returning the "top probability word" which may be a random thing like aláǹtakùn!

Language Identification Helper

I'm proposing a language identification helper module that can:

Be used to build language id models using any of the rule based or learning algorithm available for doing this.
Be used to identify languages.

Proposing a usage similar to:

from iranlowo.language import LanguageIdentifier

lang_a = 'eng'
lang_b = 'yor'
lang_a_corpus = 'path_to_corpus'
lang_b_corpus = 'path_to_corpus'

lang_model = LanguageIdentifier(langs=[lang_a, lang_b], corpus=[lang_a_corpus, lang_b_corpus], **kwargs)
lang_model.build(algo='', epoch=epoch, batch=batch, **kwargs)
lang_model.save('save_path')

Then this model can be loaded and used to identify languages like:

from iranlowo.language import identify_language, load_model

language_id_model = load_model('save_path')

language_id = identify_language(language_id_model, 'text')

Tokenizer feature

Thinking about including a tokenizer class in the project.
I'm thinking the API could look like:

from iranlowo.tokenizer import Tokenizer

text = "some text"
word_tokens = Tokenizer(text).word_tokenize()
sentence_tokens = Tokenizer(text).sentence_tokenize()

[ADD] a module to do ADR

Add a module to do ADR right after pip installing it.

Ideas include wrap the OpenNMT-py REST service setup, but via the CLI
Host the prebuilt model somewhere sensible and reliable, pull it down via wget as part of the pip install.

This will permit mavens or technical folks to start using the tool easily without "extra" infrastructure.

[ADD] a more intuitive confidence value

[ADD] a more intuitive confidence value to the results than just the PRED AVG SCORE i.e. the negative log-likelihood or the perplexity, aka exp(-PRED AVG SCORE).

It will help users understand the quality of the results better if they know and can understand the model's confidence. Perplexity is nice but perhaps not ideal for the lay technical person. Yet.

It also provides a better way to reason with developers who are using it to know when to discard and warn that the results are not worth much. 🤔

[ADD] an module for scapers

Add a module that does Yorùbá TS website scraping. Right now candidates are:

Bibelo Mimo x2 (RR)
Yorùbá Blog (RR)
BBC Scraping (Olamilekan?)
Twitter
myriad other websites listed on https://github.com/Niger-Volta-LTI/yoruba-text

Runtime Error when tryinng to you .diacritize_text()

I'm using a windows OS:
Edition Windows 10 Pro
Version 21H2
Installed on ‎9/‎2/‎2022
OS build 19044.2006

After pip freeze > requirements.txt, here are the libraries and their versions

click==8.1.3
deep-translator==1.8.3
keras==2.9.0
Keras-Preprocessing==1.1.2
nltk==3.7
numpy==1.23.1
python==3.10.6
pandas==1.4.3
regex==2022.7.25
scikit-learn==1.0.2
scipy==1.9.0
streamlit==1.11.1
tensorboard==2.9.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.9.2
tensorflow-estimator==2.9.0
tensorflow-io-gcs-filesystem==0.26.0

Simplify CI/CD with GithubActions instead of TravisCI

Can we simplify CI/CD with GithubActions instead of using TravisCI?

I haven't yet explored GithubActions, but I'm looking to make the PR merging, testing and deployment of artifacts into PyPI much more smooth.

undefined symbol: _ZN3re23RE2C1ERKSs

import iranlowo.adr as ralo
with errors:

--------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-13-52d83c4cc378> in <module>()
----> 1 import iranlowo.adr as ralo

10 frames
/usr/local/lib/python3.6/dist-packages/iranlowo/__init__.py in <module>()
----> 1 from iranlowo import adr

/usr/local/lib/python3.6/dist-packages/iranlowo/adr.py in <module>()
     10 from argparse import Namespace
     11 from collections import defaultdict
---> 12 from onmt.translate.translator import build_translator
     13 from onmt.utils.parse import ArgumentParser
     14 

/usr/local/lib/python3.6/dist-packages/onmt/translate/__init__.py in <module>()
      1 """ Modules for translation """
----> 2 from onmt.translate.translator import Translator
      3 from onmt.translate.translation import Translation, TranslationBuilder
      4 from onmt.translate.beam import Beam, GNMTGlobalScorer
      5 from onmt.translate.beam_search import BeamSearch

/usr/local/lib/python3.6/dist-packages/onmt/translate/translator.py in <module>()
     10 import torch
     11 
---> 12 import onmt.model_builder
     13 import onmt.translate.beam
     14 import onmt.inputters as inputters

/usr/local/lib/python3.6/dist-packages/onmt/model_builder.py in <module>()
      8 from torch.nn.init import xavier_uniform_
      9 
---> 10 import onmt.inputters as inputters
     11 import onmt.modules
     12 from onmt.encoders import str2enc

/usr/local/lib/python3.6/dist-packages/onmt/inputters/__init__.py in <module>()
      4 e.g., from a line of text to a sequence of embeddings.
      5 """
----> 6 from onmt.inputters.inputter import \
      7     load_old_vocab, get_fields, OrderedIterator, \
      8     build_vocab, old_style_vocab, filter_example

/usr/local/lib/python3.6/dist-packages/onmt/inputters/inputter.py in <module>()
      9 
     10 import torch
---> 11 import torchtext.data
     12 from torchtext.data import Field
     13 from torchtext.vocab import Vocab

/usr/local/lib/python3.6/dist-packages/torchtext/__init__.py in <module>()
     40 
     41 
---> 42 _init_extension()
     43 
     44 

/usr/local/lib/python3.6/dist-packages/torchtext/__init__.py in _init_extension()
     36     if ext_specs is None:
     37         raise ImportError("torchtext C++ Extension is not found.")
---> 38     torch.ops.load_library(ext_specs.origin)
     39     torch.classes.load_library(ext_specs.origin)
     40 

/usr/local/lib/python3.6/dist-packages/torch/_ops.py in load_library(self, path)
    103             # static (global) initialization code in order to register custom
    104             # operators with the JIT.
--> 105             ctypes.CDLL(path)
    106         self.loaded_libraries.add(path)
    107 

/usr/lib/python3.6/ctypes/__init__.py in __init__(self, name, mode, handle, use_errno, use_last_error)
    346 
    347         if handle is None:
--> 348             self._handle = _dlopen(self._name, mode)
    349         else:
    350             self._handle = handle

OSError: /usr/local/lib/python3.6/dist-packages/torchtext/_torchtext.so: undefined symbol: _ZN3re23RE2C1ERKSs

Improve BIG file dependencies

We need to fix the big file dependencies in this project:

The pre-trained ADR model (binary) is a 88MB file living in the model folder. This make a very heavy upload/download from PYPI.
The torch dependency in requirements.txt by default pulls down the GPU version of torch. This makes integration with Heroku and RTD difficult/impossible because of hard-size-limits. It would be better to integrate and use a CPU-only version. Is this compatible with Travis CI and requirements.txt??

To facilitate all this:

all the ADR pre-trained models live in this Bintray artifactory
Is some clever way (or post install script) that we can download them locally as needed
The upside is that the iranlowo download is fast/small and then if you can separately pull down the models to do inference/prediction.

[RM] torchtext as a top-level package

Remove torchtext as a top-level package. torchtext code was added to the repo as a workaround because:

OpenNMT requires the latest torchtext, which must be installed with pip install git+https://github.com/pytorch/text (torchtext=0.4.0)
The latest torchtext cannot be installed with pip install torchtext because on PyPI torchtext=0.3.1 is the latest. Using it causes OpenNMT based predictions to BREAK!

Since we cannot accelerate nor predict when torchtext=0.4.0 will be on PyPI, we fold that dependency into a top-level package and carry on with this issue as a reminder to clean up once the correct version is ready.

Write up a short description of current confidence measure

Write up a short intuitive (no math!) description of current ADR model's confidence measures, negative log likelihood & perplexity for Wúràọlá's blog post.
Use the feedback to implement Issue #6 → [ADD] a more intuitive confidence value
Be fast