Giter Club home page Giter Club logo

iranlowo's People

Contributors

olamyy avatar ruohoruotsi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

iranlowo's Issues

Corpus Loading Features

I had some free time this week and I was able to pen down some features I'm hoping we'll be able to include. These are:

  1. A class for handling all forms of scrapping. This API for this feature can be like an interface that other scrappers can be built on. We can leverage either bs4 or scrapy . I'm thinking something like:
class BaseScrapper(scrappy.Spider):
         def __init__(name, urls, **kwargs):
               super(BaseScrapper, self).__init__(name, **kwargs)

         def parse_urls(self):
                ###Do something to the URLs before starting
                pass         

         def parse(self):
               #Crawling logic
               pass

Then a scrapper like the Bibeli scrapper can use this class:

class BibeliScrapper(BaseScrapper)
          ###Logic goes here

Major advantage here is reusability. So, anyone can build their own yoruba scrapper with minimum amount of work.

  1. Corpus class and DirectoryCorpus classs (Inspired by gensim)
    This would be a class that can be used to load various format of yoruba corpus using a single API interface. It should support:
  • Streaming files
  • Reading various file formats. txt, gzip, csv,
  • Validating a file format. Say if a user loads an Owe file. It should be able to validate that the content of the file conforms to that format.
  • Preprocessing while reading.
  • Generating random text

A commit for this is available here
The interface is described below:

class Corpus(interfaces.CorpusABC):
    def __init__(self, path=None, text=None, stream=False, fformat='txt', cformat=None, labels=False, preprocess=None):
        """

        Args:
            path:
            text:
        """
        self.path = path
        self.text = text
        self.labels = labels
        self.stream = stream
        self.fformat = fformat
        self.cformat = cformat
        self.preprocess = preprocess
        if not self.preprocess:
            self.preprocess = [normalize_diacritics_text]
        self.data = self.read_file_filename_or_text(text=text) if text else self.read_file_filename_or_text()
        self.validate_format()

    def __iter__(self):
        for line in self.data:
            yield line

    def __len__(self):
        return len(self.data)

    @staticmethod
    def save_corpus(fname, corpus, id2word=None, metadata=False):
        pass

    def streamfile(self, fobj):
        pass

    def read_file_filename_or_text(self, f=None, text=None):
        """

        Returns:

        """
        pass

    def handle_preprocessing(self, text):
        if callable(self.preprocess):
            return self.preprocess(text)
        if isinstance(self.preprocess, list):
            for technique in self.preprocess:
                text = technique(text)
            return text

    def validate_format(self):
        """

        Returns:

        """


    def generate(self, size):
        """

        Args:
            size:

        Returns:

        """
        if not self.cformat:
            raise ValueError("You need to specify a format for generating random text")


class DirectoryCorpus(Corpus):
    def __init__(self, path, **kwargs):
        self.path_dir = path
        walked = list(walk(self.path_dir))
        self.depth = walked[0][0]
        self.dirnames = walked[0][2]
        self.flist = walked[0][3]
        self.path = list(self.read_files())
        super(DirectoryCorpus, self).__init__(path=self.path, **kwargs)

    def read_files(self):
        for path in self.flist:
            yield os.path.join(self.path_dir, path)
  1. Loaders : These would be responsible for loading corpus made available by iranlowo.. They should return a Corpus object.
class OweLoader(DirectoryCorpus):
    def __init__(self, path, **kwargs):
        super(DirectoryCorpus, self).__init__(path=self.path, **kwargs)

I imagine a downside of these features is that it might make the project become bloated(big(?)) but I think the uses would outweigh this downside.

Outstanding task before submission to PyPI

Time to get this wrapped up! and out into the world:

  • Need to start using it in a few places, so show some example usage (scraping, nfc 2 nfd)
  • Usage also in Yorùbá text and or in ADR?
  • Have PyPI credentials, go the rest of the way there to pip install-ability! Do you need to have Python 3.7 installed to test out? Find out from tox people.

Pre-filter words whose diacrictic forms are not in the dictionary

Pre-filter words whose non-diacrictized word-forms are not in the dictionary, before asking the model to do ADR. This way we can get more predictable results and error messages for Out-Of-Vocabulary words (OOV)

If the model sees a word like elerindodo, validate that this word's diacritic form exists in the dictionary and return an error message if it doesn't! This way, since the model doesn't know about elerindodo, it can just say so, rather than confuse the users by returning the "top probability word" which may be a random thing like aláǹtakùn!

Language Identification Helper

I'm proposing a language identification helper module that can:

  1. Be used to build language id models using any of the rule based or learning algorithm available for doing this.
  2. Be used to identify languages.

Proposing a usage similar to:

from iranlowo.language import LanguageIdentifier

lang_a = 'eng'
lang_b = 'yor'
lang_a_corpus = 'path_to_corpus'
lang_b_corpus = 'path_to_corpus'

lang_model = LanguageIdentifier(langs=[lang_a, lang_b], corpus=[lang_a_corpus, lang_b_corpus], **kwargs)
lang_model.build(algo='', epoch=epoch, batch=batch, **kwargs)
lang_model.save('save_path')

Then this model can be loaded and used to identify languages like:

from iranlowo.language import identify_language, load_model

language_id_model = load_model('save_path')

language_id = identify_language(language_id_model, 'text')

Tokenizer feature

Thinking about including a tokenizer class in the project.
I'm thinking the API could look like:

from iranlowo.tokenizer import Tokenizer

text = "some text"
word_tokens = Tokenizer(text).word_tokenize()
sentence_tokens = Tokenizer(text).sentence_tokenize()

[ADD] a module to do ADR

Add a module to do ADR right after pip installing it.

Ideas include wrap the OpenNMT-py REST service setup, but via the CLI
Host the prebuilt model somewhere sensible and reliable, pull it down via wget as part of the pip install.

This will permit mavens or technical folks to start using the tool easily without "extra" infrastructure.

[ADD] a more intuitive confidence value

[ADD] a more intuitive confidence value to the results than just the PRED AVG SCORE i.e. the negative log-likelihood or the perplexity, aka exp(-PRED AVG SCORE).

It will help users understand the quality of the results better if they know and can understand the model's confidence. Perplexity is nice but perhaps not ideal for the lay technical person. Yet.

It also provides a better way to reason with developers who are using it to know when to discard and warn that the results are not worth much. 🤔

Runtime Error when tryinng to you .diacritize_text()

I'm using a windows OS:
Edition Windows 10 Pro
Version 21H2
Installed on ‎9/‎2/‎2022
OS build 19044.2006

After pip freeze > requirements.txt, here are the libraries and their versions

  • click==8.1.3
  • deep-translator==1.8.3
  • keras==2.9.0
  • Keras-Preprocessing==1.1.2
  • nltk==3.7
  • numpy==1.23.1
  • python==3.10.6
  • pandas==1.4.3
  • regex==2022.7.25
  • scikit-learn==1.0.2
  • scipy==1.9.0
  • streamlit==1.11.1
  • tensorboard==2.9.1
  • tensorboard-data-server==0.6.1
  • tensorboard-plugin-wit==1.8.1
  • tensorflow==2.9.2
  • tensorflow-estimator==2.9.0
  • tensorflow-io-gcs-filesystem==0.26.0

undefined symbol: _ZN3re23RE2C1ERKSs

import iranlowo.adr as ralo
with errors:

--------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-13-52d83c4cc378> in <module>()
----> 1 import iranlowo.adr as ralo

10 frames
/usr/local/lib/python3.6/dist-packages/iranlowo/__init__.py in <module>()
----> 1 from iranlowo import adr

/usr/local/lib/python3.6/dist-packages/iranlowo/adr.py in <module>()
     10 from argparse import Namespace
     11 from collections import defaultdict
---> 12 from onmt.translate.translator import build_translator
     13 from onmt.utils.parse import ArgumentParser
     14 

/usr/local/lib/python3.6/dist-packages/onmt/translate/__init__.py in <module>()
      1 """ Modules for translation """
----> 2 from onmt.translate.translator import Translator
      3 from onmt.translate.translation import Translation, TranslationBuilder
      4 from onmt.translate.beam import Beam, GNMTGlobalScorer
      5 from onmt.translate.beam_search import BeamSearch

/usr/local/lib/python3.6/dist-packages/onmt/translate/translator.py in <module>()
     10 import torch
     11 
---> 12 import onmt.model_builder
     13 import onmt.translate.beam
     14 import onmt.inputters as inputters

/usr/local/lib/python3.6/dist-packages/onmt/model_builder.py in <module>()
      8 from torch.nn.init import xavier_uniform_
      9 
---> 10 import onmt.inputters as inputters
     11 import onmt.modules
     12 from onmt.encoders import str2enc

/usr/local/lib/python3.6/dist-packages/onmt/inputters/__init__.py in <module>()
      4 e.g., from a line of text to a sequence of embeddings.
      5 """
----> 6 from onmt.inputters.inputter import \
      7     load_old_vocab, get_fields, OrderedIterator, \
      8     build_vocab, old_style_vocab, filter_example

/usr/local/lib/python3.6/dist-packages/onmt/inputters/inputter.py in <module>()
      9 
     10 import torch
---> 11 import torchtext.data
     12 from torchtext.data import Field
     13 from torchtext.vocab import Vocab

/usr/local/lib/python3.6/dist-packages/torchtext/__init__.py in <module>()
     40 
     41 
---> 42 _init_extension()
     43 
     44 

/usr/local/lib/python3.6/dist-packages/torchtext/__init__.py in _init_extension()
     36     if ext_specs is None:
     37         raise ImportError("torchtext C++ Extension is not found.")
---> 38     torch.ops.load_library(ext_specs.origin)
     39     torch.classes.load_library(ext_specs.origin)
     40 

/usr/local/lib/python3.6/dist-packages/torch/_ops.py in load_library(self, path)
    103             # static (global) initialization code in order to register custom
    104             # operators with the JIT.
--> 105             ctypes.CDLL(path)
    106         self.loaded_libraries.add(path)
    107 

/usr/lib/python3.6/ctypes/__init__.py in __init__(self, name, mode, handle, use_errno, use_last_error)
    346 
    347         if handle is None:
--> 348             self._handle = _dlopen(self._name, mode)
    349         else:
    350             self._handle = handle

OSError: /usr/local/lib/python3.6/dist-packages/torchtext/_torchtext.so: undefined symbol: _ZN3re23RE2C1ERKSs

Improve BIG file dependencies

We need to fix the big file dependencies in this project:

  • The pre-trained ADR model (binary) is a 88MB file living in the model folder. This make a very heavy upload/download from PYPI.
  • The torch dependency in requirements.txt by default pulls down the GPU version of torch. This makes integration with Heroku and RTD difficult/impossible because of hard-size-limits. It would be better to integrate and use a CPU-only version. Is this compatible with Travis CI and requirements.txt??

To facilitate all this:

  • all the ADR pre-trained models live in this Bintray artifactory
  • Is some clever way (or post install script) that we can download them locally as needed
  • The upside is that the iranlowo download is fast/small and then if you can separately pull down the models to do inference/prediction.

[RM] torchtext as a top-level package

Remove torchtext as a top-level package. torchtext code was added to the repo as a workaround because:

  • OpenNMT requires the latest torchtext, which must be installed with pip install git+https://github.com/pytorch/text (torchtext=0.4.0)
  • The latest torchtext cannot be installed with pip install torchtext because on PyPI torchtext=0.3.1 is the latest. Using it causes OpenNMT based predictions to BREAK!

Since we cannot accelerate nor predict when torchtext=0.4.0 will be on PyPI, we fold that dependency into a top-level package and carry on with this issue as a reminder to clean up once the correct version is ready.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.