niger-volta-lti / iranlowo Goto Github PK
View Code? Open in Web Editor NEWÌrànlọ́wọ́ is a utility library for analysis & (pre)processing of Yorùbá text → https://pypi.org/project/iranlowo
License: MIT License
Ìrànlọ́wọ́ is a utility library for analysis & (pre)processing of Yorùbá text → https://pypi.org/project/iranlowo
License: MIT License
I had some free time this week and I was able to pen down some features I'm hoping we'll be able to include. These are:
class BaseScrapper(scrappy.Spider):
def __init__(name, urls, **kwargs):
super(BaseScrapper, self).__init__(name, **kwargs)
def parse_urls(self):
###Do something to the URLs before starting
pass
def parse(self):
#Crawling logic
pass
Then a scrapper like the Bibeli scrapper can use this class:
class BibeliScrapper(BaseScrapper)
###Logic goes here
Major advantage here is reusability. So, anyone can build their own yoruba scrapper with minimum amount of work.
Owe
file. It should be able to validate that the content of the file conforms to that format.A commit for this is available here
The interface is described below:
class Corpus(interfaces.CorpusABC):
def __init__(self, path=None, text=None, stream=False, fformat='txt', cformat=None, labels=False, preprocess=None):
"""
Args:
path:
text:
"""
self.path = path
self.text = text
self.labels = labels
self.stream = stream
self.fformat = fformat
self.cformat = cformat
self.preprocess = preprocess
if not self.preprocess:
self.preprocess = [normalize_diacritics_text]
self.data = self.read_file_filename_or_text(text=text) if text else self.read_file_filename_or_text()
self.validate_format()
def __iter__(self):
for line in self.data:
yield line
def __len__(self):
return len(self.data)
@staticmethod
def save_corpus(fname, corpus, id2word=None, metadata=False):
pass
def streamfile(self, fobj):
pass
def read_file_filename_or_text(self, f=None, text=None):
"""
Returns:
"""
pass
def handle_preprocessing(self, text):
if callable(self.preprocess):
return self.preprocess(text)
if isinstance(self.preprocess, list):
for technique in self.preprocess:
text = technique(text)
return text
def validate_format(self):
"""
Returns:
"""
def generate(self, size):
"""
Args:
size:
Returns:
"""
if not self.cformat:
raise ValueError("You need to specify a format for generating random text")
class DirectoryCorpus(Corpus):
def __init__(self, path, **kwargs):
self.path_dir = path
walked = list(walk(self.path_dir))
self.depth = walked[0][0]
self.dirnames = walked[0][2]
self.flist = walked[0][3]
self.path = list(self.read_files())
super(DirectoryCorpus, self).__init__(path=self.path, **kwargs)
def read_files(self):
for path in self.flist:
yield os.path.join(self.path_dir, path)
iranlowo.
. They should return a Corpus
object.class OweLoader(DirectoryCorpus):
def __init__(self, path, **kwargs):
super(DirectoryCorpus, self).__init__(path=self.path, **kwargs)
I imagine a downside of these features is that it might make the project become bloated(big(?)) but I think the uses would outweigh this downside.
Time to get this wrapped up! and out into the world:
tox
people.Since we are using tox, get the travis CI badge going to report build status on the readme.
https://docs.travis-ci.com/user/status-images/
https://stackoverflow.com/questions/19810386/showing-travis-build-status-in-github-repo
OpenNMT/OpenNMT-py#1261 (comment)
https://pypi.org/project/OpenNMT-py/
OpenNMT-py has finally been published on PyPI, refactor and reduce source code dependencies to just a pip install OpenNMT-py
.
Pre-filter words whose non-diacrictized word-forms are not in the dictionary, before asking the model to do ADR. This way we can get more predictable results and error messages for Out-Of-Vocabulary words (OOV)
If the model sees a word like elerindodo
, validate that this word's diacritic form exists in the dictionary and return an error message if it doesn't! This way, since the model doesn't know about elerindodo
, it can just say so, rather than confuse the users by returning the "top probability word" which may be a random thing like aláǹtakùn
!
I'm proposing a language identification helper module that can:
Proposing a usage similar to:
from iranlowo.language import LanguageIdentifier
lang_a = 'eng'
lang_b = 'yor'
lang_a_corpus = 'path_to_corpus'
lang_b_corpus = 'path_to_corpus'
lang_model = LanguageIdentifier(langs=[lang_a, lang_b], corpus=[lang_a_corpus, lang_b_corpus], **kwargs)
lang_model.build(algo='', epoch=epoch, batch=batch, **kwargs)
lang_model.save('save_path')
Then this model can be loaded and used to identify languages like:
from iranlowo.language import identify_language, load_model
language_id_model = load_model('save_path')
language_id = identify_language(language_id_model, 'text')
Thinking about including a tokenizer class in the project.
I'm thinking the API could look like:
from iranlowo.tokenizer import Tokenizer
text = "some text"
word_tokens = Tokenizer(text).word_tokenize()
sentence_tokens = Tokenizer(text).sentence_tokenize()
Add a module to do ADR right after pip installing it.
Ideas include wrap the OpenNMT-py REST service setup, but via the CLI
Host the prebuilt model somewhere sensible and reliable, pull it down via wget
as part of the pip install.
This will permit mavens or technical folks to start using the tool easily without "extra" infrastructure.
[ADD] a more intuitive confidence value to the results than just the PRED AVG SCORE
i.e. the negative log-likelihood or the perplexity, aka exp(-PRED AVG SCORE)
.
It will help users understand the quality of the results better if they know and can understand the model's confidence. Perplexity is nice but perhaps not ideal for the lay technical person. Yet.
It also provides a better way to reason with developers who are using it to know when to discard and warn that the results are not worth much. 🤔
Add a module that does Yorùbá TS website scraping. Right now candidates are:
I'm using a windows OS:
Edition Windows 10 Pro
Version 21H2
Installed on 9/2/2022
OS build 19044.2006
After pip freeze > requirements.txt, here are the libraries and their versions
Can we simplify CI/CD with GithubActions instead of using TravisCI?
I haven't yet explored GithubActions, but I'm looking to make the PR merging, testing and deployment of artifacts into PyPI much more smooth.
import iranlowo.adr as ralo
with errors:
--------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-13-52d83c4cc378> in <module>()
----> 1 import iranlowo.adr as ralo
10 frames
/usr/local/lib/python3.6/dist-packages/iranlowo/__init__.py in <module>()
----> 1 from iranlowo import adr
/usr/local/lib/python3.6/dist-packages/iranlowo/adr.py in <module>()
10 from argparse import Namespace
11 from collections import defaultdict
---> 12 from onmt.translate.translator import build_translator
13 from onmt.utils.parse import ArgumentParser
14
/usr/local/lib/python3.6/dist-packages/onmt/translate/__init__.py in <module>()
1 """ Modules for translation """
----> 2 from onmt.translate.translator import Translator
3 from onmt.translate.translation import Translation, TranslationBuilder
4 from onmt.translate.beam import Beam, GNMTGlobalScorer
5 from onmt.translate.beam_search import BeamSearch
/usr/local/lib/python3.6/dist-packages/onmt/translate/translator.py in <module>()
10 import torch
11
---> 12 import onmt.model_builder
13 import onmt.translate.beam
14 import onmt.inputters as inputters
/usr/local/lib/python3.6/dist-packages/onmt/model_builder.py in <module>()
8 from torch.nn.init import xavier_uniform_
9
---> 10 import onmt.inputters as inputters
11 import onmt.modules
12 from onmt.encoders import str2enc
/usr/local/lib/python3.6/dist-packages/onmt/inputters/__init__.py in <module>()
4 e.g., from a line of text to a sequence of embeddings.
5 """
----> 6 from onmt.inputters.inputter import \
7 load_old_vocab, get_fields, OrderedIterator, \
8 build_vocab, old_style_vocab, filter_example
/usr/local/lib/python3.6/dist-packages/onmt/inputters/inputter.py in <module>()
9
10 import torch
---> 11 import torchtext.data
12 from torchtext.data import Field
13 from torchtext.vocab import Vocab
/usr/local/lib/python3.6/dist-packages/torchtext/__init__.py in <module>()
40
41
---> 42 _init_extension()
43
44
/usr/local/lib/python3.6/dist-packages/torchtext/__init__.py in _init_extension()
36 if ext_specs is None:
37 raise ImportError("torchtext C++ Extension is not found.")
---> 38 torch.ops.load_library(ext_specs.origin)
39 torch.classes.load_library(ext_specs.origin)
40
/usr/local/lib/python3.6/dist-packages/torch/_ops.py in load_library(self, path)
103 # static (global) initialization code in order to register custom
104 # operators with the JIT.
--> 105 ctypes.CDLL(path)
106 self.loaded_libraries.add(path)
107
/usr/lib/python3.6/ctypes/__init__.py in __init__(self, name, mode, handle, use_errno, use_last_error)
346
347 if handle is None:
--> 348 self._handle = _dlopen(self._name, mode)
349 else:
350 self._handle = handle
OSError: /usr/local/lib/python3.6/dist-packages/torchtext/_torchtext.so: undefined symbol: _ZN3re23RE2C1ERKSs
We need to fix the big file dependencies in this project:
torch
dependency in requirements.txt
by default pulls down the GPU version of torch. This makes integration with Heroku and RTD difficult/impossible because of hard-size-limits. It would be better to integrate and use a CPU-only version. Is this compatible with Travis CI and requirements.txt
??To facilitate all this:
Remove torchtext as a top-level package. torchtext
code was added to the repo as a workaround because:
pip install git+https://github.com/pytorch/text
(torchtext=0.4.0
)torchtext
cannot be installed with pip install torchtext
because on PyPI torchtext=0.3.1
is the latest. Using it causes OpenNMT based predictions to BREAK!Since we cannot accelerate nor predict when torchtext=0.4.0
will be on PyPI, we fold that dependency into a top-level package and carry on with this issue as a reminder to clean up once the correct version is ready.
Write up a short intuitive (no math!) description of current ADR model's confidence measures, negative log likelihood & perplexity for Wúràọlá's blog post.
Use the feedback to implement Issue #6 → [ADD] a more intuitive confidence value
Be fast
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.