isi-nlp / nlcodec Goto Github PK
View Code? Open in Web Editor NEWNatural Language EnCoder-Decoder: word, char, bpe etc
Home Page: https://isi-nlp.github.io/nlcodec
License: Apache License 2.0
Natural Language EnCoder-Decoder: word, char, bpe etc
Home Page: https://isi-nlp.github.io/nlcodec
License: Apache License 2.0
Currently, char, word, bpe, class schemes are supported
TODO: add byte
scheme;
char scheme is Unicode code point, whereas the byte scheme is UTF-8 byte.
Challenges:
<s> </s> <cls> <pad>
), then it'd be genius!here is how to reproduce
echo "a1212121212b a1212121212b" | nlcodec learn -l bpe -vs 100 -mf 1 -m /dev/stdout
One would think that BPEScheme.name == "bpe"
should return resolve as True
given
class BPEScheme(CharScheme):
...
@property
@classmethod
def name(cls):
return "bpe"
but...
Traceback (most recent call last):
File "/Users/ljferrer/miniconda3/envs/rtg/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/Users/ljferrer/miniconda3/envs/rtg/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/Users/ljferrer/Documents/PycharmProjects/rtg/rtg/data/dataset.py", line 215, in __call__
record = [tokr(col) for col, tokr in zip(record, self.tokenizers)]
File "/Users/ljferrer/Documents/PycharmProjects/rtg/rtg/data/dataset.py", line 215, in <listcomp>
record = [tokr(col) for col, tokr in zip(record, self.tokenizers)]
File "/Users/ljferrer/Documents/PycharmProjects/rtg/rtg/data/codec.py", line 171, in encode_as_ids
if self.codec.name == "bpe" and split_ratio > 0:
TypeError: 'classmethod' object is not callable
nlcodec
doesn't install pyspark
but requires it. Installing pyspark
separately worked.
File "/home1/jain593/.conda/envs/nmt_toolkits_rtg/lib/python3.7/site-packages/nlcodec/term_freq.py", line 10, in <module>
from nlcodec import spark as spark_util
File "/home1/jain593/.conda/envs/nmt_toolkits_rtg/lib/python3.7/site-packages/nlcodec/spark.py", line 12, in <module>
from pyspark.sql import DataFrame, SparkSession
ModuleNotFoundError: No module named 'pyspark'
related to #1
Use __slots__
for data classes such as Nodes in linkedlist, trie, etc
Traceback (most recent call last):
File "/Users/tg/miniconda3/envs/rtg/bin/nlcodec", line 5, in <module>
from nlcodec.__main__ import mainnlcodec
ImportError: cannot import name 'mainnlcodec' from 'nlcodec.__main__' (/Users/tg/miniconda3/envs/rtg/lib/python3.7/site-packages/nlcodec/__main__.py)
Caused by: missing a comma
Lines 38 to 39 in b4fcc78
Frequency count of reserved types help for computing imbalance
Keep certain characters separate; don't merge them even if there is sufficient frequency
watch out: be language agnostic. use Unicode table to figure out digit/punch annotation
File "...nlcodec/codec.py", line 151, in __init__
assert len(self.idx_to_str) == len(self.str_to_idx)
clearly, this is a dumb assumption.
In BPE, we have more that one "legal or valid" way of encoding strings to int seqs.
https://github.com/thammegowda/bpepp/blob/aff1164e84ec2b5cee28aa9d97c1eb71e8ca1a09/bpepp.py#L557
dont drop seqs,
use them by slightly editing them
example
4796 comed▁ 1 2316 318 202
4797 welcomed▁ 1 2316 585 4796
related to https://github.com/isi-nlp/rtg/issues/197
Often we will have to encode a huge dataset (say training data for NMT) in one row.
Encoding can be trivially parallelized using multiprocessing, with the help of a pool of worker processes.
Signature should be something like : scheme.fit(files:List[Path], min_freq:int=1, save_at:Path=None) -> List[int]
Also, for the future work, think about adding a new set of types to vocab. Model can insert a few rows with random weights or average of remaining rows.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.