isi-nlp / nlcodec Goto Github PK

View Code? Open in Web Editor NEW

5.0 3.0 2.0 323 KB

Natural Language EnCoder-Decoder: word, char, bpe etc

Home Page: https://isi-nlp.github.io/nlcodec

License: Apache License 2.0

Python 100.00%

nlp bpe text preprocessing

nlcodec's People

Contributors

Stargazers

Watchers

Forkers

marcelomata pegasus-lynx

nlcodec's Issues

Support byte fallback in BPE

Instead of UNK-ing OOV characters, support bytefallback
character coverage (e..g. 99.95%) param can be used to decide what portion of characters (1-0.9995) to do bytefallback

add byte encoding scheme

Currently, char, word, bpe, class schemes are supported

TODO: add byte scheme;
char scheme is Unicode code point, whereas the byte scheme is UTF-8 byte.

Challenges:

Can this work as 1byte integers? Can sequences be put into a byte array?
It'd be fairly easy to represent using 2byte ints, but if we can make it work by using only 1byte ints while also keeping extra 4 special types ( such as <s> </s> <cls> <pad>), then it'd be genius!

Speed up max bigram lookup using MaxHeap

BPE corner case bug: repeated codes corrupt the index

here is how to reproduce

echo "a1212121212b a1212121212b" | nlcodec learn -l bpe -vs 100 -mf 1 -m /dev/stdout

Classmethod object is not callable

One would think that BPEScheme.name == "bpe" should return resolve as True given

class BPEScheme(CharScheme):
...
    @property
    @classmethod
    def name(cls):
        return "bpe"

but...

Traceback (most recent call last):
  File "/Users/ljferrer/miniconda3/envs/rtg/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/Users/ljferrer/miniconda3/envs/rtg/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/Users/ljferrer/Documents/PycharmProjects/rtg/rtg/data/dataset.py", line 215, in __call__
    record = [tokr(col) for col, tokr in zip(record, self.tokenizers)]
  File "/Users/ljferrer/Documents/PycharmProjects/rtg/rtg/data/dataset.py", line 215, in <listcomp>
    record = [tokr(col) for col, tokr in zip(record, self.tokenizers)]
  File "/Users/ljferrer/Documents/PycharmProjects/rtg/rtg/data/codec.py", line 171, in encode_as_ids
    if self.codec.name == "bpe" and split_ratio > 0:
TypeError: 'classmethod' object is not callable

Option for NFKC normalization

save the flag and normalization type in model file (during training)
apply it consistently during encoding

Requires pyspark but doesn't install it

nlcodec doesn't install pyspark but requires it. Installing pyspark separately worked.

  File "/home1/jain593/.conda/envs/nmt_toolkits_rtg/lib/python3.7/site-packages/nlcodec/term_freq.py", line 10, in <module>
    from nlcodec import spark as spark_util
  File "/home1/jain593/.conda/envs/nmt_toolkits_rtg/lib/python3.7/site-packages/nlcodec/spark.py", line 12, in <module>
    from pyspark.sql import DataFrame, SparkSession
ModuleNotFoundError: No module named 'pyspark'

support separate vocab file argument; dont move or overwrite vocab level1

character coverage 99.995% instead of min count 20

related to #1

option to use mutual information instead of freq count for merging

Setup Travis CI build

https://github.com/thammegowda/mtdata/blob/master/.travis.yml

CLI interface to learn BPE vocabulary from term-frequency input

performance improvement by using slots

Use __slots__ for data classes such as Nodes in linkedlist, trie, etc

Bug: nlcodec CLI is broken

Traceback (most recent call last):
  File "/Users/tg/miniconda3/envs/rtg/bin/nlcodec", line 5, in <module>
    from nlcodec.__main__ import mainnlcodec
ImportError: cannot import name 'mainnlcodec' from 'nlcodec.__main__' (/Users/tg/miniconda3/envs/rtg/lib/python3.7/site-packages/nlcodec/__main__.py)

Caused by: missing a comma

nlcodec/setup.py

Lines 38 to 39 in b4fcc78

 'nlcodec=nlcodec.__main__:main' 

 'nlcodec-freq=nlcodec.term_freq:main'

BPE: Save the Count of Reserved types: UNK, SPACE

Frequency count of reserved types help for computing imbalance

BPE: dont merge categories

Keep certain characters separate; don't merge them even if there is sufficient frequency

digits
punctuations
dates months years
... anything else?

watch out: be language agnostic. use Unicode table to figure out digit/punch annotation

Assumption that there is one-to-one between string and int seqs for BPE is wrong

  File "...nlcodec/codec.py", line 151, in __init__
    assert len(self.idx_to_str) == len(self.str_to_idx)

clearly, this is a dumb assumption.
In BPE, we have more that one "legal or valid" way of encoding strings to int seqs.

learn task: instead of dropping seqs with 2+ repeated codes, replace the 2+ repeated codes with just 2 and use it.

https://github.com/thammegowda/bpepp/blob/aff1164e84ec2b5cee28aa9d97c1eb71e8ca1a09/bpepp.py#L557

dont drop seqs,
use them by slightly editing them

remove child entry when it is completely covered by a composite

example

4796	comed▁	1	2316	318 202
4797	welcomed▁	1	2316	585 4796

char-coverage : support count instead of percentage

Given a greedy segmentation, have % chance to break into a suboptimal segmenation

Speed up encoding using multiprocessing

Often we will have to encode a huge dataset (say training data for NMT) in one row.
Encoding can be trivially parallelized using multiprocessing, with the help of a pool of worker processes.

Fit or shrink an existing vocab to a given dataset

Accept a list of files
compute term frequencies
Eliminate types with zero counts
Preserve reserved types even if they have zero counts
Save the resulting model at a given file path
Return index mapping between old and new, so we can go back to model and shrink embedding tables

Signature should be something like : scheme.fit(files:List[Path], min_freq:int=1, save_at:Path=None) -> List[int]

Also, for the future work, think about adding a new set of types to vocab. Model can insert a few rows with random weights or average of remaining rows.

	'nlcodec=nlcodec.__main__:main'
	'nlcodec-freq=nlcodec.term_freq:main'