rsennrich / subword-nmt Goto Github PK

View Code? Open in Web Editor NEW

2.2K 54.0 464.0 242 KB

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation

License: MIT License

Python 100.00%

neural-machine-translation segmentation machine-translation nmt subword-units bpe

subword-nmt's People

Contributors

Stargazers

Watchers

Forkers

orhanf kyunghyuncho ml-lab xiliangsong lemaoliu redreamality fbougares he-ro superxiaoqiang casperkaae jiangnandexue vitaka gphuang tnq177 lixiangnlp chagge aagohary chenb67 julianser panyang stevenlol timwee anoopkunchukuttan odashi miradel51 winggyn zshwuhan kastnerkyle xuerenlv davidtranno1 alvations djher heshizhu jiejiang wximo surafelml ugermann vyraun jzhang533 shihuaxing aaronlifenghan almath123 solertis kspar chrishokamp mboyanov mjpost qiu19qq kobikun makai281 aj-av cosmmb jiangnanhugo zzmjohn cshanbo unbabel shaohuikuang jvdbogae graehl scitator chiraaglala premjithb soldni frankiegu zabin10 gugray ucaslyc kovalevfm lishiji1992 loic-mt codedecde ttslr kevinduh lvyli hfxunlp nimcho justinchiu nananabi ehsanasgari derpferd gladuo kaixin-wu mingfengwuye edufierro xvshiting karendeng xibin hairy-crab xmb-cipher proyag deeppavlov bardetadrien vgoklani chenglongchen zaenal1981abidin lkfo415579 obo daiwk eminemrain czhiming

subword-nmt's Issues

byte pair encoding

Hi, i want to know why do you use byte pair encoding for code generation task to further sub-tokenise it ? can't we proceed further directly to nematus without byte-pair encoding?

You mentioned the vocabulary size is the number of characters + # bpe_operations. Does that mean the network is based on characters and bpes, not (words and bpes)? For example, for the following text which is segmented using bpes, what's the vocabulary size?

e.g.,
low lo@@ w@@ e@@ r

If we count characters and bpes: 7
Or if we count words and bpes: 5

how to restore the original encoding from BPE encoding after translation?

Adding some documentation to the statistics updates

I'm not sure whether this is helpful but I wrote some comments on the update_pair_statistics function: https://github.com/alvations/subword-nmt/blob/patch-2/learn_bpe.py

@rsennrich if you find it's useful. I'll create a PR for merging otherwise, I'll leave it in my fork so that I don't get stunned for 10-15 mins every time I revisit the function ;P

chrF fails on short references.

Running the command:

./chrF.py --ref <(echo Irre.) --hyp <(echo Fantastisch.)

leads to this error:

Traceback (most recent call last):
  File "./chrF.py", line 135, in <module>
    main(args)
  File "./chrF.py", line 122, in main
    chrf, precision, recall = f1(correct, total, total_ref, args.ngram, args.beta)
  File "./chrF.py", line 98, in f1
    recall += (correct[i] + smooth) / (total_ref[i] + smooth)
ZeroDivisionError: division by zero

This is due to the fact, that the reference contains no 6-gram.
Changing line 96 to

if total_hyp[i] + smooth and total_ref[i] + smooth:

fixed the problem for me, but I am not sure if this is the right way to do it.

Anyway thank you for sharing your implementation of chrF!

Different BPE models produced based on python version and/or stdin/argparse

I'm using the latest github clone from this repository and I've noticed the following (Note that I removed the shebang line from learn_bpe.py so that it does not default to /usr/bin/python):

python2 learn_bpe.py -i input.en -s 1000 -o py2.argparse
sort py2.argparse > py2.argparse.sorted

cat input.en | python2 learn_bpe.py -s 1000 > py2.stdin
sort py2.stdin > py2.stdin.sorted

python3 learn_bpe.py -i input.en -s 1000 -o py3.argparse
sort py3.argparse > py3.argparse.sorted

cat input.en | python3 learn_bpe.py -s 1000 > py3.stdin
sort py3.stdin > py3.stdin.sorted

sha1sum *sorted
a8d78085206049c4ba8398e174e3467055b4b1f5  py2.argparse.sorted
a8d78085206049c4ba8398e174e3467055b4b1f5  py2.stdin.sorted
2e56fefcb09f9f6d829774fb230dd08d139ae711  py3.argparse.sorted
72c59258c73930284032e41fd1f6ffd848822f76  py3.stdin.sorted

I also attached the input.en file (as input.en.txt)
input.en.txt

too many @ in the result

I trained two vocabularies with about 900M Chinese-English materials, and then coded two data sets (900M training set and 500K test set) with these two Chinese-English vocabularies.

The training set can get normal results, but there are many @ in the test set.

Before using subword-nmt for bpe, I had participled Chinese.

The corresponding instructions are as follows:

python learn_joint_bpe_and_vocab.py --input data/train.en data/train.zh -s 32000 -o data/bpe32k --write-vocabulary data/vocab.en data/vocab.zh
python apply_bpe.py --vocabulary data/vocab.en --vocabulary-threshold 50 -c data/bpe32k < data/train.en > data/corpus.32k.en
python apply_bpe.py --vocabulary data/vocab.zh --vocabulary-threshold 50 -c data/bpe32k < data/train.zh > data/corpus.32k.zh
python apply_bpe.py --vocabulary data/vocab.zh --vocabulary-threshold 50 -c data/bpe32k < data/valid.zh > data/aval_bpe_enzh.zh
python apply_bpe.py --vocabulary data/vocab.en --vocabulary-threshold 50 -c data/bpe32k < data/valid.en > data/aval_bpe_enzh.en

The result is shown in the figure.

I don't know where the problem is. Please help me to answer it. Thank you very much.

bash: ./subword-nmt/subword-nmt/apply_bpe.py: No such file or directory

Hello, I am getting this error if I try to preprocess german data? I am curious if this works only for english?

Pip version bug - ModuleNotFoundError: No module named 'learn_bpe'

Installed with pip and python3 venv, calling the subword-nmt modules throws this error.

subword-nmt learn-bpe -s 3000 < input.txt

Traceback (most recent call last):
File "/<>/venv/bin/subword-nmt", line 7, in <module>
    from subword_nmt.subword_nmt import main
File "/<>/venv/lib/python3.6/site-packages/subword_nmt/subword_nmt.py", line 9, in <module>
    from learn_bpe import learn_bpe
ModuleNotFoundError: No module named 'learn_bpe'

I think we have to change the import to .learn_bpe

apply_bpe.py repeats last character twice (if not EOL symbol)

To reproduce:

$ echo -n hello world | python apply_bpe.py --codes bpe_codes.txt
hel@@ lo worldd

I know that having EOL before EOF is a good thing and everyone better do it, but no one is protected from it.

apply_bpe gives fewer segments than before

Hi, I have been using apply_bpe from October 2016. I tested a recent copy of apply_bpe and the number of segments are significantly lower than before. I am using exactly the same settings and codes as before. Has something changed significantly in the algorithm for applying the codes that may be causing this issue?

Motivation for the change in final `<\w>` addition in BPE

@rsennrich What's the motivation for change in the <\w> addition in the "tuplization" of the word at https://github.com/rsennrich/subword-nmt/blob/master/learn_bpe.py#L188 ?

Previous, it was:

"stuff!" -> ('s', 't', 'u', 'f', 'f', '!', '</w>')

Now it's:

"stuff!" -> ('s', 't', 'u', 'f', 'f', '!</w>')

encoding issue?

Hi,
I have been using this code for sometime but lately I've been getting this issue.

no pair has frequency >= 2. Stopping

What's strange about this is that it works with some input files whereas it fails with the above message for some files (although these are exactly in the same format as required)

What exactly is the problem? Also, any ideas about how to fix this? I have already tried the solution suggested in #29 but that didn't help much.

thanks!

Problem with a large corpus

I have a large corpus, around 40GB of text. I install subword-nmt via pip and try to make the dictionary with subword-nmt command line and it takes forever to finish. I just wonder whether there any solution for that situation?

Removing the BPE

In case, if I want to remove a generated BPE dictionary in a text file (supposing that I am not provided the original training/dev/test data). Can I do it?

Frequency threshold?

Hi,
Would it make sense to add a minimal frequency threshold instead of a maximal number of symbols? That way we could make sure that the symbols are being seens at least N times during NMT training.

results on newstest2014

Hi,

I read your paper of BPE, that's a great work. I just wondering where is the result of newstest2014, seems I couldn't find the reported BLEU score. Am I missing anything?

Thanks.

learn_bpe.py generates an invalid bpe file

When training with learn_bpe.py and then trying to use the generated file, I get the following error:

Error: invalid line in BPE codes file: e
The line should exist of exactly two subword units, separated by whitespace

This error message is not very useful as it does not indicate on what line the error is.

what is num_operation?!

./learn_bpe.py -s {num_operations} < {train_file} > {codes_file}

what is num_operations and how to set it?

I think it maybe the number of tokens in the final vocabulary, so I executed the above command using 30000 for 'num_operations' parameter on a machine with 32GB of memory.
After two days, it's still running while takes all the available memory (32GB) + 15GB of swap space.
Is it normal?!

apply_bpe.py produces extra lines on specific file / codecs module

not sure it's highly relevant, but it may be of help if someone runs into this again. when running apply_bpe.py on a specific file with 1.9 million lines, the script outputs 4 extra lines. after a few hours of investigation it boils down to the codecs module: for line in args.input: produces the extra 4 lines. replacing codecs with io solves the issue.

unfortunately the extra lines are at random positions in the middle and once i worked around this i didn't spend extra time on finding where they come from and what may be anomalous in the data to trigger this.

i'm happy provide the offending file on request.

How to use 'glossaries' field?

Hi @rsennrich ,

Can you please shed some light on how to use the glossaries field?
What file format? How should it be structured, etc?

Thanks,
mzeid

replace_pair in learn_bpe.py throws exception for strings that end with escape char (backslash)

pair 1920: \pccsrv \ -> \pccsrv\ (frequency 2798)
Traceback (most recent call last):
File "./learn_bpe.py", line 201, in
changes = replace_pair(most_frequent, sorted_vocab, indices)
File "./learn_bpe.py", line 145, in replace_pair
new_word = pattern.sub(pair_str, new_word)
File "/usr/lib/python2.7/re.py", line 273, in _subx
template = _compile_repl(template, pattern)
File "/usr/lib/python2.7/re.py", line 260, in _compile_repl
raise error, v # invalid expression
sre_constants.error: bogus escape (end of line)

Test case fails on fresh clone of repo.

I am running into a test case failure on a fresh clone of the repo. It's only a method signature error but I just wanted to confirm whether the test case failure to safe to disregard ?

Steps to reproduce.

$ git clone https://github.com/rsennrich/subword-nmt.git
$ cd subword-nmt
$ python test/test_glossaries.py 
======================================================================
ERROR: test_multiple_glossaries (__main__.TestBPESegmentMethod)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_glossaries.py", line 116, in test_multiple_glossaries
    self._run_test_case(test_case)
  File "/home/prastog3/.local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched
    return func(*args, **keywargs)
  File "test/test_glossaries.py", line 108, in _run_test_case
    out = self.bpe.segment(orig)
  File "test/test_glossaries.py", line 108, in _run_test_case
    out = self.bpe.segment(orig)
  File "/home/prastog3/anaconda/lib/python2.7/bdb.py", line 49, in trace_dispatch
    return self.dispatch_line(frame)
  File "/home/prastog3/anaconda/lib/python2.7/bdb.py", line 68, in dispatch_line
    if self.quitting: raise BdbQuit
BdbQuit

----------------------------------------------------------------------
Ran 9 tests in 43.061s

FAILED (errors=1)

support

In the code generation we have to give declation+docstring as x.train and body as y.train. So, i am following your code and i did byte pair encoding as i am new to this field i really want to show you my work so that you can tell me whether i am going in a right direction or not.

As directed .. this is my vocab file for declaration and description which i got after using the shell command

this is the vocab file of body

this is the declaration_description.bpe file

this is the body.bpe file

so , can u please guide me whether i am going in the right direction or not.
Thank you so much in advance 👍

New version of subword-nmt can't handle certain sentences

There are several sentences in WMT dataset that used to work fine, but currently break, I was able to narrow down the problem

Double space and special caracters:

echo "and &#93;  . &apos;" | python $BPEROOT/learn_bpe.py -s 10
#version: 0.2
Traceback (most recent call last):
  File "subword-nmt/learn_bpe.py", line 254, in <module>
    main(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input)
  File "subword-nmt/learn_bpe.py", line 199, in main
    vocab = dict([(tuple(x[:-1])+(x[-1]+'</w>',) ,y) for (x,y) in vocab.items()])
  File "subword-nmt/learn_bpe.py", line 199, in <listcomp>
    vocab = dict([(tuple(x[:-1])+(x[-1]+'</w>',) ,y) for (x,y) in vocab.items()])
IndexError: string index out of range

Empty string:

$ echo "hello world
> " | python $BPEROOT/learn_bpe.py -s 10
#version: 0.2
Traceback (most recent call last):
  File "subword-nmt/learn_bpe.py", line 254, in <module>
    main(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input)
  File "subword-nmt/learn_bpe.py", line 199, in main
    vocab = dict([(tuple(x[:-1])+(x[-1]+'</w>',) ,y) for (x,y) in vocab.items()])
  File "subword-nmt/learn_bpe.py", line 199, in <listcomp>
    vocab = dict([(tuple(x[:-1])+(x[-1]+'</w>',) ,y) for (x,y) in vocab.items()])
IndexError: string index out of range

Both of those worked with older versions of subword-nmt, e.g. with this one: 8247488

Apply.bpe is giving duplicates in Vocabulary result file (German-English)

I'm trying to replicate English-German benchmarking using this link: Tensorflow Seq-to-seq. When I closely look at result file it has duplicates like 'Flank@@' 'flank@@', 'Check-In' 'check-in'. I removed these duplicates by sorting and filtering but somehow my bleu scores get messed up (test and dev bleu scores show 0.0). I'm thinking it because I modified my vocabulary file it's reading it incorrectly. Attached is the vocab file.
vocab.bpe.32000.de.txt

Desired Final Vocabulary size

Hi Rico,

Let say that I want to have the final vocabulary size of the input corpus is V_e. Would it be possible using the current implementation?

For example, Google wordpieces is trained to select "D" subwords that maximize the language-model likelihood of the training data.

Thank you!

apply_bpe.py doubles empty lines

Hi,
it seems apply_bpe.py duplicates empty lines, minimal example:

echo -e '\n\n' | wc -l
3

and twice as many with the script.

echo -e '\n\n' | ./subword-nmt/apply_bpe.py -c bpe.codes | wc -l
6

Can you reproduce this?

Subword for Arabic

Hi @rsennrich! this is quite an interesting work! I'm interested in applying this subword units (BPE) approach for translation from Arabic to English. I see that this approach seems to work only for romanized text encodings. Are there any ways to apply it for Arabic text?

My idea is to generate these subword units for Arabic and English and train Ar-En NMT system using TensorFlow. I know the scripts work for English but I'm not sure about how to achieve the same subword unit generation for Arabic.

Vocabulary size

Hi Rico

First, thanks a lot for releasing your code!

I have a few questions w.r.t. to the implementation/usage

I tried to learn a model/codes on the french europarl corpus and encode the same text using the learned model. When I use 5000 or 10000 symbols i get a vocabulary size of 5464 and 10435 respectively on the encoded text. (checked using the script get_vocab on the encoded text. The trained models/code files do contain 5000 or 10000 lines as expected). I this intended behaviour or am I doing something wrong?
Whats the best way to process the encoded text downstream of apply_bpe.py. Should i just encoded as i would normally with a space separated tokenized text - Specifically I'm thinking how you go about extracting the vocabulary, do you get that from the model/codes file or extract that from the encoded text?

BR
Casper Sønderby

UnicodeEncodeError when using subword-nmt learn-bpe with verbose mode

My train text data is in Chinese. and it reports UnicodeEncodeError when using subword-nmt learn-bpecommand with verbose mode on, while it works fine with learn_bpe.py script.
Then I check the code and find out the reason.

    # python 2/3 compatibility
    if sys.version_info < (3, 0):
        sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
        sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
        sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
    else:
        sys.stdin = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
        sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
        sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

It seems that command line subwor-nmt learn-bpe doesn't run the above codes, then sys.stderr used by verbose mode (see below) would be the default system stderr，which encodes unicode with "ascii" encoding.

        if verbose:
            sys.stderr.write('pair {0}: {1} {2} -> {1}{2} (frequency {3})\n'.format(i, most_frequent[0], most_frequent[1], stats[most_frequent]))

Is it suitable for any language?

Is it suitable for any language? I want use it in Chinese, is it correct ?

README typo

README has a small typo. Flag --write-vocabulary is listed as --write_vocabulary for sample command to learn joint bpe

Adding to pypi

Have you considered packaging this up as a module and submitting to the Python package index? It's not too time-consuming to do and would make it very easy to include the code in downstream projects (including incorporating it as an explicit dependency).

Post Processing

Hi, can you please guide me how to do post processing on the output files generated by nematus of code generaton and code documentation?
@rsennrich

Thank you :)

BPE and FST

I have recently found out this post by google about the transliteration models implemented with FST encoder/decoder used on the Gboard keyboard. I'm not sure which is the difference between BPE presented here and the FST used by Google. Any hint, thank you?

The given sed pattern may not be enough to remove @@

Hi,

An MT system can generate hypotheses ending with a subword token which are not cleaned by the given sed pattern s/\@\@ //g:

echo 'foo@@ bar blah blah tot@@' | sed 's/@@ //g'
foobar blah blah tot@@

# Below seems to fix this by handling 0 or more spaces at the end
echo 'foo@@ bar blah blah tot@@' | sed 's/@@ *//g'
foobar blah blah tot
echo '

IndexError: tuple index out of range

When I use the following command for BPE operation：
./apply_bpe.py -c bpe.train.mn < train.mn-zh.mn
Terminal output error message：
Traceback (most recent call last):
File "./apply_bpe.py", line 313, in
bpe = BPE(args.codes, args.merges, args.separator, vocabulary, args.glossaries)
File "./apply_bpe.py", line 45, in init
self.bpe_codes_reverse = dict([(pair[0] + pair[1], pair) for pair,i in self.bpe_codes.items()])
File "./apply_bpe.py", line 45, in
self.bpe_codes_reverse = dict([(pair[0] + pair[1], pair) for pair,i in self.bpe_codes.items()])
IndexError: tuple index out of range
I would like to ask what is the reason for this.
Looking forward to your advice or answers.
Best regards,

yapingzhao

How to identify the subunits in an encoded text

I understand that by removing the @@ symbols I get back to the input text, but how can I identify the smallest subunits in the processed text?

If for example I have di@@ rect, How can I figure out the smallest subunits, as I understand it, it could be {di, rect}, {d, i, rect}, {d, i, re, ct} and so on, since I don'nt know which part of di and which part of rect belongs to the subunit, and which part is unknown to the tokenizer.
How do I know what part of a word which contains is part of the binary pair, and which part is the rest of the word?

I'm sorry, if I just got the overall concept wrong, but I can't figure this out.

Number of merge operations

Hello,
Could you please tell me, given a corpus, how do we decide on the number of merge operations? Should we use the same number of operations on both the source and the target sides?

Thank you.

Importing and using learn_bpe and apply_bpe from a Python shell

Hi, Rico:
This may be a newbie question, so apologies in advance.
An undergrad student of mine and I are putting together a Python script that does a bunch of things. It learns a BPE and applies it to some files (Python is a nice way for things to work both in her Windows laptop and in my GNU/Linux machines).
I am currently launching ("subword-nmt apply-bpe" and "subword-nmt learn-bpe") via subprocesses.
Is there a way to import this functionality into a Python script?
I have seen that from a Python shell I can type "from subword_nmt import learn_bpe, apply_bpe", but then if I try something like
infile=open("test.tok.true.es")
outfile=open("/tmp/zzz","w")
learn_bpe(infile,outfile,10000)
I get a "'module' object is not callable" type error.
Thanks a million in advance.

`cache` parameter in the apply_bpe.py script not being used

About Programmatically usage

I'm trying to use the package programatically. I'm doing

    from subword_nmt.apply_bpe import BPE, read_vocabulary
     # read/write files as UTF-8
    bpe_codes_fin = codecs.open(bpe_codes, encoding='utf-8')
    bpe_vocab_fin = codecs.open(bpe_vocab, encoding='utf-8')
    vocabulary = read_vocabulary(bpe_vocab_fin, vocabulary_threshold)

    bpe = BPE(bpe_codes_fin, merges=-1, separator='@@', vocab=vocabulary, glossaries=None)
    codes = bpe.process_line(line)

Is that correct? Also, I'm not sure of the vocabulary_threshold, since I do not see any default value. Is there any one?

Thank you.

Final vocabulary size is not equal to character vocabulary plus num_operations ?

For this fake corpus
when engage what
Its character vocabulary size is 7 (e a h w n g t ).
Lean BPE by two num_operations, and apply it with the two generated codes (wh and en), we get:
wh@@ en en@@ g@@ a@@ g@@ e wh@@ a@@ t
The final vocabulary size is 7 (a@@ wh@@ g@@ e t en en@@ ), not 9.
Do I calculate it wrong?

In my opinion, the equation Final vocabulary size = character vocabulary + num_operations based on the assumption that every merge operation generates one new token.
But in this case, the merge operation of e and n, generates two token en and en@@ in the encoded text, and this phenomenon is totally unpredictable. To make sure there is no unknown word, the final vocabulary size should be 18 ??
(e a h w n g t wh en e@@ a@@ h@@ w@@ n@@ g@@ t@@ wh@@ en@@ )
I am really confused !
How to generate the final vocabulary, and how to control its size exactly ?

no pair has frequency >= 2. Stopping

python -m learn_joint_bpe_and_vocab --input corpus.en corpus.ch -s 30000 -o bpe.codes --write-vocabulary bpe.vocab.en bpe.vocab.ch
After running the above command, the following hints appear：no pair has frequency >= 2. Stopping。
I don't understand what this message means. I hope to give it a answer, thank you.

Chinese word segmentation

Can this tool be used to do Chinese word segmentation?
Thank you very much!

For languages that not share an alphabet, like chinese and english, should I train the shared bpe model or train their own bpe model separately?

Serious bug in apply_bpe (mutable default parameter)

I think there is a serious bug in apply_bpe.py resulting from having a mutable default parameter (cache={}) in the encode function (https://github.com/rsennrich/subword-nmt/blob/master/apply_bpe.py#L126).

I guess it's fine if you use the script as a command-line tool, but if I use multiple instances of the BPE class directly, I observe the following behavior:

In [1]: from apply_bpe import BPE

In [2]: bpe_de = BPE(open("/path/to/german.codes"))

In [3]: bpe_en = BPE(open("/path/to/english.codes"))

In [4]: bpe_de.segment("bewundere")
Out[4]: 'bewunder@@ e'

In [5]: bpe_en.segment("bewundere")
Out[5]: 'bewunder@@ e'

During the first segment call the cache dictionary is modified during encode, and the same instance of cache is then used in the second call of encode, essentially overwriting the codes loaded from file.

If I make the following simple modification

def encode(..., cache=None):
    if cache is None:
        cache = {}
    ...

I get the expected behavior:

In [1]: from apply_bpe import BPE

In [2]: bpe_de = BPE(open("/path/to/german.codes"))

In [3]: bpe_en = BPE(open("/path/to/english.codes"))

In [4]: bpe_de.segment("bewundere")
Out[4]: 'bewunder@@ e'

In [5]: bpe_en.segment("bewundere")
Out[5]: 'be@@ w@@ un@@ d@@ ere'

Subtract characters

Below a feature suggestion path.

The number of operations specified with -s leads to very different vocabulary sizes, due to the number of unique characters to start with. A value of 49500 creates small vocabularies for language that use Latin alphabet, but easily 70000 or so for Chinese. So, it would be good to subtract the number of unique characters from the number of symbols being generated.

+++ b/subword_nmt/learn_bpe.py
@@ -56,6 +56,9 @@ def create_parser(subparsers=None):
parser.add_argument('--dict-input', action="store_true",
help="If set, input file is interpreted as a dictionary where each line contains a word-count pair")
parser.add_argument(

   '--total-symbols', '-t', action="store_true",

   help="subtract number of characters from the symbols to be generated")

parser.add_argument(
'--verbose', '-v', action="store_true",
help="verbose mode.")

@@ -197,7 +200,7 @@ def prune_stats(stats, big_stats, threshold):
big_stats[item] = freq

-def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_dict=False):
+def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_dict=False, total_symbols=False):
"""Learn num_symbols BPE operations from vocabulary, and write to outfile.
"""

@@ -211,6 +214,16 @@ def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_d

 stats, indices = get_pair_statistics(sorted_vocab)
 big_stats = copy.deepcopy(stats)

if total_symbols:
```
   uniq_char = defaultdict(int)
```
```
   for word in vocab:
```
```
       prev_char = word[0]
```
```
       for char in word[1:]:
```
```
           uniq_char[char] += 1
```

   print('Number of characters: {0}'.format(len(uniq_char)))

```
   num_symbols -= len(uniq_char)
```
threshold is inspired by Zipfian assumption, but should only affect speed
threshold = max(stats.values()) / 10
for i in range(num_symbols):
@@ -270,4 +283,4 @@ if name == 'main':
if args.output.name != '':
args.output = codecs.open(args.output.name, 'w', encoding='utf-8')

learn_bpe(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input)

learn_bpe(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input, total_symbols=args.total_symbols)

Skip special tokens

Hi there, I want to add some special tokens into my corpus, such as , and keep it fixed. But apply bpe will tokenize it as <@@ UN@@ K@@ >.
Is there a way to skip some special tokens?
Thanks.

rsennrich / subword-nmt Goto Github PK

subword-nmt's People

Contributors

Stargazers

Watchers

Forkers

subword-nmt's Issues

threshold is inspired by Zipfian assumption, but should only affect speed

Recommend Projects

Recommend Topics

Recommend Org