Giter Club home page Giter Club logo

subword-nmt's People

Contributors

aagohary avatar alvations avatar bastings avatar dmesq avatar jsenellart avatar lfurrer avatar mboyanov avatar noe avatar obo avatar ozancaglayan avatar proyag avatar rsennrich avatar vprov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

subword-nmt's Issues

apply_bpe gives fewer segments than before

Hi, I have been using apply_bpe from October 2016. I tested a recent copy of apply_bpe and the number of segments are significantly lower than before. I am using exactly the same settings and codes as before. Has something changed significantly in the algorithm for applying the codes that may be causing this issue?

Adding to pypi

Have you considered packaging this up as a module and submitting to the Python package index? It's not too time-consuming to do and would make it very easy to include the code in downstream projects (including incorporating it as an explicit dependency).

Subword for Arabic

Hi @rsennrich! this is quite an interesting work! I'm interested in applying this subword units (BPE) approach for translation from Arabic to English. I see that this approach seems to work only for romanized text encodings. Are there any ways to apply it for Arabic text?

My idea is to generate these subword units for Arabic and English and train Ar-En NMT system using TensorFlow. I know the scripts work for English but I'm not sure about how to achieve the same subword unit generation for Arabic.

About Programmatically usage

I'm trying to use the package programatically. I'm doing

    from subword_nmt.apply_bpe import BPE, read_vocabulary
     # read/write files as UTF-8
    bpe_codes_fin = codecs.open(bpe_codes, encoding='utf-8')
    bpe_vocab_fin = codecs.open(bpe_vocab, encoding='utf-8')
    vocabulary = read_vocabulary(bpe_vocab_fin, vocabulary_threshold)

    bpe = BPE(bpe_codes_fin, merges=-1, separator='@@', vocab=vocabulary, glossaries=None)
    codes = bpe.process_line(line)

Is that correct? Also, I'm not sure of the vocabulary_threshold, since I do not see any default value. Is there any one?

Thank you.

BPE and FST

I have recently found out this post by google about the transliteration models implemented with FST encoder/decoder used on the Gboard keyboard. I'm not sure which is the difference between BPE presented here and the FST used by Google. Any hint, thank you?

encoding issue?

Hi,
I have been using this code for sometime but lately I've been getting this issue.

no pair has frequency >= 2. Stopping

What's strange about this is that it works with some input files whereas it fails with the above message for some files (although these are exactly in the same format as required)

What exactly is the problem? Also, any ideas about how to fix this? I have already tried the solution suggested in #29 but that didn't help much.

thanks!

chrF fails on short references.

Running the command:

./chrF.py --ref <(echo Irre.) --hyp <(echo Fantastisch.)

leads to this error:

Traceback (most recent call last):
  File "./chrF.py", line 135, in <module>
    main(args)
  File "./chrF.py", line 122, in main
    chrf, precision, recall = f1(correct, total, total_ref, args.ngram, args.beta)
  File "./chrF.py", line 98, in f1
    recall += (correct[i] + smooth) / (total_ref[i] + smooth)
ZeroDivisionError: division by zero

This is due to the fact, that the reference contains no 6-gram.
Changing line 96 to

if total_hyp[i] + smooth and total_ref[i] + smooth:

fixed the problem for me, but I am not sure if this is the right way to do it.

Anyway thank you for sharing your implementation of chrF!

Desired Final Vocabulary size

Hi Rico,

Let say that I want to have the final vocabulary size of the input corpus is V_e. Would it be possible using the current implementation?

For example, Google wordpieces is trained to select "D" subwords that maximize the language-model likelihood of the training data.

Thank you!

The given sed pattern may not be enough to remove @@

Hi,

An MT system can generate hypotheses ending with a subword token which are not cleaned by the given sed pattern s/\@\@ //g:

echo 'foo@@ bar blah blah tot@@' | sed 's/@@ //g'
foobar blah blah tot@@

# Below seems to fix this by handling 0 or more spaces at the end
echo 'foo@@ bar blah blah tot@@' | sed 's/@@ *//g'
foobar blah blah tot
echo '

Number of merge operations

Hello,
Could you please tell me, given a corpus, how do we decide on the number of merge operations? Should we use the same number of operations on both the source and the target sides?

Thank you.

Pip version bug - ModuleNotFoundError: No module named 'learn_bpe'

Installed with pip and python3 venv, calling the subword-nmt modules throws this error.

subword-nmt learn-bpe -s 3000 < input.txt

Traceback (most recent call last):
File "/<>/venv/bin/subword-nmt", line 7, in <module>
    from subword_nmt.subword_nmt import main
File "/<>/venv/lib/python3.6/site-packages/subword_nmt/subword_nmt.py", line 9, in <module>
    from learn_bpe import learn_bpe
ModuleNotFoundError: No module named 'learn_bpe'

I think we have to change the import to .learn_bpe

what is num_operation?!

./learn_bpe.py -s {num_operations} < {train_file} > {codes_file}

what is num_operations and how to set it?

I think it maybe the number of tokens in the final vocabulary, so I executed the above command using 30000 for 'num_operations' parameter on a machine with 32GB of memory.
After two days, it's still running while takes all the available memory (32GB) + 15GB of swap space.
Is it normal?!

Skip special tokens

Hi there, I want to add some special tokens into my corpus, such as , and keep it fixed. But apply bpe will tokenize it as <@@ UN@@ K@@ >.
Is there a way to skip some special tokens?
Thanks.

byte pair encoding

Hi, i want to know why do you use byte pair encoding for code generation task to further sub-tokenise it ? can't we proceed further directly to nematus without byte-pair encoding?

replace_pair in learn_bpe.py throws exception for strings that end with escape char (backslash)

pair 1920: \pccsrv \ -> \pccsrv\ (frequency 2798)
Traceback (most recent call last):
File "./learn_bpe.py", line 201, in
changes = replace_pair(most_frequent, sorted_vocab, indices)
File "./learn_bpe.py", line 145, in replace_pair
new_word = pattern.sub(pair_str, new_word)
File "/usr/lib/python2.7/re.py", line 273, in _subx
template = _compile_repl(template, pattern)
File "/usr/lib/python2.7/re.py", line 260, in _compile_repl
raise error, v # invalid expression
sre_constants.error: bogus escape (end of line)

learn_bpe.py generates an invalid bpe file

When training with learn_bpe.py and then trying to use the generated file, I get the following error:

Error: invalid line in BPE codes file: e
The line should exist of exactly two subword units, separated by whitespace

This error message is not very useful as it does not indicate on what line the error is.

Post Processing

Hi, can you please guide me how to do post processing on the output files generated by nematus of code generaton and code documentation?
@rsennrich

Thank you :)

support

In the code generation we have to give declation+docstring as x.train and body as y.train. So, i am following your code and i did byte pair encoding as i am new to this field i really want to show you my work so that you can tell me whether i am going in a right direction or not.

As directed .. this is my vocab file for declaration and description which i got after using the shell command

screen shot 2018-04-10 at 17 15 03

this is the vocab file of body
screen shot 2018-04-10 at 17 17 55

this is the declaration_description.bpe file
screen shot 2018-04-10 at 17 18 56

this is the body.bpe file
screen shot 2018-04-10 at 17 19 48

so , can u please guide me whether i am going in the right direction or not.
Thank you so much in advance 👍

Serious bug in apply_bpe (mutable default parameter)

I think there is a serious bug in apply_bpe.py resulting from having a mutable default parameter (cache={}) in the encode function (https://github.com/rsennrich/subword-nmt/blob/master/apply_bpe.py#L126).

I guess it's fine if you use the script as a command-line tool, but if I use multiple instances of the BPE class directly, I observe the following behavior:

In [1]: from apply_bpe import BPE

In [2]: bpe_de = BPE(open("/path/to/german.codes"))

In [3]: bpe_en = BPE(open("/path/to/english.codes"))

In [4]: bpe_de.segment("bewundere")
Out[4]: 'bewunder@@ e'

In [5]: bpe_en.segment("bewundere")
Out[5]: 'bewunder@@ e'

During the first segment call the cache dictionary is modified during encode, and the same instance of cache is then used in the second call of encode, essentially overwriting the codes loaded from file.

If I make the following simple modification

def encode(..., cache=None):
    if cache is None:
        cache = {}
    ...

I get the expected behavior:

In [1]: from apply_bpe import BPE

In [2]: bpe_de = BPE(open("/path/to/german.codes"))

In [3]: bpe_en = BPE(open("/path/to/english.codes"))

In [4]: bpe_de.segment("bewundere")
Out[4]: 'bewunder@@ e'

In [5]: bpe_en.segment("bewundere")
Out[5]: 'be@@ w@@ un@@ d@@ ere'

See also http://effbot.org/zone/default-values.htm

Vocabulary size

Hi Rico

First, thanks a lot for releasing your code!

I have a few questions w.r.t. to the implementation/usage

  1. I tried to learn a model/codes on the french europarl corpus and encode the same text using the learned model. When I use 5000 or 10000 symbols i get a vocabulary size of 5464 and 10435 respectively on the encoded text. (checked using the script get_vocab on the encoded text. The trained models/code files do contain 5000 or 10000 lines as expected). I this intended behaviour or am I doing something wrong?
  2. Whats the best way to process the encoded text downstream of apply_bpe.py. Should i just encoded as i would normally with a space separated tokenized text - Specifically I'm thinking how you go about extracting the vocabulary, do you get that from the model/codes file or extract that from the encoded text?

BR
Casper Sønderby

Final vocabulary size is not equal to character vocabulary plus num_operations ?

For this fake corpus
when engage what
Its character vocabulary size is 7 (e a h w n g t ).
Lean BPE by two num_operations, and apply it with the two generated codes (wh and en), we get:
wh@@ en en@@ g@@ a@@ g@@ e wh@@ a@@ t
The final vocabulary size is 7 (a@@ wh@@ g@@ e t en en@@ ), not 9.
Do I calculate it wrong?

In my opinion, the equation Final vocabulary size = character vocabulary + num_operations based on the assumption that every merge operation generates one new token.
But in this case, the merge operation of e and n, generates two token en and en@@ in the encoded text, and this phenomenon is totally unpredictable. To make sure there is no unknown word, the final vocabulary size should be 18 ??
(e a h w n g t wh en e@@ a@@ h@@ w@@ n@@ g@@ t@@ wh@@ en@@ )
I am really confused !
How to generate the final vocabulary, and how to control its size exactly ?

Vocabulary size / convention

Hi Reco,

You mentioned the vocabulary size is the number of characters + # bpe_operations. Does that mean the network is based on characters and bpes, not (words and bpes)? For example, for the following text which is segmented using bpes, what's the vocabulary size?

e.g.,
low lo@@ w@@ e@@ r

If we count characters and bpes: 7
Or if we count words and bpes: 5

Importing and using learn_bpe and apply_bpe from a Python shell

Hi, Rico:
This may be a newbie question, so apologies in advance.
An undergrad student of mine and I are putting together a Python script that does a bunch of things. It learns a BPE and applies it to some files (Python is a nice way for things to work both in her Windows laptop and in my GNU/Linux machines).
I am currently launching ("subword-nmt apply-bpe" and "subword-nmt learn-bpe") via subprocesses.
Is there a way to import this functionality into a Python script?
I have seen that from a Python shell I can type "from subword_nmt import learn_bpe, apply_bpe", but then if I try something like
infile=open("test.tok.true.es")
outfile=open("/tmp/zzz","w")
learn_bpe(infile,outfile,10000)
I get a "'module' object is not callable" type error.
Thanks a million in advance.

Problem with a large corpus

I have a large corpus, around 40GB of text. I install subword-nmt via pip and try to make the dictionary with subword-nmt command line and it takes forever to finish. I just wonder whether there any solution for that situation?

New version of subword-nmt can't handle certain sentences

There are several sentences in WMT dataset that used to work fine, but currently break, I was able to narrow down the problem

  1. Double space and special caracters:
echo "and &#93;  . &apos;" | python $BPEROOT/learn_bpe.py -s 10
#version: 0.2
Traceback (most recent call last):
  File "subword-nmt/learn_bpe.py", line 254, in <module>
    main(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input)
  File "subword-nmt/learn_bpe.py", line 199, in main
    vocab = dict([(tuple(x[:-1])+(x[-1]+'</w>',) ,y) for (x,y) in vocab.items()])
  File "subword-nmt/learn_bpe.py", line 199, in <listcomp>
    vocab = dict([(tuple(x[:-1])+(x[-1]+'</w>',) ,y) for (x,y) in vocab.items()])
IndexError: string index out of range

  1. Empty string:
$ echo "hello world
> " | python $BPEROOT/learn_bpe.py -s 10
#version: 0.2
Traceback (most recent call last):
  File "subword-nmt/learn_bpe.py", line 254, in <module>
    main(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input)
  File "subword-nmt/learn_bpe.py", line 199, in main
    vocab = dict([(tuple(x[:-1])+(x[-1]+'</w>',) ,y) for (x,y) in vocab.items()])
  File "subword-nmt/learn_bpe.py", line 199, in <listcomp>
    vocab = dict([(tuple(x[:-1])+(x[-1]+'</w>',) ,y) for (x,y) in vocab.items()])
IndexError: string index out of range

Both of those worked with older versions of subword-nmt, e.g. with this one: 8247488

Different BPE models produced based on python version and/or stdin/argparse

I'm using the latest github clone from this repository and I've noticed the following (Note that I removed the shebang line from learn_bpe.py so that it does not default to /usr/bin/python):

python2 learn_bpe.py -i input.en -s 1000 -o py2.argparse
sort py2.argparse > py2.argparse.sorted

cat input.en | python2 learn_bpe.py -s 1000 > py2.stdin
sort py2.stdin > py2.stdin.sorted

python3 learn_bpe.py -i input.en -s 1000 -o py3.argparse
sort py3.argparse > py3.argparse.sorted

cat input.en | python3 learn_bpe.py -s 1000 > py3.stdin
sort py3.stdin > py3.stdin.sorted

sha1sum *sorted
a8d78085206049c4ba8398e174e3467055b4b1f5  py2.argparse.sorted
a8d78085206049c4ba8398e174e3467055b4b1f5  py2.stdin.sorted
2e56fefcb09f9f6d829774fb230dd08d139ae711  py3.argparse.sorted
72c59258c73930284032e41fd1f6ffd848822f76  py3.stdin.sorted

I also attached the input.en file (as input.en.txt)
input.en.txt

results on newstest2014

Hi,

I read your paper of BPE, that's a great work. I just wondering where is the result of newstest2014, seems I couldn't find the reported BLEU score. Am I missing anything?

Thanks.

README typo

README has a small typo. Flag --write-vocabulary is listed as --write_vocabulary for sample command to learn joint bpe

Removing the BPE

In case, if I want to remove a generated BPE dictionary in a text file (supposing that I am not provided the original training/dev/test data). Can I do it?

How to identify the subunits in an encoded text

I understand that by removing the @@ symbols I get back to the input text, but how can I identify the smallest subunits in the processed text?

If for example I have di@@ rect, How can I figure out the smallest subunits, as I understand it, it could be {di, rect}, {d, i, rect}, {d, i, re, ct} and so on, since I don'nt know which part of di and which part of rect belongs to the subunit, and which part is unknown to the tokenizer.
How do I know what part of a word which contains is part of the binary pair, and which part is the rest of the word?

I'm sorry, if I just got the overall concept wrong, but I can't figure this out.

Test case fails on fresh clone of repo.

I am running into a test case failure on a fresh clone of the repo. It's only a method signature error but I just wanted to confirm whether the test case failure to safe to disregard ?

Steps to reproduce.

$ git clone https://github.com/rsennrich/subword-nmt.git
$ cd subword-nmt
$ python test/test_glossaries.py 
======================================================================
ERROR: test_multiple_glossaries (__main__.TestBPESegmentMethod)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_glossaries.py", line 116, in test_multiple_glossaries
    self._run_test_case(test_case)
  File "/home/prastog3/.local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched
    return func(*args, **keywargs)
  File "test/test_glossaries.py", line 108, in _run_test_case
    out = self.bpe.segment(orig)
  File "test/test_glossaries.py", line 108, in _run_test_case
    out = self.bpe.segment(orig)
  File "/home/prastog3/anaconda/lib/python2.7/bdb.py", line 49, in trace_dispatch
    return self.dispatch_line(frame)
  File "/home/prastog3/anaconda/lib/python2.7/bdb.py", line 68, in dispatch_line
    if self.quitting: raise BdbQuit
BdbQuit

----------------------------------------------------------------------
Ran 9 tests in 43.061s

FAILED (errors=1)

apply_bpe.py doubles empty lines

Hi,
it seems apply_bpe.py duplicates empty lines, minimal example:

echo -e '\n\n' | wc -l
3

and twice as many with the script.

echo -e '\n\n' | ./subword-nmt/apply_bpe.py -c bpe.codes | wc -l
6

Can you reproduce this?

IndexError: tuple index out of range

When I use the following command for BPE operation:
./apply_bpe.py -c bpe.train.mn < train.mn-zh.mn
Terminal output error message:
Traceback (most recent call last):
File "./apply_bpe.py", line 313, in
bpe = BPE(args.codes, args.merges, args.separator, vocabulary, args.glossaries)
File "./apply_bpe.py", line 45, in init
self.bpe_codes_reverse = dict([(pair[0] + pair[1], pair) for pair,i in self.bpe_codes.items()])
File "./apply_bpe.py", line 45, in
self.bpe_codes_reverse = dict([(pair[0] + pair[1], pair) for pair,i in self.bpe_codes.items()])
IndexError: tuple index out of range
I would like to ask what is the reason for this.
Looking forward to your advice or answers.
Best regards,

yapingzhao

no pair has frequency >= 2. Stopping

python -m learn_joint_bpe_and_vocab --input corpus.en corpus.ch -s 30000 -o bpe.codes --write-vocabulary bpe.vocab.en bpe.vocab.ch
After running the above command, the following hints appear:no pair has frequency >= 2. Stopping。
I don't understand what this message means. I hope to give it a answer, thank you.

UnicodeEncodeError when using subword-nmt learn-bpe with verbose mode

My train text data is in Chinese. and it reports UnicodeEncodeError when using subword-nmt learn-bpecommand with verbose mode on, while it works fine with learn_bpe.py script.
Then I check the code and find out the reason.

    # python 2/3 compatibility
    if sys.version_info < (3, 0):
        sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
        sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
        sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
    else:
        sys.stdin = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
        sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
        sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

It seems that command line subwor-nmt learn-bpe doesn't run the above codes, then sys.stderr used by verbose mode (see below) would be the default system stderr,which encodes unicode with "ascii" encoding.

        if verbose:
            sys.stderr.write('pair {0}: {1} {2} -> {1}{2} (frequency {3})\n'.format(i, most_frequent[0], most_frequent[1], stats[most_frequent]))

Apply.bpe is giving duplicates in Vocabulary result file (German-English)

I'm trying to replicate English-German benchmarking using this link: Tensorflow Seq-to-seq. When I closely look at result file it has duplicates like 'Flank@@' 'flank@@', 'Check-In' 'check-in'. I removed these duplicates by sorting and filtering but somehow my bleu scores get messed up (test and dev bleu scores show 0.0). I'm thinking it because I modified my vocabulary file it's reading it incorrectly. Attached is the vocab file.
vocab.bpe.32000.de.txt

Subtract characters

Below a feature suggestion path.

The number of operations specified with -s leads to very different vocabulary sizes, due to the number of unique characters to start with. A value of 49500 creates small vocabularies for language that use Latin alphabet, but easily 70000 or so for Chinese. So, it would be good to subtract the number of unique characters from the number of symbols being generated.

+++ b/subword_nmt/learn_bpe.py
@@ -56,6 +56,9 @@ def create_parser(subparsers=None):
parser.add_argument('--dict-input', action="store_true",
help="If set, input file is interpreted as a dictionary where each line contains a word-count pair")
parser.add_argument(

  •    '--total-symbols', '-t', action="store_true",
    
  •    help="subtract number of characters from the symbols to be generated")
    
  • parser.add_argument(
    '--verbose', '-v', action="store_true",
    help="verbose mode.")

@@ -197,7 +200,7 @@ def prune_stats(stats, big_stats, threshold):
big_stats[item] = freq

-def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_dict=False):
+def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_dict=False, total_symbols=False):
"""Learn num_symbols BPE operations from vocabulary, and write to outfile.
"""

@@ -211,6 +214,16 @@ def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_d

 stats, indices = get_pair_statistics(sorted_vocab)
 big_stats = copy.deepcopy(stats)
  • if total_symbols:
  •    uniq_char = defaultdict(int)
    
  •    for word in vocab:
    
  •        prev_char = word[0]
    
  •        for char in word[1:]:
    
  •            uniq_char[char] += 1
    
  •    print('Number of characters: {0}'.format(len(uniq_char)))
    
  •    num_symbols -= len(uniq_char)
    
  • threshold is inspired by Zipfian assumption, but should only affect speed

    threshold = max(stats.values()) / 10
    for i in range(num_symbols):
    @@ -270,4 +283,4 @@ if name == 'main':
    if args.output.name != '':
    args.output = codecs.open(args.output.name, 'w', encoding='utf-8')
  • learn_bpe(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input)
  • learn_bpe(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input, total_symbols=args.total_symbols)

too many @ in the result

I trained two vocabularies with about 900M Chinese-English materials, and then coded two data sets (900M training set and 500K test set) with these two Chinese-English vocabularies.

The training set can get normal results, but there are many @ in the test set.

Before using subword-nmt for bpe, I had participled Chinese.

The corresponding instructions are as follows:

python learn_joint_bpe_and_vocab.py --input data/train.en data/train.zh -s 32000 -o data/bpe32k --write-vocabulary data/vocab.en data/vocab.zh
python apply_bpe.py --vocabulary data/vocab.en --vocabulary-threshold 50 -c data/bpe32k < data/train.en > data/corpus.32k.en
python apply_bpe.py --vocabulary data/vocab.zh --vocabulary-threshold 50 -c data/bpe32k < data/train.zh > data/corpus.32k.zh
python apply_bpe.py --vocabulary data/vocab.zh --vocabulary-threshold 50 -c data/bpe32k < data/valid.zh > data/aval_bpe_enzh.zh
python apply_bpe.py --vocabulary data/vocab.en --vocabulary-threshold 50 -c data/bpe32k < data/valid.en > data/aval_bpe_enzh.en

The result is shown in the figure.

image
image

I don't know where the problem is. Please help me to answer it. Thank you very much.

apply_bpe.py produces extra lines on specific file / codecs module

not sure it's highly relevant, but it may be of help if someone runs into this again. when running apply_bpe.py on a specific file with 1.9 million lines, the script outputs 4 extra lines. after a few hours of investigation it boils down to the codecs module: for line in args.input: produces the extra 4 lines. replacing codecs with io solves the issue.

unfortunately the extra lines are at random positions in the middle and once i worked around this i didn't spend extra time on finding where they come from and what may be anomalous in the data to trigger this.

i'm happy provide the offending file on request.

Frequency threshold?

Hi,
Would it make sense to add a minimal frequency threshold instead of a maximal number of symbols? That way we could make sure that the symbols are being seens at least N times during NMT training.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.