rsennrich / subword-nmt Goto Github PK
View Code? Open in Web Editor NEWUnsupervised Word Segmentation for Neural Machine Translation and Text Generation
License: MIT License
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
License: MIT License
Hi, I have been using apply_bpe
from October 2016. I tested a recent copy of apply_bpe
and the number of segments are significantly lower than before. I am using exactly the same settings and codes as before. Has something changed significantly in the algorithm for applying the codes that may be causing this issue?
Have you considered packaging this up as a module and submitting to the Python package index? It's not too time-consuming to do and would make it very easy to include the code in downstream projects (including incorporating it as an explicit dependency).
Hi @rsennrich! this is quite an interesting work! I'm interested in applying this subword units (BPE) approach for translation from Arabic to English. I see that this approach seems to work only for romanized
text encodings. Are there any ways to apply it for Arabic text?
My idea is to generate these subword units for Arabic and English and train Ar-En NMT system using TensorFlow. I know the scripts work for English but I'm not sure about how to achieve the same subword unit generation for Arabic
.
I'm trying to use the package programatically. I'm doing
from subword_nmt.apply_bpe import BPE, read_vocabulary
# read/write files as UTF-8
bpe_codes_fin = codecs.open(bpe_codes, encoding='utf-8')
bpe_vocab_fin = codecs.open(bpe_vocab, encoding='utf-8')
vocabulary = read_vocabulary(bpe_vocab_fin, vocabulary_threshold)
bpe = BPE(bpe_codes_fin, merges=-1, separator='@@', vocab=vocabulary, glossaries=None)
codes = bpe.process_line(line)
Is that correct? Also, I'm not sure of the vocabulary_threshold
, since I do not see any default value. Is there any one?
Thank you.
Hi @rsennrich ,
Can you please shed some light on how to use the glossaries field?
What file format? How should it be structured, etc?
Thanks,
mzeid
Hi,
I have been using this code for sometime but lately I've been getting this issue.
no pair has frequency >= 2. Stopping
What's strange about this is that it works with some input files whereas it fails with the above message for some files (although these are exactly in the same format as required)
What exactly is the problem? Also, any ideas about how to fix this? I have already tried the solution suggested in #29 but that didn't help much.
thanks!
Running the command:
./chrF.py --ref <(echo Irre.) --hyp <(echo Fantastisch.)
leads to this error:
Traceback (most recent call last):
File "./chrF.py", line 135, in <module>
main(args)
File "./chrF.py", line 122, in main
chrf, precision, recall = f1(correct, total, total_ref, args.ngram, args.beta)
File "./chrF.py", line 98, in f1
recall += (correct[i] + smooth) / (total_ref[i] + smooth)
ZeroDivisionError: division by zero
This is due to the fact, that the reference contains no 6-gram.
Changing line 96 to
if total_hyp[i] + smooth and total_ref[i] + smooth:
fixed the problem for me, but I am not sure if this is the right way to do it.
Anyway thank you for sharing your implementation of chrF!
Hi Rico,
Let say that I want to have the final vocabulary size of the input corpus is V_e. Would it be possible using the current implementation?
For example, Google wordpieces is trained to select "D" subwords that maximize the language-model likelihood of the training data.
Thank you!
I'm not sure whether this is helpful but I wrote some comments on the update_pair_statistics
function: https://github.com/alvations/subword-nmt/blob/patch-2/learn_bpe.py
@rsennrich if you find it's useful. I'll create a PR for merging otherwise, I'll leave it in my fork so that I don't get stunned for 10-15 mins every time I revisit the function ;P
Hi,
An MT system can generate hypotheses ending with a subword token which are not cleaned by the given sed pattern s/\@\@ //g
:
echo 'foo@@ bar blah blah tot@@' | sed 's/@@ //g'
foobar blah blah tot@@
# Below seems to fix this by handling 0 or more spaces at the end
echo 'foo@@ bar blah blah tot@@' | sed 's/@@ *//g'
foobar blah blah tot
echo '
Hello,
Could you please tell me, given a corpus, how do we decide on the number of merge operations? Should we use the same number of operations on both the source and the target sides?
Thank you.
Installed with pip and python3 venv, calling the subword-nmt modules throws this error.
subword-nmt learn-bpe -s 3000 < input.txt
Traceback (most recent call last):
File "/<>/venv/bin/subword-nmt", line 7, in <module>
from subword_nmt.subword_nmt import main
File "/<>/venv/lib/python3.6/site-packages/subword_nmt/subword_nmt.py", line 9, in <module>
from learn_bpe import learn_bpe
ModuleNotFoundError: No module named 'learn_bpe'
I think we have to change the import to .learn_bpe
./learn_bpe.py -s {num_operations} < {train_file} > {codes_file}
what is num_operations and how to set it?
I think it maybe the number of tokens in the final vocabulary, so I executed the above command using 30000 for 'num_operations' parameter on a machine with 32GB of memory.
After two days, it's still running while takes all the available memory (32GB) + 15GB of swap space.
Is it normal?!
Hi there, I want to add some special tokens into my corpus, such as , and keep it fixed. But apply bpe will tokenize it as <@@ UN@@ K@@ >
.
Is there a way to skip some special tokens?
Thanks.
Hi, i want to know why do you use byte pair encoding for code generation task to further sub-tokenise it ? can't we proceed further directly to nematus without byte-pair encoding?
pair 1920: \pccsrv \ -> \pccsrv\ (frequency 2798)
Traceback (most recent call last):
File "./learn_bpe.py", line 201, in
changes = replace_pair(most_frequent, sorted_vocab, indices)
File "./learn_bpe.py", line 145, in replace_pair
new_word = pattern.sub(pair_str, new_word)
File "/usr/lib/python2.7/re.py", line 273, in _subx
template = _compile_repl(template, pattern)
File "/usr/lib/python2.7/re.py", line 260, in _compile_repl
raise error, v # invalid expression
sre_constants.error: bogus escape (end of line)
Hello, I am getting this error if I try to preprocess german data? I am curious if this works only for english?
When training with learn_bpe.py
and then trying to use the generated file, I get the following error:
Error: invalid line in BPE codes file: e
The line should exist of exactly two subword units, separated by whitespace
This error message is not very useful as it does not indicate on what line the error is.
@rsennrich What's the motivation for change in the <\w>
addition in the "tuplization" of the word at https://github.com/rsennrich/subword-nmt/blob/master/learn_bpe.py#L188 ?
Previous, it was:
"stuff!" -> ('s', 't', 'u', 'f', 'f', '!', '</w>')
Now it's:
"stuff!" -> ('s', 't', 'u', 'f', 'f', '!</w>')
Hi, can you please guide me how to do post processing on the output files generated by nematus of code generaton and code documentation?
@rsennrich
Thank you :)
To reproduce:
$ echo -n hello world | python apply_bpe.py --codes bpe_codes.txt
hel@@ lo worldd
I know that having EOL before EOF is a good thing and everyone better do it, but no one is protected from it.
In the code generation we have to give declation+docstring as x.train and body as y.train. So, i am following your code and i did byte pair encoding as i am new to this field i really want to show you my work so that you can tell me whether i am going in a right direction or not.
As directed .. this is my vocab file for declaration and description which i got after using the shell command
this is the vocab file of body
this is the declaration_description.bpe file
so , can u please guide me whether i am going in the right direction or not.
Thank you so much in advance 👍
I think there is a serious bug in apply_bpe.py
resulting from having a mutable default parameter (cache={}
) in the encode
function (https://github.com/rsennrich/subword-nmt/blob/master/apply_bpe.py#L126).
I guess it's fine if you use the script as a command-line tool, but if I use multiple instances of the BPE
class directly, I observe the following behavior:
In [1]: from apply_bpe import BPE
In [2]: bpe_de = BPE(open("/path/to/german.codes"))
In [3]: bpe_en = BPE(open("/path/to/english.codes"))
In [4]: bpe_de.segment("bewundere")
Out[4]: 'bewunder@@ e'
In [5]: bpe_en.segment("bewundere")
Out[5]: 'bewunder@@ e'
During the first segment
call the cache
dictionary is modified during encode
, and the same instance of cache
is then used in the second call of encode
, essentially overwriting the codes loaded from file.
If I make the following simple modification
def encode(..., cache=None):
if cache is None:
cache = {}
...
I get the expected behavior:
In [1]: from apply_bpe import BPE
In [2]: bpe_de = BPE(open("/path/to/german.codes"))
In [3]: bpe_en = BPE(open("/path/to/english.codes"))
In [4]: bpe_de.segment("bewundere")
Out[4]: 'bewunder@@ e'
In [5]: bpe_en.segment("bewundere")
Out[5]: 'be@@ w@@ un@@ d@@ ere'
Hi Rico
First, thanks a lot for releasing your code!
I have a few questions w.r.t. to the implementation/usage
get_vocab
on the encoded text. The trained models/code files do contain 5000 or 10000 lines as expected). I this intended behaviour or am I doing something wrong?apply_bpe.py
. Should i just encoded as i would normally with a space separated tokenized text - Specifically I'm thinking how you go about extracting the vocabulary, do you get that from the model/codes file or extract that from the encoded text?BR
Casper Sønderby
For this fake corpus
when engage what
Its character vocabulary size is 7 (e a h w n g t
).
Lean BPE by two num_operations, and apply it with the two generated codes (wh
and en
), we get:
wh@@ en en@@ g@@ a@@ g@@ e wh@@ a@@ t
The final vocabulary size is 7 (a@@ wh@@ g@@ e t en en@@
), not 9.
Do I calculate it wrong?
In my opinion, the equation Final vocabulary size = character vocabulary + num_operations
based on the assumption that every merge operation generates one new token.
But in this case, the merge operation of e
and n
, generates two token en
and en@@
in the encoded text, and this phenomenon is totally unpredictable. To make sure there is no unknown word, the final vocabulary size should be 18 ??
(e a h w n g t wh en e@@ a@@ h@@ w@@ n@@ g@@ t@@ wh@@ en@@
)
I am really confused !
How to generate the final vocabulary, and how to control its size exactly ?
Hi Reco,
You mentioned the vocabulary size is the number of characters + # bpe_operations. Does that mean the network is based on characters and bpes, not (words and bpes)? For example, for the following text which is segmented using bpes, what's the vocabulary size?
e.g.,
low lo@@ w@@ e@@ r
If we count characters and bpes: 7
Or if we count words and bpes: 5
Hi, Rico:
This may be a newbie question, so apologies in advance.
An undergrad student of mine and I are putting together a Python script that does a bunch of things. It learns a BPE and applies it to some files (Python is a nice way for things to work both in her Windows laptop and in my GNU/Linux machines).
I am currently launching ("subword-nmt apply-bpe" and "subword-nmt learn-bpe") via subprocesses.
Is there a way to import this functionality into a Python script?
I have seen that from a Python shell I can type "from subword_nmt import learn_bpe, apply_bpe", but then if I try something like
infile=open("test.tok.true.es")
outfile=open("/tmp/zzz","w")
learn_bpe(infile,outfile,10000)
I get a "'module' object is not callable" type error.
Thanks a million in advance.
I have a large corpus, around 40GB of text. I install subword-nmt via pip and try to make the dictionary with subword-nmt command line and it takes forever to finish. I just wonder whether there any solution for that situation?
There are several sentences in WMT dataset that used to work fine, but currently break, I was able to narrow down the problem
echo "and ] . '" | python $BPEROOT/learn_bpe.py -s 10
#version: 0.2
Traceback (most recent call last):
File "subword-nmt/learn_bpe.py", line 254, in <module>
main(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input)
File "subword-nmt/learn_bpe.py", line 199, in main
vocab = dict([(tuple(x[:-1])+(x[-1]+'</w>',) ,y) for (x,y) in vocab.items()])
File "subword-nmt/learn_bpe.py", line 199, in <listcomp>
vocab = dict([(tuple(x[:-1])+(x[-1]+'</w>',) ,y) for (x,y) in vocab.items()])
IndexError: string index out of range
$ echo "hello world
> " | python $BPEROOT/learn_bpe.py -s 10
#version: 0.2
Traceback (most recent call last):
File "subword-nmt/learn_bpe.py", line 254, in <module>
main(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input)
File "subword-nmt/learn_bpe.py", line 199, in main
vocab = dict([(tuple(x[:-1])+(x[-1]+'</w>',) ,y) for (x,y) in vocab.items()])
File "subword-nmt/learn_bpe.py", line 199, in <listcomp>
vocab = dict([(tuple(x[:-1])+(x[-1]+'</w>',) ,y) for (x,y) in vocab.items()])
IndexError: string index out of range
Both of those worked with older versions of subword-nmt, e.g. with this one: 8247488
I'm using the latest github clone from this repository and I've noticed the following (Note that I removed the shebang line from learn_bpe.py
so that it does not default to /usr/bin/python
):
python2 learn_bpe.py -i input.en -s 1000 -o py2.argparse
sort py2.argparse > py2.argparse.sorted
cat input.en | python2 learn_bpe.py -s 1000 > py2.stdin
sort py2.stdin > py2.stdin.sorted
python3 learn_bpe.py -i input.en -s 1000 -o py3.argparse
sort py3.argparse > py3.argparse.sorted
cat input.en | python3 learn_bpe.py -s 1000 > py3.stdin
sort py3.stdin > py3.stdin.sorted
sha1sum *sorted
a8d78085206049c4ba8398e174e3467055b4b1f5 py2.argparse.sorted
a8d78085206049c4ba8398e174e3467055b4b1f5 py2.stdin.sorted
2e56fefcb09f9f6d829774fb230dd08d139ae711 py3.argparse.sorted
72c59258c73930284032e41fd1f6ffd848822f76 py3.stdin.sorted
I also attached the input.en
file (as input.en.txt)
input.en.txt
Is it suitable for any language? I want use it in Chinese, is it correct ?
Hi,
I read your paper of BPE, that's a great work. I just wondering where is the result of newstest2014, seems I couldn't find the reported BLEU score. Am I missing anything?
Thanks.
README has a small typo. Flag --write-vocabulary is listed as --write_vocabulary for sample command to learn joint bpe
In case, if I want to remove a generated BPE dictionary in a text file (supposing that I am not provided the original training/dev/test data). Can I do it?
I understand that by removing the @@
symbols I get back to the input text, but how can I identify the smallest subunits in the processed text?
If for example I have di@@ rect
, How can I figure out the smallest subunits, as I understand it, it could be {di
, rect
}, {d
, i
, rect
}, {d
, i
, re
, ct
} and so on, since I don'nt know which part of di
and which part of rect
belongs to the subunit, and which part is unknown to the tokenizer.
How do I know what part of a word which contains is part of the binary pair, and which part is the rest of the word?
I'm sorry, if I just got the overall concept wrong, but I can't figure this out.
Can this tool be used to do Chinese word segmentation?
Thank you very much!
I am running into a test case failure on a fresh clone of the repo. It's only a method signature error but I just wanted to confirm whether the test case failure to safe to disregard ?
Steps to reproduce.
$ git clone https://github.com/rsennrich/subword-nmt.git
$ cd subword-nmt
$ python test/test_glossaries.py
======================================================================
ERROR: test_multiple_glossaries (__main__.TestBPESegmentMethod)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/test_glossaries.py", line 116, in test_multiple_glossaries
self._run_test_case(test_case)
File "/home/prastog3/.local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched
return func(*args, **keywargs)
File "test/test_glossaries.py", line 108, in _run_test_case
out = self.bpe.segment(orig)
File "test/test_glossaries.py", line 108, in _run_test_case
out = self.bpe.segment(orig)
File "/home/prastog3/anaconda/lib/python2.7/bdb.py", line 49, in trace_dispatch
return self.dispatch_line(frame)
File "/home/prastog3/anaconda/lib/python2.7/bdb.py", line 68, in dispatch_line
if self.quitting: raise BdbQuit
BdbQuit
----------------------------------------------------------------------
Ran 9 tests in 43.061s
FAILED (errors=1)
Hi,
it seems apply_bpe.py
duplicates empty lines, minimal example:
echo -e '\n\n' | wc -l
3
and twice as many with the script.
echo -e '\n\n' | ./subword-nmt/apply_bpe.py -c bpe.codes | wc -l
6
Can you reproduce this?
When I use the following command for BPE operation:
./apply_bpe.py -c bpe.train.mn < train.mn-zh.mn
Terminal output error message:
Traceback (most recent call last):
File "./apply_bpe.py", line 313, in
bpe = BPE(args.codes, args.merges, args.separator, vocabulary, args.glossaries)
File "./apply_bpe.py", line 45, in init
self.bpe_codes_reverse = dict([(pair[0] + pair[1], pair) for pair,i in self.bpe_codes.items()])
File "./apply_bpe.py", line 45, in
self.bpe_codes_reverse = dict([(pair[0] + pair[1], pair) for pair,i in self.bpe_codes.items()])
IndexError: tuple index out of range
I would like to ask what is the reason for this.
Looking forward to your advice or answers.
Best regards,
yapingzhao
python -m learn_joint_bpe_and_vocab --input corpus.en corpus.ch -s 30000 -o bpe.codes --write-vocabulary bpe.vocab.en bpe.vocab.ch
After running the above command, the following hints appear:no pair has frequency >= 2. Stopping。
I don't understand what this message means. I hope to give it a answer, thank you.
My train text data is in Chinese. and it reports UnicodeEncodeError when using subword-nmt learn-bpe
command with verbose mode on, while it works fine with learn_bpe.py script.
Then I check the code and find out the reason.
# python 2/3 compatibility
if sys.version_info < (3, 0):
sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
else:
sys.stdin = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
It seems that command line subwor-nmt learn-bpe
doesn't run the above codes, then sys.stderr
used by verbose mode (see below) would be the default system stderr,which encodes unicode with "ascii" encoding.
if verbose:
sys.stderr.write('pair {0}: {1} {2} -> {1}{2} (frequency {3})\n'.format(i, most_frequent[0], most_frequent[1], stats[most_frequent]))
I'm trying to replicate English-German benchmarking using this link: Tensorflow Seq-to-seq. When I closely look at result file it has duplicates like 'Flank@@' 'flank@@', 'Check-In' 'check-in'. I removed these duplicates by sorting and filtering but somehow my bleu scores get messed up (test and dev bleu scores show 0.0). I'm thinking it because I modified my vocabulary file it's reading it incorrectly. Attached is the vocab file.
vocab.bpe.32000.de.txt
Below a feature suggestion path.
The number of operations specified with -s leads to very different vocabulary sizes, due to the number of unique characters to start with. A value of 49500 creates small vocabularies for language that use Latin alphabet, but easily 70000 or so for Chinese. So, it would be good to subtract the number of unique characters from the number of symbols being generated.
+++ b/subword_nmt/learn_bpe.py
@@ -56,6 +56,9 @@ def create_parser(subparsers=None):
parser.add_argument('--dict-input', action="store_true",
help="If set, input file is interpreted as a dictionary where each line contains a word-count pair")
parser.add_argument(
'--total-symbols', '-t', action="store_true",
help="subtract number of characters from the symbols to be generated")
@@ -197,7 +200,7 @@ def prune_stats(stats, big_stats, threshold):
big_stats[item] = freq
-def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_dict=False):
+def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_dict=False, total_symbols=False):
"""Learn num_symbols BPE operations from vocabulary, and write to outfile.
"""
@@ -211,6 +214,16 @@ def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_d
stats, indices = get_pair_statistics(sorted_vocab)
big_stats = copy.deepcopy(stats)
uniq_char = defaultdict(int)
for word in vocab:
prev_char = word[0]
for char in word[1:]:
uniq_char[char] += 1
print('Number of characters: {0}'.format(len(uniq_char)))
num_symbols -= len(uniq_char)
I trained two vocabularies with about 900M Chinese-English materials, and then coded two data sets (900M training set and 500K test set) with these two Chinese-English vocabularies.
The training set can get normal results, but there are many @ in the test set.
Before using subword-nmt for bpe, I had participled Chinese.
The corresponding instructions are as follows:
python learn_joint_bpe_and_vocab.py --input data/train.en data/train.zh -s 32000 -o data/bpe32k --write-vocabulary data/vocab.en data/vocab.zh
python apply_bpe.py --vocabulary data/vocab.en --vocabulary-threshold 50 -c data/bpe32k < data/train.en > data/corpus.32k.en
python apply_bpe.py --vocabulary data/vocab.zh --vocabulary-threshold 50 -c data/bpe32k < data/train.zh > data/corpus.32k.zh
python apply_bpe.py --vocabulary data/vocab.zh --vocabulary-threshold 50 -c data/bpe32k < data/valid.zh > data/aval_bpe_enzh.zh
python apply_bpe.py --vocabulary data/vocab.en --vocabulary-threshold 50 -c data/bpe32k < data/valid.en > data/aval_bpe_enzh.en
The result is shown in the figure.
I don't know where the problem is. Please help me to answer it. Thank you very much.
not sure it's highly relevant, but it may be of help if someone runs into this again. when running apply_bpe.py
on a specific file with 1.9 million lines, the script outputs 4 extra lines. after a few hours of investigation it boils down to the codecs
module: for line in args.input:
produces the extra 4 lines. replacing codecs
with io
solves the issue.
unfortunately the extra lines are at random positions in the middle and once i worked around this i didn't spend extra time on finding where they come from and what may be anomalous in the data to trigger this.
i'm happy provide the offending file on request.
Hi,
Would it make sense to add a minimal frequency threshold instead of a maximal number of symbols? That way we could make sure that the symbols are being seens at least N times during NMT training.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.