glample / fastbpe Goto Github PK
View Code? Open in Web Editor NEWFast BPE
License: MIT License
Fast BPE
License: MIT License
I am using fastBPE as part of a larger pipeline. It works fine on my PC.
But when I run this using sagemaker on an instance, I get the following error:
fast: fast.cc:464: void readCodes(const char*, std::unordered_map<std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string >, unsigned int, pair_hash>&, std::unordered_map<std::__cxx11::basic_string, std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string > >&): Assertion `codes.find(pair) == codes.end()' failed.
I'm trying to install fastBPE on Win10 as a dependency for another package (CTRL)
keep running into the following issue
fastbpe\fastbpe\fastbpe\fastBPE.hpp(18): fatal error C1083: Cannot open include file: 'unistd.h': No such file or directory
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\cl.exe' failed with exit status 2
I tried to follow [these instructions](https://github.com/pytorch/fairseq/issues/1224)
but I cannot find the files the tutorial describes (most notably, there is no cygwincompiler.py file under distutils to patch)
I am new to python, and I might screw up something completely basic. The version I have installed is Python 3.8
Any help would be much appreciated
thanks
Hi,
I have a question that why we need to provide a vocab when applying bpe to valid/test data set?
I am having issues after installing the Python API and trying to import the library. I am able to compile it with the g++ command in the readme and the ./fast commands work fine. I then install the Python API following the instructions and it seems to install fine. However when I go to import the library I keep getting the same error:
ImportError: dlopen(/anaconda3/lib/python3.6/site-packages/fastBPE-0.0.0-py3.6-macosx-10.7-x86_64.egg/fastBPE.cpython-36m-darwin.so, 2): Symbol not found: __ZTINSt6thread6_StateE
Referenced from: /anaconda3/lib/python3.6/site-packages/fastBPE-0.0.0-py3.6-macosx-10.7-x86_64.egg/fastBPE.cpython-36m-darwin.so
Expected in: flat namespace
in /anaconda3/lib/python3.6/site-packages/fastBPE-0.0.0-py3.6-macosx-10.7-x86_64.egg/fastBPE.cpython-36m-darwin.so
I have tried just about everything I can think of including a new python virtual environment and a new conda virtual environment but the same error persists. This my python configuration:
Python 3.6.8 |Anaconda custom (64-bit)| (default, Dec 29 2018, 19:04:46)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
gcc -v:
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 10.0.0 (clang-1000.10.44.4)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
g++ -v:
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 10.0.0 (clang-1000.10.44.4)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
Any help would be appreciated.
I try to install fastBPE with "pip3 install fastBPE", errors occurred. Here is the logs:
Collecting fastBPE
Using cached fastBPE-0.1.0.tar.gz (35 kB)
Preparing metadata (setup.py) ... done
Building wheels for collected packages: fastBPE
Building wheel for fastBPE (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [19 lines of output]
running bdist_wheel
running build
running build_py
package init file 'fastBPE_init_.py' not found (or not a regular file)
running build_ext
building 'fastBPE' extension
creating build
creating build\temp.win-amd64-3.9
creating build\temp.win-amd64-3.9\Release
creating build\temp.win-amd64-3.9\Release\fastBPE
C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\leehsiang\AppData\Local\Programs\Python\Python39\include -IC:\Users\leehsiang\AppData\Local\Programs\Python\Python39\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\include -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /EHsc /TpfastBPE/fastBPE.cpp /Fobuild\temp.win-amd64-3.9\Release\fastBPE/fastBPE.obj -std=c++11 -Ofast -pthread
cl: 命令行 warning D9025 :正在重写“/Os”(用“/Ot”)
cl: 命令行 warning D9002 :忽略未知选项“-std=c++11”
cl: 命令行 warning D9002 :忽略未知选项“-Of”
cl: 命令行 warning D9002 :忽略未知选项“-Oa”
cl: 命令行 warning D9002 :忽略未知选项“-pthread”
fastBPE.cpp
C:\Users\leehsiang\AppData\Local\Temp\pip-install-kj3mgi5k\fastbpe_a8c8008d8dc444068a0ab7d1b2517139\fastBPE\fastBPE.hpp(15): fatal error C1083: 无法打开包括文件: “sys/mman.h”: No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for fastBPE
Running setup.py clean for fastBPE
Failed to build fastBPE
WARNING: Ignoring invalid distribution -umpy (c:\users\leehsiang\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution - (c:\users\leehsiang\appdata\local\programs\python\python39\lib\site-packages)
Installing collected packages: fastBPE
Running setup.py install for fastBPE ... error
error: subprocess-exited-with-error
× Running setup.py install for fastBPE did not run successfully.
│ exit code: 1
╰─> [19 lines of output]
running install
running build
running build_py
package init file 'fastBPE_init_.py' not found (or not a regular file)
running build_ext
building 'fastBPE' extension
creating build
creating build\temp.win-amd64-3.9
creating build\temp.win-amd64-3.9\Release
creating build\temp.win-amd64-3.9\Release\fastBPE
C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\leehsiang\AppData\Local\Programs\Python\Python39\include -IC:\Users\leehsiang\AppData\Local\Programs\Python\Python39\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\include -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /EHsc /TpfastBPE/fastBPE.cpp /Fobuild\temp.win-amd64-3.9\Release\fastBPE/fastBPE.obj -std=c++11 -Ofast -pthread
cl: 命令行 warning D9025 :正在重写“/Os”(用“/Ot”)
cl: 命令行 warning D9002 :忽略未知选项“-std=c++11”
cl: 命令行 warning D9002 :忽略未知选项“-Of”
cl: 命令行 warning D9002 :忽略未知选项“-Oa”
cl: 命令行 warning D9002 :忽略未知选项“-pthread”
fastBPE.cpp
C:\Users\leehsiang\AppData\Local\Temp\pip-install-kj3mgi5k\fastbpe_a8c8008d8dc444068a0ab7d1b2517139\fastBPE\fastBPE.hpp(15): fatal error C1083: 无法打开包括文件: “sys/mman.h”: No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure
× Encountered error while trying to install package.
╰─> fastBPE
Could anyone help me?
Loading codes from data/processed/XLM_en_zh/50k/codes ...
fast: fastBPE/fastBPE.hpp:458: void fastBPE::readCodes(const char*, std::unordered_map<std::pair<std::basic_string, std::basic_string >, unsigned int, fastBPE::pair_hash>&, std::unordered_map<std::basic_string, std::pair<std::basic_string, std::basic_string > >&): Assertion `codes.find(pair) == codes.end()' failed.
I did this but
Sorry, there is still the same mistake.
what should I do?
In order to use this lib (or downstream libs) interactively or at run-time, fastBPE should support sending a single line without loading the initial model each time.
See facebookresearch/LASER#46 and facebookresearch/UnsupervisedMT#27
Not sure what's going on I thought this was working a few weeks ago when I tried it in ubuntu 18 in WSL2, now on actual ubuntu 20 it's not working.
OS (e.g., Linux): ubuntu 20.04
python 3.8
pip3 install fastBPE
Collecting fastBPE
Using cached fastBPE-0.1.0.tar.gz (35 kB)
Building wheels for collected packages: fastBPE
Building wheel for fastBPE (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-gutktu1d/fastBPE/setup.py'"'"'; __file__='"'"'/tmp/pip-install-gutktu1d/fastBPE/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-hlyefd0c
cwd: /tmp/pip-install-gutktu1d/fastBPE/
Complete output (12 lines):
running bdist_wheel
running build
running build_py
package init file 'fastBPE/__init__.py' not found (or not a regular file)
running build_ext
building 'fastBPE' extension
creating build
creating build/temp.linux-x86_64-3.8
creating build/temp.linux-x86_64-3.8/fastBPE
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -IfastBPE -I/usr/include/python3.8 -c fastBPE/fastBPE.cpp -o build/temp.linux-x86_64-3.8/fastBPE/fastBPE.o -std=c++11 -Ofast -pthread
unable to execute 'x86_64-linux-gnu-gcc': No such file or directory
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
ERROR: Failed building wheel for fastBPE
Running setup.py clean for fastBPE
Failed to build fastBPE
Installing collected packages: fastBPE
Running setup.py install for fastBPE ... error
ERROR: Command errored out with exit status 1:
command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-gutktu1d/fastBPE/setup.py'"'"'; __file__='"'"'/tmp/pip-install-gutktu1d/fastBPE/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-h1vszop2/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /home/mike/.local/include/python3.8/fastBPE
cwd: /tmp/pip-install-gutktu1d/fastBPE/
Complete output (12 lines):
running install
running build
running build_py
package init file 'fastBPE/__init__.py' not found (or not a regular file)
running build_ext
building 'fastBPE' extension
creating build
creating build/temp.linux-x86_64-3.8
creating build/temp.linux-x86_64-3.8/fastBPE
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -IfastBPE -I/usr/include/python3.8 -c fastBPE/fastBPE.cpp -o build/temp.linux-x86_64-3.8/fastBPE/fastBPE.o -std=c++11 -Ofast -pthread
unable to execute 'x86_64-linux-gnu-gcc': No such file or directory
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-gutktu1d/fastBPE/setup.py'"'"'; __file__='"'"'/tmp/pip-install-gutktu1d/fastBPE/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-h1vszop2/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /home/mike/.local/include/python3.8/fastBPE Check the logs for full command output.
Hi, nice work first of all!
Even though the output while loading the codes and vocab files can be helpful, I'd prefer it to be toggled by a verbose commandline parameter. Otherwise you can't directly use the stout while using the python module.
Thats the output:
Loading vocabulary from bpe.30000.vocab ...
Read 1258523173 words (30708 unique) from vocabulary file.
Loading codes from bpe.30000.codes ...
When building fastBPE on devfair with:
g++ -std=c++11 -pthread -O3 fast.cc -o fast
I get the following error:
fast.cc: In function ‘int safeOpen(const char*, int, mode_t)’:
fast.cc:44:51: error: ‘fp’ was not declared in this scope
fprintf(stderr, "Cannot open text file %s\n", fp);
^
I have found that loading a fastBPE
codes and vocabulary against subword-nmt
I get a different result in the bpe codes:
Using fastBPE
hoy quiero que te qu@@ ede &@@ apo@@ s@@ ; a dormir
this song is gonna make you mad
Using subword-nmt
ho@@ y qui@@ ero que te que@@ de &@@ apo@@ s@@ ; a dor@@ mir
th@@ is son@@ g is gon@@ na make you mad
using the same codes and vocabulary, with minimal adaptation in the latter package. My understanding of BPE
was that the implementation should be almost the same.
I have asked subword-nmt
author as well: rsennrich/subword-nmt#76
Hi, in your setup, you are trying to handle a case when Cython is not available but it does not work so you should either put Cython as a dependency and throw an error or provide the missing file.
Line 6 in 1fd3318
extension = 'cpp'
which leads to "fastBPE/fastBPE." + extension
but there is no fastBPE/fastBPE.cpp file.
Adapting the number of threads (commit c520726) causes a compilation error on my Mac.
fast.cc:614:36: error: variable length array of non-POD element type 'unordered_map<string, string>' (aka 'unordered_map<basic_string<char, char_traits<char>, allocator<char> >, basic_string<char,
char_traits<char>, allocator<char> > >')
unordered_map<string, string> bpe[kThreads];
Reverting to a fixed number of threads (10) fixes it.
I use Apple LLVM version 8.0.0 (clang-800.0.42.1) (Target: x86_64-apple-darwin16.7.0)
Hey,
learnbpe
is failing on this line if there's not enough available memory. I tried to run the script on my machine with 8G ram and it threw
Loading vocabulary from alice_in_wonderland.txt ...
Read 26443 words (5293 unique) from text file.
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
Running the same on a server with ~512G doesn't fail. Is there any way of fixing this so bpe can be ran locally?
I am wondering based on what the number 40000 or 60000 is chosen. Is it like a rough estimate of the size of corpora vocabulary?
Thanks.
/home/dina/code/fastBPE/fast learnbpe 500 /home/dina/CodeGen/data/test_dataset/cpp.sa-cl.tok.shuf.50gb > /home/dina/CodeGen/data/test_dataset/cpp.sa-cl.codes Loading vocabulary from /home/dina/CodeGen/data/test_dataset/cpp.sa-cl.tok.shuf.50gb ... Read 311990 words (17104 unique) from text file. terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)
There are different ways byte pair encoding could be applied.
import fastBPE bpe = fastBPE.fastBPE(codes_path, vocab_path) bpe.apply(["Roasted barramundi fish", "Centrally managed over a client-server architecture"])
Traceback (most recent call last): File "test.py", line 3, in <module> bpe = fastBPE.fastBPE(codes_path, vocab_path) NameError: name 'codes_path' is not defined
My environment is Linux @myleott
I am working on a remote Ubuntu server using SSH. Python 3.6. I am unable to install fastBPE and get this error:
error: command 'gcc' failed with exit status 1
As a result, I'm unable to execute my code. I am unable to install this with pip or conda as well. Please help.
sorry... this should be issue of microsoft/MASS
I am trying to run bash file "get-data-nmt.sh".
Compiling fastBPE...
g++: error: fast.cc: No such file or directory
g++: fatal error: no input file
it's like the installation of fastBPE has changed:
from:
g++ -std=c++11 -pthread -O3 fast.cc -o fast
to:
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
please take this into consideration.
I use the fastBPE to apply bpe to make it faster. But I want to use the origin codes file learned from subword. And I found a clean difference between the two results. Is it because fastBPE cannot use the codes file learned from subword?
Hello,
I am trying to use FastBPE for unsupervised NMT.
After learning the codes with
./fast learnbpe 60000 ../data/cloze.txt ../data/natural.txt > codes
when I call applybpe' as below
, the output I get in clozebpe
is identical to ../data/cloze.txt
.
This isn't the expected behavior right? How do I split words in my input text to subword units?
The instruction provided here gives a working solution to setup fastBPE in Windows 10. Fairseq with GPU is possible in Windows 10.
Hello, as title says, what's the licensing for this code? Thanks!
Why is the limitVocab function necessary? And is it correctly implemented? I see that the query string might have a kTokenDelim attached but then the lookup in vocab makes no sense. What is going on here?
Error occurred when pip install fastBPE
on win10 and python3.6
The method you provide by constraining the codes size, like "./fast learnbpe 40000 train.de train.en > codes", just indirectly adjusted the size of vocab.
My question are:
Thanks!!
Hello guys,
I applied getvocab
on a french text with the following line
./fast getvocab marie_claire.txt > new_vocab
However, I have seen a bug (if it is a bug!) : some tokens are duplicated, with the second copied token written with a line break. Here an example (it's just a cut extract of the full initial vocab output) :
You can see et
and de
in the example above. Furthermore, the vocab starts exactly as reported : a line break, a space and the frequence (2439). Still a bug ?
Here the french text :
wget -O marie_claire.txt http://www.gutenberg.org/cache/epub/58501/pg58501.txt
Any idea ?
Thanks a lot for your help :)
I am installing into Docker image using Ubuntu 18.04. Installation of Python package install needs fastBPE/fastBPE.cpp
See below:
~/fastBPE# python setup.py install
/usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'long_description_content_type'
warnings.warn(msg)
running install
running bdist_egg
running egg_info
writing fastBPE.egg-info/PKG-INFO
writing top-level names to fastBPE.egg-info/top_level.txt
writing dependency_links to fastBPE.egg-info/dependency_links.txt
package init file 'fastBPE/__init__.py' not found (or not a regular file)
reading manifest file 'fastBPE.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'fastBPE.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'fastBPE' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fdebug-prefix-map=/build/python2.7-_wncS2/python2.7-2.7.15=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/include/python2.7 -c fastBPE/fastBPE.cpp -o build/temp.linux-x86_64-2.7/fastBPE/fastBPE.o -std=c++11 -Ofast -pthread
x86_64-linux-gnu-gcc: error: fastBPE/fastBPE.cpp: No such file or directory
x86_64-linux-gnu-gcc: fatal error: no input files
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
g++: error: fastBPE/main.cc: No such file or directory
g++: fatal error: no input files
compilation terminated.
workaround suggested under pytorch / fairseq error #1224 is also not working
Followed @zjplab instructions
1.Downloaded fastBPE repo
2.added "if msc_ver >= '1900': return ['vcruntime140']" in distutils\cygwinccompiler.py
3. created files mman.c,mman.h file with the code that you shared
4.replaced #include <sys/mman.h> with #include "mman.c" in fastBPE.hpp
Im getting this error 'thread' was not declared in this scope 28 | const size_t kThreads = max(1, min(10, int(thread::hardware_concurrency())))
Full stack trace: Tired to debug and search for this issue in other portals but was not able to get any solutions could you please assist
python setup.py build --compiler=mingw32 running build running build_py running build_ext building 'fastBPE' extension C:\MinGW\bin\gcc.exe -mdll -O -Wall -IfastBPE -IC:\Users\aa\AppData\Local\Programs\Python\Python37\include -IC:\Users\aa\AppData\Local\Programs\Python\Python37\include -c fastBPE/fastBPE.cpp -o build\temp.win-amd64-3.7\Release\fastbpe\fastbpe.o -std=c++11 -Ofast -pthread In file included from fastBPE/fastBPE.cpp:653: fastBPE/fastBPE.hpp:28:44: error: 'thread' was not declared in this scope 28 | const size_t kThreads = max(1, min(10, int(thread::hardware_concurrency()))); | ^~~~~~ fastBPE/fastBPE.hpp:20:1: note: 'std::thread' is defined in header ''; did you forget to '#include '? 19 | #include <unordered_set> +++ |+#include 20 | #include fastBPE/fastBPE.hpp: In function 'void fastBPE::readText(const char*, std::unordered_map<std::__cxx11::basic_string, unsigned int>&)': fastBPE/fastBPE.hpp:83:27: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'uint64_t' {aka 'long long unsigned int'} [-Wformat=] 83 | fprintf(stderr, "Read %lu words (%lu unique) from text file.\n", total, | ~~^ ~~~~~ | | | | long unsigned int uint64_t {aka long long unsigned int} | %I64u fastBPE/fastBPE.hpp:83:38: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'std::unordered_map<std::__cxx11::basic_string, unsigned int>::size_type' {aka 'unsigned int'} [-Wformat=] 83 | fprintf(stderr, "Read %lu words (%lu unique) from text file.\n", total, | ~~^ | | | long unsigned int | %u 84 | word_count.size()); | ~~~~~~~~~~~~~~~~~ | | | std::unordered_map<std::__cxx11::basic_string, unsigned int>::size_type {aka unsigned int} fastBPE/fastBPE.hpp: In function 'void fastBPE::outputText(const char*, const char*, std::unordered_map<std::__cxx11::basic_string, std::__cxx11::basic_string >&)': fastBPE/fastBPE.hpp:136:7: error: 'ftruncate' was not declared in this scope; did you mean 'strncat'? 136 | if (ftruncate(fdOut, out_size) < 0) { | ^~~~~~~~~ | strncat fastBPE/fastBPE.hpp:137:65: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'size_t' {aka 'unsigned int'} [-Wformat=] 137 | fprintf(stderr, "Couldn't truncate output file %s to size %lu\n", fpo, | ~~^ | | | long unsigned int | %u 138 | out_size); | ~~~~~~~~ | | | size_t {aka unsigned int} fastBPE/fastBPE.hpp:149:31: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'long long unsigned int' [-Wformat=] 149 | fprintf(stderr, "Modified %lu words from text file.\n", p.second); | ~~^ ~~~~~~~~ | | | | long unsigned int long long unsigned int | %I64u fastBPE/fastBPE.hpp: In function 'void fastBPE::readVocab(const char*, std::unordered_map<std::__cxx11::basic_string, unsigned int>&)': fastBPE/fastBPE.hpp:438:27: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'uint64_t' {aka 'long long unsigned int'} [-Wformat=] 438 | fprintf(stderr, "Read %lu words (%lu unique) from vocabulary file.\n", total, | ~~^ ~~~~~ | | | | long unsigned int uint64_t {aka long long unsigned int} | %I64u fastBPE/fastBPE.hpp:438:38: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'std::unordered_map<std::__cxx11::basic_string, unsigned int>::size_type' {aka 'unsigned int'} [-Wformat=] 438 | fprintf(stderr, "Read %lu words (%lu unique) from vocabulary file.\n", total, | ~~^ | | | long unsigned int | %u 439 | vocab.size()); | ~~~~~~~~~~~~ | | | std::unordered_map<std::__cxx11::basic_string, unsigned int>::size_type {aka unsigned int} fastBPE/fastBPE.hpp: In function 'void fastBPE::readCodes(const char*, std::unordered_map<std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string >, unsigned int, fastBPE::pair_hash>&, std::unordered_map<std::__cxx11::basic_string, std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string > >&)': fastBPE/fastBPE.hpp:462:27: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'std::unordered_map<std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string >, unsigned int, fastBPE::pair_hash>::size_type' {aka 'unsigned int'} [-Wformat=] 462 | fprintf(stderr, "Read %lu codes from the codes file.\n", codes.size()); | ~~^ ~~~~~~~~~~~~ | | | | long unsigned int std::unordered_map<std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string >, unsigned int, fastBPE::pair_hash>::size_type {aka unsigned int} | %u fastBPE/fastBPE.hpp: In function 'void fastBPE::applybpe(const char*, const char*, const char*, const char*)': fastBPE/fastBPE.hpp:607:37: error: size of array 'bpe' is not an integral constant-expression 607 | unordered_map<string, string> bpe[kThreads]; | ^~~~~~~~ fastBPE/fastBPE.hpp:608:10: error: 'thread' was not declared in this scope 608 | vector threads; | ^~~~~~ fastBPE/fastBPE.hpp:608:10: note: 'std::thread' is defined in header ''; did you forget to '#include '? fastBPE/fastBPE.hpp:608:16: error: template argument 1 is invalid 608 | vector threads; | ^ fastBPE/fastBPE.hpp:608:16: error: template argument 2 is invalid fastBPE/fastBPE.hpp:610:13: error: request for member 'emplace_back' in 'threads', which is of non-class type 'int' 610 | threads.emplace_back( | ^~~~~~~~~~~~ fastBPE/fastBPE.hpp:623:14: error: invalid types 'int[size_t {aka unsigned int}]' for array subscript 623 | threads[i].join(); | ^ error: command 'C:\MinGW\bin\gcc.exe' failed with exit status 1
It would be worth to provide a tutorial about training a cross-lingual model (classification, etc.) using FastText with BPE preprocessing. It's not exactly clear to me how this would work in practice for a given an input training set and a BPE model i.e. let's say 93langs.fcodes
and 93langs.fvocab
files. (these are the ones provided by Facebook's LASER bi-LSTM model). In my case I would like to use the BPE in combination with a simpler fastText supervised classifier model.
the corpus is 4G,and the memory is 16G
i guess that it's KILLED because the memory is full.
how to deal with it if it do not reduce the ncodes?
executed ./fast learnbpe 2000 ../../CORPUS/train_data/train.en ../../CORPUS/train_data/train.te ../../CORPUS/train_data/en_te_codes
Error:
fast: fastBPE/main.cc:30: int main(int, char**): Assertion `argc == 4 || argc == 5' failed.
Aborted (core dumped)
Hi,
Any chance you'll create python api for learn bpe as well?
running install
running bdist_egg
running egg_info
writing fastBPE.egg-info\PKG-INFO
writing dependency_links to fastBPE.egg-info\dependency_links.txt
writing top-level names to fastBPE.egg-info\top_level.txt
package init file 'fastBPE_init_.py' not found (or not a regular file)
reading manifest file 'fastBPE.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'fastBPE.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_py
running build_ext
building 'fastBPE' extension
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.22.27905\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MT -IfastBPE -IC:\Users\liug\Anaconda2\envs\py3\include -IC:\Users\liug\Anaconda2\envs\py3\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.22.27905\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.22.27905\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.7.2\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\cppwinrt" /EHsc /TpfastBPE/fastBPE.cpp /Fobuild\temp.win-amd64-3.7\Release\fastBPE/fastBPE.obj -std=c++11 -Ofast -pthread
cl: 命令行 warning D9025 :正在重写“/Os”(用“/Ot”)
cl: 命令行 warning D9002 :忽略未知选项“-std=c++11”
cl: 命令行 warning D9002 :忽略未知选项“-Of”
cl: 命令行 warning D9002 :忽略未知选项“-Oa”
cl: 命令行 warning D9002 :忽略未知选项“-pthread”
fastBPE.cpp
F:\lab\写作辅助项目\fastBPE-master\fastBPE\fastBPE.hpp(15): fatal error C1083: 无法打开包括文件: “sys/mman.h”: No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.22.27905\bin\HostX86\x64\cl.exe' failed with exit status 2
Dose the file mman.h exist in the github package?
Hi,
I was looking at the outputs of the learnbpe command. For example, I have a file in which I have the following characters
a b c d e f ab bc cd de ef
I try getvocab to get 11 characters (11 unique) which is understandable.
Then I try to do learnbpe with 11 ncodes on this file.
I get the following outputs:-
a b 1
b c 1
d e 1
e f 1
c d 1
a b 0
a b 0
a b 0
a b 0
a b 0
a b 0
I am not able to understand what the output signifies.
Best,
Pranay
Hi,
I have a question about the learnbpe
operation. The example in the README.md
learn bpecodes together for en
and de
, and then apply code for en
and de
separately..
./fast learnbpe 40000 train.de train.en > codes
./fast applybpe train.de.40000 train.de codes
./fast applybpe train.en.40000 train.en codes
Here is my question:
What's the purpose of jointly learning bpe cde for en
and de
? If in the NMT system, which en
and de
will not share embedding. Is it more reasonable to learn bpe code for en
and de
separately ?
What's the different between the number 40000 in learnbpe
and applybpe
?
Thanks~
Thanks a lot for this wonderful BPE toolkit.
I notice that in subword-nmt there is an argument named 'glossaries', https://github.com/rsennrich/subword-nmt/blob/18a5c87046d15290a1b7d947449052aa6d2b47cc/subword_nmt/apply_bpe.py#L158 . Words matching any of the words/regex provided in glossaries will not be affected.
Is it possible to add this argument in fastBPE? where user can specify glossaries in a file with words or regex. Thanks a lot for your time and consideration.
Hello guys,
I have below issue when I run python setup.py install
.
Warning: passing language='c++' to cythonize() is deprecated. Instead, put "# distutils: language=c++" in your .pyx or .pxd file(s) running install running bdist_egg running egg_info writing fastBPE.egg-info/PKG-INFO writing dependency_links to fastBPE.egg-info/dependency_links.txt writing requirements to fastBPE.egg-info/requires.txt writing top-level names to fastBPE.egg-info/top_level.txt package init file 'fastBPE/__init__.py' not found (or not a regular file) reading manifest file 'fastBPE.egg-info/SOURCES.txt' writing manifest file 'fastBPE.egg-info/SOURCES.txt' installing library code to build/bdist.macosx-10.7-x86_64/egg running install_lib running build_py running build_ext building 'fastBPE' extension gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/include -arch x86_64 -I/anaconda3/include -arch x86_64 -IfastBPE -I/anaconda3/include/python3.7m -c fastBPE/fastBPE.cpp -o build/temp.macosx-10.7-x86_64-3.7/fastBPE/fastBPE.o -std=c++11 -Ofast -pthread warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found] fastBPE/fastBPE.cpp:629:10: fatal error: 'ios' file not found #include "ios" ^~~~~ 1 warning and 1 error generated.
Could anyone solve this problem?
sys:anaconda3 python 3.7.3
Thanks a lot.
Hi, great work on this. Do you have any plans to add support for BPE-dropout (https://arxiv.org/pdf/1910.13267.pdf)?
Hi,
I am getting following error when I tried to install fastBPE on windows 10. Can you please help me out?
Regards
Nagaraju
ERROR: Failed building wheel for fastBPE
Running setup.py clean for fastBPE
Failed to build fastBPE
Installing collected packages: fastBPE
Running setup.py install for fastBPE ... error
ERROR: Command errored out with exit status 1:
command: 'C:\ProgramData\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\nb185041\AppData\Local\Temp\pip-install-db26d_b_\fastBPE\setup.py'"'"'; file='"'"'C:\Users\nb185041\AppData\Local\Temp\pip-install-db26d_b_\fastBPE\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\nb185041\AppData\Local\Temp\pip-record-wqjcib3y\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\ProgramData\Anaconda3\Include\fastBPE'
cwd: C:\Users\nb185041\AppData\Local\Temp\pip-install-db26d_b_\fastBPE
Complete output (19 lines):
running install
running build
running build_py
package init file 'fastBPE_init_.py' not found (or not a regular file)
running build_ext
building 'fastBPE' extension
creating build
creating build\temp.win-amd64-3.8
creating build\temp.win-amd64-3.8\Release
creating build\temp.win-amd64-3.8\Release\fastBPE
C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.26.28801\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IfastBPE -IC:\ProgramData\Anaconda3\include -IC:\ProgramData\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.26.28801\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\cppwinrt" /EHsc /TpfastBPE/fastBPE.cpp /Fobuild\temp.win-amd64-3.8\Release\fastBPE/fastBPE.obj -std=c++11 -Ofast -pthread
cl : Command line warning D9025 : overriding '/Os' with '/Ot'
cl : Command line warning D9002 : ignoring unknown option '-std=c++11'
cl : Command line warning D9002 : ignoring unknown option '-Of'
cl : Command line warning D9002 : ignoring unknown option '-Oa'
cl : Command line warning D9002 : ignoring unknown option '-pthread'
fastBPE.cpp
C:\Users\nb185041\AppData\Local\Temp\pip-install-db26d_b_\fastBPE\fastBPE\fastBPE.hpp(15): fatal error C1083: Cannot open include file: 'sys/mman.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.26.28801\bin\HostX86\x64\cl.exe' failed with exit status 2
----------------------------------------
ERROR: Command errored out with exit status 1: 'C:\ProgramData\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\nb185041\AppData\Local\Temp\pip-install-db26d_b_\fastBPE\setup.py'"'"'; file='"'"'C:\Users\nb185041\AppData\Local\Temp\pip-install-db26d_b_\fastBPE\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\nb185041\AppData\Local\Temp\pip-record-wqjcib3y\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\ProgramData\Anaconda3\Include\fastBPE' Check the logs for full command output.
(base) C:\Users\nb185041\fairseq>
i have been trying to do that but couldn't figure out.
I'm currently using subword-nmt to run BPE programmatically in Python. This implementation mostly match the fastBPE command line (applyBPE
, etc.) like here
# fastBPE/fast applybpe ofn ifn bpe_codes bpe_vocab
bpe = BPE(bpe_codes, merges=-1, separator='@@', vocab=bpe_vocab, glossaries=None)
codes = bpe.process_line(line)
My question is if a swig wrapper of C++ headers to Python could make sense or not instead of using subword-nmt.
Thank you.
Would it be possible to generate a version release for 0.1.0 on github? It's available on pypi but the tarballs are different. Thanks!
TL;DR : fastBPE/fastBPE.cpp
missing <=> (cython
not installed => install crash)
According to setup.py
:
try:
from Cython.Build import cythonize
except ImportError:
use_cython = False
else:
use_cython = True
if use_cython:
extension = 'pyx'
else:
extension = 'cpp'
So if from Cython.Build import cythonize
fail, extension = 'cpp'
.
Then at line 22
we see "fastBPE/fastBPE." + extension
so there is fastBPE/fastBPE.cpp
.
According to your GH, this file does not exist.
It's the same as this:
facebookresearch/LASER#41
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.