Comments (4)
tiktoken is supposed to be much faster than tokenizers for BPE tokenizers.
Proof please.
Also proof that the difference in speed is actually relevant in real world use cases.
If tokenization takes 0.1% of the time in real workloads (Which it kind of is in. a lot of current LLMs scenarios, but I may not be understanding all use cases) then even infinitely faster speedups are kind of irrelevant.
Here is an example of a good case where the problem was explained properly, which enabled us to solve it.
#1413 (comment)
Fwiw, there are a lot of libraries claiming faster than X, even much faster than tiktoken too.
from tokenizers.
I am actually interested in deep diving a bit into potential reasons why we are slower, and update our implementation based on this as long as we don't break, and otherwise have a new faster BPE. As @Narsil mentioned, benchmarks are tricky to get right, and I am supprised that the thread count does not help much tokenizers in the provided bench.
I'll see what I can do once I have a bit less on my plate!
FYI @itazap
from tokenizers.
Mmm not sure no. I think working on a more efficient version of our BPE tokenizer that does not support word ids and etc would be more worthwhile TBH
from tokenizers.
Proof please.
Source: tiktoken
GitHub repository.
See also this quick benchmark I just ran myself:
import tiktoken
from transformers import GPT2TokenizerFast
tt_tokeniser = tiktoken.encoding_for_model('gpt2')
tok_tokeniser = GPT2TokenizerFast.from_pretrained('gpt2')
text = "OpenAI's `tiktoken` tokeniser benchmarks as faster than Hugging Face's tokeniser."
%timeit tt_tokeniser.encode(text) # 14 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit tok_tokeniser.encode(text) # 56.5 µs ± 625 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Also proof that the difference in speed is actually relevant in real world use cases.
If tokenization takes 0.1% of the time in real workloads (Which it kind of is in. a lot of current LLMs scenarios, but I may not be understanding all use cases) then even infinitely faster speedups are kind of irrelevant.
I created a semantic chunking library called semchunk
and the biggest bottleneck right now is the tokeniser because it is requires repeatedly counting the number of tokens in texts. This is just one use case. And when I am chunking extremely large text datasets (10GB+), it can add up very quickly.
The fact that OpenAI created tiktoken
in the first place would suggest that there are features lacking in Hugging Face's tokenisers that they felt were necessary. The fact that they produced benchmarks of the tokeniser's speed further suggests that tokenisation speed is something meaningful to them.
I am sure I am not alone in this.
Mmm not sure no. I think working on a more efficient version of our BPE tokenizer that does not support word ids and etc would be more worthwhile TBH
This would actually be my preference. @Narsil @ArthurZucker If, however, you are not interested in pursuing that at this time, I'm happy to close this issue and work on publishing a faster implementation myself that relies on tiktoken
.
from tokenizers.
Related Issues (20)
- Tokens Removed from Trained Custom BPE Tokenizer
- Llama3 tokenizer with Incorrect offset_mapping HOT 2
- Loading `tokenizer.model` with Rust API HOT 5
- Why the tokenizer is slower than tiktoken? HOT 2
- Why are 'unknown' tokens randomly added to my tokenized input? HOT 2
- Error: Cannot find module 'tokenizers/bindings/tokenizer'
- ❓Get stats (e.g. counts) about the merged pairs
- Convert huggingface tokenizer into sentencepiece format
- How to write custom Wordpiece class?
- Link to download the training text in `docs/source/quicktour.rst` is broken
- Special token handling breaks idempotency of sentencepiece due to extra spaces HOT 4
- Strange warnings with tokenizer for some models HOT 3
- Bug with `CodeQwen1.5`: `data did not match any variant of untagged enum PyPreTokenizerTypeWrapper` HOT 1
- How to Batch-Encode Paired Input Sentences with Tokenizers: Seeking Clarification
- How to allow the merging of consecutive newline tokens \n when training a byte-level bpe tokenizer? HOT 1
- Might be a bug in Unigram Trainer
- Training HuggingFace tokenizer - ignore_merges
- "from_pretrained" read wrong config file. not "tokenizer_config.json", but "config.json"
- Memory leak for large strings HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tokenizers.