Comments (12)
I'm with the majority here - I would also recommend we do not use a standard tokenizer across models.
In practice we like to say "oh Claude has 200K tokens and GPT has 128K tokens" as if they are the same thing, but each model users a different tokenizer so it really should be, "oh Claude has 200K tokens using tokenizer X and GPT has 128K tokens using tokenizer Y"
Because we are doing length evaluation, we'll need to get as close as possible to do the tokenizer used by the model (ideally the same one).
Because we won't have all the tokenizers, we'll need to have a backup or default and adjust as we get more information.
from llmtest_needleinahaystack.
@prabha-git, this subject is one of my major interests, and I will be speaking about it at PyCon SK next week, so I have many charts like this lying around. 🙂
Don't you think that if we use cl100k_base
for Gemini, for example, something like this may happen? Because Gemini tokenizer is more efficient in encoding information, will we never test Gemini's full depth?
from llmtest_needleinahaystack.
I am new to this discussion, but if I read @pavelkraleu correctly, then an inherent problem would exist with using a general tokenizer. We could blow past or under place the pin within the context window by using a tokenizer different than the one the model is using.
This seems to create a couple problems though. There is a new great model that is released BibbityBop, if BibbityBop doesn't use a tokenizer that is already implemented in the NIAH then the above scenario would happen. Wouldn't this by extension mean, that the pin placement in the context window could never be perfectly accurate unless the exact tokenizer as the model uses were used?
from llmtest_needleinahaystack.
I'm with the majority here - I would also recommend we do not use a standard tokenizer across models.
In practice we like to say "oh Claude has 200K tokens and GPT has 128K tokens" as if they are the same thing, but each model users a different tokenizer so it really should be, "oh Claude has 200K tokens using tokenizer X and GPT has 128K tokens using tokenizer Y"
Because we are doing length evaluation, we'll need to get as close as possible to do the tokenizer used by the model (ideally the same one).
Because we won't have all the tokenizers, we'll need to have a backup or default and adjust as we get more information.Sounds good. I have seen people who don't realize that these models don't use the same tokenizer. but yeah with a Standardized tokenizer, we may not get the same length in x-axis as the model's max context window.
Thanks for the discussion, will close this issue.
Thanks @prabha-git , it was great discussing on the topic. I liked the articulation in terms of graphs and visual images. Please keep suggesting issues and initiate discussions 👍
from llmtest_needleinahaystack.
Hi @prabha-git , I would like to dis-agree for using a standard tokenizer.
Tokenizers are coupled with models. Visit https://tiktokenizer.vercel.app/ and see that the tokens generated for GPT2 vs GPT4 are different. They have different vocabularies, which differ in merges as well as vocabulary size.
I understand that it is not a level playing field. But it not something we can solve. The whole training of the models is done using tokens from a specific tokenizer.
Model inference will fail or behave in a random manner if we provide tokens from a tokenizer that is not compatible with it.
I would like to close the issue, as this is not feasible.
from llmtest_needleinahaystack.
@kedarchandrayan we can't choose a tokenizer for any pre-trained transformer model, So I am not proposing that :)
Let me clarify this in detail, hope my explanation effectively conveys my points.
In our project, we utilize the tokenizer corresponding to the model under test, as demonstrated in the following example for Anthropic Models:
self.tokenizer = AnthropicModel().get_tokenizer()
Why is this tokenizer necessary for our project? The insert_needle method evaluates the context length and the depth at which the needle is inserted, on the encoded string (using a tokenizer). Effectively tokenizer is used to measure the length. We decode the text after the needle is inserted and send it for inference.
However, this approach introduces a bias, especially noticeable in longer contexts. For instance, if we set the context size to 1M tokens, Anthropic's model might encompass approximately 3M characters, whereas OpenAI's model might include around 3.2M characters from your document.
Therefore, I propose using the same tokenizer for measuring the context length and determining the depth at which to insert the needle.
Additionally, for some models, like Gemini Pro, accessing their specific tokenizer isn't possible, so we must rely on publicly available tokenizers anyway.
from llmtest_needleinahaystack.
@prabha-git The bias which you are pointing to is when you think from a human point (character length). For model, the input is always tokenized. We are testing the ability of a model to find a needle in a haystack of tokens. That's why keeping everything in terms of native tokens is more intuitive from model's perspective.
When you say that 1M tokens mean 3M characters for Anthropic's model and 3.2M characters for OpenAI's model, you are comparing them from human perspective. For the models these are 1M tokens. They were trained to work with tokens like these.
Following are some subtle points which I want to make:
- Tokenizer is also used for limiting the context to a context length in
encode_and_trim
. If we use standard tokenizer, this can result in a context which is having different number of tokens than the maximum allowed tokens for that model. - Placing a needle at a depth percent according to a standard tokenizer might insert the needle at a different position as compared to the native tokenizer from the model.
- From a model's perspective, it makes more sense in inserting the needle at correct depth percent according to it's own tokenizer. This point becomes even more relevant in case of multi-needle testing, where there are multiple depths to consider.
I agree with the point on Gemini Pro. We will have to think of something else. That is a problem to solve.
Please let me know your opinion.
from llmtest_needleinahaystack.
Thank you, @prabha-git, for initiating this discussion. It's crucial we address this issue before we proceed with integrating models like Gemini, for which we lack a tokenizer.
If I understand you correctly:
- @prabha-git believes that the discrepancy when using a standard tokenizer across all models is acceptable.
- @kedarchandrayan believes that the discrepancy when using a standard tokenizer across all models is not acceptable.
The following chart I've created illustrates the percentage difference in the number of tokens between cl100k_base
(GPT-3.5/4) and various other tokenizers when representing the same text translated into multiple languages.
From the chart above, we can interpret that:
- Gemini Pro requires approximately 15% fewer tokens to represent the same text, compared to GPT-3.5/4.
- Mistral and other models require approximately 15% more tokens to represent the same text, compared to GPT-3.5/4.
Let's explore whether a 15% discrepancy is tolerable, or if there exists an alternative solution that could reduce this error to 0%.
from llmtest_needleinahaystack.
@pavelkraleu - that's cool that you compared across different models.
I think I am making rather a simple point, Considering this is a testing framework for various LLMs. Do we want to measure all models with the same Yardstick or the Yardstick provided by the Model Provider? If we are just testing one model and the intention is to NOT compare the results with other models, then it is fine to use the tokenizer provided by the model.
If we are using a tokenizer provided by the model provider, Comparisons like this would be misleading in a larger context window. image is from this post,
I think we don't need to worry about what tokenizer a specific model is using. We measure the context length and depth using a standard tokenizer and pass the context to LLM for inference.
If it finds the needle correctly, it is green for the given context and depth. And of course, we highlight that we are using a standardized tokenizer in the documentation. Thoughts?
from llmtest_needleinahaystack.
Exactly, @douglasdlewis2012, that's what I think will be happening 🙂
However, we are not completely lost.
All the LLM APIs I know return something like prompt_tokens
and completion_tokens
, so I think we can work around it and count tokens differently.
from llmtest_needleinahaystack.
@prabha-git, this subject is one of my major interests, and I will be speaking about it at PyCon SK next week, so I have many charts like this lying around. 🙂
Don't you think that if we use
cl100k_base
for Gemini, for example, something like this may happen? Because Gemini tokenizer is more efficient in encoding information, will we never test Gemini's full depth?
Wow Cool, Do they upload the presentation to YouTube? Will look it up :)
from llmtest_needleinahaystack.
I'm with the majority here - I would also recommend we do not use a standard tokenizer across models.
In practice we like to say "oh Claude has 200K tokens and GPT has 128K tokens" as if they are the same thing, but each model users a different tokenizer so it really should be, "oh Claude has 200K tokens using tokenizer X and GPT has 128K tokens using tokenizer Y"
Because we are doing length evaluation, we'll need to get as close as possible to do the tokenizer used by the model (ideally the same one).
Because we won't have all the tokenizers, we'll need to have a backup or default and adjust as we get more information.
Sounds good. I have seen people who don't realize that these models don't use the same tokenizer. but yeah with a Standardized tokenizer, we may not get the same length in x-axis as the model's max context window.
Thanks for the discussion, will close this issue.
from llmtest_needleinahaystack.
Related Issues (20)
- hard coding of 'gpt-4' for evaluation
- Install pre-commit with end-of-file-fixer
- Replace os.path with Pathlib
- Update package Anthropic HOT 2
- Anthropic Naming Conflict Error HOT 2
- Implement Docker for testing HOT 1
- Code optimizations
- Model kwargs support HOT 1
- Add Makefile target for resetting run results HOT 1
- Convert the repository to a PyPi package HOT 1
- Remove passing of API keys as parameters and read them from environment variables HOT 1
- multi-needle-eval-pizza-3 dataset not found HOT 1
- I was wondering about the evaluation method HOT 2
- Add license file HOT 4
- [Feature Proposal] Multi-needle in a haystack HOT 2
- does it run at all? Basic commands failed to run as per the README. HOT 1
- Question: Can the Haystack have variations? HOT 3
- Possibility to specify custom API endpoint address? HOT 4
- Can I use local LLM as the evaluator and provider? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llmtest_needleinahaystack.