gkamradt / llmtest_needleinahaystack Goto Github PK
View Code? Open in Web Editor NEWDoing simple retrieval from LLM models at various context lengths to measure accuracy
License: Other
Doing simple retrieval from LLM models at various context lengths to measure accuracy
License: Other
Before proceeding with the implementation, I would like to reach a consensus
To ensure a fair and consistent evaluation of different language models, I propose standardizing our tokenizer.
Currently, we use different tokenizers, including cl100k_base (tiktoken) for OpenAI models and an unspecified tokenizer from Anthropic. This lack of standardization introduces bias, as tokenizers vary in how they split text into units, thus affecting context length calculations.
I recommend adopting cl100k_base as our standard tokenizer due to its open-source availability. This will create a level playing field for model comparisons. This difference is less significant for shorter contexts but becomes more pronounced as context length increases.
Using same tokenizer would not affect the integrity as in this project, tokenizer is used to measure the context length and find the depth for needle.
Results from my testing, Anthropic uses more token to represents the same length text. code in this colab
Could this be implemented?
For example 'localhost:7860/v1' could be one of these custom addresses one could enter as a command line argument for running such tests on local models that are set up with an OpenAI-like endpoint but of course are running off a different address.
Thank you
Hey @rlancemartin , I'm running the command on the readme right now
needlehaystack.run_test --evaluator langsmith --context_lengths_num_intervals 3 --document_depth_percent_intervals 3 --provider openai --model_name "gpt-4-0125-preview" --multi_needle True --eval_set multi-needle-eval-pizza --needles '["Figs are one of the three most delicious pizza toppings.", "Prosciutto is one of the three most delicious pizza toppings.", "Goat cheese is one of the three most delicious pizza toppings."]'
The eval kicks off find but then errors out on the first tests with
langsmith.utils.LangSmithNotFoundError: Dataset multi-needle-eval-pizza-3 not found
I switched --eval_set multi-needle-eval-pizza
to --eval_set multi-needle-eval-pizza-3
(per the blog post), but that didn't fix the issue.
Have an idea of what's going on?
Anthropic
python library needs version update.
Anthropic
to the latest versionAnthropic_prompt.txt
Hi @gkamradt ,
that's a fantastic repository! I was just wondering under which conditions one could reuse the code ansd materials provided here. Would you mind adding a license file? If you're new to licensing and/or wonder which license to use, you can read more in this blob post: https://focalplane.biologists.com/2023/05/06/if-you-license-it-itll-be-harder-to-steal-it-why-we-should-license-our-work/
Thanks!
Best,
Robert
There are a few optimizations we can make for the llm tester and I laid them out below:
results_exists
method checks if a task has finished for a specific context length and document depth by iterating over every file in results/
. This can be optimized by looking for a specific file since we know the file name formatting being used.insert_needle
method finds the most recent .
token in the context and inserts the needle right after it. This search is done with a while loop that always overwrites tokens_new_context
, which can be large. An optimization, which wont give much of a performance boost but still worth doing, is indexing directly to the .
token after the search is complete.read_context_files
method finds the length of the context, in tokens, for every file it has appended. Instead, we can find the length of the newest file's content to avoid tokenizing the same pieces of text.asyncio.gather(*tasks)
to using async with asyncio.TaskGroup() as tg
as suggested herePathlib
is much more elegant than os.path
, lets replace it.
os.path
os.path
with Pathlib
Does it support Azure OpenAI key? if support ,how to set it?
thank you very much
We need to significantly improve testing of this project, having Docker image is the first step
Dockerfile
based on the official Python Imagetiktoken
to newest version 0.6.0
- Does not require compilation on ARMExample:
FROM python:3.12
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENTRYPOINT ["python", "main.py"]
CMD []
Is there a bibtex we can cite the work?
steps to reproduce
environment Conda python 3.9
pip install needlehaystack
`
needlehaystack.run_test --provider anthropic --model_name "claude-2.1" --document_depth_percents "[50]" --context_lengths "[2000]"
/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/langchain/callbacks/init.py:37: LangChainDeprecationWarning: Importing this callback from langchain is deprecated. Importing it from langchain will no longer be supported as of langchain==0.2.0. Please import from langchain-community instead:
from langchain_community.callbacks import base
.
To install langchain-community run pip install -U langchain-community
.
warnings.warn(
Traceback (most recent call last):
File "/Users/samsaha2/miniconda3/envs/nih/bin/needlehaystack.run_test", line 5, in
from needlehaystack.run import main
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/init.py", line 1, in
from .llm_needle_haystack_tester import LLMNeedleHaystackTester
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/llm_needle_haystack_tester.py", line 10, in
from .providers import ModelProvider
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/providers/init.py", line 1, in
from .anthropic import Anthropic
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/providers/anthropic.py", line 12, in
from .model import ModelProvider
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/providers/model.py", line 5, in
class ModelProvider(ABC):
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/providers/model.py", line 10, in ModelProvider
def generate_prompt(self, context: str, retrieval_question: str) -> str | list[dict[str, str]]: ...
TypeError: unsupported operand type(s) for |: 'type' and 'types.GenericAlias'
`
`
running pip install -U langchain-community create conflict
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
needlehaystack 0.1.0 requires langchain-core==0.1.26, but you have langchain-core 0.1.36 which is incompatible.
`
We provide devs with a make command for cleaning pycache and destroying their venv. I think it would be nice to include a command for deleting results/
and contexts/
since I myself found it useful. This command would not be a prereq to destory and will stand by itself so devs dont accidentally lose their analysis
Currently, the provider and evaluator models are passed hard coded model configurations; temperature, max_new_tokens. This should be exposed in the constructor so users can modify if need be
This needs to be configurable either as a run-time parameter or a constant. TBD.
add env based - base_url argument to AsyncOpenAI in provider & ChatOpenAI in evaluator
This will support all inference framework that support OpenAI compatibility api like - ollama, tgi, etc
I really like this kind of benchmark. It would be interesting to make generalized versions of this, where there are a variable number of needles inserted. These could be unrelated independent needles, or they could be related. For example you could imagine 4 needles:
A implies B
B implies C, D
D implies E.
B is true
Then you could test the "related" needles, to ensure that all of them were detected and the relationship is understood. (What might A be? What about D?)
Curious what you think about this. If you're interested in a feature like this and willing to accept a pull request, I could find the time to try implementing it. If you have a style guide preference or anything like that, please let me know.
New Environment Variables:
NIAH_MODEL_API_KEY
- API key for interacting with the model. Depending on the provider, this gets used appropriately with the correct sdk.NIAH_EVALUATOR_API_KEY
- API key to use if openai
evaluation strategy is used.Users should be able to run tests following these steps:
pip install needlehaystack
needlehaystack.run_test --provider openai --model_name "gpt-3.5-turbo-0125" --document_depth_percents "[50]" --context_lengths "[2000]"
Since most of the needle in a "haystack" test injects a line into a pre-defined book text (that can be part of the original dataset), it can be hypothesized that the LLM simply is "smelling" for something that does not fit the context.
So, is it possible to create a "haystack" that is a mix of multiple articles, or just a list of one-liners such that it cannot guess?
Hi, I was very impressed after seeing retrieval performance measurements of LLM using Needle In A Haystack! Meanwhile, I was curious about what results would be shown in open-source models.
The experiment result of Mistral-7B-Instruct-v0.2 showed very poor performance. I felt something strange and analyzed the model's response, I found that in most cases, the needle was found well, but the evaluator gave a very low score due to other additional explanations. My questions are following:
The following is an example of a model response that received a score of 1.
[Score 1 Response]
The best thing to do in San Francisco, according to the author, is to eat a sandwich and sit in Dolores Park on a sunny day. However,
the author also warns that as the world becomes more addictive, living a normal life will become increasingly difficult and the two
senses of "normal" will be driven further apart. The author suggests that people will have to figure out for themselves what to avoid
and how, and even that may not be enough as existing things may become more addictive. The author also mentions that most
people he knows have problems with Internet addiction and that he avoids having an iPhone to limit his access to the Internet. The
author's latest trick for avoiding addiction is taking long hikes.
This project contains many files that do not have the correct end-of-file setting.
pre-commit
in this repository.end-of-file-fixer
hook.pre-commit
, which operates on all branches.It appears there is an official pre-commit GitHub Action. The example provided only runs on the main
branch.
We might consider adding more hooks in separate issues.
Below code to get tokenizer conflicts with ModelProvider implementation Anthropic
self.enc = Anthropic().get_tokenizer()
Traceback (most recent call last):
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/main.py", line 66, in
main()
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/main.py", line 59, in main
args.model_to_test = get_model_to_test(args)
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/main.py", line 43, in get_model_to_test
return Anthropic(api_key=args.api_key)
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/src/providers/anthropic.py", line 25, in init
self.enc = Anthropic().get_tokenizer()
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/src/providers/anthropic.py", line 25, in init
self.enc = Anthropic().get_tokenizer()
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/src/providers/anthropic.py", line 25, in init
self.enc = Anthropic().get_tokenizer()
[Previous line repeated 485 more times]
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/src/providers/anthropic.py", line 24, in init
self.model = AsyncAnthropic(api_key=self.api_key)
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/anthropic/_client.py", line 372, in init
super().init(
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/anthropic/_base_client.py", line 1190, in init
self._client = http_client or httpx.AsyncClient(
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_client.py", line 1397, in init
self._transport = self._init_transport(
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_client.py", line 1445, in _init_transport
return AsyncHTTPTransport(
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_transports/default.py", line 272, in init
ssl_context = create_ssl_context(verify=verify, cert=cert, trust_env=trust_env)
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_config.py", line 51, in create_ssl_context
return SSLConfig(
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_config.py", line 75, in init
self.ssl_context = self.load_ssl_context()
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_config.py", line 87, in load_ssl_context
return self.load_ssl_context_verify()
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_config.py", line 142, in load_ssl_context_verify
if ca_bundle_path.is_file():
File "/opt/homebrew/Cellar/[email protected]/3.10.13_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 1322, in is_file
return S_ISREG(self.stat().st_mode)
File "/opt/homebrew/Cellar/[email protected]/3.10.13_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 1097, in stat
return self._accessor.stat(self, follow_symlinks=follow_symlinks)
File "/opt/homebrew/Cellar/[email protected]/3.10.13_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 632, in fspath
return str(self)
RecursionError: maximum recursion depth exceeded
def generate_prompt(self, context: str, retrieval_question: str) -> str | list[dict[str, str]]:
"""
Generates a structured prompt for querying the model, based on a given context and retrieval question.
Args:
context (str): The context or background information relevant to the question.
retrieval_question (str): The specific question to be answered by the model.
Returns:
list[dict[str, str]]: A list of dictionaries representing the structured prompt, including roles and content for system and user messages.
"""
return [{
"role": "system",
"content": "You are a helpful AI bot that answers questions for a user. Keep your response short and direct"
},
{
"role": "user",
"content": context
},
{
"role": "user",
"content": f"{retrieval_question} Don't give information outside the document or repeat your findings"
}]
You are a helpful AI bot that answers questions for a user. Keep your response short and direct
Human: <context>
{context}
</context>
{retrieval_question} Don't give information outside the document or repeat your findings
Assistant: Here is the most relevant sentence in the context:
def generate_prompt(self, context: str, retrieval_question: str) -> tuple[str, list[dict[str, str]]]:
'''
Prepares a chat-formatted prompt
Args:
context (str): The needle in a haystack context
retrieval_question (str): The needle retrieval question
Returns:
tuple[str, list[dict[str, str]]]: prompt encoded as last message, and chat history
'''
return (
f"{retrieval_question} Don't give information outside the document or repeat your findings",
[{
"role": "System",
"message": "You are a helpful AI bot that answers questions for a user. Keep your response short and direct"
},
{
"role": "User",
"message": context
}]
)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.