gkamradt / llmtest_needleinahaystack Goto Github PK

View Code? Open in Web Editor NEW

1.2K 12.0 121.0 3.05 MB

Doing simple retrieval from LLM models at various context lengths to measure accuracy

License: Other

Python 10.64% Jupyter Notebook 89.33% Dockerfile 0.04%

llmtest_needleinahaystack's People

Contributors

Stargazers

Watchers

Forkers

sansar12e vhpvmx bizrockman bszollosinagy uneetkumarsingh wilson1yan fundou evelynmitchell everloom-129 pzaback simonellershaw aymeric-roucher 196sigma itsliupeng eltociear techthiyanes jlopatec lolszz sxww123 dq-soulie taocao arize-ai arize-ai prabha-git leo4life2 orionsolidified tcondello kunpeng199494 dagelf geronimi73 superxiang chenyujiehome vincezengqiang baichuan-assistant qiangtang2017 xin-li-67 bakztfuture anyuyc anyuyay cooleel reginadong aabushady bigdatasciencegroup siliciuss jianantian eddiebee brentmat ghchen99 lazarohurtado bastienzim aowen14 zhenghax cirenehc jasonlee-sf rlancemartin lplzyp hejunqing pavelkraleu tattrongvu ftgreat yongzx kedarchandrayan sandy4321 mr2cool mdwoicke rubenszimbres dikshyakasaju xuyangshen joennlae jgalego ksprashu shakibarahnama orefaleoluwayinka wilsven bjoernpl kun432 rajkrishnamurthy bmosaicml iamsk aelheloudb dongjicheng amoralesnestle tinamor xbmcnerds dff652 arkadyark-cohere md-experiments gauss5930 gongliyu pankajkumar002 ocwc22 tumthai28 uwsampl gqadonis jeorgexyz omar11205 student-7 segmond afiqmuzaffar cxxz

llmtest_needleinahaystack's Issues

hard coding of 'gpt-4' for evaluation

This needs to be configurable either as a run-time parameter or a constant. TBD.

add base_url env in openai provider - to support OpenAI compatibility local inference like - ollama, tgi, etc

add env based - base_url argument to AsyncOpenAI in provider & ChatOpenAI in evaluator

This will support all inference framework that support OpenAI compatibility api like - ollama, tgi, etc

Remove passing of API keys as parameters and read them from environment variables

New Environment Variables:

NIAH_MODEL_API_KEY - API key for interacting with the model. Depending on the provider, this gets used appropriately with the correct sdk.
NIAH_EVALUATOR_API_KEY - API key to use if openai evaluation strategy is used.

How can we cite the Needle-in-a-Haystack?

Is there a bibtex we can cite the work?

Different prompts in providers - I just wonder why cohere don't have "Don't give information outside the document or repeat your findings" and does it make a bit difference?

For OpenAI

def generate_prompt(self, context: str, retrieval_question: str) -> str | list[dict[str, str]]:
        """
        Generates a structured prompt for querying the model, based on a given context and retrieval question.

        Args:
            context (str): The context or background information relevant to the question.
            retrieval_question (str): The specific question to be answered by the model.

        Returns:
            list[dict[str, str]]: A list of dictionaries representing the structured prompt, including roles and content for system and user messages.
        """
        return [{
                "role": "system",
                "content": "You are a helpful AI bot that answers questions for a user. Keep your response short and direct"
            },
            {
                "role": "user",
                "content": context
            },
            {
                "role": "user",
                "content": f"{retrieval_question} Don't give information outside the document or repeat your findings"
            }]

For Anthropic

You are a helpful AI bot that answers questions for a user. Keep your response short and direct

Human: <context>
{context}
</context>

{retrieval_question} Don't give information outside the document or repeat your findings

Assistant: Here is the most relevant sentence in the context:

For Cohere

    def generate_prompt(self, context: str, retrieval_question: str) -> tuple[str, list[dict[str, str]]]:
        '''
        Prepares a chat-formatted prompt
        Args:
            context (str): The needle in a haystack context
            retrieval_question (str): The needle retrieval question

        Returns:
            tuple[str, list[dict[str, str]]]: prompt encoded as last message, and chat history

        '''
        return (
            f"{retrieval_question} Don't give information outside the document or repeat your findings", 
            [{
                "role": "System",
                "message": "You are a helpful AI bot that answers questions for a user. Keep your response short and direct"
            },
            {
                "role": "User",
                "message": context
            }]
        )

Add license file

Hi @gkamradt ,

that's a fantastic repository! I was just wondering under which conditions one could reuse the code ansd materials provided here. Would you mind adding a license file? If you're new to licensing and/or wonder which license to use, you can read more in this blob post: https://focalplane.biologists.com/2023/05/06/if-you-license-it-itll-be-harder-to-steal-it-why-we-should-license-our-work/

Thanks!

Best,
Robert

multi-needle-eval-pizza-3 dataset not found

Hey @rlancemartin , I'm running the command on the readme right now

needlehaystack.run_test --evaluator langsmith --context_lengths_num_intervals 3 --document_depth_percent_intervals 3 --provider openai --model_name "gpt-4-0125-preview" --multi_needle True --eval_set multi-needle-eval-pizza --needles '["Figs are one of the three most delicious pizza toppings.", "Prosciutto is one of the three most delicious pizza toppings.", "Goat cheese is one of the three most delicious pizza toppings."]'

The eval kicks off find but then errors out on the first tests with

langsmith.utils.LangSmithNotFoundError: Dataset multi-needle-eval-pizza-3 not found

I switched --eval_set multi-needle-eval-pizza to --eval_set multi-needle-eval-pizza-3 (per the blog post), but that didn't fix the issue.

Have an idea of what's going on?

I was wondering about the evaluation method

Hi, I was very impressed after seeing retrieval performance measurements of LLM using Needle In A Haystack! Meanwhile, I was curious about what results would be shown in open-source models.

The experiment result of Mistral-7B-Instruct-v0.2 showed very poor performance. I felt something strange and analyzed the model's response, I found that in most cases, the needle was found well, but the evaluator gave a very low score due to other additional explanations. My questions are following:

I think it is sufficient to evaluate the model's retrieval performance that the needle sentence was included in the generation. Is Needle In A Haystack's evaluation based on generating the exact needle sentence without any other explanation evaluation criteria for Needle In A Haystack?
If so, is it possible to accurately evaluate the non instruction-tuned model that generates infinitely until hits the limit of context lengths on Needle In A Haystack?

The following is an example of a model response that received a score of 1.

[Score 1 Response]
The best thing to do in San Francisco, according to the author, is to eat a sandwich and sit in Dolores Park on a sunny day. However, 
the author also warns that as the world becomes more addictive, living a normal life will become increasingly difficult and the two 
senses of "normal" will be driven further apart. The author suggests that people will have to figure out for themselves what to avoid 
and how, and even that may not be enough as existing things may become more addictive. The author also mentions that most 
people he knows have problems with Internet addiction and that he avoids having an iPhone to limit his access to the Internet. The 
author's latest trick for avoiding addiction is taking long hikes.

Question: Can the Haystack have variations?

Since most of the needle in a "haystack" test injects a line into a pre-defined book text (that can be part of the original dataset), it can be hypothesized that the LLM simply is "smelling" for something that does not fit the context.
So, is it possible to create a "haystack" that is a mix of multiple articles, or just a list of one-liners such that it cannot guess?

[Feature Proposal] Multi-needle in a haystack

I really like this kind of benchmark. It would be interesting to make generalized versions of this, where there are a variable number of needles inserted. These could be unrelated independent needles, or they could be related. For example you could imagine 4 needles:

A implies B
B implies C, D
D implies E.
B is true

Then you could test the "related" needles, to ensure that all of them were detected and the relationship is understood. (What might A be? What about D?)

Curious what you think about this. If you're interested in a feature like this and willing to accept a pull request, I could find the time to try implementing it. If you have a style guide preference or anything like that, please let me know.

Code optimizations

There are a few optimizations we can make for the llm tester and I laid them out below:

the results_exists method checks if a task has finished for a specific context length and document depth by iterating over every file in results/. This can be optimized by looking for a specific file since we know the file name formatting being used.
the insert_needle method finds the most recent . token in the context and inserts the needle right after it. This search is done with a while loop that always overwrites tokens_new_context, which can be large. An optimization, which wont give much of a performance boost but still worth doing, is indexing directly to the . token after the search is complete.
the read_context_files method finds the length of the context, in tokens, for every file it has appended. Instead, we can find the length of the newest file's content to avoid tokenizing the same pieces of text.
Moving from asyncio.gather(*tasks) to using async with asyncio.TaskGroup() as tg as suggested here

Update package Anthropic

Anthropic python library needs version update.

Implementation

Update package Anthropic to the latest version
Move all prompts to Python and remove Anthropic_prompt.txt

See this commit

Replace os.path with Pathlib

Pathlib is much more elegant than os.path, lets replace it.

Implementation

Find all occurrences of os.path
Replace os.path with Pathlib

Standard Tokenizer

Before proceeding with the implementation, I would like to reach a consensus

To ensure a fair and consistent evaluation of different language models, I propose standardizing our tokenizer.

Currently, we use different tokenizers, including cl100k_base (tiktoken) for OpenAI models and an unspecified tokenizer from Anthropic. This lack of standardization introduces bias, as tokenizers vary in how they split text into units, thus affecting context length calculations.

I recommend adopting cl100k_base as our standard tokenizer due to its open-source availability. This will create a level playing field for model comparisons. This difference is less significant for shorter contexts but becomes more pronounced as context length increases.

Using same tokenizer would not affect the integrity as in this project, tokenizer is used to measure the context length and find the depth for needle.

Results from my testing, Anthropic uses more token to represents the same length text. code in this colab

Can I use local LLM as the evaluator and provider?

Install pre-commit with end-of-file-fixer

This project contains many files that do not have the correct end-of-file setting.

Implementation

Install pre-commit in this repository.
Add the end-of-file-fixer hook.
Add a GitHub Action to run pre-commit, which operates on all branches.

It appears there is an official pre-commit GitHub Action. The example provided only runs on the main branch.
We might consider adding more hooks in separate issues.

Add Makefile target for resetting run results

We provide devs with a make command for cleaning pycache and destroying their venv. I think it would be nice to include a command for deleting results/ and contexts/ since I myself found it useful. This command would not be a prereq to destory and will stand by itself so devs dont accidentally lose their analysis

Implement Docker for testing

We need to significantly improve testing of this project, having Docker image is the first step

Implementation

Create Dockerfile based on the official Python Image
Update tiktoken to newest version 0.6.0 - Does not require compilation on ARM
Install requirements.txt
Document in Readme how to run project from Docker

Example:

FROM python:3.12

WORKDIR /app

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY . .

ENTRYPOINT ["python", "main.py"]
CMD []

Possibility to specify custom API endpoint address?

Could this be implemented?
For example 'localhost:7860/v1' could be one of these custom addresses one could enter as a command line argument for running such tests on local models that are set up with an OpenAI-like endpoint but of course are running off a different address.

Thank you

Elliott

Model kwargs support

Currently, the provider and evaluator models are passed hard coded model configurations; temperature, max_new_tokens. This should be exposed in the constructor so users can modify if need be

Convert the repository to a PyPi package

Users should be able to run tests following these steps:

Install the package

pip install needlehaystack

Run Test

needlehaystack.run_test --provider openai --model_name "gpt-3.5-turbo-0125" --document_depth_percents "[50]" --context_lengths "[2000]"

Can you share the PowerPoint where you mapped the final result?

Anthropic Naming Conflict Error

Below code to get tokenizer conflicts with ModelProvider implementation Anthropic

self.enc = Anthropic().get_tokenizer()

Traceback (most recent call last):
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/main.py", line 66, in
main()
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/main.py", line 59, in main
args.model_to_test = get_model_to_test(args)
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/main.py", line 43, in get_model_to_test
return Anthropic(api_key=args.api_key)
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/src/providers/anthropic.py", line 25, in init
self.enc = Anthropic().get_tokenizer()
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/src/providers/anthropic.py", line 25, in init
self.enc = Anthropic().get_tokenizer()
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/src/providers/anthropic.py", line 25, in init
self.enc = Anthropic().get_tokenizer()
[Previous line repeated 485 more times]
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/src/providers/anthropic.py", line 24, in init
self.model = AsyncAnthropic(api_key=self.api_key)
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/anthropic/_client.py", line 372, in init
super().init(
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/anthropic/_base_client.py", line 1190, in init
self._client = http_client or httpx.AsyncClient(
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_client.py", line 1397, in init
self._transport = self._init_transport(
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_client.py", line 1445, in _init_transport
return AsyncHTTPTransport(
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_transports/default.py", line 272, in init
ssl_context = create_ssl_context(verify=verify, cert=cert, trust_env=trust_env)
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_config.py", line 51, in create_ssl_context
return SSLConfig(
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_config.py", line 75, in init
self.ssl_context = self.load_ssl_context()
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_config.py", line 87, in load_ssl_context
return self.load_ssl_context_verify()
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_config.py", line 142, in load_ssl_context_verify
if ca_bundle_path.is_file():
File "/opt/homebrew/Cellar/[email protected]/3.10.13_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 1322, in is_file
return S_ISREG(self.stat().st_mode)
File "/opt/homebrew/Cellar/[email protected]/3.10.13_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 1097, in stat
return self._accessor.stat(self, follow_symlinks=follow_symlinks)
File "/opt/homebrew/Cellar/[email protected]/3.10.13_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 632, in fspath
return str(self)
RecursionError: maximum recursion depth exceeded

Azure OpenAI key

Does it support Azure OpenAI key? if support ,how to set it?
thank you very much

does it run at all? Basic commands failed to run as per the README.

steps to reproduce
environment Conda python 3.9

pip install needlehaystack
`
needlehaystack.run_test --provider anthropic --model_name "claude-2.1" --document_depth_percents "[50]" --context_lengths "[2000]"
/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/langchain/callbacks/init.py:37: LangChainDeprecationWarning: Importing this callback from langchain is deprecated. Importing it from langchain will no longer be supported as of langchain==0.2.0. Please import from langchain-community instead:

from langchain_community.callbacks import base.

To install langchain-community run pip install -U langchain-community.
warnings.warn(
Traceback (most recent call last):
File "/Users/samsaha2/miniconda3/envs/nih/bin/needlehaystack.run_test", line 5, in
from needlehaystack.run import main
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/init.py", line 1, in
from .llm_needle_haystack_tester import LLMNeedleHaystackTester
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/llm_needle_haystack_tester.py", line 10, in
from .providers import ModelProvider
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/providers/init.py", line 1, in
from .anthropic import Anthropic
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/providers/anthropic.py", line 12, in
from .model import ModelProvider
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/providers/model.py", line 5, in
class ModelProvider(ABC):
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/providers/model.py", line 10, in ModelProvider
def generate_prompt(self, context: str, retrieval_question: str) -> str | list[dict[str, str]]: ...
TypeError: unsupported operand type(s) for |: 'type' and 'types.GenericAlias'

`
running pip install -U langchain-community create conflict

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
needlehaystack 0.1.0 requires langchain-core==0.1.26, but you have langchain-core 0.1.36 which is incompatible.

gkamradt / llmtest_needleinahaystack Goto Github PK

llmtest_needleinahaystack's People

Contributors

Stargazers

Watchers

Forkers

llmtest_needleinahaystack's Issues

Implementation

Implementation

Implementation

Implementation

Recommend Projects

Recommend Topics

Recommend Org