Giter Club home page Giter Club logo

llmtest_needleinahaystack's People

Contributors

arkadyark-cohere avatar eltociear avatar gkamradt avatar kedarchandrayan avatar lazarohurtado avatar pavelkraleu avatar prabha-git avatar rlancemartin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llmtest_needleinahaystack's Issues

Different prompts in providers - I just wonder why cohere don't have "Don't give information outside the document or repeat your findings" and does it make a bit difference?

  1. For OpenAI
def generate_prompt(self, context: str, retrieval_question: str) -> str | list[dict[str, str]]:
        """
        Generates a structured prompt for querying the model, based on a given context and retrieval question.

        Args:
            context (str): The context or background information relevant to the question.
            retrieval_question (str): The specific question to be answered by the model.

        Returns:
            list[dict[str, str]]: A list of dictionaries representing the structured prompt, including roles and content for system and user messages.
        """
        return [{
                "role": "system",
                "content": "You are a helpful AI bot that answers questions for a user. Keep your response short and direct"
            },
            {
                "role": "user",
                "content": context
            },
            {
                "role": "user",
                "content": f"{retrieval_question} Don't give information outside the document or repeat your findings"
            }]
  1. For Anthropic
You are a helpful AI bot that answers questions for a user. Keep your response short and direct

Human: <context>
{context}
</context>

{retrieval_question} Don't give information outside the document or repeat your findings

Assistant: Here is the most relevant sentence in the context:
  1. For Cohere
    def generate_prompt(self, context: str, retrieval_question: str) -> tuple[str, list[dict[str, str]]]:
        '''
        Prepares a chat-formatted prompt
        Args:
            context (str): The needle in a haystack context
            retrieval_question (str): The needle retrieval question

        Returns:
            tuple[str, list[dict[str, str]]]: prompt encoded as last message, and chat history

        '''
        return (
            f"{retrieval_question} Don't give information outside the document or repeat your findings", 
            [{
                "role": "System",
                "message": "You are a helpful AI bot that answers questions for a user. Keep your response short and direct"
            },
            {
                "role": "User",
                "message": context
            }]
        )

multi-needle-eval-pizza-3 dataset not found

Hey @rlancemartin , I'm running the command on the readme right now

needlehaystack.run_test --evaluator langsmith --context_lengths_num_intervals 3 --document_depth_percent_intervals 3 --provider openai --model_name "gpt-4-0125-preview" --multi_needle True --eval_set multi-needle-eval-pizza --needles '["Figs are one of the three most delicious pizza toppings.", "Prosciutto is one of the three most delicious pizza toppings.", "Goat cheese is one of the three most delicious pizza toppings."]'

The eval kicks off find but then errors out on the first tests with

langsmith.utils.LangSmithNotFoundError: Dataset multi-needle-eval-pizza-3 not found

I switched --eval_set multi-needle-eval-pizza to --eval_set multi-needle-eval-pizza-3 (per the blog post), but that didn't fix the issue.

Have an idea of what's going on?

I was wondering about the evaluation method

Hi, I was very impressed after seeing retrieval performance measurements of LLM using Needle In A Haystack! Meanwhile, I was curious about what results would be shown in open-source models.

The experiment result of Mistral-7B-Instruct-v0.2 showed very poor performance. I felt something strange and analyzed the model's response, I found that in most cases, the needle was found well, but the evaluator gave a very low score due to other additional explanations. My questions are following:

  • I think it is sufficient to evaluate the model's retrieval performance that the needle sentence was included in the generation. Is Needle In A Haystack's evaluation based on generating the exact needle sentence without any other explanation evaluation criteria for Needle In A Haystack?
  • If so, is it possible to accurately evaluate the non instruction-tuned model that generates infinitely until hits the limit of context lengths on Needle In A Haystack?

The following is an example of a model response that received a score of 1.

[Score 1 Response]
The best thing to do in San Francisco, according to the author, is to eat a sandwich and sit in Dolores Park on a sunny day. However, 
the author also warns that as the world becomes more addictive, living a normal life will become increasingly difficult and the two 
senses of "normal" will be driven further apart. The author suggests that people will have to figure out for themselves what to avoid 
and how, and even that may not be enough as existing things may become more addictive. The author also mentions that most 
people he knows have problems with Internet addiction and that he avoids having an iPhone to limit his access to the Internet. The 
author's latest trick for avoiding addiction is taking long hikes.

Question: Can the Haystack have variations?

Since most of the needle in a "haystack" test injects a line into a pre-defined book text (that can be part of the original dataset), it can be hypothesized that the LLM simply is "smelling" for something that does not fit the context.
So, is it possible to create a "haystack" that is a mix of multiple articles, or just a list of one-liners such that it cannot guess?

[Feature Proposal] Multi-needle in a haystack

I really like this kind of benchmark. It would be interesting to make generalized versions of this, where there are a variable number of needles inserted. These could be unrelated independent needles, or they could be related. For example you could imagine 4 needles:

A implies B
B implies C, D
D implies E.
B is true

Then you could test the "related" needles, to ensure that all of them were detected and the relationship is understood. (What might A be? What about D?)

Curious what you think about this. If you're interested in a feature like this and willing to accept a pull request, I could find the time to try implementing it. If you have a style guide preference or anything like that, please let me know.

Code optimizations

There are a few optimizations we can make for the llm tester and I laid them out below:

  • the results_exists method checks if a task has finished for a specific context length and document depth by iterating over every file in results/. This can be optimized by looking for a specific file since we know the file name formatting being used.
  • the insert_needle method finds the most recent . token in the context and inserts the needle right after it. This search is done with a while loop that always overwrites tokens_new_context, which can be large. An optimization, which wont give much of a performance boost but still worth doing, is indexing directly to the . token after the search is complete.
  • the read_context_files method finds the length of the context, in tokens, for every file it has appended. Instead, we can find the length of the newest file's content to avoid tokenizing the same pieces of text.
  • Moving from asyncio.gather(*tasks) to using async with asyncio.TaskGroup() as tg as suggested here

Replace os.path with Pathlib

Pathlib is much more elegant than os.path, lets replace it.

Implementation

  1. Find all occurrences of os.path
  2. Replace os.path with Pathlib

Standard Tokenizer

Before proceeding with the implementation, I would like to reach a consensus

To ensure a fair and consistent evaluation of different language models, I propose standardizing our tokenizer.

Currently, we use different tokenizers, including cl100k_base (tiktoken) for OpenAI models and an unspecified tokenizer from Anthropic. This lack of standardization introduces bias, as tokenizers vary in how they split text into units, thus affecting context length calculations.

I recommend adopting cl100k_base as our standard tokenizer due to its open-source availability. This will create a level playing field for model comparisons. This difference is less significant for shorter contexts but becomes more pronounced as context length increases.

Using same tokenizer would not affect the integrity as in this project, tokenizer is used to measure the context length and find the depth for needle.

Results from my testing, Anthropic uses more token to represents the same length text. code in this colab

image

Install pre-commit with end-of-file-fixer

This project contains many files that do not have the correct end-of-file setting.

Implementation

  1. Install pre-commit in this repository.
  2. Add the end-of-file-fixer hook.
  3. Add a GitHub Action to run pre-commit, which operates on all branches.

It appears there is an official pre-commit GitHub Action. The example provided only runs on the main branch.
We might consider adding more hooks in separate issues.

Add Makefile target for resetting run results

We provide devs with a make command for cleaning pycache and destroying their venv. I think it would be nice to include a command for deleting results/ and contexts/ since I myself found it useful. This command would not be a prereq to destory and will stand by itself so devs dont accidentally lose their analysis

Implement Docker for testing

We need to significantly improve testing of this project, having Docker image is the first step

Implementation

  1. Create Dockerfile based on the official Python Image
  2. Update tiktoken to newest version 0.6.0 - Does not require compilation on ARM
  3. Install requirements.txt
  4. Document in Readme how to run project from Docker

Example:

FROM python:3.12

WORKDIR /app

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY . .

ENTRYPOINT ["python", "main.py"]
CMD []

Possibility to specify custom API endpoint address?

Could this be implemented?
For example 'localhost:7860/v1' could be one of these custom addresses one could enter as a command line argument for running such tests on local models that are set up with an OpenAI-like endpoint but of course are running off a different address.

Thank you

  • Elliott

Model kwargs support

Currently, the provider and evaluator models are passed hard coded model configurations; temperature, max_new_tokens. This should be exposed in the constructor so users can modify if need be

Convert the repository to a PyPi package

Users should be able to run tests following these steps:

  • Install the package
pip install needlehaystack
  • Run Test
needlehaystack.run_test --provider openai --model_name "gpt-3.5-turbo-0125" --document_depth_percents "[50]" --context_lengths "[2000]"

Anthropic Naming Conflict Error

Below code to get tokenizer conflicts with ModelProvider implementation Anthropic

self.enc = Anthropic().get_tokenizer()

Traceback (most recent call last):
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/main.py", line 66, in
main()
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/main.py", line 59, in main
args.model_to_test = get_model_to_test(args)
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/main.py", line 43, in get_model_to_test
return Anthropic(api_key=args.api_key)
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/src/providers/anthropic.py", line 25, in init
self.enc = Anthropic().get_tokenizer()
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/src/providers/anthropic.py", line 25, in init
self.enc = Anthropic().get_tokenizer()
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/src/providers/anthropic.py", line 25, in init
self.enc = Anthropic().get_tokenizer()
[Previous line repeated 485 more times]
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/src/providers/anthropic.py", line 24, in init
self.model = AsyncAnthropic(api_key=self.api_key)
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/anthropic/_client.py", line 372, in init
super().init(
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/anthropic/_base_client.py", line 1190, in init
self._client = http_client or httpx.AsyncClient(
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_client.py", line 1397, in init
self._transport = self._init_transport(
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_client.py", line 1445, in _init_transport
return AsyncHTTPTransport(
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_transports/default.py", line 272, in init
ssl_context = create_ssl_context(verify=verify, cert=cert, trust_env=trust_env)
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_config.py", line 51, in create_ssl_context
return SSLConfig(
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_config.py", line 75, in init
self.ssl_context = self.load_ssl_context()
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_config.py", line 87, in load_ssl_context
return self.load_ssl_context_verify()
File "/Users/prabha.arivalagan/Documents/github/LLMTest_NeedleInAHaystack/venv/lib/python3.10/site-packages/httpx/_config.py", line 142, in load_ssl_context_verify
if ca_bundle_path.is_file():
File "/opt/homebrew/Cellar/[email protected]/3.10.13_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 1322, in is_file
return S_ISREG(self.stat().st_mode)
File "/opt/homebrew/Cellar/[email protected]/3.10.13_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 1097, in stat
return self._accessor.stat(self, follow_symlinks=follow_symlinks)
File "/opt/homebrew/Cellar/[email protected]/3.10.13_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 632, in fspath
return str(self)
RecursionError: maximum recursion depth exceeded

Azure OpenAI key

Does it support Azure OpenAI key? if support ,how to set it?
thank you very much

does it run at all? Basic commands failed to run as per the README.

steps to reproduce
environment Conda python 3.9

pip install needlehaystack
`
needlehaystack.run_test --provider anthropic --model_name "claude-2.1" --document_depth_percents "[50]" --context_lengths "[2000]"
/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/langchain/callbacks/init.py:37: LangChainDeprecationWarning: Importing this callback from langchain is deprecated. Importing it from langchain will no longer be supported as of langchain==0.2.0. Please import from langchain-community instead:

from langchain_community.callbacks import base.

To install langchain-community run pip install -U langchain-community.
warnings.warn(
Traceback (most recent call last):
File "/Users/samsaha2/miniconda3/envs/nih/bin/needlehaystack.run_test", line 5, in
from needlehaystack.run import main
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/init.py", line 1, in
from .llm_needle_haystack_tester import LLMNeedleHaystackTester
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/llm_needle_haystack_tester.py", line 10, in
from .providers import ModelProvider
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/providers/init.py", line 1, in
from .anthropic import Anthropic
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/providers/anthropic.py", line 12, in
from .model import ModelProvider
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/providers/model.py", line 5, in
class ModelProvider(ABC):
File "/Users/samsaha2/miniconda3/envs/nih/lib/python3.9/site-packages/needlehaystack/providers/model.py", line 10, in ModelProvider
def generate_prompt(self, context: str, retrieval_question: str) -> str | list[dict[str, str]]: ...
TypeError: unsupported operand type(s) for |: 'type' and 'types.GenericAlias'

`

`
running pip install -U langchain-community create conflict

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
needlehaystack 0.1.0 requires langchain-core==0.1.26, but you have langchain-core 0.1.36 which is incompatible.

`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.