yujonglee / eval Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 1.39 MB

Evaluate your LLM apps, RAG pipeline, any generated text, and more!

License: MIT License

Python 37.75% Jupyter Notebook 62.18% Shell 0.07%

eval's Introduction

⚡♾️ FastREPL

Fast Run-Eval-Polish Loop for LLM Applications.

Quickstart

Let's say we have this existing system:

import openai

context = """
The first step is to decide what to work on. The work you choose needs to have three qualities: it has to be something you have a natural aptitude for, that you have a deep interest in, and that offers scope to do great work.
In practice you don't have to worry much about the third criterion. Ambitious people are if anything already too conservative about it. So all you need to do is find something you have an aptitude for and great interest in.
"""

def run_qa(question: str) -> str:
    return openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "system",
                "content": f"Answer in less than 30 words. Use the following context if needed: {context}",
            },
            {"role": "user", "content": question},
        ],
    )["choices"][0]["message"]["content"]

We already have a fixed context. Now, let's ask some questions. local_runner is used here to run it locally with threads and progress tracking. We will have remote_runner to run the same in the cloud.

contexts = [[context]] * len(questions)

# https://huggingface.co/datasets/repllabs/questions_how_to_do_great_work
questions = [
    "how to do great work?.",
    "How can curiosity be nurtured and utilized to drive great work?",
    "How does the author suggest finding something to work on?",
    "How did Van Dyck's painting differ from Daniel Mytens' version and what message did it convey?",
]

runner = fastrepl.local_runner(fn=run_qa)
ds = runner.run(args_list=[(q,) for q in questions], output_feature="answer")

ds = ds.add_column("question", questions)
ds = ds.add_column("contexts", contexts)
# fastrepl.Dataset({
#     features: ['answer', 'question', 'contexts'],
#     num_rows: 4
# })

Now, let's use one of our evaluators to evaluate the dataset. Note that we are running it 5 times to ensure we obtain consistent results.

evaluator = fastrepl.RAGEvaluator(node=fastrepl.RAGAS(metric="Faithfulness"))

ds = fastrepl.local_runner(evaluator=evaluator, dataset=ds).run(num=5)
# ds["result"]
# [[0.25, 0.0, 0.25, 0.25, 0.5],
#  [0.5, 0.5, 0.5, 0.75, 0.875],
#  [0.66, 0.66, 0.66, 0.66, 0.66],
#  [1.0, 1.0, 1.0, 1.0, 1.0]]

Seems like we are getting quite good results. If we increase the number of samples a bit, we can obtain a reliable evaluation of the entire system. We will keep working on bringing better evaluations.

Detailed documentation is here.

Contributing

Any kind of contribution is welcome.

Development: Please read CONTRIBUTING.md and tests.
Bug reports: Use Github Issues.
Feature request and questions: Use Github Discussions.

eval's People

Contributors

eval's Issues

Make classification fallback policy configurable

Currently we provide warning and return "UNKNOWN". This behavior should be transparent and configurable.

How is the tokenizer used?

https://github.com/fastrepl/fastrepl/blob/4c80ccd008341aa2081d8edb543645c171e4a6d7/fastrepl/llm.py#L168

SAS should be optional

https://github.com/fastrepl/fastrepl/blob/main/fastrepl/eval/metric/sas.py

This brings lots of deps, like torch and transformers.

Use chat-bison for our CI

It is quiet cheap. (around 1.5 * ChatGPT)

Use DEBUG context variable for printing info

For example, we must have ability to see what is exact prompt that is used for eval.

Current implementation of GPTCache does not use model name in key

Add sklearn.preprocessing.LabelEncoder

We need to make it easy to use sklearn.preprocessing.LabelEncoder, or implement it for pipeline interface.

Need some good warning if labels is not encoded. See test_model_grade.py

litellm.gpt_cache should be disabled for num>1 in run call

If we enable cache, there is no point running same evaluation multiple times to check consistency.
https://github.com/fastrepl/fastrepl/blob/2cbf1de3559a68a3f585ea64f176dfe0aa8b93c5/fastrepl/runner.py#L58

But for other situations, I want to have cache enabled since it save time + money while doing experimenting.

Add error handling around LLM API call

Retry
Timeout
Warning logging
...

Use outlines's controlled generation

https://normal-computing.github.io/outlines/reference/controlled_generation.html
This can be useful for more complex output, such as multiple scoring.

Main blockers are:

#38. It does not support Python 3.9
This only support some of models https://github.com/normal-computing/outlines/tree/main/outlines/model
Slow(?) Haven't tried, but might be.

Dogfooding: test our model-graded eval

Our test cases for model-graded eval should be meta-eval. Something like Pass if >98%

Human evaluation should be applied to a portion of the dataset

Conflict resolver - if consensus failed, we should collect them and hand them to human
Vibe check https://www.latent.space/p/mosaic-mpt-7b

The vibe-based eval cannot be underrated. … One of our evals was just having a bunch of prompts and watching the answers as the models trained and see if they change. Honestly, I don’t really believe that any of these eval metrics capture what we care about. One of our prompts was “suggest games for a 3-year-old and a 7-year-old to play” and that was a lot more valuable to see how the answer changed during the course of training. — Jonathan Frankle

Move litellm.api_base

https://github.com/fastrepl/fastrepl/blob/2cbf1de3559a68a3f585ea64f176dfe0aa8b93c5/fastrepl/pytest_plugin.py#L43-L44

After BerriAI/litellm#260 resolved, litellm has api_base param in completion function.

Blocker is:

Since this is test-runner specific, hesitating to move this to llm.py. Not strong preference though
How about litellm.headers?

Split integration test workflow

Enable easy subclass of Evals

More randomness in label mapping

Implement browser based human-eval

Using Pyscript might be interesting, but it is not really needed.
We can setup two endpoint, one for serving UI and one for receiving eval result. (ex, as Form)

Automatic metric and eval pipeline selection

Similar to HF's Auto~, but non-deterministic.

Add compare-based model eval

Rather than asking an LLM for a direct evaluation (via giving a score), try giving it a reference and asking for a comparison. This helps with reducing noise.

Add Grader, similar to Classifier

QLoRA: Efficient Finetuning of Quantized LLMs

Given the responses from gpt-3.5-turbo and another model, GPT-4 was prompted to score both out of 10 and explain its ratings.

G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment

they concatenate the prompt, CoT, news article, and summary and ask the LLM to output a score between 1 to 5.

Need some work on logit_bias. And return value is not string anymore, need some thought.

Maybe we should abstract Grader and Classifier as EvaluationHead and share some code. In terms of COT, it makes sense.

Add pre-commit hook

Add Auto-fixing parser

https://python.langchain.com/docs/modules/model_io/output_parsers/output_fixing_parser

Fix max_token=2 for togetherAI

tests/evaluation/test_with_yelp_review.py::test_llm_grading_head[togethercomputer/llama-2-70b-chat-references1]
  /Users/yujonglee/dev/fastrepl/fastrepl/fastrepl/warnings.py:24: UnknownLLMExceptionWarning: ValueError: Traceback (most recent call last):
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/utils.py", line 1755, in exception_type
      error_response = json.loads(error_str)
                       ^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/.pyenv/versions/3.11.3/lib/python3.11/json/__init__.py", line 346, in loads
      return _default_decoder.decode(s)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/.pyenv/versions/3.11.3/lib/python3.11/json/decoder.py", line 337, in decode
      obj, end = self.raw_decode(s, idx=_w(s, 0).end())
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/.pyenv/versions/3.11.3/lib/python3.11/json/decoder.py", line 355, in raw_decode
      raise JSONDecodeError("Expecting value", s, err.value) from None
  json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/Users/yujonglee/dev/fastrepl/fastrepl/fastrepl/llm.py", line 138, in _completion
      result = litellm.gpt_cache.completion(  # pragma: no cover
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/gptcache/adapter/openai.py", line 100, in create
      return adapt(
             ^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/gptcache/adapter/adapter.py", line 238, in adapt
      llm_data = time_cal(
                 ^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/gptcache/utils/time.py", line 9, in inner
      res = func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/gpt_cache.py", line 12, in _llm_handler
      return litellm.completion(*llm_args, **llm_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/utils.py", line 565, in wrapper
      raise e
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/utils.py", line 526, in wrapper
      result = original_function(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/timeout.py", line 44, in wrapper
      result = future.result(timeout=local_timeout_duration)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/.pyenv/versions/3.11.3/lib/python3.11/concurrent/futures/_base.py", line 456, in result
      return self.__get_result()
             ^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/.pyenv/versions/3.11.3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
      raise self._exception
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/timeout.py", line 33, in async_func
      return func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/main.py", line 825, in completion
      raise exception_type(
            ^^^^^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/utils.py", line 1826, in exception_type
      raise original_exception
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/main.py", line 559, in completion
      model_response = together_ai.completion(
                       ^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/llms/together_ai.py", line 110, in completion
      model_response["choices"][0]["message"]["content"] = completion_response
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/openai/openai_object.py", line 71, in __setitem__
      raise ValueError(
  ValueError: You cannot set content to an empty string. We interpret empty strings as None in requests.You may set {
    "content": "default",
    "role": "assistant",
    "logprobs": null
  }.content = None to delete the property
   | https://docs.fastrepl.com/miscellaneous/warnings_and_errors#unknownllmexception

This is not problem with LiteLLM side.

Use ContextWindowExceededError

Original Issue: BerriAI/litellm#228 (comment)
Docs: https://docs.litellm.ai/docs/exception_mapping#custom-mapping-list

Better text formatting on thought generation

Current approach:

https://github.com/fastrepl/fastrepl/blob/f18c86dddef268f111bf4328c77e820eaa0499e1/fastrepl/eval/model/llm_head_cot.py#L27
https://github.com/fastrepl/fastrepl/blob/f18c86dddef268f111bf4328c77e820eaa0499e1/fastrepl/eval/model/llm_head_cot.py#L54

Maybe this approach is more reliable(ask model to wrap text):

https://platform.openai.com/docs/guides/gpt-best-practices/strategy-give-gpts-time-to-think
https://docs.anthropic.com/claude/docs/give-claude-room-to-think-before-responding

Use model alias

https://docs.litellm.ai/docs/completion/model_alias

This can be useful for 2 cases

llama2: replicate/llama-2-70b-chat:2796ee94...
gpt-3.5-turbo, gpt-3.5-turbo-16k, ... to single GPT3.5

Implement consensus mechanism

Here, consensus is to use multiple eval and get result from it. Multiple eval should be derived from a single one, but differ in example position or model.

This can be useful for handling all kind of bias.

Some group of test should be failsafe within certain ratio

Pytest's hook
Write plain vanilla Python test util using Assert

backoff
timeout
exception handling (retry, warning)
model fallback
retry with extended context window(example)

API Proxy strategy

I want evaluation to be work without user providing all API key.

Some work is done #72 and #73, but not done. Simply setting api_base is not suffice.

Motivation from https://docs.litellm.ai/docs/observability/helicone_integration

Avoid any position biases and ensure that the
order in which the responses were presented does not influence your decision. Do not allow
the length of the responses to influence your evaluation. Do not favor certain names of
the assistants. Be as objective as possible.

This results more consistent result.
https://arxiv.org/pdf/2306.05685.pdf

https://github.com/huggingface/evaluate/blob/af3c30561d840b83e54fc5f7150ea58046d6af69/metrics/seqeval/seqeval.py

We added some tests in #5.

Better handle positional bias

Currently, we have some kind of shuffling, but there are more place that shuffling can be done. (For ex, when mapping label)

Also, when we do #36, we can do things like:

we can evaluate the same pair of responses twice while swapping their order. If the same response is preferred in both orders, we mark it as a win; else, it’s a tie.

Same for classification. We can pass option to how many times to try.