Giter Club home page Giter Club logo

eval's Introduction

⚡♾️ FastREPL

Fast Run-Eval-Polish Loop for LLM Applications.

Quickstart

Let's say we have this existing system:

import openai

context = """
The first step is to decide what to work on. The work you choose needs to have three qualities: it has to be something you have a natural aptitude for, that you have a deep interest in, and that offers scope to do great work.
In practice you don't have to worry much about the third criterion. Ambitious people are if anything already too conservative about it. So all you need to do is find something you have an aptitude for and great interest in.
"""

def run_qa(question: str) -> str:
    return openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "system",
                "content": f"Answer in less than 30 words. Use the following context if needed: {context}",
            },
            {"role": "user", "content": question},
        ],
    )["choices"][0]["message"]["content"]

We already have a fixed context. Now, let's ask some questions. local_runner is used here to run it locally with threads and progress tracking. We will have remote_runner to run the same in the cloud.

contexts = [[context]] * len(questions)

# https://huggingface.co/datasets/repllabs/questions_how_to_do_great_work
questions = [
    "how to do great work?.",
    "How can curiosity be nurtured and utilized to drive great work?",
    "How does the author suggest finding something to work on?",
    "How did Van Dyck's painting differ from Daniel Mytens' version and what message did it convey?",
]

runner = fastrepl.local_runner(fn=run_qa)
ds = runner.run(args_list=[(q,) for q in questions], output_feature="answer")

ds = ds.add_column("question", questions)
ds = ds.add_column("contexts", contexts)
# fastrepl.Dataset({
#     features: ['answer', 'question', 'contexts'],
#     num_rows: 4
# })

Now, let's use one of our evaluators to evaluate the dataset. Note that we are running it 5 times to ensure we obtain consistent results.

evaluator = fastrepl.RAGEvaluator(node=fastrepl.RAGAS(metric="Faithfulness"))

ds = fastrepl.local_runner(evaluator=evaluator, dataset=ds).run(num=5)
# ds["result"]
# [[0.25, 0.0, 0.25, 0.25, 0.5],
#  [0.5, 0.5, 0.5, 0.75, 0.875],
#  [0.66, 0.66, 0.66, 0.66, 0.66],
#  [1.0, 1.0, 1.0, 1.0, 1.0]]

Seems like we are getting quite good results. If we increase the number of samples a bit, we can obtain a reliable evaluation of the entire system. We will keep working on bringing better evaluations.

Detailed documentation is here.

Contributing

Any kind of contribution is welcome.

eval's People

Contributors

yujonglee avatar dependabot[bot] avatar krrishdholakia avatar marcklingen avatar

eval's Issues

Human evaluation should be applied to a portion of the dataset

  • Conflict resolver - if consensus failed, we should collect them and hand them to human
  • Vibe check https://www.latent.space/p/mosaic-mpt-7b

    The vibe-based eval cannot be underrated. … One of our evals was just having a bunch of prompts and watching the answers as the models trained and see if they change. Honestly, I don’t really believe that any of these eval metrics capture what we care about. One of our prompts was “suggest games for a 3-year-old and a 7-year-old to play” and that was a lot more valuable to see how the answer changed during the course of training. — Jonathan Frankle

Implement browser based human-eval

Using Pyscript might be interesting, but it is not really needed.
We can setup two endpoint, one for serving UI and one for receiving eval result. (ex, as Form)

Add compare-based model eval

Rather than asking an LLM for a direct evaluation (via giving a score), try giving it a reference and asking for a comparison. This helps with reducing noise.

Add Grader, similar to Classifier

Given the responses from gpt-3.5-turbo and another model, GPT-4 was prompted to score both out of 10 and explain its ratings.

they concatenate the prompt, CoT, news article, and summary and ask the LLM to output a score between 1 to 5.

Need some work on logit_bias. And return value is not string anymore, need some thought.

Maybe we should abstract Grader and Classifier as EvaluationHead and share some code. In terms of COT, it makes sense.

Fix max_token=2 for togetherAI

tests/evaluation/test_with_yelp_review.py::test_llm_grading_head[togethercomputer/llama-2-70b-chat-references1]
  /Users/yujonglee/dev/fastrepl/fastrepl/fastrepl/warnings.py:24: UnknownLLMExceptionWarning: ValueError: Traceback (most recent call last):
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/utils.py", line 1755, in exception_type
      error_response = json.loads(error_str)
                       ^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/.pyenv/versions/3.11.3/lib/python3.11/json/__init__.py", line 346, in loads
      return _default_decoder.decode(s)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/.pyenv/versions/3.11.3/lib/python3.11/json/decoder.py", line 337, in decode
      obj, end = self.raw_decode(s, idx=_w(s, 0).end())
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/.pyenv/versions/3.11.3/lib/python3.11/json/decoder.py", line 355, in raw_decode
      raise JSONDecodeError("Expecting value", s, err.value) from None
  json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/Users/yujonglee/dev/fastrepl/fastrepl/fastrepl/llm.py", line 138, in _completion
      result = litellm.gpt_cache.completion(  # pragma: no cover
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/gptcache/adapter/openai.py", line 100, in create
      return adapt(
             ^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/gptcache/adapter/adapter.py", line 238, in adapt
      llm_data = time_cal(
                 ^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/gptcache/utils/time.py", line 9, in inner
      res = func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/gpt_cache.py", line 12, in _llm_handler
      return litellm.completion(*llm_args, **llm_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/utils.py", line 565, in wrapper
      raise e
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/utils.py", line 526, in wrapper
      result = original_function(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/timeout.py", line 44, in wrapper
      result = future.result(timeout=local_timeout_duration)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/.pyenv/versions/3.11.3/lib/python3.11/concurrent/futures/_base.py", line 456, in result
      return self.__get_result()
             ^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/.pyenv/versions/3.11.3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
      raise self._exception
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/timeout.py", line 33, in async_func
      return func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/main.py", line 825, in completion
      raise exception_type(
            ^^^^^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/utils.py", line 1826, in exception_type
      raise original_exception
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/main.py", line 559, in completion
      model_response = together_ai.completion(
                       ^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/litellm/llms/together_ai.py", line 110, in completion
      model_response["choices"][0]["message"]["content"] = completion_response
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
    File "/Users/yujonglee/dev/fastrepl/fastrepl/.venv/lib/python3.11/site-packages/openai/openai_object.py", line 71, in __setitem__
      raise ValueError(
  ValueError: You cannot set content to an empty string. We interpret empty strings as None in requests.You may set {
    "content": "default",
    "role": "assistant",
    "logprobs": null
  }.content = None to delete the property
   | https://docs.fastrepl.com/miscellaneous/warnings_and_errors#unknownllmexception

This is not problem with LiteLLM side.

Better text formatting on thought generation

Implement consensus mechanism

Here, consensus is to use multiple eval and get result from it. Multiple eval should be derived from a single one, but differ in example position or model.

This can be useful for handling all kind of bias.

Adapt prompt from papers

For example,

Avoid any position biases and ensure that the
order in which the responses were presented does not influence your decision. Do not allow
the length of the responses to influence your evaluation. Do not favor certain names of
the assistants. Be as objective as possible.

This results more consistent result.
https://arxiv.org/pdf/2306.05685.pdf

Better handle positional bias

Currently, we have some kind of shuffling, but there are more place that shuffling can be done. (For ex, when mapping label)

Also, when we do #36, we can do things like:

we can evaluate the same pair of responses twice while swapping their order. If the same response is preferred in both orders, we mark it as a win; else, it’s a tie.

Same for classification. We can pass option to how many times to try.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.