zeno-ml / zeno-build Goto Github PK

View Code? Open in Web Editor NEW

479.0 479.0 33.0 19.65 MB

Build, evaluate, understand, and fix LLM-based apps

License: MIT License

Jupyter Notebook 100.00%

zeno-build's People

Contributors

Stargazers

Watchers

zeno-build's Issues

Create chatbots demo page

We now have chatbots implemented as of #30

We should:

Create a publicly accessible page demonstrating the interesting findings we get by using Zeno Build to compare chatbots
Add a link to this page from the main README

This will require finishing of several sub-issues, which we can add and link to this issue.

ChatBots: Get feedback from a few people

It'd be nice to get feedback from a few people, e.g. at CMU, on what they think.

Make it possible to explore model confidences

Most generative models can provide an uncertainty level, and it would be interesting to be able to explore things, such as the correlation between model certainty and accuracy.

In order to do this, we would first need to modify the generate functions, such as in chat_generate.py, to output model uncertainties as well as strings.

This is certainly possible for Huggingface, and may be possible for API-based models.

Once these confidences are returned by the generate function, they would need to be passed on to the dataframe that is fed into Zeno in each individual example, such as the chatbot or summarization examples. For reference, the dataframe for the chatbot example is constructed here.

Create public chatbot reports page

We should change our report doc to a public-facing page.

ChatBots: Support uni_eval

For the chatbots demo, we could support other evaluation metrics such as uni_eval for dialog evaluation, which may give us better insights.

Task: Create folder with Zeno functions

Many Zeno functions will be reusable across tasks, and can be passed into the visualize function.

For example, things like text length, unique words, etc. could be useful distill functions we want to use.

We might want to categorize this folder by task or data type.

Demo: What demo should we do for text classification?

We should decide what demo we will do for text classification.

Think of good way to to display model names in Zeno

Currently model names are displayed as model0, model1, but it'd be nice to have a better way of displaying them.

One suggestion is to display all of the non-constant parameters.
For example in the chatbots example, the prompt, model, and temperature are variable, so we could display those three parameters.

Core: Deduplicate Metrics

Currently metrics are implemented in multiple places. This is a potential source of bugs/disconnect, so we should deduplicate them and rely on the Zeno implementations.

Core: Debug duplicate models

For some reason, there are duplicate models in the results. This should be investigated.

For instance, there are 10 models in the "results" file, but only 4 after deduplication.

Visualizing 10 models
-1960601404797368550 {'training_dataset': 'sst2', 'base_model': 'bert-base-uncased', 'learning_rate': 7.032465170166586e-05, 'num_train_epochs': 3, 'weight_decay': 0.0013326587635397158, 'bias': 0.9619960134236489}
8087707616790777082 {'training_dataset': 'imdb', 'base_model': 'distilbert-base-uncased', 'learning_rate': 0.0007441349947622346, 'num_train_epochs': 2, 'weight_decay': 0.0022321073814882274, 'bias': 0.4729424283280248}
974984753089123939 {'training_dataset': 'imdb', 'base_model': 'bert-base-uncased', 'learning_rate': 4.1464852686965755e-05, 'num_train_epochs': 1, 'weight_decay': 0.0021863797480360337, 'bias': 0.010710576206724776}
-5709593449973764166 {'training_dataset': 'imdb', 'base_model': 'distilbert-base-uncased', 'learning_rate': 0.0007188594167931795, 'num_train_epochs': 4, 'weight_decay': 0.002204406220406967, 'bias': 0.17853136775181744}

Add an exhaustive optimizer

In some experiments we just want to do a complete search over the entire search space and enumerate all of the possibilities.

Currently we only have the RandomOptimizer, but we should also have an ExhaustiveOptimizer that does an exhaustive search.

This optimizer will be incompatible with Float search spaces, as they are not able to be enumerated.

ChatBots: support langchain provider

We should add LangChain as a provider for the chatbot task. It could be done by adding it to the chat_generate.py file:

https://github.com/zeno-ml/zeno-build/blob/be91a424f8147cb3908ff4f40e298c8b53b1a427/zeno_build/models/chat_generate.py

Support multi-GPU support for Hugging Face provider

For locally hosted models from Hugging Face, it would be good to support multi-GPU inference, including:

Model parallelism to make sure that larger models fit in memory
Data parallelism for improved inference speed

Currently inference is handled using the Hugging Face provider:

zeno-build/zeno_build/models/providers/huggingface_utils.py

Line 13 in 23d3080

def generate_from_huggingface(

Any code to support multi-GPU inference would have to be added there. Contributions are welcome!

Add MPT-7B

https://huggingface.co/mosaicml/mpt-7b-chat

error alarm. Beginner's problem

I don't know what to do with it. I made the following changes in the original file.I ran the code in vscode. do_prediction can work, but do_visualization can't.

I modified these hyperparameters

I took the l out of it and turned it into

I changed the address

I added this code at the very beginning of the file

The problem that appeared before no longer appears, but now the problem do not know how to solve.

Auto-generate documentation

Currently we do not have any automatically generated documentation including API doc. It would be nice to have this.

The main Zeno page has this, so maybe we could use the same method?

Add CI

Set up CI

openai_utils.py throws Unclosed connector Error

I was using openai_utils.py for ChatGPT inference. It always worked fine for the first couple hundred samples, and then it always crushed with the following error. I tried to lower request_per_minute but the problem persists.

Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x000001FC0B9CFD10>
 92%|████████████████████████████████████████████████████████████████████████      | 1360/1473 [10:27<00:52,  2.17it/s]
Unclosed connector
connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x000001FC0C137AF0>, 482899.734), (<aiohttp.client_proto.ResponseHandler object at 0x000001FC0C2A5550>, 482900.89)]']
connector: <aiohttp.connector.TCPConnector object at 0x000001FC0B9CFD50>

ChatBots: Implement Vicuna's prompt-based metric?

Vicuna uses a prompt-based metric for evaluation, maybe it should be implemented?

ChatBots: Regenerate outputs with full context

Currently outputs only use a short conversational context, we should re-generate with the full context.

Demo: What demo should we do for text summarization?

We should decide what demo we will do for text summarization.

Demo: Decide method of presentation for demo results

We'll probably want hosted Zeno instances to demonstrate any interesting results that we get out of our experiments. Where should we host these? Huggingface spaces?

ChatBots: Evaluate with FastChat dataset

Here is the the fastchat dataset used in assessing Vicuna:

https://github.com/lm-sys/FastChat/tree/main/fastchat/eval/table

Create email address

Some people may feel more comfortable getting in touch through private methods such as email. Perhaps we should create an email address for zeno build or just zeno in general.

Backend: Caching decorator

We have caching utils: https://github.com/zeno-ml/llm-compare/blob/main/llm_compare/cache_utils.py

But using them can be a bit verbose and opaque.

It would be nice if we could add a decorator like this:

@cache_utils.cache_function("text_classification", load_model)

where the first argument is the task name, and the second argument is the function used to load from the cached files.

Add models to chatbots comparison

A few people have asked about adding models to the chatbot report, so we should do that!

Consider the best way to parallelize running of experiments

Currently experiments are run in serial, but if they could be parallelized this would make life easier. We should consider how to do this.

Backend: OpenAI Async Implementation

OpenAI supports asynchronous requests: https://github.com/openai/openai-python/blob/75c90a71e88e4194ce22c71edeb3d2dee7f6ac93/openai/api_resources/chat_completion.py#L33

Using this has the potential to greatly increase the efficiency of making calls to OpenAI, so we could try implementing some infrastructure to make that work.

Task: Summarization w/ API-based Models

Within the prompt gym, we have an example of summarization with API-based models with evaluation using Critique, I will implement an example of this next.

Core: Test and document installation for Vizier Optimization

Vizier optimization isn't extensively tested yet because its core code required Jax, which was not easily installable on my macbook. It should be tested and tweaked if necessary.

Task: Chat Your Data

We can create an example task for "chat your data" like is implemented in LangChain.

ChatBots: Find interesting and catchy trends

Find at least 3, target 5, interesting trends that can be demonstrated by our browsing of the results.
Write these up in a doc, e.g. on Google Doc on Notion, so they can be posted as tweets.

Make modeling library dependencies conditional

Currently libraries such as OpenAI, Cohere, and huggingface are used in the core library code indiscriminately. However, it would be better for at least OpenAI and Cohere to only be necessary if the libraries are actually being used. In order to do this, we can do dynamic imports and warn users that they need to install the library.

Documentation for aggregate_results

We have a script to aggregate results together, but it's not well documented. It should be.

https://github.com/zeno-ml/zeno-build/blob/main/zeno_build/reporting/aggregate_results.py

UX: Model names are opaque numbers

In the text classification demo, right now it seems that the model names are just numbers, so it's hard to tell which model is which.

We should think about model naming. Here are some ideas:

"model1", "model2", "model3". Not very useful, but if we can find parameters somewhere in the zeno interface (is this a possibility?) then we could look at the models.
Find all parameters that vary between models, and print out a model name consisting of the concatenation of the parameters. For example, if we have "learning_rate", "training_data", and "batch_size", where "batch_size" is constant across all models, we could make the name "learning_rate=xxx,training_data=yyy".

Task: Visualization for text summarization

ChatBots: Find interesting example outputs and list them somewhere

We can find interesting outputs that are evaluated in different ways by the different evaluation metrics.
These should be recorded somewhere, such as a Google Sheet or Notion.

Blocked By

Error alarm

I ran the code in the directory 'zeno-build-main\examples\text_classification\main', but my pycharm kept reporting errors, like this, and I don't know how to solve it.

Add search space that supports multiple experiments

Right now Zeno Build supports CombinatorialSearchSpace, which takes the cross product between all parameter configurations.

However, in many cases it's common to run multiple experiments, where you explore some part of the experiment space in the first experiment, and another part of the search space in the second experiment.

A current workaround would be to create two different configuration files and decide which one to use, or run sequentially on both. Another option is to specify this directly in the search space by having something like:

space = CompositeSearchSpace([
   CombinatorialSearchSpace({...}),  # experiment 1
   CombinatorialSearchSpace({...}),  # experiment 2
])

Core: Make usage of parameter sweep code more obvious

Currently the parameter sweep code is a bit opaque and it's hard to tell what goes on inside. It'd probably be a good idea to make the programming interface similar to Vizier:

https://github.com/google/vizier#getting-started-

Push to pypi

Zeno build should be pushed to pypi.

Doc: Main README

We need to write up the main README. I can take a first stab at it.

Core: Consolidate code

We now have three examples of tasks, we can start consolidating the code to reduce copy-pasting across the different tasks.

Unlock protobuf requirement

Currently the protobuf library is pinned at a specific version:

zeno-build/examples/chatbot/requirements.txt

Line 5 in fa64cbd

protobuf==3.20.0

This is due to model loading errors if we use a more modern version.

But it'd be nice to relax this requirement, or at least make the version newer.

Fix unit tests in CI

Currently CI is not running unit tests. We should fix this.

Zeno error: DataFrame columns must be unique for orient='records'.

When running Zeno and trying to visualize results, I get the following difficult-to-understand error.
@cabreraalex any ideas about how to fix this?

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 429, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/routing.py", line 443, in handle
    await self.app(scope, receive, send)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/routing.py", line 237, in app
    raw_response = await run_endpoint_function(
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/routing.py", line 165, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/zeno/server.py", line 116, in get_filtered_table
    return zeno.get_filtered_table(req)
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/zeno/backend.py", line 587, in get_filtered_table
    return filt_df[[str(col) for col in req.columns]].to_json(orient="records")
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/pandas/core/generic.py", line 2532, in to_json
    return json.to_json(
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/pandas/io/json/_json.py", line 181, in to_json
    s = writer(
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/pandas/io/json/_json.py", line 237, in __init__
    self._format_axes()
  File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/pandas/io/json/_json.py", line 301, in _format_axes
    raise ValueError(
ValueError: DataFrame columns must be unique for orient='records'.

ChatBots: Improve example display

Currently the chatbot example display looks like this:

There are two problems:

All system outputs are listed as "negative" for some reason, which is strange.
It is only displaying one previous utterance, although in most cases two previous utterances worth of context are available.

zeno-ml / zeno-build Goto Github PK

zeno-build's People

Contributors

Stargazers

Watchers

Forkers

zeno-build's Issues

Blocked By

Recommend Projects

Recommend Topics

Recommend Org