Giter Club home page Giter Club logo

athina-evals's Introduction

Documentation | Athina SDK Demo Video | Athina Platform Demo Video →

Overview

Athina is an open-source library with plug-and-play preset evals designed to help engineers systematically improve their LLM reliability and performance through eval-driven-development.

develop-ui-results-metrics-5-bg

Quick Links


Why you need evals

Evaluations (evals) play a crucial role in assessing the performance of LLM responses, especially when scaling from prototyping to production.

They are akin to unit tests for LLM applications, allowing developers to:

  • Catch and prevent hallucinations and bad outputs
  • Measure the performance of model
  • Run quantifiable experiments against ambiguous, unstructured text data
  • A/B test different models and prompts rapidly
  • Detect regressions before they get to production
  • Monitor production data with confidence

🔴 Problem: Flaws with Current LLM Developer Workflows

The journey from a demo AI to a reliable production application is not easy.

Developers usually start iterating on performance by manually inspecting the outputs. Eventually they progress to using spreadsheets, CSVs, or evaluating against a golden dataset.

Each method has drawbacks, requires different tooling, and evaluation methods. See more

A lot of manual effort is required to set up a good infrastructure for running evals - creating a dataset, reviewing the responses, creating evals, and internal tooling / dashboard, tracking experiment parameters and metrics for historical record.

Eventually every LLM developer realizes the indispensable need for evals and an infrastructure to consistently run and track iterations to improve performance and reliability systematically.


🟢 Solution: Athina Evals

Github | Watch Demo Video | Docs

Athina is an open-source library that offers a system for eval-driven development, overcoming the limitations of traditional workflows.

Our solution allows for rapid experimentation, and customizable evaluators with consistent metrics.

Here’s why this is better than building in-house eval infrastructure:

  • Plug-and-Play Preset Evals: Ready-to-use evals for immediate application
  • Integrated Dashboard: For tracking experiments and inspecting the results in a web UI.
  • Custom Evaluators : A flexible framework to craft tailored evals.
  • Consistent Metrics: Uniform evaluation standards across all stages. Evaluate your model in dev and prod using a consistent set of metrics.
  • Historical Record: Automatic tracking of every prompt iteration.
  • Quick Start: Easy 5-min set up.

Here’s a demo video.



Quick Start

The easiest way to get started is to use one of our Example Notebooks as a starting point.

To get started with Athina Evals:

1. Install the athina package

pip install athina

2. Set your API keys

If you are using the python SDK, then can set the API keys like this:

from athina.keys import AthinaApiKey, OpenAiApiKey

OpenAiApiKey.set_key(os.getenv('OPENAI_API_KEY'))
AthinaApiKey.set_key(os.getenv('ATHINA_API_KEY'))

If you are using the CLI, then run athina init, and enter the API keys when prompted.


3. Load your dataset like this:

You can also load data using a CSV or Python Dictionary

from athina.loaders import RagLoader

dataset = RagLoader().load_json(json_filepath)

4. Now you can run evals like this.

from athina.evals import DoesResponseAnswerQuery

DoesResponseAnswerQuery().run_batch(data=dataset)

For more detailed guides, you can follow the links below to get started running evals using Athina.


Preset Evals

You can use our preset evaluators to add evaluation to your dev stack rapidly.

Here are the preset evaluators in this library:

RAG Evals

These evals are useful for evaluating LLM applications with Retrieval Augmented Generation (RAG).


We have also built other evaluators that are not yet a part of this library (but will soon be) You can find more information about these in our documentation.

Summarization Accuracy Evals:

These evals are useful for evaluating LLM-powered summarization performance.


More Evals



Custom Evals

See this page for more information, on how to write your own custom evals.



Why should I use Athina's Evals instead of writing my own?

You could build your own eval system from scratch, but here's why Athina might be better for you:

  • Athina provides you with plug-and-play preset evals that have been well-tested
  • Athina evals can run on both development and production, giving you consistent metrics for evaluating model performance and drift.
  • Athina removes the need for your team to write boilerplate loaders, implement LLMs, normalize data formats, etc
  • Athina offers a modular, extensible framework for writing and running evals
  • Athina calculate analytics like pass rate and flakiness, and allows you to batch run evals against live production data or dev datasets
develop-ui-requests-2



Need Production Monitoring and Evals? We've got you covered...

  • Athina eval runs automatically write into Athina Dashboard, so you can view results and analytics in a beautiful UI.
  • Athina track your experiments automatically, so you can view a historical record of previous eval runs.
  • Athina calculates analytics segmented at every level possible, so you can view and compare your model performance at very granular levels.

Athina Observe Platform

About Athina

Athina is building an end-to-end LLM monitoring and evaluation platform.

Website | Demo Video

Contact us at [email protected] for any questions about the eval library.

athina-evals's People

Contributors

akshat-g avatar shivsak avatar vivek-athina avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.