Giter Club home page Giter Club logo

langcheck's Introduction

LangCheck Logo LangCheck Logo

Pytest Tests Downloads GitHub

Simple, Pythonic building blocks to evaluate LLM applications.

InstallExamplesQuickstartDocs日本語中文Deutsch

Install

pip install langcheck

Having installation issues? See the FAQ.

Examples

Evaluate Text

Use LangCheck's suite of metrics to evaluate LLM-generated text.

import langcheck

# Generate text with any LLM library
generated_outputs = [
    'Black cat the',
    'The black cat is sitting',
    'The big black cat is sitting on the fence'
]

# Check text quality and get results as a DataFrame (threshold is optional)
langcheck.metrics.fluency(generated_outputs) > 0.5

MetricValueWithThreshold screenshot

It's easy to turn LangCheck metrics into unit tests, just use assert:

assert langcheck.metrics.fluency(generated_outputs) > 0.5

LangCheck includes several types of metrics to evaluate LLM applications. Some examples:

Type of Metric Examples Languages
Reference-Free Text Quality Metrics toxicity(generated_outputs)
sentiment(generated_outputs)
ai_disclaimer_similarity(generated_outputs)
EN, JA, ZH, DE
Reference-Based Text Quality Metrics semantic_similarity(generated_outputs, reference_outputs)
rouge2(generated_outputs, reference_outputs)
EN, JA, ZH, DE
Source-Based Text Quality Metrics factual_consistency(generated_outputs, sources) EN, JA, ZH, DE
Text Structure Metrics is_float(generated_outputs, min=0, max=None)
is_json_object(generated_outputs)
All Languages

Visualize Metrics

LangCheck comes with built-in, interactive visualizations of metrics.

# Choose some metrics
fluency_values = langcheck.metrics.fluency(generated_outputs)
sentiment_values = langcheck.metrics.sentiment(generated_outputs)

# Interactive scatter plot of one metric
fluency_values.scatter()

Scatter plot for one metric

# Interactive scatter plot of two metrics
langcheck.plot.scatter(fluency_values, sentiment_values)

Scatter plot for two metrics

# Interactive histogram of a single metric
fluency_values.histogram()

Histogram for one metric

Augment Data

Text augmentations can automatically generate reworded prompts, typos, gender changes, and more to evaluate model robustness.

For example, to measure how the model responds to different genders:

male_prompts = langcheck.augment.gender(prompts, to_gender='male')
female_prompts = langcheck.augment.gender(prompts, to_gender='female')

male_generated_outputs = [my_llm_app(prompt) for prompt in male_prompts]
female_generated_outputs = [my_llm_app(prompt) for prompt in female_prompts]

langcheck.metrics.sentiment(male_generated_outputs)
langcheck.metrics.sentiment(female_generated_outputs)

Unit Testing

You can write test cases for your LLM application using LangCheck metrics.

For example, if you only have a list of prompts to test against:

from langcheck.utils import load_json

# Run the LLM application once to generate text
prompts = load_json('test_prompts.json')
generated_outputs = [my_llm_app(prompt) for prompt in prompts]

# Unit tests
def test_toxicity(generated_outputs):
    assert langcheck.metrics.toxicity(generated_outputs) < 0.1

def test_fluency(generated_outputs):
    assert langcheck.metrics.fluency(generated_outputs) > 0.9

def test_json_structure(generated_outputs):
    assert langcheck.metrics.validation_fn(
        generated_outputs, lambda x: 'myKey' in json.loads(x)).all()

Monitoring

You can monitor the quality of your LLM outputs in production with LangCheck metrics.

Just save the outputs and pass them into LangCheck.

production_outputs = load_json('llm_logs_2023_10_02.json')['outputs']

# Evaluate and display toxic outputs in production logs
langcheck.metrics.toxicity(production_outputs) > 0.75

# Or if your app outputs structured text
langcheck.metrics.is_json_array(production_outputs)

Guardrails

You can provide guardrails on LLM outputs with LangCheck metrics.

Just filter candidate outputs through LangCheck.

# Get a candidate output from the LLM app
raw_output = my_llm_app(random_user_prompt)

# Filter the output before it reaches the user
while langcheck.metrics.contains_any_strings(raw_output, blacklist_words).any():
    raw_output = my_llm_app(random_user_prompt)

langcheck's People

Contributors

yosukehigashi avatar kennysong avatar alnusjaponica avatar liwii avatar vela-zz avatar ischender avatar thanhchinhbk avatar bioerrorlog avatar shibuiwilliam avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.