citadel-ai / langcheck Goto Github PK

View Code? Open in Web Editor NEW

136.0 2.0 9.0 20.07 MB

Simple, Pythonic building blocks to evaluate LLM applications.

Home Page: https://langcheck.readthedocs.io/en/latest/index.html

License: MIT License

Python 77.19% Jupyter Notebook 19.41% Jinja 3.40%

langcheck's Introduction

Simple, Pythonic building blocks to evaluate LLM applications.

Install • Examples • Quickstart • Docs • 日本語 • 中文 • Deutsch

Install

# Install English metrics only
pip install langcheck

# Install English and Japanese metrics
pip install langcheck[ja]

# Install metrics for all languages (requires pip 21.2+)
pip install --upgrade pip
pip install langcheck[all]

Having installation issues? See the FAQ.

Examples

Evaluate Text

Use LangCheck's suite of metrics to evaluate LLM-generated text.

import langcheck

# Generate text with any LLM library
generated_outputs = [
    'Black cat the',
    'The black cat is sitting',
    'The big black cat is sitting on the fence'
]

# Check text quality and get results as a DataFrame (threshold is optional)
langcheck.metrics.fluency(generated_outputs) > 0.5

It's easy to turn LangCheck metrics into unit tests, just use assert:

assert langcheck.metrics.fluency(generated_outputs) > 0.5

LangCheck includes several types of metrics to evaluate LLM applications. Some examples:

Type of Metric	Examples	Languages
Reference-Free Text Quality Metrics	`toxicity(generated_outputs)` `sentiment(generated_outputs)` `ai_disclaimer_similarity(generated_outputs)`	EN, JA, ZH, DE
Reference-Based Text Quality Metrics	`semantic_similarity(generated_outputs, reference_outputs)` `rouge2(generated_outputs, reference_outputs)`	EN, JA, ZH, DE
Source-Based Text Quality Metrics	`factual_consistency(generated_outputs, sources)`	EN, JA, ZH, DE
Text Structure Metrics	`is_float(generated_outputs, min=0, max=None)` `is_json_object(generated_outputs)`	All Languages
Pairwise Text Quality Metrics	`pairwise_comparison(generated_outputs_a, generated_outputs_b, prompts)`	EN, JA

Visualize Metrics

LangCheck comes with built-in, interactive visualizations of metrics.

# Choose some metrics
fluency_values = langcheck.metrics.fluency(generated_outputs)
sentiment_values = langcheck.metrics.sentiment(generated_outputs)

# Interactive scatter plot of one metric
fluency_values.scatter()

# Interactive scatter plot of two metrics
langcheck.plot.scatter(fluency_values, sentiment_values)

# Interactive histogram of a single metric
fluency_values.histogram()

Augment Data

Text augmentations can automatically generate reworded prompts, typos, gender changes, and more to evaluate model robustness.

For example, to measure how the model responds to different genders:

male_prompts = langcheck.augment.gender(prompts, to_gender='male')
female_prompts = langcheck.augment.gender(prompts, to_gender='female')

male_generated_outputs = [my_llm_app(prompt) for prompt in male_prompts]
female_generated_outputs = [my_llm_app(prompt) for prompt in female_prompts]

langcheck.metrics.sentiment(male_generated_outputs)
langcheck.metrics.sentiment(female_generated_outputs)

Unit Testing

You can write test cases for your LLM application using LangCheck metrics.

For example, if you only have a list of prompts to test against:

from langcheck.utils import load_json

# Run the LLM application once to generate text
prompts = load_json('test_prompts.json')
generated_outputs = [my_llm_app(prompt) for prompt in prompts]

# Unit tests
def test_toxicity(generated_outputs):
    assert langcheck.metrics.toxicity(generated_outputs) < 0.1

def test_fluency(generated_outputs):
    assert langcheck.metrics.fluency(generated_outputs) > 0.9

def test_json_structure(generated_outputs):
    assert langcheck.metrics.validation_fn(
        generated_outputs, lambda x: 'myKey' in json.loads(x)).all()

Monitoring

You can monitor the quality of your LLM outputs in production with LangCheck metrics.

Just save the outputs and pass them into LangCheck.

production_outputs = load_json('llm_logs_2023_10_02.json')['outputs']

# Evaluate and display toxic outputs in production logs
langcheck.metrics.toxicity(production_outputs) > 0.75

# Or if your app outputs structured text
langcheck.metrics.is_json_array(production_outputs)

Guardrails

You can provide guardrails on LLM outputs with LangCheck metrics.

Just filter candidate outputs through LangCheck.

# Get a candidate output from the LLM app
raw_output = my_llm_app(random_user_prompt)

# Filter the output before it reaches the user
while langcheck.metrics.contains_any_strings(raw_output, blacklist_words).any():
    raw_output = my_llm_app(random_user_prompt)

langcheck's People

Contributors

Stargazers

Watchers

Forkers

shibuiwilliam thanhchinhbk alnusjaponica vela-zz hitum-dev ischender bioerrorlog mpower4ru dallarosa

langcheck's Issues

Add versions to ReadTheDocs documentation

Currently, it only shows latest and stable.

Refactoring of common parts in `ja/` and `en/`

It seems like code dupe between the en/ and the ja/ directories is getting a bit high - I wonder what we can do about it 🤔

Split LangCheck into language-specific packages such as `langcheck[ja]`, `langcheck[de]`, etc

Since some language-specific packages (e.g. fugashi for Japanese) may require extra work to install, we can simplify the installation instructions for users by allowing them to choose only the languages they want to install.

Fix `NOQA E501` showing up in documentation

The NOQA E501 comments in docstrings directly show up in documentation.

Installing tensorflow breaks the transformers library in `test_factual_consistency()`

Ran into this pretty surprising bug while running tests in an environment that had TF installed. To reproduce:

# Create a fresh venv
python -m venv .venv
source .venv/bin/activate

# Install langcheck
python -m pip install --upgrade pip
python -m pip install -e .[dev]

# Tests pass
python -m pytest -s -vv -m "not optional"

# Install TensorFlow
pip install tensorflow

# Tests fail
python -m pytest -s -vv -m "not optional"

Relevant versions:

Python 3.9.2
TensorFlow 2.14.0
Keras 2.14.0
Transformers 4.22.1

See error message below:

========================================================================================================== FAILURES ===========================================================================================================
____________________________________________________________________________________ test_factual_consistency[generated_outputs0-sources0] ____________________________________________________________________________________

self = <module 'transformers.models.marian' from '/home/kennysong/langcheck/.venv/lib/python3.9/site-packages/transformers/models/marian/__init__.py'>, module_name = 'modeling_tf_marian'

    def _get_module(self, module_name: str):
        try:
>           return importlib.import_module("." + module_name, self.__name__)

.venv/lib/python3.9/site-packages/transformers/utils/import_utils.py:1031: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
.venv/lib/python3.9/site-packages/transformers/models/marian/modeling_tf_marian.py:33: in <module>
    from ...modeling_tf_utils import (
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    """TF general model utils."""
    
    import functools
    import gc
    import inspect
    import json
    import os
    import pickle
    import re
    import warnings
    from collections.abc import Mapping
    from pathlib import Path
    from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Union
    
    import h5py
    import numpy as np
    import tensorflow as tf
    from tensorflow.python.keras import backend as K
    from tensorflow.python.keras.engine import data_adapter
    from tensorflow.python.keras.engine.keras_tensor import KerasTensor
    from tensorflow.python.keras.saving import hdf5_format
    
    from huggingface_hub import Repository, list_repo_files
>   from keras.saving.hdf5_format import save_attributes_to_hdf5_group
E   ModuleNotFoundError: No module named 'keras.saving.hdf5_format'

.venv/lib/python3.9/site-packages/transformers/modeling_tf_utils.py:39: ModuleNotFoundError

The above exception was the direct cause of the following exception:

generated_outputs = ['東京は日本の首都です。', '地球は平面です。'], sources = ['東京は日本の首都です。', '地球は球体です。']

    @pytest.mark.parametrize(
        'generated_outputs,sources',
        [(['東京は日本の首都です。', '地球は平面です。'], ['東京は日本の首都です。', '地球は球体です。'])])
    def test_factual_consistency(generated_outputs, sources):
>       eval_value = factual_consistency(generated_outputs, sources)

tests/eval/ja/test_source_based_text_quality.py:16: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/langcheck/eval/ja/source_based_text_quality.py:89: in factual_consistency
    _factual_consistency_translation_pipeline = pipeline(
.venv/lib/python3.9/site-packages/transformers/pipelines/__init__.py:702: in pipeline
    framework, model = infer_framework_load_model(
.venv/lib/python3.9/site-packages/transformers/pipelines/base.py:233: in infer_framework_load_model
    _class = getattr(transformers_module, f"TF{architecture}", None)
.venv/lib/python3.9/site-packages/transformers/utils/import_utils.py:1022: in __getattr__
    value = getattr(module, name)
.venv/lib/python3.9/site-packages/transformers/utils/import_utils.py:1021: in __getattr__
    module = self._get_module(self._class_to_module[name])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <module 'transformers.models.marian' from '/home/kennysong/langcheck/.venv/lib/python3.9/site-packages/transformers/models/marian/__init__.py'>, module_name = 'modeling_tf_marian'

    def _get_module(self, module_name: str):
        try:
            return importlib.import_module("." + module_name, self.__name__)
        except Exception as e:
>           raise RuntimeError(
                f"Failed to import {self.__name__}.{module_name} because of the following error (look up to see its"
                f" traceback):\n{e}"
            ) from e
E           RuntimeError: Failed to import transformers.models.marian.modeling_tf_marian because of the following error (look up to see its traceback):
E           No module named 'keras.saving.hdf5_format'

.venv/lib/python3.9/site-packages/transformers/utils/import_utils.py:1033: RuntimeError

Eliminate unexpected spaces introduced before periods by `TreebankWordDetokenizer().detokenize()`.

Motivation

Resolve #43 (review).

Description

As outlined in nltk/nltk#3210, TreebankWordDetokenizer().detokenize() introduces an unnecessary period when periods are treated as independent tokens. This issue aims to resolve then before the NLTK fix.

Improve testing of installation on different machines and python base images

We should add some pip install tests on different machines and using different python base images (e.g. 3.8 - 3.13 on slim and non-slim versions) on Github Actions.

Add docstrings for object properties so they show up in documentation

Currently, the documentation for MetricValue is not very useful: https://langcheck.readthedocs.io/en/latest/langcheck.metrics.metric_value.html#langcheck.metrics.metric_value.MetricValue.explanations

Add "refusal to answer" metric

Quite similar to the ai_disclaimer_similarity metric, but identifying LLM outputs like "I don't have enough information" or "I don't know".

Store exception messages in EvalValue

For example, for

langcheck.eval.is_json_array()
langcheck.eval.is_json_object()
langcheck.eval.run_valid_fn()

Visualize thresholds when plotting MetricValueWithThresholds

When plotting a MetricValueWithThreshold value like (toxicity_values > 0.5).scatter(), we should draw a vertical line in the chart where the threshold is.

We could do this for the double scatter plot (langcheck.plot.scatter(toxicity_values > 0.5, sentiment_values > 0.5)) and the histogram as well ((toxicity_values > 0.5).histogram()).

Specify langcheck.version

This is currently undefined. Pointed out in this article: https://qiita.com/homura99/items/82d1e4159707cb8327c4

Rename langcheck.eval to langcheck.metrics

eval is a built-in global function in Python.

Allow metrics to run on GPU

Many metrics will run faster on GPU.

cc @yosukehigashi

Installation problem on Python 3.10 on Apple Silicon Mac

Copying a message I received outside of GitHub. It looks like we haven't fully solved the Rust compiler issue.

> python --version
Python 3.10.11

I get with

> pip freeze | grep token
tokenizers==0.15.0

and yet, when I run:

> pip install langcheck

it still tries to install tokenizers, from source.
I guess it comes from:

Collecting tokenizers!=0.11.3,<0.13,>=0.11.1 (from transformers>=4.6->langcheck)1

since I have transformers==4.35.2 in that environment.

So I end up with

error: `cargo rustc --lib --message-format=json-render-diagnostics --manifest-path Cargo.toml --release -v --features pyo3/extension-module --crate-type cdylib -- -C 'link-args=-undefined dynamic_lookup -Wl,-install_name,@rpath/tokenizers.cpython-310-darwin.so'` failed with code 101
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

The confusing part here is that if I run

> pip install -U transformers

I still stay with

Requirement already satisfied: transformers in /opt/homebrew/Caskroom/miniforge/base/envs/truesilicon/lib/python3.10/site-packages (4.35.2)

that is ACTUALLY the current version: https://github.com/huggingface/transformers/tags

Where is 'transformers >= 4.6' in the dependencies coming from?

Show tqdm progress bar for metric functions

I just ran fluency for a list of 100 strings and it was fairly slow. Showing a progress indicator would be nice.

Consider computing German fluency without translating to English

As per comment thread here: #69 (comment) :

At the moment, we translate to English before computing fluency. German and English are similar enough that, in the tests I ran, it works better than using not-so-great models specifically for German (eg: something not very grammatical would get a 0.3... fluency score if translated, and a 0.9 using the model).

It could be useful to find/train models that calculate fluency for German without translating, since translation of course loses some of the grammatical fluency bits.

Make all metrics accept a single string value as well

Just a bit of syntactic sugar. Right now you need to provide a list of strings.

Support OpenAI >= 1

We temporarily pinned openai to <1 in #44. We should support openai >=1 ASAP!

Follow up `use_async` option

#104 introduced asynchronous API calls of OpenAI-based evaluator and there are some follow-up tasks:

Skip API call at the second time (calculation of scores) when unstructured assessment failed.
Add test for use_async option
Show progress bar when using async call.

Create langcheck.utils.detect_language()

Hello,
I have a question about the following test code.

Question: Is it possible to treat different languages uniformly?
1. Automatic detect languages. (e.g., EN & JA)
2. Unify the threshold_test value of toxicity between different languages. (e.g., 0.2 for both)

import langcheck

generated_outputs = [
    '適度な**は健康に良いとされています。',
    '適度な**は健康に悪いとされています。',
    '過度の**は健康に良いとされています。',
    '過度の**は健康に悪いとされています。',
    'Moderate exercise is good for your health.',
    'Moderate exercise is bad for your health.',
    'Excessive exercise is good for your health.',
    'Excessive exercise is bad for your health.',
]

# Toxicity
display(langcheck.metrics.ja.toxicity(generated_outputs) < 0.2)
display(langcheck.metrics.en.toxicity(generated_outputs) < 0.2)

Thank you in advance.

Add `prompts` parameter to all metrics

Right now some metrics are missing the prompts parameter.

Fix documentation page title for MetricValue

The metric_value package has the old name eval_value.

Add `interactive=False` parameter to plotting functions

Right now you can only show one plot at a time since Plotly runs a server that generates each plot.

Add an interactive parameter (True by default) to make Plotly output a static chart.

Introduce 係り受け解析 (Dependency Analysis) - based Japanese reading ease scores

In the survey of Japanese reading ease scores, there are more complicated metrics that require 係り受け解析 (Dependency Analysis). Using multiple scores to measure the reading ease from multiple dimensions is recommended in the paper, so we should implement these metrics, too.

These are the libraries we could use:
https://github.com/megagonlabs/ginza
https://github.com/ku-nlp/knp/
https://taku910.github.io/cabocha/

The modules could be used for input augmentation on the Japanse texts.

Pin versions for HuggingFace models

As mentioned in #67 (comment), it would be ideal to pin the versions of models downloaded from HF so the scores don't unexpectedly change.

We can explicitly increment model versions in LangCheck releases.

Document maximum allowed input lengths for each metric

Because many of our metrics rely on local models with a limited context length, they can fail when an input string is too long. E.g. here are a subset of our Japanese metrics (local version) on a very long input string.

langcheck.metrics.ja.toxicity(long_str)  # Fails
langcheck.metrics.ja.fluency(long_str)  # Fails
langcheck.metrics.ja.sentiment(long_str)  # Fails
langcheck.metrics.ja.tateishi_ono_yamada_reading_ease(long_str)  # Succeeds
langcheck.metrics.ja.semantic_similarity(long_str, long_str)  # Succeeds
langcheck.metrics.ja.rougeL(long_str, long_str)  # Succeeds

We should document this, and gracefully handle cases where the input string is too long.