Giter Club home page Giter Club logo

llm-tabular-memorization-checker's Introduction

๐Ÿ˜ Testing Language Models for Memorization of Tabular Datasets

PyPI - Version Python License tests Documentation

Header Test

Tabmemcheck is an open-source Python library that tests language models for the memorization of tabular datasets.

Features:

  • Test GPT-3.5, GPT-4, and other LLMs for memorization of tabular datasets.
  • Supports chat models and (base) language models. In chat mode, the prompts are designed toward GPT-3.5 and GPT-4. We recommend testing the base models with other LLMs.
  • Based entirely on prompts (no access to the probability distribution over tokens ('logprobs') is required).
  • The submodule tabmemcheck.datasets allows to load tabular datasets in perturbed form (original, perturbed, task, statistical).

The different tests are described in a Neurips'23 workshop paper. We also used this package for our COLM'24 paper "Elephants Never Forget: Memorization and Learning of Tabular data in Large Language Models".

Installation

pip install tabmemcheck

Then use import tabmemcheck to import the Python package.

Overview

The package provides four different tests for verbatim memorization of a tabular dataset (header test, row completion test, feature completion test, first token test).

It also provides additional heuristics to assess what an LLM know about a tabular dataset (does the LLM know the names of the features in the dataset?).

The header test asks the LLM to complete the initial rows of a CSV file.

header_prompt, header_completion, response = tabmemcheck.header_test('uci-wine.csv', 'gpt-3.5-turbo-0613', completion_length=350)

Header Test

Here, we see that gpt-3.5-turbo-0613 can complete the initial rows of the UCI Wine dataset. The function output visualizes the Levenshtein string distance between the actual dataset and the model completion.

The row completion test asks the LLM to complete random rows of a CSV file.

rows, responses = tabmemcheck.row_completion_test('iris.csv', 'gpt-4-0125-preview', num_queries=25)

Row Completion Test

Here, we see that gpt-4-0125-preview can complete random rows of the Iris dataset. The function output again visualizes the Levenshtein string distance between the actual dataset rows and the model completions.

The feature completion test asks the LLM to complete the values of a specific feature in the dataset.

feature_values, responses = tabmemcheck.feature_completion_test('/home/sebastian/Downloads/titanic-train.csv', 'gpt-3.5-turbo-0125', feature_name='Name', num_queries=25)

Row Completion Test

Here, we see that gpt-3.5-turbo-0125 can complete the names of the passengers in the Kaggle Titanic dataset. The function output again visualizes the Levenshtein string distance between the feature values in the dataset and the model completions.

The first token test asks the LLM to complete the first token in the next row of a CSV file.

tabmemcheck.first_token_test('adult-train.csv', 'gpt-3.5-turbo-0125', num_queries=100)
First Token Test: 37/100 exact matches.
First Token Test Baseline (Matches of most common first token): 50/100.

Here, the test provides no evidence of memorization of the Adult Income dataset in gpt-3.5-turbo-0125.

One of the key features of this package is that we have implemented prompts that allow us to run the various completion tests not only with (base) language models but also with chat models (specifically, GPT-3.5 and GPT-4).

There is also a simple way to run all the different tests and generate a small report.

tabmemcheck.run_all_tests("adult-test.csv", "gpt-4-0613")

Documentation

The API documentation of the package is available here.

Testing your own LLM

To test your own LLM, simply implement tabmemcheck.LLM_Interface.

@dataclass
class LLM_Interface:
    """Generic interface to a language model."""

    # if true, the tests use the chat_completion function, otherwise the completion function
    chat_mode = False

    def completion(self, prompt: str, temperature: float, max_tokens: int):
        """Send a query to a language model.

        :param prompt: The prompt (string) to send to the model.
        :param temperature: The sampling temperature.
        :param max_tokens: The maximum number of tokens to generate.

        Returns:
            str: The model response.
        """
        raise NotImplementedError

    def chat_completion(self, messages, temperature: float, max_tokens: int):
        """Send a query to a chat model.

        :param messages: The messages to send to the model. We use the OpenAI format.
        :param temperature: The sampling temperature.
        :param max_tokens: The maximum number of tokens to generate.

        Returns:
            str: The model response.
        """
        raise NotImplementedError

Limitations

The tests provided in this package do not guarantee that the LLM has not seen or memorized the data. Specifically, it might not be possible to extract the data from the LLM via prompting, even though the LLM has memorized it.

Citation

If you find this code useful in your research, please consider citing our research papers.

@article{bordt2024colm,
  title={Elephants Never Forget: Memorization and Learning of Tabular Data in
  Large Language Models},
  author={Bordt, Sebastian and Nori, Harsha and Rodrigues, Vanessa and Nushi, Besmira and Caruana, Rich},
  journal={Conference on Language Modeling (COLM)},
  year={2024}
}

@inproceedings{bordt2023testing,
  title={Elephants Never Forget: Testing Language Models for Memorization of Tabular Data},
  author={Bordt, Sebastian and Nori, Harsha and Caruana, Rich},
  booktitle={NeurIPS 2023 Second Table Representation Learning Workshop},
  year={2023}
}

References

Chang et al., "Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4", EMNLP 2023

Carlini et al., "Extracting Training Data from Large Language Models", USENIX Security Symposium 2021

Carlini et al., "The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks", USENIX Security Symposium 2019

llm-tabular-memorization-checker's People

Contributors

richcaruana avatar sbordt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.