I'm going over the API guides, and had some impressions I wanted to share (feel free t

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks for these thoughts. To address the confu

Feedback on API usage,about promptfoo/promptfoo

Comments (9)

MentalGear commented on August 30, 2024

I have used "assertions" above since it's the term many testing frameworks use. However if you concur with #10, evaluations is the one that makes more sense.

from promptfoo.

typpo commented on August 30, 2024

Hi @MentalGear, really appreciate the feedback. Your thoughts/first impressions have been valuable. Here are my thoughts/some additional context:

This project began as a way to manually compare outputs side-by-side. That's why I tend to call it an "evaluation" or "eval" instead of just a test suite, because it's possible to run one without test cases. In my experience, a manual eval can still be useful for at-a-glance quality checks. I usually like to start with manual side-by-side comparison, and then fill in test cases.

Prompt testing is really important, but I don't want to reinvent the wheel when many pure test frameworks already exist. For this reason, yesterday I added Jest and Mocha integration examples and will spin out these plugins into their own standalone libraries when I get the chance.

So it would be good to think about the value-add we can provide over a traditional test suite. I think providing generalized assertion functions that the user can import to their existing test workflow is a nice way to enforce a baseline of prompt quality in CI.

On the other hand, the "matrix" view is great for more in-depth comparison and investigation of prompt possibilities. Non-specialized test frameworks are not going to implement this anytime soon. For this reason I think that our toolkit could focus on these areas:

Prompt tuning & evaluation (by evaluation I mean the more experimental/exploratory side of "prompt engineering")
Prompt version control & management
Automatic testing, selection, deployment of prompts

All this aside, your suggestion of a unified format is a good idea (much cleaner) and I've already started working on it 😄 I also agree "vars" might be better labeled as something else, "refs" or maybe even just "tests"? Even if there are no __expected values, each row in vars can be considered a test case that needs to be manually reviewed.

What do you think of all this?

from promptfoo.

MentalGear commented on August 30, 2024

Thanks for the response, I'm glad it provides value ! 🙂

Unified format
In the new format it'd be still useful if one could define a .csv or "/" directory for prompts and refs, for example when ingesting large data sources.

New Var Designation
For me "tests" as the overall term doesn't quite feel right since the new name should cover 2 different concepts:

vars that change the overall prompt
__expected which tests the prompt outcome - and indeed could be called test

So a more general term would be appropriate. ATM I can't think of something more fitting thanreferences/ refs, maybe you can?

Scope
Yes, I agree that we shouldn't duplicate what other testing frameworks have already build but rather integrate with or build upon them.

Roadmap
The roadmap outline looks very intriguing indeed! I agree it would bring quite added value.
Regarding the last 2 points, would they require a hosted infra approach?

The question I have then re: Scope and Roadmap:
How would promptfoo better serve it's mission/roadmap?
As a plugin for testing frameworks or by using a testing framework internally for the classic assertion part ?

from promptfoo.

MentalGear commented on August 30, 2024

Having slept over it, I can see your perspective for wanting to call it a "test(case)".

Naming Approach

To discover the proper nomenclature that allows us to define a clean and intuitive structure, I think requires some decomposition of what's generally referred as simply a "prompt"

Markup
Let's ask ourselves first:
What's the markup of a dynamically composable prompt?
There's a corpus (body or template) and interchangeable parts (vars).

Ergo:
Prompt = Body + Vars

What can be tested?

Body
Body with Var Interpolation

What do we test against?

expectation
- exact match
- within parameters

Usage
In what context will it be used now and later on? If we want to provide automated selection & deployment later on, IMO it'd be weird deploying something called a test to production.

Fitting Term

In regard of all this, I think a more all encompassing term would be something akin to (prompt) configuration or config.

Unified Format (update)

In light of this, I have an updated format proposal:

Signature

interface PromptConfig {
    name: string
    config: {
        body: string[] // can also be path to directory or file
        vars?: Variable[] // can also be path to directory or file
        testsOverall?: Test[]
        transformations?: string[] // ex: concat to help with reducing tokens, or stringify for expected format
    }
}

interface Variable {
    [key: string]: any
    tests?: Test[]
}

interface Test {
    method: "exact" | "fn" | "grade" | "similar"
    expect?: string
    value?: string
    fn?: string
    threshold?: string
}

Example Use

const promptConfigs: PromptConfig[] = [
    {
        name: "Translation Prompt",

        config: {
            body: [
                "Translate from English to French: {phraseToTranslate}",
                "{example} \nAs an expert translator: Translate from English to French: {phraseToTranslate}",
                "English to French: {example}\n English to French: {phraseToTranslate}",
            ],

            testsOverall: [
                {
                    method: "grade",
                    expect: "Do not contain spelling errors",
                },

                {
                    method: "grade",
                    expect: "Do not mention that translation was done by a language model",
                },

                // {
                //     method: "fn",
                //     fn: "JSON.parse(output)?.result !== undefined",
                //     value: "true",
                // },
            ],

            vars: [
                {
                    phraseToTranslate: "Hello World",
                    tests: [
                        {
                            method: "similar",
                            threshold: "0.8",
                            expect: "Bonjour le monde!",
                        },
                    ],
                },

                {
                    phraseToTranslate: "Hello World",
                    example: "Example: Input: Hello, \nOutput: Bonjour",

                    tests: [
                        {
                            method: "fn",
                            logic: "output.toLowerCase().includes(${value})",
                            value: "bonjour",
                        },

                        {
                            method: "fn",
                            logic: "output.toLowerCase().includes(${value})",
                            value: "monde",
                        },
                    ],
                },
            ],

            transformations: [],
        },
    },
]

Looking forward to your thoughts ! 🙂

from promptfoo.

typpo commented on August 30, 2024

FYI: I edited this post after its original submission

Based on your original suggestion, this is what I have on a local branch right now. After playing around with it a bit, I think your original point is correct that calling each prompt/provider/vars combo a "Test Case" and the whole thing a "Test Suite" might make the most sense as it's familiar terminology that everyone understands.

I also like your idea of supporting global assertions (for example, could imagine a global instruction that fails any case that begins with "As an AI language model...")

export interface Assertion {
  type: 'equality' | 'function' | 'similarity' | 'llm-rubric';
  value?: string;
  threshold?: number;
  provider?: ApiProvider; // Some assertions require an LLM provider
}

// Each test case is graded pass/fail.  A test case represents a unique input to the LLM after substituting `vars` in the prompt.
export interface TestCase {
  name?: string;
  vars?: Record<string, string>;
  assert?: Assertion[];
  prompt?: PromptConfig;
  grading?: GradingConfig;
}

// The test suite defines the "knobs" that we are tuning in prompt engineering: providers and prompts
export interface TestSuite {
  providers: ApiProvider[];
  prompts: string[];
  tests: TestCase[];
  defaultProperties?: Omit<TestCase, 'name'>;
}

Here's an example config object:

{
  "providers": ["openai:chat"],
  "prompts": "./prompts.txt",
  "defaultProperties": {
    "assert": [{ "type": "fn", "value": "output.length > 0" }]
  },
  "tests": [
    {
      "name": "Test case 1",
      "vars": { "body": "Hello world" },
      "assert": [
        { "type": "fn", "value": "output.includes('foo')" },
        { "type": "similarity", "value": "foobar", "threshold": 0.9 },
        {
          "type": "grade",
          "rubric": "foo bar",
          "provider": "openai:chat:gpt-4"
        }
      ]
    },
    {
      "name": "Test case 2",
      "providers": ["openai:chat:gpt-4"],
      "prompts": ["./prompt1.txt", "./prompt2.txt"],
      "vars": { "var1": "value 1", "var2": "value 2", "var3": "value 3" },
      "assert": [
        { "type": "fn", "value": "output.includes('foo')" },
        { "type": "similarity", "value": "foobar", "threshold": 0.9 }
      ]
    },
    { "name": "Test case 3", "prompts": ["./prompt1.txt", "./prompt2.txt"] },
    {
      "name": "Test case 4",
      "prompt": "Four {{var1}} and {{var2}} years ago, our {{var3}} brought forth...",
      "vars": { "var1": "value 1", "var2": "value 2", "var3": "value 3" },
      "assert": [
        { "type": "fn", "value": "output.includes('foo')" },
        { "type": "similarity", "value": "foobar", "threshold": 0.9 }
      ]
    }
  ]
}

from promptfoo.

MentalGear commented on August 30, 2024

Thanks for your work on it, Ian! I'll provide you detailed feedback by tmrw. :)

from promptfoo.

MentalGear commented on August 30, 2024

Nice work on the restructured interface - config is a good deal more compact now !

Here's my feedback. Fair warning ahead: it's quite a bit - I hope though you don't see it as criticism but rather as high regard for the project's potential and investment in building a solid core:

In your example, prompts can be defined within TestSuite, and then again per each TestCase. I'm not sure what that means. Are the prompts in TestSuite globals for each TestCase?
- also "prompts" is used in some places, while "prompt" is used in others. Would you like to support both? Otherwise prompts:string[] seems the most consistent.
- TestCase has prompt?: PromptConfig; prop which I don't see an definition for. In the example prompt(s): is mostly string or string[] types.
there's no name prop in TestSuite. I'd find it crucial to have one to indicate what this suite/set of candidates tests for. For example: "Find the best Translation Prompt", "Find the optimal chat bot response template".
the name prop in TestCase I find less important, and might rather be called desc?. Normally, you could see from the prompt/vars what is tested for, but if it's a complex prompt, an additional desc field might help. Example. desc: "This configuration checks if a added outlier examples significantly improve the result".
defaultProperties:
- If there's only assert within this node, we might just add assert on the top level. Or are there more values to come?
- I find defaultProperties a sub-optimal name as a default is something that's normally overwritten, not extended on. IMO this could imply that the (assert) props set here are overwritten when the same prop is defined in an item. That's why I initially used the name testsOverall, which could also be assertsGeneral to make the intent unambiguously clear.
TestSuite has grading?: GradingConfig;. GradingConfig is not defined, but since you mention "Each test case is graded pass/fail.", I assume it's pass/fail (boolean).
- However:What about tests outcomes that don't have pass/fail values, like "semantic similarity", if we look towards adding something like auto-selection of prompts. A overall test score might be useful: it is calculated by summing up the outcome of non-binary "asserts" outcomes ("cos sim") together with the pass/fail binary (one point) to give a better indicator for performance.

PromptConfig vs TestCase

Though my original preference was "test" indeed, my view has evolved since then.

While test is the most commonly used term in test frameworks, from what I gathered from your roadmap outline, promptfoo aspires to be more than simply a testing framework by striving for prompt auto-selection and deployment (something not done with normal testing libs).

When considering these future use cases, tests doesn't represent what promptfoo does in these cases.

PromptConfig feels much more future proof, intuitive and all-encompassing term in the context of these future features that you indicated.

Therefore, here's my adjusted proposal:

export interface PromptConfigSuite { // was TestSuite
  name: string,
  providers: ApiProvider[];
  configs: PromptConfig[];
  asserts?: Assertion[];
}

export interface PromptConfig {  // was TestCase
  desc?: string;
  prompt: PromptConfig;
  vars?: Record<string, string>;
  asserts?: Assertion[];
  score: number // overall score from all asserts 
}

export interface Assertion { // or simply Test
  type: 'equality' | 'function' | 'similarity' | 'llm-rubric';
  value?: string;
  threshold?: number;
  provider?: ApiProvider; // Some assertions require an LLM provider,
  score: number // pass = 1, fail = 0, otherwise cos sim to result
}

from promptfoo.

typpo commented on August 30, 2024

Thanks for these thoughts.

To address the confusion on prompt vs prompts, LLM prompts are defined for each TestSuite, whereas prompt in TestCase is just misc prompt modifiers (like prefix and suffix). The full definitions in the PR are:
```
export interface GradingConfig {
  prompt?: string;
  provider?: string | ApiProvider;
}

export interface PromptConfig {
  prefix?: string;
  suffix?: string;
}
```
~~I agree that prompt is a little confusing here, maybe promptModifiers would be clearer.~~
I've combined these into a single TestCase.options property.
I've just added optional description props to both TestCase and TestSuite
I renamed this to defaultTest to make it clearer, and for lack of a better term. This is useful not only for carrying asserts over, but also things like default vars and options configuration.
So this is a bit separate from score vs pass/fail. GradingConfig is actually just for llm-rubric-type asserts for now. ~~I can either name it more specifically, or maybe the llm-rubric config can live on the assertion itself (but this makes it harder to configure across the board).~~ I've combined this into a simpler options property on test case.
Regarding naming, these are internal types for now. The one that is most problematic is TestSuite, because maybe at the top level promptfoo will do more than just test. I'm comfortable punting on that for now, and I think TestCase is a familiar-enough abstract way to just say "example input".

The good news is that this is all very nearly done, and the docs get a major facelift as well!

from promptfoo.

MentalGear commented on August 30, 2024

Thanks for your responses. A bit of a late reply since Github's new 2FA gave me some trouble.

As the ol' saying goes: "naming is one of the hardest things in programming" 😄.

Yeah, the new interface looks real good. Also, a nice bonus to see the addition of contains-json which should be really handy.

Regarding the score config, my thought regarding this was that (suggest) prompts could self optimize when testing a grounded information approach (generating answers from a knowledge base).

from promptfoo.

Feedback on API usage about promptfoo HOT 9 CLOSED

Comments (9)

Naming Approach

Fitting Term

Unified Format (update)

PromptConfig vs TestCase

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent