ijyliu / anlp23-project Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 144.46 MB

An empirical study of the costs and practicalities of prompt engineering techniques on standard and novel benchmarks

TeX 1.23% Jupyter Notebook 98.29% Python 0.48%

anlp23-project's People

Contributors

Watchers

anlp23-project's Issues

track conversation steps as prompt response etc

Note on Generalizability/External Validity

Limit claims of generalizability - really are assessing GRE and code, and really only doing baseline methods and not ones very specialized to the task at hand.

Should I attempt to modify the methods for GRE? I am inclined to think not.

In picking methods, attention should probably be given to whether the methods claim to be generally useful or are intended for specific tasks.

Evaluation of LLM ability to check and fix inserted code on graphics for ex

Interesting paper and eval:

https://arxiv.org/pdf/2306.09896.pdf

Automatic Choice of Language Model Temperatures

How can we optimally pick language model temperatures for a user's query?

We want to elicit situations where creativity versus precision is important.

Look for hard numbers or numerical reasoning?

Though it's not possible to set temperature with GPT-4 (only possible for davinci, etc.), we can pseudo-do this with prompting or custom instructions.

Getting LLMs to produce clarifying questions or operate on assumptions and explain them

Better models do even worse with commonly held misconceptions

https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post

in fiction, the best characters make mistakes! can nlp tell us something about this? what do we know about narrator reliability? how can we use nlp on fiction examples likely to influence this?

misconceptions to try https://en.wikipedia.org/wiki/List_of_common_misconceptions

Automated Prompting Not Consistently Better

https://browse.arxiv.org/pdf/2304.03609.pdf

Top Positive and Negative Features of Businesses Based on Reviews

Identify the positive and negative aspects of a business based on review text.

Attempt to list the most relevant of these for a particular user, based on the text of their own prior reviews.

Method: GPT-3.5+

Data: Yelp, Google Maps, Other. Potentially consider application to any kind of reviews - apps, movies, products, companies on Glassdoor, etc.

Fiction/Stories and Action tasks for GPT

Can we use these to create a model good at taking actions

Data Augmentation

Can models anticipate tasks they would find hard, and generate examples for those cases?

Specifying Allowable Sources

Say, in the prompt: Only cite Wikipedia, not X

Complexity Based Prompting

https://browse.arxiv.org/pdf/2210.00720.pdf

More reasoning steps and complexity improves performance on math word problems and non-math reasoning. focuses on older (2022 and before) models and chain of thought prompting. interestingly, the gains are larger on simple questions and larger models. prompt complexity is indeed associated with output complexity. some measures of output complexity are linebreaks, periods, "step i", and semicolons. mentioned to be "annotation efficient" as opposed to other few-shot methods - just do complex examples

https://browse.arxiv.org/pdf/2302.12822.pdf

complexity of exemplars is considered critical to the quality of chain of thought prompting. however this paper finds that complexity gains are larger on more complex (not simple) questions. for simple questions, simple examples can work better - disagreeing with the fu paper in which complex examples always do better.

Similar Paper Concerning Summarization

https://browse.arxiv.org/pdf/2309.04269.pdf Optimal level of entity density for humans on summarization is found - the middle of the chain of density. balancing informativeness and readability. Consider adding entity density as a measure of complexity. Single task and domain of summarization
https://aclanthology.org/2023.findings-acl.591.pdf not much here except the idea of the presence of novel n-grams as a measure of complexity. seems dicey to use for my purposes though.
https://aclanthology.org/2023.acl-srw.1.pdf for the summarization task, prompting can improve readability and usability as measured by Flesch Reading Ease. Prompt: "Please give me a layman / simple / simplified and understandable / easy-to-comprehend / straightforward / general audience summary of X". Looks at named entity hallucination by prompt type and readability by prompt type, but does not directly look at the correlation of readability and accuracy.
https://browse.arxiv.org/pdf/2309.05454.pdf story generation and simplification tasks and targeting responses to flesch-kincaid reading levels via some prompting techniques.
https://browse.arxiv.org/pdf/2209.12356.pdf more summary evals. says automatic and human judgements of quality don't often go together. human judgements as the gold standard are very important

Chain of Verification Prompting

Make the LLM generate and answer verification questions. Experiment with having it generate them with original answer, or after. Also experiment with time questions are to be answered.

https://www.forbes.com/sites/lanceeliot/2023/09/23/latest-prompt-engineering-technique-chain-of-verification-does-a-sleek-job-of-keeping-generative-ai-honest-and-upright/

Clean up metrics section and figure out what to do with literature comments on costs/benefits

Does GPT-4 "think" in other languages?

Is it translating under the hood?

Does splitting tasks in foreign languages into translation then task then translation improve performance?

Choice of Evaluation

p.30 https://arxiv.org/pdf/2303.08774.pdf

Contamination is very serious for the GRE and less serious for the SAT. For these exams it's extremely hard to find questions not posted online.

We know that GSM8-K's test set is NOT in the GPT-4 training data.

It would be nice to look at the most common original method paper evaluations. I think GSM8K is the most common but unclear about others used

Last letter is a big one

CommonSenseQA is also a good source, but might be more contaminated

Winogrande, StrategyQA also common

It would be nice to have a reading comprehension or summarization task as well

Check https://paperswithcode.com/datasets

Possibly best to finalize this after finalizing list of methods

Be sure to state in the paper how many methods used the chosen evals!

How to pick examples for few shot learning

Look at failures of less capable models?

Are there parallels between model results? Do these earlier failures tell us something?

Can we use simplification, summarize, target reading level instructions in prompts

Does this help reduce length and complexity?

seems like this can be effective, at least for summarization https://aclanthology.org/2023.acl-srw.1.pdf

Few-shot prompting

One implementation option is automated retrieval - find questions in a training set with embeddings closest to the test question, measured in Euclidean distance.

https://browse.arxiv.org/pdf/2210.00720.pdf

Ensemble prompting

are regressions allowed with our dependent data structure

Emails as LLM Fine Tuning or Prompting Data

Emails often have a query, a response, and, importantly, a thank you or other grading of the quality of the response.

Note one interesting technique may be somehow feeding the model the query and the evaluation/thanks messages and having the model fill in the middle, the response.

Look into bidirectional encoders (BERT) as a way to learn from both sides of input/fill in the middle?

Evaluating logical consistency, not accuracy

Potential Sources for a More Formal List of Techniques

https://arxiv.org/pdf/2302.07842.pdf

https://arxiv.org/pdf/2301.00234.pdf

More Capable LLMs Managing Less Capable LLMs

One feasible structure for automated prompting is having more capable LLMs manage less capable ones that do smaller sub-tasks.

Should the most capable LLMs supervise, or should the roles be reversed? Is there some better structure?

Potential benefits of 'management' - parallelization of tasks, getting around context window limitations for each subtask

How do we divide tasks into subtasks? Is some other NLP method appropriate?

When do we evaluate output from a subtask? Right after it finishes?

See https://www.promptingguide.ai/techniques/ape as a potential method

Look at organizational or management texts and see if GPT managerial delegation lines up with criteria. Try to fine tune or prompt engineer

Non linearities note in Gen discussion

Asking the Right Questions

So much of real world work depends on asking the correct questions and raising issues, but there been little evaluation of LLM question generating capacity. Most performance benchmarks focus on answering, much like an academic setting

Paper on the practicality of prompting with a costs section but no evaluation

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4504303

Figure out complexity measures

OpenAI Evals

A collection of GPT-4 failures

https://github.com/openai/evals/tree/main/evals/registry/data

Possible additions: git?

Use Text of GitHub Issues to Automatically Create Suggested FAQs

Compile insights from GitHub issues reporting bugs/problems into an actionable list of items for repository maintainers to create an FAQ section from. Consider including potential answers.

Potentially search for sentences ending in question marks.

Use on a public repository with substantial discussion.

Method: GPT-3.5+, some regex detection of questions

The Cost of Prompt Engineering

Length imposes several costs. Potential for quality to decline, see Roose. Extra time to review output and check each step in real world scenarios, when some steps may be obvious to human reviewers. Token costs.

Motivation

First assessment of some of these techniques since launch, when underlying LLMs have changed, and the costs with them.

Of course, costs of techniques may change after this paper. However, they are likely to get higher - marginal benefits are likely to fall as models advance. So this paper establishes a minimum baseline.

There are so many techniques out there and it is hard to compare them even on baseline performance. It's even harder to make a full cost/benefit analysis as to whether they should be used. This paper begins to start on that.

Prior Literature

Some light discussion of human costs... to advocate for an automated prompt engineering solution: https://arxiv.org/pdf/2302.12246.pdf

https://dl.acm.org/doi/pdf/10.1145/3491102.3517582 discusses chaining and some of the cost challenges - increased complexity, limitations on creativity/randomness

Methods to Assess

Zero-Shot Baseline
Few-Shot Prompting. On page 4, we see a comparison of the returns to in-context learning by model size https://arxiv.org/pdf/2005.14165.pdf. But this paper is from a time of smaller models (and indeed I don't think the newest GPT-4 tech report even considers few-shot/chain-of-thought)! Also, here we see that the correctness of examples is not important - instead, info about the distribution of the label space, input text dist, output format is key https://arxiv.org/pdf/2202.12837.pdf.
Manually provided chain-of-thought prompting. It was at one time claimed that this method could produce superior performance https://arxiv.org/pdf/2210.03493.pdf. Note some of these advanced LMs automatically do CoT
Chain of verification prompting: https://arxiv.org/pdf/2309.11495.pdf
Finish going through repo and also look up most popular techniques online...
Generated Knowledge Prompting
Directional Stimulus Prompting

https://en.wikipedia.org/wiki/Prompt_engineering

I even chose relatively easy to implement/low cost methods!

Data

Probably real-world/professional, rather than academic tasks. Coding/Leetcode, then perhaps some very generalized standard tests (All GRE Sections)

Models: the few major ones: GPT-4, Claude, Bard. If time allows, assess on GPT-3.5 or older models closer to the original ones that motivated these prompt engineering techniques.

For some of the simpler metrics, in limited domains, it may be possible to use reported accuracy scores on the domain dataset coming from the original paper. It may also be possible to use prompts, LLM responses, and correct responses provided along with original papers. However, the more complex metrics will require human evaluation.

Metrics

Accuracy at the point a human is confident in the answer (modelling the real world)

Accuracy (according to the papers initially presenting the technique, in whatever domain it was tested)

Length of the entire interaction in tokens

Length of the entire interaction in tokens relative to the length of a human/solved out/generally accepted as correct answer. How much is the engineering stretching the interaction out? Is this stretching adding value/improving accuracy?

Length of the entire interaction in time (seconds)

Cost of this length

Human assessment of need for specialized knowledge/difficulty of implementation of the technique. This could be task specific (done for some real world example questions/prompting scenarios), or it could be done overall based on a simple description of the technique. Perhaps a balance of both is best.

Human assessment of output complexity (ease of evaluating results). This could be task specific (done for some real world example questions/prompting scenarios), or it could be done overall based on a simple description of the technique. Perhaps a balance of both is best.

Limitations

Hard to tell what methods are actually the most popular.

I expect things to change over time... in fact, they have changed since many of these techniques came on the scene. However, I only expect relative costs to increase as models get better.

Automated prompt engineering can certainly decrease at least some of the costs. https://arxiv.org/pdf/2211.01910.pdf. We did not consider it because of the difficulty of implementation and limited resources available.

What parts of prompt engineering really work?

Systematic empirical evaluation of correctness.

The prompt engineering guide contains a lot of techniques: https://www.promptingguide.ai/

Here are some unintuitive ones:

Separate instruction and context with ### signs https://www.promptingguide.ai/introduction/tips#the-instruction
Tree of thought prompting might be effective, but it significantly increases overhead and text generated. https://www.promptingguide.ai/techniques/tot. It didn't seem to clearly improve in my experiments
Directional stimulus prompting - annotating key terms in a passage, the providing this hint to the larger LLM along with directions. This seems to operate on the assumption there are important terms that can be easily found. It currently uses a policy LM. Maybe there is another NLP way to find key terms?
Using code sort of forces chain-of-thought prompting, so LLMs empowered with python do better
https://www.promptingguide.ai/applications/workplace_casestudy - Role-play, mock discussion with acknowledgement, reinforcement of key elements with repetition, ask for right conclusion, info about failures, giving the model a name, giving positive feedback before querying (maybe some thanks or you did really well)
https://www.promptingguide.ai/applications/pf prompt functions might be able to get concise yet accurate input
https://www.promptingguide.ai/models/flan finetuning on chain of thought and non chain of thought data produces better results. Is this just fine-tuning on more examples, or this there actually an underlying phenomenon at play here?
system-user-assistant setup https://www.promptingguide.ai/models/gpt-4

As a note, there seem to be small typos in many of the prompts in the guide - I wonder if this affects performance...

Additionally, examples such as that at the bottom of https://www.promptingguide.ai/techniques/fewshot seem to be passed with GPT-4 now. The chain of thought process to the correct answer now triggers automatically. However, if you tell the model to be concise in its response it often fails. And obviously if we use smaller/older models there will probably be more failure.

Some of these techniques greatly increase response time. Is this a tradeoff worth making? There's also the problem of having more text to verify/read through when one gets a result, but I don't think this is a huge deal.

Choice of prompt engineering methods

Table to consider

Popular, easy, and powerful:

Zero-Shot Chain-of-Thought (Original)
APE Improved Zero-Shot Chain-of-Thought

Popular, moderately hard, powerful

Tree-of-thought (maybe with the prompt-only version allowed, as I could only find a sudoku solver implementation with code and that was task specific)

Less popular, easy, less powerful

Self-refine
Least-to-most prompting

Popular basic methods:

Manual few shot
Manual (few-shot) chain-of-thought

With these we will get a good representation of the biggest methods. GSM8K and Last Letter have been used as datasets for a good chunk of them.

If time allows:

Automatic chain-of-thought

Add hypothesis testing for all metrics

This can be done as a comparison of metrics between the methods and direct prompting

Use a t-test for a difference of sample means, or potentially a permutation test, or a bootstrap. perhaps try all of them for robustness

CIs might be the cleanest approach, as that would allow you to quickly calculate the side-metrics based off of the accuracy rates

Many papers in the literature do not include any statistical inference!

Running topic models and having GPT-X name the topics

We would feed the documents with the most of each topic, or the most common words of each topic, to GPT

It's more flexible than a standard classifier as it could potentially invent labels

What happens if we aggregate the results of many queries to an LLM?

Is there some sort of wisdom in the crowd when we have multiple models or multiple queries addressing the same task or question?

I think the answer is yes, because "Active-Prompt" seems to work. https://www.promptingguide.ai/techniques/activeprompt

What can we say about independence?

Mention of context window and token consumption concerns

https://cameronrwolfe.substack.com/p/practical-prompt-engineering-part

Finding the Most Common Prompt Engineering Techniques

Create a long list of prompt engineering techniques and evaluate their online popularity. Be very careful with minor string differences - "chain-of-thought" versus "chain of thought" versus "CoT", etc.

Just use the search function to look for number of posts, or scrape posts

Here are the largest communities:

https://www.reddit.com/r/PromptEngineering/

https://www.reddit.com/r/ChatGPT/

https://www.reddit.com/r/ChatGPTPromptGenius/

https://www.reddit.com/r/PromptDesign/

Google Trends

https://trends.google.com/trends/explore?geo=US&hl=en-US

Google Scholar

Just check the number of citations on the original paper. Potentially do citations per month since release to adjust for recency.

Comparing LLM Metrics With Each Other and With Response Accuracy

I intend to compare several measurements of characteristics of LLM responses with each other, and with response accuracy.

Basic Metrics

Length/Number of words generated. Punctuation. Vocabulary complexity, distinctive terms in correct/incorrect responses, distinctive terms in responses and correct answers. Paragraph size. Sentence length. Readability. Syntactic complexity. Presence of words that indicate ambiguity.

A motivation for length is the case that LLMs do stray off topic as was the case with Sidney Bing...

Summary statistics for all metrics overall, by model, by model and task type.

Run these on the prompt, the response, and a sourced correct answer.

Subtopics

Compare lengths of GPT answers versus lengths of solved out, correct answers. Look at this in percentage terms (or some other way) to account for different lengths on different questions. Are answers systematically more accurate when GPT answers are longer than established answers (meaning the model took its time and added more steps as necessary)? This could tell us something about step-by-step in general, rather than specific techniques. We can then analyze the quality of specific techniques in terms of their value add per extra token of response length/normalized to response length. Why do we care about length? Shorter = easier to verify, less expensive via pricing, etc.

Response Speed

Is there a connection to accuracy? Can we use speed to assign confidence? How can we account for serious confounding issues related to server/connection speed, etc.?

Relying on External Sources

Number and attributes of quotations and citations. Tool and plug-in usage. Package usage within code. Package use seems to be subject to a large amount of hallucination.

Locating Errors

Where in a response do errors typically occur? The beginning, middle, or end? Does this vary by type of prompt?

Motivation

Despite the known importance of these problems, systematic evaluation of them seems to be lacking. Most papers seem to focus solely on accuracy. This is despite the fact that some of these topics are known to matter.

ijyliu / anlp23-project Goto Github PK

anlp23-project's People

Contributors

Watchers

anlp23-project's Issues

Motivation

Prior Literature

Methods to Assess

Data

Metrics

Limitations

Reddit

Google Trends

Google Scholar

Basic Metrics

Subtopics

Response Speed

Relying on External Sources

Locating Errors

Motivation

Recommend Projects

Recommend Topics

Recommend Org