ijyliu / anlp23-project Goto Github PK
View Code? Open in Web Editor NEWAn empirical study of the costs and practicalities of prompt engineering techniques on standard and novel benchmarks
An empirical study of the costs and practicalities of prompt engineering techniques on standard and novel benchmarks
Limit claims of generalizability - really are assessing GRE and code, and really only doing baseline methods and not ones very specialized to the task at hand.
Should I attempt to modify the methods for GRE? I am inclined to think not.
In picking methods, attention should probably be given to whether the methods claim to be generally useful or are intended for specific tasks.
Interesting paper and eval:
How can we optimally pick language model temperatures for a user's query?
We want to elicit situations where creativity versus precision is important.
Look for hard numbers or numerical reasoning?
Though it's not possible to set temperature with GPT-4 (only possible for davinci, etc.), we can pseudo-do this with prompting or custom instructions.
https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post
in fiction, the best characters make mistakes! can nlp tell us something about this? what do we know about narrator reliability? how can we use nlp on fiction examples likely to influence this?
misconceptions to try https://en.wikipedia.org/wiki/List_of_common_misconceptions
Identify the positive and negative aspects of a business based on review text.
Attempt to list the most relevant of these for a particular user, based on the text of their own prior reviews.
Method: GPT-3.5+
Data: Yelp, Google Maps, Other. Potentially consider application to any kind of reviews - apps, movies, products, companies on Glassdoor, etc.
Can we use these to create a model good at taking actions
Can models anticipate tasks they would find hard, and generate examples for those cases?
Say, in the prompt: Only cite Wikipedia, not X
https://browse.arxiv.org/pdf/2210.00720.pdf
More reasoning steps and complexity improves performance on math word problems and non-math reasoning. focuses on older (2022 and before) models and chain of thought prompting. interestingly, the gains are larger on simple questions and larger models. prompt complexity is indeed associated with output complexity. some measures of output complexity are linebreaks, periods, "step i", and semicolons. mentioned to be "annotation efficient" as opposed to other few-shot methods - just do complex examples
https://browse.arxiv.org/pdf/2302.12822.pdf
complexity of exemplars is considered critical to the quality of chain of thought prompting. however this paper finds that complexity gains are larger on more complex (not simple) questions. for simple questions, simple examples can work better - disagreeing with the fu paper in which complex examples always do better.
https://browse.arxiv.org/pdf/2309.04269.pdf Optimal level of entity density for humans on summarization is found - the middle of the chain of density. balancing informativeness and readability. Consider adding entity density as a measure of complexity. Single task and domain of summarization
https://aclanthology.org/2023.findings-acl.591.pdf not much here except the idea of the presence of novel n-grams as a measure of complexity. seems dicey to use for my purposes though.
https://aclanthology.org/2023.acl-srw.1.pdf for the summarization task, prompting can improve readability and usability as measured by Flesch Reading Ease. Prompt: "Please give me a layman / simple / simplified and understandable / easy-to-comprehend / straightforward / general audience summary of X". Looks at named entity hallucination by prompt type and readability by prompt type, but does not directly look at the correlation of readability and accuracy.
https://browse.arxiv.org/pdf/2309.05454.pdf story generation and simplification tasks and targeting responses to flesch-kincaid reading levels via some prompting techniques.
https://browse.arxiv.org/pdf/2209.12356.pdf more summary evals. says automatic and human judgements of quality don't often go together. human judgements as the gold standard are very important
Make the LLM generate and answer verification questions. Experiment with having it generate them with original answer, or after. Also experiment with time questions are to be answered.
Is it translating under the hood?
Does splitting tasks in foreign languages into translation then task then translation improve performance?
p.30 https://arxiv.org/pdf/2303.08774.pdf
Contamination is very serious for the GRE and less serious for the SAT. For these exams it's extremely hard to find questions not posted online.
We know that GSM8-K's test set is NOT in the GPT-4 training data.
It would be nice to look at the most common original method paper evaluations. I think GSM8K is the most common but unclear about others used
Last letter is a big one
CommonSenseQA is also a good source, but might be more contaminated
Winogrande, StrategyQA also common
It would be nice to have a reading comprehension or summarization task as well
Check https://paperswithcode.com/datasets
Possibly best to finalize this after finalizing list of methods
Be sure to state in the paper how many methods used the chosen evals!
Look at failures of less capable models?
Are there parallels between model results? Do these earlier failures tell us something?
Does this help reduce length and complexity?
seems like this can be effective, at least for summarization https://aclanthology.org/2023.acl-srw.1.pdf
One implementation option is automated retrieval - find questions in a training set with embeddings closest to the test question, measured in Euclidean distance.
Emails often have a query, a response, and, importantly, a thank you or other grading of the quality of the response.
Note one interesting technique may be somehow feeding the model the query and the evaluation/thanks messages and having the model fill in the middle, the response.
Look into bidirectional encoders (BERT) as a way to learn from both sides of input/fill in the middle?
One feasible structure for automated prompting is having more capable LLMs manage less capable ones that do smaller sub-tasks.
Should the most capable LLMs supervise, or should the roles be reversed? Is there some better structure?
Potential benefits of 'management' - parallelization of tasks, getting around context window limitations for each subtask
How do we divide tasks into subtasks? Is some other NLP method appropriate?
When do we evaluate output from a subtask? Right after it finishes?
See https://www.promptingguide.ai/techniques/ape as a potential method
Look at organizational or management texts and see if GPT managerial delegation lines up with criteria. Try to fine tune or prompt engineer
So much of real world work depends on asking the correct questions and raising issues, but there been little evaluation of LLM question generating capacity. Most performance benchmarks focus on answering, much like an academic setting
A collection of GPT-4 failures
https://github.com/openai/evals/tree/main/evals/registry/data
Possible additions: git?
Compile insights from GitHub issues reporting bugs/problems into an actionable list of items for repository maintainers to create an FAQ section from. Consider including potential answers.
Potentially search for sentences ending in question marks.
Use on a public repository with substantial discussion.
Method: GPT-3.5+, some regex detection of questions
Length imposes several costs. Potential for quality to decline, see Roose. Extra time to review output and check each step in real world scenarios, when some steps may be obvious to human reviewers. Token costs.
First assessment of some of these techniques since launch, when underlying LLMs have changed, and the costs with them.
Of course, costs of techniques may change after this paper. However, they are likely to get higher - marginal benefits are likely to fall as models advance. So this paper establishes a minimum baseline.
There are so many techniques out there and it is hard to compare them even on baseline performance. It's even harder to make a full cost/benefit analysis as to whether they should be used. This paper begins to start on that.
Some light discussion of human costs... to advocate for an automated prompt engineering solution: https://arxiv.org/pdf/2302.12246.pdf
https://dl.acm.org/doi/pdf/10.1145/3491102.3517582 discusses chaining and some of the cost challenges - increased complexity, limitations on creativity/randomness
https://en.wikipedia.org/wiki/Prompt_engineering
I even chose relatively easy to implement/low cost methods!
Probably real-world/professional, rather than academic tasks. Coding/Leetcode, then perhaps some very generalized standard tests (All GRE Sections)
Models: the few major ones: GPT-4, Claude, Bard. If time allows, assess on GPT-3.5 or older models closer to the original ones that motivated these prompt engineering techniques.
For some of the simpler metrics, in limited domains, it may be possible to use reported accuracy scores on the domain dataset coming from the original paper. It may also be possible to use prompts, LLM responses, and correct responses provided along with original papers. However, the more complex metrics will require human evaluation.
Accuracy at the point a human is confident in the answer (modelling the real world)
Accuracy (according to the papers initially presenting the technique, in whatever domain it was tested)
Length of the entire interaction in tokens
Length of the entire interaction in tokens relative to the length of a human/solved out/generally accepted as correct answer. How much is the engineering stretching the interaction out? Is this stretching adding value/improving accuracy?
Length of the entire interaction in time (seconds)
Cost of this length
Human assessment of need for specialized knowledge/difficulty of implementation of the technique. This could be task specific (done for some real world example questions/prompting scenarios), or it could be done overall based on a simple description of the technique. Perhaps a balance of both is best.
Human assessment of output complexity (ease of evaluating results). This could be task specific (done for some real world example questions/prompting scenarios), or it could be done overall based on a simple description of the technique. Perhaps a balance of both is best.
Hard to tell what methods are actually the most popular.
I expect things to change over time... in fact, they have changed since many of these techniques came on the scene. However, I only expect relative costs to increase as models get better.
Automated prompt engineering can certainly decrease at least some of the costs. https://arxiv.org/pdf/2211.01910.pdf. We did not consider it because of the difficulty of implementation and limited resources available.
Systematic empirical evaluation of correctness.
The prompt engineering guide contains a lot of techniques: https://www.promptingguide.ai/
Here are some unintuitive ones:
As a note, there seem to be small typos in many of the prompts in the guide - I wonder if this affects performance...
Additionally, examples such as that at the bottom of https://www.promptingguide.ai/techniques/fewshot seem to be passed with GPT-4 now. The chain of thought process to the correct answer now triggers automatically. However, if you tell the model to be concise in its response it often fails. And obviously if we use smaller/older models there will probably be more failure.
Some of these techniques greatly increase response time. Is this a tradeoff worth making? There's also the problem of having more text to verify/read through when one gets a result, but I don't think this is a huge deal.
Table to consider
Popular, easy, and powerful:
Popular, moderately hard, powerful
Less popular, easy, less powerful
Popular basic methods:
With these we will get a good representation of the biggest methods. GSM8K and Last Letter have been used as datasets for a good chunk of them.
If time allows:
This can be done as a comparison of metrics between the methods and direct prompting
Use a t-test for a difference of sample means, or potentially a permutation test, or a bootstrap. perhaps try all of them for robustness
CIs might be the cleanest approach, as that would allow you to quickly calculate the side-metrics based off of the accuracy rates
Many papers in the literature do not include any statistical inference!
We would feed the documents with the most of each topic, or the most common words of each topic, to GPT
It's more flexible than a standard classifier as it could potentially invent labels
Is there some sort of wisdom in the crowd when we have multiple models or multiple queries addressing the same task or question?
I think the answer is yes, because "Active-Prompt" seems to work. https://www.promptingguide.ai/techniques/activeprompt
What can we say about independence?
Create a long list of prompt engineering techniques and evaluate their online popularity. Be very careful with minor string differences - "chain-of-thought" versus "chain of thought" versus "CoT", etc.
Just use the search function to look for number of posts, or scrape posts
Here are the largest communities:
https://www.reddit.com/r/PromptEngineering/
https://www.reddit.com/r/ChatGPT/
https://www.reddit.com/r/ChatGPTPromptGenius/
https://www.reddit.com/r/PromptDesign/
https://trends.google.com/trends/explore?geo=US&hl=en-US
Just check the number of citations on the original paper. Potentially do citations per month since release to adjust for recency.
I intend to compare several measurements of characteristics of LLM responses with each other, and with response accuracy.
Length/Number of words generated. Punctuation. Vocabulary complexity, distinctive terms in correct/incorrect responses, distinctive terms in responses and correct answers. Paragraph size. Sentence length. Readability. Syntactic complexity. Presence of words that indicate ambiguity.
A motivation for length is the case that LLMs do stray off topic as was the case with Sidney Bing...
Summary statistics for all metrics overall, by model, by model and task type.
Run these on the prompt, the response, and a sourced correct answer.
Compare lengths of GPT answers versus lengths of solved out, correct answers. Look at this in percentage terms (or some other way) to account for different lengths on different questions. Are answers systematically more accurate when GPT answers are longer than established answers (meaning the model took its time and added more steps as necessary)? This could tell us something about step-by-step in general, rather than specific techniques. We can then analyze the quality of specific techniques in terms of their value add per extra token of response length/normalized to response length. Why do we care about length? Shorter = easier to verify, less expensive via pricing, etc.
Is there a connection to accuracy? Can we use speed to assign confidence? How can we account for serious confounding issues related to server/connection speed, etc.?
Number and attributes of quotations and citations. Tool and plug-in usage. Package usage within code. Package use seems to be subject to a large amount of hallucination.
Where in a response do errors typically occur? The beginning, middle, or end? Does this vary by type of prompt?
Despite the known importance of these problems, systematic evaluation of them seems to be lacking. Most papers seem to focus solely on accuracy. This is despite the fact that some of these topics are known to matter.
LLMs as classifiers
LLMs and covariate selection
Simply state that the LLM can revert to a previous step at any time in the initial prompt?
Somewhere in the second or later steps, explicitly note returning to a different choice as an option?
complexity measured by flesch-kincaid, wide range of other metrics
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.