Giter Club home page Giter Club logo

anlp23-project's People

Contributors

ijyliu avatar

Watchers

 avatar

anlp23-project's Issues

Note on Generalizability/External Validity

Limit claims of generalizability - really are assessing GRE and code, and really only doing baseline methods and not ones very specialized to the task at hand.

Should I attempt to modify the methods for GRE? I am inclined to think not.

In picking methods, attention should probably be given to whether the methods claim to be generally useful or are intended for specific tasks.

Automatic Choice of Language Model Temperatures

How can we optimally pick language model temperatures for a user's query?

We want to elicit situations where creativity versus precision is important.

Look for hard numbers or numerical reasoning?

Though it's not possible to set temperature with GPT-4 (only possible for davinci, etc.), we can pseudo-do this with prompting or custom instructions.

Top Positive and Negative Features of Businesses Based on Reviews

Identify the positive and negative aspects of a business based on review text.

Attempt to list the most relevant of these for a particular user, based on the text of their own prior reviews.

Method: GPT-3.5+

Data: Yelp, Google Maps, Other. Potentially consider application to any kind of reviews - apps, movies, products, companies on Glassdoor, etc.

Data Augmentation

Can models anticipate tasks they would find hard, and generate examples for those cases?

Complexity Based Prompting

https://browse.arxiv.org/pdf/2210.00720.pdf

More reasoning steps and complexity improves performance on math word problems and non-math reasoning. focuses on older (2022 and before) models and chain of thought prompting. interestingly, the gains are larger on simple questions and larger models. prompt complexity is indeed associated with output complexity. some measures of output complexity are linebreaks, periods, "step i", and semicolons. mentioned to be "annotation efficient" as opposed to other few-shot methods - just do complex examples

https://browse.arxiv.org/pdf/2302.12822.pdf

complexity of exemplars is considered critical to the quality of chain of thought prompting. however this paper finds that complexity gains are larger on more complex (not simple) questions. for simple questions, simple examples can work better - disagreeing with the fu paper in which complex examples always do better.

Similar Paper Concerning Summarization

  • https://browse.arxiv.org/pdf/2309.04269.pdf Optimal level of entity density for humans on summarization is found - the middle of the chain of density. balancing informativeness and readability. Consider adding entity density as a measure of complexity. Single task and domain of summarization

  • https://aclanthology.org/2023.findings-acl.591.pdf not much here except the idea of the presence of novel n-grams as a measure of complexity. seems dicey to use for my purposes though.

  • https://aclanthology.org/2023.acl-srw.1.pdf for the summarization task, prompting can improve readability and usability as measured by Flesch Reading Ease. Prompt: "Please give me a layman / simple / simplified and understandable / easy-to-comprehend / straightforward / general audience summary of X". Looks at named entity hallucination by prompt type and readability by prompt type, but does not directly look at the correlation of readability and accuracy.

  • https://browse.arxiv.org/pdf/2309.05454.pdf story generation and simplification tasks and targeting responses to flesch-kincaid reading levels via some prompting techniques.

  • https://browse.arxiv.org/pdf/2209.12356.pdf more summary evals. says automatic and human judgements of quality don't often go together. human judgements as the gold standard are very important

Choice of Evaluation

p.30 https://arxiv.org/pdf/2303.08774.pdf

Contamination is very serious for the GRE and less serious for the SAT. For these exams it's extremely hard to find questions not posted online.

We know that GSM8-K's test set is NOT in the GPT-4 training data.

It would be nice to look at the most common original method paper evaluations. I think GSM8K is the most common but unclear about others used

Last letter is a big one

CommonSenseQA is also a good source, but might be more contaminated

Winogrande, StrategyQA also common

It would be nice to have a reading comprehension or summarization task as well

Check https://paperswithcode.com/datasets

Possibly best to finalize this after finalizing list of methods

Be sure to state in the paper how many methods used the chosen evals!

Emails as LLM Fine Tuning or Prompting Data

Emails often have a query, a response, and, importantly, a thank you or other grading of the quality of the response.

Note one interesting technique may be somehow feeding the model the query and the evaluation/thanks messages and having the model fill in the middle, the response.

Look into bidirectional encoders (BERT) as a way to learn from both sides of input/fill in the middle?

More Capable LLMs Managing Less Capable LLMs

One feasible structure for automated prompting is having more capable LLMs manage less capable ones that do smaller sub-tasks.

Should the most capable LLMs supervise, or should the roles be reversed? Is there some better structure?

Potential benefits of 'management' - parallelization of tasks, getting around context window limitations for each subtask

How do we divide tasks into subtasks? Is some other NLP method appropriate?

When do we evaluate output from a subtask? Right after it finishes?

See https://www.promptingguide.ai/techniques/ape as a potential method

Look at organizational or management texts and see if GPT managerial delegation lines up with criteria. Try to fine tune or prompt engineer

Asking the Right Questions

So much of real world work depends on asking the correct questions and raising issues, but there been little evaluation of LLM question generating capacity. Most performance benchmarks focus on answering, much like an academic setting

Use Text of GitHub Issues to Automatically Create Suggested FAQs

Compile insights from GitHub issues reporting bugs/problems into an actionable list of items for repository maintainers to create an FAQ section from. Consider including potential answers.

Potentially search for sentences ending in question marks.

Use on a public repository with substantial discussion.

Method: GPT-3.5+, some regex detection of questions

The Cost of Prompt Engineering

Length imposes several costs. Potential for quality to decline, see Roose. Extra time to review output and check each step in real world scenarios, when some steps may be obvious to human reviewers. Token costs.

Motivation

First assessment of some of these techniques since launch, when underlying LLMs have changed, and the costs with them.

Of course, costs of techniques may change after this paper. However, they are likely to get higher - marginal benefits are likely to fall as models advance. So this paper establishes a minimum baseline.

There are so many techniques out there and it is hard to compare them even on baseline performance. It's even harder to make a full cost/benefit analysis as to whether they should be used. This paper begins to start on that.

Prior Literature

Some light discussion of human costs... to advocate for an automated prompt engineering solution: https://arxiv.org/pdf/2302.12246.pdf

https://dl.acm.org/doi/pdf/10.1145/3491102.3517582 discusses chaining and some of the cost challenges - increased complexity, limitations on creativity/randomness

Methods to Assess

  • Zero-Shot Baseline
  • Few-Shot Prompting. On page 4, we see a comparison of the returns to in-context learning by model size https://arxiv.org/pdf/2005.14165.pdf. But this paper is from a time of smaller models (and indeed I don't think the newest GPT-4 tech report even considers few-shot/chain-of-thought)! Also, here we see that the correctness of examples is not important - instead, info about the distribution of the label space, input text dist, output format is key https://arxiv.org/pdf/2202.12837.pdf.
  • Manually provided chain-of-thought prompting. It was at one time claimed that this method could produce superior performance https://arxiv.org/pdf/2210.03493.pdf. Note some of these advanced LMs automatically do CoT
  • Chain of verification prompting: https://arxiv.org/pdf/2309.11495.pdf
    Finish going through repo and also look up most popular techniques online...
  • Generated Knowledge Prompting
  • Directional Stimulus Prompting

https://en.wikipedia.org/wiki/Prompt_engineering

I even chose relatively easy to implement/low cost methods!

Data

Probably real-world/professional, rather than academic tasks. Coding/Leetcode, then perhaps some very generalized standard tests (All GRE Sections)

Models: the few major ones: GPT-4, Claude, Bard. If time allows, assess on GPT-3.5 or older models closer to the original ones that motivated these prompt engineering techniques.

For some of the simpler metrics, in limited domains, it may be possible to use reported accuracy scores on the domain dataset coming from the original paper. It may also be possible to use prompts, LLM responses, and correct responses provided along with original papers. However, the more complex metrics will require human evaluation.

Metrics

Accuracy at the point a human is confident in the answer (modelling the real world)

Accuracy (according to the papers initially presenting the technique, in whatever domain it was tested)

Length of the entire interaction in tokens

Length of the entire interaction in tokens relative to the length of a human/solved out/generally accepted as correct answer. How much is the engineering stretching the interaction out? Is this stretching adding value/improving accuracy?

Length of the entire interaction in time (seconds)

Cost of this length

Human assessment of need for specialized knowledge/difficulty of implementation of the technique. This could be task specific (done for some real world example questions/prompting scenarios), or it could be done overall based on a simple description of the technique. Perhaps a balance of both is best.

Human assessment of output complexity (ease of evaluating results). This could be task specific (done for some real world example questions/prompting scenarios), or it could be done overall based on a simple description of the technique. Perhaps a balance of both is best.

Limitations

Hard to tell what methods are actually the most popular.

I expect things to change over time... in fact, they have changed since many of these techniques came on the scene. However, I only expect relative costs to increase as models get better.

Automated prompt engineering can certainly decrease at least some of the costs. https://arxiv.org/pdf/2211.01910.pdf. We did not consider it because of the difficulty of implementation and limited resources available.

What parts of prompt engineering really work?

Systematic empirical evaluation of correctness.

The prompt engineering guide contains a lot of techniques: https://www.promptingguide.ai/

Here are some unintuitive ones:

As a note, there seem to be small typos in many of the prompts in the guide - I wonder if this affects performance...

Additionally, examples such as that at the bottom of https://www.promptingguide.ai/techniques/fewshot seem to be passed with GPT-4 now. The chain of thought process to the correct answer now triggers automatically. However, if you tell the model to be concise in its response it often fails. And obviously if we use smaller/older models there will probably be more failure.

Some of these techniques greatly increase response time. Is this a tradeoff worth making? There's also the problem of having more text to verify/read through when one gets a result, but I don't think this is a huge deal.

Choice of prompt engineering methods

Table to consider

image

Popular, easy, and powerful:

  • Zero-Shot Chain-of-Thought (Original)
  • APE Improved Zero-Shot Chain-of-Thought

Popular, moderately hard, powerful

  • Tree-of-thought (maybe with the prompt-only version allowed, as I could only find a sudoku solver implementation with code and that was task specific)

Less popular, easy, less powerful

  • Self-refine
  • Least-to-most prompting

Popular basic methods:

  • Manual few shot
  • Manual (few-shot) chain-of-thought

With these we will get a good representation of the biggest methods. GSM8K and Last Letter have been used as datasets for a good chunk of them.

If time allows:

  • Automatic chain-of-thought

Add hypothesis testing for all metrics

This can be done as a comparison of metrics between the methods and direct prompting

Use a t-test for a difference of sample means, or potentially a permutation test, or a bootstrap. perhaps try all of them for robustness

CIs might be the cleanest approach, as that would allow you to quickly calculate the side-metrics based off of the accuracy rates

Many papers in the literature do not include any statistical inference!

Finding the Most Common Prompt Engineering Techniques

Create a long list of prompt engineering techniques and evaluate their online popularity. Be very careful with minor string differences - "chain-of-thought" versus "chain of thought" versus "CoT", etc.

Reddit

Just use the search function to look for number of posts, or scrape posts

Here are the largest communities:

https://www.reddit.com/r/PromptEngineering/

https://www.reddit.com/r/ChatGPT/

https://www.reddit.com/r/ChatGPTPromptGenius/

https://www.reddit.com/r/PromptDesign/

Google Trends

https://trends.google.com/trends/explore?geo=US&hl=en-US

Google Scholar

Just check the number of citations on the original paper. Potentially do citations per month since release to adjust for recency.

Comparing LLM Metrics With Each Other and With Response Accuracy

I intend to compare several measurements of characteristics of LLM responses with each other, and with response accuracy.

Basic Metrics

Length/Number of words generated. Punctuation. Vocabulary complexity, distinctive terms in correct/incorrect responses, distinctive terms in responses and correct answers. Paragraph size. Sentence length. Readability. Syntactic complexity. Presence of words that indicate ambiguity.

A motivation for length is the case that LLMs do stray off topic as was the case with Sidney Bing...

Summary statistics for all metrics overall, by model, by model and task type.

Run these on the prompt, the response, and a sourced correct answer.

Subtopics

Compare lengths of GPT answers versus lengths of solved out, correct answers. Look at this in percentage terms (or some other way) to account for different lengths on different questions. Are answers systematically more accurate when GPT answers are longer than established answers (meaning the model took its time and added more steps as necessary)? This could tell us something about step-by-step in general, rather than specific techniques. We can then analyze the quality of specific techniques in terms of their value add per extra token of response length/normalized to response length. Why do we care about length? Shorter = easier to verify, less expensive via pricing, etc.

Response Speed

Is there a connection to accuracy? Can we use speed to assign confidence? How can we account for serious confounding issues related to server/connection speed, etc.?

Relying on External Sources

Number and attributes of quotations and citations. Tool and plug-in usage. Package usage within code. Package use seems to be subject to a large amount of hallucination.

Locating Errors

Where in a response do errors typically occur? The beginning, middle, or end? Does this vary by type of prompt?

Motivation

Despite the known importance of these problems, systematic evaluation of them seems to be lacking. Most papers seem to focus solely on accuracy. This is despite the fact that some of these topics are known to matter.

Tree of thoughts needs to allow backtracking?

Simply state that the LLM can revert to a previous step at any time in the initial prompt?

Somewhere in the second or later steps, explicitly note returning to a different choice as an option?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.