expectedparrot / edsl Goto Github PK

Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.

Home Page: https://docs.expectedparrot.com

License: MIT License

Makefile 0.84% Python 98.92% Jupyter Notebook 0.24%

anthropic data-labeling deepinfra domain-specific-language experiments llama2 llm llm-agent llm-framework llm-inference

edsl's Introduction

Expected Parrot Domain-Specific Language

The Expected Parrot Domain-Specific Language (EDSL) package lets you conduct computational social science and market research with AI. Use it to design surveys and experiments, simulate responses with large language models, and perform data labeling and other research tasks. EDSL comes with built-in methods for analyzing, visualizing and sharing your results.

🔗 Links

PyPI: https://pypi.org/project/edsl/
Documentation: https://docs.expectedparrot.com
Getting started: https://www.expectedparrot.com/getting-started/
Discord: https://discord.com/invite/mxAYkjfy9m

💡 Contributions, Feature Requests & Bugs

Interested in contributing? Want us to add a new feature? Found a nasty bug that you would like us to squash? Please send us an email at [email protected] or message us at our Discord server.

💻 Getting started

EDSL is compatible with Python 3.9 - 3.11.

pip install edsl

See https://www.expectedparrot.com/getting-started/ for examples and tutorials. Read the docs at http://docs.expectedparrot.com.

🔧 Dependencies

API keys for LLMs that you want to use, stored in a .env file

edsl's People

Contributors

Stargazers

Watchers

Forkers

ibehnam shylaatchison hbrooks maxrross kushalkmrd sara-win nurv cgodlewski barabazs

edsl's Issues

to_csv only saves the second model's responses

When i do response.to_csv('xyz.csv') (response being a class Results object I think)
It only saves the queries from the model that is run second:
e.g., in this case the csv only has the responses from 3.5

responses = q.by(scenarios).by(m4).by(m35).run()
# this case it only has the responsess from 4
responses = q.by(scenarios).by(m35).by(m4).run()

Create a kind of question that takes a regex for validating answers

Eg validating email, phone numbers
For surveys collecting respondent information

Plus API method for suggesting regex to use

Fix .env, version, and db

.env should live in cwd
Version of the package should be retrievable quickly through code
Objects stored in the db should have a package version column -- this will make our lives easier in db migrations

Get Azure Inference Service to Work

Feature Request: Create a method for adding a set of parameterized questions to the same survey

We should create a method for adding a set of parameterized questions all to the same survey. Scenarios does not currently allow this -- when you add a parameterized Question to a Survey and then do survey.by(scenarios) to insert parameter values you create a new Job for each parameter value, i.e., the single-question survey is administered separately for each parameter value.

For example, this method creates a set of parameterized questions that we can then add to a survey using the add_question() method (or Survey(questions=[...]) construction) in order to have them all within a survey:

def add_product_rating_questions(survey):
    for product in PRODUCTS:
        question_name = f"rating_{product.replace(' ', '_').replace('-', '').lower()}"
        question_text = f"Rate your desirability for {product} right now (1-7). Use 1 to 7 scale, please:"
        survey.add_question(QuestionLinearScale(question_name=question_name, question_text=question_text, question_options=list(range(0, 8))))

Conceptually, what we want is a "vertical" application of question scenarios within a survey.

Add check that API key entered is valid

Re-factor and improve "Results" object

Create a capability to allow multiple runs of the same Job cached independently

Problem: When you run a Job that is identical to a Job that has already been run, you get the cached result. This is not desired if you want to use cached results but also see multiple runs. The cache is essentially over-eager. We want to pass a free parameter that will trigger re-runs and cache those new results. Right now we are hacking this by slightly adjusting the temperature, but this is not ideal because it is not an identical Job.

Word cloud doesn't fetch comments

This doesn't work:

>>> results.word_cloud_plot('answer.how_feeling_comment')

Report re-factor goals

Code re-factored into smaller modules with unit tests
Notebook-based documentation
Some form of testing of image outputs
Discoverability in "main" edsl
Automated "short names" functionality
Output of files (not just HTML)

We show add a "cache rollback" feature where you can say how much time you want to rollback

E.g., because of our agent bug, users have "bad" cache entries.

Create a print method for Results that shows the field names only

Equivalent to .to_pandas().columns

Create method for using a local LLM

Modify sub-classes for models

https://github.com/goemeritus/edsl/blob/main/edsl/language_models/model_interfaces/GeminiPro.py

print() not working in 0.1.8

There is no error generated, but no content is printed.
To reproduce the issue with different question types:

from edsl.questions import QuestionLinearScale, QuestionMultipleChoice

q = QuestionLinearScale(
    question_name = "exercise",
    question_text = "How many times do you typically exercise each week?",
    question_options = [0,1,2,3,4,5,6,7]
)
r = q.run()
r.print()

q = QuestionMultipleChoice(
    question_name = "exercise",
    question_text = "How often do you typically exercise each week?",
    question_options = ["Never","Sometimes","Often"]
)
r = q.run()
r.print()

Re-factor questions to not use Pydantic

Background

We are unable to install edsl on Google Collab due to some problems with the Pydantic package. Rather than work around, we might consider re-factoring Quesitons to not use Pydantic. This Gist shows what Questions and QuestionMultipleChoice would look like w/o pydantic for validation:

What a change looks like

https://gist.github.com/johnjosephhorton/9cb426fd349c9328a0071cf23510126d

Cons

Probably a day or work to re-factor
Our validators would not be as good, most likely
Some of our FastAPI work would be lost

Pros

One less dependency
We could likely install on Collab
Substantial code simplicity (avoid weird meta-class stuff)

Re-factor Questions

Refactor Question.py so init is not called
Write custom validators for answers & questions - start with MC & free text
Move prompt creation out of Question.py to separate classes ; refactor interviewing accordingly
Turn over system prompt and fine-grained prompt control to use
Use Metaclass for the question registry

Create a "long" view of a Survey for when there are many questions that you want to print vertically

See eg https://examples.goemeritus.com/moral_beliefs/

A Plan for Handling API / Task Exceptions

A few considerations:

The process from Job ---> model call is quite complex, going through several main edsl objects. This is of course by design, but it makes understanding where failures are happening difficult.
We do not want one failure to end the job---as long as forward progress can be still be made with uncompleted tasks, we should continue. We should only stop when the queue of tasks to still do that can be done is empty.
try/except blocks are probably not helpful early on when we are trying to learn the nature of exceptions. We should catch no exceptions and let them surface.
Cacheing is very helpful in actually getting things done because partial failures are surmountable
When a task fails, all tasks that depend on that task should fail as well, to save time.
We need to communicate back to the user the nature of failures and suggest improvements
We should do lots of pre-LLM call checks to limit number of failures that happen at model-level

Strategy

Add real edsl notebooks to the integration test
Make lots of simulated jobs that with complex dependencies and inject exceptions are different points; test that it fails appropriately

Create method for selecting a set of agents with random combinations of traits that can also be checked for feasibility if desired

JSON error using default LLM for free text question

Code to reproduce:

from edsl.questions import QuestionFreeText
q_example = QuestionFreeText(
    question_name = "examples",
    question_text = "Draft 3 different summary sections of resumes for software engineers."
)
r_example = q_example.run()

Prompt Re-Factor PEP

We need to give users fine-grained control over prompts, allowing them to experiment as they see fit, but still have sensible defaults. Also, prompts are data and, should probably not be stored in the repository long-term. Finally, users need to have detailed insight into precisely what prompts are used.

This note describes a proposed plan of re-factoring.

`Prompt` class

There will a be a base class that has common functionality, such as rendering a template, combining two prompts, and so on.
It will be sub-classed to have different kinds of prompts:
- question_data (the actual rendered text of a user question (which is often a jinja2 template)
- question_instructions (e.g., describing to the model how you answer a MC question)
- agent_instructions (e.g., "You are a helpful agent taking a survey...")
past_question_memory
agent_data (e.g., "Your traits are...")
These component types will have corresponding attributes that children have to have e.g., if it's a question_instructions, it has to have attributes such as:
- model that they are appropriate for
- question_type (if it's a question_instructions component) and so on
  These are enforced by a meta-class at registration and/or ABC.

Prompt Registry for standard prompts

Meta class records the creation of all prompt classes and stores them in a trie-ish data structure
The registry supports look-up based on prompt type, question type, model, etc., returning what it can by matching keywords e.g., `get_classes(question_type = "multiple-choice") will return all Prompt classes for multiple-choice, regardless of parameters.

`PromptCollection`

These various prompt fragments are combined at 'run time' making use of the model, question type, scenario and so on. The user can optionally control this themselves.

User-defined prompts

Users can add their own prompts to the registry at run-time by sub-classing Prompt.

When a survey is run:

Prompts are determined at run time by looking at the question being asked, the model, and so on.
Users can add their own list of prompts to run e.g., q.run(custom_prompts = [p1, p2])
User prompts are used first; ties are broken with some notion of precedence, tbd
An unused entry in custom_prompt throws a warning
If the prompt text is a template, it is rendered before being sent to the LLM
Users can add a parameter to run to indicate that if multiple prompts fit, to run all of them.

Rendered prompts are stored in the Result object

With a results object, users will be able to do results.select("user_prompt") and see the user prompt, for example.

General cleanup and re-factor ideas

Create a 'Prompt' class and allow questions to have model-specific prompts
Remove HTML / form-based display from Questions as largely unused
Remove the notion of survey and question UUIDs
Move "docx" method to edsl[extra]
Move all plotting / output methods to edsl[extra]
Refactor "type" in Question to "question_type" as type is a reserved keyword in python, which causes trouble
??

Questions: Consolidated improvement ideas

Refactor Question to QuestionBase and then use Question as the default object users import from Edsl

Modily .show_schema() to incorporate print()

Allow .run(cache: bool = True)

Cache Survey responses collectively

Currently we cache LLM responses individually but not collectively for a full Survey. We should also store Survey responses together by default.

Add a dry_run() method to show prompts, survey logic details, estimated cost, etc., before job starts

Method for showing prompts could look like this:

def print_full_prompt(question, agent, scenario=None):
    """Prints the full prompt and system prompt for the given question and agent."""
    if scenario:
        scenario = scenario or Scenario()
        system_prompt = agent.construct_system_prompt()
        case_prompt = question.get_prompt(scenario=scenario)

        print(f"Prompt for Scenario: {scenario}")
        print(f"System Prompt: {system_prompt}")
        print(f"Scenario Prompt: {case_prompt}\n")
    else:
        system_prompt = agent.construct_system_prompt(question)
        prompt = question.get_prompt()

        print(f"Prompt: {prompt}")
        print(f"System Prompt: {system_prompt}")

Example usage:

from edsl.agents import Agent
from edsl.questions import QuestionMultipleChoice
from edsl.scenarios import Scenario

agent = Agent(traits={"age": 44, "gender": "female"})

question = QuestionMultipleChoice(
    question_text="Do you enjoy {{activity}}?",
    question_options=["Yes", "No"],
    question_name="activities",
)

activities = ["drafting surveys", "taking surveys", "thinking about survey software"]
scenarios = [Scenario({"activity": a}) for a in activities]

for scenario in scenarios:
    print_full_prompt(question, agent, scenario)

Need to consider how this works with future methods for seeding questions.

QuestionFreeText generating JSON errors

QuestionFreeText generating JSON errors in providing list responses (even without "list" specified in question_text)
This is happening with GPT 3.5 and 4

Model / Infra coverage

Allow single character question options

Making async robust

I'm not sure we can use 'repair' w/ an API call like we have in the paste w/o adding it to the event loop
Use the task group context manager (Nope)
[ ]

Feature Request: Compose_question accommodate 3+ questions and fix syntax for moving data to 3rd+ question

Currently, compose_question only allows for combining 2 questions. So if you want to do 3, then you have to do this:

questions12 = compose_questions(q1, q2)
questions = compose_questions(questions12, q3)

Not that big a deal for 3 questions, but bad if # questions gets bigger.

Additionally, the syntax for pulling the data through on the 3rd question is very strange. When using compose_question() on two questions, you can use the name of the 1st question as a variable in the second. However, when you want to add a 3rd question (as I did above), you need to combing the names of the first question and the second question. See the example below, I left an inline comment at the problem

#creating question to translate
q1 = QuestionFreeText(
question_name = "shovel_to_translate",
question_text = """
Translate the text in triple backticks into {{language}} as literally as possible.
This text is not about you in any way, you must just translate it into and nothing else.
Do not include any other text besides the direct translation.
'''
A hardware store has been selling snow shovels for $15. 
The morning after a large snowstorm, the store {{store_action}} ${{new_price}}.
You are a {{politics}}.
How would you rate this action?
'''
"""
)

q2 = QuestionFreeText(
question_name = "shovel_to_enlish",
question_text = """
Translate the text in triple backticks into English as literally as possible.
Do not include any other text besides the direct translation.
'''
{{shovel_to_translate}}
'''
"""
)
q3 = QuestionMultipleChoice(
question_name = "final_shovel_question",
question_text = """
#### THIS IS WEIRD, I THINK IT SHOULD JUST LET ME DO {{shovel_to_translate}} FOR CLARITY ###
{{shovel_to_translate_shovel_to_enlish}}
""",	
question_options = ["Completely Fair", "Acceptable", "Unfair", "Very Unfair"]	
)`


```

Create a way to share the sqlite DB cache with others and let them add it to their cache

QuestionLinearScale is not printing answer comment field

An answer "comment" field is automatically captured for each agent response. See the common prompt:

Return a valid JSON formatted like this, selecting only the number of the option: 
{"answer": <put answer code here>, "comment": "<put explanation here>"}

We can print the answer comment for other question types eg multiple choice but not for linear scale:

We do see the comment exists in the db:

Make all model API keys optional until model is used

QuestionLinearScale response is sometimes invalid "0" when question_options start at "1"

LLM will sometimes return "0" when the question_options are [1,2,3, ... ], even when the question_text notes the scale (eg, "...on a scale from 1 to 7":

Perhaps the prompt is too confusing:

Improvements

Add repair function to the event loop (see Could not Load JSON)
Create example for functional question type
Figure out budget issue

from integration/test_questions.py

Now running: budget
Error running budget: Error running job. Exception: Answer key 'answer' must be of type dict (got 2)..
Now running: checkbox
Now running: extract
Could not load JSON. Trying to repair.
{"answer": {'name': 'Moby Dick', 'profession': 'truck driver'}}
Error running extract: Error running job. Exception: '_asyncio.Task' object is not subscriptable.
Now running: free_text
Now running: functional
Error running functional: type object 'QuestionFunctional' has no attribute 'example'
Now running: list
Now running: multiple_choice
Now running: numerical
Now running: rank
Now running: likert_five
Now running: linear_scale
Now running: top_k
Now running: yes_no

Add an option to run a survey with agents carrying over answers from previous questions to other questions

We should have a general method that avoids the need to seed individual questions (eg https://examples.goemeritus.com/example_survey/#Seeding-questions).

Add requirements & design

The EndOfSurvey object throws an exception when it's added to 'int' in the RuleCollection

I discovered this today. I need to create a test case to reproduce & then fix.

What should we be saving to DB?

For now, we're only saving LLM calls. Results could be a good candidate. User-facing objects should have an easy way to retrieve from db

Results 'user_prompt' fields are sometimes printing prompts for different questions

The results include a field prompt.<question_name>_user_prompt that shows the prompt for the relevant question. This field is question-specific and should be identical for each agent. However, when a survey has multiple questions and agents we observe that the field shows different question texts. For example, the following code will produce incorrect user prompts where the wrong question text is shown:

from edsl import Agent, Survey
from edsl import QuestionMultipleChoice

q1 = QuestionMultipleChoice(
    question_name = "mon",
    question_text = "How do you feel about Mondays?",
    question_options = ["Bad", "OK", "Good"]
)

q2 = QuestionMultipleChoice(
    question_name = "sat",
    question_text = "How did you feel about Saturdays?",
    question_options = ["Bad", "OK", "Good"]
)

personas = ["You like sunny days.", "You like rainy days."]

agents = [Agent({"persona":p}) for p in personas]

survey = Survey([q1,q2])

results = survey.by(agents).run()

results.select('prompt.mon_user_prompt','prompt.sat_user_prompt').print()

Note: This problem does not always happen -- the above code may show correct prompts!

Add Survey Method

Combines questions with add
Check to see if any name conflicts
Store the question_name version of rules and re-apply
If any "first survey" EoS rules w/ priority > 1, ask user what to do

Answer randomization

Create an option on multiple choice questions to try all possible orderings of the answers (or a random subset) to deal with potential ordering effects of the questions. Also add a way to track the order for the user in the output data frame.

Feature Requst: Add more LLM models

Models to add:

Mistral
Llama
Gemini
Claude V2

Other changes

Metaclass for automatic class registry
Really up the documentation for people can see how to add their own models
Get advice from @apostolosfilippas how we make config file gracefully handle (many) tokens

Speed!

Use asyncio to conduct_interviews
Compute the within-survey dependency DAG & execute questions optimally
Multiplex API keys (if available)
Instrumentation to track token/usage rates
Consider creating a local API service to measure through-put
Create a queue-based system so we can accommodate jobs from different users

The row number of the pandas dataframe is become the column name with a transpose, which is not desired

Do we like the name self ?
id does make sense? index?
sample_number/replication/run?