Giter Club home page Giter Club logo

expectedparrot / edsl Goto Github PK

View Code? Open in Web Editor NEW
54.0 5.0 9.0 6.95 MB

Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.

Home Page: https://docs.expectedparrot.com

License: MIT License

Makefile 0.84% Python 98.92% Jupyter Notebook 0.24%
anthropic data-labeling deepinfra domain-specific-language experiments llama2 llm llm-agent llm-framework llm-inference

edsl's Introduction

Expected Parrot Domain-Specific Language

edsl.png

The Expected Parrot Domain-Specific Language (EDSL) package lets you conduct computational social science and market research with AI. Use it to design surveys and experiments, simulate responses with large language models, and perform data labeling and other research tasks. EDSL comes with built-in methods for analyzing, visualizing and sharing your results.

๐Ÿ”— Links

๐Ÿ’ก Contributions, Feature Requests & Bugs

Interested in contributing? Want us to add a new feature? Found a nasty bug that you would like us to squash? Please send us an email at [email protected] or message us at our Discord server.

๐Ÿ’ป Getting started

EDSL is compatible with Python 3.9 - 3.11.

pip install edsl

See https://www.expectedparrot.com/getting-started/ for examples and tutorials. Read the docs at http://docs.expectedparrot.com.

๐Ÿ”ง Dependencies

API keys for LLMs that you want to use, stored in a .env file

edsl's People

Contributors

apostolosfilippas avatar barabazs avatar johnjosephhorton avatar nurv avatar rbyh avatar zer0dss avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

edsl's Issues

to_csv only saves the second model's responses

When i do response.to_csv('xyz.csv') (response being a class Results object I think)
It only saves the queries from the model that is run second:
e.g., in this case the csv only has the responses from 3.5

responses = q.by(scenarios).by(m4).by(m35).run()
# this case it only has the responsess from 4
responses = q.by(scenarios).by(m35).by(m4).run()

Fix .env, version, and db

  • .env should live in cwd
  • Version of the package should be retrievable quickly through code
  • Objects stored in the db should have a package version column -- this will make our lives easier in db migrations

Feature Request: Create a method for adding a set of parameterized questions to the same survey

We should create a method for adding a set of parameterized questions all to the same survey. Scenarios does not currently allow this -- when you add a parameterized Question to a Survey and then do survey.by(scenarios) to insert parameter values you create a new Job for each parameter value, i.e., the single-question survey is administered separately for each parameter value.

For example, this method creates a set of parameterized questions that we can then add to a survey using the add_question() method (or Survey(questions=[...]) construction) in order to have them all within a survey:

def add_product_rating_questions(survey):
    for product in PRODUCTS:
        question_name = f"rating_{product.replace(' ', '_').replace('-', '').lower()}"
        question_text = f"Rate your desirability for {product} right now (1-7). Use 1 to 7 scale, please:"
        survey.add_question(QuestionLinearScale(question_name=question_name, question_text=question_text, question_options=list(range(0, 8))))

Conceptually, what we want is a "vertical" application of question scenarios within a survey.

Create a capability to allow multiple runs of the same Job cached independently

Problem: When you run a Job that is identical to a Job that has already been run, you get the cached result. This is not desired if you want to use cached results but also see multiple runs. The cache is essentially over-eager. We want to pass a free parameter that will trigger re-runs and cache those new results. Right now we are hacking this by slightly adjusting the temperature, but this is not ideal because it is not an identical Job.

Report re-factor goals

  • Code re-factored into smaller modules with unit tests
  • Notebook-based documentation
  • Some form of testing of image outputs
  • Discoverability in "main" edsl
  • Automated "short names" functionality
  • Output of files (not just HTML)

print() not working in 0.1.8

There is no error generated, but no content is printed.
To reproduce the issue with different question types:

from edsl.questions import QuestionLinearScale, QuestionMultipleChoice

q = QuestionLinearScale(
    question_name = "exercise",
    question_text = "How many times do you typically exercise each week?",
    question_options = [0,1,2,3,4,5,6,7]
)
r = q.run()
r.print()

q = QuestionMultipleChoice(
    question_name = "exercise",
    question_text = "How often do you typically exercise each week?",
    question_options = ["Never","Sometimes","Often"]
)
r = q.run()
r.print()

Re-factor questions to not use Pydantic

Background

We are unable to install edsl on Google Collab due to some problems with the Pydantic package. Rather than work around, we might consider re-factoring Quesitons to not use Pydantic. This Gist shows what Questions and QuestionMultipleChoice would look like w/o pydantic for validation:

What a change looks like

https://gist.github.com/johnjosephhorton/9cb426fd349c9328a0071cf23510126d

Cons

  • Probably a day or work to re-factor
  • Our validators would not be as good, most likely
  • Some of our FastAPI work would be lost

Pros

  • One less dependency
  • We could likely install on Collab
  • Substantial code simplicity (avoid weird meta-class stuff)

Re-factor Questions

  • Refactor Question.py so init is not called
  • Write custom validators for answers & questions - start with MC & free text
  • Move prompt creation out of Question.py to separate classes ; refactor interviewing accordingly
  • Turn over system prompt and fine-grained prompt control to use
  • Use Metaclass for the question registry

A Plan for Handling API / Task Exceptions

A few considerations:

  • The process from Job ---> model call is quite complex, going through several main edsl objects. This is of course by design, but it makes understanding where failures are happening difficult.
  • We do not want one failure to end the job---as long as forward progress can be still be made with uncompleted tasks, we should continue. We should only stop when the queue of tasks to still do that can be done is empty.
  • try/except blocks are probably not helpful early on when we are trying to learn the nature of exceptions. We should catch no exceptions and let them surface.
  • Cacheing is very helpful in actually getting things done because partial failures are surmountable
  • When a task fails, all tasks that depend on that task should fail as well, to save time.
  • We need to communicate back to the user the nature of failures and suggest improvements
  • We should do lots of pre-LLM call checks to limit number of failures that happen at model-level

Strategy

  • Add real edsl notebooks to the integration test
  • Make lots of simulated jobs that with complex dependencies and inject exceptions are different points; test that it fails appropriately

JSON error using default LLM for free text question

image

Code to reproduce:

from edsl.questions import QuestionFreeText
q_example = QuestionFreeText(
    question_name = "examples",
    question_text = "Draft 3 different summary sections of resumes for software engineers."
)
r_example = q_example.run()

Prompt Re-Factor PEP

We need to give users fine-grained control over prompts, allowing them to experiment as they see fit, but still have sensible defaults. Also, prompts are data and, should probably not be stored in the repository long-term. Finally, users need to have detailed insight into precisely what prompts are used.

This note describes a proposed plan of re-factoring.

Prompt class

  • There will a be a base class that has common functionality, such as rendering a template, combining two prompts, and so on.
  • It will be sub-classed to have different kinds of prompts:
    • question_data (the actual rendered text of a user question (which is often a jinja2 template)
    • question_instructions (e.g., describing to the model how you answer a MC question)
    • agent_instructions (e.g., "You are a helpful agent taking a survey...")
  • past_question_memory
  • agent_data (e.g., "Your traits are...")
  • These component types will have corresponding attributes that children have to have e.g., if it's a question_instructions, it has to have attributes such as:
    • model that they are appropriate for
    • question_type (if it's a question_instructions component) and so on
      These are enforced by a meta-class at registration and/or ABC.

Prompt Registry for standard prompts

  • Meta class records the creation of all prompt classes and stores them in a trie-ish data structure
  • The registry supports look-up based on prompt type, question type, model, etc., returning what it can by matching keywords e.g., `get_classes(question_type = "multiple-choice") will return all Prompt classes for multiple-choice, regardless of parameters.

PromptCollection

These various prompt fragments are combined at 'run time' making use of the model, question type, scenario and so on. The user can optionally control this themselves.

User-defined prompts

Users can add their own prompts to the registry at run-time by sub-classing Prompt.

When a survey is run:

  • Prompts are determined at run time by looking at the question being asked, the model, and so on.
  • Users can add their own list of prompts to run e.g., q.run(custom_prompts = [p1, p2])
  • User prompts are used first; ties are broken with some notion of precedence, tbd
  • An unused entry in custom_prompt throws a warning
  • If the prompt text is a template, it is rendered before being sent to the LLM
  • Users can add a parameter to run to indicate that if multiple prompts fit, to run all of them.

Rendered prompts are stored in the Result object

  • With a results object, users will be able to do results.select("user_prompt") and see the user prompt, for example.

General cleanup and re-factor ideas

  • Create a 'Prompt' class and allow questions to have model-specific prompts
  • Remove HTML / form-based display from Questions as largely unused
  • Remove the notion of survey and question UUIDs
  • Move "docx" method to edsl[extra]
  • Move all plotting / output methods to edsl[extra]
  • Refactor "type" in Question to "question_type" as type is a reserved keyword in python, which causes trouble
  • ??

Questions: Consolidated improvement ideas

  • QuestionFunctional and Question.__add__ need work
  • Modify question tests to use class registry (similar to how down with the question integration test, but without API call). Reason: We want to be able to easily modify our tests.
  • We should be always allowing both {'answer': 1} and {'answer':'1'} to go through. See QuestionExtact in integration tests whether I think this is failing.
  • integration/test_questions.py should
    • Test serialization / deserialization
    • Test the examples w/ all registered models
    • Introspect code and try leaving out values to make sure it throws exceptions
  • Deprecate non-responses
  • Proposal: Eliminate short_names_dict and instead have a ShortNames object that takes a question as input and returns a dictionary. Thoughts @apostolosfilippas ?
  • Get consistent on comments treatment
  • Descriptors could be made DRYer/better
  • Answer validation could be made DRYer/better

Cache Survey responses collectively

Currently we cache LLM responses individually but not collectively for a full Survey. We should also store Survey responses together by default.

Add a dry_run() method to show prompts, survey logic details, estimated cost, etc., before job starts

Method for showing prompts could look like this:

def print_full_prompt(question, agent, scenario=None):
    """Prints the full prompt and system prompt for the given question and agent."""
    if scenario:
        scenario = scenario or Scenario()
        system_prompt = agent.construct_system_prompt()
        case_prompt = question.get_prompt(scenario=scenario)

        print(f"Prompt for Scenario: {scenario}")
        print(f"System Prompt: {system_prompt}")
        print(f"Scenario Prompt: {case_prompt}\n")
    else:
        system_prompt = agent.construct_system_prompt(question)
        prompt = question.get_prompt()

        print(f"Prompt: {prompt}")
        print(f"System Prompt: {system_prompt}")

Example usage:

from edsl.agents import Agent
from edsl.questions import QuestionMultipleChoice
from edsl.scenarios import Scenario

agent = Agent(traits={"age": 44, "gender": "female"})

question = QuestionMultipleChoice(
    question_text="Do you enjoy {{activity}}?",
    question_options=["Yes", "No"],
    question_name="activities",
)

activities = ["drafting surveys", "taking surveys", "thinking about survey software"]
scenarios = [Scenario({"activity": a}) for a in activities]

for scenario in scenarios:
    print_full_prompt(question, agent, scenario)

Need to consider how this works with future methods for seeding questions.

QuestionFreeText generating JSON errors

QuestionFreeText generating JSON errors in providing list responses (even without "list" specified in question_text)
This is happening with GPT 3.5 and 4

image

Model / Infra coverage

  • Google AI Studio
  • Google Model Garden
  • OpenAI
  • Mistral
  • Deep Infra
    • Llama models
    • Mixtral
  • Hugging Face
  • Open Router
  • Azure
    • Llama models
    • OpenAI models
  • AWS Bedrock
    • Llama Models
    • Claude V2

Making async robust

  • I'm not sure we can use 'repair' w/ an API call like we have in the paste w/o adding it to the event loop
  • Use the task group context manager (Nope)
  • [ ]

Feature Request: Compose_question accommodate 3+ questions and fix syntax for moving data to 3rd+ question

Currently, compose_question only allows for combining 2 questions. So if you want to do 3, then you have to do this:

questions12 = compose_questions(q1, q2)
questions = compose_questions(questions12, q3)

Not that big a deal for 3 questions, but bad if # questions gets bigger.

Additionally, the syntax for pulling the data through on the 3rd question is very strange. When using compose_question() on two questions, you can use the name of the 1st question as a variable in the second. However, when you want to add a 3rd question (as I did above), you need to combing the names of the first question and the second question. See the example below, I left an inline comment at the problem

#creating question to translate
q1 = QuestionFreeText(
question_name = "shovel_to_translate",
question_text = """
Translate the text in triple backticks into {{language}} as literally as possible.
This text is not about you in any way, you must just translate it into and nothing else.
Do not include any other text besides the direct translation.
'''
A hardware store has been selling snow shovels for $15. 
The morning after a large snowstorm, the store {{store_action}} ${{new_price}}.
You are a {{politics}}.
How would you rate this action?
'''
"""
)

q2 = QuestionFreeText(
question_name = "shovel_to_enlish",
question_text = """
Translate the text in triple backticks into English as literally as possible.
Do not include any other text besides the direct translation.
'''
{{shovel_to_translate}}
'''
"""
)
q3 = QuestionMultipleChoice(
question_name = "final_shovel_question",
question_text = """
#### THIS IS WEIRD, I THINK IT SHOULD JUST LET ME DO {{shovel_to_translate}} FOR CLARITY ###
{{shovel_to_translate_shovel_to_enlish}}
""",	
question_options = ["Completely Fair", "Acceptable", "Unfair", "Very Unfair"]	
)`


```

QuestionLinearScale is not printing answer comment field

An answer "comment" field is automatically captured for each agent response. See the common prompt:

Return a valid JSON formatted like this, selecting only the number of the option: 
{"answer": <put answer code here>, "comment": "<put explanation here>"}

We can print the answer comment for other question types eg multiple choice but not for linear scale:
image

We do see the comment exists in the db:
image

Improvements

  • Add repair function to the event loop (see Could not Load JSON)
  • Create example for functional question type
  • Figure out budget issue

from integration/test_questions.py

Now running: budget
Error running budget: Error running job. Exception: Answer key 'answer' must be of type dict (got 2)..
Now running: checkbox
Now running: extract
Could not load JSON. Trying to repair.
{"answer": {'name': 'Moby Dick', 'profession': 'truck driver'}}
Error running extract: Error running job. Exception: '_asyncio.Task' object is not subscriptable.
Now running: free_text
Now running: functional
Error running functional: type object 'QuestionFunctional' has no attribute 'example'
Now running: list
Now running: multiple_choice
Now running: numerical
Now running: rank
Now running: likert_five
Now running: linear_scale
Now running: top_k
Now running: yes_no

What should we be saving to DB?

For now, we're only saving LLM calls. Results could be a good candidate. User-facing objects should have an easy way to retrieve from db

Results 'user_prompt' fields are sometimes printing prompts for different questions

The results include a field prompt.<question_name>_user_prompt that shows the prompt for the relevant question. This field is question-specific and should be identical for each agent. However, when a survey has multiple questions and agents we observe that the field shows different question texts. For example, the following code will produce incorrect user prompts where the wrong question text is shown:

from edsl import Agent, Survey
from edsl import QuestionMultipleChoice

q1 = QuestionMultipleChoice(
    question_name = "mon",
    question_text = "How do you feel about Mondays?",
    question_options = ["Bad", "OK", "Good"]
)

q2 = QuestionMultipleChoice(
    question_name = "sat",
    question_text = "How did you feel about Saturdays?",
    question_options = ["Bad", "OK", "Good"]
)

personas = ["You like sunny days.", "You like rainy days."]

agents = [Agent({"persona":p}) for p in personas]

survey = Survey([q1,q2])

results = survey.by(agents).run()

results.select('prompt.mon_user_prompt','prompt.sat_user_prompt').print()
image

Note: This problem does not always happen -- the above code may show correct prompts!

Add Survey Method

  • Combines questions with add
  • Check to see if any name conflicts
  • Store the question_name version of rules and re-apply
  • If any "first survey" EoS rules w/ priority > 1, ask user what to do

Answer randomization

Create an option on multiple choice questions to try all possible orderings of the answers (or a random subset) to deal with potential ordering effects of the questions. Also add a way to track the order for the user in the output data frame.

Feature Requst: Add more LLM models

Models to add:

  • Mistral
  • Llama
  • Gemini
  • Claude V2

Other changes

  • Metaclass for automatic class registry
  • Really up the documentation for people can see how to add their own models
  • Get advice from @apostolosfilippas how we make config file gracefully handle (many) tokens

Speed!

  • Use asyncio to conduct_interviews
  • Compute the within-survey dependency DAG & execute questions optimally
  • Multiplex API keys (if available)
  • Instrumentation to track token/usage rates
  • Consider creating a local API service to measure through-put
  • Create a queue-based system so we can accommodate jobs from different users

SQL Improvements / bugs

  • The row number of the pandas dataframe is become the column name with a transpose, which is not desired

Screenshot from 2024-02-08 08-31-23

  • Do we like the name self ?

  • id does make sense? index?

  • sample_number/replication/run?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.