Giter Club home page Giter Club logo

essential-takehome's Introduction

Question Labeling

Reasoning

The goal of the labeling is to not only provide obervability into the level of complexity that the agent can handle, but also

There are two major goals for question labeling:

  1. Provide observability into the level of complexity the agent can handle
  2. Give insight into how to improve the agent.

In other words, we don't want to just know if the agent is good at it's job. We also want to know where the agent is failing and how to make it better.

To achieve this, we need to take a step back and understand the tasks presented. What are the common tasks within data analysis? How do they influence the difficulty of the task? Can we clealy communicate why one task is easy and the other hard?

I created a rubric for labeling that highlights 5 common tasks within analysis. For a given quesiton, each task is scored with 1 (easy), 2 (easy), or 3 (hard). A question is rated easy, medium, or hard based on aggregating the rating of each task.

  • 5: Easy (minimum score, all 1's)
  • 6-10: Medium
  • 11++: Hard

Identifying and scoring sub-tasks helps bring objectivity to the labeling process. We want an approach that can scale to more questions and be performed by more people. While the labeling task does require expertise in data analysis, the scoring system should minimize disagreement between labelers.

This rubric also allows for deeper insights into the agent's performance. As a benchmark, we can identify if the agent's behavior has changed. Run n may have the same overall score as n - 1, but that top level score may be hiding large changes in performance that neutralize each other. A balanced rubric helps prevent overfitting.

The additional dimensionality also allows for targeted improvements. It's much easier to develop a strategy for improving the agent's ability to perform transformations then stumble around looking for an overall 10% performance boost.

A question's difficulty is heavily influenced by the underlying data model. One of the primary goals of ETL pipelines is to structure the data to simplify queries. This is traditionally defined as denormalizing a transactional data model (OLTP) into one that is designed for analytic workloads (OLAP; Star Schema is a common design pattern). A question that is highly aligned with the data model is easy to answer; a question that is at odds with the data model is hard to answer.

Rubric

  • Quantity of datasets: Merging datasets are the most common source of bugs within data modeling. A join operation can return the same quantity of rows as the underlying set, remove rows from the underlying set, or it might increase the rows. The latter, when done unintentionally, is known as a fan out.
    • Easy: Question can be answered with a single table/dataframe. No joins required.
    • Medium: Multiple datasets combined with simple join logic.
    • Hard: Complex join logic; e.g. aggregation required prior to join to avoid fan out.
  • Use of Fractions: Aggregations are foundational to data analysis, and fractions must be aggregated with the correct order. Consider conversion rate; User A has 2 visits and 1 conversion User B has 3 visits and 1 conversion. If we calculate the ratio and then add, we end up with (0.5 + 0.33)/2 = 0.375, which is incorrect. The total visists is 5, and the total conversions is 2. Our formula should be (1 + 1) / (2 + 3) = 0.4.
    • Easy: No fractions are required.
    • Medium: Simple aggregation of two measures that are combined in the final step.
    • Hard: Combining derirved measures (aggregate in step 2, then again in step 3, then perform division). Distinct counts are also challenging; they are inherenly an O(n) operation (you must traverse the entire vector) and can not be rolled up.
  • Windowing and Ordinality of data: Windowing is applying a function to a row that depends on other rows. Most aggregations are naive to order; sum(1, 3) returns the same result as sum(3, 1). The simplest example is row numbers; for a given set of dimensions, this record is the nth value. The value of the previous row is required to determine the current row. In session_clickstream_data.csv, the ordinal column is a row number calculation. Rolling mean is another good example; we determine the row's mean based on a neighboring subset of data.
    • Easy: No windowing required.
    • Medium: Simple windowing logic; e.g. row number, rolling sum.
    • Hard: Applying window logic on derived data sets, funneling logic (do A, then B, then C).
  • Transformations Required: It is very common to need to mutate values to make them easier to use. This might be changing the data type (casting year, month, and day columns as a date) or defining a dimensional hierarchy to facilitate rollups.
    • Easy: No transformations are required.
    • Medium: Simple projections performed in the first step.
    • Hard: Transformations required on derived data.
  • Scope and Clarity of Request: Does the prompt clearly define a successful response? Does the agent need to make assumptions or infer the prompt's intent? "What is the MoM change in revenue for November 2023?" is clearly scoped. "How is revenue?" is very vaugue and requires the model to infer the intent to answer well.
    • Easy: Prompt identifies answer as being a scaler or one-dimensional vector.
    • Medium: Answer requires multiple dimensions, modeled as a dataframe.
    • Hard: Prompt does not identify the scope of a successful answer; agent must first define possible metrics to be calculated.

Local Setup

  • Install Poetry.
  • Create .env file and add OPEN_AI_API_KEY and ANTHROPIC_API_KEY.
  • Install dependancies: poetry install.
  • Launch environment: poetry shell, or source .venv/bin/activate if you know the location of your venv.
  • python cli.py --help to list availible commands
  • python cli.py eval-all to run full evaluation suite. Choose LLM using -m flag; see --help for options.
  • python cli.py eval -q question.json -a answer.txt to run single question. Pass in files using -q and -a arguments.

essential-takehome's People

Contributors

aaronbannin avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.