Giter Club home page Giter Club logo

vanna's Introduction

GitHub PyPI Documentation
GitHub PyPI Documentation

Vanna

Vanna is an MIT-licensed open-source Python RAG (Retrieval-Augmented Generation) framework for SQL generation and related functionality.

0802.mp4

vanna-quadrants

How Vanna works

Screen Recording 2024-01-24 at 11 21 37β€―AM

Vanna works in two easy steps - train a RAG "model" on your data, and then ask questions which will return SQL queries that can be set up to automatically run on your database.

  1. Train a RAG "model" on your data.
  2. Ask questions.

If you don't know what RAG is, don't worry -- you don't need to know how this works under the hood to use it. You just need to know that you "train" a model, which stores some metadata and then use it to "ask" questions.

See the base class for more details on how this works under the hood.

User Interfaces

These are some of the user interfaces that we've built using Vanna. You can use these as-is or as a starting point for your own custom interface.

Getting started

See the documentation for specifics on your desired database, LLM, etc.

If you want to get a feel for how it works after training, you can try this Colab notebook.

Install

pip install vanna

There are a number of optional packages that can be installed so see the documentation for more details.

Import

See the documentation if you're customizing the LLM or vector database.

# The import statement will vary depending on your LLM and vector database. This is an example for OpenAI + ChromaDB

from vanna.openai.openai_chat import OpenAI_Chat
from vanna.chromadb.chromadb_vector import ChromaDB_VectorStore

class MyVanna(ChromaDB_VectorStore, OpenAI_Chat):
    def __init__(self, config=None):
        ChromaDB_VectorStore.__init__(self, config=config)
        OpenAI_Chat.__init__(self, config=config)

vn = MyVanna(config={'api_key': 'sk-...', 'model': 'gpt-4-...'})

# See the documentation for other options

Training

You may or may not need to run these vn.train commands depending on your use case. See the documentation for more details.

These statements are shown to give you a feel for how it works.

Train with DDL Statements

DDL statements contain information about the table names, columns, data types, and relationships in your database.

vn.train(ddl="""
    CREATE TABLE IF NOT EXISTS my-table (
        id INT PRIMARY KEY,
        name VARCHAR(100),
        age INT
    )
""")

Train with Documentation

Sometimes you may want to add documentation about your business terminology or definitions.

vn.train(documentation="Our business defines XYZ as ...")

Train with SQL

You can also add SQL queries to your training data. This is useful if you have some queries already laying around. You can just copy and paste those from your editor to begin generating new SQL.

vn.train(sql="SELECT name, age FROM my-table WHERE name = 'John Doe'")

Asking questions

vn.ask("What are the top 10 customers by sales?")

You'll get SQL

SELECT c.c_name as customer_name,
        sum(l.l_extendedprice * (1 - l.l_discount)) as total_sales
FROM   snowflake_sample_data.tpch_sf1.lineitem l join snowflake_sample_data.tpch_sf1.orders o
        ON l.l_orderkey = o.o_orderkey join snowflake_sample_data.tpch_sf1.customer c
        ON o.o_custkey = c.c_custkey
GROUP BY customer_name
ORDER BY total_sales desc limit 10;

If you've connected to a database, you'll get the table:

CUSTOMER_NAME TOTAL_SALES
0 Customer#000143500 6757566.0218
1 Customer#000095257 6294115.3340
2 Customer#000087115 6184649.5176
3 Customer#000131113 6080943.8305
4 Customer#000134380 6075141.9635
5 Customer#000103834 6059770.3232
6 Customer#000069682 6057779.0348
7 Customer#000102022 6039653.6335
8 Customer#000098587 6027021.5855
9 Customer#000064660 5905659.6159

You'll also get an automated Plotly chart:

RAG vs. Fine-Tuning

RAG

  • Portable across LLMs
  • Easy to remove training data if any of it becomes obsolete
  • Much cheaper to run than fine-tuning
  • More future-proof -- if a better LLM comes out, you can just swap it out

Fine-Tuning

  • Good if you need to minimize tokens in the prompt
  • Slow to get started
  • Expensive to train and run (generally)

Why Vanna?

  1. High accuracy on complex datasets.
    • Vanna’s capabilities are tied to the training data you give it
    • More training data means better accuracy for large and complex datasets
  2. Secure and private.
    • Your database contents are never sent to the LLM or the vector database
    • SQL execution happens in your local environment
  3. Self learning.
    • If using via Jupyter, you can choose to "auto-train" it on the queries that were successfully executed
    • If using via other interfaces, you can have the interface prompt the user to provide feedback on the results
    • Correct question to SQL pairs are stored for future reference and make the future results more accurate
  4. Supports any SQL database.
    • The package allows you to connect to any SQL database that you can otherwise connect to with Python
  5. Choose your front end.
    • Most people start in a Jupyter Notebook.
    • Expose to your end users via Slackbot, web app, Streamlit app, or a custom front end.

Extending Vanna

Vanna is designed to connect to any database, LLM, and vector database. There's a VannaBase abstract base class that defines some basic functionality. The package provides implementations for use with OpenAI and ChromaDB. You can easily extend Vanna to use your own LLM or vector database. See the documentation for more details.

Vanna in 100 Seconds

vanna-in-100-seconds-480.mov

More resources

vanna's People

Contributors

0xcha05 avatar andreped avatar anugyas avatar anush008 avatar archit0 avatar arslanhashmi avatar arthurmor4is avatar ashishsingal1 avatar aus1st avatar aymane11 avatar danielcorin avatar dbtzy avatar everdark avatar giacomopracucci avatar godnessy avatar hassan-elseoudy avatar jonnyeqt avatar kun321 avatar livenson avatar molrn avatar navanit-git avatar relic-yuexi avatar scalabresegd avatar simoncalabrese avatar tal7aouy avatar theharshpat avatar vikyw89 avatar vipgupta avatar zainhoda avatar zyclove avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vanna's Issues

Don't return True/False

Instead of returning True/False, output nothing when the status.success is true otherwise throw an exception with the status.message

print key dataset stats

Can we print out some key stats for a dataset - like name, description, training questions, asked questions, successful questions, visibility, users, admins, firstquestiondate, lastquestiondate, etc

vn.get_training_data

vn.get_training_data() should return

id
type (question-sql, ddl, documentation)
data

bootstrap - automated one line training + results

Can we implement a bootstrapping one line "agent"? For example -

conn = snowflake connection
vn.set_dataset('dataset')
vn.bootstrap()

where bootstrap does the following -

  1. gets DDL and stores
  2. gets historical queries and stores along w generated questions
  3. generates 10 qs
  4. generates sql for those 10 questions
  5. runs sql, prints results, charts etc.

add user to database returns False

Can't add user to a dataset, returns an error you can only add a user to your own organisation. This is for an organisation that i just created and must have the ownership rights of it.

Screenshot 2023-07-24 at 9 42 22 AM

TypeError when running `vn.ask()`

Encountered the following error when saving the outputs of vn.ask() to four objects (query, df, plot, qns):
image

This error does not appear if the parameter print_results = False is passed.

Flow diagrams for SQL

Can we create mermaid charts that do flow diagrams for how a SQL statement gets executed and the different entities involved?

Add GH action for NB example runs

Add GH action that:

  1. Runs on every push to the PR
  2. Runs a notebook: https://github.com/marketplace/actions/run-notebook
  3. Convert the notebook into docs using nbconvert

ENV variables required, should be fed from GH secrets to the action's context (find their values in Slack).

VANNA_API_KEY=xxx
VANNA_MODEL=xxx
SNOWFLAKE_ACCOUNT=xxx
SNOWFLAKE_USERNAME=xxx
SNOWFLAKE_PASSWORD=xxx
SNOWFLAKE_DATABASE=xxx

Cache should only be for trained SQL

When we do vn.generate_sql, we pull from the cache if the SQL already exists. However, if the SQL is flagged or otherwise not in the training set, we should bypass the cache

Confidence score for generated SQL

Is there a way to get a confidence score in terms of how likely the SQL is to be correct? Or whether there are enough similar queries / etc to give Vanna enough context to generate the SQL?

Maybe this could be calculated via embedding distances?

Vanna generated documentation

Can we have Vanna generate documentation for tables and columns automatically? For example ..

vn.generate_docs(entity='table', name='<tablename>') would generate a docstring for a particular table, and
vn.generate_docs(entity='column', name='<columnname>') would generate a docstring for a particular column

and perhaps there could be a flag on the table call to also generate docs for cols within that table automatically?

Training multiple queries at once (bulk training)

The ability to send in

  1. a JSON of question / SQL pairs, or
  2. a SQL file full of semicolon delimited SQL queries

and have Vanna automatically train against a dataset. For 2, would need to auto generate the questions as well

Restrict the characters that can be in a dataset name

In order to avoid confusion and also to make the dataset name url safe, on input of the dataset name we should:

  • make it lowercase
  • replace spaces with a hyphen -
  • replace special characters with hyphen or remove it altogether

There should be a deterministic mapping of the input dataset string to the actual dataset name so that users can do vn.set_dataset('my WEirD dataset name!') and it will still work

Make vn.train generic

vn.train should take in

question: str or None
sql: str or None
ddl: str or None
documentation: str or None
json_file: str or None
sql_file: str or None

  • If just question, throw an error and print out example usage
  • If just sql, do vn.generate_question to generate the question and then vn.add_sql
  • If just DDL, do vn.add_ddl
  • If just documentation, do vn.store_documentation
  • If just a json_file, read the json file using pd.read_json and then iterate through rows to get question and sql columns to do vn.add_sql
  • If just a sql_file, use sqlparse to separate the sql statements. Anything that's a create table should go into vn.add_DDL and other statements should do vn.generate_question and then vn.add_sql

All parameters defaults should be None and if the user passes in any combination of invalid parameters it should raise an exception

automatically get DDL using DB connection

instead of manually putting the DDL using store_dll(), can Vanna automatically get the DDL directly from the database if provided the connection string, and put each table in separately? at least for snowflake, bq and pg?

vn.generate_meta_description

This function will take in a question and use training data as a reference to answer questions about the data instead of returning SQL

Rename dataset to model

Model is a lot easier for people to understand.

  • create_model
  • train
  • get_training_data
  • delete_model
  • update_model_visibility
  • etc

Plotly chart arising from `vn.ask()`

The chart generated by vn.ask() cannot be replotted if it is not to the user's liking. But they should be able to. Otherwise, users might have to reshape the resultant df, and do the plotting themselves. The value presented by Vanna might be diminished somewhat in this instance

Use Vanna with CSV files

Should we make a vn.use_df function that loads data into sqlite and connects to it so that you can run Vanna on dataframes that you might have brought in via CSV or some other method?

More informative results after running vn.train()

Running vn.train() returns True regardless of whether the SQL that is trained is correct or not. It also does not show whether the question that is being trained already exists. If it does, then what does vn.train() do? This raises the following issues:

  1. Returning True is not informative. It gives the impression that the SQL trained is correct but it might not be. I was able to train on erroneous SQL queries and it returned True as well.
  2. In what scenarios will False be returned?
  3. Are existing questions and their SQL code overwritten when vn.train() is run with an existing question?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.