vanna-ai / vanna Goto Github PK

🤖 Chat with your SQL database 📊. Accurate Text-to-SQL Generation via LLMs using RAG 🔄.

License: MIT License

Python 100.00%

agent ai data-visualization database llm sql text-to-sql rag

vanna's Introduction

GitHub	PyPI	Documentation

Vanna

Vanna is an MIT-licensed open-source Python RAG (Retrieval-Augmented Generation) framework for SQL generation and related functionality.

0802.mp4

How Vanna works

Vanna works in two easy steps - train a RAG "model" on your data, and then ask questions which will return SQL queries that can be set up to automatically run on your database.

Train a RAG "model" on your data.
Ask questions.

If you don't know what RAG is, don't worry -- you don't need to know how this works under the hood to use it. You just need to know that you "train" a model, which stores some metadata and then use it to "ask" questions.

See the base class for more details on how this works under the hood.

User Interfaces

These are some of the user interfaces that we've built using Vanna. You can use these as-is or as a starting point for your own custom interface.

Getting started

See the documentation for specifics on your desired database, LLM, etc.

If you want to get a feel for how it works after training, you can try this Colab notebook.

Install

pip install vanna

There are a number of optional packages that can be installed so see the documentation for more details.

Import

See the documentation if you're customizing the LLM or vector database.

# The import statement will vary depending on your LLM and vector database. This is an example for OpenAI + ChromaDB

from vanna.openai.openai_chat import OpenAI_Chat
from vanna.chromadb.chromadb_vector import ChromaDB_VectorStore

class MyVanna(ChromaDB_VectorStore, OpenAI_Chat):
    def __init__(self, config=None):
        ChromaDB_VectorStore.__init__(self, config=config)
        OpenAI_Chat.__init__(self, config=config)

vn = MyVanna(config={'api_key': 'sk-...', 'model': 'gpt-4-...'})

# See the documentation for other options

Training

You may or may not need to run these vn.train commands depending on your use case. See the documentation for more details.

These statements are shown to give you a feel for how it works.

Train with DDL Statements

DDL statements contain information about the table names, columns, data types, and relationships in your database.

vn.train(ddl="""
    CREATE TABLE IF NOT EXISTS my-table (
        id INT PRIMARY KEY,
        name VARCHAR(100),
        age INT
    )
""")

Train with Documentation

Sometimes you may want to add documentation about your business terminology or definitions.

vn.train(documentation="Our business defines XYZ as ...")

Train with SQL

You can also add SQL queries to your training data. This is useful if you have some queries already laying around. You can just copy and paste those from your editor to begin generating new SQL.

vn.train(sql="SELECT name, age FROM my-table WHERE name = 'John Doe'")

Asking questions

vn.ask("What are the top 10 customers by sales?")

You'll get SQL

SELECT c.c_name as customer_name,
        sum(l.l_extendedprice * (1 - l.l_discount)) as total_sales
FROM   snowflake_sample_data.tpch_sf1.lineitem l join snowflake_sample_data.tpch_sf1.orders o
        ON l.l_orderkey = o.o_orderkey join snowflake_sample_data.tpch_sf1.customer c
        ON o.o_custkey = c.c_custkey
GROUP BY customer_name
ORDER BY total_sales desc limit 10;

If you've connected to a database, you'll get the table:

	CUSTOMER_NAME	TOTAL_SALES
0	Customer#000143500	6757566.0218
1	Customer#000095257	6294115.3340
2	Customer#000087115	6184649.5176
3	Customer#000131113	6080943.8305
4	Customer#000134380	6075141.9635
5	Customer#000103834	6059770.3232
6	Customer#000069682	6057779.0348
7	Customer#000102022	6039653.6335
8	Customer#000098587	6027021.5855
9	Customer#000064660	5905659.6159

You'll also get an automated Plotly chart:

RAG vs. Fine-Tuning

RAG

Portable across LLMs
Easy to remove training data if any of it becomes obsolete
Much cheaper to run than fine-tuning
More future-proof -- if a better LLM comes out, you can just swap it out

Fine-Tuning

Good if you need to minimize tokens in the prompt
Slow to get started
Expensive to train and run (generally)

Why Vanna?

High accuracy on complex datasets.
- Vanna’s capabilities are tied to the training data you give it
- More training data means better accuracy for large and complex datasets
Secure and private.
- Your database contents are never sent to the LLM or the vector database
- SQL execution happens in your local environment
Self learning.
- If using via Jupyter, you can choose to "auto-train" it on the queries that were successfully executed
- If using via other interfaces, you can have the interface prompt the user to provide feedback on the results
- Correct question to SQL pairs are stored for future reference and make the future results more accurate
Supports any SQL database.
- The package allows you to connect to any SQL database that you can otherwise connect to with Python
Choose your front end.
- Most people start in a Jupyter Notebook.
- Expose to your end users via Slackbot, web app, Streamlit app, or a custom front end.

Extending Vanna

Vanna is designed to connect to any database, LLM, and vector database. There's a VannaBase abstract base class that defines some basic functionality. The package provides implementations for use with OpenAI and ChromaDB. You can easily extend Vanna to use your own LLM or vector database. See the documentation for more details.

Vanna in 100 Seconds

vanna-in-100-seconds-480.mov

More resources

vanna's People

Contributors

Stargazers

Watchers

Forkers

anassfarah damonclifford ai-app jorgetolentinog tamiral kaminczak f901107 huangtianan hivewang ang88myt amadeus75 woonhock lautaromoreira hugoarielmartinez danquin 671335366 dorucioclea grail anietieakpan farkas1companion aparnakesarkar codehiro0517 0xcha05 williamvinc trasgum anujsrc nouma-mdw aesthethic0de feisuo huangyingting njirubryance linsnowx lkccy papiguy wemysschen bmedi hassan-elseoudy alancherosr ramnathv avimuk ehutapea s1x-data-team harrison001 anggadaz mnasruul andreped zonalds ayaster perstepheny chinnaiyanvignesh bjungwirth techthiyanes henninggc pradep2023 hasuk1 dsecret sunholo-data kotthoff kenny-ngo hbcbh1999 kzsh antonioevans m8e gmh5225 acumenix intelliquery uziiwork2020 wuchirat veryvanya rippergs opensesamedoors linkinng sekmet leosapucaia 0xleyth wovika ssingh13-rms siddharth1988 maheskrishnan cyb3rhex alanhu1024 photoup-godwinh priyankt68 tonkworks ssdatalog jussker kurhula mivanovitch airt-ai mz0in dorbodwolf irilias dionatann huykgit98 weiplanet canslove firmai-research tonywhite11 claudey jansystemic

vanna's Issues

Add test suite to run tests.

Add generate_followup_questions to vn.ask

Don't return True/False

Instead of returning True/False, output nothing when the status.success is true otherwise throw an exception with the status.message

Removing datasets

Could we have a delete_dataset function?

Better display of the dataframe in vn.ask() for notebooks

Right now vn.ask() prints the table in markdown format when you run it in a notebook. We should get it to display using the native display widget.

print key dataset stats

Can we print out some key stats for a dataset - like name, description, training questions, asked questions, successful questions, visibility, users, admins, firstquestiondate, lastquestiondate, etc

vn.generate_questions should reference DDL, documentation, and sql

Right now vn.generate_questions only references DDL

Excessive print outs from `vn.ask()`

Can we omit the print outs that come after the chart in vn.ask()? Unnecessary, repetitive, and result in a lot of scrolling

Migrate tests to this repo

Our tests are on the server repo -- they need to be migrated to this repo

vn.get_training_data

vn.get_training_data() should return

id
type (question-sql, ddl, documentation)
data

Enable pip install vanna, pip install vanna[snowflake], pip install vanna[bigquery]

We'll likely have to begin using setuptools for this

Write a CONTRIBUTING.md

We need documentation for contributors on how to develop/test/etc

Documentation examples should be real use cases from demo-tpc-h

bootstrap - automated one line training + results

Can we implement a bootstrapping one line "agent"? For example -

conn = snowflake connection
vn.set_dataset('dataset')
vn.bootstrap()

where bootstrap does the following -

gets DDL and stores
gets historical queries and stores along w generated questions
generates 10 qs
generates sql for those 10 questions
runs sql, prints results, charts etc.

Add test suite to run tests.

add user to database returns False

Can't add user to a dataset, returns an error you can only add a user to your own organisation. This is for an organisation that i just created and must have the ownership rights of it.

TypeError when running `vn.ask()`

Encountered the following error when saving the outputs of vn.ask() to four objects (query, df, plot, qns):

This error does not appear if the parameter print_results = False is passed.

Flow diagrams for SQL

Can we create mermaid charts that do flow diagrams for how a SQL statement gets executed and the different entities involved?

vn.remove_training_data(id=...)

generate_questions returning valid questions when used inside the vn.ask(), but generates random stuff when use independantly

One bad query in connect_to_postgres results in future failures

If one query has an error, the subsequent queries to postgres will fail. We either need to open a new connection for each query or we need to do a rollback on error.

Update readme: Add directions to run tests with tox

Raise an error when vn.set_dataset is not passed a string

Add GH action for NB example runs

Add GH action that:

Runs on every push to the PR
Runs a notebook: https://github.com/marketplace/actions/run-notebook
Convert the notebook into docs using nbconvert

ENV variables required, should be fed from GH secrets to the action's context (find their values in Slack).

VANNA_API_KEY=xxx
VANNA_MODEL=xxx
SNOWFLAKE_ACCOUNT=xxx
SNOWFLAKE_USERNAME=xxx
SNOWFLAKE_PASSWORD=xxx
SNOWFLAKE_DATABASE=xxx

Cache should only be for trained SQL

When we do vn.generate_sql, we pull from the cache if the SQL already exists. However, if the SQL is flagged or otherwise not in the training set, we should bypass the cache

Make a vn.connect_to_bigquery function

Confidence score for generated SQL

Is there a way to get a confidence score in terms of how likely the SQL is to be correct? Or whether there are enough similar queries / etc to give Vanna enough context to generate the SQL?

Maybe this could be calculated via embedding distances?

Vanna generated documentation

Can we have Vanna generate documentation for tables and columns automatically? For example ..

vn.generate_docs(entity='table', name='<tablename>') would generate a docstring for a particular table, and
vn.generate_docs(entity='column', name='<columnname>') would generate a docstring for a particular column

and perhaps there could be a flag on the table call to also generate docs for cols within that table automatically?

Training multiple queries at once (bulk training)

The ability to send in

a JSON of question / SQL pairs, or
a SQL file full of semicolon delimited SQL queries

and have Vanna automatically train against a dataset. For 2, would need to auto generate the questions as well

Add lint step into the CI

Please use flake8 for lint purposes.

Restrict the characters that can be in a dataset name

In order to avoid confusion and also to make the dataset name url safe, on input of the dataset name we should:

make it lowercase
replace spaces with a hyphen -
replace special characters with hyphen or remove it altogether

There should be a deterministic mapping of the input dataset string to the actual dataset name so that users can do vn.set_dataset('my WEirD dataset name!') and it will still work

Make vn.train generic

vn.train should take in

question: str or None
sql: str or None
ddl: str or None
documentation: str or None
json_file: str or None
sql_file: str or None

If just question, throw an error and print out example usage
If just sql, do vn.generate_question to generate the question and then vn.add_sql
If just DDL, do vn.add_ddl
If just documentation, do vn.store_documentation
If just a json_file, read the json file using pd.read_json and then iterate through rows to get question and sql columns to do vn.add_sql
If just a sql_file, use sqlparse to separate the sql statements. Anything that's a create table should go into vn.add_DDL and other statements should do vn.generate_question and then vn.add_sql

All parameters defaults should be None and if the user passes in any combination of invalid parameters it should raise an exception

User should not have to define vn.run_sql()

Instead of having the user separately define a vn.run_sql() function, could it be integrated within vn.ask() or perhaps a vn.select_db()

Should vn.train run the SQL to validate it?

Use tox to run tests in github tests workflow

automatically get DDL using DB connection

instead of manually putting the DDL using store_dll(), can Vanna automatically get the DDL directly from the database if provided the connection string, and put each table in separately? at least for snowflake, bq and pg?

vn.generate_meta_description

This function will take in a question and use training data as a reference to answer questions about the data instead of returning SQL

document permissioning for functions

we should include who has permission to do this action (eg admin or anyone), for the dataset functions / write functions especially.

Sweep: fix typo get_model to get_models in init.py

Sample of what `vn.ask()` returns

Not an issue, but a reminder that sample notebooks for vn.ask() need to show that four objects are returned with vn.ask

Rename dataset to model

Model is a lot easier for people to understand.

create_model
train
get_training_data
delete_model
update_model_visibility
etc

Allow chart customization in vn.generate_plotly_code

Add a parameter to vn.generate_plotly_code so that the user can specify that they want a line chart vs a bar chart etc

Plotly chart arising from `vn.ask()`

The chart generated by vn.ask() cannot be replotted if it is not to the user's liking. But they should be able to. Otherwise, users might have to reshape the resultant df, and do the plotting themselves. The value presented by Vanna might be diminished somewhat in this instance

Typo on Code Reference site

On the Code Reference site, it should be vn.get_models() and vn.get_model() as shown in the screengrab

Make a vn.connect_to_postgres function

automatically get historical queries from dw that support

Can Vanna automatically get the last X historical SQL queries from dw that support this functionality, like Snowflake and BQ, if provided the connection?

psycopg2 doesn't work on mac

In order to get the postgres connector to work on a mac, I had to do:
pip install psycopg2-binary

I think we should consider pg8000 to avoid compatibility issues:

https://wiki.postgresql.org/wiki/Python

Use the ddl parameter here instead of the sql parameter

https://github.com/vanna-ai/vanna-py/blob/feb8443194e433654888d9a6995eeb55a1aaabc1/src/vanna/__init__.py#L790C28-L790C28

Use Vanna with CSV files

Should we make a vn.use_df function that loads data into sqlite and connects to it so that you can run Vanna on dataframes that you might have brought in via CSV or some other method?

More informative results after running vn.train()

Running vn.train() returns True regardless of whether the SQL that is trained is correct or not. It also does not show whether the question that is being trained already exists. If it does, then what does vn.train() do? This raises the following issues:

Returning True is not informative. It gives the impression that the SQL trained is correct but it might not be. I was able to train on erroneous SQL queries and it returned True as well.
In what scenarios will False be returned?
Are existing questions and their SQL code overwritten when vn.train() is run with an existing question?