lancedb / vectordb-recipes Goto Github PK

High quality resources & applications for LLMs, multi-modal models and VectorDBs

License: Apache License 2.0

JavaScript 0.12% TypeScript 0.40% CSS 0.02% Jupyter Notebook 97.31% Python 2.15%

ai deep-learning llms machine-learning multimodal agents embeddings fine-tuning gpt langchain

vectordb-recipes's Introduction

VectorDB-recipes

Dive into building GenAI applications! This repository contains examples, applications, starter code, & tutorials to help you kickstart your GenAI projects.

These are built using LanceDB, a free, open-source, serverless vectorDB that requires no setup.
It integrates into python data ecosystem so you can simply start using these in your existing data pipelines in pandas, arrow, pydantic etc.
LanceDB has native Typescript SDK using which you can run vector search in serverless functions!

Join our community for support - Discord • Twitter

This repository is divided into 3 sections:

Examples - Get right into the code with minimal introduction, aimed at getting you from an idea to PoC within minutes!
Applications - Ready to use Python and web apps using applied LLMs, VectorDB and GenAI tools
Tutorials - A curated list of tutorials, blogs, Colabs and courses to get you started with GenAI in greater depth.

Examples

Applied examples that get right into the code with minimal introduction, aimed at getting you from an idea to PoC within minutes! Examples are available as:

Colab notebooks - that builds the application is stages allowing you to investigate results at every intermediate stage.
Python scripts - for cases where you'd like directly to use the file or snippets to integrate in your application
JS/TS scripts - Some examples are written using lancedb's native js library! These script/snippets can also be directly integrated in your web applications.

If you're looking for in-depth tutorial-like examples, checkout the tutorials section!

Example	Notebook & Scripts	Read The Blog!

Youtube transcript search bot
Langchain: Code Docs QA bot
Databricks DBRX Website Bot
CLI-based SDK Manual Chatbot with Phidata
TransformersJS Embedding example
Inbuilt Hybrid Search
Audio Search
Multi-lingual search
Hybrid search BM25 & lancedb
Search Within Images
Accelerate Vector Search Applications Using OpenVINO
Multimodal CLIP: DiffusionDB
Multimodal CLIP: Youtube videos
Multimodal Image + Text Search
Movie Recommender
Product Recommender
Arxiv paper recommender
Improve RAG with Re-ranking
Improve RAG with FLARE
Improve RAG with HyDE
Improve RAG with LOTR
Advanced RAG: Parent Document Retriever
Query Expansion and Reranker
RAG Fusion
Contextual-Compression-with-RAG
Instruct-Multitask
Evaluating Prompts with Prompttools
AI Agents: Reducing Hallucination
AI Trends Searcher with CrewAI
SuperAgent Autogen
Sentiment Analysis : Analysing Hotel Reviews
Facial Recognition
Imagebind demo app

Projects & Applications

These are ready to use applications built using LanceDB serverless vector database. You can explore these open source projects, use parts of them in your projects or build your applications on top of these.

Project Name	Description	Screenshot
YOLOExplorer	Iterate on your YOLO / CV datasets using SQL, Vector semantic search, and more within seconds
Website Chatbot (Deployable Vercel Template)	Create a chatbot from the sitemap of any website/docs of your choice. Built using vectorDB serverless native javascript package.
Chat with multiple URL/website	Conversational AI for Any Website with Mistral,Bge Embedding & LanceDB
Talk with Youtube Video using GPT4 Vision API	Talk with Youtube Video using GPT4 Vision API and Langchain
Talk with Podcast	Talk with Youtube Podcast using Ollama and insanely-fast-whisper
Talk with Wikipedia	Talk with Wikipedia Pages
Talk with Github	Talk with Github Codespaces using Qwen1.5
Document Chat with Langroid	Talk with your Documents using Langroid
Hr chatbot	Hr chatbot - ask your personal query using zero-shot React agent & tools
Advanced Chatbot with Parler TTS	This Chatbot app uses Lancedb Hybrid search, FTS & reranker method with Parlers TTS library.
Multi-Modal Search Engine	Create a Multi-modal search engine app, to search images using both images or text
Multimodal Myntra Fashion Search Engine	This app uses OpenAI's CLIP to make a search engine that can understand and deal with both written words and pictures.
Multilingual-RAG	Multilingual RAG with cohere embedding & support 100+ languages
Fastapi RAG template	FastAPI based RAG template with Websocket support
GTE MLX RAG	mlx based RAG model using lancedb api support

Tutorials

Looking to get started with LLMs, vectorDBs, and the world of Generative AI? These in-depth tutorials and courses cover these concepts with practical follow along colabs where possible.

Tutorial	Interactive Environment	Blog Link

Build RAG from Scratch
Local RAG from Scratch with Llama3
A Primer on Text Chunking and its Types
Langchain LlamaIndex Chunking
NER powered Semantic Search
Product Quantization: Compress High Dimensional Vectors
Corrective RAG with Langgraph
LLMs, RAG, & the missing storage layer for AI
Fine-Tuning LLM using PEFT & QLoRA
Context-Aware Chatbot using Llama 2 & LanceDB
Better RAG with FLARE

🌟 New! 🌟 Applied GenAI and VectorDB course on Udacity Learn about GenAI and vectorDBs using LanceDB in the recently launched Udacity Course

Contributing Examples

If you're working on some cool applications that you'd like to add to this repo, please open a PR!

vectordb-recipes's People

Contributors

Stargazers

Watchers

vectordb-recipes's Issues

Add tests

Pin openai versions

@kaushal07wick
https://github.com/lancedb/vectordb-recipes/tree/main/examples/Code-Documentation-QA-Bot

Add linting

Black & Isort. Pre commit actions if its not disruptive

arxiv-CLIP/FTS search example

[Examples Code Test] Build nodejs native module from source (main branch)

It doesn't seem like we are actually building our binary in the CI, but we need to investigate if it is

Update Lambda examples to use S3 Express

This is currently blocked by apache/arrow-rs#5140

We need:

New API documentation from AWS to know what needs to be updated
arrow-rs has to be updated
arrow-rs needs to be released
datafusion needs to update
lance can then update

to test things out, we can certainly just fork arrow-rs and datafusion to create a new custom build.

Invalid argument error: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches.

Heyo me again 👯

i created a little helper file for my usecase with lanceDb but when i run my example i get an error which doesnt help

[Error: Invalid argument error: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches.]

Here is my code:

lanceDb-retriver.ts

import { OpenAIEmbeddingFunction, connect, } from 'vectordb';
const dbPath = 'assets/db'
let embedFunction;

export interface IngestOptions {
    table: string;
    data: Array<Record<string, unknown>>;
}

export interface RetriveOptions {
    query: string;
    table: string;
    limit?: number;
    filter?: string;
    select?: Array<string>;
}

export interface DeleteOptions {
    table: string;
    filter: string;
}

export interface UpdateOptions {
    table: string;
    data: Record<string, unknown>[]
}

export async function useLocalEmbedding() {
    const { pipeline } = await import('@xenova/transformers');
    const pipe = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

    const embed_fun: any = {};
    embed_fun.sourceColumn = 'text';
    embed_fun.embed = async function (batch) {
        let result = [];
        for (let text of batch) {
            const res = await pipe(text, { pooling: 'mean', normalize: true });
            result.push(Array.from(res['data']));
        }
        return result;
    }

    embedFunction = embed_fun;
}

export function useOpenAiEmbedding(apiKey: string, sourceColumn = 'pageContent') {
    embedFunction = new OpenAIEmbeddingFunction(sourceColumn, apiKey)
}

export async function update(options: UpdateOptions) {
    try {
        const db = await connect(dbPath)

        if ((await db.tableNames()).includes(options.table)) {
            const tbl = await db.openTable(options.table, embedFunction)
            await tbl.overwrite(options.data)
        } else {
            return new Error("Table does not exist")
        }
    } catch (e) {
        console.error(e);
        throw e;
    }
}

export async function remove(options: DeleteOptions) {
    try {
        const db = await connect(dbPath)

        if ((await db.tableNames()).includes(options.table)) {
            const tbl = await db.openTable(options.table, embedFunction)
            await tbl.delete(options.filter)
        } else {
            return new Error("Table does not exist")
        }
    } catch (e) {
        console.error(e);
        throw e;
    }
}

export async function ingest(options: IngestOptions) {
    try {
        const db = await connect(dbPath)
        if ((await db.tableNames()).includes(options.table)) {
            const tbl = await db.openTable(options.table, embedFunction)
            await tbl.overwrite(options.data)
        } else {
            await db.createTable(options.table, options.data, embedFunction)
        }
    }
    catch (e) {
        console.error(e);
        throw e;
    }
}

export async function retrive(options: RetriveOptions) {
    try {
        const db = await connect(dbPath)

        if ((await db.tableNames()).includes(options.table)) {
            const tbl = await db.openTable(options.table, embedFunction)
            const build = tbl.search(options.query);

            if (options.filter) {
                build.filter(options.filter)
            }

            if (options.select) {
                build.select(options.select)
            }

            if (options.limit) {
                build.limit(options.limit)
            }

            const results = await build.execute();
            return results;
        } else {
            return new Error("Table does not exist")
        }
    }
    catch (e) {
        console.error(e);
        throw e;
    }
}

and i call it on an other file

test.ts

import dotenv from 'dotenv';
dotenv.config();
const apiKey = process.env.OPENAI_API_KEY;

async function main() {
    console.time('ingest');
    const data = [
        {
            id: 1,
            metadata: {
                title: "Lorem Ipsum Document",
                author: "John Doe",
                date: "2023-09-20"
            },
            pageContent: "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed ac ipsum nec justo consequat dignissim. Nulla facilisi. Integer gravida tincidunt turpis eget iaculis."
        },
        {
            id: 2,
            metadata: {
                title: "Technical Report on AI Ethics",
                author: "Jane Smith",
                date: "2023-09-21"
            },
            pageContent: "This document provides an overview of the ethical considerations surrounding artificial intelligence. It covers topics such as bias in machine learning, data privacy, and responsible AI development."
        },
    ];
    useOpenAiEmbedding(apiKey);
    await ingest({
        data,
        table: 'vectors'
    })

    const retriveData = await retrive({
        table: 'vectors',
        query: 'what is lorem?'
    });

    console.log(retriveData);
    console.timeEnd('ingest');
}

main();

Add JS top-level catches to help users debug possible errors

Tag all examples

Add [beginner, intermediate, Advanced] tags for all examples
Try to maintain the ordering beginner followed by intermediate
group topics together for example like this:

Add javascript notebooks and modify readmes to link to these notebooks

Convert all to_df calls to to_pandas

[User Feedback]: Recipes need better branding, content depth & better introduction.

A power user provided the following critical feedback:

Who is this repo aimed at? Add better intro for each of the tables to set the right expectations.
A user might come expecting tutorials with a lot of handholding. Instead 2/3 tables are mostly PoCs and Standalone applications. They expect users to know a bit more than "what are LLMs, vectorDBs, & RAGs respectively". We've recently added a 3rd table that is aimed more at "Lengthy tutorials" that are actually introductory but no one can deduce them from the table titles at a glance.
Why Should someone use this repo as opposed to others like openai cookbook and others?
Missing loud backlinks to lancedb and doesn't tell much about the value props(serverless, no setup auth, native js etc.) so a user landing directly to this repo has no incentive to try it out.
(This was actually done on purpose initially to get users to try the examples sooner, but maybe it's better to find a middle ground)

chat with any website app broken

I followed the readme, but keep getting 422 errors:

-> % curl 'http://localhost:7860/run/predict' \
  -H 'Accept: */*' \
  -H 'Accept-Language: en-US,en' \
  -H 'Cache-Control: no-cache' \
  -H 'Connection: keep-alive' \
  -H 'Content-Type: application/json' \
  -H 'Cookie: PGADMIN_LANGUAGE=en; _ga=GA1.1.919721918.1702334574; _ga_R1FN4KJKJH=GS1.1.1702334574.1.1.1702334813.0.0.0' \
  -H 'Origin: http://localhost:7860' \
  -H 'Pragma: no-cache' \
  -H 'Referer: http://localhost:7860/' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'Sec-GPC: 1' \
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' \
  -H 'dnt: 1' \
  -H 'sec-ch-ua: "Not_A Brand";v="8", "Chromium";v="120", "Brave";v="120"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  --data-raw '{"data":["https://lancedb.com/ & https://blog.lancedb.com/context-aware-chatbot-using-llama-2-lancedb-as-vector-database-4d771d95c755"],"event_data":null,"fn_index":0,"session_hash":"81i86315img"}' \
  --compressed
{"detail":[{"type":"missing","loc":["body","event_id"],"msg":"Field required","input":{"data":["https://lancedb.com/ & https://blog.lancedb.com/context-aware-chatbot-using-llama-2-lancedb-as-vector-database-4d771d95c755"],"event_data":null,"fn_index":0,"session_hash":"81i86315img"},"url":"https://errors.pydantic.dev/2.4/v/missing"}]}%

openai.Configuration is not a constructor

Heyo your implementation of the OpenAIEmbeddingFunction seems to fail when i try to run it like this

`import { OpenAIEmbeddingFunction, connect } from 'vectordb';
import dotenv from 'dotenv';
dotenv.config();

const dbPath = 'assets/db/lancedb'
const apiKey = process.env.OPENAI_API_KEY
let embedFunction = new OpenAIEmbeddingFunction('info', apiKey)`

i get this error

        const configuration = new openai.Configuration({
                              ^
TypeError: openai.Configuration is not a constructor
    at new OpenAIEmbeddingFunction (...node_modules\vectordb\dist\embedding\openai.js:37:31)

Product recommender example with Instacart dataset

Monthly Recipes audit

Jan -

Things to do:

Every example should work without errors
The requirements need to be present in the examples directory
If case there is a colab, make sure the first cell installs all requriments
The links to colabs and blogs should actually work

NOTE: When auditing, make sure each example gets tested in a separate env. Otherwise missing deps errors won't be captured for some cases. This can be automated via a script or smthn.