Giter Club home page Giter Club logo

embedchainjs's Introduction

embedchainjs

Discord Twitter Substack

embedchain is a framework to easily create LLM powered bots over any dataset. embedchainjs is Javascript version of embedchain. If you want a python version, check out embedchain-python

🤝 Let's Talk Embedchain!

Schedule a Feedback Session with Taranjeet, the founder, to discuss any issues, provide feedback, or explore improvements.

How it works

It abstracts the entire process of loading dataset, chunking it, creating embeddings and then storing in vector database.

You can add a single or multiple dataset using .add and .addLocal function and then use .query function to find an answer from the added datasets.

If you want to create a Naval Ravikant bot which has 2 of his blog posts, as well as a question and answer pair you supply, all you need to do is add the links to the blog posts and the QnA pair and embedchain will create a bot for you.

const dotenv = require("dotenv");
dotenv.config();
const { App } = require("embedchain");

//Run the app commands inside an async function only
async function testApp() {
  const navalChatBot = await App();

  // Embed Online Resources
  await navalChatBot.add("web_page", "https://nav.al/feedback");
  await navalChatBot.add("web_page", "https://nav.al/agi");
  await navalChatBot.add(
    "pdf_file",
    "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf"
  );

  // Embed Local Resources
  await navalChatBot.addLocal("qna_pair", [
    "Who is Naval Ravikant?",
    "Naval Ravikant is an Indian-American entrepreneur and investor.",
  ]);

  const result = await navalChatBot.query(
    "What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?"
  );
  console.log(result);
  // answer: Naval argues that humans possess the unique capacity to understand explanations or concepts to the maximum extent possible in this physical reality.
}

testApp();

Getting Started

Installation

  • First make sure that you have the package installed. If not, then install it using npm
npm install embedchain && npm install -S openai@^3.3.0
  • Currently, it is only compatible with openai 3.X, not the latest version 4.X. Please make sure to use the right version, otherwise you will see the ChromaDB error TypeError: OpenAIApi.Configuration is not a constructor

  • Make sure that dotenv package is installed and your OPENAI_API_KEY in a file called .env in the root folder. You can install dotenv by

npm install dotenv
  • Download and install Docker on your device by visiting this link. You will need this to run Chroma vector database on your machine.

  • Run the following commands to setup Chroma container in Docker

git clone https://github.com/chroma-core/chroma.git
cd chroma
docker-compose up -d --build
  • Once Chroma container has been set up, run it inside Docker

Usage

  • We use OpenAI's embedding model to create embeddings for chunks and ChatGPT API as LLM to get answer given the relevant docs. Make sure that you have an OpenAI account and an API key. If you have dont have an API key, you can create one by visiting this link.

  • Once you have the API key, set it in an environment variable called OPENAI_API_KEY

// Set this inside your .env file
OPENAI_API_KEY = "sk-xxxx";
  • Load the environment variables inside your .js file using the following commands
const dotenv = require("dotenv");
dotenv.config();
  • Next import the App class from embedchain and use .add function to add any dataset.
  • Now your app is created. You can use .query function to get the answer for any query.
const dotenv = require("dotenv");
dotenv.config();
const { App } = require("embedchain");

async function testApp() {
  const navalChatBot = await App();

  // Embed Online Resources
  await navalChatBot.add("web_page", "https://nav.al/feedback");
  await navalChatBot.add("web_page", "https://nav.al/agi");
  await navalChatBot.add(
    "pdf_file",
    "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf"
  );

  // Embed Local Resources
  await navalChatBot.addLocal("qna_pair", [
    "Who is Naval Ravikant?",
    "Naval Ravikant is an Indian-American entrepreneur and investor.",
  ]);

  const result = await navalChatBot.query(
    "What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?"
  );
  console.log(result);
  // answer: Naval argues that humans possess the unique capacity to understand explanations or concepts to the maximum extent possible in this physical reality.
}

testApp();
  • If there is any other app instance in your script or app, you can change the import as
const { App: EmbedChainApp } = require("embedchain");

// or

const { App: ECApp } = require("embedchain");

Format supported

We support the following formats:

PDF File

To add any pdf file, use the data_type as pdf_file. Eg:

await app.add("pdf_file", "a_valid_url_where_pdf_file_can_be_accessed");

Web Page

To add any web page, use the data_type as web_page. Eg:

await app.add("web_page", "a_valid_web_page_url");

QnA Pair

To supply your own QnA pair, use the data_type as qna_pair and enter a tuple. Eg:

await app.addLocal("qna_pair", ["Question", "Answer"]);

More Formats coming soon

  • If you want to add any other format, please create an issue and we will add it to the list of supported formats.

Testing

Before you consume valueable tokens, you should make sure that the embedding you have done works and that it's receiving the correct document from the database.

For this you can use the dryRun method.

Following the example above, add this to your script:

let result = await naval_chat_bot.dryRun("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?");console.log(result);

'''
Use the following pieces of context to answer the query at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
terms of the unseen. And I think that’s critical. That is what humans do uniquely that no other creature, no other computer, no other intelligence—biological or artificial—that we have ever encountered does. And not only do we do it uniquely, but if we were to meet an alien species that also had the power to generate these good explanations, there is no explanation that they could generate that we could not understand. We are maximally capable of understanding. There is no concept out there that is possible in this physical reality that a human being, given sufficient time and resources and
Query: What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?
Helpful Answer:
'''

The embedding is confirmed to work as expected. It returns the right document, even if the question is asked slightly different. No prompt tokens have been consumed.

The dry run will still consume tokens to embed your query, but it is only ~1/15 of the prompt.

How does it work?

Creating a chat bot over any dataset needs the following steps to happen

  • load the data
  • create meaningful chunks
  • create embeddings for each chunk
  • store the chunks in vector database

Whenever a user asks any query, following process happens to find the answer for the query

  • create the embedding for query
  • find similar documents for this query from vector database
  • pass similar documents as context to LLM to get the final answer.

The process of loading the dataset and then querying involves multiple steps and each steps has nuances of it is own.

  • How should I chunk the data? What is a meaningful chunk size?
  • How should I create embeddings for each chunk? Which embedding model should I use?
  • How should I store the chunks in vector database? Which vector database should I use?
  • Should I store meta data along with the embeddings?
  • How should I find similar documents for a query? Which ranking model should I use?

These questions may be trivial for some but for a lot of us, it needs research, experimentation and time to find out the accurate answers.

embedchain is a framework which takes care of all these nuances and provides a simple interface to create bots over any dataset.

In the first release, we are making it easier for anyone to get a chatbot over any dataset up and running in less than a minute. All you need to do is create an app instance, add the data sets using .add function and then use .query function to get the relevant answer.

Tech Stack

embedchain is built on the following stack:

Team

Author

Maintainer

Citation

If you utilize this repository, please consider citing it with:

@misc{embedchain,
  author = {Taranjeet Singh},
  title = {Embechain: Framework to easily create LLM powered bots over any dataset},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/embedchain/embedchainjs}},
}

embedchainjs's People

Contributors

cachho avatar taranjeet avatar jaikant avatar sahilyadav902 avatar

Stargazers

Second Datke avatar momo avatar Hashem Aldhaheri avatar Mobin Chowdhury avatar Jeremy avatar Cyclotomic Fields avatar  avatar Arvin avatar Gary Blankenship avatar Golden Kumar avatar Jesús Ferretti avatar ezzabuzaid avatar Cristian avatar  avatar JT5D avatar 昱彤 avatar SobaZino | Mehran Nosrati avatar Slava Kurilyak avatar Lehao Lin avatar SeongjaeMoon avatar John C avatar Ryan E avatar  avatar jessie J taylor avatar ROHIT RANJAN SINGH avatar Nuriman Quddus avatar Nero One avatar  avatar Zack Jackson avatar  avatar Zimo Xiao avatar  avatar Joshua Harper avatar Mehedi Hasan Siam avatar Jay Mathis avatar Aleksa avatar Cory Silva avatar  avatar Rohit Gupta avatar Ilko Kacharov avatar  avatar  avatar PΔBLØ ᄃΞ avatar Michael Anderson avatar NSR avatar Abdul Rahman Zantout avatar Vildan Softic avatar Publiminal avatar Jean Ponchon avatar Weismann avatar Filipe Boldo avatar  avatar Ralph avatar Graham Taylor avatar Lim Leong Kui avatar Choi at Spaceflo avatar Yanick J.S. avatar Rihards Mantejs avatar  avatar  avatar  avatar  avatar Ian Baxter avatar  avatar Devang Parekh avatar Beckett avatar minfive avatar Olavo avatar BowTiedSwan avatar Mark Friedman avatar Wilfried PETIT avatar desmond avatar  avatar Pranith Jain avatar  David Abimbola avatar Dustin T Hughes avatar 斓曦未央丶 avatar Niraj !!! avatar Juan Manuel Mery avatar Ved Bhowmik avatar Chi avatar Ziv Erlichson avatar Mohit Sahu avatar Lukasz Hanusik avatar Zhazha_JiaYiZhen avatar 阿宝哥 avatar Tenvi avatar CanisMinor avatar Ernest Obot avatar Oleg Veselkov avatar  avatar Samuel Deng avatar  avatar guygubaby avatar neighbor wang avatar Bharath Bhushan avatar Jedidiah Hurt avatar Mariam Zareen avatar Lovia Edassery Biju avatar Nymul Islam avatar

Watchers

 avatar Deshraj Yadav avatar  avatar Adrian Carballo avatar  avatar Deon Bands avatar  avatar Kirill Egorov avatar  avatar Shashank Srivastava avatar

embedchainjs's Issues

pdf read

Thanks for creating such sophisticated software
I got issue with the "hello world" example on linux fedora

All data from https://nav.al/feedback already exists in the database.
Successfully saved https://nav.al/agi. Total chunks count: 6
/home/rudix/Desktop/llm/node_modules/pdfjs-dist/build/pdf.js:1709
      data: structuredClone(obj, transfer ? {
            ^

ReferenceError: structuredClone is not defined

Is it possible to get more "debugging" info of the response ... like tokens used ?

Telemetry breaks embedchain in Nextjs API routes

I am using the follwoing:

I have an API route that imports embedchain with the following folder structure:

  • node_modules/
  • app/
    -- api/
    --- pdfchat/
    ---- route.ts <--- embedchain imported here
    -- page.tsx
  • next.config.js
  • package.json

Here is my route.ts file:

const dotenv = require("dotenv");
dotenv.config();

const { App } = require("embedchain");

export const maxDuration = 120;
export const runtime = "nodejs";

export const dynamic = 'force-dynamic';
dotenv.config();

export async function POST(req: Request) {
  try {
    const {
      messages,
      doc_url,
    } = await req.json()

    const navalChatBot = await App();
    
    await navalChatBot.add('pdf_file', doc_url);
    
    const userMessage = messages[messages.length - 1].content;
    
    const result = await navalChatBot.query(userMessage);

    return result


  } catch (error) {
    console.error(error);
  }
}

However, when this route receives a request and invokes the EmbedChainApp it throws a runtime error as follows:

 ⨯ unhandledRejection: Error: ENOENT: no such file or directory, open '/Users/adham/Developer/rna/chat-etf-bot/.next/server/package.json'
    at Object.openSync (node:fs:603:3)
    at Object.readFileSync (node:fs:471:35)
    at EmbedChainApp.eval (webpack-internal:///(rsc)/./node_modules/.pnpm/[email protected]_@[email protected]_@[email protected]_@pinecone-databa_q3mszccedqc6bzmo3cdgeljek4/node_modules/embedchain/dist/embedchain.js:308:47)
    at Generator.next (<anonymous>)
    at eval (webpack-internal:///(rsc)/./node_modules/.pnpm/[email protected]_@[email protected]_@[email protected]_@pinecone-databa_q3mszccedqc6bzmo3cdgeljek4/node_modules/embedchain/dist/embedchain.js:59:71)
    at new Promise (<anonymous>)
    at __awaiter (webpack-internal:///(rsc)/./node_modules/.pnpm/[email protected]_@[email protected]_@[email protected]_@pinecone-databa_q3mszccedqc6bzmo3cdgeljek4/node_modules/embedchain/dist/embedchain.js:41:12)
    at EmbedChainApp.sendTelemetryEvent (webpack-internal:///(rsc)/./node_modules/.pnpm/[email protected]_@[email protected]_@[email protected]_@pinecone-databa_q3mszccedqc6bzmo3cdgeljek4/node_modules/embedchain/dist/embedchain.js:301:16)
    at new EmbedChain (webpack-internal:///(rsc)/./node_modules/.pnpm/[email protected]_@[email protected]_@[email protected]_@pinecone-databa_q3mszccedqc6bzmo3cdgeljek4/node_modules/embedchain/dist/embedchain.js:93:14)
    at new EmbedChainApp (webpack-internal:///(rsc)/./node_modules/.pnpm/[email protected]_@[email protected]_@[email protected]_@pinecone-databa_q3mszccedqc6bzmo3cdgeljek4/node_modules/embedchain/dist/embedchain.js:344:1)
    at eval (webpack-internal:///(rsc)/./node_modules/.pnpm/[email protected]_@[email protected]_@[email protected]_@pinecone-databa_q3mszccedqc6bzmo3cdgeljek4/node_modules/embedchain/dist/index.js:35:21)
    at Generator.next (<anonymous>)
    at eval (webpack-internal:///(rsc)/./node_modules/.pnpm/[email protected]_@[email protected]_@[email protected]_@pinecone-databa_q3mszccedqc6bzmo3cdgeljek4/node_modules/embedchain/dist/index.js:26:71)
    at new Promise (<anonymous>)
    at __awaiter (webpack-internal:///(rsc)/./node_modules/.pnpm/[email protected]_@[email protected]_@[email protected]_@pinecone-databa_q3mszccedqc6bzmo3cdgeljek4/node_modules/embedchain/dist/index.js:8:12)
    at App (webpack-internal:///(rsc)/./node_modules/.pnpm/[email protected]_@[email protected]_@[email protected]_@pinecone-databa_q3mszccedqc6bzmo3cdgeljek4/node_modules/embedchain/dist/index.js:34:17)
    at doc_embedder (webpack-internal:///(rsc)/./app/api/chat/related-docs/doc_embedder.js:6:25)
    at POST (webpack-internal:///(rsc)/./app/api/chat/related-docs/route.ts:25:82)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async /Users/adham/Developer/rna/chat-etf-bot/node_modules/.pnpm/[email protected]_@[email protected][email protected][email protected][email protected]/node_modules/next/dist/compiled/next-server/app-route.runtime.dev.js:6:62623 {
  errno: -2,
  syscall: 'open',
  code: 'ENOENT',
  path: '/Users/adham/Developer/rna/chat-etf-bot/.next/server/package.json'
}
Error: Setting up fake worker failed: "Cannot find module './pdf.worker.js'
Require stack:
- /Users/adham/Developer/rna/chat-etf-bot/.next/server/vendor-chunks/[email protected]".
    at eval (webpack-internal:///(rsc)/./node_modules/.pnpm/[email protected]/node_modules/pdfjs-dist/build/pdf.js:2116:58)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

Upon a deeper look, I noticed that the sendTelemetryEvent function is trying to read the contents of the package.json of the embedchain module. However the way the path to the package.json file is declared with __dirname causes a problem. https://github.com/embedchain/embedchainjs/blob/a833b3988a4bb5b5d22cfd96851572f3cc26de24/embedchain/embedchain.ts#L265)

The __dirname variable in this case points to the root directory of the Next.js project, not the directory of the npm module. This means that the package.json file of the npm module won't be found, and fs.readFileSync will throw that error.

Is it possible to use absolute path instead here? Right now this library can't be used in Nextjs APIs 😞

Make it more configurable?

It would be nice to pass different models here https://github.com/embedchain/embedchainjs/blob/ff5c5fef4bf3dc7ae3b44d28142e42f596829b14/embedchain/embedchain.ts#L146 i.e gpt-4

Also I am thinking about deploying a chromadb instance on cloudformation https://docs.trychroma.com/deployment?lang=js

So it would be nice to not have to rely on it being on the same machine the js is running on i.e :8000 https://github.com/embedchain/embedchainjs/blob/ff5c5fef4bf3dc7ae3b44d28142e42f596829b14/embedchain/vectordb/ChromaDb.ts#L21

For now I could create my own DB instance I suppose? Something like

async function App(db) {
  const navalChatBot = new EmbedChain(db);
  // ... Other initialization logic
  return navalChatBot;
}

//Run the app commands inside an async function only
async function testApp() {
  const myDB = new MyCustomDB();
  const navalChatBot = await App(myDB);

Embedding from raw text/json

Hi! Thanks for building this :)
I see you support embeddings like Websites and PDF which require extra processing to convert into a digestible format, so I'm wondering why the simplest of embedding (aka raw text) is not supported? Or am I missing something?

Thanks,
David

Support for youtube_video format

Hi,
I would like to test how much with AI, I can capture what is intersting from a youtuber channel.
Do you have plans to add this ?

naval_chat_bot.add limitations

This project looks interesting, i have few quick question.
consider we have thousands of PDF files, URLs and videos, what are the limitations on the adding data set? is is required to naval_chat_bot.add all thousands of resources every time app inits ?

Add configuration parameter to App()

Hi,

I would like to intantiate several bots in the same application.
For this to work, minimum is to be able to specify a (Chroma)collection name.

Other configs parameters like "chroma server" & "port" and "OpenAI .env key" will help to cover multi-instances cases.

WYT?

Add maintainers from community

  • Embedchain team will develop the feature first in Python and then port it to Javascript. So javascript sdk will be lagging in terms of features, but want to keep this gap as minimum as possible.
  • Community is asking to be added as moderators

Community asks / features to add

  • typescript support
  • support for youtube loader

Load an existing database?

Tried setting a different dir for chroma but seems to go back to updating the default /db location.

'vectordb': {
'provider': 'chroma',
'config': {
'collection_name': 'test_app2',
'dir': 'db1',
'allow_reset': True
},
},

If I start new file using same config for 'vectordb' and app name, it does not see previous embedded vectors.

Announcing 2 new maintainers

Onboarding @sahilyadav902 and @cachho as the official maintainers for Embedchain Javascript package.

Really liked the way Sahil and Cachho have been contributing to Embedchain, so this move made natural sense.

Together, lets make embedchain better and the best framework for devs to create LLM bots.

[CHORE] Discrepancy between JavaScript and Python package

This issue is there to track features that might exist in python, that are not part of this module yet.

Classes:

  • OpenSourceApp
  • PersonApp
  • Config

Methods:

  • Dry run
  • Chat
  • count
  • reset

Formats:

  • Text
  • Docs

Misc:

  • check for duplicate ids in add
  • table of contents in readme
  • unit tests
  • show new chunks when adding a dataset
  • streaming responses

Question about Installing 'embedchain' with 'openai' Version Compatibility

Hello,

I am trying to install the 'embedchain' package in my project, and the documentation specifies that it requires 'openai' version '^3.3.0'. However, when I attempt to install it as per the instructions, I encounter dependency resolution issues with 'openai'.

Here are the steps I've taken:

  1. In my project's package.json file, I've specified the OpenAI version as "openai": "^3.3.0" to ensure compatibility.
  2. I've deleted the node_modules folder and package-lock.json file in my project directory to start fresh.
  3. I've cleared the npm cache using the command 'npm cache clean --force'.

Despite these steps, I still encounter issues when running 'npm install embedchain', which tries to install OpenAI version 4.3.1, causing dependency conflicts.

npm WARN ERESOLVE overriding peer dependency
npm WARN While resolving: [email protected]
npm WARN Found: peerOptional openai@"^3.0.0 | ^4.0.0" from the root project
npm WARN 
npm WARN Could not resolve dependency:
npm WARN peerOptional openai@"^3.0.0 | ^4.0.0" from the root project

Is there any specific guidance or workaround for installing 'embedchain' with 'openai' version '^3.3.0'? Are there any known compatibility issues with certain versions of 'openai' that I should be aware of? I'd appreciate any assistance or advice you can provide on this matter.

Thank you!

support for mistral

llm:
provider: huggingface
config:
model: 'mistralai/Mistral-7B-Instruct-v0.2'
top_p: 0.5
embedder:
provider: huggingface
config:
model: 'sentence-transformers/all-mpnet-base-v2'

[FEATURE] Reset function

Since this package is in client-mode, we need a function to reset the chroma db. Otherwise, you have to go into the docker container, and delete the files, which is a mess.

Error: Please install openai as a dependency with, e.g. `yarn add openai`

This looks like a promising project.

I was attempting to run the example in the readme. When started, I get the following error:

node index.js                                         
Telemetry: Attempt 1 failed with status: 415
Telemetry: Attempt 2 failed with status: 415
/embedchainjs/node_modules/chromadb/dist/cjs/chromadb.cjs:2986
      throw new Error(
            ^

Error: Please install openai as a dependency with, e.g. `yarn add openai`

I have installed the npm dependencies, entered the OpenAI token, and I have Chroma running in Docker.

Any tips?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.