Giter Club home page Giter Club logo

chroma's Introduction

Chroma logo

Chroma - the open-source embedding database.
The fastest way to build Python or JavaScript LLM apps with memory!

Discord | License | Docs | Homepage

pip install chromadb # python client
# for javascript, npm install chromadb!
# for client-server mode, chroma run --path /chroma_db_path

The core API is only 4 functions (run our ๐Ÿ’ก Google Colab or Replit template):

import chromadb
# setup Chroma in-memory, for easy prototyping. Can add persistence easily!
client = chromadb.Client()

# Create collection. get_collection, get_or_create_collection, delete_collection also available!
collection = client.create_collection("all-my-documents")

# Add docs to the collection. Can also update and delete. Row-based API coming soon!
collection.add(
    documents=["This is document1", "This is document2"], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
    metadatas=[{"source": "notion"}, {"source": "google-docs"}], # filter on these!
    ids=["doc1", "doc2"], # unique for each doc
)

# Query/search 2 most similar results. You can also .get by id
results = collection.query(
    query_texts=["This is a query document"],
    n_results=2,
    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    # where_document={"$contains":"search_string"}  # optional filter
)

Features

  • Simple: Fully-typed, fully-tested, fully-documented == happiness
  • Integrations: ๐Ÿฆœ๏ธ๐Ÿ”— LangChain (python and js), ๐Ÿฆ™ LlamaIndex and more soon
  • Dev, Test, Prod: the same API that runs in your python notebook, scales to your cluster
  • Feature-rich: Queries, filtering, density estimation and more
  • Free & Open Source: Apache 2.0 Licensed

Use case: ChatGPT for ______

For example, the "Chat your data" use case:

  1. Add documents to your database. You can pass in your own embeddings, embedding function, or let Chroma embed them for you.
  2. Query relevant documents with natural language.
  3. Compose documents into the context window of an LLM like GPT3 for additional summarization or analysis.

Embeddings?

What are embeddings?

  • Read the guide from OpenAI
  • Literal: Embedding something turns it from image/text/audio into a list of numbers. ๐Ÿ–ผ๏ธ or ๐Ÿ“„ => [1.2, 2.1, ....]. This process makes documents "understandable" to a machine learning model.
  • By analogy: An embedding represents the essence of a document. This enables documents and queries with the same essence to be "near" each other and therefore easy to find.
  • Technical: An embedding is the latent-space position of a document at a layer of a deep neural network. For models trained specifically to embed data, this is the last layer.
  • A small example: If you search your photos for "famous bridge in San Francisco". By embedding this query and comparing it to the embeddings of your photos and their metadata - it should return photos of the Golden Gate Bridge.

Embeddings databases (also known as vector databases) store embeddings and allow you to search by nearest neighbors rather than by substrings like a traditional database. By default, Chroma uses Sentence Transformers to embed for you but you can also use OpenAI embeddings, Cohere (multilingual) embeddings, or your own.

Get involved

Chroma is a rapidly developing project. We welcome PR contributors and ideas for how to improve the project.

Release Cadence We currently release new tagged versions of the pypi and npm packages on Mondays. Hotfixes go out at any time during the week.

License

Apache 2.0

chroma's People

Contributors

adjectiveallison avatar alabasteraxe avatar alw3ys avatar atroyn avatar beggers avatar cakecrusher avatar codetheweb avatar dglazkov avatar floleuerer avatar fr0th avatar grishick avatar hammadb avatar ibratoev avatar ishiihara avatar jeffchuber avatar laserbear avatar levand avatar naynaly10 avatar nicolasgere avatar nlsfnr avatar patcher9 avatar perryrobinson avatar sai-suraj-27 avatar sanketkedia avatar satyam-79 avatar shivankar-p avatar swyxio avatar tazarov avatar tonisives avatar weiligu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chroma's Issues

Metrics/Logs/Tracers for the User

When the user deploys this service, they will want to:

  • capture some level of logging
  • understand performance of the service
  • capture any bugs that are happening

We should offer some smart defaults around this.

Fix up READMEs

There are currently 3 separate READMEs. None of these give the right instructions for setup, testing, or usage.

"get_nearest_neighbor" can return nan neighbor classes - breaks server implementation

get_nearest_neighbor can return nan for the labels if we happen to query unlabeled datasets. This breaks fastapi's JSON serialization with an error ValueError: Out of range float values are not JSON compliant.

We should replace the nans or handle them accordingly. The current workaround when using the server is to not query unlabeled datasets in get_nearest_neighbor calls like so - which is likely what is desired anyways.

chroma_client.get_nearest_neighbors(embedding, n_results=N, where={'dataset':'training'})

Data formatting/storage for inference and label data

Currently we store data in the coco json format. That looks like this:

{
	"annotations": [{
		"bbox": [-0.88869536, -2.7152133, 127.68482208, 62.74983978],
		"category_id": 43,
		"category_name": "knife",
	}]
}

More here on that format: https://roboflow.com/formats/coco-json

Questions

  • what data format will our users have their results in? Natively in wherever they are running inference/logging
  • what properties actually matter to us? should things be broken out into their own columns? should we be able to store/return back to the user exactly what they gave us even so?

Next steps
Review various data formats. Look further at roboflow, see if the labeling providers have some documentation. Consider both vision as well as NLP applications.

Relax strict requirements pins?

Chroma uses strict requirement pins, which means that pip will only install it as part of an application whose requirement pins are exactly the same. As Chroma is a framework that will (hopefully!) be included in many other applications, having strict pins severely limits that uptake by making it automatically incompatible with applications that have even a minor version difference of something as common as FastAPI, requests, or Pydantic.

If interesting, at Prefect we found this article to be very helpful in thinking through the implications of pinning versions, and have generally left our frameworks (which are intended to be used in other software) pinned only to lower bounds that we know are compatible with our usage.

API Design

This version is meant to get a handle on the external and internal APIs, not yet to define all the specifics of what inputs they take, if they do single vs batch mode, etc.

External API

  • init - set metadata to be used for logging and fetching for this session
  • log - log new embeddings and their metadata
  • fetch - get data to label

Internal API

  • init - loads db if exists
  • load database
  • save database - run everytime the database changes so we never rely on in-memory?
  • load index
  • save index
  • generate ANN index - runs hnswlib
  • use ANN - mhb and other calculations uses the index from ANN for speed
  • run mhb - runs malanobis distance
  • generate distances - generates distances based on mhb and the index and saves it to the db
  • save new embedding+metadata datapoints to db
  • [future] run umap
  • internal set of things that log does
  • internal set of things that fetch does

File store proposal

  • store the db and index in a .chroma directory at the level the python function was executed from

Questions

  • when do regenerate the index? - on every new batch of datapoints?
  • when do we re-rerun mhb and regenerate distance values? - on every new batch of datapoints?

Document hand rolled (parquet/duckdb/hnswlib) customer cycle proposal

Based on our conversations so far, write up a possible round-trip customer data plan for MVP.

Should include:

  • customer data logging
  • transfer data to chroma
  • build ovoid indexes on training data
  • build ANN indexes on training data
  • ingest prod data
  • compute recommendations for prod -> training inclusion

Updating the index

One of the known potential issues with using hnswlib vs other more "full-featured" options is updating the index without recomputing the whole index.

This is actually important because updating the index is a core part of what we do.

Deleting from an index is also useful because users may upload data accidentally and want to remove it without starting completely over.

Move to upstream HNSWlib

The features we needed (filtering) have now been merged in upstream, so we no longer need our fork.

Prototype caching for ANN indexes

Depending on our ANN lib (e.g. Hnsw) we'll have an artifact computed for a set of rows. That has to:

  • live on disk
  • be associated to the input set
  • load into memory for live queries
  • unload after disuse

"Multi-space"

Right now Chroma is built as a "single-space" embedding store. Meaning it stores 1 embedding space at a time (from 1 layer, 1 trained model, 1 app).

Think about what it would take to make it "multi-store".

If we stick with duckdb+parquet this probably also means various kinds of parquet partitioning.

Telemetry back to Chroma

We need to be able to report on how many users (anonymously) are using Chroma. This is going to be an opt-out part of the setup flow. We should make it very easy for the user to tell what we are sending back so they can verify we are only sending back very lightweight fully-anonymized usage and no user data ever.

We may want to add telemetry the client and the server?

If we do add telemetry to the client - we need to make sure that our requests are not blocking.

To enable opt-out we are going to have to add a .env file to the application.

Should we send events directly to the downstream data store or use our own tracking domain? eg https://telemetry.trychroma.com

Worker/Queues

Unless we want to continue on relying on the user to manually trigger the formation of the index, and running other downstream analytics, we want to have an internal worker/queue system for running these ourselves based on triggers and schedules.

Data storage for parquet files

If we stay with parquet, I imagine we will want to support moving these files off the local machines they are built on and off to object storage on whatever cloud the user is using.

Audit and resolve coco JSON export integrity

Without real auditing, I already know the counts are off in the objects set. I don't plan to go crazy with the audit, but just some sanity checks before asking Anton and Jeff to review.

The new files will be converted into parquet for testing.

DB Backups

Any production service needs a way to configure regular backups. If we go with duckdb/parquet this could be as simple as a bucket policy to create an archive of any new or updated files every 24h.

Make core algorithms more efficient

The core algorithms have a lot of redundant computation we could avoid by modularizing them further and caching results as they become available.

Interface formats

  • We want to make our "default" apis use python generic data structures like list, dict, etc. Why? because we do not want to put the burden on the user to learn Apache Arrow or to add extra dependencies to their application like Pandas if they aren't using them right now.
  • When users want a speed up - they can move to a more opinionated and optimized data structure.
  • That being said, we can still use a format internally and standardize around it.
  • Apache Arrow is attractive as that candidate
  • Arrow also gives us direct access to Apace Arrow Flight and Apache Arrow Flight SQL
  • Duckdb and Clickhouse both support direct import and query of apache arrow files, and I assume data structures, though this needs to be tested.
  • We could add other endpoints to our client or API like fetch, fetch_arrow, fetch_df, fetch_numpy to give users more flexibility in the format they get back. The client or API would handle this transform probably at that endpoint... all wrapped around fetch_arrow if we use that internally.
  • Because users will use both the client mainly, in the deployed context, but we also want to support direct import of chroma_core for in-memory notebook usage... we will want to offer these various fetch_* options at both layers.
  • The client can always talk to the API via insert_arrow, fetch_arrow however.
  • There is the additional question of moving data over-the-wire and what format we use. With Arrow, we could Arrow Flight (which uses protobufs under the hood iirc), or we could use something else.

A Collection should store its UUID

Collections should store their UUIDs and pass them through read/write paths rather than every path requiring a lookup from collection_name -> collection_uuid.

Move away from JSON/REST

The current client-server bridge uses http+json+rest because it was easy, but certainly not because it was fast or ideal. Explore alternatives to increase speed.

Unit tests fail in Github Actions on for MacOS platform

Github Actions uses older versions of OSX (11 and 12), and runs only on Intel-based macs.

The tests run fine when running locally on an M1.

Given that the dev team is mostly on newer M1 Macs, without immediate access to an Intel-based mac to reproduce the problem, we are going to "solve" the test failure by removing MacOS as a test platform target. Given the error message, I also strongly suspect the problem is in the Github Actions environment rather than actually being a Mac-related issue.

Creating this issue so we can circle back if we ever do decide it's important to get unit tests running on Intel macs, again.

uuid package is for Python 2

One of the requirements for the package is uuid which is a package from 2006 supporting Python 2. uuid is now a built-in library.

Should be removed from requirements ๐Ÿ™

Making default behavior saving to disk instead of clearing it

The default behavior is clearing directory on exit instead of persisting it:

def __del__(self):
print("Exiting: Cleaning up .chroma directory")
self._idx.reset()

As opposed to the doc:

By default Chroma uses an in-memory database, which gets persisted on exit and loaded on start (if it exists). This is fine for many experimental / prototyping workloads, limited by your machine's memory.

Suggestion:

  1. Clarify that default behavior is clearing directory in the doc
  2. Or better, make saving to disk the default behavior.

Add get_or_create support

Current issue:

if I'm using create_collection in a notebook cell (or in another Python function), the code will fail on the second run as the collection already exists. This forces users to change their code from create to get collection, or to use an or statement like this:

repos = chroma_client.get_collection(name="my_repos") or chroma_client.create_collection(name="my_repos")

Proposed solution

Inspired by Rails ActiveRecord's find_or_create_by, Chroma could support a get_or_create function that creates the collection OR returns it if it already exists.

If the collection exist, Chroma prints a notice. This will help avoid issues in which users are writing into someone else's collection, or write data in a collection thinking it's brand new but it already has data in it.

Fix class-based cluster representative sampling

#87 introduces the core algorithms, but representative sampling using class labels is currently broken.

This is probably because hnswlib filtering doesn't let us get a connected graph, and so ANN search fails. We should understand how / when this fails. There are possibly easy fixes, i.e. generating a separate index per dataset per class, or increasing the edge factor, but the real solution is to understand how to use hnsw properly in our context.

Rename space_key

space_key is dumb. The idea in theory makes sense... their is an embedding space, and there is a key to denote that unique embedding space.

But I think its very confusing, and DX includes good naming! (hardest problem blah blah)

Other candidates

  • enviornment
  • scope
  • ....

Multi-user

One concern about using DuckDb and parquet is maintaining correctness even when potentially many requests are coming in per second to add new embeddings to the production data space.

The other concern is multiple users in the org querying or pulling data from a service at the same time.

Will this work? Will there be collisions?

chroma-ui

We don't know exactly what users want yet in a frontend, but here are some very general ideas

  • View what "spaces" you have
  • View underlying data in the browser
  • Run SQL queries against those datasets in the browser

Down the road

  • View projections

Step 1 would be to figure out a sketch of how we would package it up.

Discussion: in-memory and chroma client-server --> sharing code?

chroma-client is mainly is responsible for the public API and data ferrying to the backend.
chroma-server is mainly is responsible for storing data and running computations.

Inside chroma there are 2 wolves:

  • Wolf A: chroma running entirely in the python session of the user
  • Wolf B: chroma running in a client-server way where the user sends data to a session managed outside of their current python session.

Wolf B is fairly obvious... there is chroma-client and chroma-server and then work together.

Wolf A however... does chroma-client have extra functionality to handle the in-memory use case? or is there seperate code that is shared between chroma-server and chroma-client?

I think some code has to be shared... the code for "doing the maths" - and possibly more code, things like data formats, etc. Possibly more. How should this be structured?

Remove non-user functions from the API

Currently the API has a bunch of functions the user shouldn't call. The one that stands out most to me is where_with_model_space which is purely a utility function.

Currently I've marked all those I think should be removed with "โš ๏ธ This method should not be used directly." in the docstring.

UUIDs and Strings

In several places, we convert UUID objects back and forth from and to strings. This isn't great, and it's hard to keep track of what they should be, when.

We should add types to make sure they are what we expect, and then refactor to try to do this conversion as little as possible, or preferably, not at all.

An example:

collection_uuid, embedding=embeddings, metadata=metadatas, documents=documents, ids=ids

It's ambiguous here what type collection_uuid should be.

Storing custom generated scores

If we generated something like a custom quality score for a giant set of embeddings... do we update (discouraged in clickhouse) or put them into a new table....

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.