Giter Club home page Giter Club logo

Comments (9)

tazarov avatar tazarov commented on September 4, 2024 1

@Alexie-Z-Yevich, thanks for your explanation. Let me ask a clarifying question. If your app is multi-tenant, e.g., each user owns their collection or DB, why do you need to query across all users' knowledgebases? Is that some admin functionality?

from chroma.

Alexie-Z-Yevich avatar Alexie-Z-Yevich commented on September 4, 2024 1

@tazarov It has effectively solved my problem. Thank you very much! ❤❤❤

from chroma.

tazarov avatar tazarov commented on September 4, 2024 1

@Alexie-Z-Yevich, glad to be able to help. If you have any further feel free to jump into Discord. We have a great community there and live interactions will make for speedy answers :)

from chroma.

Alexie-Z-Yevich avatar Alexie-Z-Yevich commented on September 4, 2024 1

@tazarov I don't know if I understand your meaning correctly. As you can see, whether I use keys or values as labels, metadata returns not a list, but the key I last stored.
My partial code is as follows:

chroma_client = chromadb.PersistentClient(path="../vector_store/admin/")
collections = chroma_client.get_or_create_collection(user_name)
print(collections.get())
sources = collections.metadata  # or collections.metadata['A1']
print(sources)

t1
t2

from chroma.

Alexie-Z-Yevich avatar Alexie-Z-Yevich commented on September 4, 2024

@Alexie-Z-Yevich, thanks for your explanation. Let me ask a clarifying question. If your app is multi-tenant, e.g., each user owns their collection or DB, why do you need to query across all users' knowledgebases? Is that some admin functionality?

Not all data uploaded by users is queried. Of course, your viewpoint also reminds me of another possibility: the company's managers can upload some public information for reference.
What I actually mean is: if a user uploads multiple documents, I want to use a list to store all the documents they have uploaded, so that the front-end can prompt the user which documents they have uploaded. For example, if he uploads a medical related document for the first time and a game related document for the second time, he can choose which document he wants to recall from to avoid mistakenly finding medical related content when searching for game content.
At the same time, it is also possible to easily delete corresponding documents based on their uploaded document names or other identifiers, and add some restrictions (such as a maximum of 10 100mb documents per user), making it easy to maintain a multi-user usage scenario.

from chroma.

tazarov avatar tazarov commented on September 4, 2024

@Alexie-Z-Yevich, the simplest solution to your problem can be document categories which can be implemented in the following ways:

Metadata, single collection per user:

each time a user uploads a document they specify what caregory it falls into. You chunk the document (if necessary) and to each chunk, you attach metadata with the corresponding category.

You query data using:

collection.query(query_texts=[user_query_input], where={"category":user_query_category}

The list of those categories you can keep either outside of Chroma in a separate relational DB or within Chroma - 1) have a metadata document that gets updated each time a new category is added or removed (this is hacky, but can work) and 2) Query for all categories each time, which is what you do in right now I believe.

Doing collection.get() each time you list the user's categories or even the individual documents, can be expensive, but perhaps you can utilize a cache to store that info

Collection per user per category:

Each type of document goes into a separate collection. Each collection has metadata associated with the category it represents.

You can create collections like so:

client.get_or_create_collection(f"{user_id}-{category}", metadata={"category":category}) # make sure to sanitize user_id and category

Then, when you present the list of categories to the user:

user_categories = [f"{c.metadata['category']}" for c in client.list_collections() if c.name.startswith(f"{user}")]

The list_collections operation is much cheaper than collection.get()

from chroma.

Alexie-Z-Yevich avatar Alexie-Z-Yevich commented on September 4, 2024

@tazarov Sorry, I still have a question. Can't multiple collections be searched simultaneously? Although category distinguishes major categories, each user will have many collections, which makes it very difficult to recall if there are two texts in different collections.💦💦
Still using my previous games and medical examples, if it's a medical game, do I need to recall 3 items in the game category and then recall 3 items from the medical category? (That way, my prompt will become very long because I can't extract 3 out of 6 data points)
My original intention was that when a user uploads a document, there will be a label for the document, which defaults to recalling n pieces of data from all the documents uploaded by the user. Users can view their historical uploaded documents and choose to delete them or recall only some documents (rather than all). This scenario is more similar to a multiple-choice box.
At the same time, based on your first answer😂, my question has been further strengthened: is it possible to set up an admin role to add some enterprise documents to the bottom layer of all users without displaying them on the user panel. Although it is possible to upload the document for each user role and hide the relevant tags upon return, this would result in a huge waste of data being present in each role's database.

from chroma.

Alexie-Z-Yevich avatar Alexie-Z-Yevich commented on September 4, 2024

I'm sorry I have so many questions🙏🙏🙏

from chroma.

tazarov avatar tazarov commented on September 4, 2024

@Alexie-Z-Yevich, you can totally do concurrent queries across multiple collections. You can even do them in parallel/simultaneously if you'd like to reduce user latency.

I get your point that there can be an overlap between categories, and that does not necessarily need to translate into redundant copies. Let's take my first approach from above - Metadata, single collection per user:

Example:

collection.query(query_texts=[user_input_query],where={"medical":True,"game":True}

The key to the above query is that for each document, you store all the categories it belongs to as boolean metadata fields (this makes for speedier filters).

The above example can also be applied for the multi-collection approach where you can pick to store the document in only one collection, but add metadata with the other category it also matches against then at query time you can add those additional categories as above example.

Let me know if the above makes sense in your context.

Regarding the admin role question - I guess there is more than one way to achieve this. With the single collection approach, you could add extra categories to search that are only related to corporate documents. For the multi-collection one, you can conversely search either a single collection with corporate docs and filter based on category or have multiple collections (e.g. one per major category) of corporate docs that you search in parallel to the user-selected collections.

from chroma.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.