Giter Club home page Giter Club logo

Comments (2)

dosubot avatar dosubot commented on June 16, 2024 2

To address your concerns about saving progress every 100 files and implementing multithreading for adding entries to the vector database, here's a concise guide:

  1. Saving Progress Every 100 Files:

    • Implement a counter in your file processing loop. After processing each file, increment this counter.
    • Once the counter hits 100, trigger a save operation to persist the current state, then reset the counter.
  2. Multithreading for Vector Database Entries:

    • Use a thread-safe approach to allow multiple threads to add entries without data corruption.
    • Implement a task queue where each task represents a file or batch of files to process.
    • Create worker threads that pull tasks from the queue and process them, ensuring that any method modifying the vector database is thread-safe.

Example Implementation:

import threading
from queue import Queue

def process_files(queue, vs_name):
    while not queue.empty():
        task = queue.get()
        name, embeddings = task
        # Add your file processing logic here, including updating the vector database
        queue.task_done()

def main():
    files_to_process = [...]  # Your list of files
    queue = Queue()
    
    # Load tasks into the queue
    for file in files_to_process:
        embeddings = load_local_embeddings()  # Function to load embeddings for a file
        queue.put((file, embeddings))
    
    # Start worker threads
    num_worker_threads = 10
    for i in range(num_worker_threads):
        t = threading.Thread(target=process_files, args=(queue, 'your_vector_store_name'))
        t.start()
    
    queue.join()  # Wait for all tasks to be processed

if __name__ == "__main__":
    main()

This setup uses a Queue for managing tasks and worker threads for concurrent processing. Ensure your file processing and vector database update logic is placed within the process_files function. Adjust the number of worker threads based on your system's capabilities and the workload. This approach will help you process files more efficiently and save progress periodically to mitigate data loss risks.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

from langchain-chatchat.

KevinFanng avatar KevinFanng commented on June 16, 2024

python init_database.py -help
usage: init_database.py [-h] [-r] [--create-tables] [--clear-tables] [--import-db IMPORT_DB] [-u] [-i] [--prune-db] [--prune-folder] [-n KB_NAME [KB_NAME ...]]
[-e EMBED_MODEL]

please specify only one operate method once time.

options:
-h, --help show this help message and exit
-r, --recreate-vs recreate vector store. use this option if you have copied document files to the content folder, but vector store has not been populated
or DEFAUL_VS_TYPE/EMBEDDING_MODEL changed.
--create-tables create empty tables if not existed
--clear-tables create empty tables, or drop the database tables before recreate vector stores
--import-db IMPORT_DB
import tables from specified sqlite database
-u, --update-in-db update vector store for files exist in database. use this option if you want to recreate vectors for files exist in db and skip files
exist in local folder only.
-i, --increment update vector store for files exist in local folder and not exist in database. use this option if you want to create vectors
incrementally.
--prune-db delete docs in database that not existed in local folder. it is used to delete database docs after user deleted some doc files in file
browser
--prune-folder delete doc files in local folder that not existed in database. is is used to free local disk space by delete unused doc files.
-n KB_NAME [KB_NAME ...], --kb-name KB_NAME [KB_NAME ...]
specify knowledge base names to operate on. default is all folders exist in KB_ROOT_PATH.
-e EMBED_MODEL, --embed-model EMBED_MODEL
specify embeddings model.

from langchain-chatchat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.