Comments (2)
To address your concerns about saving progress every 100 files and implementing multithreading for adding entries to the vector database, here's a concise guide:
-
Saving Progress Every 100 Files:
- Implement a counter in your file processing loop. After processing each file, increment this counter.
- Once the counter hits 100, trigger a save operation to persist the current state, then reset the counter.
-
Multithreading for Vector Database Entries:
- Use a thread-safe approach to allow multiple threads to add entries without data corruption.
- Implement a task queue where each task represents a file or batch of files to process.
- Create worker threads that pull tasks from the queue and process them, ensuring that any method modifying the vector database is thread-safe.
Example Implementation:
import threading
from queue import Queue
def process_files(queue, vs_name):
while not queue.empty():
task = queue.get()
name, embeddings = task
# Add your file processing logic here, including updating the vector database
queue.task_done()
def main():
files_to_process = [...] # Your list of files
queue = Queue()
# Load tasks into the queue
for file in files_to_process:
embeddings = load_local_embeddings() # Function to load embeddings for a file
queue.put((file, embeddings))
# Start worker threads
num_worker_threads = 10
for i in range(num_worker_threads):
t = threading.Thread(target=process_files, args=(queue, 'your_vector_store_name'))
t.start()
queue.join() # Wait for all tasks to be processed
if __name__ == "__main__":
main()
This setup uses a Queue
for managing tasks and worker threads for concurrent processing. Ensure your file processing and vector database update logic is placed within the process_files
function. Adjust the number of worker threads based on your system's capabilities and the workload. This approach will help you process files more efficiently and save progress periodically to mitigate data loss risks.
Details
Sources
- server/knowledge_base/kb_cache/faiss_cache.py
- configs/kb_config.py.example
- server/api.py
- server/knowledge_base/model/kb_document_model.py
- startup.py
- server/knowledge_base/kb_cache/base.py
- tests/api/test_kb_api.py
- document_loaders/FilteredCSVloader.py
- server/chat/file_chat.py
- document_loaders/mydocloader.py
- server/chat/chat.py
- server/chat/agent_chat.py
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
To continue the conversation, mention @dosu.
from langchain-chatchat.
python init_database.py -help
usage: init_database.py [-h] [-r] [--create-tables] [--clear-tables] [--import-db IMPORT_DB] [-u] [-i] [--prune-db] [--prune-folder] [-n KB_NAME [KB_NAME ...]]
[-e EMBED_MODEL]
please specify only one operate method once time.
options:
-h, --help show this help message and exit
-r, --recreate-vs recreate vector store. use this option if you have copied document files to the content folder, but vector store has not been populated
or DEFAUL_VS_TYPE/EMBEDDING_MODEL changed.
--create-tables create empty tables if not existed
--clear-tables create empty tables, or drop the database tables before recreate vector stores
--import-db IMPORT_DB
import tables from specified sqlite database
-u, --update-in-db update vector store for files exist in database. use this option if you want to recreate vectors for files exist in db and skip files
exist in local folder only.
-i, --increment update vector store for files exist in local folder and not exist in database. use this option if you want to create vectors
incrementally.
--prune-db delete docs in database that not existed in local folder. it is used to delete database docs after user deleted some doc files in file
browser
--prune-folder delete doc files in local folder that not existed in database. is is used to free local disk space by delete unused doc files.
-n KB_NAME [KB_NAME ...], --kb-name KB_NAME [KB_NAME ...]
specify knowledge base names to operate on. default is all folders exist in KB_ROOT_PATH.
-e EMBED_MODEL, --embed-model EMBED_MODEL
specify embeddings model.
from langchain-chatchat.
Related Issues (20)
- cannot import name 'legacy' from 'llama_index'
- [BUG]httpcore.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read) HOT 3
- 配置好vllm配置之后启动startup.py时候遇到一个问题,似乎是缺少一个参数配置,具体报错如下
- 运行后卡柱不动
- 在初始化过程中,显存爆了 HOT 1
- 知识库匹配需要原文回答,而不是润色的内容
- 这个项目中的fastapi 怎么在vscode 调试时 重载代码? uvicorn --reload时报错
- 知识库问答调用add_message_to_db 无法保存response
- UnstructuredMarkdownLoader会丢失Markdown的结构标识
- 怎么使用splite
- [FEATURE] 搜索引擎支持百度api
- 请问怎么看最后传给大模型的历史对话和用户询问,我目前只跟踪到以下部分。
- http://127.0.0.1:7861/knowledge_base/kb_summary_api/summary_file_to_vector_store 进行文件总结 响应7分钟后出现一下报错,是超时了吗,这个逻辑是在哪啊,
- 上传文件不支持json格式的吗,似乎上传后解析不了
- 问答的第二个问题就会卡住中断
- cannot import name 'PDFResourceManag er' from 'pdfminer.converter'这是什么错误[BUG] 简洁阐述问题 / Concise description of the issue HOT 1
- 如何同时接入多个封装好的openai api格式大模型 HOT 2
- 请问怎么计算输出的token数量?这个框架有现成支持的方法吗
- 调用讯飞星火api,没有输出回答 HOT 2
- [BUG] 知识库问答匹配不到/ Concise description of the issue HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from langchain-chatchat.