daethyra / build-ragai Goto Github PK
View Code? Open in Web Editor NEWAI-powered Python components and notebooks for leveraging Large Language Models from OpenAI and Hugging Face.
License: Other
AI-powered Python components and notebooks for leveraging Large Language Models from OpenAI and Hugging Face.
License: Other
New intention: specifically supply week documented custom wrapper classes in modules for building LLMs
-> LangChain Ecosystem updates (v0.1)
I did not personally find other things worth checking than Assistant Architect did
Suggested, not necessary:
The use of the transformers library for the ASR pipeline seems appropriate. However, ensure that the specific model (openai/whisper-large-v2) and the parameters (chunk_length_s, return_timestamps) are supported by the library version you are using.
The sliding window concept is a sensible approach for handling real-time audio data. However, there seems to be inconsistency in the way the sliding window is managed after transcription. After the first 30 seconds are transcribed, the remaining part of the window should be retained, not entirely reset.
The sliding window length check (if len(self.sliding_window) >= 16000 * self.asr_pipeline.task.config.chunk_size_ms / 1000:) appears to be incorrectly using chunk_size_ms. You should confirm the existence and correct usage of this attribute in the transformers documentation.
The script performs error handling and logging for file operations and stream activities, which is good practice.
Consider adding more specific error handling around the ASR pipeline's processing, as this could fail or raise exceptions not currently caught.
The logic for handling log files could be improved. For example, the check for the log file's existence and writability is repetitive and could be simplified.
The method create_new_log_file does not handle potential exceptions that might occur during file operations.
Ensure that resources like the PyAudio stream are appropriately closed or released in all scenarios, including exceptions.
Argument parsing is correctly implemented.
The logging setup within a file context (with open(...)) is unnecessary, as logging.basicConfig handles file operations internally.
The check asr_app.asr_pipeline.is_running() might not be valid. The pipeline object from transformers does not typically have an is_running method. Verify this based on the library's documentation.
Good use of try-except blocks to handle unexpected errors and keyboard interrupts.
Consider logging the exception details for better debugging.
The script ensures that resources are closed in the finally block, which is a good practice.
README.md is unfinished
llm_utilikit
Modules like json and dotenv are imported but not used. Consider removing them if they are not necessary.
The usage of config['ENDING_CAPTION'] in the main function seems incorrect. You might need to use ending_caption instead, as it is the variable holding the relevant environment variable value.
In save_to_csv, you open csvfile as a new file every time. If csvfile is passed as an argument, it should be used instead of opening a new file. Also, consider parameterizing the write mode ('a' for appending) to allow for more flexibility.
While you have a try-except block, it might be useful to continue processing other images even if one fails. Consider moving the try-except inside the loop to handle errors on a per-image basis.
import os
from git import Repo
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import LanguageParser
from langchain.text_splitter import Language
from langchain.embeddings.jina import JinaEmbeddings
from langchain.vectorstores.chroma import Chroma
def clone_repository(repo_url, repo_path):
"""
Clones a git repository to the specified path.
"""
repo = Repo.clone_from(repo_url, to_path=repo_path)
return repo
def load_code_files(repo_path):
"""
Loads code files from the specified repository path using LanguageParser.
"""
loader = GenericLoader.from_filesystem(
repo_path,
glob="**/*",
suffixes=[".py"],
parser=LanguageParser(language=Language.PYTHON),
)
documents = loader.load()
return documents
def split_documents(documents):
"""
Splits the documents into chunks using RecursiveCharacterTextSplitter.
"""
splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON, chunk_size=2000, chunk_overlap=200
)
chunks = splitter.split_documents(documents)
return chunks
def embed_chunks(chunks):
"""
Embeds the chunks using JinaEmbeddings.
"""
embeddings = JinaEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
return vectorstore
def save_vectorstore(vectorstore, chromadb_path):
"""
Saves the vectorstore to ChromaDB.
"""
vectorstore.save(chromadb_path)
def cleanup_repository(repo_path):
"""
Cleans up the cloned repository.
"""
repo = Repo(repo_path)
repo.close()
os.remove(repo_path)
def prepare_vector_db(repo_url, repo_path, chromadb_path):
"""
Prepares a vector database for similarity searching for RAG over code.
"""
# Clone the repository
clone_repository(repo_url, repo_path)
# Load the code files
documents = load_code_files(repo_path)
# Split the documents into chunks
chunks = split_documents(documents)
# Embed the chunks
vectorstore = embed_chunks(chunks)
# Save the vectorstore to ChromaDB
save_vectorstore(vectorstore, chromadb_path)
# Clean up the cloned repository
cleanup_repository(repo_path)
Before release 1:
{} = replace_me
"What is the idiomatic way to {do the thing you want to do}
in {language in question}?"
Absolutely, here's a more developer-oriented introduction:
"Welcome to the LLM-Utilikit repository. This toolkit is a collection of prompts and components designed to streamline your work with Large Language Models (LLMs). It's built with developers in mind, providing pre-configured prompts and back-end modules to help you get up and running quickly with OpenAI, LangChain, Hugging Face, or Pinecone. The LLM-Utilikit is open-source, so feel free to contribute and help us improve it. Let's build something amazing together with the power of LLMs."
Need
Must ensure all notebooks in langchain/ subdir actually work
I'll need to personally split commands based on role
https://github.com/f/awesome-chatgpt-prompts/blob/main/prompts.csv
conv_html_to_markdown.py
.Solution:
Import glob: At the top of conv_html_to_markdown.py
, add import glob to use the glob module for file pattern matching.
Update load_json Function:
import glob
def load_json_files(pattern):
"""
Load data from multiple JSON files matching a pattern.
Args:
pattern (str): Glob pattern to match files.
Returns:
list: Aggregated data from all matched files.
"""
aggregated_data = []
for file_path in glob.glob(pattern):
with open(file_path, "r", encoding="utf-8") as file:
aggregated_data.extend(json.load(file))
return aggregated_data
def main():
# ... existing code ...
try:
# Load data from all output JSON files
original_data = load_json_files("output-*.json")
# ... rest of the existing code ...
Already collected, @Daethyra: look in personal "Building AI" list to find the right ones.
Please watch this video and summarize its main points in bullet points. Use clear and concise language. Provide relevant examples and explanations from the video. Include the following information:
The topic and purpose of the video
The main arguments or claims made by the speaker
The evidence or support provided for the arguments or claims
The speaker’s tone and attitude towards the topic
The intended audience and message of the video
Prompt =
Request: Create a detailed and informative markdown guide that will serve as a reference for the next AI assistant. This guide should encapsulate the essence of our ongoing project and articulate specific steps taken and future actions required.
Purpose: To provide the next AI assistant with comprehensive context and clear understanding about the project related to web scraping LangChain documentation.
Requirements for the Guide:
Project Overview: Summarize the initial objective - downloading LangChain documentation and the evolution towards developing a web scraping script.
Developed Tools:
Describe the enhanced web scraping script, highlighting its purpose, key features, and user instructions.
Mention the JSON file template, its role in the project, and user responsibilities for its utilization.
Next Steps for the User: Outline specific actions the user must take, including populating the JSON template with URLs and running the script.
Ethical and Legal Considerations: Emphasize the importance of adhering to legal and ethical web scraping practices, including the compliance with robots.txt of target websites.
Monitoring and Troubleshooting: Suggest steps for monitoring the script's output and handling potential issues.
Format: Markdown, for readability and structured documentation.
Goal: To ensure seamless continuity and understanding for the next AI assistant, enabling them to provide effective and pertinent assistance to the user.
I collected my Gists that I believe would be helpful in tweaking the current file base of AA4LLM
from langchain.retrievers import VectorStoreRetriever
import os
from typing import List, Tuple
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.retrievers import VectorStoreRetriever
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.prompts import ChatPromptTemplate
class DocumentRetrievalChatbot:
def __init__(self, pdf_directory: str, persist_directory: str = "./chroma_db"):
self.pdf_directory = pdf_directory
self.persist_directory = persist_directory
self.db = self._initialize_chroma_db()
self.retriever = VectorStoreRetriever(self.db)
self.chat = self._initialize_chat_model()
def _initialize_chroma_db(self):
loader = PyPDFLoader(self.pdf_directory, recursive=True)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
docs = text_splitter.split_documents(documents)
embedding_function = OpenAIEmbeddings()
db = Chroma.from_documents(docs, embedding_function, persist_directory=self.persist_directory)
return db
def _initialize_chat_model(self):
output_parser = StrOutputParser()
template = ChatPromptTemplate()
chat = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
return chat
def get_responses(self, query: str, top_k: int = 5) -> str:
retrieved_docs = self.retriever.retrieve(query, top_k=top_k)
responses = []
for doc in retrieved_docs:
response = self.chat.generate_response(query, doc.page_content)
responses.append(response.text)
return " ".join(responses)
def run_query_loop(self):
while True:
query = input("Enter your query (or 'q' to quit): ")
if query.lower() == "q":
break
response = self.get_responses(query)
print("Response:", response)
if __name__ == "__main__":
pdf_directory = "data/"
bot = DocumentRetrievalChatbot(pdf_directory)
bot.run_query_loop()
I'm not sure what code should be carried over. A lot of the repo's contents are monolithic in nature, composing excessive amounts of functionalities in a few classes per file.
Since this repo is moving to a more simple, understandable structure, I will likely not make use of much of the codebase. Not only does most of it not work, but most of it is also poorly written with no clear vision in mind.
The OpenAI directory doesn't contain much code anymore. Only two files currently available.
.env
langchain, openai, transformers
See the prompt quoted below:
[Note regarding future user inputs]:"""
I will use brackets, '[]' to specify either a literal command(like, [PROCEED]) OR a context-conveyor followed by a string(like, [User message]:"Sample text").
"""
[PROCEED]
Traceback for "flake8 .\src\llm_utilikit\langchain":
.\src\llm_utilikit\langchain\codesnippets\bufferwindow_memory.py:1:1: F401 'langchain.memory.ConversationBufferWindowMemory' imported but unused
.\src\llm_utilikit\langchain\codesnippets\bufferwindow_memory.py:8:80: E501 line too long (98 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\bufferwindow_memory.py:11:80: E501 line too long (84 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\bufferwindow_memory.py:15:80: E501 line too long (95 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\bufferwindow_memory.py:21:14: F821 undefined name 'langchain'
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:3:1: F401 'langchain.prompts.chat.ChatPromptTemplate' imported but unused
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:3:1: F401 'langchain.prompts.chat.HumanMessagePromptTemplate' imported but unused
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:3:1: F401 'langchain.prompts.chat.SystemMessagePromptTemplate' imported but unused
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:24:80: E501 line too long (95 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:25:80: E501 line too long (108 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:26:80: E501 line too long (124 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:27:80: E501 line too long (93 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:33:80: E501 line too long (104 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:36:80: E501 line too long (98 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:37:80: E501 line too long (94 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:38:80: E501 line too long (83 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:40:80: E501 line too long (124 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:45:80: E501 line too long (89 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:46:80: E501 line too long (93 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:5:80: E501 line too long (81 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:8:80: E501 line too long (85 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:10:80: E501 line too long (118 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:11:80: E501 line too long (135 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:14:80: E501 line too long (95 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:17:80: E501 line too long (90 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:40:51: F821 undefined name 'question'
.\src\llm_utilikit\langchain\end2end\chatbots\streamlit\st_chat.py:15:80: E501 line too long (87 > 79 characters)
.\src\llm_utilikit\langchain\end2end\chatbots\streamlit\st_with_memory.py:1:80: E501 line too long (82 > 79 characters)
.\src\llm_utilikit\langchain\end2end\chatbots\streamlit\st_with_memory.py:2:1: F811 redefinition of unused 'StreamlitChatMessageHistory' from line 1
.\src\llm_utilikit\langchain\end2end\chatbots\streamlit\st_with_memory.py:20:80: E501 line too long (86 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:2:80: E501 line too long (99 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:3:80: E501 line too long (102 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:4:80: E501 line too long (100 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:5:80: E501 line too long (93 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:15:1: F401 'langchain.agents.agent_toolkits.create_conversational_retrieval_agent' imported but unused
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:15:80: E501 line too long (81 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:34:44: F821 undefined name 'memory_key'
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:34:60: F821 undefined name 'llm'
.\src\llm_utilikit\langchain\end2end\rag\pinecone\application.py:9:80: E501 line too long (85 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\pinecone\application.py:15:80: E501 line too long (88 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\pinecone\application.py:22:80: E501 line too long (85 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\pinecone\application.py:29:80: E501 line too long (83 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:16:80: E501 line too long (89 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:36:80: E501 line too long (133 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:37:80: E501 line too long (92 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:49:80: E501 line too long (86 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:56:31: F821 undefined name 'retry_if_value_error'
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:64:80: E501 line too long (85 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:67:80: E501 line too long (88 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:122:46: F821 undefined name 'hub'
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:122:80: E501 line too long (82 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:142:80: E501 line too long (149 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:151:36: F821 undefined name 'cosine_similarity'
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:161:80: E501 line too long (114 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:162:80: E501 line too long (85 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\run_qa_local_docs.py:33:80: E501 line too long (88 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\run_qa_local_docs.py:36:80: E501 line too long (110 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\run_qa_local_docs.py:44:64: W605 invalid escape sequence '\ '
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\run_qa_local_docs.py:44:65: W291 trailing whitespace
.\src\llm_utilikit\langchain\rag-with-agents\pdf_only\query_local_docs.py:34:80: E501 line too long (81 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\pdf_only\query_local_docs.py:56:80: E501 line too long (88 > 79 characters)
query_local_docs.py
qa_local_docs.py
/ run_qa_local_docs.pt
I'm not sure what code should be carried over. A lot of the repo's contents are monolithic in nature, composing excessive amounts of functionalities in a few classes per file.
Since this repo is moving to a more simple, understandable structure, I will likely not make use of much of the codebase. Not only does most of it not work, but most of it is also poorly written with no clear vision in mind.
https://smith.langchain.com/hub/rlm/map-prompt?organizationId=0f7461cf-206f-5c85-aa8d-48c6c48bafc5
RAG Prompt/Chain that should be added to AA4LLM's code examples
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.