daethyra / build-ragai Goto Github PK

AI-powered Python components and notebooks for leveraging Large Language Models from OpenAI and Hugging Face.

License: Other

Python 1.47% Jupyter Notebook 98.53%

large-language-models prompt-examples openai ai artificial-intelligence artificial-intelligence-projects machine-learning prompt-template huggingface langchain

build-ragai's Issues

Repurpose repo

New intention: specifically supply week documented custom wrapper classes in modules for building LLMs
-> LangChain Ecosystem updates (v0.1)

transcribe_microphone.py

I did not personally find other things worth checking than Assistant Architect did

transcribe_microphone.py

Initialization of ASR Pipeline:

Suggested, not necessary:
The use of the transformers library for the ASR pipeline seems appropriate. However, ensure that the specific model (openai/whisper-large-v2) and the parameters (chunk_length_s, return_timestamps) are supported by the library version you are using.

Audio Processing Logic:

The sliding window concept is a sensible approach for handling real-time audio data. However, there seems to be inconsistency in the way the sliding window is managed after transcription. After the first 30 seconds are transcribed, the remaining part of the window should be retained, not entirely reset.
The sliding window length check (if len(self.sliding_window) >= 16000 * self.asr_pipeline.task.config.chunk_size_ms / 1000:) appears to be incorrectly using chunk_size_ms. You should confirm the existence and correct usage of this attribute in the transformers documentation.

Error Handling:

The script performs error handling and logging for file operations and stream activities, which is good practice.
Consider adding more specific error handling around the ASR pipeline's processing, as this could fail or raise exceptions not currently caught.

Logging and File Writing:

The logic for handling log files could be improved. For example, the check for the log file's existence and writability is repetitive and could be simplified.
The method create_new_log_file does not handle potential exceptions that might occur during file operations.

Resource Management:

Ensure that resources like the PyAudio stream are appropriately closed or released in all scenarios, including exceptions.

run.py

Argument Parsing and Logging:

Argument parsing is correctly implemented.
The logging setup within a file context (with open(...)) is unnecessary, as logging.basicConfig handles file operations internally.

ASR Pipeline and Stream Checks:

The check asr_app.asr_pipeline.is_running() might not be valid. The pipeline object from transformers does not typically have an is_running method. Verify this based on the library's documentation.

Exception Handling:

Good use of try-except blocks to handle unexpected errors and keyboard interrupts.
Consider logging the exception details for better debugging.

Resource Management:

The script ensures that resources are closed in the finally block, which is a good practice.

System prompt

Add Notebook: rag-in-langchain

Add ref and explanation of llm_utilikit directory

README.md is unfinished

Requires a reference and explanation for each sub directory of llm_utilikit

integrable_image_captioner.py

Leftover work: integrable_image_captioner.py:

Code Improvements:

Unused Imports:

Modules like json and dotenv are imported but not used. Consider removing them if they are not necessary.

Global Variable Usage:

The usage of config['ENDING_CAPTION'] in the main function seems incorrect. You might need to use ending_caption instead, as it is the variable holding the relevant environment variable value.

CSV File Handling:

In save_to_csv, you open csvfile as a new file every time. If csvfile is passed as an argument, it should be used instead of opening a new file. Also, consider parameterizing the write mode ('a' for appending) to allow for more flexibility.

Error Handling in Main:

While you have a try-except block, it might be useful to continue processing other images even if one fails. Consider moving the try-except inside the loop to handle errors on a per-image basis.

Jina embeddings + vector store module

import os
from git import Repo
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import LanguageParser
from langchain.text_splitter import Language
from langchain.embeddings.jina import JinaEmbeddings
from langchain.vectorstores.chroma import Chroma

def clone_repository(repo_url, repo_path):
    """
    Clones a git repository to the specified path.
    """
    repo = Repo.clone_from(repo_url, to_path=repo_path)
    return repo

def load_code_files(repo_path):
    """
    Loads code files from the specified repository path using LanguageParser.
    """
    loader = GenericLoader.from_filesystem(
        repo_path,
        glob="**/*",
        suffixes=[".py"],
        parser=LanguageParser(language=Language.PYTHON),
    )
    documents = loader.load()
    return documents

def split_documents(documents):
    """
    Splits the documents into chunks using RecursiveCharacterTextSplitter.
    """
    splitter = RecursiveCharacterTextSplitter.from_language(
        language=Language.PYTHON, chunk_size=2000, chunk_overlap=200
    )
    chunks = splitter.split_documents(documents)
    return chunks

def embed_chunks(chunks):
    """
    Embeds the chunks using JinaEmbeddings.
    """
    embeddings = JinaEmbeddings()
    vectorstore = Chroma.from_documents(chunks, embeddings)
    return vectorstore

def save_vectorstore(vectorstore, chromadb_path):
    """
    Saves the vectorstore to ChromaDB.
    """
    vectorstore.save(chromadb_path)

def cleanup_repository(repo_path):
    """
    Cleans up the cloned repository.
    """
    repo = Repo(repo_path)
    repo.close()
    os.remove(repo_path)

def prepare_vector_db(repo_url, repo_path, chromadb_path):
    """
    Prepares a vector database for similarity searching for RAG over code.
    """
    # Clone the repository
    clone_repository(repo_url, repo_path)

    # Load the code files
    documents = load_code_files(repo_path)

    # Split the documents into chunks
    chunks = split_documents(documents)

    # Embed the chunks
    vectorstore = embed_chunks(chunks)

    # Save the vectorstore to ChromaDB
    save_vectorstore(vectorstore, chromadb_path)

    # Clean up the cloned repository
    cleanup_repository(repo_path)

Release 1 requirements

Before release 1:

Run PDM init and ensure the pyprojecttoml looks good.
Take all of the prompts of all role types and move them into a single master sheet, still separate from the cheatsheet

ST Prompt 1 - User Role

{} = replace_me

"What is the idiomatic way to {do the thing you want to do}
in {language in question}?"

Bing intro

Absolutely, here's a more developer-oriented introduction:

"Welcome to the LLM-Utilikit repository. This toolkit is a collection of prompts and components designed to streamline your work with Large Language Models (LLMs). It's built with developers in mind, providing pre-configured prompts and back-end modules to help you get up and running quickly with OpenAI, LangChain, Hugging Face, or Pinecone. The LLM-Utilikit is open-source, so feel free to contribute and help us improve it. Let's build something amazing together with the power of LLMs."

Collaboration preparation

Need

issue template
'collaborating' markdown doc

'docs/' subdir needs README

Subdir: "notebooks": unchecked

Must ensure all notebooks in langchain/ subdir actually work

Always point to GPT-3.5-Turbo-1106

hard encode into code examples for reproducibility
consider updating versions every snapshot release of GPT-3.5-Turbo-{MMDD}

Abstract OpenAI utility

Outsource the API calls

Ensure logging, retries
Follow this todo list

Entire subdir: unchecked | requires meticulous review

Assistant Architect Hallucination - Resolve: LangChain - Tracing

Check my Gists for more useful notebooks

transcribe_tasks.py

fine_tune_sequence_classification_model.py

Need README for docs/jupyter_notebooks

https://github.com/Daethyra/LLM-Utilikit/tree/v1.0.21/docs/jupyter_notebooks

Double check "continued education" directory for useful learning

define, "useful"

Copy/Credit this prompt CSV

I'll need to personally split commands based on role

https://github.com/f/awesome-chatgpt-prompts/blob/main/prompts.csv

Read all JSON files via Glob

Updates to how output files are named cause errors to throw during the execution of `conv_html_to_markdown.py`.

Solution:

Import glob: At the top of conv_html_to_markdown.py, add import glob to use the glob module for file pattern matching.
Update load_json Function:

Rename it to load_json_files to reflect its new functionality.
Use glob.glob to find all files matching the output-*.json pattern.
Iterate over these files, load their contents, and aggregate the data.

import glob

def load_json_files(pattern):
    """
    Load data from multiple JSON files matching a pattern.

    Args:
        pattern (str): Glob pattern to match files.

    Returns:
        list: Aggregated data from all matched files.
    """
    aggregated_data = []
    for file_path in glob.glob(pattern):
        with open(file_path, "r", encoding="utf-8") as file:
            aggregated_data.extend(json.load(file))
    return aggregated_data

def main():
    # ... existing code ...
    try:
        # Load data from all output JSON files
        original_data = load_json_files("output-*.json")
        # ... rest of the existing code ...

Double check both "prompt" documents for quality and clarity

AA4LLM Document Contents Review

Add 'awesome' lists

Already collected, @Daethyra: look in personal "Building AI" list to find the right ones.

Subdir: "end2end": ✅

Idea for prompt: YouTube video summarization

Please watch this video and summarize its main points in bullet points. Use clear and concise language. Provide relevant examples and explanations from the video. Include the following information:

The topic and purpose of the video
The main arguments or claims made by the speaker
The evidence or support provided for the arguments or claims
The speaker’s tone and attitude towards the topic
The intended audience and message of the video

video = "https://youtu.be/2F9itktands?si=DXyOmSHePtEip_lO"

ChatGPT Context reset + contextual continuity

Prompt =

Request: Create a detailed and informative markdown guide that will serve as a reference for the next AI assistant. This guide should encapsulate the essence of our ongoing project and articulate specific steps taken and future actions required.

Purpose: To provide the next AI assistant with comprehensive context and clear understanding about the project related to web scraping LangChain documentation.

Requirements for the Guide:

Project Overview: Summarize the initial objective - downloading LangChain documentation and the evolution towards developing a web scraping script.
Developed Tools:
Describe the enhanced web scraping script, highlighting its purpose, key features, and user instructions.
Mention the JSON file template, its role in the project, and user responsibilities for its utilization.
Next Steps for the User: Outline specific actions the user must take, including populating the JSON template with URLs and running the script.
Ethical and Legal Considerations: Emphasize the importance of adhering to legal and ethical web scraping practices, including the compliance with robots.txt of target websites.
Monitoring and Troubleshooting: Suggest steps for monitoring the script's output and handling potential issues.
Format: Markdown, for readability and structured documentation.

Goal: To ensure seamless continuity and understanding for the next AI assistant, enabling them to provide effective and pertinent assistance to the user.

Subdir: "codesnippets": unchecked

Needs directory organization

needs folder organization for
-- multi-shot examples
-- user examples
-- system examples

Sift through Gists

I collected my Gists that I believe would be helpful in tweaking the current file base of AA4LLM

Gists-LLM-Utilikit-Free_Code_Examples.zip

Suggestion for `query_local_docs.py` code refactorization

! Contains hallucinated code !

Minimum one instance:
-> from langchain.retrievers import VectorStoreRetriever

import os from typing import List, Tuple from langchain.document_loaders import PyPDFLoader from langchain.text_splitters import RecursiveCharacterTextSplitter from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.retrievers import VectorStoreRetriever from langchain.chat_models import ChatOpenAI from langchain.schema.output_parser import StrOutputParser from langchain.schema.runnable import RunnablePassthrough from langchain.prompts import ChatPromptTemplate class DocumentRetrievalChatbot: def __init__(self, pdf_directory: str, persist_directory: str = "./chroma_db"): self.pdf_directory = pdf_directory self.persist_directory = persist_directory self.db = self._initialize_chroma_db() self.retriever = VectorStoreRetriever(self.db) self.chat = self._initialize_chat_model() def _initialize_chroma_db(self): loader = PyPDFLoader(self.pdf_directory, recursive=True) documents = loader.load() text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000) docs = text_splitter.split_documents(documents) embedding_function = OpenAIEmbeddings() db = Chroma.from_documents(docs, embedding_function, persist_directory=self.persist_directory) return db def _initialize_chat_model(self): output_parser = StrOutputParser() template = ChatPromptTemplate() chat = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0) return chat def get_responses(self, query: str, top_k: int = 5) -> str: retrieved_docs = self.retriever.retrieve(query, top_k=top_k) responses = [] for doc in retrieved_docs: response = self.chat.generate_response(query, doc.page_content) responses.append(response.text) return " ".join(responses) def run_query_loop(self): while True: query = input("Enter your query (or 'q' to quit): ") if query.lower() == "q": break response = self.get_responses(query) print("Response:", response) if __name__ == "__main__": pdf_directory = "data/" bot = DocumentRetrievalChatbot(pdf_directory) bot.run_query_loop()

Review all LangChain code for merge-ability

I'm not sure what code should be carried over. A lot of the repo's contents are monolithic in nature, composing excessive amounts of functionalities in a few classes per file.
Since this repo is moving to a more simple, understandable structure, I will likely not make use of much of the codebase. Not only does most of it not work, but most of it is also poorly written with no clear vision in mind.

Pyproject.toml missing requirements

Use PDM to add safetensors and
fk I forgot the other one

Ensure code works: "OpenAI"

The OpenAI directory doesn't contain much code anymore. Only two files currently available.

`query_local_docs.py` does not return LLM Response | Add LangSmith tracing v2

return LLM query-response
Add LangSmith tracing v2 via .env

Click here

Reviewed article and copied prompts on 11-28-23

Brackets to control "role"-type

See the prompt quoted below:

[Note regarding future user inputs]:"""
I will use brackets, '[]' to specify either a literal command(like, [PROCEED]) OR a context-conveyor followed by a string(like, [User message]:"Sample text").
"""

[PROCEED]

Add notebook: privacy-rag-over-code

flake8 results

Traceback for "flake8 .\src\llm_utilikit\langchain":

.\src\llm_utilikit\langchain\codesnippets\bufferwindow_memory.py:1:1: F401 'langchain.memory.ConversationBufferWindowMemory' imported but unused
.\src\llm_utilikit\langchain\codesnippets\bufferwindow_memory.py:8:80: E501 line too long (98 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\bufferwindow_memory.py:11:80: E501 line too long (84 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\bufferwindow_memory.py:15:80: E501 line too long (95 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\bufferwindow_memory.py:21:14: F821 undefined name 'langchain'
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:3:1: F401 'langchain.prompts.chat.ChatPromptTemplate' imported but unused
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:3:1: F401 'langchain.prompts.chat.HumanMessagePromptTemplate' imported but unused
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:3:1: F401 'langchain.prompts.chat.SystemMessagePromptTemplate' imported but unused
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:24:80: E501 line too long (95 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:25:80: E501 line too long (108 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:26:80: E501 line too long (124 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:27:80: E501 line too long (93 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:33:80: E501 line too long (104 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:36:80: E501 line too long (98 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:37:80: E501 line too long (94 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:38:80: E501 line too long (83 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:40:80: E501 line too long (124 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:45:80: E501 line too long (89 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:46:80: E501 line too long (93 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:5:80: E501 line too long (81 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:8:80: E501 line too long (85 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:10:80: E501 line too long (118 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:11:80: E501 line too long (135 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:14:80: E501 line too long (95 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:17:80: E501 line too long (90 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:40:51: F821 undefined name 'question'
.\src\llm_utilikit\langchain\end2end\chatbots\streamlit\st_chat.py:15:80: E501 line too long (87 > 79 characters)
.\src\llm_utilikit\langchain\end2end\chatbots\streamlit\st_with_memory.py:1:80: E501 line too long (82 > 79 characters)
.\src\llm_utilikit\langchain\end2end\chatbots\streamlit\st_with_memory.py:2:1: F811 redefinition of unused 'StreamlitChatMessageHistory' from line 1
.\src\llm_utilikit\langchain\end2end\chatbots\streamlit\st_with_memory.py:20:80: E501 line too long (86 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:2:80: E501 line too long (99 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:3:80: E501 line too long (102 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:4:80: E501 line too long (100 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:5:80: E501 line too long (93 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:15:1: F401 'langchain.agents.agent_toolkits.create_conversational_retrieval_agent' imported but unused
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:15:80: E501 line too long (81 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:34:44: F821 undefined name 'memory_key'
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:34:60: F821 undefined name 'llm'
.\src\llm_utilikit\langchain\end2end\rag\pinecone\application.py:9:80: E501 line too long (85 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\pinecone\application.py:15:80: E501 line too long (88 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\pinecone\application.py:22:80: E501 line too long (85 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\pinecone\application.py:29:80: E501 line too long (83 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:16:80: E501 line too long (89 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:36:80: E501 line too long (133 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:37:80: E501 line too long (92 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:49:80: E501 line too long (86 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:56:31: F821 undefined name 'retry_if_value_error'
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:64:80: E501 line too long (85 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:67:80: E501 line too long (88 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:122:46: F821 undefined name 'hub'
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:122:80: E501 line too long (82 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:142:80: E501 line too long (149 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:151:36: F821 undefined name 'cosine_similarity'
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:161:80: E501 line too long (114 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:162:80: E501 line too long (85 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\run_qa_local_docs.py:33:80: E501 line too long (88 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\run_qa_local_docs.py:36:80: E501 line too long (110 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\run_qa_local_docs.py:44:64: W605 invalid escape sequence '\ '
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\run_qa_local_docs.py:44:65: W291 trailing whitespace
.\src\llm_utilikit\langchain\rag-with-agents\pdf_only\query_local_docs.py:34:80: E501 line too long (81 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\pdf_only\query_local_docs.py:56:80: E501 line too long (88 > 79 characters)

Review Blind Programming subdirectory

Subdir: "rag-with-agents": unchecked

query_local_docs.py
qa_local_docs.py / run_qa_local_docs.pt

daethyra / build-ragai Goto Github PK

build-ragai's Issues

transcribe_microphone.py

Initialization of ASR Pipeline:

Audio Processing Logic:

Error Handling:

Logging and File Writing:

Resource Management:

run.py

Argument Parsing and Logging:

ASR Pipeline and Stream Checks:

Exception Handling:

Resource Management:

Leftover work: integrable_image_captioner.py:

Code Improvements:

Unused Imports:

Global Variable Usage:

CSV File Handling:

Error Handling in Main:

Outsource the API calls

Updates to how output files are named cause errors to throw during the execution of conv_html_to_markdown.py.

! Contains hallucinated code !

Recommend Projects

Recommend Topics

Recommend Org

Updates to how output files are named cause errors to throw during the execution of `conv_html_to_markdown.py`.