Giter Club home page Giter Club logo

daethyra / build-ragai Goto Github PK

View Code? Open in Web Editor NEW
22.0 22.0 2.0 35.37 MB

AI-powered Python components and notebooks for leveraging Large Language Models from OpenAI and Hugging Face.

License: Other

Python 1.47% Jupyter Notebook 98.53%
ai artificial-intelligence artificial-intelligence-projects components framework huggingface jupyter-notebooks langchain langserve langsmith large-language-models machine-learning natural-language-processing openai prompt-examples prompt-template python rag retrieval-augmented-generation transformers

build-ragai's Introduction

Hi there πŸ‘‹ I'm Daethyra (pronounced: duh-thear-uh)

A bit about me...

  • πŸ³οΈβ€βš§οΈ Pronouns: she/her
  • πŸ”­ I’m currently working on transitioning from cybersecurity and into software development.
  • 🌱 I’m currently building FreeStream, a Streamlit Multi-Page App with various Chatbots for many use-cases.
  • πŸ‘― I love collaborating on open-source projects with a vision to make peoples' lives better.
  • πŸ€” More specifically, I want to build cool stuff for others that automates the mundane.
  • πŸ’¬ Ask me about my favorite video game, or which games I'm playing recently.
  • ⚑ Fun fact: I love Star Wars! Guess my favorite trilogy.

Tech Stack

Python JavaScript TypeScript Pandas Numpy Scikit-Learn LangChain PyTorch TensorFlow React Next.js Tailwind CSS Flask FastAPI Jupyter HTML5 CSS Markdown PDM Git GitHub Docker Windows Linux MongoDB Google Cloud AWS Azure


Notable Projects

FreeStream

Description: A web application where you can access Claude Opus and GPT-4 for free, making use of the different chatbot architectures I've set up. The first example is focused on retrieval augmented generation and requires you to upload files for the AI to generate answers from. The second chatbot, so far, is a general-purpose chatbot, and the benefit to using FreeStream is that there's no chat-length limits and you can "drop-in" your choice of large language model from foundational model providers OpenAI, Anthropic, and Google.

Build-RAGAI

Description: A collection of Jupyter Notebooks and Python components to leverage LangChain, OpenAI, and Transformers for building generative AI applications, providing reusable code snippets, tutorials, and end-to-end examples.


Daethyra's GitHub stats Top Langs

build-ragai's People

Contributors

daethyra avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

build-ragai's Issues

Suggestion for `query_local_docs.py` code refactorization

! Contains hallucinated code !

  • Minimum one instance:
    -> from langchain.retrievers import VectorStoreRetriever
import os
from typing import List, Tuple
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.retrievers import VectorStoreRetriever
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.prompts import ChatPromptTemplate

class DocumentRetrievalChatbot:
    def __init__(self, pdf_directory: str, persist_directory: str = "./chroma_db"):
        self.pdf_directory = pdf_directory
        self.persist_directory = persist_directory
        self.db = self._initialize_chroma_db()
        self.retriever = VectorStoreRetriever(self.db)
        self.chat = self._initialize_chat_model()

    def _initialize_chroma_db(self):
        loader = PyPDFLoader(self.pdf_directory, recursive=True)
        documents = loader.load()

        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
        docs = text_splitter.split_documents(documents)

        embedding_function = OpenAIEmbeddings()
        db = Chroma.from_documents(docs, embedding_function, persist_directory=self.persist_directory)
        return db

    def _initialize_chat_model(self):
        output_parser = StrOutputParser()
        template = ChatPromptTemplate()
        chat = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
        return chat

    def get_responses(self, query: str, top_k: int = 5) -> str:
        retrieved_docs = self.retriever.retrieve(query, top_k=top_k)
        responses = []

        for doc in retrieved_docs:
            response = self.chat.generate_response(query, doc.page_content)
            responses.append(response.text)

        return " ".join(responses)

    def run_query_loop(self):
        while True:
            query = input("Enter your query (or 'q' to quit): ")
            if query.lower() == "q":
                break

            response = self.get_responses(query)
            print("Response:", response)

if __name__ == "__main__":
    pdf_directory = "data/"
    bot = DocumentRetrievalChatbot(pdf_directory)
    bot.run_query_loop()

integrable_image_captioner.py

Leftover work: integrable_image_captioner.py:

Code Improvements:

Unused Imports:

Modules like json and dotenv are imported but not used. Consider removing them if they are not necessary.

Global Variable Usage:

The usage of config['ENDING_CAPTION'] in the main function seems incorrect. You might need to use ending_caption instead, as it is the variable holding the relevant environment variable value.

CSV File Handling:

In save_to_csv, you open csvfile as a new file every time. If csvfile is passed as an argument, it should be used instead of opening a new file. Also, consider parameterizing the write mode ('a' for appending) to allow for more flexibility.

Error Handling in Main:

While you have a try-except block, it might be useful to continue processing other images even if one fails. Consider moving the try-except inside the loop to handle errors on a per-image basis.

AA4LLM Document Contents Review

ST Prompt 1 - User Role

{} = replace_me

"What is the idiomatic way to {do the thing you want to do}
in {language in question}?"

Idea for prompt: YouTube video summarization

Please watch this video and summarize its main points in bullet points. Use clear and concise language. Provide relevant examples and explanations from the video. Include the following information:

  • The topic and purpose of the video

  • The main arguments or claims made by the speaker

  • The evidence or support provided for the arguments or claims

  • The speaker’s tone and attitude towards the topic

  • The intended audience and message of the video

video = "https://youtu.be/2F9itktands?si=DXyOmSHePtEip_lO"

Review all Transformers code for merge-ability

I'm not sure what code should be carried over. A lot of the repo's contents are monolithic in nature, composing excessive amounts of functionalities in a few classes per file.
Since this repo is moving to a more simple, understandable structure, I will likely not make use of much of the codebase. Not only does most of it not work, but most of it is also poorly written with no clear vision in mind.

Release 1 requirements

Before release 1:

  • Run PDM init and ensure the pyprojecttoml looks good.
  • Take all of the prompts of all role types and move them into a single master sheet, still separate from the cheatsheet

Brackets to control "role"-type

See the prompt quoted below:

[Note regarding future user inputs]:"""
I will use brackets, '[]' to specify either a literal command(like, [PROCEED]) OR a context-conveyor followed by a string(like, [User message]:"Sample text").
"""

[PROCEED]

Review all LangChain code for merge-ability

I'm not sure what code should be carried over. A lot of the repo's contents are monolithic in nature, composing excessive amounts of functionalities in a few classes per file.
Since this repo is moving to a more simple, understandable structure, I will likely not make use of much of the codebase. Not only does most of it not work, but most of it is also poorly written with no clear vision in mind.

flake8 results

Traceback for "flake8 .\src\llm_utilikit\langchain":

.\src\llm_utilikit\langchain\codesnippets\bufferwindow_memory.py:1:1: F401 'langchain.memory.ConversationBufferWindowMemory' imported but unused
.\src\llm_utilikit\langchain\codesnippets\bufferwindow_memory.py:8:80: E501 line too long (98 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\bufferwindow_memory.py:11:80: E501 line too long (84 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\bufferwindow_memory.py:15:80: E501 line too long (95 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\bufferwindow_memory.py:21:14: F821 undefined name 'langchain'
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:3:1: F401 'langchain.prompts.chat.ChatPromptTemplate' imported but unused
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:3:1: F401 'langchain.prompts.chat.HumanMessagePromptTemplate' imported but unused
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:3:1: F401 'langchain.prompts.chat.SystemMessagePromptTemplate' imported but unused
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:24:80: E501 line too long (95 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:25:80: E501 line too long (108 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:26:80: E501 line too long (124 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:27:80: E501 line too long (93 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:33:80: E501 line too long (104 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:36:80: E501 line too long (98 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:37:80: E501 line too long (94 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:38:80: E501 line too long (83 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:40:80: E501 line too long (124 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:45:80: E501 line too long (89 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\chatopenai.py:46:80: E501 line too long (93 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:5:80: E501 line too long (81 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:8:80: E501 line too long (85 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:10:80: E501 line too long (118 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:11:80: E501 line too long (135 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:14:80: E501 line too long (95 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:17:80: E501 line too long (90 > 79 characters)
.\src\llm_utilikit\langchain\codesnippets\multi_queryvector_retrieval.py:40:51: F821 undefined name 'question'
.\src\llm_utilikit\langchain\end2end\chatbots\streamlit\st_chat.py:15:80: E501 line too long (87 > 79 characters)
.\src\llm_utilikit\langchain\end2end\chatbots\streamlit\st_with_memory.py:1:80: E501 line too long (82 > 79 characters)
.\src\llm_utilikit\langchain\end2end\chatbots\streamlit\st_with_memory.py:2:1: F811 redefinition of unused 'StreamlitChatMessageHistory' from line 1
.\src\llm_utilikit\langchain\end2end\chatbots\streamlit\st_with_memory.py:20:80: E501 line too long (86 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:2:80: E501 line too long (99 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:3:80: E501 line too long (102 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:4:80: E501 line too long (100 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:5:80: E501 line too long (93 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:15:1: F401 'langchain.agents.agent_toolkits.create_conversational_retrieval_agent' imported but unused
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:15:80: E501 line too long (81 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:34:44: F821 undefined name 'memory_key'
.\src\llm_utilikit\langchain\end2end\rag\faiss_retriever.py:34:60: F821 undefined name 'llm'
.\src\llm_utilikit\langchain\end2end\rag\pinecone\application.py:9:80: E501 line too long (85 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\pinecone\application.py:15:80: E501 line too long (88 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\pinecone\application.py:22:80: E501 line too long (85 > 79 characters)
.\src\llm_utilikit\langchain\end2end\rag\pinecone\application.py:29:80: E501 line too long (83 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:16:80: E501 line too long (89 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:36:80: E501 line too long (133 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:37:80: E501 line too long (92 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:49:80: E501 line too long (86 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:56:31: F821 undefined name 'retry_if_value_error'
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:64:80: E501 line too long (85 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:67:80: E501 line too long (88 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:122:46: F821 undefined name 'hub'
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:122:80: E501 line too long (82 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:142:80: E501 line too long (149 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:151:36: F821 undefined name 'cosine_similarity'
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:161:80: E501 line too long (114 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\qa_local_docs.py:162:80: E501 line too long (85 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\run_qa_local_docs.py:33:80: E501 line too long (88 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\run_qa_local_docs.py:36:80: E501 line too long (110 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\run_qa_local_docs.py:44:64: W605 invalid escape sequence '\ '
.\src\llm_utilikit\langchain\rag-with-agents\directoryloader\run_qa_local_docs.py:44:65: W291 trailing whitespace
.\src\llm_utilikit\langchain\rag-with-agents\pdf_only\query_local_docs.py:34:80: E501 line too long (81 > 79 characters)
.\src\llm_utilikit\langchain\rag-with-agents\pdf_only\query_local_docs.py:56:80: E501 line too long (88 > 79 characters)

ChatGPT Context reset + contextual continuity

Prompt =

Request: Create a detailed and informative markdown guide that will serve as a reference for the next AI assistant. This guide should encapsulate the essence of our ongoing project and articulate specific steps taken and future actions required.

Purpose: To provide the next AI assistant with comprehensive context and clear understanding about the project related to web scraping LangChain documentation.

Requirements for the Guide:

Project Overview: Summarize the initial objective - downloading LangChain documentation and the evolution towards developing a web scraping script.
Developed Tools:
Describe the enhanced web scraping script, highlighting its purpose, key features, and user instructions.
Mention the JSON file template, its role in the project, and user responsibilities for its utilization.
Next Steps for the User: Outline specific actions the user must take, including populating the JSON template with URLs and running the script.
Ethical and Legal Considerations: Emphasize the importance of adhering to legal and ethical web scraping practices, including the compliance with robots.txt of target websites.
Monitoring and Troubleshooting: Suggest steps for monitoring the script's output and handling potential issues.
Format: Markdown, for readability and structured documentation.

Goal: To ensure seamless continuity and understanding for the next AI assistant, enabling them to provide effective and pertinent assistance to the user.

transcribe_microphone.py

I did not personally find other things worth checking than Assistant Architect did
Β 

transcribe_microphone.py

Initialization of ASR Pipeline:

Suggested, not necessary:
The use of the transformers library for the ASR pipeline seems appropriate. However, ensure that the specific model (openai/whisper-large-v2) and the parameters (chunk_length_s, return_timestamps) are supported by the library version you are using.

Audio Processing Logic:

The sliding window concept is a sensible approach for handling real-time audio data. However, there seems to be inconsistency in the way the sliding window is managed after transcription. After the first 30 seconds are transcribed, the remaining part of the window should be retained, not entirely reset.
The sliding window length check (if len(self.sliding_window) >= 16000 * self.asr_pipeline.task.config.chunk_size_ms / 1000:) appears to be incorrectly using chunk_size_ms. You should confirm the existence and correct usage of this attribute in the transformers documentation.

Error Handling:

The script performs error handling and logging for file operations and stream activities, which is good practice.
Consider adding more specific error handling around the ASR pipeline's processing, as this could fail or raise exceptions not currently caught.

Logging and File Writing:

The logic for handling log files could be improved. For example, the check for the log file's existence and writability is repetitive and could be simplified.
The method create_new_log_file does not handle potential exceptions that might occur during file operations.

Resource Management:

Ensure that resources like the PyAudio stream are appropriately closed or released in all scenarios, including exceptions.

run.py

Argument Parsing and Logging:

Argument parsing is correctly implemented.
The logging setup within a file context (with open(...)) is unnecessary, as logging.basicConfig handles file operations internally.

ASR Pipeline and Stream Checks:

The check asr_app.asr_pipeline.is_running() might not be valid. The pipeline object from transformers does not typically have an is_running method. Verify this based on the library's documentation.

Exception Handling:

Good use of try-except blocks to handle unexpected errors and keyboard interrupts.
Consider logging the exception details for better debugging.

Resource Management:

The script ensures that resources are closed in the finally block, which is a good practice.

Jina embeddings + vector store module

import os
from git import Repo
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import LanguageParser
from langchain.text_splitter import Language
from langchain.embeddings.jina import JinaEmbeddings
from langchain.vectorstores.chroma import Chroma

def clone_repository(repo_url, repo_path):
    """
    Clones a git repository to the specified path.
    """
    repo = Repo.clone_from(repo_url, to_path=repo_path)
    return repo

def load_code_files(repo_path):
    """
    Loads code files from the specified repository path using LanguageParser.
    """
    loader = GenericLoader.from_filesystem(
        repo_path,
        glob="**/*",
        suffixes=[".py"],
        parser=LanguageParser(language=Language.PYTHON),
    )
    documents = loader.load()
    return documents

def split_documents(documents):
    """
    Splits the documents into chunks using RecursiveCharacterTextSplitter.
    """
    splitter = RecursiveCharacterTextSplitter.from_language(
        language=Language.PYTHON, chunk_size=2000, chunk_overlap=200
    )
    chunks = splitter.split_documents(documents)
    return chunks

def embed_chunks(chunks):
    """
    Embeds the chunks using JinaEmbeddings.
    """
    embeddings = JinaEmbeddings()
    vectorstore = Chroma.from_documents(chunks, embeddings)
    return vectorstore

def save_vectorstore(vectorstore, chromadb_path):
    """
    Saves the vectorstore to ChromaDB.
    """
    vectorstore.save(chromadb_path)

def cleanup_repository(repo_path):
    """
    Cleans up the cloned repository.
    """
    repo = Repo(repo_path)
    repo.close()
    os.remove(repo_path)

def prepare_vector_db(repo_url, repo_path, chromadb_path):
    """
    Prepares a vector database for similarity searching for RAG over code.
    """
    # Clone the repository
    clone_repository(repo_url, repo_path)

    # Load the code files
    documents = load_code_files(repo_path)

    # Split the documents into chunks
    chunks = split_documents(documents)

    # Embed the chunks
    vectorstore = embed_chunks(chunks)

    # Save the vectorstore to ChromaDB
    save_vectorstore(vectorstore, chromadb_path)

    # Clean up the cloned repository
    cleanup_repository(repo_path)

Repurpose repo

New intention: specifically supply week documented custom wrapper classes in modules for building LLMs
-> LangChain Ecosystem updates (v0.1)

Read *all* JSON files via Glob

Updates to how output files are named cause errors to throw during the execution of conv_html_to_markdown.py.

Solution:

  1. Import glob: At the top of conv_html_to_markdown.py, add import glob to use the glob module for file pattern matching.

  2. Update load_json Function:

  • Rename it to load_json_files to reflect its new functionality.
  • Use glob.glob to find all files matching the output-*.json pattern.
  • Iterate over these files, load their contents, and aggregate the data.
import glob

def load_json_files(pattern):
    """
    Load data from multiple JSON files matching a pattern.

    Args:
        pattern (str): Glob pattern to match files.

    Returns:
        list: Aggregated data from all matched files.
    """
    aggregated_data = []
    for file_path in glob.glob(pattern):
        with open(file_path, "r", encoding="utf-8") as file:
            aggregated_data.extend(json.load(file))
    return aggregated_data
def main():
    # ... existing code ...
    try:
        # Load data from all output JSON files
        original_data = load_json_files("output-*.json")
        # ... rest of the existing code ...

Bing intro

Absolutely, here's a more developer-oriented introduction:

"Welcome to the LLM-Utilikit repository. This toolkit is a collection of prompts and components designed to streamline your work with Large Language Models (LLMs). It's built with developers in mind, providing pre-configured prompts and back-end modules to help you get up and running quickly with OpenAI, LangChain, Hugging Face, or Pinecone. The LLM-Utilikit is open-source, so feel free to contribute and help us improve it. Let's build something amazing together with the power of LLMs."

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.