Giter Club home page Giter Club logo

codeqai's Introduction

codeqai

Build Publish License

Search your codebase semantically or chat with it from cli. Keep the vector database superfast up to date to the latest code changes. 100% local support without any dataleaks.
Built with langchain, treesitter, sentence-transformers, instructor-embedding, faiss, lama.cpp, Ollama, Streamlit.

โœจ Features

  • ๐Ÿ”Ž ย Semantic code search
  • ๐Ÿ’ฌ ย GPT-like chat with your codebase
  • โš™๏ธ ย Synchronize vector store and latest code changes with ease
  • ๐Ÿ’ป ย 100% local embeddings and llms
    • sentence-transformers, instructor-embeddings, llama.cpp, Ollama
  • ๐ŸŒ ย OpenAI, Azure OpenAI and Anthropic
  • ๐ŸŒณ ย Treesitter integration

Note

There will be better results if the code is well documented. You might consider doc-comments-ai for code documentation generation.

๐Ÿš€ Usage

Start semantic search:

codeqai search

Start chat dialog:

codeqai chat

Synchronize vector store with current git checkout:

codeqai sync

Start Streamlit app:

codeqai app

Note

At first usage, the repository will be indexed with the configured embeddings model which might take a while.

๐Ÿ“‹ Requirements

  • Python >=3.9,<3.12

๐Ÿ“ฆ Installation

Install in an isolated environment with pipx:

pipx install codeqai

โš  Make sure pipx is using Python >=3.9,<3.12.
To specify the Python version explicitly with pipx, activate the desired Python version (e.g. with pyenv shell 3.X.X) and intall with:

pipx install codeqai --python $(which python)

If you are still facing issues using pipx you can also install directly from source through PyPI with:

pip install codeqai

However, it is recommended to use pipx to benefit from isolated environments for the dependencies.
Visit the Troubleshooting section for solutions of known issues during installation.

Note

Some packages are not installed by default. At first usage it is asked to install faiss-cpu or faiss-gpu. Faiss-gpu is recommended if the hardware supports CUDA 7.5+. If local embeddings and llms are used it will be further asked to install sentence-transformers, instructor or llama.cpp.

๐Ÿ”ง Configuration

At first usage or by running

codeqai configure

the configuration process is initiated, where the embeddings and llms can be chosen.

Important

If you want to change the embeddings model in the configuration later, delete the cached files in ~/.cache/codeqai. Afterwards the vector store files are created again with the recent configured embeddings model. This is neccessary since the similarity search does not work if the models differ.

๐ŸŒ Remote models

If remote models are used, the following environment variables are required. If the required environment variables are already set, they will be used, otherwise you will be prompted to enter them which are then stored in ~/.config/codeqai/.env.

OpenAI

export OPENAI_API_KEY = "your OpenAI api key"

Azure OpenAI

export OPENAI_API_TYPE = "azure"
export AZURE_OPENAI_ENDPOINT = "https://<your-endpoint>.openai.azure.com/"
export OPENAI_API_KEY = "your Azure OpenAI api key"
export OPENAI_API_VERSION = "2023-05-15"

Anthropic

export ANTHROPIC_API_KEY="your Anthropic api key"

Note

To change the environment variables later, update the ~/.config/codeqai/.env manually.

๐Ÿ“š Supported Languages

  • Python
  • Typescript
  • Javascript
  • Java
  • Rust
  • Kotlin
  • Go
  • C++
  • C
  • C#
  • Ruby

๐Ÿ’ก How it works

The entire git repo is parsed with treesitter to extract all methods with documentations and saved to a local FAISS vector database with either sentence-transformers, instructor-embeddings or OpenAI's text-embedding-ada-002.
The vector database is saved to a file on your system and will be loaded later again after further usage. Afterwards it is possible to do semantic search on the codebase based on the embeddings model.
To chat with the codebase locally llama.cpp or Ollama is used by specifying the desired model. For synchronization of recent changes in the repository, the git commit hashes of each file along with the vector Ids are saved to a cache. When synchronizing the vector database with the latest git state, the cached commit hashes are compared to the current git hash of each file in the repository. If the git commit hashes differ, the related vectors are deleted from the database and inserted again after recreating the vector embeddings. Using llama.cpp the specified model needs to be available on the system in advance. Using Ollama the Ollama container with the desired model needs to be running locally in advance on port 11434. Also OpenAI or Azure-OpenAI can be used for remote chat models.

๏ผŸFAQ

Where do I get models for llama.cpp?

Install the huggingface-cli and download your desired model from the model hub. For example

huggingface-cli download TheBloke/CodeLlama-13B-Python-GGUF codellama-13b-python.Q5_K_M.gguf

will download the codellama-13b-python.Q5_K_M model. After the download has finished the absolute path of the model .gguf file is printed to the console.

Important

llama.cpp compatible models must be in the .gguf format.

๐Ÿ›Ÿ Troubleshooting

  • During installation with pipx

    pip failed to build package: tiktoken
    
    Some possibly relevant errors from pip install:
      error: subprocess-exited-with-error
      error: can't find Rust compiler
    

    Make sure the rust compiler is installed on your system from here.

  • During installation of faiss

    ร— Building wheel for faiss-cpu (pyproject.toml) did not run successfully.
    โ”‚ exit code: 1
    โ•ฐโ”€> [12 lines of output]
        running bdist_wheel
        ...
    note: This error originates from a subprocess, and is likely not a problem with pip.
    ERROR: Failed building wheel for faiss-cpu
    Failed to build faiss-cpu
    ERROR: Could not build wheels for faiss-cpu, which is required to install pyproject.toml-based projects
    

    Make sure to have codeqai installed with Python <3.12. There is no faiss wheel available yet for Python 3.12.

๐ŸŒŸ Contributing

If you are missing a feature or facing a bug don't hesitate to open an issue or raise a PR. Any kind of contribution is highly appreciated!

codeqai's People

Contributors

bhargavnova avatar dependabot[bot] avatar fynnfluegge avatar nisarg1112 avatar shreyahegde18 avatar yenif avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

codeqai's Issues

Error running `codeqai search , app` and `chat` : Unexpected keyword argument `token` in `INSTRUCTOR._load_sbert_model()`

Description

When attempting to run the codeqai app command on my project directory, I encountered a TypeError related to an unexpected keyword argument 'token' in the INSTRUCTOR._load_sbert_model() method. This occurred after configuring codeqai to use local embedding models (Instructor-Large) and selecting gpt-4 as the remote LLM for chat functionalities.

Steps to Reproduce

  1. Installed codeqai using pip.
  2. Ran codeqai configure and configured the tool as follows:
    • Selected "y" for using local embedding models.
    • Chose "Instructor-Large" for the local embeddings model.
    • Selected "N" for using local chat models and chose "OpenAI" with "gpt-4" as the remote LLM.
  3. Attempted to start the codeqai search by running codeqai search in the terminal.
  4. Encountered the following error:
  Traceback (most recent call last):
    File "/usr/local/bin/codeqai", line 8, in <module>
      sys.exit(main())
               ^^^^^^
    File "/usr/local/lib/python3.11/site-packages/codeqai/__main__.py", line 5, in main
      app.run()
    File "/usr/local/lib/python3.11/site-packages/codeqai/app.py", line 121, in run
      embeddings_model = Embeddings(
                         ^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/codeqai/embeddings.py", line 42, in __init__
      self.embeddings = HuggingFaceInstructEmbeddings()
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/langchain_community/embeddings/huggingface.py", line 149, in __init__
      self.client = INSTRUCTOR(
                    ^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py", line 194, in __init__
      modules = self._load_sbert_model(
                ^^^^^^^^^^^^^^^^^^^^^^^
  TypeError: INSTRUCTOR._load_sbert_model() got an unexpected keyword argument 'token'

Expected Behavior

I expected the codeqai search to launch successfully and allow me to interact with my codebase through the bash terminal.

Actual Behavior

The application failed to start due to a TypeError in the INSTRUCTOR._load_sbert_model() method.

Environment

  • codeqai version: 0.0.14
  • langchain-community version: 0.0.17
  • sentence-transformers version: 2.3.1
  • Python version: 3.11
  • Operating System: Linux c9b1c6e240f6 5.15.133.1-microsoft-standard-WSL2 #1 SMP Thu Oct 5 21:02:42 UTC 2023 x86_64 GNU/Linux (Docker Container)

Additional Context

The issue seems to be related to the integration between codeqai, the langchain-community package, and sentence-transformers. Given that all components are up to date, it appears there might be an incompatibility or a bug in the way codeqai is utilizing the sentence-transformers library, specifically with the INSTRUCTOR model configuration.


Throwing Error When Reading .venv

Installed and tried to run on my current VS code project, but since I have a virtual environment setup, it threw an error. I was able to set iup and run with a new version of the same repository without issues. Is there a way I can skip the .venv or other files/folders on ingestion?

Assertion error in faiss

Very cool project, trying to get it to work I face some issue -

Using local embeddings (INSTRUCTOR_Transformer) and llamacpp model, any search/chat ends up in the following assertion:

load INSTRUCTOR_Transformer
max_seq_length  512
๐Ÿ”Ž Enter a search pattern: preprocessing
โ น ๐Ÿค– Processing...Traceback (most recent call last):
  File "/Users/dinari/.local/bin/codeqai", line 10, in <module>
    sys.exit(main())
  File "/Users/dinari/Library/Application Support/pipx/venvs/codeqai/lib/python3.10/site-packages/codeqai/__main__.py", line 5, in main
    app.run()
  File "/Users/dinari/Library/Application Support/pipx/venvs/codeqai/lib/python3.10/site-packages/codeqai/app.py", line 177, in run
    similarity_result = vector_store.similarity_search(search_pattern)
  File "/Users/dinari/Library/Application Support/pipx/venvs/codeqai/lib/python3.10/site-packages/codeqai/vector_store.py", line 131, in similarity_search
    return self.db.similarity_search(query, k=4)
  File "/Users/dinari/Library/Application Support/pipx/venvs/codeqai/lib/python3.10/site-packages/langchain_community/vectorstores/faiss.py", line 544, in similarity_search
    docs_and_scores = self.similarity_search_with_score(
  File "/Users/dinari/Library/Application Support/pipx/venvs/codeqai/lib/python3.10/site-packages/langchain_community/vectorstores/faiss.py", line 417, in similarity_search_with_score
    docs = self.similarity_search_with_score_by_vector(
  File "/Users/dinari/Library/Application Support/pipx/venvs/codeqai/lib/python3.10/site-packages/langchain_community/vectorstores/faiss.py", line 302, in similarity_search_with_score_by_vector
    scores, indices = self.index.search(vector, k if filter is None else fetch_k)
  File "/Users/dinari/Library/Application Support/pipx/venvs/codeqai/lib/python3.10/site-packages/faiss/class_wrappers.py", line 329, in replacement_search
    assert d == self.d
AssertionError

Using apple silicon arm64 arch.

Move `pytest` to dev dependency group

pytest is currently defined in default dependency group:

[tool.poetry.dependencies]
python = "^3.9"
tiktoken = "^0.4.0"
yaspin = "^3.0.0"
pytest = "^7.4.0"

Should be moved to a separate dev dependency group to exclude it from build.

Wrong line numbers in Semantic search result

The code snippets displayed as the result of the semantic search has wrong line numbers. The line numbers should be the same as in the relative file. Currently line numbers starts always at 1.

Screenshot 2023-10-08 at 10 12 48

This can be fixed by finding the occurrence of the code snippet in the file. The file name is present in the metadata of the vector search result.

Indexing Error with codeqai on Conda Environment: Continuous Indexing Without Completion

While using the codeqai tool within a conda environment, I encountered an issue during the indexing process where it continuously attempts to index without completion. This problem occurred when I tried to utilize codeqai's search functionality in my project directory. Specifically, the error IndexError: list index out of range was thrown, indicating an issue with handling the document vector indexing. Below are the detailed steps to reproduce, along with the specific environment setup.

Steps to Reproduce:

  1. Installed codeqai using pip within a conda environment.
  2. Ran codeqai configure and configured the tool with the following settings:
    • Selected "y" for using local embedding models.
    • Chose "Instructor-Large" for the local embedding model.
    • Selected "N" for using local chat models and chose "OpenAI" with "gpt-4" as the remote LLM.
  3. Attempted to start the codeqai search by navigating to my project directory (2-006) that includes .m, .mat, .txt. files. Running codeqai search in the terminal.
  4. Received a message indicating no vector store was found for 2-006 and that initial indexing may take a few minutes. Shortly after, the indexing process started but then failed with an IndexError: list index out of range.

Expected Behavior:

The indexing process should be completed, allowing for subsequent searches within the codebase using codeqai.

Actual Behavior:

The application failed to complete the indexing process due to an IndexError in the vector indexing step, specifically indicating a problem with handling the document vectors.

Environment:

  • codeqai version: 0.0.14
  • langchain-community version: 0.0.17
  • sentence-transformers version: 2.3.1
  • Python version: 3.11
  • Conda version: 4.12.0
  • Operating System: Windows (with Conda environment)

Full Terminal Output and Error

{GenericDirectory>}conda activate condaqai-env

(condaqai-env) {GenericDirectory>}codeqai search
Not a git repository. Exiting.

(condaqai-env) {GenericDirectory>}ls
'ls' is not recognized as an internal or external command,
operable program or batch file.

(condaqai-env) {GenericDirectory>}cd 2-006

(condaqai-env) {GenericDirectory}\2-006>codeqai search
No vector store found for 2-006. Initial indexing may take a few minutes.
โ ‹ ๐Ÿ’พ Indexing vector store...Traceback (most recent call last):
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\Scripts\codeqai.exe\__main__.py", line 7, in <module>
    sys.exit(main())
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\codeqai\__main__.py", line 5, in main
    app.run()
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\codeqai\app.py", line 146, in run
    vector_store.index_documents(documents)
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\codeqai\vector_store.py", line 34, in index_documents
    self.db = FAISS.from_documents(documents, self.embeddings)
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\langchain_core\vectorstores.py", line 508, in from_documents
    return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\langchain_community\vectorstores\faiss.py", line 960, in from_texts
    return cls.__from(
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\langchain_community\vectorstores\faiss.py", line 919, in __from
    index = faiss.IndexFlatL2(len(embeddings[0]))
IndexError: list index out of range
โ ด ๐Ÿ’พ Indexing vector store...

Additional Context:

This issue seems to stem from the vector indexing process within the langchain-community package, possibly due to an empty or malformed document set being processed for vectorization. Given the configuration steps and the use of a conda environment, there might be specific dependencies or configurations that contribute to this problem.

Various issues and fixes on Windows

Hi, I tried to run your project using Windows 11 Powershell 7.4 and ran into various issues. I was able to debug some of them, so I thought I'd jot down the steps I took:

1) pipx run --spec codeqai codeqai configure

This didn't work for me, the setup launched but Codeqai was unavailable after. (My understanding of pipx run is that it's a temporary, run-once sandbox venv only.)

pipx install codeqai, followed by codeqai configure worked instead.

2) UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1137: character maps to

Whenever you're using open(), I believe you should add encoding='utf-8'. This solves this issue.
For example: with open(env_path, "w", encoding='utf-8') as env_f: in app.py.

3) Command '['C:\\Users\\<USER>\\.local\\pipx\\venvs\\codeqai\\Scripts\\python.exe', '-m', 'pip', 'install', 'faiss-gpu (Only if your system supports CUDA))']' returned non-zero exit status 1.

Unless I'm mistaken, on line 170 of vector_store.py, you're passing the literal string faiss-gpu (Only if your system supports CUDA) to pip install. You'd want fiass-gpu instead. However, I still couldn't install fiass-gpu as it returned a no compatible packages error. fiass-cpu worked fine.

4) When I reran codeqai search/sync/etc I get "IndexError: list index out of range". in C:\Users\<USER>\.local\pipx\venvs\codeqai\lib\site-packages\codeqai\vector_store.py", line 34,.

This seems to be because "documents" is empty. Going back to app.py, the files var after files = repo.load_files() has an array of docs, but documents does not after documents = codeparser.parse_code_files(files)

After some debugging, this seems to be because treesitterNodes in codeparser.py by line 36 is empty. However, programming_language has content (Language.JAVASCRIPT /n Language.JAVASCRIPT), TreesitterMethodNode has <codeqai.treesitter.treesitter_js.TreesitterJavascript object at .... (x2), and file_bytes also has the expected file data.

I'm unfamiliar with Treesitter to be able to debug any further as to why treesitter_parser.parse(file_bytes) is returning an empty array in this case.

Hope this can help.

P.S.
Didn't include this in the list as it may be my local ENV, but for some reason I was unable to run codeqai via pipx in python 3.10.5. It repeatedly wanted to use pyenv-win 3.9.6, even though that was nowhere on my system. I had to install 3.9.6 to be able to continue. This may be a local env issue from an old installation however.

add python-dotenv

The OpenAI and Azure API Key should be isolated in the pipx installation. The user should be prompted to enter the necessary environment variables if not available in the isolated environment. These variables should be stored in an .env file.
Use https://github.com/theskumar/python-dotenv

Unknown Issue after entering API Key

Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "C:\Users\pierr\AppData\Local\Programs\Python\Python311\Scripts\codeqai.exe_main
.py", line 7, in
File "C:\Users\pierr\AppData\Local\Programs\Python\Python311\Lib\site-packages\codeqai_main
.py", line 5, in main
app.run()
File "C:\Users\pierr\AppData\Local\Programs\Python\Python311\Lib\site-packages\codeqai\app.py", line 108, in run
repo_name = repo.get_git_root(os.getcwd()).split("/")[-1]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\pierr\AppData\Local\Programs\Python\Python311\Lib\site-packages\codeqai\repo.py", line 8, in get_git_root
git_repo = Repo(path, search_parent_directories=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\pierr\AppData\Local\Programs\Python\Python311\Lib\site-packages\git\repo\base.py", line 276, in init
raise InvalidGitRepositoryError(epath)
git.exc.InvalidGitRepositoryError: C:\Users\pierr

Azure Open API Issue

Hi,

My Setup is as follows regarding Embedding and Chat LLM:

[?] Which local embeddings model do you want to use?:
Instructor-Large
[?] Do you want to use local chat models? (y/N): N
[?] Which remote LLM do you want to use?:
Azure-OpenAI

In such a set up , simple questions about the codebase give response like below:

I'm sorry, I cannot determine the answer to your question as there is not enough context provided to identify a specific codebase.
Can you provide more information or code snippets?

Is there anything wrong with above set-up?

Changing embeddings model should delete faiss index

If an embedding model is configured after the current repo was indexed with faiss before, the related faiss index should be deleted from .cache/ folder automatically. Afterwards it can be recreated with the newly configured model.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.