Giter Club home page Giter Club logo

code-indexer-loop's Introduction

Code Indexer Loop

PyPI version License Forks Stars Twitter Discord

Code Indexer Loop is a Python library designed to index and retrieve code snippets.

It uses the useful indexing utilities of the LlamaIndex library and the multi-language tree-sitter library to parse the code from many popular programming languages. tiktoken is used to right-size retrieval based on number of tokens and LangChain is used to obtain embeddings (defaults to OpenAI's text-embedding-ada-002) and store them in an embedded ChromaDB vector database. watchdog is used for continuous updating of the index based on file system events.

Read the launch blog post for more details about why we've built this!

Installation:

Use pip to install Code Indexer Loop from PyPI.

pip install code-indexer-loop

Usage:

  1. Import necessary modules:
from code_indexer_loop.api import CodeIndexer
  1. Create a CodeIndexer object and have it watch for changes:
indexer = CodeIndexer(src_dir="path/to/code/", watch=True)
  1. Use .query to perform a search query:
query = "pandas"
print(indexer.query(query)[0:30])

Note: make sure the OPENAI_API_KEY environment variable is set. This is needed for generating the embeddings.

You can also use indexer.query_nodes to get the nodes of a query or indexer.query_documents to receive the entire source code files.

Note that if you edit any of the source code files in the src_dir it will efficiently re-index those files using watchdog and an md5 based caching mechanism. This results in up-to-date embeddings every time you query the index.

Examples

Check out the basic_usage notebook for a quick overview of the API.

Token limits

You can configure token limits for the chunks through the CodeIndexer constructor:

indexer = CodeIndexer(
    src_dir="path/to/code/", watch=True,
    target_chunk_tokens = 300,
    max_chunk_tokens = 1000,
    enforce_max_chunk_tokens = False,
    coalesce = 50
    token_model = "gpt-4"
)

Note you can choose whether the max_chunk_tokens is enforced. If it is, it will raise an exception in case there is no semantic parsing that respects the max_chunk_tokens.

The coalesce argument controls the limit of combining smaller chunks into single chunks to avoid having many very small chunks. The unit for coalesce is also tokens.

tree-sitter

Using tree-sitter for parsing, the chunks are broken only at valid node-level string positions in the source file. This avoids breaking up e.g. function and class definitions.

Supported languages:

C, C++, C#, Go, Haskell, Java, Julia, JavaScript, PHP, Python, Ruby, Rust, Scala, Swift, SQL, TypeScript

Note, we're mainly testing Python support. Use other languages at your own peril.

Contributing

Pull requests are welcome. Please make sure to update tests as appropriate. Use tools provided within dev dependencies to maintain the code standard.

Tests

Run the unit tests by invoking pytest in the root.

License

Please see the LICENSE file provided with the source code.

Attribution

We'd like to thank the Sweep AI for publishing their ideas about code chunking. Read their blog posts about the topic here and here. The implementation in code_indexer_loop is modified from their original implementation mainly to limit based on tokens instead of characters and to achieve perfect document reconstruction ("".join(chunks) == original_source_code).

code-indexer-loop's People

Contributors

di-github-bot avatar eherde avatar ricklamers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.