Giter Club home page Giter Club logo

semantweet-search's Introduction

SemanTweet Search

SemanTweet Search allows you to search over all your tweets from the Twitter archive using semantic similarity. A demo is available here.

It preprocesses your tweets, generates embeddings using OpenAI's small/large embedding model, stores the data and embeddings in LanceDB vector db, and provides a web interface to search and view the results.

You can do semantic search post pre-filtering by time, likes, retweets, media only or link only tweets too.

Pre-filtering by sql operations helps not only filter but also reduce the vector search space thus speeding up the search.

You can additionally use/edit projector.py and tensorflow projector to get a visualization of your tweets using t-sne algorithm as shown here

UPDATE

  • (25/3/2024) bge-small-en-v1.5 embedding support added. No API key required. It's 29th on MTEB leaderboard and 130 MB size, 384 dimension, sequence length 512. Note it's not multi-lingual.

    Technically, a lot of embeddings from sentence-transformers are possible. You can refer LanceDB docs here

  • (24/3/2024) Added CLIP based image searching on tweets_media folder. Check code app_image_search.py after installing requirements i.e till step 5. Does not require to run the setup bash scripts for this, it's a standalone program. First run will take roughly 5-10 minutes as it creates the embeddings. Subsequent runs will be instant.

This is also available as a standalone light weight repository as Embeddit where you can use any images folder.

Technologies Used:

  • Twitter archive for data
  • OpenAI text-embedding-3-large embeddings by default for semantic search
  • Lancedb for vector search and sql operations
  • Hybrid search provided by Lancedb combining BM25 + embedding search
  • Flask for server

Prerequisites

  • Python 3.x
  • OpenAI API key
  • Twitter archive data

Installation

  1. Clone the repository:

    git clone https://github.com/sankalp1999/semantweet-search.git
    
    cd semantweet-search/
    
  2. Download your Twitter archive (takes 2 days to be available)

    Go to: More (3 dot button) > Settings and Privacy > Your Account > Download an archive of your data.

    Extract it. Put the extracted folder at the root of this project and rename it to twitter-archive.

  3. Create a virtual environment:

    python3 -m venv venv
    

    Make sure you do this at the root of project.

  4. Activate the virtual environment:

    • For Unix/Linux:
      source venv/bin/activate
      
    • For Windows:
      venv\Scripts\activate
      
  5. Install the required dependencies:

    pip install -r requirements.txt
    

    If you want to try out image vector search, please run below command. I have not included this in requirements.txt as it downloads a 620 MB model and not everyone would like to do that by default.

    pip install open_clip_torch
    
  6. (Not required if using sentence transformers) Set up your OpenAI API key as an environment variable:

    export OPENAI_API_KEY=your_api_key
    
  7. By default, this repo uses openAI text-embedding-3-large

    You can change it to text-embedding-3-small. Change required at two places.

  1. Run the setup script:

    OpenAI

    chmod +x run_scripts.sh
    ./run_scripts.sh

    sentence-transformers

    chmod +x run_sentence_tf_scripts.sh
    ./run_sentence_tf_scripts.sh

    Uncomment line 15, 16 in app.py

    # db = lancedb.connect("data/bge_embeddings")
    # table = db.open_table("bge_table")
  2. Start the application:

    python app.py
    

    or

    flask run
    

Enjoy!

Flow of the program

graph TD
    A[Twitter Archive Data] --> B[preprocess_tweets_one.py]
    B --> C[Preprocessed Tweets CSV]
    C --> D[async_openai_embedding_two.py]
    D --> E[Embeddings CSV]
    E --> F[create_lance_db_table_openai_three.py]
    F --> G[LanceDB Database]
    G --> H[Web Interface]
    H --> I[Search Tweets]
    I --> J[View Results]

The OpenAI embedding flow consists of the following steps:

  1. preprocess_tweets_one.py: This script preprocesses the tweets from the Twitter archive, extracting relevant information and saving it to a CSV file.

  2. async_openai_embedding_two.py: This script reads the preprocessed tweets from the CSV file, generates embeddings using OpenAI's embedding model asynchronously, and saves the embeddings to a new CSV file.

  3. create_lance_db_table_openai_three.py: This script reads the generated embeddings from the CSV file, creates a LanceDB table using the specified schema, and stores the data in the database.

The run_scripts.sh script automates the execution of these steps in the correct order.

Additional Notes

  • The project uses the text-embedding-3-large model by default. You can change the model by modifying the MODEL_NAME variable

  • The batch size for generating embeddings is set to 32 to stay within the token limit. Adjust the batch size if needed.

  • The LanceDB database is stored in the data/openai_db directory.

  • The project also includes a synchronous version of the OpenAI embedding generation script (create_openai_embedding_sync_two.py), which can be used as an alternative to the asynchronous version.

License

This project is licensed under the MIT License. See the LICENSE file for details.

semantweet-search's People

Contributors

sankalp1999 avatar

Stargazers

Brylle avatar Allen Thomas avatar Shhahzaan Khan avatar  avatar Idriss avatar Rexami avatar  avatar Vic avatar  avatar Brad avatar Roman avatar Bagas Wastu avatar Owen Diehl avatar Rajdeep Ghosh avatar Josh Pazmino avatar CyberNovas avatar Herculean17 avatar 0x44 0x46 avatar Ahmed El Shami avatar Carlos A. Wong avatar Yuxuan Zhang avatar Riya Gupta avatar mukul avatar Matt Shaffer avatar Dominic Peel avatar phial3 avatar USAGI avatar Logan Kirkland avatar Anna Smirnova avatar deepfates avatar Adithya Ds avatar gambhir ⚡ avatar Pranitham Raj avatar Sahil Bansal avatar  avatar Pushkar Patel avatar Aashutosh Rathi avatar Shantanu Goel avatar  avatar

Watchers

Kostas Georgiou avatar  avatar Matt Shaffer avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.