Giter Club home page Giter Club logo

torchserve-embedder-encoder's Introduction

Running all-MiniLM-L6-v2 with TorchServe

This repository contains everything needed to deploy a production-ready service for computing sentence similarity embeddings using the model all-MiniLM-L6-v2 and TorchServe. Those embeddings can then be used in combination with a vector database like Pinecone, Milvus, Weaviate or Qdrant.

This repository has been created because there was no simple example in the torchserve repository for deploying a huggingface model for sentence similarity. The closest I could find was resources for sequence classification, generation, question answering, and token classification as you can check here.

More details about it is provided in this article.

Get Started

There are two ways to test the server, by running it as a process or as a docker container.

Deploy with docker

First, make sure you have docker installed.

docker run -p 8080:8080 -it ghcr.io/alexgseymour/torch-serve-embedder-encoder-x86:latest

Then, go to Usage to check how to use the service.

Note that the command can be further optimized for production by using settings documented in the TorchServe documentation

Also note that by using this docker image and even if you have a GPU available, the inferences will be done on CPU with the basic capabilities, i.e., without leveraging any CPU extension. If you want to leverage your GPU or some CPU capabilities, you need to adapt the Dockerfile according to the documentation of TorchServe.

Deploy as a process

First, make sure you have Python 3 and Java 11+ installed.

make serve

Then, go to Usage to check how to use the service. If you have a GPU available, it will use it for faster inferences.

Usage

This should start a server localy that you can query with a curl like the following:

curl --location 'http://127.0.0.1:8080/predictions/my_model' \
--header 'Content-Type: application/json' \
--data '{
    "input": ["hello, how are you?", "hi, what is up?"]
}'

You should get an output similar to

[
  [
    0.019096793606877327,
    0.03446517512202263,
    0.09162796288728714,
    0.0701652243733406,
    -0.029946573078632355,
    ...
  ],
  [
    -0.06470940262079239,
    -0.03830110654234886,
    0.013061972334980965,
    -0.0003482792235445231,
    ...
  ]
]

Future Work

  • TochServe uses a Java server to expose the API which is heavy. Checking the stats from Docker, the service takes about 4GB of RAM. I will check how I can optimize the RAM usage with Go bindings to reduce the footprint.

Aknowledgments

Many thanks to:

  • Stane Aurelius who wrote a great post about the details on how to deploy a model with tochserve. I highly recommend reading it.

  • The community publishing models on HuggingFace and particularly to the team who have produced the all-MiniLM-L6-v2 model and have shared it.

  • The Microsoft Research team who produced the paper about MiniLM.

  • The HuggingFace team who hosts the models and make them easily available to everyone.

License

The code in this repository is licensed under the MIT license.

torchserve-embedder-encoder's People

Contributors

clems4ever avatar alexgseymour avatar owaiskhan9654 avatar

Stargazers

Shaik Subhan avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.