Giter Club home page Giter Club logo

justicio's Introduction

Justicio

Justicio is a Question/Answering Assistant that generates answers from user questions about the official state gazette of Spain: Boletín Oficial del Estado (BOE).

Spanish link

English link

TL;DR: All BOE articles are embedded in vectors and stored in a vector database. When a question is asked, the question is embedded in the same latent space and the most relevant text is retrieved from the vector database by performing a query using the embedded question. The retrieved pieces of text are then sent to the LLM to construct an answer.

Service

At this moment we are running a user-free service: Justicio

You can test it without charge! Please, give us your feedback if you have!

How it works under the hood

image (4)

Flow

  1. All BOE articles are embedded as embeddings and stored in an embedding database. This process is run at startup and every day.
  2. The user writes (using natural language) any question related to the BOE as input to the system.
  3. The backend service processes the input request (user question), transforms the question into an embedding, and sends the generated embedding as a query to the embedding database.
  4. The embedding database returns documents that most closely match the query.
  5. The most similar documents returned by the embedding database are added to the input query as context. Then a request with all the information is sent to the LLM API model.
  6. The LLM API model returns a natural language answer to the user's question.
  7. The user receives an AI-generated response output.

Components

Backend service

It is the web service, and it is a central component for the whole system, doing most of the tasks:

  • Process the input requests from the user.
  • Transform the input text into embeddings.
  • Send requests to the embeddings database to get the most similar embeddings.
  • Send requests to the LLM API model to generate the response.
  • Save the traces.
  • Handle input/output exceptions.

Embedding/Vector database

Loading data

We download the BOE documents and break them into small chunks of text (e.g. 1200 characters). Each text chunk is transformed into an embedding (e.g. a numerically dense vector of 768 sizes). Some additional metadata is also stored with the vectors so that we can pre- or post-filter the search results. Check the code

The BOE is updated every day, so we need to run an ETL job every day to retrieve the new documents, transform them into embeddings, link the metadata, and store them in the embedding database. Check the code

Reading data

It implements APIs to transform the input question into a vector, and to perform ANN (Approximate Nearest Neighbour) against all the vectors in the database. Check the code

There are different types of search (semantic search, keyword search, or hybrid search).

There are different types of ANNs (cosine similarity, Euclidean distance, or dot product).

Embedding Model

The text in BOE is written in Spanish, so we need a sentence transformer model that is fine-tuned using Spanish datasets. We are experimenting with these models.

More info: https://www.newsletter.swirlai.com/p/sai-notes-07-what-is-a-vector-database

LLM API Model

It is a Large Language Model (LLM) which generates answers for the user questions based on the context, which is the most similar documents returned by the embedding database.

Tools

Deploy your own service

Check deployment_guide.md file

Want to help develop the project?

You are welcome! Please, contact us to see how you can help.

justicio's People

Contributors

bukosabino avatar ntkog avatar llop00 avatar ruben-quilez avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.