epeters3 / gospel-search Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 454 KB

Python 84.98% JavaScript 0.81% Dockerfile 1.19% Shell 0.75% TypeScript 12.27%

gospel-search's People

Contributors

Stargazers

Watchers

gospel-search's Issues

Support extracting/embedding only segments that do not yet exist

Meaning, don't extract and embed the entire dataset on every crawl. Only extract and embed what doesn't yet exist in the database.

Use Poetry for Python package management

Pressing `<enter>` when the search bar is focused should trigger a search

Combine `extract` and `embed` stages into one

Use a more modern embedding model

Like all-mpnet-base-v2 from sentence-embedders [1], or BAAI/bge-m3 from HuggingFace.

Create a small evaluation dataset

(Note: I already have the evaluate sub-package in this repo as a starting point).

This is so we can algorithmically compare the ranking performance of various system variations that affect ranking performance, such as:

A different, more modern sentence embedding model, like all-mpnet-base-v2
Embedding the whole segment, rather than taking the mean of the sentence embeddings for the segment
Larger value of k
Using a NN lookup for the whole search solution, rather than reranking the top k results returned by ElasticSearch (which uses BM25)

If going the human evaluation route, then pair-wise rankings and the Bradley Terry model for aggregation of those pair-wise rankings into a full ranking is one way to go (source).

Make QA cite its sources

The sources should be clickable/tappable, rendering the source content in a modal or tooltip, with a button to view the full source in a new page.

By refactoring to perform incremental updates instead of whole system re-ingests, we can eliminate the need for an intermediate datastore (i.e. MongoDB), and ingest directly from the Gospel Library site to the search engine. Also, by refactoring to use a Vector database instead of ElasticSearch, we can eliminate the need for a re-ranking service. This would simplify the Data Transformation Pipeline to:

sequenceDiagram
    actor Operator
    Operator->>Worker: PUT /ingest
    Vector DB->>Worker: current state
    Note over Worker: determine which pages <br/> haven't been ingested yet
    Worker->>Gospel Library: requests
    Gospel Library->>Worker: web pages
    Note over Worker: extract segments and embed
    Worker->>Vector DB: new segments and embeddings

And would simplify the front-end application stack to:

sequenceDiagram
    actor User
    User->>Proxy Server: GET /
    Proxy Server->>User: client app
    User->>Proxy Server: GET /api/search
    Proxy Server->>Vector DB: search query
    Vector DB->>Proxy Server: top-k segments
    Proxy Server->>User: search results

When choosing a Vector DB solution, it must have these requirements:

Supports keyword search
Supports semantic search via vector search
Supports metadata filtering (e.g. filter for only BoM segments)

epeters3 / gospel-search Goto Github PK

gospel-search's People

Contributors

Stargazers

Watchers

gospel-search's Issues

Convert the UI to Typescript

Support extracting/embedding only segments that do not yet exist

Use Poetry for Python package management

Pressing `<enter>` when the search bar is focused should trigger a search

Combine `extract` and `embed` stages into one

Use a more modern embedding model

Create a small evaluation dataset

Make QA cite its sources

Simplify Architecture

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent