gospel-search's People
gospel-search's Issues
Convert the UI to Typescript
Support extracting/embedding only segments that do not yet exist
Meaning, don't extract and embed the entire dataset on every crawl. Only extract and embed what doesn't yet exist in the database.
Use Poetry for Python package management
Pressing `<enter>` when the search bar is focused should trigger a search
Combine `extract` and `embed` stages into one
Use a more modern embedding model
Like all-mpnet-base-v2
from sentence-embedders
[1], or BAAI/bge-m3
from HuggingFace.
Create a small evaluation dataset
(Note: I already have the evaluate
sub-package in this repo as a starting point).
This is so we can algorithmically compare the ranking performance of various system variations that affect ranking performance, such as:
- A different, more modern sentence embedding model, like
all-mpnet-base-v2
- Embedding the whole segment, rather than taking the mean of the sentence embeddings for the segment
- Larger value of
k
- Using a NN lookup for the whole search solution, rather than reranking the top
k
results returned by ElasticSearch (which uses BM25)
If going the human evaluation route, then pair-wise rankings and the Bradley Terry model for aggregation of those pair-wise rankings into a full ranking is one way to go (source).
Make QA cite its sources
The sources should be clickable/tappable, rendering the source content in a modal or tooltip, with a button to view the full source in a new page.
Simplify Architecture
By refactoring to perform incremental updates instead of whole system re-ingests, we can eliminate the need for an intermediate datastore (i.e. MongoDB), and ingest directly from the Gospel Library site to the search engine. Also, by refactoring to use a Vector database instead of ElasticSearch, we can eliminate the need for a re-ranking service. This would simplify the Data Transformation Pipeline to:
sequenceDiagram
actor Operator
Operator->>Worker: PUT /ingest
Vector DB->>Worker: current state
Note over Worker: determine which pages <br/> haven't been ingested yet
Worker->>Gospel Library: requests
Gospel Library->>Worker: web pages
Note over Worker: extract segments and embed
Worker->>Vector DB: new segments and embeddings
And would simplify the front-end application stack to:
sequenceDiagram
actor User
User->>Proxy Server: GET /
Proxy Server->>User: client app
User->>Proxy Server: GET /api/search
Proxy Server->>Vector DB: search query
Vector DB->>Proxy Server: top-k segments
Proxy Server->>User: search results
When choosing a Vector DB solution, it must have these requirements:
- Supports keyword search
- Supports semantic search via vector search
- Supports metadata filtering (e.g. filter for only BoM segments)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.