Giter Club home page Giter Club logo

search_autocomplete's Introduction

Text Search, Fuzzy Search, Semantic...oh my!

Now with autocomplete suggestions!

Screenshot 2024-01-17 at 3 15 29 PM

Build your own text-based search engine with real-time smart autocomplete suggestions! The smart autocomplete takes into account spelling mistakes, word variations, and more! Search on precise text, keywords, and semantics of your own data - all powered by Rockset. This project implements the following standard text search concepts:

  • Wildcard Techniques
  • Tokenization
  • Term Frequency (TF) with weights
  • Ngrams
  • Levenshtein Distance
  • Vector Search

Check out these slides for more information on these concepts.

To build the autocomplete & text search on titles & keywords, you will need an account with Rockset. To build the semantic search (optional), you will need an account on both OpenAI and Rockset to get an API key for both platforms. Thankfully, API keys are available on the free versions of both platforms. To create an account on OpenAI go here and to create an account on Rockset go here.



Step 1: Setup Collection in Rockset

Rockset already has a public dataset of book titles, descriptions, and embeddings! Follow the steps below to set-up this collection correctly:

  1. In the Rockset Console, go to the "Collections" tab and then select "Create a Collection"
  2. Scroll down and select "Public Datasets"
  3. Click the "Book Embeddings Dataset" then "Start"
  4. Once the preview has loaded, click "Next"
  5. There is a default ingest transformation, but we'll need to make a few additions for our text search use case. The ingest transformation is a powerful tool available in Rockset. It allows you to execute a SQL query on all incoming data before it is stored in Rockset. For this project, we'll need to tokenize and create ngrams on the text we plan to search. Use the ingest transformation below:
SELECT
TOKENIZE(title, 'en_US') AS title_tokens, -- tokenizing the title
NGRAMS(LOWER(title), 3) AS title_ngrams, -- creating ngrams of the title
TOKENIZE(description, 'en_US') AS description_tokens, -- tokenizing the description
title,
series,
author,
TRY_CAST(rating as float) as rating,
description,
language,
TRY_CAST (isbn as integer) as isbn,
genres,
characters,
bookFormat as book_format,
edition,
TRY_Cast(pages as int) as page_count,
publisher,
publishDate as publish_date,
firstPublishDate as first_publish_date,
awards,
TRY_CAST(numRatings as int) as num_ratings,
ratingsByStars as ratings_by_stars,
TRY_CAST(likedPercent as float) as liked_percent,
setting,
coverImg as cover_image,
TRY_CAST(bbeScore as int) as bbe_score,
TRY_Cast(bbeCotes as int) as bbe_votes,
TRY_Cast(price as float) as price,
VECTOR_ENFORCE(embedding, 1536, 'float') as book_embedding
FROM
_input
where
title is not NULL
  1. In the next page, type a workspace name and collection name. I used workspace=Text-Search and collection=Books.
  2. Final step is to click "Create" and wait for the data to ingest into your Rockset collection. This will only take a few minutes.



Step 2 (only for Semantic Search): Build an IVF Index

In order to run semantic search on the embeddedings in the public dataset, we will need to build a special IVF Index. This can be done with the following query:

CREATE
	SIMILARITY INDEX text_search_book_embed
ON
	FIELD “Text-Search”.Books:book_embedding DIMENSION 1536 AS 'faiss::IVF256,Flat';

Run the query below to check the status of the index. Proceed when the status is Ready.

SELECT
    index_status, *
FROM
    _system.similarity_index
WHERE
    name = 'text_search_book_embed'

For more information, check out Rockset's Vector Search documentation.



Step 3: Create 3 Query Lambdas

Rockset's patented Query Lambdas are named, parameterized SQL queries stored in Rockset that can be executed from a dedicated REST endpoint. In the Query Editor in Rockset, save the following SQL queries as a Query Lambdas. We will later call these in our webpage. Each Query Lambda below will require you to create a parameter search_query of type string.

Save the following query as a Query Lambda named searchTitles under the Text-Search workspace:

script {{{
export function levenshteinDistance(str1, str2) {
    // https://en.wikipedia.org/wiki/Levenshtein_distance#Iterative_with_two_matrix_rows
    const len1 = str1.length;
    const len2 = str2.length;

    let prevRow = Array.from({ length: len2 + 1 }, (_, i) => i);

    for (let i = 1; i <= len1; i++) {
        let currentRow = [i];
        for (let j = 1; j <= len2; j++) {
            const cost = str1[i - 1] === str2[j - 1] ? 0 : 1;
            currentRow[j] = Math.min(
                prevRow[j] + 1,        // deletion
                currentRow[j - 1] + 1, // insertion
                prevRow[j - 1] + cost  // substitution
            );
        }
        prevRow = currentRow;
    }
    return prevRow[len2];
}
}}}

(
    SELECT
        title,
        'exact' as score,
        'exact' as distance,
        'exact' as hybrid_score,
        num_ratings
    FROM
        "Text-Search".Books
    WHERE
        LOWER(title) LIKE CONCAT(LOWER(:search_query), '%')
    ORDER BY
        num_ratings desc
    LIMIT
        10
)
UNION
(
    SELECT
        title,
        score() as score,
        _script.levenshteinDistance(title, :search_query) as distance,
        score() - 0.05 * _script.levenshteinDistance(title, :search_query) as hybrid_score,
        num_ratings
    FROM
        "Text-Search".Books
    WHERE
        search(
                CONTAINS(title_tokens, :search_query),
                BOOST(
                    0.5,
                    CONTAINS(
                        title_ngrams,
                        ARRAY_JOIN(NGRAMS(LOWER(:search_query), 3), ' ')
                    )
                )
            ) OPTION(match_all = false)
    ORDER BY
        hybrid_score desc, num_ratings desc
    LIMIT
        10
)
ORDER BY
    hybrid_score desc, num_ratings desc
LIMIT
    10

Save the following query as a Query Lambda named searchKeywords under the Text-Search workspace:

SELECT
    title,
    score() as score,
    num_ratings,
    description
FROM
    "Text-Search".Books
WHERE
    search(
        CONTAINS(title_tokens, :search_query),
        CONTAINS(description_tokens, :search_query)
    ) OPTION(match_all = false)
ORDER BY
    score desc,
    num_ratings desc
LIMIT
    10

Save the following query as a Query Lambda named searchSemantic under the Text-Search workspace:

SELECT
    title,
    APPROX_DOT_PRODUCT(
        JSON_PARSE(:search_query),
        book_embedding
    ) as similarity,
    num_ratings,
    description
FROM
    "Text-Search".Books HINT(access_path=index_similarity_search)
ORDER BY
    similarity DESC
LIMIT
    10



Step 4: Create an API Key & Grab your Region

Create an API key in the API Keys tab of the Rockset Console. The region can be found in the dropdown menu at the top of the page. For more information, refer to Rockset's API Reference.



Step 5: Update script.js

Before running the webpage, check lines 2-8:

const rocksetApiKey = "YOUR_ROCKSET_API_KEY"; // UPDATE WITH YOUR ROCKSET API KEY
const apiServer = "YOUR_ROCKSET_REGION_URL" // UPDATE WITH YOUR ROCKSET REGION URL (ex: "https://api.usw2a1.rockset.com")
const qlWorkspace = 'Text-Search'; // UPDATE if not the same
const qlName_titles = 'searchTitles'; // UPDATE if not the same
const qlName_keywords = 'searchKeywords'; // UPDATE if not the same
const qlName_semantic = 'searchSemantic'; // UPDATE if not the same
const openaiApiKey = "YOUR_OPENAI_API_KEY"; // UPDATE WITH YOUR OPENAI API KEY (only for semantic search)



Step 6: Run the .html file and start searching!

Now you're ready to search!



Other Considerations

  • NEW Rockset function EDIT_DISTANCE can be used to replace the Levenshtein UDF. Check out the official Rockset documentation for more details.
  • Consider adding metadata filtering to your queries by updating the WHERE clause. For example, we can update the searchSemantic Query Lambda to add filtering on price and publisher name:
SELECT
    title,
    APPROX_DOT_PRODUCT(
        JSON_PARSE(:search_query),
        book_embedding
    ) as similarity,
    num_ratings,
    description
FROM
    "Text-Search".Books HINT(access_path=index_similarity_search)
WHERE
    price BETWEEN 5 AND 25
    AND publisher IN ('Scholastic Inc.', 'Pocket', 'Wildside Press')
ORDER BY
    similarity DESC
LIMIT
    10

search_autocomplete's People

Contributors

sofia099 avatar

Stargazers

Manoj Kumar Nagabandi avatar Ben Kraus avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.