Giter Club home page Giter Club logo

blog's Introduction

👋 Hi there, I'm Clément (a.k.a. kero).

I'm one of the co-founders of Meilisearch, as well as the Tech Lead.

Currently I live in Paris, but I was raised in Bordeaux, a city in France known for its wine production.

I've been interested in computer science since my childhood. At 15 or so I got interested in hacking, and more specifically RAT (Remote Access Tool) malwares. I had fun infecting computers and stealing the passwords of my friends, but shhhh don't tell them. Around 17 I built my first website named CartelAmple in Laravel and jquery, a website where users had their own public page exposing things they liked.

I moved to Paris and started Ecole 42 after high school, in 2013. It was my first time doing backend programming, I mean working with C and really understanding what was happening in my terminal console. I really liked it.

My final internship was at Veepee, a major French e-commerce company, where I met Quentin & Thomas, and where we first started working on the project that would later become Meili.

And that's it! that's what you should know about me if we're going to work together I think.

Oh, and also: I'm a huge Apple fanboy, and love video games.

blog's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

peacetara

blog's Issues

Using Rust and the GitHub CI for my blog

This blog post will talk about how I created my first-ever blog website, and you are currently reading the first post of it. It is hosted on GitHub Pages and uses GitHub CI extensively.

Before creating this website by hand, I hesitated with the Hey.com blog hosting service. The one advantage is that exposing your thoughts online will not take a week the first time. Sending an email will publish your post.

However, there is a reason why I didn't choose world.hey.com for my blog: I don't want to be locked in on this platform, and code highlighting needs to be supported.

fn main() {
    println!("That's so much better than world.hey.com...");
}

So, I decided to take inspiration from Utterances and use issues to display articles on my blog. The first message of an issue labeled as article is displayed on my blog. Note that I am the only one having triage rights, I can label any issue as an article, even someone else's issue and it will be correctly displayed on my blog with the right profil picture and name. Whenever an issue is edited, GitHub triggers the CI and generates the static HTML pages. The GitHub iOS app is excellent for editing issues with markdown, adding links, images, code blocks...

It was a great surprise how easy it was to make this blog alive. The GitHub API is so complete and easy to use. As I am building the static page generation in Rust, I used the octocrab library for easiness. The most exciting part was that GitHub exposes a complete HTML document of the content of your issues, especially exposing the code blocks' tokens with the proper syntax highlighting. The Starry Night library was quite helpful in generating the colors for these tokens!

Once I retrieved the issue HTML content, I used the Askama templating library to generate the static pages. Pages are accessible via the snake case-encoded issue title. But one new issue just raised: Whenever I update the article's title, it will break the page link. I read on the internet and discovered that creating HTML redirection pages was possible. Cool, right?! The GitHub API also exposes the renaming events. I used those events to generate the redirection static pages from every previous issue title to the current one!

Now, it's time to make it pretty. I used Bootstrap because I wanted to avoid learning a new CSS tool and because I find it better to use boring technology when it perfectly does the job. By using version 5.3, you even got dark/light mode out of the box. However, Bootstrap is extensive, and I needed to use PurgeCSS to remove more than 224ko of unused CSS rules. By doing so, I achieved 100% on every performance gauge of the Google Pagespeed website 👌

This blog website is available on GitHub. Host it under your GitHub account and have a pretty blog website. Create issues and edit them to see them alive. There are some improvements to make, like using the repository description as the meta description used by the search engines or fixing some issues I got with the images backed-up by camo or replacing the @ references with links from the author bio text...

How Meilisearch Updates a Millions Vector Embeddings Database in Under a Minute

This is part 4 of a series of blog posts. You can find part 1, part 2, and part 3 on the blog.
This blog post considers that you’ve already read parts 1 and 2.

In this blog post, we’ll talk about how we implemented incremental indexing in Arroy.
By incremental, we mean that when inserting or deleting items in a tree, we can update only the required parts instead of rebuilding everything.
At Meilisearch, that’s essential because our users usually send updates regularly, and their content constantly changes.
Rebuilding everything from scratch is very costly and won’t give a good experience; that’s why we already spent a lot of time on our latest release to make all the datastructures support incremental indexing and arroy being one of them, we HAD TO implement it as well!

The Theory

The following schema shows a high-level view of what should happen when adding and deleting documents into an existing tree.
As a reminder, arroy is a vector store based on LMDB, an embedded transactional memory-mapped key-value store. It stores your vectors in trees composed of three kinds of nodes:

  • The split nodes - Store information on which item IDs are spread out between its left and right.
  • The descendants' nodes - Store multiple item IDs.
  • The items - A single vector.

For the following examples, we’ll consider the descendants node cannot contain more than two items and that the closer the IDs are, the closer the items are in space.
For example, the item 1 is close to the item 2 but is farther from the item 10.

First step before inserting elements in the tree

Here, we’re trying to insert items 2 and 7 and delete items 3 and 11.
Following what we do on the search, we split the items to insert between the left and right split nodes.
But the deleted items are no longer present in the database! We don't have their associated vector, and therefore we'll need to iterate over all the leaves to find and delete them.
Fortunately, since we use roaring bitmaps, this list doesn't consume much memory, and since we never update it, we can share it with every thread without ever copying it 🎉

Second step inserting element in the tree

Notice how the items to insert were balanced between the left and right split nodes.
In the next step, we’ll follow the same process and split both lists of items to insert on the split node.

Third step inserting into elements in the tree

It’s on this step that all the good stuff happens. From the left to the right:

  • The first descendant node gets its item 3 removed, and item 2 added.
  • The second descendant node gets too big and must be replaced by a new split node.
  • The item 8 becomes a descendant containing both the items 7 and 8.
  • After deleting item 10 in the last descendant node, we replace it with a direct link to an item to reduce the number of nodes in the tree (and improve search time by reducing the number of lookups).

Last step inserting elements in the tree

Now that you’ve seen all the steps of our incremental indexing process, we’re still going to take a few notes on what happened:

  • In the first three steps, we had to read all of the tree nodes, which, by default, is equal to the total number of items in the database.
  • In the last step, we had to write four new tree nodes. Previously we would have rewritten the whole tree. The number of writes, which is the slowest operation in databases, is reduced to the bare minimum.
  • The process works on a single tree which means we can still multi-thread the computation of every tree.
  • The IDs aren’t sequential anymore, which doesn't play well with the ID generation strategy described in Writing the tree in parallel.

How Did We Fix the ID Generation?

At the start of the indexing process, we need to gather all the existing tree IDs.
The information we need to generate new nodes are:

  • The total number of tree nodes used to know when to stop building trees.
  • The « holes » in the used IDs that we want to re-use. This fragmentation is produced when we edit the trees.
  • The highest ID existing in the trees to generate the next new IDs.

Selecting the next ID in the Roaring Bitmap

The « simple » idea is to have a counter for the total number of tree nodes. A counter that we can increment to generate new fresh IDs. The most challenging part is to pick an available ID shared between threads and without big mutexes. It took me quite some time to find a solution for this last point; I implemented a whole complex solution (150loc) that I ended up rewriting entirely from scratch in one hour for an easy and more efficient solution. We're going to see all the ideas we went through before finding our final solution.

Sharing an Iterator Over the Available IDs

The first and most straightforward idea that comes to mind is to create and share an iterator over all the available IDs.

let mut available_ids = available_ids.into_iter();

roots.into_par_iter().for_each(|root| {
    // stuff
    let next_id = available_ids.next();
    // stuff
});

If you’re familiar with Rust, you’ll notice immediately that the for_each closure cannot capture a mutable reference over the available_ids because we use the into_par_iter method from rayon that execute the next methods on the iterator in a multi-threaded way. Which means we can't call .next().
By using a mutex (short for mutual exclusion), we can make it compile:

let mut available_ids = Mutex::new(available_ids.into_iter());

roots.into_par_iter().for_each(|root| {
    // stuff
    let next_id = available_ids.lock().unwrap().next();
    // stuff
});

That is safe and will yield the right results, but won’t scale well as all the threads have to wait for each other on the lock.

Let Each Thread Own Its List of IDs

When you need to remove a lock, a common solution is to split the work beforehand so each thread can work to its full capacity without ever having to know what’s happening on the other threads. That’s the solution I implemented initially. The idea is to explode the roaring bitmap into as many smaller roaring bitmaps as there are trees to update.

The following function can split a roaring bitmap into n sub-bitmaps of equal size:

pub fn split_ids_in(available: RoaringBitmap, n: usize) -> Vec<RoaringBitmap> {
    // Define the number of elements per sub-bitmap
    let chunk_size = available.len() / n as u64;
    let mut ret = Vec::new();

    let mut iter = available.into_iter();
    for _ in 0..(n - 1) {
        // Create a roaring bitmap of `chunk_size` elements from the iterator without consuming it
        let available: RoaringBitmap = iter.by_ref().take(chunk_size as usize).collect();
        ret.push(available);
    }
    // the last element is going to contain everything remaining
    ret.extend(iter);

    ret
}

With this function available, it's then trivial to use one bitmap per tree to update with the zip method on the parallel iterators of Rayon:

let available_ids = split_ids_in(available_ids, roots.len());

roots.into_par_iter().zip(available_ids).for_each(|(root, available_ids)| {
    let mut available_ids = available_ids.into_iter();
    // stuff
    let next_id = available_ids.next();
    // Here, we can use the available_ids without any locks or inter-thread synchronization
    // stuff
});

This works well while we're updating the already existing root trees.

But later on, we may want to create new trees from scratch, and we can't know in advance how many new trees we’ll need to create. That's an issue because, without this information, we don't know how many sub-bitmaps we need to create. At this point, I couldn’t see an easy fix to this problem, but I made the assumption that we rarely create a lot of new trees and that all new trees would use a lot of IDs. This means, giving all the IDs to the first new tree would probably work out. (it’s hard to be 100% sure without monitoring it in prod)

The Final Solution

The previous solution was complex and did not even work perfectly.
While I was scrolling through the documentation of the Roaring Bitmap, I saw the select method and immediately understood how I could make the initial approach works.
This method lets you get the values in your bitmap by index in an efficient way.
For example, with the following values in a bitmap: 13, 14, 15, 98, 99, 100:

  • select(0) returns 13.
  • select(3) returns 98.
  • select(5) returns 100.
  • select(6) returns None.

With that in mind, if we share the bitmap with all threads in read-only and share the index in the bitmap as an atomic number, we can have multiple threads get available IDs at the same time with lock-free synchronization. Once the select method returns None, it means we can stop picking IDs from the bitmap and instead simply generate new IDs from scratch with the old method.
Even if it's super fast, calling the select method still takes some time, so once a thread gets a None returned from the method, we're going to update an atomic boolean telling us if there are still values to look at in the bitmap. If not, we can skip calling select and directly generate a new ID.

With that in mind, the cool ConcurrentNodeIds structure we showed you before in part 2 got a little bit more complex but it still let us generate IDs fairly without any locks!

/// A concurrent ID generator that will never return the same ID twice.
pub struct ConcurrentNodeIds {
    /// The current tree node ID we should use if no other IDs are available.
    current: AtomicU32,
    /// The total number of tree node IDs used.
    used: AtomicU64,

    /// A list of IDs to exhaust before picking IDs from `current`.
    available: RoaringBitmap,
    /// The current Nth ID to select in the bitmap.
    select_in_bitmap: AtomicU32,
    /// Tells if you should look in the roaring bitmap or if all the IDs are already exhausted.
    look_into_bitmap: AtomicBool,
}

impl ConcurrentNodeIds {
    /// Creates an ID generator returning unique IDs, avoiding the specified used IDs.
    pub fn new(used: RoaringBitmap) -> ConcurrentNodeIds {
        let last_id = used.max().map_or(0, |id| id + 1);
        let used_ids = used.len();
        let available = RoaringBitmap::from_sorted_iter(0..last_id).unwrap() - used;

        ConcurrentNodeIds {
            current: AtomicU32::new(last_id),
            used: AtomicU64::new(used_ids),
            select_in_bitmap: AtomicU32::new(0),
            look_into_bitmap: AtomicBool::new(!available.is_empty()),
            available,
        }
    }

    /// Returns a new unique ID and increases the count of IDs used.
    pub fn next(&self) -> Result<u32> {
        if self.look_into_bitmap.load(Ordering::Relaxed) {
            let current = self.select_in_bitmap.fetch_add(1, Ordering::Relaxed);
            match self.available.select(current) {
                Some(id) => Ok(id),
                None => {
                    self.look_into_bitmap.store(false, Ordering::Relaxed);
                    Ok(self.current.fetch_add(1, Ordering::Relaxed))
                }
            }
        } else {
            Ok(self.current.fetch_add(1, Ordering::Relaxed))
        }
    }
}

Performances in Practice

I haven’t run a lot of benchmarks yet (I would like to, though, and may update this post later), but on my benchmarks with a few hundred thousand items, here are the results I get:

Performance improvement showed on a graph

  • On average, we get more than a 10x speedup after three batches of items indexed.
  • The little bumps we observe afterward that make us go from ~700ms to ~3s are due to the creation of new trees as the number of items increases.
  • I expect this benchmark to represent a worst-case scenario:
    • All my item IDs are randomly generated, which means there are very few duplicates/updates but always new insertions and more writing.
    • My vectors are also randomly generated with a uniform distribution, which means there shouldn’t be clusters as we observe in a real-life dataset.

This feature is now being used on production data with millions of documents, and we've seen numbers in the range of 9000 items added on a 2M items database in under a minute.

Fillit in C compared with the Rust version

The Game Rules

The Input

The program will take a list of a maximum of twenty-six Tetriminos. This list comprises a four-line Tetriminos; an empty new line separates every Tetriminos.

A Tetriminos description must respect the following rules:

  • Exactly four lines of four characters followed by newlines
  • A Tetriminos is a classic Tetris piece composed of four blocks
  • Each character must be either a # when the tile corresponds to one of the four blocks or a . when the tile is empty
  • Each block must touch another one with at least one of its edges.

Here are examples of valid Tetriminos:

....  ....  ####  ....  .##.  ....  .#..  ....  ....
..##  ....  ....  ....  ..##  .##.  ###.  ##..  .##.
..#.  ..##  ....  ##..  ....  ##..  ....  #...  ..#.
..#.  ..##  ....  ##..  ....  ....  ....  #...  ..#.

Here are examples of invalid Tetriminos:

####  ...#  ##...  #.  ....  ..##  ####  ,,,,  .HH.
...#  ..#.  ##...  ##  ....  ....  ####  ####  HH..
....  .#..  ....   #.  ....  ....  ####  ,,,,  ....
....  #...  ....       ....  ##..  ####  ,,,,  ....

As each Tetriminos occupies only four of the sixteen available squares, it's possible to describe the same Tetriminos in several ways. However, the rotation of a Tetriminos describes a different Tetriminos from the original for this game. This means no rotation is possible on a Tetriminos when arranging it with the other Tetriminos.

Those four Tetriminos are, therefore, completely distinct:

##..  .###  ....  ....  ....
#...  ...#  ...#  ....  .##.
#...  ....  ...#  #...  .##.
....  ....  ..##  ###.  ....

Find the Smallest Square

This game aims to arrange the Tetriminos to form the smallest possible square. Tetriminos are placed in the order in which they appear in the file. Of the various possible solutions for achieving the smallest square, the one where each Tetrimino is placed as far up and as far to the left as possible will be chosen.

Your program must display the smallest solution square on the standard output. To identify each Tetrimino in the solution square, you will assign an uppercase letter (starting with A) to that Tetrimino in the order they appear in the description file.

Here is a link to a valid input file and to the only expected solution. You can find a lot of different input files along with the associated solutions on the fastest program's repository.


Context

  • One of my last performance projects at 42
  • A challenge with more than 300 participants
  • Staff team participating

Code in Rust (still compiles after a long time).
Code in C (no more compile after a long time).

Technical

  • Introduce the subject and rules (backtracking)
  • Explain man(2) function limitations
  • Using bitfields
  • Computing the smallest map sqrt(x)
  • Difficulty of the different maps and pieces

Algorithm tricks

  • Always writing a piece after the last position of the same piece
  • Recursively texting, storing, and writing the output
  • Using chars in a bit map

Conclusion

  • Best Results in C
  • Best Results in Rust

Talk About Other Ideas

  • Using GPU (I tried, but it was bad, too many draw calls)
  • Multithreading
  • SIMD

Asking Questions

  • Do you have better ideas?
  • Why is C slower? Why is code generation bad for C?
  • Can you beat me?

Meilisearch Expands Search Power with Arroy's Filtered Disk ANN

This is part 3 of a series of blog posts. You can find the part 1 and part 2 on my blog.

We can store vectors in arroy and efficiently compute the search tree nodes, but we still need some features to make it usable in Meilisearch. Our full-text search engine has great support for filtering; selecting subsets of documents is an essential feature to support. For instance, one of our biggest clients requires the ability to filter through over 100 million YouTube video metadata and their associated image embeddings to effectively select videos released within specific time frames, such as a single day or week. This represents the scalability and responsiveness challenges we aim to address with our filtering system, making it a perfect use case to tailor our developments.

Meilisearch supports the following operators on the filterable attributes of your documents: <, <=, =, != >=, and >. Internally, we extensively use RoaringBitmaps, which are well-optimized sets of integers that support fast binary operations like unions and intersections. When the engine receives a user request with filters, it first computes the subset of documents from the filter that will be inputted into the search algorithms, the one that ranks documents by quality. This subset is represented by a RoaringBitmap.

How was it Working Before Arroy?

Ranking only the subset of filtered documents has been working well since 2018, but now that we have a new data structure to search on, we need to look at how to implement it. The engine had already supported the vector store feature for months but was inefficient. We were using an in-memory HNSW and were deserializing the whole data structure in memory, searching for the nearest neighbors of the target vector, which was returning an iterator.

fn vector_store(db: Database, subset: &RoaringBitmap, target_vec: &[f32]) -> Vec<DocumentID> {
    // This takes a lot of time and memory.
    let hnsw = db.deserialize_hnsw();

    let mut output = Vec::new();
    for (vec_id, vec, dist) in hnsw.nearest_neighbors(target_vec) {
        let doc_id = db.document_associated_to_vec(vec_id).unwrap();
        if !output.contains(&doc_id) {
          output.push(doc_id);
          if output.len() == 20 { break }
        }
    }

    output
}

You may wonder why we are retrieving the document ID associated with the vector ID. Meilisearch supports multiple vectors by document since the beginning of the vector store feature. This is unfortunate because we must look up every vector we are iterating. We need to maintain this lookup table. The iterator can iterate on the whole vector dataset if the subset is small enough, e.g., document.user_id = 32. We want the document operations to be atomic and consistent, so we have to store the HNSW on disk and avoid having to maintain synchronization between the LMDB transactions and this vector store data structure. Ho! And the library doesn't support incremental insertion. We must reconstruct the HNSW in memory from scratch every time we insert a single vector.

Integrating Arroy in Meilisearch

As we worked on updating Meilisearch to include the new vector store, arroy, we tried out mob programming for the first time. This is where we all code together at the same time. It might sound like it would slow us down, but it actually made us super productive! By tackling problems together right as they came up, we got arroy fitted into Meilisearch much faster than if we'd worked alone.

Arroy is different because it doesn’t return an iterator to give back search results. Now, our search engine is smarter and can figure out the exact number of results it needs to give back, even when there are filters to consider. This teamwork improved our search tool and showed us that working together is key when facing big tech challenges.

Filtering While Searching in Arroy

You can find a description of the arroy internal data structure in part 1 of this series. Here is a list of the different kinds of nodes you can find:

  • The item nodes. The original vectors that the user provides. The small points on the left.
  • The normal nodes, also called split planes. They represent hyperplanes that split subsets of item nodes in half.
  • The descendant nodes. They are the tree leaves composed of the item IDs you'll find if you follow a certain normal node path.

split-plane-combined-schema

A typical search will load all the normal nodes in a binary heap associated with an infinite distance. Remember that there are a lot of randomly generated trees, and there are, therefore, a lot of entry points. Ascending distances order the binary heap; the shortest nodes found will be popped first.

In the search algorithm, we pop the nearest item from this heap. If it's a normal node, we fetch the left and right sides of the hyperplane:

  • If we find normal nodes again, we associate the distance from the hyperplane with positive and negative distances from the targeted/queried vector.
  • If we find descendant nodes, we do not push them into the queue but rather directly add them to the potential output list as they represent the nearest vectors we have found.

You probably noticed where I want to go, but this is where magic happens. We modified arroy to store the list of descendants in RoaringBitmaps instead of a raw list of integers. This is another improvement compared to the original Spotify library, as those lists weigh less. Doing an intersection with the filtered subset is way easier now.

However, there is always an issue: the vector IDs are not the document IDs, and Meilisearch, after executing the filters, only knows about the documents. Iterating on the lookup table, I talked about before, constructing the final bitmap with all the vector IDs corresponding to the filtered documents would not be efficient when many documents are part of this subset, e.g., document.user_id != 32. I do not recommend using an O(n) algorithm in the search functions.

Using Multiple Indexes for Efficient Filtering

Fortunately, we developed a fun feature that wasn't meant to be used that way in arroy: the support for multiple indexes in a single LMDB database. We originally developed the multiple indexes feature to be able only to open a single LMDB database to store different vector types. Yes! In Meilisearch v1.6, you can describe the different embedders that live in a single index. You can identify the different vectors with different numbers of dimensions and distance functions you can store in a single document. Defining multiple vectors for a single document associated with the same embedder is also possible.

The indexes are identified by an u16. This feature can trick the algorithm into being even more efficient than the previous HNSW solution. Using one store by embedder and vector in a document is interesting because we can now identify the vectors using the document ID. No more lookup database and vector IDs. The vector IDs are reduced to the document IDs. We can use the output of the filters to filter the arroy indexes.

The search algorithm is different on the Meilisearch side. We request the nearest neighbors in every arroy index, sort the results by document ID to be able to deduplicate them and not return the same document multiple times, sort them again by distance, and then return only the top twenty documents. It may seem complex, but we are talking about twenty documents by the number of vectors for a single document. Usually, users will have a single vector.

fn vector_search(
    rtxn: &RoTxn,
    database: Database,
    embedder_index: u8,
    limit: usize,
    candidates: &RoaringBitmap,
    target_vector: &[f32],
) -> Vec<(DocumentId, f32)> {
    // The index represents the embedder index shifted and
    // is later combined with the arroy index. There is an arroy
    // index by vector for a single embedded in a document.
    let index = (embedder_index as u16) << 8;
    let readers: Vec<_> = (0..=u8::MAX)
        .map(|k| index | (k as u16))
        .map_while(|index| arroy::Reader::open(rtxn, index, database).unwrap())
        .collect();

    let mut results = Vec::new();
    for reader in &readers {
        let nns = reader.nns_by_vector(rtxn, target_vector, limit, None, Some(candidates)).unwrap();
        results.extend(nns_by_vector);
    }

    // Documents can have multiple vectors. We store the different vectors
    // into different arroy indexes, we must make sure we don't find the nearest neighbors
    // vectors that correspond to the same document.
    results.sort_unstable_by_key(|(doc_id, _)| doc_id);
    results.dedup_by_key(|(doc_id, _)| doc_id);

    // Sort back the documents by distance
    results.sort_unstable_by_key(|(_, distance)| distance);
    results
}

The Design Philosophy Behind These Improvements

We're trying to make Meilisearch flexible enough for everyone's needs. Whether your documents are powered by a single embedding, like the kind OpenAI's nifty tools generate, or mixing it up with text and image embeddings, as many e-commerce sites do, we want your search experience to be seamless. We haven't forgotten you for the power users working with multiple embeddings from multi-modal embedders. Our latest improvements ensure that everyone gets to update and fine-tune their search indexes incrementally, making the whole process as smooth as butter.

Big thanks to everyone: Wrapping this up, a huge shout-out to the squad – @dureuill, @irevoire, and @ManyTheFish – for their genius and grunt work that brought our ideas to life. And watch out for @irevoire's upcoming article, where he'll explain how we've implemented incremental indexing in Arroy—meaning you can add new vectors without rebuilding everything from scratch. More on that soon!

You can comment about this article on Lobste.rs, Hacker News, the Rust Subreddit, or X (formerly Twitter).

Spotify-Inspired: Elevating Meilisearch with Hybrid Search and Rust

This is part 1 of a series of blog posts. You can find the part 2 and part 3 on my blog.

Uniting Keyword and Semantic Search

At Meilisearch we are working hard on hybrid search, mixing the widespread keyword search algorithm with the not-so-common semantic search. The former is very good at exact matching, while the latter finds stuff you can describe better.

I will explain why we even talk about Meilisearch/arroy and why it is important to us. The semantic search was something new to us. The principle is quite simple: documents are associated with vectors (lists of floats), and the closer they are to each other, the closer they are semantically. When a user queries the engine, you compute the embeddings (vectors) of the query and do an approximate nearest neighbor search to get the nth closest items to the user query. Fun fact: after all this time, Annoy is still among the top competitors, so imagine when we will finally release Arroy.

Seems simple, right? Do you really think this series of blog posts will talk about a dead simple subject? Join us on a journey as @irevoire and I (@Kerollmops), with the assistance of @dureuill, work on porting this cutting-edge library to Rust.

Optimizing Storage and Search for High-Dimension Embeddings

The embeddings are something between 768 and 1536 dimensions. We have customers at Meilisearch who want to store over 100M of them. Embeddings that were not modified by any dimension reduction algorithm use 32-bit floats. This means storing this data will take between 286 GiB and 572 GiB, depending on the dimensions.

Yes, it fits in RAM, but at what cost? Storing on disk is much cheaper. Ho! And it is only the raw vectors. I assure you that doing an approximate neighbors search among all vectors in O(n) will be quite slow. So we decided to store them on disk. We have looked at many different solutions before choosing Spotify/Annoy. Initially, we used the HNSW from instant-distance Rust crate, another data structure, to do an ANNs (approximate nearest neighbors) search. However, it is not disk-based; everything lives in memory, which is unpractical.

Spotify's Hyperplane Trees for Efficient ANNs

Spotify worked on a cool C++ library to search in enormous vector sets. The algorithm generates multiple trees that refer to the vectors. The tree nodes are random hyperplanes that split subsets of vectors in half. The root node splits the whole set of vectors in half, and the left and right branches split the subsets of vectors recursively again until the subset is small enough. When we perform an ANNs search, we go through all the root trees and decide whether to go on the hyperplanes' left or right, depending on the haystack provided. The advantage is that every node and vector is directly stored in a memory-mapped file.

split-plane-combined-schema

At the end of the tree, we can see the descendants' nodes. Those define the final list of leaf vectors that fit on this side of the recursive split nodes above. It is a list of unsigned integers the user provides. Annoy represents them as slices of u32s, but we decided to go with RoaringBitmaps to reduce their size. Annoy cannot compress them because they have a fun constraint: All nodes, user leaf vectors, split nodes, and descendants must be the same size to be accessible using offsets on disk.

The above image shows the way a split plane can be represented. A hyperplane splits a subset of nodes here in two dimensions, but imagine that with 768 to 1536 dimensions. Each hyperplane creates two subsets of points recursively split by another hyperplane. Once the number of points to split is small enough, we create a descendants node containing the item IDs corresponding to these points. Furthermore, the embeddings of the points are never duplicated in memory; we refer to them by their IDs.

Adapting Annoy to LMDB

So, if it works so well, why must we port it to Rust? One, because I follow the Rewrite it in Rust dogma 😛, two because this is a C++ library with a lot of horrifying hacks and undefined behaviors, and three because Meilisearch is based on LMDB, an atomic, transaction-based, memory-mapped key-value store. Moreover, since 2015, they have wanted to use LMDB, but they have yet to achieve it; it requires a lot of time to change the data structures accordingly.

LMDB uses a BTree to order the entries, and it takes more space for the intermediate data structures than annoy, which directly identifies vectors using the offset in the file. Another advantage of using this key-value store is managing incremental updates. It is easier to insert and remove vectors. Suppose you insert a vector identified by a high 32-bit integer in Annoy. In that case, it will allocate a lot of memory to store it as the dedicated offset, increasing the file size if necessary.

In Meilisearch and Arroy, we use heed, a typed LMDB Rust wrapper, to avoid undefined behavior, bugs, and stuff related to C/C++ libraries. We, therefore, use &mut and & extensively to be sure that we are not modifying the key-value store entries while keeping references inside it, and we make sure that the read-write transactions can't be referenced or sent between threads. But that will be for another story.

We first thought using this key-value store would make Arroy slower than Annoy. However, LMDB writes its pages in memory before writing them to disk, which seems much faster than directly writing into mutable memory-mapped areas. On the other hand, LMDB doesn't guarantee the memory alignment of the values that Annoy's file format permits; we'll talk about that later.

Optimizing Vector Handling with SIMD

The most CPU-intensive task that Arroy was supposed to do was to try to split clouds of vectors in half. This requires computing a massive number of distances in hot loops. But we read the embeddings from a memory-mapped disk. What could go wrong?

fn align_floats(unaligned_bytes: &[u8]) -> Vec<f32> {
    let count = unaligned_bytes.len() / mem::size_of::<f32>();
    let mut floats = vec![f32::NAN; count];
    cast_slice_mut(&mut floats).copy_from_slice(unaligned_bytes);
    floats
}

We used Instrument on macOS to profile our program and discovered a lot of time was spent moving stuff around, i.e., _platform_memmove. The reason is that reading unaligned floats from disk is undefined behavior, so we were copying our floats into an aligned memory area as shown above. Cost: An allocation by reading plus a calling memmove.

A big memmove

While porting the distance function from C++ to Rust, we used the Qdrant SIMD-optimized distance functions without modifying them beforehand. Although, being aware of the performance cost of reading unaligned memory, we decided to execute the distance functions on unaligned floats, making sure not to represent it as a &[f32] as it would be undefined behavior. The functions take raw slices of bytes and deal with them using pointers and SIMD functions. We unlocked performances. The distances functions were slower, but it compensated the memmoves and allocations!

No more memmove

Similarly to the memmove calls, we can see that the _platform_memcmp function is taking a lot of space here. The reason is that LMDB uses this standard function to compare key bytes and decide whether a key is lexicographically located above or below another key in the tree. It is used whenever we read or write in LMDB. @irevoire drastically reduced the size of the keys and saw an important gain in performance. We further tried using the MDB_INTEGERKEY, which makes LMDB compare memory using the computer endianness, but it was complex to use and not showing significant performance gains.

Upcoming Challenges

While porting this cool algorithm to Rust and LMDB, we were missing one of the most important features: Multithreading the tree building. The main reason for this missing feature is LMDB, which does not support concurrent writes to disk. It is part of the beauty of this library; writing is deterministic. We already know LMDB very well, and we will explain how we leveraged the power of LMDB in part 2 of this series and how we beat the Spotify library.

In addition to equalizing the current Annoy features, we needed more for Meilisearch. We implemented Microsoft's Filtered-DiskANN feature in Arroy. By providing the subset of item IDs we want to retrieve, we avoid searching the whole trees to fetch the approximate nearest neighbors. We will talk about this in a soon-to-be-released article.

In Meilisearch v1.6, we've optimized performance for updating only document parts, as with vote counts or view numbers. The described single-threaded version of Arroy constructs tree nodes anew for each vector adjustment. @irevoire will explore Arroy's incremental indexing in his next article, a capability not offered by Annoy.

You can comment about this article on Lobste.rs, Hacker News, the Rust Subreddit, or X (formerly Twitter).

Multithreading and Memory-Mapping: Refining ANN Performance with Arroy

This is part 2 of a series of blog posts. You can find the part 1 and part 3 on my blog.

Wouldn't it be great to show you how a single-threaded, memory-mapped key-value store can be more efficient than a hand-written memory-mapped solution? I faced issues while porting the Spotify disk-based Approximate Nearest Neighbors library to Rust and, more specifically, to LMDB. Those issues were primarily due to LMDB and memory safety. Here is the story.

To remind you, Arroy is a library that stores embeddings (vectors of floats) on disk. Some data structures are generated on top of those vectors, which look like trees governed by normals used to recursively split the vectors dataset into subdomains. But you can read more about this in part 1 of this series.

How does Annoy Generate the Tree Nodes?

Annoy, the Spotify library stores the nodes on disk, the item nodes (the ones containing the embeddings), and the other nodes that we will call tree nodes in this article. The advantage of doing this is double:

  • The program's memory usage is low and optimized by the OS as all the nodes live in a memory-mapped file.
  • The concept is simple: Access a node by using its ID. The library will find its offset on disk by using a simple multiplication.

However, there are downsides to doing that. All nodes: items with embeddings along with the tree nodes must have the same size, and If a user wants to identify its embedding using the ID 231, the library will increase the file size to the ID multiplied by the node size. For example, with a vector of 768 dimensions, storing a single item with the ID 231 will generate a file of more than 6.5 TiB.

In generating the final data structure, the tree nodes are all written to disk in the same file as the user item containing the embeddings. Those tree-building processes run in parallel, one running process by tree, and, therefore, requires a lot of synchronization when defining the newly created node ID and the memory area reserved for it, most importantly, when the memory-mapped file is too small to accept more nodes, only a single thread must grow the file, so a mutex is used on top of the mutable memory-map and around a node_id variable. One interesting property of tree generation is that the generation process only requires the original user item with embeddings.

The Challenges We Encountered when Porting it to LMDB

A fun fact is that it is the first time in a long time that I have written C++, and the first time I asked ChatGPT to code for me because I was not confident in doing C++ and feared falling into some segfault. I needed a small program to deserialize embeddings from stdin and give them to Annoy. The chatbot's code was mostly effective, but it omitted a critical empty vector check, which led to a segmentation fault...

The main obstacle to porting it to LMDB is that writing into this key-value store is single-threaded. It doesn't support concurrent write operations to maintain a correctly balanced BTree. Fun incoming!

Reading the User Item Nodes from Different Threads

We have used LMDB at Meilisearch since the beginning. It is a well-maintained key-value store used in Firefox and maintained by the OpenLDAP developers. It is memory-mapped and does not maintain a user-end cache of the entries in memory but instead gives you a pointer to the memory-mapped area on disk. The main advantage is that you can keep a pointer to this area for as long as your read transaction lives. Ho yeah! Because it is a transactional key-value store that supports atomic committing, too!

But tree generation doesn't require referring to the generated nodes but only the user items. We previously saw that LMDB gives direct pointers into the memory-mapped file without maintaining any intermediate cache (that could be invalidated). There is another quirk with LMDB: you cannot share a read-only transaction between multiple threads, i.e., RoTxn: !Sync and you cannot move the write transaction between threads, i.e., RwTxn: !Send + !Sync. Ho! And there is no way to create a read-transaction on uncommitted changes. This is an issue because we want to generate the data-structures trees in the same transaction where we store the items.

But magic is everywhere, starting with the following small and fun data structure. The principle is to keep the pointers to the internal user items with embeddings in a Vec<*const u8>. Thanks to Rust, we can ensure, at compile time, that the pointers will live long enough by keeping the lifetime in the struct. Using the &'t RwTxn to get the &'t RoTxn by using Deref also ensures that we cannot modify the database while reading in it by using a &'t mut RwTxn. According to the leading developer of LMDB, it is safe to share those pointers between threads and why I implemented Sync for this structure.

pub struct ImmutableItems<'t, D> {
    item_ids: RoaringBitmap,
    constant_length: Option<usize>,
    offsets: Vec<*const u8>,
    _marker: marker::PhantomData<(&'t (), D)>,
}

impl<'t, D: Distance> ImmutableItems<'t, D> {
    pub fn new(rtxn: &'t RoTxn, database: Database<D>, index: u16) -> heed::Result<Self> {
        let mut item_ids = RoaringBitmap::new();
        let mut constant_length = None;
        let mut offsets = Vec::new();

        for result in database.items()? {
            let (item_id, bytes) = result?;
            assert_eq!(*constant_length.get_or_insert(bytes.len()), bytes.len());
            item_ids.push(item_id);
            offsets.push(bytes.as_ptr());
        }

        Ok(ImmutableItems { item_ids, constant_length, offsets, _marker: marker::PhantomData })
    }

    pub fn get(&self, item_id: ItemId) -> heed::Result<Option<Leaf<'t, D>>> {
        let len = match self.constant_length {
            Some(len) => len,
            None => return Ok(None),
        };
        let ptr = match self.item_ids.rank(item_id).checked_sub(1).and_then(|offset| self.offsets.get(offset)) {
            Some(ptr) => *ptr,
            None => return Ok(None),
        };
        let bytes = unsafe { slice::from_raw_parts(ptr, len) };
        ItemCodec::bytes_decode(bytes).map(Some)
    }
}

unsafe impl<D> Sync for ImmutableItems<'_, D> {}

You have also probably noticed some other fun optimizations in this simplified version of the data structure. We also know that the user-item nodes have a constant length, so I decided to store it only once, reducing the offsets vector's size by two. Considering that our objective is to store 100M vectors and that this vector is in memory, we shrunk its size from 1526 MiB to 763MiB, which is not much, but better than nothing.

Writing the Tree Nodes in Parallel

Ok! Now that we know how to store pointers to items and share them between threads without any user-end synchronization, we need to generate the tree nodes using it. We already know how to deal with LMDB at Meilisearch and decided to implement the same workaround. To write in parallel into LMDB, write into different, independent files and merge everything afterward. This leverages the Share-Nothing principle and isolates the algorithms. This drastically reduces the number of synchronization points compared to the original C++ code (look at the .lock/unlock calls) to a single line in our code: the atomically increasing global tree node ID.

pub fn next_node_id(&self) -> u64 {
    self.0.fetch_add(1, Ordering::Relaxed)
}

Our functions that generate normal split nodes based on the user-item nodes are now simply writing the nodes into independent files. The nodes are appended into a file, and the on-disk offsets and bounds are stored into a vector, a single usize by node. Using Rayon, we run all tree-generation functions in parallel and, once completed, retrieve the files and boundaries to write the uniquely identified nodes into LMDB sequentially.

Performances Comparison

Our objective at Meilisearch is to support 100M embedding of around 768 dimensions. If we store those as f32 vectors without any dimensionality reduction, it would be equivalent to 100M * 768 * 4 = 307B, in other words, 286 GiB to store the vectors raw, without any internal tree nodes, i.e., no way to search in them efficiently.

If you don't specify the number of trees to generate, the algorithm will continue to create trees while the number of tree nodes is smaller than the number of item vectors. At the end, there must be roughly the same number of tree nodes and items.

Discovering the Limit of Vectors We Can Index

Arroy and Annoy use memory mapping extensively but in slightly different ways. In a previous article from @dureuill, we saw that operating systems do not have infinite memory-map space. So, let's dive into performance results.

I noticed something strange when running Arroy with 25M vectors of 768 dimensions. The CPUs were not used, and the program seemed to read the SSD/NVMe a lot, too much 🤔 The tree generation algorithm is splitting the vector space into subdomains with fewer vectors. Still, it must first find the appropriate normal to split the entire domain, and for that, it randomly selects many items. Unfortunately, my machine only has 63 GiB of memory, and indexing those 25M items requires more than 72 Gib. Annoy was also struggling in the same way.

We cannot index 25M vectors of 768 dimensions with 63 GiB

After investigating, I understood why the whole swap space and memory mapping limits were reached. Not only the item nodes were not only 768 * 4 bytes because we store the norm and other stuff alongside the vectors, but in the case of Arroy, LMDB needs to maintain BTree data structures around the entries, and those are also tacking memory-mapped space. Both programs request random item nodes unavailable in memory, so the OS fetches them from the disk, which takes time. Ho and every single thread is doing that in parallel. CPUs are simply awaiting the disk.

So, after some dichotomy, I found the point where arroy successfully used all of the memory without being bound to the disk speed. It can index 15.625M on a 63 GiB machine. You can see on this htop screenshot that the disk read speed is at zero as all of the vectors fit in RAM, that arroy is writing the tree nodes to disk, and that the CPUs are doing their best. It took less than seven hours to process.

Successfully indexing 15.6M vector of 768 dimensions with 63 GiB

But... Annoy cannot index this same number of documents. It suffers from the same issue we saw before: high disk read and low CPU usage. But why? I needed clarification because the nodes have the same format, the number of item vectors to index is the same, and the algorithm has been ported. So, what is the difference between both solutions?

For those who looked into the C++ codebase and were probably hinted by the memory mapping issue described earlier, you probably noticed this slight difference: Arroy is writing the generated tree nodes into different raw files when Annoy, on the contrary, is reserving space into the memory-mapped file and directly writing into it. By doing this trick, the OS needs to keep much more space in the memory-mapped area and is forced to invalidate the item nodes from the cache to keep the just-written tree nodes hot in the cache, slowing down the system for no reason, as the tree nodes are not necessary for the algorithm.

So, after even more dichotomy to find the Annoy limits on a 63 GiB machine, I discovered that it could roughly index 10M vectors in five hours.

The Share Nothing Principle to the Win

So Arroy can index 36% more vectors on the same machine, but how long does it take? Wouldn't it be faster to write into the same file in parallel instead of copying all those tree nodes in a single-threaded way? It will be a much shorter paragraph as I only did some small tests, but Arroy is always faster!

number of vectors number of threads building time
Annoy 96k 1 5min 38s
arroy 96k 1 3min 45s
Annoy 96k 12 1min 9s
arroy 96k 12 54s
Annoy 10M 12 5h 40min
arroy 10M 12 4h 17min
Annoy 15.625M 12 --
arroy 15.625M 12 7h 22min

Now, you probably tell me that I will need around 400GiB of memory to index 100M vectors, and you are probably right, but @irevoire will talk about incremental indexing in a future article. I did a linear regression for fun. With those 12 threads and the required memory, I predicted it would take one day, 22 hours, and 43 min 😬. Ho! As we also use this vector store in Meilisearch, we need to provide ways to filter this data structure at search time. This is the next article in this series.

You can comment about this article on Lobste.rs, Hacker News, the Rust Subreddit, or X (formerly Twitter).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.