meilisearch / meilisearch Goto Github PK

View Code? Open in Web Editor NEW

43.6K 274.0 1.6K 58.02 MB

A lightning-fast search API that fits effortlessly into your apps, websites, and workflow

Home Page: https://www.meilisearch.com

License: MIT License

Rust 99.83% Dockerfile 0.03% Shell 0.13%

search-engine rust typo-tolerance synonyms site-search database enterprise-search rest search app-search

meilisearch's Issues

The indexation must create only one file

...with all key and values.

Currently the fst map and the values associated with it (i.e. document id, attribute and position) are in two different files, this is favourable to mistakes: a user can ask to open an fst map dump with a wrong values dump and produce wrong results while searching into it.

We must disallow this, it must be impossible to do that kind of mistake, creating only one file by concatenating the current ones could be a valid solution.

Change the DocIndex file format to allow stream writing

Currently the file format to store the document indexes associated with the word in the fst map stores an index at the start of the file followed by all the corresponding document indexes. You can read more about the format in the deep-dive.

This flat data structure does not allow us to "stream" the data to disk directly when merging indexes for example, because we need to know about the final index to write it ahead.

I propose to invert the order and to store the index and its size at the end of the file, this way we would be able to "stream" the document indexes to the file and only keep the offsets the file in ram. When all words have been traveled and written in append-only to the file we are able to write all offsets followed by the size of this index (all offsets).

To read this file and the document indexes contained in it we will read the last 8bytes to know the size of the index (the offsets) and use this index, like before, to know which document indexes are associated with and id.

Moving from RocksDB to Sled

Currently the underlying key-value store is RocksDB, this is one of the most important dependency we have. It allows the search engine to:

accept updates atomically and provide snapshots of the DB.
implement custom ranking rules based on the document attributes stored in the DB
probably make other ways of querying the data without affecting the current system (i.e. geo-search)

But RocksDB is a C++ library that does not expose detailed error types. Therefore it will never be possible to solve the issue #24.

Another problem is that it multiply the number of tools we need to compile the library, we need to have a C++ compiler, make and cmake.

The solution to these problems is to use a full Rust library, thinking of sled. If we move to it, it will be much easy for us to maintain the sled code base making it better fitting for our usecases.

For the moment Sled does not have many of the features we need to make the switch:

Accept some kind of SST file spacejam/sled#438 or handle transactions and atomic replications.
Snapshots (i.e. read the database at a certain point in time).
~~column-families~~ namespacing feature #27 (spacejam/sled#478).

WordArea and Attribute must not panic

Currently WordArea and Attribute panics if the arguments are greater than the internal limit.

This is an issue when texts are too long for example, the serializer must know when a WordArea can not be constructed due to words that are too far.

We must change the constructor to be able to catch possible errors, we will keep a pub(crate) unchecked constructor that panics just for tests.

Implement a better debug for Schema

We need a better debug implementation of the Schema Debug current one.

We could display all the attributes names along with their number and properties. Using the map Formatter helper.

Reduce the deserialization time of the index

The current deserialization time of the "data-index" entry is something around 500ms, this is too much.

Currently the "data-index" is ser and deserialized using serde, the cpu consumption is passed in here.

https://github.com/Kerollmops/pentium/blob/0e825e05bbd6812a57ea36b2e9ac58bb8e89f687/src/database/blob/positive/blob.rs#L114-L149

Change CI from travis to Azure

Think about data oriented design

Currently the mosts cpu consumming parts of MeiliDB are the criterions, it is almost 40% of the total time (for 9 million fields).

We spotted something interresting in time measurements, one of our criterion takes much more time than the others but it only does a sum of one of the Match property.

The sum_of_typos criterion takes 5.41ms to sort 97360 documents and the sum_of_words_attribute takes 19.76ms for the exact same number of documents.

97360 total documents to classify
626460 total matches to classify

query_all took 87.45 ms

criterion SumOfTypos,               documents group of size 97360
criterion SumOfTypos                sort took 5.41 ms

criterion NumberOfWords,            documents group of size 97360
criterion NumberOfWords             sort took 5.36 ms

criterion WordsProximity,           documents group of size 97360
criterion WordsProximity            sort took 7.24 ms

criterion SumOfWordsAttribute,      documents group of size 97360
criterion SumOfWordsAttribute       sort took 19.76 ms

criterion SumOfWordsPosition,       documents group of size 33657
criterion SumOfWordsPosition        sort took 10.48 ms

criterion Exact,                    documents group of size 16708
criterion Exact                     sort took 1.96 ms

criterion DocumentId,               documents group of size 16708
criterion DocumentId                sort took 1.98 ms

It is nearly the sames algorithms:

https://github.com/Kerollmops/MeiliDB/blob/0a3d069fbc0108cca6953c813453b2dd24d8b68d/src/rank/criterion/sum_of_typos.rs#L14-L26

https://github.com/Kerollmops/MeiliDB/blob/0a3d069fbc0108cca6953c813453b2dd24d8b68d/src/rank/criterion/sum_of_words_attribute.rs#L13-L19

The only difference is that the attribute field is not at the start of Match.
It is probably padded, making the CPU cache unhappy.

struct Match {
    query_index: u32,
    distance: u8,
    attribute: u16,
    word_index: u32,
    is_exact: bool,
}

So we thought about data oriented design, putting related data in the same memory space to make the CPU cache happy.

All of these fields will probably be stored in the same vectors, each vector represent a property.
All of the documents matches properties will be in the same five vectors, everything is a matter of indexes and slices.

// OLD algorithm struct
struct Match {
    query_index: u32,
    distance: u8,
    attribute: u16,
    word_index: u32,
    is_exact: bool,
}

let matches: Vec<Match>;

// NEW algorithm struct
struct Matches {
    query_index: Vec<u32>,
    distance: Vec<u8>,
    attribute: Vec<u16>,
    word_index: Vec<u32>,
    is_exact: Vec<bool>,
}

https://stackoverflow.com/questions/17074324/how-can-i-sort-two-vectors-in-the-same-way-with-criteria-that-uses-only-one-of

Policy for matching words

On a search with one or multiples word in the query. The user can choose what happens if a word on a query is not present on the document or if the engine has not found enough documents. It's mean the possibility to make a query word optional or not.

I propose 4 possibilities:

None => If a query word is missing the document is not considered
Last Word => If we have no results we remove the last query word and try again
First Word => If we have no result we remove the first query word and try again
Optional => All words in a query are always optional, If one is missing the document is kept

Handle multiple indexes using the same key-value store

The RocksDB database have an interresting feature called column-families. If you don't sepcify which column family to use to put a value it will use the "default" column family.

Will could change this behaviour by specifying the column family name to be the index name that should be updated or queried. It is possible to create an SST file for a specific column family.

New highlight style

Currently, the engine returns for highlights the fields name, the start of the occurrence and the length.

e.g.

 "matches": [
      {
        "attribute": "xxx",
        "start": 62,
        "length": 7
      }
]

It should be great to have the content of the field with the highlights visible. Set a text before and after the highlight. It should be great for many usages. Like rendering HTML, printing on a terminal, etc. But keep the information starts and length of a match is also interesting in some case.

e.g.

"_highlightResult": {
	"attribute_name": {
		"value": "I am the text of attribute_name and the match is <em>chocolate</em>",
		"matchedWords": [
			"chocolate"
		],
		"matchesPositions": [
			{ "start": 50, "length": 9 }
		],
		"prefixTag": "<em>",
		"postfixTag": "</em>"
	}
},

Allow querying a range of documents

It is an interresting feature to allow a user to query the engine by specifying the exact range of document he want to have, it can be usefull to do pagination. The user could have done a work around by querying enough documents to reach the up bound of its final needed range.

By specify the range, the engine can skip many computation inside the bucket sort algorithm.

Create a tokenizer

...that can be based on the Michael Nitschinger's blog post about tokenization in Rust.

Make the Schema attribute_name function faster

For the moment the Schema::attribute_name function has an awful complexity, see by yourself:

https://github.com/Kerollmops/MeiliDB/blob/2cbb943cbef406a5223deea13e20c027094021bb/src/database/schema.rs#L113-L119

We can probably use a LinkedHashMap to be able to iterate on it but as its name explain it good: it is a linked-list that drive this fonctionnality.

Schema must be de/serialized from a json format

Thanks to a previous issue #47 Schema can be built from a toml file.

The proposal for this issue is to handle the JSON format like the Toml format.

An example of the json schema:

{
    "identifier": "id",
    "attributes": {
        "id": {
            "strored": true
        },
        "title": {
            "strored": true,
            "indexed": true
        },
        "description": {
            "strored": true,
            "indexed": true
        },
        "image": {
            "indexed": true
        }
    }
}

Save the state of the search engine (its index)

in other word dump the index files. It is fairly easy to dump the fst::map and the doc_indexes but we need to dump the RocksDB too, as an SST file if possible.

https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F

It seems that we will be forced to iterate over the keys and select the ones that are present in the metas. But if it is possible to remove keys will we do operations on the index (differences of negative and positive doc indexes) it must not been necessary and an easy* big dump could be sufficient.

The real advantage of doing so will be to give the new index to other search engines and to keep a state even if the engine crash. Ideally the state should be a unique file (e.g. a tar.sz file, tar combined with snappy).

Produce an SST RocksDB file at index time

Better produce an SST file at index time, this kind of file is useful when the key-value generation can be done earlier and the boot of the program, that will query the key-values, must be fast.

https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files

Implement distinct on RankedStream

Implement the Algolia's distinct feature on the RankedStream struct using an HashMap and a counter. The attributeForDistinct will be given to the RankedStreamBuilder.

https://www.algolia.com/doc/tutorials/full-text-search/distinct/?language=ruby#distinct-at-query-time

Implement an open_read_only for Database

This way it will be possible to run multiple instances of the same database and only test the query functions, the ones that does not affect the database content.

Add some logging to the project

It can be for the http server or in the raptor library.

Using the log crate can be a good start.

Implement better error types

This important for the library user to know what wrong happen when something wrong happen 😃
But it is also important for us to know what is the source of an error.

There is a big problem for the moment, we use RocksDB, that is a C++ library wrapper, errors are just simple strings. We do not have any information on the kind of the error, is it an IO error, a logic error ? Does the key that I want to remove does not exists or is it wrongly spelled ?

We can probably create a workaround that wil be an enum error wrapping the RocksDB error String...

Replace hasbrown::HashSet by std::collections::Hashset

hashbrown by std

Save blobs using a key-value store

The problem that we faced while implementing this database was to save the blobs generated by our engine. The engine can work with positive and negative blobs as explained in the deep-dive.

These blobs are currently written to the filesystem. The problem is that the files order must be saved, you can read more about why in the issue #15. Additionally the positive blobs are associated with an sst file, a file that contains all key values of the newly added documents, a format RocksDB can be populated with.

One of the feature that the final database must have is to be fed with updates (addition, deletion, update) while it is running. The big problem was: How can we ensure that the modifications received by the database are saved durably (according to the ACID properties) ?

A temporary solution could be to use an existing key-value store like RocksDB, it already follows some of the ACID properties. It will be used to store the key-values of the newly indexes documents and the index files (.idx, .map). Using column families as a separation.

When an update is made we first stores the index files under an incremented id as key, update the document key-values and set the current key id to the one created. We have a little problem here: ingesting an sst file cannot be done inside of a transaction.

Allow serializing unstructured data to create an update

Currently only structure serialization is supported to create an update, in the future it would be great to serialize HashMaps for example. This way we will allow users to send custom objects without needing to recompile.

The current schema does not know anything about the document unique identifer we must resolve this issue first (#22), this way users would be able to insert documents without needing to compute its id.

Policy for prefix search

On a search with one or multiples word in the query. The user can choose if each word can be searched as a prefix or not.

I propose 3 possibilities:

None => No Prefix, Never!
Last => Consider that only the last word can be a prefix
All => Consider that all words in the query can be a prefix

Fix `CappedBTreeMap` insertion

The current CappedBTreeMap::insert function will pop out the smallest value if the capacity is reached but it doesn't test if the inserted one is smaller than the one in the tree so the poped/returned value is not the smallest one.

Improve document ranking when a query word is know to be exact

Currently the documents ranking doesn't seems to take into account the fact that a query word is at the end of the query or not. When a word is at the middle of a query or is not followed by a space this word is more likely to be finished/in its exact form.

For example if a user query "cat and dog" we must understand that the word "cat" is a finished word and documents containing the exact representation must be more interresting than documents contaning words starting with "cat" like "catalog" "catherine"...

This is already one of the criterion that is used by default but it is nearly the last criterion in the list (link to the code).

Add a sessionId on the search request

For the moment, all search statistics and analytics made on the server are not specific enough. We cannot know if the search "pants" is unique or if it's preceded by "p" then "pa" then "pan" then "pant" and finally "pants". It's mean we are unable to find the most searched words for an index.

It should be great to add on the search requests headers a sessionId to keep a history of what user search. The goal is not to spy the users but just to have enough information to know what the final search is.

Give more information about the pocessing on the search response

The search responses are only composed by an array of documents and for each document the list of matches. We might need to have more information for debugging, do metrics, etc.

Here some information that can be interesting to be found on a response.

Process:

The total number of documents related to the query
The processing time in milliseconds

Pagination:

The actual page number requested by the user
The number of pages that can be requested
The number of hits per page

Debugging:

The indices used for search (useful in case of a search on multiple indices)
The complete ranking information (gave information to understand why results are in this order)
Return all information about the request made by the user

Use the Stream instead of a Levenshtein for little words

It will gives better performances.

Currently, if a query word is smaller or equal than 4 characters, a Leveinshtein automaton is created with an allowed distance of 0 that was used only to match on prefixes and it is much slower than combining the stream and starts_with map methods.

The best way to make it possible must be to define an enum that could be a stream + starts_with combination or a Levenshtein. We will keep the Levenshtein pool but it will under this wrapper.

Schema must be contructed from file

We must be able to create a Schema from a file. The format could be toml for example.

It could be something of this form, the order in which the attributes are declared is important.
There is a pull request about that in the toml repository.

identifer = "id"

[attributes."id"]
stored = true

[attributes."title"]
stored = true
indexed = true

[attributes."description"]
stored = true
indexed = true

[attributes."image"]
stored = true

Create a capped `BTreeMap` with a limited pool of nodes

...that will allow us to store nodes with values in a simple memory buffer, removing most of the cache misses while iterating over the tree.

A good way to identify nodes pointing over other nodes is to use node indexes in the memory buffer instead of raw pointers, it will allow us to growth the pool node.

Make the indexing more customizable

The engine uses a lot of magic numbers. Mostly during the indexing phase. For the typo tolerance configuration, the number of words indexed in a field, the biggest possible proximity, etc..

This issue should allow the end user to configures all these parameters.

One possibility is to store them on the config already saved in the KV store. The search and the indexing config parts could be separated.

Distinct with automatic group size

The distinct feature (see #12 ) allow to de-duplicate objects based on a generic function.

The implementation keeps the possibility to choose the number of results returned for an object family.

The proposition is to allow users to choose between 3 choices:

x, keep the x best results
Auto, will keep all results with the same level

Pre-config

attributeForDistinct = [groupId]
attributeForAutoDistinct = [color]

Indexed data

{ groupId:1, id:11, name: t-shirt marine, color:red }
{ groupId:1, id:12, name: t-shirt marine, color:blue }
{ groupId:1, id:13, name: t-shirt marine, color:white }
{ groupId:2, id:14, name: t-shirt, color:marine }
{ groupId:2, id:15, name: t-shirt, color:blue }
{ groupId:2, id:16, name: t-shirt, color:marine }
{ groupId:2, id:17, name: t-shirt, color:red }

Request

Query: T-shirt marine

Results

Current:

{ groupId:1, id:11, name: t-shirt marine, color:red }
{ groupId:2, id:14, name: t-shirt, color:marine }

Simply distinguished by groupId

Possible result 1:

{ groupId:1, id:11, name: t-shirt marine, color:red }
{ groupId:2, id:14, name: t-shirt, color:marine }
{ groupId:2, id:16, name: t-shirt, color:marine }

Because the marine color is not present on groupId:1 but is on groupId:2

Possible result 2:

{ groupId:1, id:11, name: t-shirt marine, color:red }
{ groupId:1, id:12, name: t-shirt marine, color:blue }
{ groupId:1, id:13, name: t-shirt marine, color:white }
{ groupId:2, id:14, name: t-shirt, color:marine }
{ groupId:2, id:16, name: t-shirt, color:marine }

Because marine match one time in the category color

Talking issue on implementing a client-server protocol

We must think about a way to communicate with the database, to insert, delete and update documents.

The database will be listening on a port in tcp and should be queried by any language (i.e. Rust, Ruby, Python, Javascript...). The protocol must be versioned.

We though that the first protocol that can be implemented will be a full text one that follows some of the Redis protocol specifications. This way it will be easy to implement it for other languages.

But the second version of the protocol could be oriented on something else, more like a binary format for example.

This is issue is blocked by #22 and #35.

Improve the query distinct function performances

The query distinct function call the distinct function given by the user inside the raw bucket sort pass and another time inside the final aggregation loop.

https://github.com/Kerollmops/MeiliDB/blob/e36deb60dacc8e70db00d7747555758d15f949b9/src/rank/query_builder.rs#L151-L191

The user given distinct function is always cpu consumming related to the fact that it fetch documents fields from the key-value store.

It could have been interresting to store the outputs of the firsts distinct function calls.

Use a shared Vec for the "data-index" PositiveBlob

This way it will reduce the number of allocation when reading this entry to one of a known size.

The PositiveBlob is composed of an fst::Map and a DocIndexes struct, these two structures are one after the other in memory. We could use a shared Vec to construct the Map and the DocIndexes.

Schema must be aware of the document id

Currently the schema know which fields of the document should be indexed and stored. Using the SchemaBuilder struct we declare each field in the order of importance we want and the document id is unknown to the finally built schema.

The advantage of this solution is that it will be impossible to serialize a document that does not have an id. The serializer will not be able to generate the appropriate key and therefore not be able to store it in the database.

We must absolutely keep some kind of "raw" API that will allow us to store a document under any id and with any attributes.

How should we modify the SchemaBuilder API ? What if the user want its id to be stored and indexed ?

We will probably need to transform the id (i.e. hashing it) to be able to save the in the database, when a user will need to identify a document (i.e. deleting it), he will refer to its original id, not the transformed one.

Change the example dataset

We must find a better dataset. A set of summary of some important Wikipedia pages for example.

Rank results following rules

Create a RankedStreamBuilder with settings and stuff to customise the sorting algorithms settings.

This builder produce a RankedStream which streams the results using a StreamWithState (with the levenshtein state) and sort the results with some possible optimisations.

The internal StreamWithState returns each matching word (with the right levenshtein distance) associated with the document id, word index in the query, the attribute where the word is located and position of the word in this attribute.

This is raw data and that must be sorted to match the following rules:

Typo: Will rank higher a word that doesn’t contain a typing mistake.
Words: Will rank higher an object matching more of the words typed by the user.
Proximity: Will rank higher words close to each other (George Clooney is better than George word Clooney).
Attribute: Will rank higher a match in a more important attribute (Title, Description).
Words position: Will rank higher words that are at the beginning of an attribute.
Exact: Will rank higher words that match exactly without any suffix (if prefix search is supported).

First step to follow the rules is to collect results (document id, word index in the query, attribute and position) in a collection to help sorting them.

But these rules must be applied using bucket sorting: if, after sorting results using the first rule, there is matches that are equal so use the next rule to disassociate them and so on if equallity are detected.

The typo rule is not the more complex to implement, sort documents according to the levenshtein distance only. If there is multiple words take the maximum of the minimum query words distances.

The words rule is straightforward: the index of the word in the query is returned in the result of the internal StreamWithState stream, this information can be used to know which word the result talk of and so simply count the number of different matching words ([1, 2, 3] is better than [1, 2] or [1, 1, 1]).

The proximity rule can be implemented by sorting the results by the attribute first, by the word index in the query after and finally by the index in the attribute. Once the results for each document are sorted then sort documents by the smallest sum of distances between each query word.

The attribute rule has already been fulfilled in the previous rule application.

The words position rule has already been granted by the previous sorting, select the different word index in the query and get the one that is the nearest from the start of the attribute, rank document based on these values (lower is better).

The exact rule is not defined for the moment.

Conclusion

The best way to implement all of these rules is to create a collection for each document that match something. Documents can be added as the research progresses, the collections contain structures with the index of the word in the query followed by the distance of the match, the attribute in which the word has matched and finally the index of this word in the attribute.

`RankedStream` optimisations

The RankedStream stream can be setup to return a maximum number of results and so optimisations are possible.

The internal StreamWithState stream gives results in lexicographic order, each result, words, is associated to a given number of document id, position... These informations are used to rank results.
The order in which the internal stream gives results is useful to help know if other matching words can be given in the future.

If the stream as a limited number of results to return, some results are already known to be "frozen" ones, there is no results that can be better than these ones.

For example: a document that contain exactly the query word (with a levenshtein distance of zero) and that is at the first position in the first attribute can not be moved out, it is called as "frozen" and can be immediatly returned.

Another example: a one word query is created and the internal stream gives words mathing in the lexicographic order, so if the period in which the words exactly matching (with a levenshtein distance of zero) have been returned, and there is enough of them, so there is no reason to search for more results from the internal stream if the next ones will have a bigger distance and the current accumulated results can be sorted and returned already.

Remove the principle of positive and negative update

It could be easier for the user to create an update and not to care about the fact that this is a positive or a negative one.

It must be an update without any distinction, the user could be able to insert documents fields updates, remove documents fields or completly remove documents.

Implement some sort of incremental indexing

What is incremental indexing ?

Currently the engine take an index composed of 3 files, one for the final state transducer (.fst), one for the list of document id associated with the index where it match (a DocIndex) (.idx) and one that represents the key-values used to feed RocksDB (.sst).

This kind of index is immutable and it is the power of the datastructure, you can read more on the fst repo. I will use the "index" term to speak about this batch of file that works together in the following lines.

Problem

A use-case that should/must be managed by this library is the possiblity to add and remove documents in real time without needing to re-index all the documents. This is a feature that is really important when you need to take a small amount of time to expose new results.

Implement query filtering

Another great feature to have is a way to return only documents that should be yielded. An API like the following would be perfect.

fn filter_function(doc: DocumentId, view: &DatabaseView<D>) -> bool {
    // ...
}

let builder = view.query_builder()?;
let builder = builder.with_filter(filter_function);

let documents = builder.query("chateau", 0..20);

We must think again how work the query builder system.

For the moment a user can create a QueryBuilder without any parameter, if it needs to apply a distinct rule on it, the user can create a DistinctQueryBuilder from it. This is a new type with its own query function, really custom.

If we want to introduce a new way to filter documents we will multiply the number of types and it will be hard to combine types of filters (i.e. filter, distinct). An idea could be to introduce a Query trait that any filter will implement, this way we will be able to introduce query filters that wraps other query filters.

We need to be carreful, the query filter must always be done before the query distinct for example. A type restriction could be implemented.

Rank documents

https://blog.algolia.com/how-algolia-tackled-the-relevance-problem-of-search-engines/
https://blog.algolia.com/inside-the-algolia-enginepart-4-textual-relevance/

Index language configuration

If the user has a database with documents translated in multiples languages, the best use case is to create one index per language. In this case, it should be great to permit the end user to choose a default language for each index. This will apply the tokenizer and the default stop words for this language during the indexing.

Of course, this option should be totally optional. By default, no stop word will be used. The tokenizer will split by space and use a default sentence separator like dots, coma, etc..

Implement a bucket sort algortihm

After some time experimenting on a basic sorting method that sort all documents at once it appears that it was not the best idea.

The new algorithm should use a bucket sort system, a sort that will only be applied on sub groups by criterion. Buckets are sorted by criterion and split, each sub-buckets are then sorted again and split by the next criterion. The advantage of this technique is that only the minimum number of buckets are taken into account, the other buckets that will not contain the returned documents will not be processed and sorted.

We need to accept soft updates

Soft updates are updates that can be ingested by the database without any cpu overhead.
Unlike the heavy updates they doesn't carry any fst-docids entry therefore no blob merge is needed and entry modification can be seen in realtime.

Obviously soft updates can only be done on non-indexed attributes.

Allowing this can be useful for custom ranking rules using non-indexed attributes frequently changing like a number_of_clicks attribute for example.

Store the default pagination configuration

The search engine has as default pagination by 20 documents. Set this value alterable.

Amortize document updates ingestion costs

The current update ingestion implementation recompute the reverse index.
It is an important CPU cost mostly for adding a small number of documents.

We can't deal with this anymore!
We must be able to acppet any size updates and make them be available to search without the minimum overhead.

We thought about something similar to database caching algorithms: putting most recently used entries in memory providing a rapid access to it. When too much entries are cached, a compaction is performed (i.e. some of the Sled's future strategy).

So, basically when a user propose some documents update/insertion/deletion we will not put every document attributes updates inside of an SST file, paying the cost of keeping words-docindex ordered in RAM and constructing a new FST.

Inserting words-docindex and document attributes values directly into the key-value store is the solution. Internally the database will extract words of the indexed attributes and update the special "revindex" keyspace, dedicated to reverse index data, storing DocIndex associated to words as keys-values, like the current FST.

The advantage of a key-value store is that, like a BTree and an FST, keys are ordered, making O(log n) unions possible between these datastructures at search time.

Obviously, we should find a rule to trigger compaction, it must be based on the latency involved in doing the union, probably related to the number of words/doc-indexes stored as raw keys. A compaction will be an update of the FST by integrating "revindex" word-docindex into the FST.