meilisearch / meilisearch Goto Github PK
View Code? Open in Web Editor NEWA lightning-fast search API that fits effortlessly into your apps, websites, and workflow
Home Page: https://www.meilisearch.com
License: MIT License
A lightning-fast search API that fits effortlessly into your apps, websites, and workflow
Home Page: https://www.meilisearch.com
License: MIT License
...with all key and values.
Currently the fst map and the values associated with it (i.e. document id, attribute and position) are in two different files, this is favourable to mistakes: a user can ask to open an fst map dump with a wrong values dump and produce wrong results while searching into it.
We must disallow this, it must be impossible to do that kind of mistake, creating only one file by concatenating the current ones could be a valid solution.
Currently the file format to store the document indexes associated with the word in the fst map stores an index at the start of the file followed by all the corresponding document indexes. You can read more about the format in the deep-dive.
This flat data structure does not allow us to "stream" the data to disk directly when merging indexes for example, because we need to know about the final index to write it ahead.
I propose to invert the order and to store the index and its size at the end of the file, this way we would be able to "stream" the document indexes to the file and only keep the offsets the file in ram. When all words have been traveled and written in append-only to the file we are able to write all offsets followed by the size of this index (all offsets).
To read this file and the document indexes contained in it we will read the last 8bytes to know the size of the index (the offsets) and use this index, like before, to know which document indexes are associated with and id.
Currently the underlying key-value store is RocksDB, this is one of the most important dependency we have. It allows the search engine to:
But RocksDB is a C++ library that does not expose detailed error types. Therefore it will never be possible to solve the issue #24.
Another problem is that it multiply the number of tools we need to compile the library, we need to have a C++ compiler, make and cmake.
The solution to these problems is to use a full Rust library, thinking of sled. If we move to it, it will be much easy for us to maintain the sled code base making it better fitting for our usecases.
For the moment Sled does not have many of the features we need to make the switch:
Currently WordArea
and Attribute
panics if the arguments are greater than the internal limit.
This is an issue when texts are too long for example, the serializer must know when a WordArea
can not be constructed due to words that are too far.
We must change the constructor to be able to catch possible errors, we will keep a pub(crate)
unchecked constructor that panics just for tests.
We need a better debug implementation of the Schema Debug current one.
We could display all the attributes names along with their number and properties. Using the map Formatter helper.
The current deserialization time of the "data-index" entry is something around 500ms, this is too much.
Currently the "data-index" is ser and deserialized using serde, the cpu consumption is passed in here.
Currently the mosts cpu consumming parts of MeiliDB are the criterions, it is almost 40% of the total time (for 9 million fields).
We spotted something interresting in time measurements, one of our criterion takes much more time than the others but it only does a sum of one of the Match
property.
The sum_of_typos
criterion takes 5.41ms to sort 97360 documents and the sum_of_words_attribute
takes 19.76ms for the exact same number of documents.
97360 total documents to classify
626460 total matches to classify
query_all took 87.45 ms
criterion SumOfTypos, documents group of size 97360
criterion SumOfTypos sort took 5.41 ms
criterion NumberOfWords, documents group of size 97360
criterion NumberOfWords sort took 5.36 ms
criterion WordsProximity, documents group of size 97360
criterion WordsProximity sort took 7.24 ms
criterion SumOfWordsAttribute, documents group of size 97360
criterion SumOfWordsAttribute sort took 19.76 ms
criterion SumOfWordsPosition, documents group of size 33657
criterion SumOfWordsPosition sort took 10.48 ms
criterion Exact, documents group of size 16708
criterion Exact sort took 1.96 ms
criterion DocumentId, documents group of size 16708
criterion DocumentId sort took 1.98 ms
It is nearly the sames algorithms:
The only difference is that the attribute
field is not at the start of Match
.
It is probably padded, making the CPU cache unhappy.
struct Match {
query_index: u32,
distance: u8,
attribute: u16,
word_index: u32,
is_exact: bool,
}
So we thought about data oriented design, putting related data in the same memory space to make the CPU cache happy.
All of these fields will probably be stored in the same vectors, each vector represent a property.
All of the documents matches properties will be in the same five vectors, everything is a matter of indexes and slices.
// OLD algorithm struct
struct Match {
query_index: u32,
distance: u8,
attribute: u16,
word_index: u32,
is_exact: bool,
}
let matches: Vec<Match>;
// NEW algorithm struct
struct Matches {
query_index: Vec<u32>,
distance: Vec<u8>,
attribute: Vec<u16>,
word_index: Vec<u32>,
is_exact: Vec<bool>,
}
On a search with one or multiples word in the query. The user can choose what happens if a word on a query is not present on the document or if the engine has not found enough documents. It's mean the possibility to make a query word optional or not.
I propose 4 possibilities:
The RocksDB database have an interresting feature called column-families. If you don't sepcify which column family to use to put a value it will use the "default" column family.
Will could change this behaviour by specifying the column family name to be the index name that should be updated or queried. It is possible to create an SST file for a specific column family.
Currently, the engine returns for highlights the fields name, the start of the occurrence and the length.
e.g.
"matches": [
{
"attribute": "xxx",
"start": 62,
"length": 7
}
]
It should be great to have the content of the field with the highlights visible. Set a text before and after the highlight. It should be great for many usages. Like rendering HTML, printing on a terminal, etc. But keep the information starts and length of a match is also interesting in some case.
e.g.
"_highlightResult": {
"attribute_name": {
"value": "I am the text of attribute_name and the match is <em>chocolate</em>",
"matchedWords": [
"chocolate"
],
"matchesPositions": [
{ "start": 50, "length": 9 }
],
"prefixTag": "<em>",
"postfixTag": "</em>"
}
},
It is an interresting feature to allow a user to query the engine by specifying the exact range of document he want to have, it can be usefull to do pagination. The user could have done a work around by querying enough documents to reach the up bound of its final needed range.
By specify the range, the engine can skip many computation inside the bucket sort algorithm.
...that can be based on the Michael Nitschinger's blog post about tokenization in Rust.
For the moment the Schema::attribute_name
function has an awful complexity, see by yourself:
We can probably use a LinkedHashMap
to be able to iterate on it but as its name explain it good: it is a linked-list that drive this fonctionnality.
Thanks to a previous issue #47 Schema can be built from a toml file.
The proposal for this issue is to handle the JSON format like the Toml format.
An example of the json schema:
{
"identifier": "id",
"attributes": {
"id": {
"strored": true
},
"title": {
"strored": true,
"indexed": true
},
"description": {
"strored": true,
"indexed": true
},
"image": {
"indexed": true
}
}
}
in other word dump the index files. It is fairly easy to dump the fst::map
and the doc_indexes
but we need to dump the RocksDB
too, as an SST
file if possible.
https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F
It seems that we will be forced to iterate over the keys
and select the ones that are present in the metas
. But if it is possible to remove keys will we do operations on the index (differences of negative and positive doc indexes) it must not been necessary and an easy* big dump could be sufficient.
The real advantage of doing so will be to give the new index to other search engines and to keep a state even if the engine crash. Ideally the state should be a unique file (e.g. a tar.sz
file, tar combined with snappy).
Better produce an SST file at index time, this kind of file is useful when the key-value generation can be done earlier and the boot of the program, that will query the key-values, must be fast.
https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files
Implement the Algolia's distinct feature on the RankedStream
struct using an HashMap
and a counter. The attributeForDistinct
will be given to the RankedStreamBuilder
.
This way it will be possible to run multiple instances of the same database and only test the query functions, the ones that does not affect the database content.
It can be for the http
server or in the raptor
library.
Using the log
crate can be a good start.
This important for the library user to know what wrong happen when something wrong happen 😃
But it is also important for us to know what is the source of an error.
There is a big problem for the moment, we use RocksDB, that is a C++ library wrapper, errors are just simple strings. We do not have any information on the kind of the error, is it an IO error, a logic error ? Does the key that I want to remove does not exists or is it wrongly spelled ?
We can probably create a workaround that wil be an enum error wrapping the RocksDB error String...
The problem that we faced while implementing this database was to save the blobs generated by our engine. The engine can work with positive and negative blobs as explained in the deep-dive.
These blobs are currently written to the filesystem. The problem is that the files order must be saved, you can read more about why in the issue #15. Additionally the positive blobs are associated with an sst file, a file that contains all key values of the newly added documents, a format RocksDB can be populated with.
One of the feature that the final database must have is to be fed with updates (addition, deletion, update) while it is running. The big problem was: How can we ensure that the modifications received by the database are saved durably (according to the ACID properties) ?
A temporary solution could be to use an existing key-value store like RocksDB, it already follows some of the ACID properties. It will be used to store the key-values of the newly indexes documents and the index files (.idx
, .map
). Using column families as a separation.
When an update is made we first stores the index files under an incremented id as key, update the document key-values and set the current key id to the one created. We have a little problem here: ingesting an sst file cannot be done inside of a transaction.
Currently only structure serialization is supported to create an update, in the future it would be great to serialize HashMaps for example. This way we will allow users to send custom objects without needing to recompile.
The current schema does not know anything about the document unique identifer we must resolve this issue first (#22), this way users would be able to insert documents without needing to compute its id.
On a search with one or multiples word in the query. The user can choose if each word can be searched as a prefix or not.
I propose 3 possibilities:
The current CappedBTreeMap::insert
function will pop out the smallest value if the capacity is reached but it doesn't test if the inserted one is smaller than the one in the tree so the poped/returned value is not the smallest one.
Currently the documents ranking doesn't seems to take into account the fact that a query word is at the end of the query or not. When a word is at the middle of a query or is not followed by a space this word is more likely to be finished/in its exact form.
For example if a user query "cat and dog"
we must understand that the word "cat"
is a finished word and documents containing the exact representation must be more interresting than documents contaning words starting with "cat"
like "catalog"
"catherine"
...
This is already one of the criterion that is used by default but it is nearly the last criterion in the list (link to the code).
For the moment, all search statistics and analytics made on the server are not specific enough. We cannot know if the search "pants" is unique or if it's preceded by "p" then "pa" then "pan" then "pant" and finally "pants". It's mean we are unable to find the most searched words for an index.
It should be great to add on the search requests headers a sessionId
to keep a history of what user search. The goal is not to spy the users but just to have enough information to know what the final search is.
The search responses are only composed by an array of documents and for each document the list of matches. We might need to have more information for debugging, do metrics, etc.
Here some information that can be interesting to be found on a response.
Process:
Pagination:
Debugging:
It will gives better performances.
Currently, if a query word is smaller or equal than 4 characters, a Leveinshtein automaton is created with an allowed distance of 0 that was used only to match on prefixes and it is much slower than combining the stream
and starts_with
map methods.
The best way to make it possible must be to define an enum that could be a stream + starts_with
combination or a Levenshtein
. We will keep the Levenshtein
pool but it will under this wrapper.
We must be able to create a Schema from a file. The format could be toml for example.
It could be something of this form, the order in which the attributes are declared is important.
There is a pull request about that in the toml repository.
identifer = "id"
[attributes."id"]
stored = true
[attributes."title"]
stored = true
indexed = true
[attributes."description"]
stored = true
indexed = true
[attributes."image"]
stored = true
...that will allow us to store nodes with values in a simple memory buffer, removing most of the cache misses while iterating over the tree.
A good way to identify nodes pointing over other nodes is to use node indexes in the memory buffer instead of raw pointers, it will allow us to growth the pool node.
The engine uses a lot of magic numbers. Mostly during the indexing phase. For the typo tolerance configuration, the number of words indexed in a field, the biggest possible proximity, etc..
This issue should allow the end user to configures all these parameters.
One possibility is to store them on the config already saved in the KV store. The search and the indexing config parts could be separated.
The distinct feature (see #12 ) allow to de-duplicate objects based on a generic function.
The implementation keeps the possibility to choose the number of results returned for an object family.
The proposition is to allow users to choose between 3 choices:
attributeForDistinct = [groupId]
attributeForAutoDistinct = [color]
{ groupId:1, id:11, name: t-shirt marine, color:red }
{ groupId:1, id:12, name: t-shirt marine, color:blue }
{ groupId:1, id:13, name: t-shirt marine, color:white }
{ groupId:2, id:14, name: t-shirt, color:marine }
{ groupId:2, id:15, name: t-shirt, color:blue }
{ groupId:2, id:16, name: t-shirt, color:marine }
{ groupId:2, id:17, name: t-shirt, color:red }
Query: T-shirt marine
{ groupId:1, id:11, name: t-shirt marine, color:red }
{ groupId:2, id:14, name: t-shirt, color:marine }
Simply distinguished by groupId
{ groupId:1, id:11, name: t-shirt marine, color:red }
{ groupId:2, id:14, name: t-shirt, color:marine }
{ groupId:2, id:16, name: t-shirt, color:marine }
Because the marine
color is not present on groupId:1
but is on groupId:2
{ groupId:1, id:11, name: t-shirt marine, color:red }
{ groupId:1, id:12, name: t-shirt marine, color:blue }
{ groupId:1, id:13, name: t-shirt marine, color:white }
{ groupId:2, id:14, name: t-shirt, color:marine }
{ groupId:2, id:16, name: t-shirt, color:marine }
Because marine
match one time in the category color
We must think about a way to communicate with the database, to insert, delete and update documents.
The database will be listening on a port in tcp and should be queried by any language (i.e. Rust, Ruby, Python, Javascript...). The protocol must be versioned.
We though that the first protocol that can be implemented will be a full text one that follows some of the Redis protocol specifications. This way it will be easy to implement it for other languages.
But the second version of the protocol could be oriented on something else, more like a binary format for example.
The query distinct function call the distinct function given by the user inside the raw bucket sort pass and another time inside the final aggregation loop.
The user given distinct function is always cpu consumming related to the fact that it fetch documents fields from the key-value store.
It could have been interresting to store the outputs of the firsts distinct function calls.
This way it will reduce the number of allocation when reading this entry to one of a known size.
The PositiveBlob
is composed of an fst::Map
and a DocIndexes
struct, these two structures are one after the other in memory. We could use a shared Vec
to construct the Map
and the DocIndexes
.
Currently the schema know which fields of the document should be indexed and stored. Using the SchemaBuilder
struct we declare each field in the order of importance we want and the document id is unknown to the finally built schema.
The advantage of this solution is that it will be impossible to serialize a document that does not have an id. The serializer will not be able to generate the appropriate key and therefore not be able to store it in the database.
We must absolutely keep some kind of "raw" API that will allow us to store a document under any id and with any attributes.
How should we modify the SchemaBuilder
API ? What if the user want its id to be stored and indexed ?
We will probably need to transform the id (i.e. hashing it) to be able to save the in the database, when a user will need to identify a document (i.e. deleting it), he will refer to its original id, not the transformed one.
We must find a better dataset. A set of summary of some important Wikipedia pages for example.
Create a RankedStreamBuilder
with settings and stuff to customise the sorting algorithms settings.
This builder produce a RankedStream
which streams the results using a StreamWithState
(with the levenshtein state) and sort the results with some possible optimisations.
The internal StreamWithState
returns each matching word (with the right levenshtein distance) associated with the document id, word index in the query, the attribute where the word is located and position of the word in this attribute.
This is raw data and that must be sorted to match the following rules:
George Clooney
is better than George word Clooney
).First step to follow the rules is to collect results (document id, word index in the query, attribute and position) in a collection to help sorting them.
But these rules must be applied using bucket sorting: if, after sorting results using the first rule, there is matches that are equal so use the next rule to disassociate them and so on if equallity are detected.
The typo rule is not the more complex to implement, sort documents according to the levenshtein distance only. If there is multiple words take the maximum of the minimum query words distances.
The words rule is straightforward: the index of the word in the query is returned in the result of the internal StreamWithState
stream, this information can be used to know which word the result talk of and so simply count the number of different matching words ([1, 2, 3]
is better than [1, 2]
or [1, 1, 1]
).
The proximity rule can be implemented by sorting the results by the attribute first, by the word index in the query after and finally by the index in the attribute. Once the results for each document are sorted then sort documents by the smallest sum of distances between each query word.
The attribute rule has already been fulfilled in the previous rule application.
The words position rule has already been granted by the previous sorting, select the different word index in the query and get the one that is the nearest from the start of the attribute, rank document based on these values (lower is better).
The exact rule is not defined for the moment.
The best way to implement all of these rules is to create a collection for each document that match something. Documents can be added as the research progresses, the collections contain structures with the index of the word in the query followed by the distance of the match, the attribute in which the word has matched and finally the index of this word in the attribute.
The RankedStream
stream can be setup to return a maximum number of results and so optimisations are possible.
The internal StreamWithState
stream gives results in lexicographic order, each result, words, is associated to a given number of document id, position... These informations are used to rank results.
The order in which the internal stream gives results is useful to help know if other matching words can be given in the future.
If the stream as a limited number of results to return, some results are already known to be "frozen" ones, there is no results that can be better than these ones.
For example: a document that contain exactly the query word (with a levenshtein distance of zero) and that is at the first position in the first attribute can not be moved out, it is called as "frozen" and can be immediatly returned.
Another example: a one word query is created and the internal stream gives words mathing in the lexicographic order, so if the period in which the words exactly matching (with a levenshtein distance of zero) have been returned, and there is enough of them, so there is no reason to search for more results from the internal stream if the next ones will have a bigger distance and the current accumulated results can be sorted and returned already.
It could be easier for the user to create an update and not to care about the fact that this is a positive or a negative one.
It must be an update without any distinction, the user could be able to insert documents fields updates, remove documents fields or completly remove documents.
What is incremental indexing ?
Currently the engine take an index composed of 3 files, one for the final state transducer (.fst
), one for the list of document id associated with the index where it match (a DocIndex
) (.idx
) and one that represents the key-values used to feed RocksDB (.sst
).
This kind of index is immutable and it is the power of the datastructure, you can read more on the fst repo. I will use the "index" term to speak about this batch of file that works together in the following lines.
A use-case that should/must be managed by this library is the possiblity to add and remove documents in real time without needing to re-index all the documents. This is a feature that is really important when you need to take a small amount of time to expose new results.
Another great feature to have is a way to return only documents that should be yielded. An API like the following would be perfect.
fn filter_function(doc: DocumentId, view: &DatabaseView<D>) -> bool {
// ...
}
let builder = view.query_builder()?;
let builder = builder.with_filter(filter_function);
let documents = builder.query("chateau", 0..20);
We must think again how work the query builder system.
For the moment a user can create a QueryBuilder
without any parameter, if it needs to apply a distinct rule on it, the user can create a DistinctQueryBuilder
from it. This is a new type with its own query function, really custom.
If we want to introduce a new way to filter documents we will multiply the number of types and it will be hard to combine types of filters (i.e. filter, distinct). An idea could be to introduce a Query
trait that any filter will implement, this way we will be able to introduce query filters that wraps other query filters.
We need to be carreful, the query filter must always be done before the query distinct for example. A type restriction could be implemented.
If the user has a database with documents translated in multiples languages, the best use case is to create one index per language. In this case, it should be great to permit the end user to choose a default language for each index. This will apply the tokenizer and the default stop words for this language during the indexing.
Of course, this option should be totally optional. By default, no stop word will be used. The tokenizer will split by space and use a default sentence separator like dots, coma, etc..
After some time experimenting on a basic sorting method that sort all documents at once it appears that it was not the best idea.
The new algorithm should use a bucket sort system, a sort that will only be applied on sub groups by criterion. Buckets are sorted by criterion and split, each sub-buckets are then sorted again and split by the next criterion. The advantage of this technique is that only the minimum number of buckets are taken into account, the other buckets that will not contain the returned documents will not be processed and sorted.
Soft updates are updates that can be ingested by the database without any cpu overhead.
Unlike the heavy updates they doesn't carry any fst-docids entry therefore no blob merge is needed and entry modification can be seen in realtime.
Obviously soft updates can only be done on non-indexed attributes.
Allowing this can be useful for custom ranking rules using non-indexed attributes frequently changing like a number_of_clicks
attribute for example.
The search engine has as default pagination by 20 documents. Set this value alterable.
The current update ingestion implementation recompute the reverse index.
It is an important CPU cost mostly for adding a small number of documents.
We can't deal with this anymore!
We must be able to acppet any size updates and make them be available to search without the minimum overhead.
We thought about something similar to database caching algorithms: putting most recently used entries in memory providing a rapid access to it. When too much entries are cached, a compaction is performed (i.e. some of the Sled's future strategy).
So, basically when a user propose some documents update/insertion/deletion we will not put every document attributes updates inside of an SST file, paying the cost of keeping words-docindex ordered in RAM and constructing a new FST.
Inserting words-docindex and document attributes values directly into the key-value store is the solution. Internally the database will extract words of the indexed attributes and update the special "revindex" keyspace, dedicated to reverse index data, storing DocIndex
associated to words as keys-values, like the current FST.
The advantage of a key-value store is that, like a BTree and an FST, keys are ordered, making O(log n)
unions possible between these datastructures at search time.
Obviously, we should find a rule to trigger compaction, it must be based on the latency involved in doing the union, probably related to the number of words/doc-indexes stored as raw keys. A compaction will be an update of the FST by integrating "revindex" word-docindex into the FST.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.