quickwit-oss / quickwit Goto Github PK
View Code? Open in Web Editor NEWCloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
Home Page: https://quickwit.io
License: Other
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
Home Page: https://quickwit.io
License: Other
Indexes a Newline Delimited JSON (NDJSON) dataset located at input-path
or read from stdin. The data is appended to the target index specified by index-path
unless overwrite
is passed. input-path
can be a file or a directory. In the latter case, the directory is traversed recursively. Local and remote datasets (S3) are supported. By default, the process uses 4 threads and 1GiB of memory per thread. The num-threads
and heap-size
options allow for customizing those settings.
quickwit index
--index-path <path>
[--input-path <path>]
[--num-thread <num threads>]
[--heap-size <num bytes>]
[--overwrite]
--index-path
(string) Specifies the location of the target index.
--input-path
(string) Specifies the location of the NDJSON-formatted source dataset.
--num-thread
(integer) Number of allocated threads.
--heap-size
(integer) Amount of allocated memory.
--overwrite
(boolean) Overwrites existing data.
Indexing a local dataset
quickwit index --index-path s3://quickwit-indexes/nginx --input-path nginx.json
Indexing a remote dataset
quickwit index --index-path s3://quickwit-indexes/nginx --input-path s3://quickwit-datasets/nginx.json
Indexing a dataset from stdin
cat nginx.json | quickwit index --index-path s3://quickwit-indexes/nginx
quickwit index --index-path s3://quickwit-indexes/nginx < nginx.json
Reindexing a dataset
quickwit index --index-path s3://quickwit-indexes/nginx --input-path nginx.json --overwrite
Customizing the resources allocated to the process
quickwit index --index-path s3://quickwit-indexes/nginx --input-path nginx.json --num-threads 8 --heap-size 16GiB
Before completing, the command displays a usage example of quickwit search
for the target index (see #4 for more details about the search
command).
guilload@modern14 ~> quickwit index --index-path s3://quickwit-indexes/nginx < nginx.json
Indexed 12345 docs in 3s.
You can now query your index with `quickwit search --index-path s3://quickwit-indexes/nginx --query "barack obama"`
The command creates and uploads splits to index-path
and updates the meta store file. When appending data, merging newly generated splits with "older" splits is out of scope for this iteration. However, the command should create a new split every 5 million documents. The command should also be able to generate and upload two splits concurrently. We want a Quickwit split to be comprised of a single Tantivy segment, so we should wait for Tantivy to merge all the segments before uploading a split. The command should be recoverable, i.e. we should be able to hit Ctrl+C to interrupt the process and then resume right where we left off. Tantivy has a mechanism to embed metadata within a commit, which should help to implement this behavior. Finally, the command should display some statistics (number of docs indexed, throughput, ...) as the indexing progresses.
Assignee(s), please complete this paragraph further as you're going through the implementation.
Assignee(s), please describe in a few sentences how you intend to test this command.
Quickwit needs a way to keep track of the list of splits available for a given index. Currently the metastore's implementation is backed by PostGreSQL.
However, deploying PostgreSQL might be a cumbersome requirement for the user who just wants an index for a single large static dataset in their datalake, or who wants to test out whether Quickwit's search is working as advertised before throwing themselves into deploying multiple services.
For this reason, there will be different implementation of the datastore, the most constrained one, being stored directly on S3.
Note that some of these operation imply some interaction with the storage. This is not the role of the meta store, and the correct implementation of its client application is assumed. The meta store semantics however, need to make this coordination possible... (See section below)
Quickwit needs a way to ensure that we can cleanup unused files, and this process needs to be resilient to any fail-stop failures.
Because our metastore are typically more constrained than a filesystem (e.g.: object storage), we cannot rely in a write ahead log. We do not want to rely on a strategy that would require to list the files in the store as well, because:
Instead, we rely atomically shifting the status of splits.
A split goes into the following lifecycle
Building
-stage->
Staged
-publish->
Published
-markAsDelete->
MarkAsDeleted
-delete->
Deleted
Building splits and deleted splits are not states that are registered in the metastore, and should have no durable left-over anywhere.
We maintain the following invariant:
If a split has a files in the storage, it MUST be registered in the meta store, and its state can be either
Before creating any file, we need to stage the split. If there is a failure, upon recovery we schedule for delete all of the staged splits.
A client may not necessarily remove file from storage right after marking it as deleted. A CLI client may delete files right away, but a more serious deployment should probably only delete these files after a period of time to give some time to running search queries to finish.
/// Stages a splits.
/// A split needs to be staged BEFORE uploading any of its files to the storage.
/// The SplitId is returned for convenienced but it is not generated
/// by the metastore. In fact it was supplied by the client and is present in the split manifest.
fn stage_split(split_manifest: SplitManifest) -> Result<SplitId, StageError>;
enum StageError {
InvalidManifest(...),
ExistingSplitId(...),
InternalError(...),
}
/// A split manifest carries all meta data about a split
/// and its files.
struct SplitManifest {
/// the index it belongs to or is destined to belong to.
index_id: String,
/// The set of information required to open the split,
/// and possibly allocate the split.
metadata: SplitMetaData,
/// The list of files associated to the split.
files: Vec<ManifestEntry>,
}
struct SplitMetaData {
/// Split uri. In spirit, this uri should be self sufficient
/// to identify a split.
// In reality, some information may be implicitly configure
/// in the store uri resolver, such as the S3 region.
split_uri: String
/// Number of records (or document) in the split
num_records: u64,
/// Weight in bytes of the split
size_in_bytes: u64,
/// if a time field is available, the min / max timestamp in the segment.
time_range: Option<Range<i64>>
/// Number of merge this segment has been subjected to during its lifetime.
generation: usize,
}
/// Records a split as published.
///
/// This API is typically used by an indexer who needs to publish a new split.
/// At this point, the split files are assumed to have already uploaded.
/// The metastore only updates the state of the split from staging to published.
///
/// It has two side effetcs:
/// - it makes the split visible to other clients via the ListSplit API.
/// - it guards eventual recovery procedure from removing the split files.
///
/// Unless specified otherwise in the implementation, the metastore DOES NOT
/// do any check on the presence, integrity or validity of the metadata of this split.
///
/// A split successfully published MUST eventually be visible to all clients.
/// Stronger consistency semantics should be documented in the implementation.
///
/// If the split is already published, this API call returns a success.
fn publish_split(split: SplitId) -> Result<(), PublishError>;
enum PublishError {
DoesNotExist(SplitId),
SplitIsNotStaged {
split: SplitId,
previous_state: State
},
InternalError(...)
}
/// Returns the list of published splits intersecting with the given time_range.
/// Regardless of the timerange filter, if a split has no timestamp it is always returned.
/// Splits are returned in any order.
fn list_splits(index: IndexId, state: State, time_range: Option<TimeRange>) -> Result<Vec<SplitMeta>, ListSplitsError>;
enum ListSplitsError {
IndexDoesNotExist(IndexId),
InternalError(...)
}
List splits is used:
/// Marks a split as deleted.
/// This function is successful if a split was already marked as deleted.
fn mark_as_deleted(split_id: SplitId) -> Result<(), MarkAsDeletedError>;
enum MarkAsDeleteError {
DoesNotExist(SplitId),
InternalError(...)
}
///
/// This function only takes split that are in staging or in mark as deleted state.
fn delete_split(split_id: SplitId) -> Result<, DeleteSplitError>;
enum DeleteSplitError {
DoesNotExist(...),
Forbidden(...), // deleting a published split is forbidden
InternalError(...)
}
We simply serialize all of the splits associated to a single index into one large file.
We do not enforce any lock, and having several concurrent writer can lead to corruption (similarly to several datalake meta store).
The file has the following format.
{
"index": {
// Possible index metas
},
"splits": {
<split_id_1>: {
"metadata": <SplitMetaData>,
"files": [<manifest_entries>],
}
}
}
Right now the role of DocMapperType is a bit difficult. It's only role is to help serialize. Right now the type is too visible.
We could hide it more, or even remove it and rely on typetag.
https://github.com/dtolnay/typetag
Right now it happens before merging, so its content is incorrect.
Let's quote Elasticsearch doc:
Mapping is the process of defining how a document, and the fields it contains, are stored and indexed.
As mentioned here, tantivy provides already a schema which defines the set of fields for a given index and how it is stored, the user has yet to define how he will convert input document (in json typically) to data that will be in the index.
Tantivy already provides some useful information in the schema about how data needs to be stored (see indexing options). But we may need something more to handle other cases:
I would do it in tantivy but we can start a "dumb" implementation for quickwit's need.
The mapping is a json (or yaml?) file and sufficient to create the index schema.
To start with, we can just use tantivy schema as it will do the job.
{
"store_source": true,
"ignore_unknown_fields": true,
"default_search_fields": [
"body", "title"
],
"properties": [
{
"source_name": "timestamp",
"target_name": "timestamp2",
"type": "u64",
"indexed": true,
"stored": true
},
{
"source_name": "severity_text",
"type": "text",
"record": "basic",
"tokenizer": "raw",
"stored": true
},
{
"source_name": "resources.label",
"type": "my_custom_text"
},
{
"source_name": "meta",
"type": "array<organization>"
},
{
"source_name": ""
}
],
"field_types": [
{
"name": "my_custom_text",
"type": "text",
"analyzer": "my_analyzer"
},
{
"name": "organization",
"type": "object",
"properties": [
{
"source_name": "name",
"type": "text"
},
{
"source_name": "created_at",
"type": "u64"
}
]
}
]
}
We talked about having 3 types of doc mapping but I don't think this is relevant. A mapping is specified as a json or yaml file and that's all. But we can also provide basic template that works with open telemetry json format.
struct DocMapper {
meta: Vec<str>,
properties: Vec<FieldMapper>,
}
impl DocMapper {
pub fn doc_from_json(&self, doc_json: &str) -> Result<Document, MappingError> {
}
}
We want
To run on each PR.
Describe the bug
On MS Windows 10, since the standard path separator is \
, usage of Path::join
creates indices that cannot be found on other nix* platforms.
Step to reproduce:
Steps to reproduce (if applicable)
Steps to reproduce the behavior:
cargo run -- new --index-uri s3://quickwit-dev/evance/data --no-timestamp-field --doc-mapper-type wikipedia
s3://quickwit-dev/evance\data\quickwit.json
Expected behavior
Index manipulation should be interoperable across all platforms we support
System configuration:
Windows 10,
Rustup 1.24.3 31/05/2021
Rustc 1.52.1 09/05/2021
User should be able to search on interval with a single bound t > start_time
instead of being forced to define an entire interval.
Currently, the index command of quickwit-cli can only be used with --index-uri file://
. Any other protocol is raised as an UnsuportedProtocol error.
We need to implement a Metastore that interacts with s3 storage interface.
We could also make SingleFileMetastore supports the s3 protocol.
current code looks very aws centric. What about minio support for those that dont want to be tied to a cloud provider ?
term frequency are only useful for
These are either impossible today or fairly rare.
We could reduce the amount of downloaded data if we change tantivy codec to stop interleaving doc frequency block and doc id blocks in the posting list. (provided we do not observe a performance regression)
The warm_postings
could then take a IndexRecordOption argument instead of need_position
The following command
cargo run -- new --index-uri file:///home/fulmicoton/datasets/wikipedia-idx --no-timestamp-field --doc-mapper-type wikipeda
creates an index with the default mapper.
This is more of a vague idea than a really fleshed idea.
I am a huge fan of this extension:
https://marketplace.visualstudio.com/items?itemName=quicktype.quicktype
The idea is at follows...
When creating a model in your favorite language, you paste an example json object and the extension creates the struct (or Class) automatically, with a lot of boilerplate.
You can then tweak it to your preference.
It is a massive timesaver and avoids a lot of typo related friction.
We could use the same idea in the CLI for instance, by allowing the user to copy paste a schema... or even sniff it from the clipboard...
We could also offer a webapp that helps creating the quickwit.json file with a good looking mapping straight from json.
cat wiki-1000.json | RUST_LOG=debug cargo run -- index --index-uri file:///home/fulmicoton/datasets/wikipedia-idx --heap-size 1000000000 --num-threads 2
fails with
Jun 04 13:33:06.374 DEBUG quickwit_cli: indexing index_uri=file:///home/fulmicoton/datasets/wikipedia-idx input_uri=None temp_dir=/tmp num_threads=2 heap_size=1000000000 overwrite=false
Please enter your new line delimited json documents.
[quickwit-core/src/indexing/split_finalizer.rs:76] "he" = "he"
Jun 04 13:33:06.376 INFO tantivy::indexer::segment_updater: save metas
Jun 04 13:33:06.377 DEBUG tantivy::directory::mmap_directory: Atomic Write ".managed.json"
Jun 04 13:33:06.379 DEBUG tantivy::directory::mmap_directory: Atomic Write "meta.json"
Jun 04 13:33:06.380 DEBUG tantivy::indexer::segment_updater: Saved metas Ok("{\n \"index_settings\": {\n \"sort_by_field\": null\n },\n \"segments\": [],\n \"schema\": [],\n \"opstamp\": 0\n}")
Jun 04 13:33:07.149 DEBUG tantivy::directory::mmap_directory: Releasing lock ".tantivy-writer.lock"
Command failed: Could not get the index_id
The URI resolution scheme for the file system can have some important risk implication.
We also want to handle file:///var/local and the incorrect but common file://var/local format.
We need to make sure someone cannot access files that are not under the root directory using the "../" schemefor instnace.
Generally speaking: we should put some thoughts in it.
UPDATE:
We want to enforce the fact that path accessed by the FileStorage are child path of the FileStorage root.
The trait does not explain the point of staging, mark_split_as_deleted, etc.
A reader could think for instance that delete_split would remove the files associated to the split in the storage.
The searchers receiving a root request need to dispatch the leaf work among the search servers.
We want to handle the peer discovery problem using the SWIM gossip algorithm.
Before plugging the peer discovery, please submit a documented trait describing the public API that will be consumed by the root handler.
This is only relevant after the merge of #18.
We should handle "file://" as a valid protocol and allow people to store and consume files on their local file system.
I got this
cargo r index --index-uri file:///Users/fmassot/Documents/quickwit/repos/quickwit/my-index --input-path wiki-articles-1000.json
Documents: 10000 Errors: 0 Splits: 1 Dataset Size: 11MB Throughput: 11.24MB/s
Documents: 10000 Errors: 0 Splits: 1 Dataset Size: 11MB Throughput: 5.62MB/s
Documents: 10000 Errors: 0 Splits: 1 Dataset Size: 11MB Throughput: 3.75MB/s
Documents: 10000 Errors: 0 Splits: 1 Dataset Size: 11MB Throughput: 2.81MB/s
may
I don't understand the first constant number.
This is clearly not a bug but I expect some users will do the same error as me.
I expected the index-uri to have an absolute uri but if you have only two slashes it will be relative.
So having file://folder1/folder2/folder3/folder4/wikipedia
will create 5 folders because I forget the slash.
I suggest to support only absolute path.
Currently, we are doing the node discovery by polling DNS entries in Kubernetes.
However, it takes a few seconds after a new node is added before it becomes available.
So, we will try to create a sample project to consider if it is possible to do the node discovery more easily using P2P technology.
quickwit new ...
creates a new index.
quickwit new <index name>
--index-path <path>
{--timestamp-field <field name> | --no-timestamp-field}
[--overwrite]
--index-path <path>
(string) Defines where the index is created.
--timestamp-field <field name>
(string) Creates a time-series index.
--no-timestamp
(boolean) Creates a non time-series index.
--overwrite
(boolean) Overwrites existing index.
The timestamp-field
and no-timestamp-field
options are mutually exclusive.
quickwit new <index name> --index-path <path> --timestamp-field <field name>
quickwit new nginx --index-path s3://quickwit-indexes/nginx --timestamp-field ts
quickwit new <index name> --index-path <path> --no-timestamp-field
quickwit new nginx --index-path s3://quickwit-indexes/nginx --no-timestamp-field
Before completing, the subcommand displays an usage example of quickwit index
for the index that was just created (see #3 for more details about the index
subcommand).
guilload@modern14 ~> quickwit new nginx --index-path s3://quickwit-indexes/nginx --timestamp-field ts
Creating new index at s3://quickwit-indexes/nginx...
Done!
You can now [...] using `quickwit index --index-path s3://quickwit-indexes/nginx --input-path <path to file or directory>`
Assignee(s), please describe in a few sentences how you intend to implement and test this feature.
Ideally, the meta store should not require a call to open
.
The current interface is meant to be backed by more serious backend in the future. (e.g. PostgreSQL, foundationDB or what not).
open
is a quirk associated with the single file implementation.
It is not actually necessary. We could use the internal state a cache an lazily "open" the index upon a call if it is not loaded yet.
You can probably remove the method from the trait. Make the current implementation private,
and lazily call it from methods like read, etc.
Currently, the indexing command does not handle the time range field during document indexing.
This needs to be implemented.
Right now unit tests are create temporary directory. This is not great.
Tantivy can work with a RAMDirectory without any trouble. Similarly, we already have an in ram storage.
The main problem is actually split upload. We only handle this on Tokio::File today.
The documentation should include the following section:
I end up with one split uploaded, and extra empty directory.
Our doc and our code are inconsistent (independently and mutually) on these concept.
The goal of this document is to come up with a clear definition of what should be done.
The origin of the confusion is coming from the coexistence of 3 different product
Barrel is just a prototype and should only be here as a source of experience for the future... So let's exclude it from the discussion.
Going one step further, the source of the confusion is that the CLI only targets a single index.
Our foundational abstraction however try to make us ready for the near future and try to address the problem of managing several indexes (duh 🙂).
The serve
command could actually be a good candidate to already handle several indices, and can be a good pivot for the reflexion.
Here is the solution that I suggest:
metastore_url
: a self sufficient address to open a metastore. The MetastoreUrlResolver turns an url into a MetaStore object. The protocol helps choosing the right implementation. Examples:
index_id
: an id that only makes sense in the context of the meta store. The meta store is able to get a bunch of information about the index given an index_id
. Most importantly, the index_url
and a list of splits.index_url
in the code:split_relative_url
to a split_absolute_url.split_id
: an url that points at the storage of a given split. We prefer to store a split_id and resolve things from the index_url for different reasons... The most obvious one is that it makes it easier to move an index. For instance, if someone wanted to move an entire index from s3 to the local filesystem, it would only require copying all of the split files, export a single file version of the metastore, and fix the index_url to the filesystem.Storage
object and fetech path including the split prefix path, rather than create one storage for every split.index_url
in the CLI arguments : Here comes the cause of the ambiguity... Conceptually speaking, with the following setting, the create and the index cli (this is the same story for other cli) would look likequickwit create —meta-store-url s3://quickwit/ —index-id wikipedia —index-url s3://quickwit/wikipedia
quickwit index —meta-store-url s3://quickwit/ —index-id wikipedia
. This is obviously not user friendly, error prone, and overkill in a world where we have only one index with a colocated metastore.index-url
argument. It is required to be a "storage url". We infer the meta-store-url and index-id from this index-url.quickwit index --index-url s3://quickwit/wikipedia
quickwit serve
, we also use the quickwit serve —index-url s3://quickwit/wikipedia
syntax, but we can also support the followingquickwit serve —meta-store-url
s3://quickwit/``.s3://quickwit/wikipedia/quickwit.json
.The difference between the two is notoriously confusing.
We do not really use these uri as IDs and it is quite common to consider relative url valid. Usage is just 50% 50%. Postgresql refers to their postgresql://... addresses as an url, while amazon s3 referes to s3://... addresses as an uri.
The following poll suggests uri is more popular.
https://twitter.com/fulmicoton/status/1397738350270840832
Which one do you prefer?
I think it is possible to build a consistent spec base on index_uri in place of index_id.
Our current spec is halfway between that imaginary spec and the spec above... This alternative is possible but we need to correctly spec it out.
Right now if a single split fails, we fail. We need to at least retry this split, or return partial results and point out the failed split.
This is low priority
We might get better performance by allocating only one thread per split.
The idea would then to read the target amount of docs (5M) in memory, and have
a single thread indexer start working on its indexing.
Several splits would be working at the same time. Each one with a single thread.
With enough RAM, this could reduce or even remove the need for merges.
README can be really helpful for developers to navigate a new project.
They also force us to define the perimeter of projects.
We want the Storage trait to mimick the experience of using an object storage.
Object storage do not have a real notion of a directory.
Put("a/b/c/d") does not require to create the directory "a/b/c" beforehands.
Similarly, removing ("a/b/c/d"), does not mean that we left the directory a/b/c empty.
The directory simply do not exists. It is more of a blob key value store.
The "/" are only characters in a key, that happens to be useful. Listing files in one of those "fake" directory is in fact a prefix enumeration.
The time range needs to be used to prune splits, but also to filter documents befroe they reach the collector.
This part has not been implemented yet.
Cargo-deny makes it possible to check for license inconsistency.
We should add it to our CI.
https://crates.io/crates/cargo-deny
Currently, in the indexing, splits are published at the end of the indexing but not atomically also the indexing would just report split failures.
What we want to be implemented is an atomic way of publishing splits with an all
or nothing
behavior. That is if one of the splits fails to publish, we should consider the whole operation of indexing as failed. As a suggestion, the Metastore could be a good place to implement this.
Metastore::publish_splits(splits: Vec<SplitId>)
https://github.com/cla-assistant/cla-assistant offers a project with a github action.
We can either host it, use the SAP hosted version, or use cla-assistant-lite which is alpha, and host the data directly in the repo.
The latter is convenient but a bit scary and does not allow for a individual/corporate contribute license agreement.
For the template itself, we can probably rely on http://www.harmonyagreements.org/faqs.html
We also need to document the CLAs and the process in a contributing.md file.
We need to define and enforce what kind of string is accepted as an index-id.
Please edit this ticket to drive the discussion.
Read #55 to understand the actual nature of an index_id.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.