quickwit-oss / quickwit Goto Github PK

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.

License: Other

Rust 96.42% Makefile 0.15% Shell 0.26% Dockerfile 0.10% PLpgSQL 0.06% Python 0.66% HTML 0.03% TypeScript 1.60% CSS 0.01% JavaScript 0.30% HCL 0.42%

big-data cloud-native cloud-storage distributed-tracing log-management logs open-source rust search-engine tantivy

quickwit's People

Contributors

Stargazers

Watchers

Forkers

kezhenxu94 shafiahmed mbijon yunasystems isgasho azanux lideen999 igxactly codesoda pseitz ansrivas phayes linecode sam-mix danrepos pseudobobsmith productinfo kination tkforks colinger eliasyaoyc yeohoonyun zmilan lvheyang rheehot widojansen huangweiboy2 redstrike zhuomingliang saroh r-zenine ddelemeny edisplay pi-pi-miao gagliardetto estebarb spullara age-rs lokhmakov topecongiro holg llogiq imron fatelei gautamphegde anykno alexmikhalev dai-dao scn2016 isdzulqor galactus009 linxgnu pombredanne mo-wizard devopstoday11 dzvon yashwant-nagarjuna trinity-1686a snow01 unix1986 run0nceex yalshq qiaogj1 eranbes zhangyuchi kamibo mylovetop playfloor k-u-s iamazy dongbin86 heenabansal2009 stranger-danger-zamu juan-riveros wking1986 mhmtbsbyndr naddame me-diru gdzy1987 boraarslan sseg-dd bkbase-plugin gqadonis tribe-health circleof tangulak kstaken ocaraworks k-yomo renanpolisciuc higumachan ryanrussell sunisdown simrit1 davidalphafox laurids-reichardt lingo-xp suryatmodulus chillfish8 xingren23

quickwit's Issues

Rest API

Implement `index` CLI subcommand

Description

Indexes a Newline Delimited JSON (NDJSON) dataset located at input-path or read from stdin. The data is appended to the target index specified by index-path unless overwrite is passed. input-path can be a file or a directory. In the latter case, the directory is traversed recursively. Local and remote datasets (S3) are supported. By default, the process uses 4 threads and 1GiB of memory per thread. The num-threads and heap-size options allow for customizing those settings.

Synopsis

quickwit index
	--index-path <path> 
	[--input-path <path>]
	[--num-thread <num threads>]
	[--heap-size <num bytes>]
        [--overwrite]

Options

--index-path (string) Specifies the location of the target index.
--input-path (string) Specifies the location of the NDJSON-formatted source dataset.
--num-thread (integer) Number of allocated threads.
--heap-size (integer) Amount of allocated memory.
--overwrite (boolean) Overwrites existing data.

Examples

Indexing a local dataset
quickwit index --index-path s3://quickwit-indexes/nginx --input-path nginx.json

Indexing a remote dataset
quickwit index --index-path s3://quickwit-indexes/nginx --input-path s3://quickwit-datasets/nginx.json

Indexing a dataset from stdin
cat nginx.json | quickwit index --index-path s3://quickwit-indexes/nginx
quickwit index --index-path s3://quickwit-indexes/nginx < nginx.json

Reindexing a dataset
quickwit index --index-path s3://quickwit-indexes/nginx --input-path nginx.json --overwrite

Customizing the resources allocated to the process
quickwit index --index-path s3://quickwit-indexes/nginx --input-path nginx.json --num-threads 8 --heap-size 16GiB

Output

Before completing, the command displays a usage example of quickwit search for the target index (see #4 for more details about the search command).

guilload@modern14 ~> quickwit index --index-path s3://quickwit-indexes/nginx < nginx.json
Indexed 12345 docs in 3s.
You can now query your index with `quickwit search --index-path s3://quickwit-indexes/nginx --query "barack obama"`

Design and implementation pointers

The command creates and uploads splits to index-path and updates the meta store file. When appending data, merging newly generated splits with "older" splits is out of scope for this iteration. However, the command should create a new split every 5 million documents. The command should also be able to generate and upload two splits concurrently. We want a Quickwit split to be comprised of a single Tantivy segment, so we should wait for Tantivy to merge all the segments before uploading a split. The command should be recoverable, i.e. we should be able to hit Ctrl+C to interrupt the process and then resume right where we left off. Tantivy has a mechanism to embed metadata within a commit, which should help to implement this behavior. Finally, the command should display some statistics (number of docs indexed, throughput, ...) as the indexing progresses.

Assignee(s), please complete this paragraph further as you're going through the implementation.

Testing

Assignee(s), please describe in a few sentences how you intend to test this command.

Implement the meta store interface and its impl for its file-backed implementation

Purpose and Challenge

Quickwit needs a way to keep track of the list of splits available for a given index. Currently the metastore's implementation is backed by PostGreSQL.

However, deploying PostgreSQL might be a cumbersome requirement for the user who just wants an index for a single large static dataset in their datalake, or who wants to test out whether Quickwit's search is working as advertised before throwing themselves into deploying multiple services.

For this reason, there will be different implementation of the datastore, the most constrained one, being stored directly on S3.

API Definition

Functional need

Indexer need to publish the splits they have finished building.
Searcher need to list the splits in an index, possibly pruning them based on some metadata.
Mergers need to publish the outcome of a merge, (generally the atomic publication of a split and the deletion of some other).
Retention manager need to be able to remove split based on their timestamps (and possibly some metadata?).

Note that some of these operation imply some interaction with the storage. This is not the role of the meta store, and the correct implementation of its client application is assumed. The meta store semantics however, need to make this coordination possible... (See section below)

File management need, and the split lifecycle

Quickwit needs a way to ensure that we can cleanup unused files, and this process needs to be resilient to any fail-stop failures.

Because our metastore are typically more constrained than a filesystem (e.g.: object storage), we cannot rely in a write ahead log. We do not want to rely on a strategy that would require to list the files in the store as well, because:

we might end up removing some user file non related with quickwit.
listing files is notoriously slow and clumsy on object storage.

Instead, we rely atomically shifting the status of splits.

A split goes into the following lifecycle

Building
   -stage->
Staged
   -publish->
Published
   -markAsDelete->
MarkAsDeleted
   -delete->
Deleted

Building splits and deleted splits are not states that are registered in the metastore, and should have no durable left-over anywhere.

We maintain the following invariant:
If a split has a files in the storage, it MUST be registered in the meta store, and its state can be either

Staged: the splits is almost ready. Some of its files may have been uploaded in the storage.
Published: the splits is ready and published.
ScheduledForDelete: the split is scheduled to be deleted.

Before creating any file, we need to stage the split. If there is a failure, upon recovery we schedule for delete all of the staged splits.

A client may not necessarily remove file from storage right after marking it as deleted. A CLI client may delete files right away, but a more serious deployment should probably only delete these files after a period of time to give some time to running search queries to finish.

API Definition

Stage splits (W)

/// Stages a splits.
/// A split needs to be staged BEFORE uploading any of its files to the storage.
/// The SplitId is returned for convenienced but it is not generated
/// by the metastore. In fact it was supplied by the client and is present in the split manifest.
fn stage_split(split_manifest: SplitManifest) -> Result<SplitId, StageError>;

enum StageError {
    InvalidManifest(...),
    ExistingSplitId(...),
    InternalError(...),
}

/// A split manifest carries all meta data about a split
/// and its files.
struct SplitManifest {
    /// the index it belongs to or is destined to belong to.
    index_id: String,

    /// The set of information required to open the split,
    /// and possibly allocate the split.
    metadata: SplitMetaData,

    /// The list of files associated to the split.
    files: Vec<ManifestEntry>,
}

struct SplitMetaData {
    /// Split uri. In spirit, this uri should be self sufficient
    /// to identify a split.
    //  In reality, some information may be implicitly configure
    /// in the store uri resolver, such as the S3 region.
    split_uri: String

    /// Number of records (or document) in the split
    num_records: u64,

    /// Weight in bytes of the split
    size_in_bytes: u64,

    /// if a time field is available, the min / max timestamp in the segment.
    time_range: Option<Range<i64>>

    /// Number of merge this segment has been subjected to during its lifetime.
    generation: usize,
}

Publish splits (W)

/// Records a split as published.
///
/// This API is typically used by an indexer who needs to publish a new split.
/// At this point, the split files are assumed to have already uploaded.
/// The metastore only updates the state of the split from staging to published.
///
/// It has two side effetcs:
/// - it makes the split visible to other clients via the ListSplit API.
/// - it guards eventual recovery procedure from removing the split files.
///
/// Unless specified otherwise in the implementation, the metastore DOES NOT
/// do any check on the presence, integrity or validity of the metadata of this split.
///
/// A split successfully published MUST eventually be visible to all clients.
/// Stronger consistency semantics should be documented in the implementation.
///
/// If the split is already published, this API call returns a success.
fn publish_split(split: SplitId) -> Result<(), PublishError>;

enum PublishError {
    DoesNotExist(SplitId),
    SplitIsNotStaged {
        split: SplitId,
        previous_state: State
    },
    InternalError(...)
}

List splits (W)

/// Returns the list of published splits intersecting with the given time_range.
/// Regardless of the timerange filter, if a split has no timestamp it is always returned.
/// Splits are returned in any order.
fn list_splits(index: IndexId, state: State, time_range: Option<TimeRange>) -> Result<Vec<SplitMeta>, ListSplitsError>;

enum ListSplitsError {
IndexDoesNotExist(IndexId),
InternalError(...)
}

List splits is used:

by searcher with state=Published, and often with some timestamp range.
by recovery procedure with state=Staging, and state=ScheduledForDelete
by eventual garbage collection procedure with state=ScheduledForDelete

Delete split (W)

/// Marks a split as deleted.
/// This function is successful if a split was already marked as deleted.
fn mark_as_deleted(split_id: SplitId) -> Result<(), MarkAsDeletedError>;

enum MarkAsDeleteError {
    DoesNotExist(SplitId),
    InternalError(...)
}

///
/// This function only takes split that are in staging or in mark as deleted state.
fn delete_split(split_id: SplitId) -> Result<, DeleteSplitError>;

enum DeleteSplitError {
    DoesNotExist(...),
    Forbidden(...), // deleting a published split is forbidden
    InternalError(...)
}

File-in-object-storage based implementation

We simply serialize all of the splits associated to a single index into one large file.
We do not enforce any lock, and having several concurrent writer can lead to corruption (similarly to several datalake meta store).

The file has the following format.

{
    "index": {
        // Possible index metas
    },
    "splits": {
        <split_id_1>: {
            "metadata": <SplitMetaData>,
            "files": [<manifest_entries>],
        }
    }
}

Distributed search

Remove or hide DocMapperType

Right now the role of DocMapperType is a bit difficult. It's only role is to help serialize. Right now the type is too visible.

We could hide it more, or even remove it and rely on typetag.
https://github.com/dtolnay/typetag

Hot cache creation must happen after merging

Right now it happens before merging, so its content is incorrect.

Specify and implement `DocMapping`

Why we need a doc mapping

What is a "doc mapping"?

Let's quote Elasticsearch doc:

Mapping is the process of defining how a document, and the fields it contains, are stored and indexed.

Tantivy provides only a schema

As mentioned here, tantivy provides already a schema which defines the set of fields for a given index and how it is stored, the user has yet to define how he will convert input document (in json typically) to data that will be in the index.

Do we need more that a schema?

Tantivy already provides some useful information in the schema about how data needs to be stored (see indexing options). But we may need something more to handle other cases:

define what metadata we may store: typically we want to be able to store or not the entire document in a _source field like ES does it
more complex fields needs a mapping: for example, multifields (see tantivy issue), one day we will have geo fields :)
handle coerce
define a facet for a given field
define how to control dynamic field mapping (to avoid mapping explosion for example)

Where to implement it?

I would do it in tantivy but we can start a "dumb" implementation for quickwit's need.

Specification

The mapping is a json (or yaml?) file and sufficient to create the index schema.
To start with, we can just use tantivy schema as it will do the job.

Format

{
  "store_source": true,
  "ignore_unknown_fields": true,
  "default_search_fields": [
    "body", "title"
  ],
  "properties": [
    {
      "source_name": "timestamp",
      "target_name": "timestamp2",
      "type": "u64",
      "indexed": true,
      "stored": true
    },
    {
      "source_name": "severity_text",
      "type": "text",
      "record": "basic",
      "tokenizer": "raw",
      "stored": true
    },
    {
      "source_name": "resources.label",
      "type": "my_custom_text"
    },
    {
      "source_name": "meta",
      "type": "array<organization>"
    },
    {
      "source_name": ""
    }
  ],
  "field_types": [
    {
      "name": "my_custom_text",
      "type": "text",
      "analyzer": "my_analyzer"
    },
    {
      "name": "organization",
      "type": "object",
      "properties": [
        {
          "source_name": "name",
          "type": "text"
        },
        {
          "source_name": "created_at",
          "type": "u64"
        }
      ]
    }
  ]
}

Implementation

We talked about having 3 types of doc mapping but I don't think this is relevant. A mapping is specified as a json or yaml file and that's all. But we can also provide basic template that works with open telemetry json format.

struct DocMapper {
   meta: Vec<str>,
   properties: Vec<FieldMapper>,
}

impl DocMapper {
  pub fn doc_from_json(&self, doc_json: &str) -> Result<Document, MappingError> {
  }


}

Add options to choose doc mapper in CLI

Set up basic CI

We want

cargo fmt --check
cargo check --all
cargo clippy --all
cargo test --all

To run on each PR.

Usage of Path::join creates non-conform index on windows

Describe the bug
On MS Windows 10, since the standard path separator is \, usage of Path::join creates indices that cannot be found on other nix* platforms.
Step to reproduce:

Steps to reproduce (if applicable)
Steps to reproduce the behavior:

Run cargo run -- new --index-uri s3://quickwit-dev/evance/data --no-timestamp-field --doc-mapper-type wikipedia
An index is create on s3 at: s3://quickwit-dev/evance\data\quickwit.json
Trying to access this index via any other command on macos or ubuntu will yield IndexNotFound error.

Expected behavior
Index manipulation should be interoperable across all platforms we support

System configuration:
Windows 10,
Rustup 1.24.3 31/05/2021
Rustc 1.52.1 09/05/2021

Handle single bound time interval

User should be able to search on interval with a single bound t > start_time instead of being forced to define an entire interval.

Create an S3 compatible metastore

Currently, the index command of quickwit-cli can only be used with --index-uri file://. Any other protocol is raised as an UnsuportedProtocol error.

We need to implement a Metastore that interacts with s3 storage interface.

We could also make SingleFileMetastore supports the s3 protocol.

Remove reference to quickwit s3

see https://github.com/quickwit-inc/quickwit/blob/b29a7d9882f1c982acec56b11de7ca37717e7ca3/storage/src/object_storage/s3_compatible_storage.rs#L574

minio support ?

current code looks very aws centric. What about minio support for those that dont want to be tied to a cloud provider ?

Remove interleaving of doc and term frequency in tantivy and improve the warmup operation

term frequency are only useful for

BM25 queries
phrase query.

These are either impossible today or fairly rare.
We could reduce the amount of downloaded data if we change tantivy codec to stop interleaving doc frequency block and doc id blocks in the posting list. (provided we do not observe a performance regression)

The warm_postings could then take a IndexRecordOption argument instead of need_position

A mistake in the docmapper does not emit an error.

The following command

cargo run -- new --index-uri file:///home/fulmicoton/datasets/wikipedia-idx --no-timestamp-field --doc-mapper-type wikipeda

creates an index with the default mapper.

Implement `serve` CLI subcommand

Mapping configuration from example [food for thought]

This is more of a vague idea than a really fleshed idea.

I am a huge fan of this extension:
https://marketplace.visualstudio.com/items?itemName=quicktype.quicktype

The idea is at follows...
When creating a model in your favorite language, you paste an example json object and the extension creates the struct (or Class) automatically, with a lot of boilerplate.
You can then tweak it to your preference.

It is a massive timesaver and avoids a lot of typo related friction.

We could use the same idea in the CLI for instance, by allowing the user to copy paste a schema... or even sniff it from the clipboard...
We could also offer a webapp that helps creating the quickwit.json file with a good looking mapping straight from json.

Indexing does not work and fails with a unhelpful error

cat wiki-1000.json | RUST_LOG=debug cargo run -- index --index-uri file:///home/fulmicoton/datasets/wikipedia-idx --heap-size 1000000000 --num-threads 2

fails with

Jun 04 13:33:06.374 DEBUG quickwit_cli: indexing index_uri=file:///home/fulmicoton/datasets/wikipedia-idx input_uri=None temp_dir=/tmp num_threads=2 heap_size=1000000000 overwrite=false
Please enter your new line delimited json documents.
[quickwit-core/src/indexing/split_finalizer.rs:76] "he" = "he"
Jun 04 13:33:06.376  INFO tantivy::indexer::segment_updater: save metas    
Jun 04 13:33:06.377 DEBUG tantivy::directory::mmap_directory: Atomic Write ".managed.json"    
Jun 04 13:33:06.379 DEBUG tantivy::directory::mmap_directory: Atomic Write "meta.json"    
Jun 04 13:33:06.380 DEBUG tantivy::indexer::segment_updater: Saved metas Ok("{\n  \"index_settings\": {\n    \"sort_by_field\": null\n  },\n  \"segments\": [],\n  \"schema\": [],\n  \"opstamp\": 0\n}")    
Jun 04 13:33:07.149 DEBUG tantivy::directory::mmap_directory: Releasing lock ".tantivy-writer.lock"    
Command failed: Could not get the index_id

Investigate security considerations on the local file storage

The URI resolution scheme for the file system can have some important risk implication.
We also want to handle file:///var/local and the incorrect but common file://var/local format.

We need to make sure someone cannot access files that are not under the root directory using the "../" schemefor instnace.

Generally speaking: we should put some thoughts in it.

UPDATE:

We want to enforce the fact that path accessed by the FileStorage are child path of the FileStorage root.

Improved documentation of the MetaStore trait

The trait does not explain the point of staging, mark_split_as_deleted, etc.

A reader could think for instance that delete_split would remove the files associated to the split in the storage.

Define the client pool trait

The searchers receiving a root request need to dispatch the leaf work among the search servers.

We want to handle the peer discovery problem using the SWIM gossip algorithm.
Before plugging the peer discovery, please submit a documented trait describing the public API that will be consumed by the root handler.

Add a File system storage

This is only relevant after the merge of #18.

We should handle "file://" as a valid protocol and allow people to store and consume files on their local file system.

add a file Storage
add a file StorageFactory
add it to the default StorageUriResolver
test

Better logging messages in console during indexing

I got this

cargo r index --index-uri file:///Users/fmassot/Documents/quickwit/repos/quickwit/my-index --input-path wiki-articles-1000.json
Documents: 10000 Errors: 0  Splits: 1 Dataset Size: 11MB Throughput: 11.24MB/s
Documents: 10000 Errors: 0  Splits: 1 Dataset Size: 11MB Throughput: 5.62MB/s
Documents: 10000 Errors: 0  Splits: 1 Dataset Size: 11MB Throughput: 3.75MB/s
Documents: 10000 Errors: 0  Splits: 1 Dataset Size: 11MB Throughput: 2.81MB/s
may

I don't understand the first constant number.

Need to be stricter on index-uri ?

This is clearly not a bug but I expect some users will do the same error as me.

I expected the index-uri to have an absolute uri but if you have only two slashes it will be relative.

So having file://folder1/folder2/folder3/folder4/wikipedia will create 5 folders because I forget the slash.

I suggest to support only absolute path.

Consideration of node discovery using P2P technology

Currently, we are doing the node discovery by polling DNS entries in Kubernetes.
However, it takes a few seconds after a new node is added before it becomes available.
So, we will try to create a sample project to consider if it is possible to do the node discovery more easily using P2P technology.

Implement `new` CLI command

Spec

quickwit new ... creates a new index.

Command synopsis

quickwit new <index name> 
	--index-path <path> 
	{--timestamp-field <field name> | --no-timestamp-field}
	[--overwrite]

Command options

--index-path <path> (string) Defines where the index is created.
--timestamp-field <field name> (string) Creates a time-series index.
--no-timestamp (boolean) Creates a non time-series index.
--overwrite (boolean) Overwrites existing index.

The timestamp-field and no-timestamp-field options are mutually exclusive.

Examples

Creating a time-series index

quickwit new <index name> --index-path <path> --timestamp-field <field name>
quickwit new nginx --index-path s3://quickwit-indexes/nginx --timestamp-field ts

Creating a non time-series index

quickwit new <index name> --index-path <path> --no-timestamp-field
quickwit new nginx --index-path s3://quickwit-indexes/nginx --no-timestamp-field

Before completing, the subcommand displays an usage example of quickwit index for the index that was just created (see #3 for more details about the index subcommand).

guilload@modern14 ~> quickwit new nginx --index-path s3://quickwit-indexes/nginx --timestamp-field ts
Creating new index at s3://quickwit-indexes/nginx...
Done!
You can now [...] using `quickwit index --index-path s3://quickwit-indexes/nginx --input-path <path to file or directory>`

Design

Assignee(s), please describe in a few sentences how you intend to implement and test this feature.

Upon Storage failure, the state is still updated in the metastore.

Remove the open method

Ideally, the meta store should not require a call to open.

The current interface is meant to be backed by more serious backend in the future. (e.g. PostgreSQL, foundationDB or what not).
open is a quirk associated with the single file implementation.

It is not actually necessary. We could use the internal state a cache an lazily "open" the index upon a call if it is not loaded yet.
You can probably remove the method from the trait. Make the current implementation private,
and lazily call it from methods like read, etc.

Handle time-series data in indexing

Currently, the indexing command does not handle the time range field during document indexing.
This needs to be implemented.

Make indexing entirely in RAM possible

Right now unit tests are create temporary directory. This is not great.

Tantivy can work with a RAMDirectory without any trouble. Similarly, we already have an in ram storage.
The main problem is actually split upload. We only handle this on Tokio::File today.

Write documentation for Quickwit CLI

The documentation should include the following section:

a "Getting Started" section providing a quick overview of what the CLI can do
a "Commands" section describing each command in detail
a guide or tutorial explaining how to index and query a dataset from start to finish

A useless directory is created in the storage after indexing 1000 docs.

I end up with one split uploaded, and extra empty directory.

Consistent IndexUri vs IndexId

Index(Id/Url/Uri), Split(Id/Url/Uri)

Our doc and our code are inconsistent (independently and mutually) on these concept.

The goal of this document is to come up with a clear definition of what should be done.

The origin of the confusion is coming from the coexistence of 3 different product

barrel
quickwit version 1, as a CLI, where each command only apply to a single index
quickwit in the future, which is seen as a better version of barrel.

Barrel is just a prototype and should only be here as a source of experience for the future... So let's exclude it from the discussion.

Going one step further, the source of the confusion is that the CLI only targets a single index.
Our foundational abstraction however try to make us ready for the near future and try to address the problem of managing several indexes (duh 🙂).

The serve command could actually be a good candidate to already handle several indices, and can be a good pivot for the reflexion.

Here is the solution that I suggest:

metastore_url: a self sufficient address to open a metastore. The MetastoreUrlResolver turns an url into a MetaStore object. The protocol helps choosing the right implementation. Examples:
- s3://quickwit-indexes/
- postgres://localhost/toto
index_id: an id that only makes sense in the context of the meta store. The meta store is able to get a bunch of information about the index given an index_id. Most importantly, the index_url and a list of splits.
index_url in the code:
When we index, it serves as a base url to create the storage used to store splits.
On search make it possible to resolve split_relative_url to a split_absolute_url.
split_id: an url that points at the storage of a given split. We prefer to store a split_id and resolve things from the index_url for different reasons... The most obvious one is that it makes it easier to move an index. For instance, if someone wanted to move an entire index from s3 to the local filesystem, it would only require copying all of the split files, export a single file version of the metastore, and fix the index_url to the filesystem.
The less obvious reason is that we might want to create only one Storage object and fetech path including the split prefix path, rather than create one storage for every split.
index_url in the CLI arguments : Here comes the cause of the ambiguity... Conceptually speaking, with the following setting, the create and the index cli (this is the same story for other cli) would look like
quickwit create —meta-store-url s3://quickwit/ —index-id wikipedia —index-url s3://quickwit/wikipedia
quickwit index —meta-store-url s3://quickwit/ —index-id wikipedia. This is obviously not user friendly, error prone, and overkill in a world where we have only one index with a colocated metastore.
Rather than this, for quickwit v1, we only have one index-url argument. It is required to be a "storage url". We infer the meta-store-url and index-id from this index-url.
quickwit index --index-url s3://quickwit/wikipedia
In quickwit serve, we also use the quickwit serve —index-url s3://quickwit/wikipedia syntax, but we can also support the following
quickwit serve —meta-store-url s3://quickwit/``.
When receiving a query targetting index_id=wikipedia, the server will dynamically load and cache the meta file in s3://quickwit/wikipedia/quickwit.json.

Url vs Uri

The difference between the two is notoriously confusing.

https://danielmiessler.com/study/difference-between-uri-url/#:~:text=The terms “URI” and “,as HTTPs %2C FTP %2C etc.

We do not really use these uri as IDs and it is quite common to consider relative url valid. Usage is just 50% 50%. Postgresql refers to their postgresql://... addresses as an url, while amazon s3 referes to s3://... addresses as an uri.

The following poll suggests uri is more popular.
https://twitter.com/fulmicoton/status/1397738350270840832

Which one do you prefer?

Alternative solution

I think it is possible to build a consistent spec base on index_uri in place of index_id.

Our current spec is halfway between that imaginary spec and the spec above... This alternative is possible but we need to correctly spec it out.

Gracefully handle failing split search

Right now if a single split fails, we fail. We need to at least retry this split, or return partial results and point out the failed split.

Add SingleFileMetastoreFactory for MetastoreUriResolver

One split one thread

This is low priority

We might get better performance by allocating only one thread per split.
The idea would then to read the target amount of docs (5M) in memory, and have
a single thread indexer start working on its indexing.

Several splits would be working at the same time. Each one with a single thread.
With enough RAM, this could reduce or even remove the need for merges.

Add a README in all directories describing the perimeter of the project.

README can be really helpful for developers to navigate a new project.
They also force us to define the perimeter of projects.

IndexDataParams.temp_dir is actually not a temp dir.

More accurate semantics in Storage trait and implementation in the FileStorage

We want the Storage trait to mimick the experience of using an object storage.

Object storage do not have a real notion of a directory.
Put("a/b/c/d") does not require to create the directory "a/b/c" beforehands.

Similarly, removing ("a/b/c/d"), does not mean that we left the directory a/b/c empty.
The directory simply do not exists. It is more of a blob key value store.

The "/" are only characters in a key, that happens to be useful. Listing files in one of those "fake" directory is in fact a prefix enumeration.

Clarify this semantics in the Storage trait.
Fix the LocalFileStorage implementation to implement this semantics. Delete can be best effort. If we delete a file that is the last file in the current directory, and it is not the root of the filestorage, try to remove the directory. If removing the directory fails, log a warning, and ignore the error.
Add tests

Implement `search` CLI subcommand

Handle time range in the quickwit collector

The time range needs to be used to prune splits, but also to filter documents befroe they reach the collector.
This part has not been implemented yet.

Add cargo deny check in the CI

Cargo-deny makes it possible to check for license inconsistency.

We should add it to our CI.
https://crates.io/crates/cargo-deny

Persist doc mapper in metastore

Implements Atomic Split Publising

Currently, in the indexing, splits are published at the end of the indexing but not atomically also the indexing would just report split failures.

What we want to be implemented is an atomic way of publishing splits with an all or nothing behavior. That is if one of the splits fails to publish, we should consider the whole operation of indexing as failed. As a suggestion, the Metastore could be a good place to implement this.

Metastore::publish_splits(splits: Vec<SplitId>)

For the template itself, we can probably rely on http://www.harmonyagreements.org/faqs.html

We also need to document the CLAs and the process in a contributing.md file.

Define and enforce the restriction on index_id

We need to define and enforce what kind of string is accepted as an index-id.

Please edit this ticket to drive the discussion.
Read #55 to understand the actual nature of an index_id.

quickwit-oss / quickwit Goto Github PK

quickwit's People

Contributors

Stargazers

Watchers

Forkers

quickwit's Issues

Description

Synopsis

Options

Examples

Output

Design and implementation pointers

Testing

Purpose and Challenge

API Definition

Functional need

File management need, and the split lifecycle

API Definition

Stage splits (W)

Publish splits (W)

List splits (W)

Delete split (W)

File-in-object-storage based implementation

Why we need a doc mapping

What is a "doc mapping"?

Tantivy provides only a schema

Do we need more that a schema?

Where to implement it?

Specification

Format

Implementation

Spec

Command synopsis

Command options

Examples

Creating a time-series index

Creating a non time-series index

Design

Index(Id/Url/Uri), Split(Id/Url/Uri)

Url vs Uri

Alternative solution

Recommend Projects

Recommend Topics

Recommend Org