Giter Club home page Giter Club logo

tantivy-cli's Introduction

License: MIT

Tantivy-cli is the project hosting the command line interface for tantivy, a search engine project.

Tutorial: Indexing Wikipedia with Tantivy CLI

Introduction

In this tutorial, we will create a brand new index with the articles of English wikipedia in it.

Installing the tantivy CLI.

There are simple way to add the tantivy CLI to your computer.

If you are a rust programmer, you probably have cargo installed and you can just run cargo install tantivy-cli.

Alternatively, if you are on Linux 64bits, you can directly download a static binary: binaries/linux_x86_64/, and save it in a directory of your system's PATH.

Creating the index: new

Let's create a directory in which your index will be stored.

    # create the directory
    mkdir wikipedia-index

We will now initialize the index and create its schema. The schema defines the list of your fields, and for each field :

  • its name
  • its type, currently u32 or str
  • how it should be indexed.

You can find more information about the latter on [tantivy's schema documentation page](http://fulmicoton.com/tantivy/tantivy/schema/index.html

In our case, our documents will contain

  • a title
  • a body
  • a url

We want the title and the body to be tokenized and index. We want to also add the term frequency and term positions to our index. (To be honest, phrase queries are not yet implemented in tantivy, so the positions won't be really useful in this tutorial.)

Running tantivy new will start a wizard that will help you go through the definition of the schema of our new index.

Like all the other commands of tantivy, you will have to pass it your index directory via the -i or --index parameter as follows.

    tantivy new -i wikipedia-index

When asked answer to the question, answer as follows:


    Creating new index 
    Let's define it's schema! 



    New field name  ? title
    Text or unsigned 32-bit Integer (T/I) ? T
    Should the field be stored (Y/N) ? Y
    Should the field be indexed (Y/N) ? Y
    Should the field be tokenized (Y/N) ? Y
    Should the term frequencies (per doc) be in the index (Y/N) ? Y
    Should the term positions (per doc) be in the index (Y/N) ? Y
    Add another field (Y/N) ? Y



    New field name  ? body
    Text or unsigned 32-bit Integer (T/I) ? T
    Should the field be stored (Y/N) ? Y
    Should the field be indexed (Y/N) ? Y
    Should the field be tokenized (Y/N) ? Y
    Should the term frequencies (per doc) be in the index (Y/N) ? Y
    Should the term positions (per doc) be in the index (Y/N) ? Y
    Add another field (Y/N) ? Y



    New field name  ? url
    Text or unsigned 32-bit Integer (T/I) ? T
    Should the field be stored (Y/N) ? Y
    Should the field be indexed (Y/N) ? N
    Add another field (Y/N) ? N

    [
    {
        "name": "title",
        "type": "text",
        "options": {
            "indexing": "position",
            "stored": true
        }
    },
    {
        "name": "body",
        "type": "text",
        "options": {
            "indexing": "position",
            "stored": true
        }
    },
    {
        "name": "url",
        "type": "text",
        "options": {
            "indexing": "unindexed",
            "stored": true
        }
    }
    ]


After the wizard has finished, a meta.json has been written in wikipedia-index/meta.json. It is a fairly human readable JSON, so you may check its content.

It contains two sections :

  • segments (currently empty, but we will change that soon)
  • schema

Indexing the document : index

Tantivy's index command offers a way to index a json file. More accurately, the file must contain one document per line, in a json format. The structure of this JSON object must match that of our schema definition.

    {"body": "some text", "title": "some title", "url": "http://somedomain.com"}

For this tutorial, you can download a corpus with the 5 millions+ English articles of wikipedia formatted in the right format here : wiki-articles.json (2.34 GB). Make sure to uncompress the file

    bunzip2 wiki-articles.json.bz2

If you are in a rush you can download 100 articles in the right format here.

The index command will index your document. By default it will use as many threads as there are cores on your machine. You can change the number of threads by passing it the -t parameter.

On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), it will take around 6 minutes.

    cat wiki-articles.json | tantivy index -i ./wikipedia-index

While it is indexing, you can peek at the index directory to check what is happening.

    ls ./wikipedia-index

If you indexed the 5 millions articles, you should see a lot of new files, all with the following format The main file is meta.json.

Our index is in fact divided in segments. Each segment acts as an individual smaller index. Its named is simply a uuid.

Serve the search index : serve

Tantivy's cli also embeds a search server. You can run it with the following command.

    tantivy serve -i wikipedia-index

By default, the server is serving on the port 3000.

You can search for the top 20 most relevant documents for the query Barack Obama by accessing the following url in your browser

http://localhost:3000/api/?q=barack+obama&explain=true&nhits=20

Optimizing the index : merge

Each tantivy's indexer thread is closing a new segment every 100K documents (this is completely arbitrary at the moment). You should have more than 50 segments in your dictionary at the moment.

Having that many queries is hurting your query performance (well, mostly the fast ones). Tantivy merge will merge your segment into one.

    tantivy merge -i ./wikipedia-index

(The command takes around 7 minutes on my computer)

Note that your files are still there even after having run the command. meta.json however only lists one of the segments. You will still need to remove the files manually.

tantivy-cli's People

Contributors

fulmicoton avatar

Watchers

James Cloos avatar κeen avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.