The current API sticks very closely to tantivy API.
Some of the choice of the current API are strongly specific Rust and do not necessarily make sense in python. Also, some choice of the current are actually not necessarily very smart even in the context of rust.
I suspect we could simplify our API a bit.
I explored differnet changes in #3, but I would like us to discuss them somewhere.
Fields
Getting Field as objects obtained either at the creation of the Schema
or by querying the Schema
can be cumbersome. I suggest we let the user identify them as string all the way along.
Note this bit may change in rust's tantivy as well.
Following the readme.
builder = tantivy.SchemaBuilder()
title = builder.add_text_field("title", stored=True)
body = builder.add_text_field("body")
schema = builder.build()
We could remove saving the title
, and body
and even introduce a fluent interface. (note I do not know how to return self with PyO3. It might be impossible.)
schema = tantivy.SchemaBuilder()
.add_text_field("title", stored=True)
.add_text_field("body")
.build()
When referring to fields then, users could always directly pass in strings instead of the field objects.
Documents
Conceptually, documents are just maps of field to a list of values.
We could remove the notion of document entirely and use dictionaries instead.
It might make onboarding much straightforward, and leave us with an arguably more appealing README.
That being said -as stated above-, tantivy allows more than one value per field.
Within tantivy json serializer, deserializer, the choice was made to allow for the single value format as an input, so that both
{ "title": "The old man and the sea"}
and
{ "title": ["The old man and the sea"]}
are valid documents.
On deserialization, we always deserialize as {"title": ["The old man and the sea"]}
This could lead to confusion within users, so there could be value in keeping a structured Document object.
We could define a __getitem__
, __len__
to make the user 's life easier while leaving him aware that the field can be multivalued.
Reader
We could have the reader be part of the Index
.
The searcher would then be acquired directly from the index
.
In a similar spirit, the query parsing could be directly methods of Index
.
Also we could have a helper to perform a vanilla top-K search with or without count of documents directly from the index.
The README sample could become...
reader = index.searcher()
searcher = reader.searcher()
query = index.parse_query("sea whale", default_fields=[title, body])
search_results = searcher.search(query, nhits=10, count=True)
print(search_result.count)
(_score, doc_address) = search_results.hits()[0]
searched_doc = searcher.doc(doc_address)
assert searched_doc["title"] == ["The Old Man and the Sea"]