blevesearch / bleve Goto Github PK

View Code? Open in Web Editor NEW

9.6K 241.0 666.0 15.82 MB

A modern text/numeric/geo-spatial/vector indexing library for go

License: Apache License 2.0

Go 99.78% Shell 0.01% Yacc 0.22%

bleve's Introduction

bleve

A modern text indexing library in go

Features

Index any go data structure (including JSON)
Intelligent defaults backed up by powerful configuration
Supported field types:
- text, number, datetime, boolean, geopoint, geoshape, IP, vector
Supported query types:
- Term, Phrase, Match, Match Phrase, Prefix, Fuzzy
- Conjunction, Disjunction, Boolean (must/should/must_not)
- Term Range, Numeric Range, Date Range
- Geo Spatial
- Simple query string syntax
- Vector Search
tf-idf Scoring
Query time boosting
Search result match highlighting with document fragments
Aggregations/faceting support:
- Terms Facet
- Numeric Range Facet
- Date Range Facet

Indexing

message := struct{
	Id   string
	From string
	Body string
}{
	Id:   "example",
	From: "[email protected]",
	Body: "bleve indexing is easy",
}

mapping := bleve.NewIndexMapping()
index, err := bleve.New("example.bleve", mapping)
if err != nil {
	panic(err)
}
index.Index(message.Id, message)

Querying

index, _ := bleve.Open("example.bleve")
query := bleve.NewQueryStringQuery("bleve")
searchRequest := bleve.NewSearchRequest(query)
searchResult, _ := index.Search(searchRequest)

Command Line Interface

To install the CLI for the latest release of bleve, run:

$ go install github.com/blevesearch/bleve/v2/cmd/bleve@latest

$ bleve --help
Bleve is a command-line tool to interact with a bleve index.

Usage:
  bleve [command]

Available Commands:
  bulk        bulk loads from newline delimited JSON files
  check       checks the contents of the index
  count       counts the number documents in the index
  create      creates a new index
  dictionary  prints the term dictionary for the specified field in the index
  dump        dumps the contents of the index
  fields      lists the fields in this index
  help        Help about any command
  index       adds the files to the index
  mapping     prints the mapping used for this index
  query       queries the index
  registry    registry lists the bleve components compiled into this executable
  scorch      command-line tool to interact with a scorch index

Flags:
  -h, --help   help for bleve

Use "bleve [command] --help" for more information about a command.

Text Analysis

Bleve includes general-purpose analyzers (customizable) as well as pre-built text analyzers for the following languages:

Arabic (ar), Bulgarian (bg), Catalan (ca), Chinese-Japanese-Korean (cjk), Kurdish (ckb), Danish (da), German (de), Greek (el), English (en), Spanish - Castilian (es), Basque (eu), Persian (fa), Finnish (fi), French (fr), Gaelic (ga), Spanish - Galician (gl), Hindi (hi), Croatian (hr), Hungarian (hu), Armenian (hy), Indonesian (id, in), Italian (it), Dutch (nl), Norwegian (no), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Swedish (sv), Turkish (tr)

Text Analysis Wizard

bleveanalysis.couchbase.com

Discussion/Issues

Discuss usage/development of bleve and/or report issues here:

License

Apache License Version 2.0

bleve's People

Contributors

Stargazers

Watchers

Forkers

liyinhgqw deoxxa baijum cmdev8 andradeandrey vishalsodani lgs xingskycn jordie tyjohnnew tml yanlinaung thomasvinay neuroradiology bdacode putaozhuose bigxing gaowenbin bozzcq qbuger mistobaan cgiogkarakis jreamlu wojons shugyousha owenthereal liujianping ilovejs growthux godeep simonpeng2009 dongfanliang nemesisqp patricktoca mohae glycerine wcn3 emilgpa fashtimedotcom hihus polaris1119 changguanghua miku avsej pombredanne seacoastboy cw2018 sacheendra g-var gvrv taka011239 mrxiaoz wuchuguang otoolep strogo gsathya looksgood dengmin thurday simapple yl365 hehexianshi linkris zhanglei c4pt0r andrisetiawan gooo000 kenvinwei zhuyong96 typerandom indraniel rli-diraryi doctorwho1986 sxhao cyclefusion brunoga zofuthan spring-zhang zhangf911 tianlin tomzhang betashepherd bigtong tennessine liangyali ngnono alex-xiao funkygao is00hcw hsen-dev ateleshev vimleshs parsegarden johnkewforks suensummit onetodo tukdesk phynalle drewwells lonelypale

bleve's Issues

token synonym filter

ability to load synonyms from files (like stop word lists)
ability to either expand (index all synonyms)
or contract (consolidate synonyms to single version)

also, investigate wordnet: http://wordnet.princeton.edu/

make stop token dictionaries loadable from file

czech stemmer

truncate token length filter

truncate token at the specified max lenght
useful for fields left as a single token

catalan stemmer

field type for numeric values

add elision token filter

will improve analyzer for french

use protobufs to encode index values

While we can't use them for the index keys which we craft to get the desired sort order, we should use protobufs to encode the index values. This will make the binary serialization/deserialization less error prone, more compact, and easier to evolve over time.

create initial wiki pages

initial wiki pages for:

building bleve with all the c libraries
creating a custom mapping
one for each type of field
one for each type of analyzer/character filter/tokenizer/token filter
- these can be stubs initially, but serve as place holders for adding information over time
one for each type of query

galician stemmer

indonesian stemmer

document mapping supports ignoring sections of documents

arabic stemmer

add normalization token filters for various languages

arabic, german, hindi, indic, kurdish, persian, and scandanavian

irish stemmer

change back index entries to just contain list of keys

Currently the back index contains 2 separate lists of more strongly typed data. This should be changes to just a flat list of keys. This will make it easier to introduce new index row types in the future without having to keep updating the way the back index works.

basque stemmer

add support for some sort of synthesized _all field

need hindi stemmer

add apostrophe filter

will improve turkish analyzer

token filter that filters by min/max codepoint length

use unicode/utf8 package RuneCount method
also rename existing length filter to ByteLength

support for facet queries

Initial implementation should just operate at query time.

If we swap the field and term order in the index key we can support faceting at query time. For every document satisfying the original query, we can look up the document in the back index, and find entries for the field that is being faceted. Seems like we don't even have to load that key, just be able to parse the field id and terms. For categorical facets the terms are bucketed and counted. For numerical range facets the parsed terms are bucketed and counted. The top-N facets are then returned with the query results.

index term entry should be able to include hierarchical position data

Currently index term entries are:

't'

Would like to add support for also storing the position of this term in any arrays that were a part of the field path.

Not 100% decided that this must be in the key, but that would be the only way to have some hope of efficiently querying on this information.

The idea is to be able to further qualify queries and say that in addition to other query criteria, matching items must occur in the same parent element.

Consider the following documents in an index.

{
  "name": "a",
  "children": [
      {
          "name": "c",
          "age": 25
     },
      {
          "name": "d",
          "age": 15
     },
}

{
  "name": "b",
  "children": [
      {
          "name": "c",
          "age": 15
     },
      {
          "name": "d",
          "age": 25
     },
}

Logically we want to query:
child.name = "c" AND child.age < 20 AND same child

Both documents have a child named "c" and a child who's age is less than 25, but ONLY "b" satisfies both criteria in the same child.

The implementation idea is to include the position in the children array, and the query criteria "same child" is accomplished by verifying that matching items have the same value.