Giter Club home page Giter Club logo

hunt's Introduction

Hunt Build Status Hunt on Hayoo!

Hunt is a flexible, lightweight search platform written in Haskell.

The default server implementation provides a powerful JSON-API and a simple web interface.

Features

  • Powerful query language
  • Schema support (numeric data, dates, geospatial data)
  • Granular ranking capabilities
  • JSON API
  • Extensible architecture

Installation

Dependencies
  • GHC: The Glasgow Haskell Compiler
  • Cabal: Haskell package management tool
Hunt Installation

The easiest way to get the setup started is to install the current Haskell Platform.

Linux

If you're using linux, you can use make for the build.

git clone https://github.com/hunt-framework/hunt.git
cd hunt
make sandbox install
Windows

If you're using Windows, you can use cabal for the build. If you would like to use sandboxes on Windows, you can copy the neccessary cabal commands from our Makefile.

git clone https://github.com/hunt-framework/hunt.git
cd hunt/hunt-server
cabal install

Getting Started

The following line starts the default server. The web interface is available at http://localhost:3000/.

make startServer

A small sample data set can be inserted with:

make insertJokes

FAQ

Can I run Hunt on a 32-bit machine?

No, we are using 64-bit hashes for our internal document IDs. Collisions are much more likely for 32-bit hashes and the available memory would be limited to 4GB.

Why is the CPU usage in idle so high?

GHC performs a a major garbage collection every 0.3 seconds in idle, which can be computationally expensive on a big index. This can be disabled with the GHC RTS option -I0.

Development / History

Hunt was started in 2013 by Ulf Sauer and Chris Reumann to improve and extend the existing Holumbus framework. Holumbus was developed in 2008-2009 by Timo B. Kranz and Sebastian M. Gauck and powers the current Haskell API search Hayoo!. We decided to rebrand, because Hunt represents a major rewrite and breaks compatibility.

A new Hayoo implementation is currently under development by Sebastian Philipp.

Both projects were developed at the FH Wedel under supervision and active support of Prof. Dr. Uwe Schmidt.

hunt's People

Contributors

alexbiehl avatar chrisreu avatar noobymatze avatar sebastian-philipp avatar steshaw avatar ulfsauer0815 avatar uweschmidt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hunt's Issues

Query Processor: search has to be context-type aware

values get normalized before beeing stored in the index.

f.e. for the date context type 2013-07-21 gets normalized to 20130721000000

Right now the query processor does not take that into account. User can search for "2013-07-21" and won't get a result. They have to search for "20130721".

Query processing has to be context aware.

empty result for set queries with with and without context

This query works:

"mapM package:base"
{icQuery = QBinary And (QWord QNoCase "mapM") (QContext ["package"] (QWord QNoCase "base")), icOffsetSR = 0, icMaxSR = 20}

this doesn't:

"package:base mapM"
{icPrefixCR = QBinary And (QContext ["package"] (QWord QNoCase "base")) (QWord QNoCase "mapM"), icMaxCR = 20}

Interpreter: Improve read/write locking

The current implementation uses a simple MVar.
Readers don't block each other (readMVar) but writers block readers and writers until the are finished (takeMVar, write, putMVar).

Readers should never be blocked and use the "old" index until the writer is done.
Writers should block each other so that writes can only occur sequentially.

Default instances for several types

I think we have a dependency to the defaults package already, so it makes sense to implement it for common data types.

The index environment is a good example.

This looks ugly in the examples right now.

ContextTypes should be a map

The context type name should be unique. This could be expressed easier with a map using the name as the key.

type ContextTypes = [ContextType]

Normalizer: Be more extensible here.

I'd like to change the way we define normalize functions. By making these changes it would easily be possible to write new normalize function and add them, by just implementing this easy datatype.

I think we don't loose a lot type-safety by removing the enum, because we rarely use it anyway.

Any concerns? What do you think?

-- current implementation
data CNormalizer = NormUpperCase | NormLowerCase | NormDate | NormPosition | NormIntZeroFill
  deriving (Show, Eq)

-- suggested change:
data CNormalizer = CNormalizer { normalize :: Text -> Text }

Completion: include contexts in some way?

For example. If you have an empty search field and start typing. if we would include the context names into the completion somehow, it would suggest them as well:

search: "con"
completion: [ "contextgeo:", "contextdate:"... ]

Change handling of invalid queries on different context types

Imagine the case where a text index and a date index are created.

Right now all general queries run on each context. If the searched value is not compatible with one type the whole search fails.

For this case we added the "default" option for the ContextSchema and context.

But i'm not sure if this is really the easiest way to solve this. Maybe it would be better to remove this "default" option and handle the search in a different way. If a validation of the search term fails for a context then we could set the contexts search result to empty instead of throwing an error.

What do you think?

Schema: Add proper Int support

There is no proper support for Int right now.
The input needs to be padded, e.g.

  2 -> 0000000002
157 -> 0000000157

or with simple support for negative numbers

   2 -> 10000000002
  -2 -> 00000000002
 157 -> 10000000157
-157 -> 00000000157

or

  1 -> 5000000001
  0 -> 5000000000
 -1 -> 4999999999

The validator needs to exlude values that are out of range of this encoding.

Related issue: #1

DocTable: Compression

We have to reintegrate document compression within the DocTable. I guess a trivial approach would be okay.

Add Ranking per Document

We need to add a document based ranking. This can either be a part of the document description or part of the document itself.

Hayoo needs a rankings based on the document itself.

Change module structure

Right now we have:

Hunt.Interpreter.Interpreter
Hunt.Index.Index

This redundancy is a bit ugly. Maybe we should change this before releasing 0.1

Snapshots

Basic filesystem like snapshot support.

Completing the parallel-over-context implementation

We have implemented the parallel map over contexts for the BatchInsert command, but for none of the others.

BatchDelete should perform the deletion in parallel as well. Then there is update. Update is implemented by a delete followed by an insert. If insert and delete performance is critical, update performance has to be even worse.

For that reason we need BatchUpdate as well. It should be implemented within the internal Command language like BatchInsert and BatchDelete. Since it is based on BatchInsert and BatchDelete, implementation should be trivial.

In the ContextMap we still have function for single insertion and deletion. I think due to the batch optimizations, they are no longer used. Even a single insert should be transformed to a BatchInsert with just one document. So this single functions should be removed to clean up the interface.

Any thoughts?

Store/Load does not work

The commands are currently disabled/undefined - probably because some Binary instances are missing.

Overall Performance

Everything performance related

ContextIndex: Inserting single documents in parallel

Most documents insert values into multiple contexts. Since this is a map we should be able to improve insertion performance by performing them in parallel at least for one document.

I implemented this within the addWords function. I'm not a 100% sure if everything is perfectly optimized yet, but a first benchmark showed improvements:

test dataset: 300 documents containing from 200 upto 10000 words.
test run on 8core machine with all cores active.

sequentiell foldM is the initial implementation

FoldM sequentiell mapM sequentiell MapM parallel
46.12884s 48.05277s 32.794021s

Completion: suggested words should be related

Current auto completion returns results only for the currently typed word, without taking the rest of the query into account.

f.e.: "word1 wor" will return all completions for "wor", not only all completions of "wor" that follow "word1".

We could improve this, by splitting up the query. We could search exact for the first part and perform a fuzzy prefix search on the second part.

Normalizers: No default normalizer

That's a bit strange, because we need default normalization for numbers, dates and geo positions. There should be at least one default for each context type.

f.e: date default 2013-01-01 ->20130101

Users should be able to overwrite those defaults, but users should not be forced to configure those defaults themselves

Normalization: Handle default normalizers and user-defined normalizers separately

At the moment we don't make a separation here. That's a problem, because for some contexttypes like int, geo/position and date, a default validation is required. The user configuration should not be able to change that.

The idea is to have the user defined normalization like it is implemented now. With a normalization function from Text -> Text.

The type specific required normalization could be handled by a IndexProxy implementation. That way we could support different index structures then the StringMap with different types, without changing anything in interpreter or query processor.

Occurrences should be a type class

First of all, this is just an idea and i don't think we could implement this within the time frame of the thesis.

The Occurrence data-structure is the one part of the whole implementation that is still not very flexible. It is somehow bound to the InvertedIndex implementation and cannot be replaced with something else, because it is used massively within the Query processor.

I think instead of a Occurrence data type, a type class "IndexValue" might be more suitable. It would also help to test different Occurrence implementations like the BinTree vs the IntMap.

Any thoughts about this?

Merge ContextIndex and IndexHandler

We never really liked the name IndexHandler, so we wanted to find a better name here anyway. Then we have the ContextIndex listed as an Index Proxy. While that was correct first, after a couple of changes, it is not an index proxy anymore.

I think it would be a good idea to merge those two modules. Any thoughts?

--merge these
data IndexHandler dt = IXH     
  { ixhIndex  :: CIx.ContextIndex Occurrences 
  , ixhDocs   :: dt            
  , ixhSchema :: Schema        
  }

newtype ContextIndex v         
  = ContextIx { contextIx :: Map Context (Impl.IndexImpl v) }
  deriving (Show)

-- to:         
data ContextIndex dt = ContextIx
  { ixhIndex  :: Map Context (Impl.IndexImpl Occurrences) 
  , ixhDocs   :: dt            
  , ixhSchema :: Schema        
  }

Clean up the API

There are a couple of things we should think through before releasing the package on hackage.

The common package it pretty confusing right now in regards to the exported types.

Maybe we need another module:
Hunt.Interpreter.Config
which exposes all the configuration stuff needed for the interpreter.

Right now the client needs to import multiple modules like:
Hunt.Query.Ranking (function defaultRankConfig)
only to initialize the interpreter.

It's impossible to search for keys containing spaces

That's because the query parser will split the qurey into words:

[2014-01-05 15:28:06 CET : Holumbus.Interpreter.Interpreter : DEBUG] Executing command: Completion {icPrefixCR = QBinary And (QWord QNoCase "StringMap") (QBinary And (QWord QNoCase "a") (QBinary And (QWord QNoCase "->") (QWord QNoCase "Bool"))), icMaxCR = 20}

Using a Phrase doesn't work either:

GET /search/%22StringMap%20a%20-%3E%20Bool%22/0/20
Accept: 
[2014-01-05 15:28:25 CET : Holumbus.Interpreter.Interpreter : DEBUG] Executing command: Search {icQuery = QPhrase QNoCase "StringMap a -> Bool", icOffsetSR = 0, icMaxSR = 20}
Status: 200 OK. /search/%22StringMap%20a%20-%3E%20Bool%22/0/20

Memory profiling

Memory profiling

This Issue is about the question whether ByteString serialization is a good or a bad thing. We really had a lot of trouble with it and despite certain optimizations, it seems not to improve the overall index size.

Setup: I always used the same stack of indexes.

newtype InvertedIndex _v = InvIx { invIx :: KeyProxyIndex Text ComprOccPrefixTree CompressedOccurrences }

I only exchanged the CompressedOccurrences type.

Test1: 3000 documents a 200 words (25 MB JSON)

I compared the serialized files on disk.

Compression Size in disk Size in memory
no compression 49 MB 260 MB
only bytestring 49,1 MB 83 MB
bzip bytestring 27,1 MB 73 MB

Conlusion

The test are run with profiling and the profling results look okay. Seems like the extra compression makes sense after all.

While executing this benchmark a strictness bug regarding lazy Occurrences was fixed.

Profiling results verify the benchmark. Profiled values fit to what the responding memory footprint show.

ByteString without further compression is only a bit bigger than the compressed ones. This makes sense, since we deal with small values here. The conversion to bytestring takes more time then the compression afterwards as well.

The uncompressed results is as big as it is, because of the Hashed DocID within the IntMap. The position within the IntSet are not a factor.

Index: Compression does not work

Huge memory hog and overall slow performance.
(Re)introduced with commit a374670.

Memory usage with jokes dataset now and before commit a374670:
before: ~19MB
now: ~175MB

Insertion is at least 3x slower too.

Processor: range queries - range check

Not correct now for all possible values. rangeValidator just checks for <=. I think this need to be performed in the normalized values, or at least be implemented for each CType seperatly. For now i disabled this for the position search.

Normalizer.hs

rangeValidator :: CType -> [Text] -> [Text] -> Bool
rangeValidator t from to = case t of
-- XXX TODO real range check for positions
CPosition -> True

Processor.hs

-- values form a valid range    
-- XXX fix range validation - normalized values should be compared
--    unless' (rangeValidator cType ls' hs')
--            400 $ "invalid range for context: " `T.append` rangeText
-- type determines the processing

Geo Index

New context type for indexing geo-positions. We should try out different things here.

  1. using prefixtree to store position
  2. implement R*Tree with our index structure and try out different combinations of R-Trees and PrefixTrees

hunt depends on hayoo

hayoo and hunt are sharing the same sandbox and therefore require common makefile.

HayooFrontend: escape signature:context

replace queries and autocompletions:

a -> f a name:pure

must be esacaped to

a\ ->\ f\ a name:pure

and

pure signature:a -> f a

to

pure signature:a\ ->\ f\ a

Query Processor: Fix intermediate merging and result construction

Wrong results due to misusage of optimized functions. Probably a result of refactorizations - especially the different approaches to the parallelization of index queries.

Processor.toRawResult only works for results for a single context, Processor.toRawResult . ContextIndex.searchWithCxs is just wrong.
Intermediate.fromListCx probably shouldn't exist.
Processor.limitWords uses psTotal of ProcessState which is initialized with a constant in the Interpreter.

BatchInsert Space leak

While playing around to find the best maximum size for the batchInsert i recognized something.

I ran two test. Both with one 80MB json file containing 10000 documents a 200 words.

batch size 2000 documents

First i set the max batchInsert to 2000 documents.
command execution time: 56.143825s
index size: 1.1 GB

batch size 200 documents

Same document inserted with max batchInsert set to 200 documents
command execution time: 136.253376s
index size: 800 MB

conlusion

So, this looks like:
The bigger the inserted batches are, the more memory is consumed after the insert.

Client Library

We should provide a small simple client library which makes it easy to integrate with the framework on client side.

Perhaps a Jquery plugin?

Important functions could be:

  • interface for creating complex queries
  • help for creating geo and date queries
  • wrapper for paging and completion

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.