hunt-framework / hunt Goto Github PK

View Code? Open in Web Editor NEW

59.0 12.0 10.0 7.41 MB

A flexible, lightweight search platform

Makefile 1.25% Haskell 96.95% CSS 0.04% JavaScript 0.52% Python 0.17% HTML 0.67% Shell 0.29% PHP 0.10%

haskell hayoo searchengine

hunt's Introduction

Hunt

Hunt is a flexible, lightweight search platform written in Haskell.

The default server implementation provides a powerful JSON-API and a simple web interface.

Features

Powerful query language
Schema support (numeric data, dates, geospatial data)
Granular ranking capabilities
JSON API
Extensible architecture

Installation

Dependencies

GHC: The Glasgow Haskell Compiler
Cabal: Haskell package management tool

Hunt Installation

The easiest way to get the setup started is to install the current Haskell Platform.

Linux

If you're using linux, you can use make for the build.

git clone https://github.com/hunt-framework/hunt.git
cd hunt
make sandbox install

Windows

If you're using Windows, you can use cabal for the build. If you would like to use sandboxes on Windows, you can copy the neccessary cabal commands from our Makefile.

git clone https://github.com/hunt-framework/hunt.git
cd hunt/hunt-server
cabal install

Getting Started

The following line starts the default server. The web interface is available at http://localhost:3000/.

make startServer

A small sample data set can be inserted with:

make insertJokes

FAQ

Can I run Hunt on a 32-bit machine?

No, we are using 64-bit hashes for our internal document IDs. Collisions are much more likely for 32-bit hashes and the available memory would be limited to 4GB.

Why is the CPU usage in idle so high?

GHC performs a a major garbage collection every 0.3 seconds in idle, which can be computationally expensive on a big index. This can be disabled with the GHC RTS option -I0.

Development / History

Hunt was started in 2013 by Ulf Sauer and Chris Reumann to improve and extend the existing Holumbus framework. Holumbus was developed in 2008-2009 by Timo B. Kranz and Sebastian M. Gauck and powers the current Haskell API search Hayoo!. We decided to rebrand, because Hunt represents a major rewrite and breaks compatibility.

A new Hayoo implementation is currently under development by Sebastian Philipp.

Both projects were developed at the FH Wedel under supervision and active support of Prof. Dr. Uwe Schmidt.

hunt's People

Contributors

Stargazers

Watchers

Forkers

alexbiehl kaiha erantapaa mrphishxxx steshaw naushadh dhruvio wjmboss

hunt's Issues

Normalizer: Text -> [Text]

To support better normalization.

Query Processor: search has to be context-type aware

values get normalized before beeing stored in the index.

f.e. for the date context type 2013-07-21 gets normalized to 20130721000000

Right now the query processor does not take that into account. User can search for "2013-07-21" and won't get a result. They have to search for "20130721".

Query processing has to be context aware.

empty result for set queries with with and without context

This query works:

"mapM package:base"
{icQuery = QBinary And (QWord QNoCase "mapM") (QContext ["package"] (QWord QNoCase "base")), icOffsetSR = 0, icMaxSR = 20}

this doesn't:

"package:base mapM"
{icPrefixCR = QBinary And (QContext ["package"] (QWord QNoCase "base")) (QWord QNoCase "mapM"), icMaxCR = 20}

Normalizer: there may be some problems in date validation and normalization

Functions:
isAnyDate' does not allow 2012-01-01 ?!?

For more information take a look at the failing test within the analyzer testsuite

Interpreter: Improve read/write locking

The current implementation uses a simple MVar.
Readers don't block each other (readMVar) but writers block readers and writers until the are finished (takeMVar, write, putMVar).

Readers should never be blocked and use the "old" index until the writer is done.
Writers should block each other so that writes can only occur sequentially.

Default instances for several types

I think we have a dependency to the defaults package already, so it makes sense to implement it for common data types.

The index environment is a good example.

This looks ugly in the examples right now.

Strict Index: Existential type IndexImpl not truly strict right now

There are a couple of testcases within the strictness testsuite failing right now due to missing strictness of the IndexImpl existential type. It seems like the index is actually strict, but is not identified as strict by AssertNF. Maybe be find a solution for this here:

http://stackoverflow.com/questions/21113387/how-to-define-a-strict-existential-type

Query Parser: Boost parsing is buggy

"foo^1 bar" =>
QBoost 1.0 (QWord QNoCase "foo") but should be
QBinary And (QBoost 1.0 (QWord QNoCase "foo")) (QWord QNoCase "bar")

ContextTypes should be a map

The context type name should be unique. This could be expressed easier with a map using the name as the key.

type ContextTypes = [ContextType]

Normalizer: Be more extensible here.

I'd like to change the way we define normalize functions. By making these changes it would easily be possible to write new normalize function and add them, by just implementing this easy datatype.

I think we don't loose a lot type-safety by removing the enum, because we rarely use it anyway.

Any concerns? What do you think?

-- current implementation
data CNormalizer = NormUpperCase | NormLowerCase | NormDate | NormPosition | NormIntZeroFill
  deriving (Show, Eq)

-- suggested change:
data CNormalizer = CNormalizer { normalize :: Text -> Text }

HayooCrawler: source link is part of the signature

eg: http://holumbus.fh-wedel.de/hayoo/hayoo.html?query=(%3D%3D)

Query Processor: no testsuite yet

testsuite for this part of the framework is still missing

Completion: include contexts in some way?

For example. If you have an empty search field and start typing. if we would include the context names into the completion somehow, it would suggest them as well:

search: "con"
completion: [ "contextgeo:", "contextdate:"... ]

Server: Runtime statistics

memory, gargabe collection statistics

GHC.Stats: http://www.haskell.org/ghc/dist/current/docs/html/libraries/base-4.6.0.0/GHC-Stats.html
EKG:
http://hackage.haskell.org/package/ekg
http://blog.johantibell.com/2011/12/remotely-monitor-any-haskell.html
snap dependency :(
data-size: http://hackage.haskell.org/package/data-size

Change handling of invalid queries on different context types

Imagine the case where a text index and a date index are created.

Right now all general queries run on each context. If the searched value is not compatible with one type the whole search fails.

For this case we added the "default" option for the ContextSchema and context.

But i'm not sure if this is really the easiest way to solve this. Maybe it would be better to remove this "default" option and handle the search in a different way. If a validation of the search term fails for a context then we could set the contexts search result to empty instead of throwing an error.

What do you think?

Schema: Add proper Int support

There is no proper support for Int right now.
The input needs to be padded, e.g.

  2 -> 0000000002
157 -> 0000000157

or with simple support for negative numbers

   2 -> 10000000002
  -2 -> 00000000002
 157 -> 10000000157
-157 -> 00000000157

  1 -> 5000000001
  0 -> 5000000000
 -1 -> 4999999999

The validator needs to exlude values that are out of range of this encoding.

Related issue: #1

DocTable: Compression

We have to reintegrate document compression within the DocTable. I guess a trivial approach would be okay.

Command should be instance of Monad

to be able to write Commands in Do-Notation

Add Ranking per Document

We need to add a document based ranking. This can either be a part of the document description or part of the document itself.

Hayoo needs a rankings based on the document itself.

Add delete by query

Delete all documents that would be returned by the query.

Change module structure

Right now we have:

Hunt.Interpreter.Interpreter
Hunt.Index.Index

This redundancy is a bit ugly. Maybe we should change this before releasing 0.1

Snapshots

Basic filesystem like snapshot support.

Query Processor Configuration needs to be exposed via Interpreter Environment

Right now we are just using a default implementation, that is not configurable at all by the framework user. This ProcessConfig should be named QueryConfig and should be included into interpreter environment.

Interpreter - line ~ 455

queryConfig     :: ProcessConfig
queryConfig     = ProcessConfig (FuzzyConfig True True 1.0 germanReplacements) True 100 500

Testsuites not compiling with current Bytestring

ghc-heap-view seems to be the problem. Maybe import via git and change upper bound manually.

Completing the parallel-over-context implementation

We have implemented the parallel map over contexts for the BatchInsert command, but for none of the others.

BatchDelete should perform the deletion in parallel as well. Then there is update. Update is implemented by a delete followed by an insert. If insert and delete performance is critical, update performance has to be even worse.

For that reason we need BatchUpdate as well. It should be implemented within the internal Command language like BatchInsert and BatchDelete. Since it is based on BatchInsert and BatchDelete, implementation should be trivial.

In the ContextMap we still have function for single insertion and deletion. I think due to the batch optimizations, they are no longer used. Even a single insert should be transformed to a BatchInsert with just one document. So this single functions should be removed to clean up the interface.

Any thoughts?

Store/Load does not work

The commands are currently disabled/undefined - probably because some Binary instances are missing.

HayooCrawler: Links to operators are wrong

the chunk part needs to be encoded.
correct:
http://hackage.haskell.org/package/conduit-1.0.9.3/docs/Data-Conduit.html#v:-36--36-
wrong:
http://hackage.haskell.org/package/conduit/docs/Data-Conduit.html#v:($$)

eg: http://holumbus.fh-wedel.de/hayoo/hayoo.html?query=%24%24

Overall Performance

Everything performance related

ContextIndex: Inserting single documents in parallel

Most documents insert values into multiple contexts. Since this is a map we should be able to improve insertion performance by performing them in parallel at least for one document.

I implemented this within the addWords function. I'm not a 100% sure if everything is perfectly optimized yet, but a first benchmark showed improvements:

test dataset: 300 documents containing from 200 upto 10000 words.
test run on 8core machine with all cores active.

sequentiell foldM is the initial implementation

FoldM sequentiell	mapM sequentiell	MapM parallel
46.12884s	48.05277s	32.794021s

HayooCrawler: signature is "module" or "class"

eg: http://holumbus.fh-wedel.de/hayoo/hayoo.html?query=Control.Monad.IO.Class

It would be better, if the type would be an extra field in the document instead.

Completion: suggested words should be related

Current auto completion returns results only for the currently typed word, without taking the rest of the query into account.

f.e.: "word1 wor" will return all completions for "wor", not only all completions of "wor" that follow "word1".

We could improve this, by splitting up the query. We could search exact for the first part and perform a fuzzy prefix search on the second part.

Normalizers: No default normalizer

That's a bit strange, because we need default normalization for numbers, dates and geo positions. There should be at least one default for each context type.

f.e: date default 2013-01-01 ->20130101

Users should be able to overwrite those defaults, but users should not be forced to configure those defaults themselves

Normalization: Handle default normalizers and user-defined normalizers separately

At the moment we don't make a separation here. That's a problem, because for some contexttypes like int, geo/position and date, a default validation is required. The user configuration should not be able to change that.

The idea is to have the user defined normalization like it is implemented now. With a normalization function from Text -> Text.

The type specific required normalization could be handled by a IndexProxy implementation. That way we could support different index structures then the StringMap with different types, without changing anything in interpreter or query processor.

Occurrences should be a type class

First of all, this is just an idea and i don't think we could implement this within the time frame of the thesis.

The Occurrence data-structure is the one part of the whole implementation that is still not very flexible. It is somehow bound to the InvertedIndex implementation and cannot be replaced with something else, because it is used massively within the Query processor.

I think instead of a Occurrence data type, a type class "IndexValue" might be more suitable. It would also help to test different Occurrence implementations like the BinTree vs the IntMap.

Any thoughts about this?

HayooCrawler: signature equals function name

eg http://holumbus.fh-wedel.de/hayoo/hayoo.html?query=mapMaybe

Merge ContextIndex and IndexHandler

We never really liked the name IndexHandler, so we wanted to find a better name here anyway. Then we have the ContextIndex listed as an Index Proxy. While that was correct first, after a couple of changes, it is not an index proxy anymore.

I think it would be a good idea to merge those two modules. Any thoughts?

--merge these
data IndexHandler dt = IXH     
  { ixhIndex  :: CIx.ContextIndex Occurrences 
  , ixhDocs   :: dt            
  , ixhSchema :: Schema        
  }

newtype ContextIndex v         
  = ContextIx { contextIx :: Map Context (Impl.IndexImpl v) }
  deriving (Show)

-- to:         
data ContextIndex dt = ContextIx
  { ixhIndex  :: Map Context (Impl.IndexImpl Occurrences) 
  , ixhDocs   :: dt            
  , ixhSchema :: Schema        
  }

Clean up the API

There are a couple of things we should think through before releasing the package on hackage.

The common package it pretty confusing right now in regards to the exported types.

Maybe we need another module:
Hunt.Interpreter.Config
which exposes all the configuration stuff needed for the interpreter.

Right now the client needs to import multiple modules like:
Hunt.Query.Ranking (function defaultRankConfig)
only to initialize the interpreter.

It's impossible to search for keys containing spaces

That's because the query parser will split the qurey into words:

[2014-01-05 15:28:06 CET : Holumbus.Interpreter.Interpreter : DEBUG] Executing command: Completion {icPrefixCR = QBinary And (QWord QNoCase "StringMap") (QBinary And (QWord QNoCase "a") (QBinary And (QWord QNoCase "->") (QWord QNoCase "Bool"))), icMaxCR = 20}

Using a Phrase doesn't work either:

GET /search/%22StringMap%20a%20-%3E%20Bool%22/0/20
Accept: 
[2014-01-05 15:28:25 CET : Holumbus.Interpreter.Interpreter : DEBUG] Executing command: Search {icQuery = QPhrase QNoCase "StringMap a -> Bool", icOffsetSR = 0, icMaxSR = 20}
Status: 200 OK. /search/%22StringMap%20a%20-%3E%20Bool%22/0/20

Memory profiling

This Issue is about the question whether ByteString serialization is a good or a bad thing. We really had a lot of trouble with it and despite certain optimizations, it seems not to improve the overall index size.

Setup: I always used the same stack of indexes.

newtype InvertedIndex _v = InvIx { invIx :: KeyProxyIndex Text ComprOccPrefixTree CompressedOccurrences }

I only exchanged the CompressedOccurrences type.

Test1: 3000 documents a 200 words (25 MB JSON)

I compared the serialized files on disk.

Compression	Size in disk	Size in memory
no compression	49 MB	260 MB
only bytestring	49,1 MB	83 MB
bzip bytestring	27,1 MB	73 MB

Conlusion

The test are run with profiling and the profling results look okay. Seems like the extra compression makes sense after all.

While executing this benchmark a strictness bug regarding lazy Occurrences was fixed.

Profiling results verify the benchmark. Profiled values fit to what the responding memory footprint show.

ByteString without further compression is only a bit bigger than the compressed ones. This makes sense, since we deal with small values here. The conversion to bytestring takes more time then the compression afterwards as well.

The uncompressed results is as big as it is, because of the Hashed DocID within the IntMap. The position within the IntSet are not a factor.

Index: Compression does not work

Huge memory hog and overall slow performance.
(Re)introduced with commit a374670.

Memory usage with jokes dataset now and before commit a374670:
before: ~19MB
now: ~175MB

Insertion is at least 3x slower too.

Add command line parameter to set the hunt.log file

A unix daemon normally doesn't have write access to the current directory.

Processor: range queries - range check

Not correct now for all possible values. rangeValidator just checks for <=. I think this need to be performed in the normalized values, or at least be implemented for each CType seperatly. For now i disabled this for the position search.

Normalizer.hs

rangeValidator :: CType -> [Text] -> [Text] -> Bool
rangeValidator t from to = case t of
-- XXX TODO real range check for positions
CPosition -> True

Processor.hs

-- values form a valid range    
-- XXX fix range validation - normalized values should be compared
--    unless' (rangeValidator cType ls' hs')
--            400 $ "invalid range for context: " `T.append` rangeText
-- type determines the processing

Geo Index

New context type for indexing geo-positions. We should try out different things here.

using prefixtree to store position
implement R*Tree with our index structure and try out different combinations of R-Trees and PrefixTrees

hunt depends on hayoo

hayoo and hunt are sharing the same sandbox and therefore require common makefile.

HayooFrontend: escape signature:context

replace queries and autocompletions:

a -> f a name:pure

must be esacaped to

a\ ->\ f\ a name:pure

and

pure signature:a -> f a

pure signature:a\ ->\ f\ a

Query Processor: Fix intermediate merging and result construction

Wrong results due to misusage of optimized functions. Probably a result of refactorizations - especially the different approaches to the parallelization of index queries.

Processor.toRawResult only works for results for a single context, Processor.toRawResult . ContextIndex.searchWithCxs is just wrong.
Intermediate.fromListCx probably shouldn't exist.
Processor.limitWords uses psTotal of ProcessState which is initialized with a constant in the Interpreter.

Normalizer: missing type validation for integer

CInt -> const True

BatchInsert Space leak

While playing around to find the best maximum size for the batchInsert i recognized something.

I ran two test. Both with one 80MB json file containing 10000 documents a 200 words.

batch size 2000 documents

First i set the max batchInsert to 2000 documents.
command execution time: 56.143825s
index size: 1.1 GB

batch size 200 documents

Same document inserted with max batchInsert set to 200 documents
command execution time: 136.253376s
index size: 800 MB

conlusion

So, this looks like:
The bigger the inserted batches are, the more memory is consumed after the insert.

Fix Text memory sharing

add FromJson text copy to avoid eventual text sharing

Find Framework name and move to new git repository

find name
move project to new github repository
create seperate project for each subproject (searchengine, server, ...)

Client Library

We should provide a small simple client library which makes it easy to integrate with the framework on client side.

Perhaps a Jquery plugin?

Important functions could be:

interface for creating complex queries
help for creating geo and date queries
wrapper for paging and completion