rluiten / elm-text-search Goto Github PK

View Code? Open in Web Editor NEW

42.0 2.0 5.0 181 KB

Full text index engine in Elm language inspired by lunr.js.

Home Page: http://package.elm-lang.org/packages/rluiten/elm-text-search/latest

License: BSD 3-Clause "New" or "Revised" License

Elm 100.00%

elm-text-search's Introduction

ElmTextSearch full text indexer

This is a full text indexing engine inspired by lunr.js and written in Elm language. See http://lunrjs.com/ for lunr.js

I am happy to hear about users of this package.

I am happy to receive contributions be they bug reports, pull requests, documentation updates or examples.

v4.0.0 will not load indexes saved with old version.

If you do not use storeToValue storeToString fromString fromValue in ElmTextSearch this update is not likely to introduce issues.

The way that filters and transforms are applied to the content of documents has changed. This is to properly fix a bug reported see #10 where stop word filters were not correctly applied. This means saved indexes from previous version of ElmTextSearch will not load in this version.

Defaults.indexVersion has changed value.

The reason this is a Major version bump is some generalisation was done to enable future support for loading and saving of older version and types of default index configurations.

v5.0.0 updates for Elm 0.19

Result types from loading indexes are now Decode.Error not String.

v5.0.2, v5.1.0

New functions addT for add, searchT for search and removeT for remove. These replace the error type of result with a type. v5.0.2 was a goof on my part i forgot to expose new functions correctly.

Packages

Several packages were created for this project and published separately for this package to depend on.

trie
http://package.elm-lang.org/packages/rluiten/trie/latest
stemmer
http://package.elm-lang.org/packages/rluiten/stemmer/latest
sparsevector
http://package.elm-lang.org/packages/rluiten/sparsevector/latest

Parts of lunr.js were left out

This does not have an event system.
Its internal data structure is not compatible.

Notes captured along way writing this.

lunr.js
tokenStore.remove does not decrement length, but it doesn't use length really only save/load
stemmer "lay" -> "lay" "try" -> "tri" is opposite to porter stemmer
porter stemmer erlang implementation
step5b does not use endsWithDoubleCons which is required afaik to pass the voc.txt output.txt cases

Example

See examples folder for four examples. You can run any of the examples if you navigate to the examples folder and run elm reactor and select an example in the src folder.

First example is included inline here.

IndexNewAddSearch.elm

module Main exposing (ExampleDocType, createNewIndexExample, main, resultSearchIndex, resultUpdatedMyIndexAfterAdd)

{-| Create an index and add a document, search a document

Copyright (c) 2016 Robin Luiten

-}

import Browser
import ElmTextSearch
import Html exposing (Html, button, div, text)


{-| Example document type.
-}
type alias ExampleDocType =
    { cid : String
    , title : String
    , author : String
    , body : String
    }


{-| Create an index with default configuration.
See ElmTextSearch.SimpleConfig documentation for parameter information.
-}
createNewIndexExample : ElmTextSearch.Index ExampleDocType
createNewIndexExample =
    ElmTextSearch.new
        { ref = .cid
        , fields =
            [ ( .title, 5.0 )
            , ( .body, 1.0 )
            ]
        , listFields = []
        }


{-| Add a document to an index.
-}
resultUpdatedMyIndexAfterAdd : Result String (ElmTextSearch.Index ExampleDocType)
resultUpdatedMyIndexAfterAdd =
    ElmTextSearch.add
        { cid = "id1"
        , title = "First Title"
        , author = "Some Author"
        , body = "Words in this example document with explanations."
        }
        createNewIndexExample


{-| Search the index.

The result includes an updated Index because a search causes internal
caches to be updated to improve overall performance.

-}
resultSearchIndex : Result String ( ElmTextSearch.Index ExampleDocType, List ( String, Float ) )
resultSearchIndex =
    resultUpdatedMyIndexAfterAdd
        |> Result.andThen
            (ElmTextSearch.search "explanations")


{-| Display search result.
-}
main =
    Browser.sandbox { init = 0, update = update, view = view }


type Msg
    = DoNothing


update msg model =
    case msg of
        DoNothing ->
            model


view model =
    let
        -- want only the search results not the returned index
        searchResults =
            Result.map Tuple.second resultSearchIndex
    in
    div []
        [ text
            ("Result of searching for \"explanations\" is "
                ++ Debug.toString searchResults
            )
        ]

elm-text-search's People

Contributors

Stargazers

Watchers

Forkers

nicholasgwk twocolumn eniac314 mthadley icodein

elm-text-search's Issues

Usage example for indexing an array of records?

In the docs you only index one record, but a search engine won't be very useful unless you can index a lot of them.

What would you think of adding an example of how to index an array of records?

v5.0.2 isnt exposing addT, removeT, searchT correctly, sorry

I forgot to expose them ElmTextSearch module, they are only in Index at moment.
I have to figure out how to expose the errors types with constructors correctly.

Typo in MultipleAddSearch.elm example

The text readout at the end of the MultipleAddSearch.elm example should be:
"Result of searching for "title" is " ++ SearchResults

The example searches for the keyword "title" not "explanations".

No hits for partial words

If you run the example in the readme it finds a result, but if you then remove the last two characters from "explanations" so you have "explanatio" then it won't find anything. This is different from lunr.js behavior.

The API docs seems to suggest that this should not happen:
Each token is expanded, so that the term "he" might be expanded to "hello" and "help" if those terms were already included in the document index.

http://package.elm-lang.org/packages/rluiten/elm-text-search/2.0.0/ElmTextSearch#search

Is this a bug or am I misunderstanding the docs? :)

Can not find anything when two records have a similar text

I'm indexing a FAQ archive.

In my prototype I have "Question1" and "Question2" and when I search for "Q" I get nothing, but if I rename "Question2" to "Puestion2", then I find "Question1". This seems like a bug.

Index documents on fields that contain lists of strings

I have a use case where it would be very useful if the index was able to use fields that hold lists of strings in addition to single strings and apply the same weight to all strings in the list. The documents in this use case look like:

{ cid : String
, name : String
, synonyms : List String
}

So I'd like to index both the name and the synonyms and apply the same weight to all synonyms.

From a SimpleConfig / Config API, I would see it as adding a new option, something like:

list_fields : List (doc -> List String, Float)

Which would be used like this:

ElmTextSearch.new
{ ref = .cid
, fields = [ ( .name, 5.0 ) ]
, list_fields = [ ( .synonyms, 1.0 ) ]
}

Instead of having IndexX.elm files, use a folder

Index.elm
IndexDefaults.elm
IndexLoad.elm
IndexModel.elm
IndexUtils.elm
IndexVector.elm

should be

Index.elm
Index/Defaults.elm
Index/Load.elm
Index/Model.elm
Index/Utils.elm
Index/Vector.elm

How do I use CodecIndexRecord?

I've encoded an Index to JSON, and then decoded it, which produces a CodecIndexRecord. How do I use that to get back an Index?

Leaving field empty when using listFields returns results with score equal to NaN

I am indexing documents with this type:

type alias RecipeName = String 

Type alias RecipeBody = String 

type alias FeuilleLiaison =
    { date : Date
    , filename : String
    , wholeBody : String
    , recipes : Dict RecipeName RecipeBody
    }

I am using two indexes, one for whole text search, one for the recipes only.
As each document can contain zero, one or more recipes I thought of setting up the index like this:

recipeConfig =
    { indexType = "ElmTextSearch - Customized French Stop Words"
    , ref = Date.toIsoString << .date
    , fields =
        []
    , listFields = [ ( Dict.keys << .recipes, 5.0 ), ( Dict.values << .recipes, 1.0 ) ]
    , initialTransformFactories = Index.Defaults.defaultInitialTransformFactories
    , transformFactories = [ (\func index -> ( index, func )) (FrenchStemmer.stemmer True) ]
    , filterFactories = [ createFilterFunc frenchStopWords ]
    }

The search is working fine and both recipes names and bodies seem to be indexed. However the scores associated with the results are all equal to NaN, so I cannot sort or filter the results.

Putting ( always "", 1 ) in fields seems to remove the NaNs but I do not know if it affect the search in any way.

Am I using the listFields parameter wrong?

addOrUpdate function

I have a use case where I receive documents as part of a HTTP GET that may or may not already be in my index so I'd like to add them if they're not in and update if they're already in. The way I do this at the moment is to add and if this results in an error, update. It would be nice to have a addOrUpdate capability instead.

wildcards

Wondering if you can do wild cards like llo or *ll

Also, can you use AND | OR

p.s. Thanks for a great library!

Errors when searching for the keyword "a"or"A" but not other single characters.

When a search is done for the single letter "a" it produces the error: "Error after tokenisation there are no terms to search for."

This can be reproduced by changing the search term in the MultipleAddSearch.elm example to either "a" or "A".

"A" was the only character I found that caused a problem. Other single letters or numbers work correctly.

Is there a way to make the tokenizer ignore apostrophes?

I am using the library to index a corpus of recipes in French. I replaced the default stop word list by a French one, so far so good.

In French we often use l' or d' as determiner before nouns as in de l'eau, l'ail meaning some water and the garlic. I think the tokenizer includes l' or d' with the following words, so a search for ail or eau does not work, probably because the search string tokens are expanded to the right. The fact that searching for john in a text containing John's book works seems to confirm that.

Ideally the index would only register ail and eau as tokens and ignore what comes before, as it it not relevant in the context of a search.

Is there a way to change the index config in order to achieve this effect?

Can't index or search for "one" (or: maybe filter stopwords before stemming?)

In my initial testing, I got an error when I searched for the text "one". As far as I can tell, it looks like processTokens transforms the tokens before filtering them. In this case, "one" becomes "on" and is filtered out as a stopword.

I don't know much about this stuff, but it would seem to make sense to filter out stopwords first? This seems to be what lunr.js does according to this comment

Example: https://ellie-app.com/346jy8VmsMXa1/0

0.19 Upgrade

Hey @rluiten! Are you planning to continue maintaining this and related packages into Elm 0.19? A response either way will help us at NoRedInk to plan our codebase upgrade. Thank you!

Retrieving documents based on search results

Is there a recommended method to retrieve the actual documents after you get the results? I always struggled with this with lunr.js as well...to me, when I do a search, the IDs aren't easily mapped to the documents they represent since the original documents are in a list rather than a Dict.

Obviously you can do a map over the results then filter the original list to pull out each document, but this seems inefficient.

Great port btw! Working great, just though I'd post up with this question!

Replace error strings with union types?

I'm trying to handle some of the errors in certain ways, but having to switch on strings is a bit of a bummer. Would you be interested in a PR that replaces the string error messages with some custom union types?

I'm picturing something verbose but straightforward like these (we could sort out the details to your taste):

type AddError
    = UniqueRefIsEmpty
    | NoTermsToIndexAfterTokenisation
    | DocAlreadyExists

type RemoveError
    = UniqueRefIsEmpty
    | DocIsNotInIndex

type SearchError
    = IndexIsEmpty
    | QueryIsEmpty
    | NoTermsToSearchAfterTokenisation

Issue with the word "Loyalty"

I went to https://elm-lang.org/try and added rluiten/elm-text-search as a dependency.

This snippet demonstrates the issues by performing 3 searches and printing out the number of results for each lo, loy, and loya respectively. You can see that only for loy, I'm getting no results back.

import ElmTextSearch
import Html

main =
  let
    index =
      ElmTextSearch.new
        { ref = .id
        , fields = [ (.title, 1 ) ]
        , listFields = []
        }

    indexAddResult =
      ElmTextSearch.add { id = "1234", title = "Loyalty" } index
      
    searchResultLo =
        indexAddResult
          |> Result.andThen (\i ->
            ElmTextSearch.search "lo" i |> Result.map (Tuple.second >> List.map Tuple.first)
          )
          |> Result.map List.length
          |> Result.withDefault 0  
      
    searchResultLoy =
        indexAddResult
          |> Result.andThen (\i ->
            ElmTextSearch.search "loy" i |> Result.map (Tuple.second >> List.map Tuple.first)
          )
          |> Result.map List.length
          |> Result.withDefault 0
          
    searchResultLoya =
        indexAddResult
          |> Result.andThen (\i ->
            ElmTextSearch.search "loya" i |> Result.map (Tuple.second >> List.map Tuple.first)
          )
          |> Result.map List.length
          |> Result.withDefault 0          
  in
  Html.ul []
    [ Html.li [] [Html.text (String.fromInt searchResultLo)]
    , Html.li [] [Html.text (String.fromInt searchResultLoy)]
    , Html.li [] [Html.text (String.fromInt searchResultLoya)]
    ]