Giter Club home page Giter Club logo

elm-text-search's Introduction

ElmTextSearch full text indexer

Copyright (c) 2016 Robin Luiten

This is a full text indexing engine inspired by lunr.js and written in Elm language. See http://lunrjs.com/ for lunr.js

I am happy to hear about users of this package.

I am happy to receive contributions be they bug reports, pull requests, documentation updates or examples.

v4.0.0 will not load indexes saved with old version.

If you do not use storeToValue storeToString fromString fromValue in ElmTextSearch this update is not likely to introduce issues.

The way that filters and transforms are applied to the content of documents has changed. This is to properly fix a bug reported see #10 where stop word filters were not correctly applied. This means saved indexes from previous version of ElmTextSearch will not load in this version.

  • Defaults.indexVersion has changed value.

The reason this is a Major version bump is some generalisation was done to enable future support for loading and saving of older version and types of default index configurations.

v5.0.0 updates for Elm 0.19

Result types from loading indexes are now Decode.Error not String.

v5.0.2, v5.1.0

New functions addT for add, searchT for search and removeT for remove. These replace the error type of result with a type. v5.0.2 was a goof on my part i forgot to expose new functions correctly.

Packages

Several packages were created for this project and published separately for this package to depend on.

Parts of lunr.js were left out

  • This does not have an event system.
  • Its internal data structure is not compatible.

Notes captured along way writing this.

  • lunr.js
  • tokenStore.remove does not decrement length, but it doesn't use length really only save/load
  • stemmer "lay" -> "lay" "try" -> "tri" is opposite to porter stemmer
  • porter stemmer erlang implementation
  • step5b does not use endsWithDoubleCons which is required afaik to pass the voc.txt output.txt cases

Example

See examples folder for four examples. You can run any of the examples if you navigate to the examples folder and run elm reactor and select an example in the src folder.

First example is included inline here.

IndexNewAddSearch.elm

module Main exposing (ExampleDocType, createNewIndexExample, main, resultSearchIndex, resultUpdatedMyIndexAfterAdd)

{-| Create an index and add a document, search a document

Copyright (c) 2016 Robin Luiten

-}

import Browser
import ElmTextSearch
import Html exposing (Html, button, div, text)


{-| Example document type.
-}
type alias ExampleDocType =
    { cid : String
    , title : String
    , author : String
    , body : String
    }


{-| Create an index with default configuration.
See ElmTextSearch.SimpleConfig documentation for parameter information.
-}
createNewIndexExample : ElmTextSearch.Index ExampleDocType
createNewIndexExample =
    ElmTextSearch.new
        { ref = .cid
        , fields =
            [ ( .title, 5.0 )
            , ( .body, 1.0 )
            ]
        , listFields = []
        }


{-| Add a document to an index.
-}
resultUpdatedMyIndexAfterAdd : Result String (ElmTextSearch.Index ExampleDocType)
resultUpdatedMyIndexAfterAdd =
    ElmTextSearch.add
        { cid = "id1"
        , title = "First Title"
        , author = "Some Author"
        , body = "Words in this example document with explanations."
        }
        createNewIndexExample


{-| Search the index.

The result includes an updated Index because a search causes internal
caches to be updated to improve overall performance.

-}
resultSearchIndex : Result String ( ElmTextSearch.Index ExampleDocType, List ( String, Float ) )
resultSearchIndex =
    resultUpdatedMyIndexAfterAdd
        |> Result.andThen
            (ElmTextSearch.search "explanations")


{-| Display search result.
-}
main =
    Browser.sandbox { init = 0, update = update, view = view }


type Msg
    = DoNothing


update msg model =
    case msg of
        DoNothing ->
            model


view model =
    let
        -- want only the search results not the returned index
        searchResults =
            Result.map Tuple.second resultSearchIndex
    in
    div []
        [ text
            ("Result of searching for \"explanations\" is "
                ++ Debug.toString searchResults
            )
        ]

elm-text-search's People

Contributors

bendingbender avatar rluiten avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

elm-text-search's Issues

Usage example for indexing an array of records?

In the docs you only index one record, but a search engine won't be very useful unless you can index a lot of them.

What would you think of adding an example of how to index an array of records?

Typo in MultipleAddSearch.elm example

The text readout at the end of the MultipleAddSearch.elm example should be:
"Result of searching for "title" is " ++ SearchResults

The example searches for the keyword "title" not "explanations".

No hits for partial words

If you run the example in the readme it finds a result, but if you then remove the last two characters from "explanations" so you have "explanatio" then it won't find anything. This is different from lunr.js behavior.

The API docs seems to suggest that this should not happen:
Each token is expanded, so that the term "he" might be expanded to "hello" and "help" if those terms were already included in the document index.

http://package.elm-lang.org/packages/rluiten/elm-text-search/2.0.0/ElmTextSearch#search

Is this a bug or am I misunderstanding the docs? :)

Index documents on fields that contain lists of strings

I have a use case where it would be very useful if the index was able to use fields that hold lists of strings in addition to single strings and apply the same weight to all strings in the list. The documents in this use case look like:

{ cid : String
, name : String
, synonyms : List String
}

So I'd like to index both the name and the synonyms and apply the same weight to all synonyms.

From a SimpleConfig / Config API, I would see it as adding a new option, something like:

list_fields : List (doc -> List String, Float)

Which would be used like this:

ElmTextSearch.new
{ ref = .cid
, fields = [ ( .name, 5.0 ) ]
, list_fields = [ ( .synonyms, 1.0 ) ]
}

Instead of having IndexX.elm files, use a folder

  • Index.elm
  • IndexDefaults.elm
  • IndexLoad.elm
  • IndexModel.elm
  • IndexUtils.elm
  • IndexVector.elm

should be

  • Index.elm
  • Index/Defaults.elm
  • Index/Load.elm
  • Index/Model.elm
  • Index/Utils.elm
  • Index/Vector.elm

How do I use CodecIndexRecord?

I've encoded an Index to JSON, and then decoded it, which produces a CodecIndexRecord. How do I use that to get back an Index?

Leaving field empty when using listFields returns results with score equal to NaN

I am indexing documents with this type:

type alias RecipeName = String 

Type alias RecipeBody = String 

type alias FeuilleLiaison =
    { date : Date
    , filename : String
    , wholeBody : String
    , recipes : Dict RecipeName RecipeBody
    }

I am using two indexes, one for whole text search, one for the recipes only.
As each document can contain zero, one or more recipes I thought of setting up the index like this:

recipeConfig =
    { indexType = "ElmTextSearch - Customized French Stop Words"
    , ref = Date.toIsoString << .date
    , fields =
        []
    , listFields = [ ( Dict.keys << .recipes, 5.0 ), ( Dict.values << .recipes, 1.0 ) ]
    , initialTransformFactories = Index.Defaults.defaultInitialTransformFactories
    , transformFactories = [ (\func index -> ( index, func )) (FrenchStemmer.stemmer True) ]
    , filterFactories = [ createFilterFunc frenchStopWords ]
    }

The search is working fine and both recipes names and bodies seem to be indexed. However the scores associated with the results are all equal to NaN, so I cannot sort or filter the results.

Putting ( always "", 1 ) in fields seems to remove the NaNs but I do not know if it affect the search in any way.

Am I using the listFields parameter wrong?

addOrUpdate function

I have a use case where I receive documents as part of a HTTP GET that may or may not already be in my index so I'd like to add them if they're not in and update if they're already in. The way I do this at the moment is to add and if this results in an error, update. It would be nice to have a addOrUpdate capability instead.

wildcards

Wondering if you can do wild cards like llo or *ll

Also, can you use AND | OR

p.s. Thanks for a great library!

Errors when searching for the keyword "a"or"A" but not other single characters.

When a search is done for the single letter "a" it produces the error: "Error after tokenisation there are no terms to search for."

This can be reproduced by changing the search term in the MultipleAddSearch.elm example to either "a" or "A".

"A" was the only character I found that caused a problem. Other single letters or numbers work correctly.

Is there a way to make the tokenizer ignore apostrophes?

I am using the library to index a corpus of recipes in French. I replaced the default stop word list by a French one, so far so good.

In French we often use l' or d' as determiner before nouns as in de l'eau, l'ail meaning some water and the garlic. I think the tokenizer includes l' or d' with the following words, so a search for ail or eau does not work, probably because the search string tokens are expanded to the right. The fact that searching for john in a text containing John's book works seems to confirm that.

Ideally the index would only register ail and eau as tokens and ignore what comes before, as it it not relevant in the context of a search.

Is there a way to change the index config in order to achieve this effect?

Can't index or search for "one" (or: maybe filter stopwords before stemming?)

In my initial testing, I got an error when I searched for the text "one". As far as I can tell, it looks like processTokens transforms the tokens before filtering them. In this case, "one" becomes "on" and is filtered out as a stopword.

I don't know much about this stuff, but it would seem to make sense to filter out stopwords first? This seems to be what lunr.js does according to this comment

Example: https://ellie-app.com/346jy8VmsMXa1/0

0.19 Upgrade

Hey @rluiten! Are you planning to continue maintaining this and related packages into Elm 0.19? A response either way will help us at NoRedInk to plan our codebase upgrade. Thank you!

Retrieving documents based on search results

Is there a recommended method to retrieve the actual documents after you get the results? I always struggled with this with lunr.js as well...to me, when I do a search, the IDs aren't easily mapped to the documents they represent since the original documents are in a list rather than a Dict.

Obviously you can do a map over the results then filter the original list to pull out each document, but this seems inefficient.

Great port btw! Working great, just though I'd post up with this question!

Replace error strings with union types?

I'm trying to handle some of the errors in certain ways, but having to switch on strings is a bit of a bummer. Would you be interested in a PR that replaces the string error messages with some custom union types?

I'm picturing something verbose but straightforward like these (we could sort out the details to your taste):

type AddError
    = UniqueRefIsEmpty
    | NoTermsToIndexAfterTokenisation
    | DocAlreadyExists

type RemoveError
    = UniqueRefIsEmpty
    | DocIsNotInIndex

type SearchError
    = IndexIsEmpty
    | QueryIsEmpty
    | NoTermsToSearchAfterTokenisation

Issue with the word "Loyalty"

I went to https://elm-lang.org/try and added rluiten/elm-text-search as a dependency.

This snippet demonstrates the issues by performing 3 searches and printing out the number of results for each lo, loy, and loya respectively. You can see that only for loy, I'm getting no results back.

import ElmTextSearch
import Html

main =
  let
    index =
      ElmTextSearch.new
        { ref = .id
        , fields = [ (.title, 1 ) ]
        , listFields = []
        }

    indexAddResult =
      ElmTextSearch.add { id = "1234", title = "Loyalty" } index
      
    searchResultLo =
        indexAddResult
          |> Result.andThen (\i ->
            ElmTextSearch.search "lo" i |> Result.map (Tuple.second >> List.map Tuple.first)
          )
          |> Result.map List.length
          |> Result.withDefault 0  
      
    searchResultLoy =
        indexAddResult
          |> Result.andThen (\i ->
            ElmTextSearch.search "loy" i |> Result.map (Tuple.second >> List.map Tuple.first)
          )
          |> Result.map List.length
          |> Result.withDefault 0
          
    searchResultLoya =
        indexAddResult
          |> Result.andThen (\i ->
            ElmTextSearch.search "loya" i |> Result.map (Tuple.second >> List.map Tuple.first)
          )
          |> Result.map List.length
          |> Result.withDefault 0          
  in
  Html.ul []
    [ Html.li [] [Html.text (String.fromInt searchResultLo)]
    , Html.li [] [Html.text (String.fromInt searchResultLoy)]
    , Html.li [] [Html.text (String.fromInt searchResultLoya)]
    ]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.