Giter Club home page Giter Club logo

lucene-text-analysis's Introduction

Clojars Project cljdoc badge Tests

lucene-text-analysis

Library to inspect the output of the Lucene text analysis pipeline.

Supports 3 ways of analyzing text:

  • string to list of strings;
  • String to list of tokens (similar to the Elasticsearch/Opensearch _analyze API);
  • string to GraphViz program to draw a Lucene TokenStream as a graph.

Quickstart

Dependencies:

{:deps {lt.jocas/lucene-text-analysis {:mvn/version "1.0.21"}}}

Code:

(require '[lucene.custom.text-analysis :as analysis])

(analysis/text->token-strings "Test TEXT")
;; => ["test" "text"]

(analysis/text->tokens "Test TEXT")
;; => 
[#lucene.custom.text_analysis.TokenRecord{:token "test",
                                          :type "<ALPHANUM>",
                                          :start_offset 0,
                                          :end_offset 4,
                                          :position 0,
                                          :positionLength 1}
 #lucene.custom.text_analysis.TokenRecord{:token "text",
                                          :type "<ALPHANUM>",
                                          :start_offset 5,
                                          :end_offset 9,
                                          :position 1,
                                          :positionLength 1}]

(analysis/text->graph "Test TEXT")
;; =>
"digraph tokens {
   graph [ fontsize=30 labelloc=\"t\" label=\"\" splines=true overlap=false rankdir = \"LR\" ];
   // A2 paper size
   size = \"34.4,16.5\";
   edge [ fontname=\"Helvetica\" fontcolor=\"red\" color=\"#606060\" ]
   node [ style=\"filled\" fillcolor=\"#e8e8f0\" shape=\"Mrecord\" fontname=\"Helvetica\" ]
 
   0 [label=\"0\"]
   -1 [shape=point color=white]
   -1 -> 0 []
   0 -> 1 [ label=\"test / Test\"]
   1 [label=\"1\"]
   1 -> 2 [ label=\"text / TEXT\"]
   -2 [shape=point color=white]
   2 -> -2 []
 }
 "

Every function accepts a Lucene Analyzer as the second argument.

Use cases

  • Do ASCII folding person names:

With helper library:

lt.jocas/lucene-custom-analyzer {:mvn/version "1.0.14"}
(require '[lucene.custom.analyzer :as custom-analyzer])

(lucene.custom.text-analysis/text->token-strings 
  "Thomas Müller" 
  (custom-analyzer/create {:token-filters [{:asciiFolding {}}]}))
;; => ["Thomas" "Muller"]

How to draw a graph image?

The example assumes that the GraphViz dot program is installed:

clojure -M --eval '(require `lucene.custom.text-analysis)(println (lucene.custom.text-analysis/text->graph "one two three"))' | dot -Tpng -o docs/assets/images/token-graph.png

Results in an image

Token Graph

Development

Compile Java classes:

clojure -T:build compile-java

Start your REPL.

License

Copyright © 2023 Dainius Jocas.

Distributed under The Apache License, Version 2.0.

lucene-text-analysis's People

Contributors

dainiusjocas avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

lucene-text-analysis's Issues

[lucene-text-analysis] function to check what Analyzer::normalize returns

(deftest numeric-range-queries
  (testing "multi-field with string and numeric"
    (let [query-string "field-a:foo AND field-b:[10 TO 19]"
          opts {:default-field "text"
                :schema        {:field-b
                                {:analyzer
                                 {:char-filters
                                  [{:patternReplace {:pattern     "(\\d+)"
                                                     :replacement "000000000000$1"}}
                                   {:patternReplace {:pattern     ".*(\\d{5}$)"
                                                     :replacement "$1"}}]
                                  :tokenizer :keyword}}}}
          queries [{:id           "1"
                    :query        query-string
                    :query-parser {}}]
          docs [{:field-a "prefix foo suffix"
                 :field-b "100"}
                {:field-a "prefix foo suffix"
                 :field-b "15"}]]
      (with-open [monitor (m/monitor opts queries)]
        (is (= [[] [{:id "1"}]] (m/match monitor docs)))))))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.