Giter Club home page Giter Club logo

elasticsearch-skroutz-greekstemmer's Introduction

SkroutzGreekStemmer plugin for ElasticSearch

This plugin is based on the GreekStemmer that is included in Apache Lucene.

Lucene's GreekStemmer is created according to Development of a Stemmer for the Greek Language of Georgios Ntaias. This thesis mentions that 166 suffixes are recognized in the Greek language. However, only 158 were captured by this stemmer, because the addition of the remainning suffixes would reduce the precision of the stemmer on the word-sets that were used for its evaluation.

But the exclusion of these suffixes does not perform well on our word-set which consists of more than 120.000 words. So, for our needs we had to modify the implementation of Lucene's GreekStemmer in order to include eight more suffixes which improve the quality of our search results. Four of the these new suffixes are not included to the 166 suffixes of the thesis of Geogios Ntaias. These are:

-ιο, ιοσ, -εασ, -εα

The remaining four suffixes are included in the set of the eight suffixes that were intentionally not captured by the the original GreekStemmer. These suffixes reflect different forms of the words that end with the first three of the above suffixes and these are the following:

-ιασ, -ιεσ, -ιοι, -ιουσ

Examples:

Word GreekStemmer SkroutzGreekStemmer
κριτηριο (singular) κριτηρι κριτηρ
κριτηρια (plural) κριτηρ κριτηρ
προβολεας (singular) προβολε προβολ
προβολεις (plural) προβολ προβολ
αμινοξυ (singular) αμινοξ αμινοξ
αμινοξεα (plural) αμινοξε αμινοξ

Stemming exceptions

The stemmer can be combined with the keyword-marker and stemmer-override Elasticsearch filters for stemming exceptions support (see also the greek_exceptions.txt sample stemmer-override configuration file). As of version 5.4.2.6, there is no builtin support for stemming exceptions.

Installation

To list all plugins in current installation:

sudo bin/elasticsearch-plugin list

In order to install the latest version of the plugin, simply run:

sudo bin/elasticsearch-plugin install gr.skroutz:elasticsearch-skroutz-greekstemmer:7.7.0.1

To install version 5.4.2.6 run:

sudo bin/elasticsearch-plugin install gr.skroutz:elasticsearch-skroutz-greekstemmer:5.4.2.6

In order to install version 2.4.4 of the plugin, simply run:

sudo bin/plugin install skroutz/elasticsearch-skroutz-greekstemmer/2.4.4.1

In order to install versions prior to 0.0.12, simply run:

sudo bin/plugin -install skroutz/elasticsearch-skroutz-greekstemmer/0.0.1

To remove a plugin (5.x.x/7.x.x):

sudo bin/elasticsearch-plugin remove <plugin_name>

Versions

SkroutzGreekStemmer Plugin ElasticSearch Branch
7.7.0.2 7.7.0 7.7.0
7.7.0.1 7.7.0 7.7.0
5.4.2.6 5.4.2 5.4.2
5.4.0.1 5.4.0 5.4.0
2.4.4.1                   2.4.4         2.4.4
0.0.12 (<=)                1.5.0         1.5.0

Example usage

# Create index
$ curl -XPUT 'http://localhost:9200/test_stemmer' -H 'Content-Type: application/json' -d '{
  "settings":{
    "analysis":{
      "analyzer":{
        "stem_analyzer":{
          "type":"custom",
            "tokenizer":"standard",
            "filter": ["lower_greek", "stem_greek"]
        }
      },
      "filter": {
        "lower_greek": {
          "type":"lowercase",
          "language":"greek"
        },
        "stem_greek": {
          "type":"skroutz_stem_greek"
        }
      }
    }
  }
}'
{"acknowledged":true}

# Test analyzer
$ curl -XGET 'http://localhost:9200/test_stemmer/_analyze?pretty' -H 'Content-Type: application/json' -d '{"analyzer": "stem_analyzer", "text": "κουρευτικές μηχανές"}'
{
  "tokens" : [ {
    "token" : "κουρευτ",
    "start_offset" : 0,
    "end_offset" : 11,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "μηχαν",
    "start_offset" : 12,
    "end_offset" : 19,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

$ curl -XGET 'http://localhost:9200/test_stemmer/_analyze?pretty' -H 'Content-Type: application/json' -d '{"analyzer": "stem_analyzer", "text": "κουρευτική μηχανή"}'
{
  "tokens" : [ {
    "token" : "κουρευτ",
    "start_offset" : 0,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "μηχαν",
    "start_offset" : 11,
    "end_offset" : 17,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

# Delete test index
$ curl -XDELETE 'http://localhost:9200/test_stemmer'
{"ok":true,"acknowledged":true}

YML configuration example

index:
  analysis:
    filter:
      stem_greek:
        type: skroutz_stem_greek

Warning

Input is expected to to be casefolded for Greek (including folding of final sigma to sigma), and with diacritics removed. This can be achieved with GreekLowerCaseFilter.

References

Issues

For stemming issues: here

elasticsearch-skroutz-greekstemmer's People

Contributors

astathopoulos avatar bill-kolokithas avatar chief avatar greenonion avatar lovemeblender avatar m-peter avatar ptheof avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-skroutz-greekstemmer's Issues

rule0, exceptions handling

rule0 of SkroutzGreekStemmer.java tries to handle special cases for specific word endings.
However, most of those cases concern whole words, rather than endings.
Eg. the word περατοσ is handled as an ending, and will also match υδατοπερατοσ and stem it as υδατοπερ, σαφωσ will match φωσ, etc.
Those case are false positive matches.

Most of the cases should be handled with string equality (rather than string suffix matching).
This should happen in an extra step before what now is rule0 and rule0 should have less special cases to handle

ElasticsearchIllegalArgumentException[failed to find token filter type [skroutz_stem_greek] for [stem_greek]];

Version 1.1, Index :

    "index":{
        "analysis":{
            "analyzer":{
                "analyzer_startswith":{
                    "tokenizer":"keyword",
                    "filter":"lowercase"
                },
                "prefix-test-analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter" : ["lowercase","stem_greek"]
                }
            },
            "filter" : {
                "mynGram" : {
                    "type" : "nGram",
                    "min_gram" : 2,
                    "max_gram" : 50
                },
                "stem_greek": {
                    "type":"skroutz_stem_greek"
                }
            },
            "tokenizer": {
                "prefix-test-tokenizer": {
                    "type": "path_hierarchy",
                    "delimiter": "."
                }
            }
        }
    }

Installation on 6.x ES

Cannot install on latest ES due to error:

sudo bin/elasticsearch-plugin install gr.skroutz:elasticsearch-skroutz-greekstemmer:5.4.2.1
-> Downloading gr.skroutz:elasticsearch-skroutz-greekstemmer:5.4.2.1 from maven central
[=================================================] 100%   
Warning: sha512 not found, falling back to sha1. This behavior is deprecated and will be removed in a future release. Please update the plugin to use a sha512 checksum.
ERROR: This plugin was built with an older plugin structure. Contact the plugin author to remove the intermediate "elasticsearch" directory within the plugin zip.

Any chance for an update here?

Can't install plugin

I can't install the plugin, can you please help?

cd /usr/share/elasticsearch && sudo bin/plugin --install skroutz/elasticsearch-skroutz-greekstemmer/2.4.4.1
-> Installing skroutz/elasticsearch-skroutz-greekstemmer/2.4.4.1...
Trying http://download.elasticsearch.org/skroutz/elasticsearch-skroutz-greekstemmer/elasticsearch-skroutz-greekstemmer-2.4.4.1.zip...
Trying http://search.maven.org/remotecontent?filepath=skroutz/elasticsearch-skroutz-greekstemmer/2.4.4.1/elasticsearch-skroutz-greekstemmer-2.4.4.1.zip...
Trying https://oss.sonatype.org/service/local/repositories/releases/content/skroutz/elasticsearch-skroutz-greekstemmer/2.4.4.1/elasticsearch-skroutz-greekstemmer-2.4.4.1.zip...
Trying https://github.com/skroutz/elasticsearch-skroutz-greekstemmer/archive/2.4.4.1.zip...
Trying https://github.com/skroutz/elasticsearch-skroutz-greekstemmer/archive/master.zip...
Failed to install skroutz/elasticsearch-skroutz-greekstemmer/2.4.4.1, reason: failed to download out of all possible locations..., use --verbose to get detailed information

How to install it in elasticsearch 7.17.6

Hi, I want to install it in my current elasticsearch which is v.7.17.6

Installation fails with the following message:

Exception in thread "main" java.lang.IllegalArgumentException: Plugin [elasticsearch-skroutz-greekstemmer] was built for Elasticsearch version 7.7.0 but version 7.17.6 is running

How can I update the code for my current elasticsearch version?

Branch for 5.5.2

Hi!

Could you compile a new branch for ES 5.5.2?

Thank you :)

Can't install plugin

Here is what I get:

sudo /usr/share/elasticsearch/bin/plugin -install skroutz/elasticsearch-skroutz-greekstemmer/0.0.1-> Installing skroutz/elasticsearch-skroutz-greekstemmer/0.0.1... Trying http://download.elasticsearch.org/skroutz/elasticsearch-skroutz-greekstemmer/elasticsearch-skroutz-greekstemmer-0.0.1.zip... Trying http://search.maven.org/remotecontent?filepath=skroutz/elasticsearch-skroutz-greekstemmer/0.0.1/elasticsearch-skroutz-greekstemmer-0.0.1.zip... Trying https://oss.sonatype.org/service/local/repositories/releases/content/skroutz/elasticsearch-skroutz-greekstemmer/0.0.1/elasticsearch-skroutz-greekstemmer-0.0.1.zip... Trying https://github.com/skroutz/elasticsearch-skroutz-greekstemmer/zipball/v0.0.1... (assuming site plugin) Failed to install skroutz/elasticsearch-skroutz-greekstemmer/0.0.1, reason: failed to download out of all possible locations..., use -verbose to get detailed information

Usage sample

Hi,

First of all I would like to congratulate you guys for the enhanced greek stemmer you worked on for the elasticsearch platform. I believe that Usage example is needed as well as a test case scenario to be sure that we have done the correct configuration.

elasticsearch 2

Αποτυχία εγκατάστασης σε elasticsearch 2
ERROR: Could not find plugin descriptor 'plugin-descriptor.properties' in plugin zip

Stem Exception Handling

The exceptional cases of the various analysis steps are not uniformly handled.
Some are static variables and some are coded into if clauses.
All of them are hardcoded and can only change by altering the source files.
We can make an effort to

  1. handle them uniformly
  2. load them from a resource file

Building or Testing on a system with default encoding other than UTF-8 breaks file "stemming_samples.txt"

The problem is that UpdateStemmingSamples.java reads the file with UTF-8 encoding and replaces it with a file using the default encoding of the building computer. Subsequent builds fail.

Proposed changes (lines 27, 28):
FileOutputStream fileWriter = new FileOutputStream(file.getAbsoluteFile());
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(fileWriter, StandardCharsets.UTF_8));

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.