Giter Club home page Giter Club logo

pg_trgm_pro's Introduction

pg_trgm โ€“ text similarity measurement and index searching based on trigrams

Introduction

The pg_trgm module provides functions and operators for determining the similarity of alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings.

A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages.

The original module is located in GitHub. This module provides a new function and new operators which provide fuzzy searching for word in a text.

Note. Functions of this module and functions of pg_trgm module, which included in the PostgreSQL 9.6, are differ. Functions of this module have other names and the module does not provide GUC parameters.

License

This module available from GitHub under the same license as PostgreSQL and supports PostgreSQL 9.4+.

Installation

Before build and install pg_trgm you should ensure following:

  • PostgreSQL version is 9.4 or higher.

Typical installation procedure may look like this:

$ git clone https://github.com/postgrespro/pg_trgm_pro
$ cd pg_trgm_pro
$ make USE_PGXS=1
$ sudo make USE_PGXS=1 install
$ make USE_PGXS=1 installcheck
$ psql DB -c "CREATE EXTENSION pg_trgm;"

New functions and operators

The pg_trgm module provides the new functions.

Function Returns Description
substring_similarity(text, text) real Returns a number that indicates how similar the first string to the most similar word of the second string. The function searches in the second string a most similar word not a most similar substring. The range of the result is zero (indicating that the two strings are completely dissimilar) to one (indicating that the first string is identical to one of the word of the second string).
show_substring_limit() real Returns the current substring similarity threshold that is used by the <% operator.
set_substring_limit(real) real Sets the current substring similarity threshold that is used by the <% operator. The threshold must be between 0 and 1 (default is 0.6).

The module provides new operators.

Operator Returns Description
text <% text boolean Returns true if its arguments have a substring similarity that is greater than the current substring similarity threshold set by set_substring_limit().

GiST and GIN indexes support the operator <%.

Examples

Let us assume we have an test_trgm table:

CREATE TABLE test_trgm (t text);

You can create GiST index:

CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops);

or GIN index:

CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);

Now you can use an index on the t column for substring similarity. For example:

SELECT t, substring_similarity('word', t) AS sml
  FROM test_trgm
  WHERE 'word' <% t
  ORDER BY sml DESC, t;

This will return all values in the text column that have a word which sufficiently similar to word, sorted from best match to worst. The index will be used to make this a fast operation even over very large data sets.

Authors

Oleg Bartunov [email protected]

Teodor Sigaev [email protected]

pg_trgm_pro's People

Contributors

akorotkov avatar za-arthur avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.