eulerto / pg_similarity Goto Github PK

set of functions and operators for executing similarity queries

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.46% C 99.54%

pg_similarity's Introduction

Introduction

pg_similarity is an extension to support similarity queries on PostgreSQL. The implementation is tightly integrated in the RDBMS in the sense that it defines operators so instead of the traditional operators (= and <>) you can use ~~~ and ! (any of these operators represents a similarity function).

pg_similarity has three main components:

Functions: a set of functions that implements similarity algorithms available in the literature. These functions can be used as UDFs and, will be the base for implementing the similarity operators;
Operators: a set of operators defined at the top of similarity functions. They use similarity functions to obtain the similarity threshold and, compare its value to a user-defined threshold to decide if it is a match or not;
Session Variables: a set of variables that store similarity function parameters. Theses variables can be defined at run time.

Installation

pg_similarity is supported on those platforms that PostgreSQL is. The installation steps depend on your operating system.

You can also keep up with the latest fixes and features cloning the Git repository.

$ git clone https://github.com/eulerto/pg_similarity.git

UNIX based Operating Systems

Before you are able to use your extension, you should build it and load it at the desirable database.

$ tar -zxf pg_similarity-1.0.tgz
$ cd pg_similarity-1.0
$ $EDITOR Makefile # edit PG_CONFIG iif necessary
$ make
$ make install
$ psql mydb
psql (13.0)
Type "help" for help.

mydb=# CREATE EXTENSION pg_similarity;
CREATE EXTENSION

The typical usage is to copy a sample file at tarball (pg_similarity.conf.sample) to PGDATA (as pg_similarity.conf) and include the following line in postgresql.conf:

include 'pg_similarity.conf'

Windows

Sorry, never tried^H^H^H^H^H Actually I tried that but it is not that easy as on UNIX. :( There are two ways to build PostgreSQL on Windows: (i) MingW and (ii) MSVC. The former is supported but it is not widely used and the latter is popular because Windows binaries (officially distributed) are built using MSVC. If you choose to use Mingw, just follow the UNIX instructions above to build pg_similarity. Otherwise, the MSVC steps are below:

Edit pg_similarity.vcxproj replacing c:\postgres\pg130 with PostgreSQL prefix directory;
Open this project file in MS Visual Studio and build it;
Copy pg_similarity.dll to pg_config --pkglibdir;
Copy pg_similarity.control and pg_similarity--*.sql to SHAREDIR/extension (SHAREDIR is pg_config --sharedir).

Functions and Operators

This extension supports a set of similarity algorithms. The most known algorithms are covered by this extension. You must be aware that each algorithm is suited for a specific domain. The following algorithms are provided.

L1 Distance (as known as City Block or Manhattan Distance);
Cosine Distance;
Dice Coefficient;
Euclidean Distance;
Hamming Distance;
Jaccard Coefficient;
Jaro Distance;
Jaro-Winkler Distance;
Levenshtein Distance;
Matching Coefficient;
Monge-Elkan Coefficient;
Needleman-Wunsch Coefficient;
Overlap Coefficient;
Q-Gram Distance;
Smith-Waterman Coefficient;
Smith-Waterman-Gotoh Coefficient;
Soundex Distance.

Algorithm	Function	Operator	Use Index?	Parameters
L1 Distance	block(text, text) returns float8	~++	yes	pg_similarity.block_tokenizer (enum) pg_similarity.block_threshold (float8) pg_similarity.block_is_normalized (bool)
Cosine Distance	cosine(text, text) returns float8	~##	yes	pg_similarity.cosine_tokenizer (enum) pg_similarity.cosine_threshold (float8) pg_similarity.cosine_is_normalized (bool)
Dice Coefficient	dice(text, text) returns float8	~-~	yes	pg_similarity.dice_tokenizer (enum) pg_similarity.dice_threshold (float8) pg_similarity.dice_is_normalized (bool)
Euclidean Distance	euclidean(text, text) returns float8	~!!	yes	pg_similarity.euclidean_tokenizer (enum) pg_similarity.euclidean_threshold (float8) pg_similarity.euclidean_is_normalized (bool)
Hamming Distance	hamming(bit varying, bit varying) returns float8 hamming_text(text, text) returns float8	~@~	no	pg_similarity.hamming_threshold (float8) pg_similarity.hamming_is_normalized (bool)
Jaccard Coefficient	jaccard(text, text) returns float8	~??	yes	pg_similarity.jaccard_tokenizer (enum) pg_similarity.jaccard_threshold (float8) pg_similarity.jaccard_is_normalized (bool)
Jaro Distance	jaro(text, text) returns float8	~%%	no	pg_similarity.jaro_threshold (float8) pg_similarity.jaro_is_normalized (bool)
Jaro-Winkler Distance	jarowinkler(text, text) returns float8	~@@	no	pg_similarity.jarowinkler_threshold (float8) pg_similarity.jarowinkler_is_normalized (bool)
Levenshtein Distance	lev(text, text) returns float8	~==	no	pg_similarity.levenshtein_threshold (float8) pg_similarity.levenshtein_is_normalized (bool)
Matching Coefficient	matchingcoefficient(text, text) returns float8	~^^	yes	pg_similarity.matching_tokenizer (enum) pg_similarity.matching_threshold (float8) pg_similarity.matching_is_normalized (bool)
Monge-Elkan Coefficient	mongeelkan(text, text) returns float8	~\|\|	no	pg_similarity.mongeelkan_tokenizer (enum) pg_similarity.mongeelkan_threshold (float8) pg_similarity.mongeelkan_is_normalized (bool)
Needleman-Wunsch Coefficient	needlemanwunsch(text, text) returns float8	~#~	no	pg_similarity.nw_threshold (float8) pg_similarity.nw_is_normalized (bool)
Overlap Coefficient	overlapcoefficient(text, text) returns float8	~**	yes	pg_similarity.overlap_tokenizer (enum) pg_similarity.overlap_threshold (float8) pg_similarity.overlap_is_normalized (bool)
Q-Gram Distance	qgram(text, text) returns float8	~~~	yes	pg_similarity.qgram_threshold (float8) pg_similarity.qgram_is_normalized (bool)
Smith-Waterman Coefficient	smithwaterman(text, text) returns float8	~=~	no	pg_similarity.sw_threshold (float8) pg_similarity.sw_is_normalized (bool)
Smith-Waterman-Gotoh Coefficient	smithwatermangotoh(text, text) returns float8	~!~	no	pg_similarity.swg_threshold (float8) pg_similarity.swg_is_normalized (bool)
Soundex Distance	soundex(text, text) returns float8	~*~	no

The several parameters control the behavior of the pg_similarity functions and operators. I don't explain in detail each parameter because they can be classified in three classes: tokenizer, threshold, and normalized.

tokenizer: controls how the strings are tokenized. The valid values are alnum, gram, word, and camelcase. All tokens are lowercase (this option can be set at compile time; see PGS_IGNORE_CASE at source code). Default is alnum;
- alnum: delimiters are any non-alphanumeric characters. That means that only alphabetic characters in the standard C locale and digits (0-9) are accepted in tokens. For example, the string "Euler_Taveira_de_Oliveira 22/02/2011" is tokenized as "Euler", "Taveira", "de", "Oliveira", "22", "02", "2011";
- gram: an n-gram is a subsequence of length n. Extracting n-grams from a string can be done by using the sliding-by-one technique, that is, sliding a window of length n through out the string by one character. For example, the string "euler taveira" (using n = 3) is tokenized as "eul", "ule", "ler", "er ", "r t", " ta", "tav", "ave", "vei", "eir", and "ira". There are some authors that consider n-grams adding " e", " eu", "ra ", and "a " to the set of tokens, that is called full n-grams (this option can be set at compile time; see PGS_FULL_NGRAM at source code);
- word: delimiters are white space characters (space, form-feed, newline, carriage return, horizontal tab, and vertical tab). For example, the string "Euler Taveira de Oliveira 22/02/2011" is tokenized as "Euler", "Taveira", "de", "Oliveira", and "22/02/2011";
- camelcase: delimiters are capitalized characters but they are also included as first token characters. For example, the string "EulerTaveira de Oliveira" is tokenized as "Euler", "Taveira de ", and "Oliveira".
threshold: controls how flexible will be the result set. These values are used by operators to match strings. For each pair of strings, if the calculated value (using the corresponding similarity function) is greater or equal the threshold value, there is a match. The values range from 0.0 to 1.0. Default is 0.7;
normalized: controls whether the similarity coefficient/distance is normalized (between 0.0 and 1.0) or not. Normalized values are used automatically by operators to match strings, that is, this parameter only makes sense if you are using similarity functions. Default is true.

Examples

Set parameters at run time.

mydb=# show pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold
-------------------------------------
 0.7
(1 row)

mydb=# set pg_similarity.levenshtein_threshold to 0.5;
SET
mydb=# show pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold
-------------------------------------
 0.5
(1 row)

mydb=# set pg_similarity.cosine_tokenizer to camelcase;
SET
mydb=# set pg_similarity.euclidean_is_normalized to false;
SET

Simple tables for examples.

mydb=# create table foo (a text);
CREATE TABLE
mydb=# insert into foo values('Euler'),('Oiler'),('Euler Taveira de Oliveira'),('Maria Taveira dos Santos'),('Carlos Santos Silva');
INSERT 0 5
mydb=# create table bar (b text);
CREATE TABLE
mydb=# insert into bar values('Euler T. de Oliveira'),('Euller'),('Oliveira, Euler Taveira'),('Sr. Oliveira');
INSERT 0 4

Example 1: Using similarity functions cosine, jaro, and euclidean.

mydb=# select a, b, cosine(a,b), jaro(a, b), euclidean(a, b) from foo, bar;
             a             |            b            |  cosine  |   jaro   | euclidean
---------------------------+-------------------------+----------+----------+-----------
 Euler                     | Euler T. de Oliveira    |      0.5 |     0.75 |  0.579916
 Euler                     | Euller                  |        0 | 0.944444 |         0
 Euler                     | Oliveira, Euler Taveira |  0.57735 | 0.605797 |  0.552786
 Euler                     | Sr. Oliveira            |        0 | 0.505556 |  0.225403
 Oiler                     | Euler T. de Oliveira    |        0 | 0.472222 |  0.457674
 Oiler                     | Euller                  |        0 |      0.7 |         0
 Oiler                     | Oliveira, Euler Taveira |        0 | 0.672464 |  0.367544
 Oiler                     | Sr. Oliveira            |        0 | 0.672222 |  0.225403
 Euler Taveira de Oliveira | Euler T. de Oliveira    |     0.75 |  0.79807 |      0.75
 Euler Taveira de Oliveira | Euller                  |        0 | 0.677778 |  0.457674
 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.866025 | 0.773188 |       0.8
 Euler Taveira de Oliveira | Sr. Oliveira            | 0.353553 | 0.592222 |  0.552786
 Maria Taveira dos Santos  | Euler T. de Oliveira    |        0 |  0.60235 |       0.5
 Maria Taveira dos Santos  | Euller                  |        0 | 0.305556 |  0.457674
 Maria Taveira dos Santos  | Oliveira, Euler Taveira | 0.288675 | 0.535024 |  0.552786
 Maria Taveira dos Santos  | Sr. Oliveira            |        0 | 0.634259 |  0.452277
 Carlos Santos Silva       | Euler T. de Oliveira    |        0 | 0.542105 |   0.47085
 Carlos Santos Silva       | Euller                  |        0 | 0.312865 |  0.367544
 Carlos Santos Silva       | Oliveira, Euler Taveira |        0 | 0.606662 |   0.42265
 Carlos Santos Silva       | Sr. Oliveira            |        0 | 0.507728 |  0.379826
(20 rows)

Example 2: Using operator levenshtein (~==) and changing its threshold at run time.

mydb=# show pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold
-------------------------------------
 0.7
(1 row)

mydb=# select a, b, lev(a,b) from foo, bar where a ~== b;
             a             |          b           |   lev
---------------------------+----------------------+----------
 Euler                     | Euller               | 0.833333
 Euler Taveira de Oliveira | Euler T. de Oliveira |     0.76
(2 rows)

mydb=# set pg_similarity.levenshtein_threshold to 0.5;
SET
mydb=# select a, b, lev(a,b) from foo, bar where a ~== b;
             a             |          b           |   lev
---------------------------+----------------------+----------
 Euler                     | Euller               | 0.833333
 Oiler                     | Euller               |      0.5
 Euler Taveira de Oliveira | Euler T. de Oliveira |     0.76
(3 rows)

Example 3: Using operator qgram (~~~) and changing its threshold at run time.

mydb=# set pg_similarity.qgram_threshold to 0.7;
SET
mydb=# show pg_similarity.qgram_threshold;
 pg_similarity.qgram_threshold
-------------------------------
 0.7
(1 row)

mydb=# select a, b,qgram(a, b) from foo, bar where a ~~~ b;
             a             |            b            |  qgram
---------------------------+-------------------------+----------
 Euler                     | Euller                  |      0.8
 Euler Taveira de Oliveira | Euler T. de Oliveira    |  0.77551
 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.807692
(3 rows)

mydb=# set pg_similarity.qgram_threshold to 0.35;
SET
mydb=# select a, b,qgram(a, b) from foo, bar where a ~~~ b;
             a             |            b            |  qgram
---------------------------+-------------------------+----------
 Euler                     | Euler T. de Oliveira    | 0.413793
 Euler                     | Euller                  |      0.8
 Oiler                     | Euller                  |      0.4
 Euler Taveira de Oliveira | Euler T. de Oliveira    |  0.77551
 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.807692
 Euler Taveira de Oliveira | Sr. Oliveira            | 0.439024
(6 rows)

Example 4: Using a set of operators using the same threshold (0.7) to ilustrate that some similarity functions are appropriated to certain data domains.

mydb=# select * from bar where b ~@@ 'euler'; -- jaro-winkler operator
          b
----------------------
 Euler T. de Oliveira
 Euller
(2 rows)

mydb=# select * from bar where b ~~~ 'euler'; -- qgram operator
 b
---
(0 rows)

mydb=# select * from bar where b ~== 'euler'; -- levenshtein operator
   b
--------
 Euller
(1 row)

mydb=# select * from bar where b ~## 'euler'; -- cosine operator
 b
---
(0 rows)

License

Copyright © 2008-2020 Euler Taveira de Oliveira All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer; Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution; Neither the name of the Euler Taveira de Oliveira nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

pg_similarity's People

Contributors

Stargazers

Watchers

pg_similarity's Issues

Could not open extension control file “/usr/share/postgresql/10/extension/pg_similarity.control”: No such file or directory

I’m trying to create the extension pg_smiliarity in docker container. I create this Dockerfile:

FROM postgres:10

ENV POSTGRES_USER user
ENV POSTGRES_PASSWORD user
ENV POSTGRES_DB user_db

RUN apt-get update \
    && apt-get upgrade -y \
    && apt-get -y install \
     postgresql-server-dev-all \
     wget \
     make \
     gcc \
    && rm -rf /var/lib/apt/lists/*
RUN wget -c 'pgfoundry.org/frs/download.php/2237/pg_similarity-0.0.19.tgz' 
RUN tar -zxf pg_similarity-0.0.19.tgz
RUN cd pg_similarity \
    && USE_PGXS=1 make \
    && USE_PGXS=1 make install

Then I build the image and I run the container. I exec into the container:

psql -U user -d user_db
psql (10.3 (Debian 10.3-1.pgdg90+1))
 Type "help" for help.
user_db=# create extension pg_similarity;

I got this error:

ERROR: could not open extension control file "/usr/share/postgresql/10/extension/pg_similarity.control": No such file or directory

Under '/usr/share/postgres/10/contrib' I find the file pg_similarity.sql.How can I fix this problem?

How to install on version 9.5

when running "make" in builds in directory for 9.4, what should I change to force it install on postgres 9.5?

set PGS_MAX_STR_LEN at runtime

I would like to set PGS_MAX_STR_LEN at runtime instead of compile time. For example:

set pg_similarity.max_str_len to 2000;

pg_similarity crashes on PG16

See the build log:
https://pgdgbuild.dus.dg-i.net/view/Binaries-Beta/job/pg-similarity-binaries-beta/lastFailedBuild/architecture=amd64,distribution=sid/console#console-section-9

cosine and jaro on perfect matches

Hi,
I am starting to use this library and came across those potential glitches.
I would expect cosine, jaro and jarowinkler to return exactly 1.0 on equal strings, but I get the following (jarowinkler omitted, but shows the same behaviour as jaro on this):

server_encoding                         | UTF8 
client_encoding                         | UTF8
pg_similarity.cosine_is_normalized      | on
pg_similarity.jaro_is_normalized        | on

cl_generated_1000=# select version();
                                                   version                                                    
--------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.6.9 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.0 (clang-700.1.76), 64-bit
(1 row)
# set pg_similarity.cosine_tokenizer = 'alnum';
SET
# set pg_similarity.jaro_tokenizer = 'alnum';
SET
# select cosine('Michael C', 'Michael C') < 1.,
         cosine('Brésil', 'Brésil') < 1.,
         jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
         jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
 t        | t        | t        | t
(1 row)

# set pg_similarity.cosine_tokenizer = 'word';
SET
# set pg_similarity.jaro_tokenizer = 'word';
SET
# select cosine('Michael C', 'Michael C') < 1.,
         cosine('Brésil', 'Brésil') < 1.,
         jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
         jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
 t        | f        | t        | t
(1 row)

# set pg_similarity.cosine_tokenizer = 'gram';
SET
# set pg_similarity.jaro_tokenizer = 'gram';
SET
# select cosine('Michael C', 'Michael C') < 1.,
         cosine('Brésil', 'Brésil') < 1.,
         jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
         jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
 f        | f        | t        | t
(1 row)

# set pg_similarity.cosine_tokenizer = 'camelcase';
SET
# set pg_similarity.jaro_tokenizer = 'camelcase';
SET
# select cosine('Michael C', 'Michael C') < 1.,
         cosine('Brésil', 'Brésil') < 1.,
         jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
         jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
 t        | f        | t        | t
(1 row)

ERROR: argument exceeds the maximum length of 1024 bytes

I'm having an issue when using the smithwaterman() function. I receive the following error:

SELECT content, smithwaterman(content, 'example word') FROM review;
SQL Error [22023]: ERROR: argument exceeds the maximum length of 1024 bytes

The size of the string exceeds the limit of 1024 bytes defined. I would like to know if it's possible to increase this limit. I believe that I may need to change the PGS_MAX_STR_LEN or PGS_MAX_TOKEN_LEN variable, but I'm not sure how to do that. Could someone guide me on how to set this variable in PostgreSQL? I appreciate any help in advance!

Regression tests fail with PG12

Regression tests fail with PG12beta1 because floating point output is now more precise by default:

16:50:42 --- /tmp/autopkgtest.SxiwfA/tree/expected/test1.out	2019-05-21 14:50:21.000000000 +0000
16:50:42 +++ /tmp/autopkgtest.SxiwfA/tree/results/test1.out	2019-05-21 14:50:41.994778640 +0000
16:50:42 @@ -7,27 +7,27 @@
16:50:42  \set b '\'Euler T Oliveira\''
16:50:42  \set c '\'Oiler Taviera do Oliviera\''
16:50:42  select block(:a, :b), block_op(:a, :b), :a ~++ :b as operator;
16:50:42 -       block       | block_op | operator 
16:50:42 --------------------+----------+----------
16:50:42 - 0.571428571428571 | f        | f
16:50:42 +       block        | block_op | operator 
16:50:42 +--------------------+----------+----------
16:50:42 + 0.5714285714285714 | f        | f
16:50:42  (1 row)
16:50:42  
16:50:42  select cosine(:a, :b), cosine_op(:a, :b), :a ~## :b as operator;
16:50:42 -      cosine       | cosine_op | operator 
16:50:42 --------------------+-----------+----------
16:50:42 - 0.577350269189626 | f         | f
16:50:42 +       cosine       | cosine_op | operator 
16:50:42 +--------------------+-----------+----------
16:50:42 + 0.5773502691896258 | f         | f
16:50:42  (1 row)
16:50:42  
16:50:42  select dice(:a, :b), dice_op(:a, :b), :a ~-~ :b as operator;
16:50:42 -       dice        | dice_op | operator 
16:50:42 --------------------+---------+----------
16:50:42 - 0.571428571428571 | f       | f
16:50:42 +        dice        | dice_op | operator 
16:50:42 +--------------------+---------+----------
16:50:42 + 0.5714285714285714 | f       | f
16:50:42  (1 row)
16:50:42  
16:50:42  select euclidean(:a, :b), euclidean_op(:a, :b), :a ~!! :b as operator;
16:50:42 -     euclidean     | euclidean_op | operator 
16:50:42 --------------------+--------------+----------
16:50:42 - 0.653589838486225 | f            | f
16:50:42 +     euclidean      | euclidean_op | operator 
16:50:42 +--------------------+--------------+----------
16:50:42 + 0.6535898384862245 | f            | f
16:50:42  (1 row)
16:50:42  
16:50:42  select hamming_text(:a, :c), hamming_text_op(:a, :c), :a ~@~ :c as operator;
16:50:42 @@ -43,9 +43,9 @@
16:50:42  (1 row)
16:50:42  
16:50:42  select jaro(:a, :b), jaro_op(:a, :b), :a ~%% :b as operator;
16:50:42 -       jaro        | jaro_op | operator 
16:50:42 --------------------+---------+----------
16:50:42 - 0.796666666666667 | t       | t
16:50:42 +        jaro        | jaro_op | operator 
16:50:42 +--------------------+---------+----------
16:50:42 + 0.7966666666666666 | t       | t
16:50:42  (1 row)
16:50:42  
16:50:42  select jarowinkler(:a, :b), jarowinkler_op(:a, :b), :a ~@@ :b as operator;
16:50:42 @@ -77,13 +77,13 @@
16:50:42  select overlapcoefficient(:a, :b), overlapcoefficient_op(:a, :b), :a ~** :b as operator;
16:50:42   overlapcoefficient | overlapcoefficient_op | operator 
16:50:42  --------------------+-----------------------+----------
16:50:42 -  0.666666666666667 | f                     | f
16:50:42 + 0.6666666666666666 | f                     | f
16:50:42  (1 row)
16:50:42  
16:50:42  select qgram(:a, :b), qgram_op(:a, :b), :a ~~~ :b as operator;
16:50:42 -       qgram       | qgram_op | operator 
16:50:42 --------------------+----------+----------
16:50:42 - 0.711111111111111 | t        | t
16:50:42 +       qgram        | qgram_op | operator 
16:50:42 +--------------------+----------+----------
16:50:42 + 0.7111111111111111 | t        | t
16:50:42  (1 row)
16:50:42  
16:50:42  --select smithwaterman(:a, :b), smithwaterman_op(:a, :b), :a ~=~ :b as operator;
16:50:42 diff -U3 /tmp/autopkgtest.SxiwfA/tree/expected/test4.out /tmp/autopkgtest.SxiwfA/tree/results/test4.out
16:50:42 --- /tmp/autopkgtest.SxiwfA/tree/expected/test4.out	2018-03-27 04:39:53.000000000 +0000
16:50:42 +++ /tmp/autopkgtest.SxiwfA/tree/results/test4.out	2019-05-21 14:50:42.154779828 +0000
16:50:42 @@ -29,13 +29,13 @@
16:50:42  (2 rows)
16:50:42  
16:50:42  SELECT a, cosine(a, :a) FROM simtst WHERE a ~## :a;
16:50:42 -             a             |      cosine       
16:50:42 ----------------------------+-------------------
16:50:42 - Euler Taveira de Oliveira |                 1
16:50:42 - Euler T. de Oliveira      |              0.75
16:50:42 - Euler Oliveira            | 0.707106781186547
16:50:42 - Euler Taveira             | 0.707106781186547
16:50:42 - Oliveira, Euler           | 0.707106781186547
16:50:42 +             a             |       cosine       
16:50:42 +---------------------------+--------------------
16:50:42 + Euler Taveira de Oliveira |                  1
16:50:42 + Euler T. de Oliveira      |               0.75
16:50:42 + Euler Oliveira            | 0.7071067811865475
16:50:42 + Euler Taveira             | 0.7071067811865475
16:50:42 + Oliveira, Euler           | 0.7071067811865475
16:50:42  (5 rows)
16:50:42  
16:50:42  CREATE INDEX simtsti ON simtst USING gin (a gin_similarity_ops);
16:50:42 @@ -56,13 +56,13 @@
16:50:42  
16:50:42  SET enable_bitmapscan TO ON;
16:50:42  SELECT a, cosine(a, :a) FROM simtst WHERE a ~## :a;
16:50:42 -             a             |      cosine       
16:50:42 ----------------------------+-------------------
16:50:42 - Euler Taveira de Oliveira |                 1
16:50:42 - Euler T. de Oliveira      |              0.75
16:50:42 - Euler Oliveira            | 0.707106781186547
16:50:42 - Euler Taveira             | 0.707106781186547
16:50:42 - Oliveira, Euler           | 0.707106781186547
16:50:42 +             a             |       cosine       
16:50:42 +---------------------------+--------------------
16:50:42 + Euler Taveira de Oliveira |                  1
16:50:42 + Euler T. de Oliveira      |               0.75
16:50:42 + Euler Oliveira            | 0.7071067811865475
16:50:42 + Euler Taveira             | 0.7071067811865475
16:50:42 + Oliveira, Euler           | 0.7071067811865475
16:50:42  (5 rows)
16:50:42  
16:50:42  DROP TABLE simtst;

A possible fix would be to hardcode SET extra_float_digits = 0; in the test file.

A new release with the fix would be nice so we packagers could update the packages. Thanks!

release 版本用例跑不通， master分支可以跑通，但没有发布release，希望能发布下

Extension does not compile against PG16

Hello!

It looks like this extension can not be compiled against Postgres 16 developer libraries. I see this error when attempting to compile the library on Debian bookworm against PG16

31.64 hamming.c:181:65: error: incompatible pointer to integer conversion passing 'char *' to parameter of type 'Datum' (aka 'unsigned long') [-Wint-conversion]
31.64         elog(DEBUG1, "hammingdistance(%s, %s) = %.3f", DatumGetCString(a), DatumGetCString(b), res);
31.64                                                                        ^
31.64 /usr/include/postgresql/16/server/utils/elog.h:239:34: note: expanded from macro 'elog'
31.64         ereport(elevel, errmsg_internal(__VA_ARGS__))
31.64                                         ^~~~~~~~~~~
31.64 /usr/include/postgresql/16/server/utils/elog.h:164:37: note: expanded from macro 'ereport'
31.64         ereport_domain(elevel, TEXTDOMAIN, __VA_ARGS__)
31.64                                            ^~~~~~~~~~~
31.64 /usr/include/postgresql/16/server/utils/elog.h:147:4: note: expanded from macro 'ereport_domain'
31.64                         __VA_ARGS__, errfinish(__FILE__, __LINE__, __func__); \
31.64                         ^~~~~~~~~~~
31.64 /usr/include/postgresql/16/server/postgres.h:335:23: note: passing argument to parameter 'X' here
31.64 DatumGetCString(Datum X)

I see that some Debian maintainers patched this in https://packages.debian.org/trixie/amd64/database/postgresql-16-similarity - Could this get upstreamed here?

Cannot install

I'm trying to install pg_similarity following the instructions given in ReadMe.

I'm probably doing something really basic incorrectly but at the stage: USE_PGXS=1 make install I'm getting the following output.

/bin/mkdir -p '/usr/lib/postgresql/9.5/lib'
/bin/mkdir -p '/usr/share/postgresql/9.5/contrib'
/usr/bin/install -c -m 755  pg_similarity.so '/usr/lib/postgresql/9.5/lib/pg_similarity.so'
/usr/bin/install: cannot remove '/usr/lib/postgresql/9.5/lib/pg_similarity.so': Permission denied
/usr/lib/postgresql/9.5/lib/pgxs/src/makefiles/../../src/Makefile.shlib:506: recipe for target 'install-lib-shared' failed
make: *** [install-lib-shared] Error 1

Could you advise what I should do?

Needleman-Wunsch crashes the Postgres server

I have a database with a table of 3460 organizations. I found that the Needleman-Wunsch operator and function crash my server.

Here is the query that crashes the server:

select o1.organization_id, o1.organization_name,
       o2.organization_id, o2.organization_name,
        needlemanwunsch(o1.organization_name, o2.organization_name)
 from organizations o1
   left join organizations o2 on o2.organization_id != o1.organization_id 
                             and o1.organization_name ~#~ o2.organization_name
where o1.organization_id in( 4088, 3935, 3700, 3652)

[SELECT - 0 rows, 1.138 secs]  [Code: 0, SQL State: 08006]  An I/O error occurred while sending to the backend.

Here is a query using gram that works fine:

select o1.organization_id, o1.organization_name,
       o2.organization_id, o2.organization_name,
        qgram(o1.organization_name, o2.organization_name)
 from organizations o1
   left join organizations o2 on o2.organization_id != o1.organization_id 
                             and o1.organization_name ~~~ o2.organization_name
where o1.organization_id in( 4088, 3935, 3700, 3652)

organization_id organization_name               organization_id organization_name                qgram              
--------------- ------------------------------- --------------- -------------------------------- ------------------ 
3652            U. of Arizona                   (null)          (null)                           (null)             
3700            Rensselaer Polytechnic Insitute 330             Rensselaer Polytechnic Institute 0.9253731343283582 
3935            Shiv Nadar University           (null)          (null)                           (null)             
4088            University of Pittsbugh         586             University of Pittsburgh         0.9019607843137255 
4088            University of Pittsbugh         2185            University of Pisa               0.7111111111111111

The server info is:

PostgresSQL 9.3.14 on x86_64-unknonwn-linux-gnu, compiled by gcc (Debian 4.9.2.-10) 4.9.2, 64-bit

gin problems

It appears that the gin opclasses don't guarantee correct results:

joe=# create index aix on foo using gin(a gin_similarity_ops);
CREATE INDEX
joe=# select a, b, lev(a,b) from foo, bar where a ~== b;
             a             |          b           |        lev        
---------------------------+----------------------+-------------------
 Euler Taveira de Oliveira | Euler T. de Oliveira |              0.76
 Euler                     | Euller               | 0.833333333333333
(2 rows)

joe=# explain select a, b, lev(a,b) from foo, bar where a ~== b;
                           QUERY PLAN                           
----------------------------------------------------------------
 Nested Loop  (cost=0.00..122.43 rows=7 width=64)
   Join Filter: (foo.a ~== bar.b)
   ->  Seq Scan on bar  (cost=0.00..23.10 rows=1310 width=32)
   ->  Materialize  (cost=0.00..1.07 rows=5 width=32)
         ->  Seq Scan on foo  (cost=0.00..1.05 rows=5 width=32)
(5 rows)

joe=# show enable_seqscan;
 enable_seqscan 
----------------
 on
(1 row)

joe=# set enable_seqscan = 'off';
SET
joe=# explain select a, b, lev(a,b) from foo, bar where a ~== b;
                                   QUERY PLAN                                    
---------------------------------------------------------------------------------
 Nested Loop  (cost=10000000000.01..10000005300.97 rows=7 width=64)
   ->  Seq Scan on bar  (cost=10000000000.00..10000000023.10 rows=1310 width=32)
   ->  Bitmap Heap Scan on foo  (cost=0.01..4.02 rows=1 width=32)
         Recheck Cond: (a ~== bar.b)
         ->  Bitmap Index Scan on aix  (cost=0.00..0.01 rows=1 width=0)
               Index Cond: (a ~== bar.b)
(6 rows)

joe=# select a, b, lev(a,b) from foo, bar where a ~== b;
             a             |          b           | lev  
---------------------------+----------------------+------
 Euler Taveira de Oliveira | Euler T. de Oliveira | 0.76
(1 row)

joe=# set enable_seqscan = 'on';
SET
joe=# select a, b, lev(a,b) from foo, bar where a ~== b;
             a             |          b           |        lev        
---------------------------+----------------------+-------------------
 Euler Taveira de Oliveira | Euler T. de Oliveira |              0.76
 Euler                     | Euller               | 0.833333333333333
(2 rows)

joe=# \q
(joe@3fac) pg_similarity >

pg_config editting

It's not possible to edit the pg_config but you can edit the Makefile.global where the values are.

overlapcoefficient causes segment fault if either parameter is an empty string

If either parameter passed to overlapcoefficient is an empty string rather than null, a segment fault is raised. I can repeat this with the extension installed on Amazon Linux 2012-09 running Postgresql 9.2. I compiled the pg_similarity code from c12cc00.

Here's an example:

values(overlapcoefficient('hello',''))

CREATE EXTENSION syntax error

This line..

CREATE EXTENSION pg_similarity FROM unpackaged;

failed for me with...

ERROR:  function block(text, text) does not exist

removing the "FROM unpackaged" part and just running...

CREATE EXTENSION pg_similarity;

seemed to solve it.

Unable to install the extension in rhel7 and postgres version is Postgres13.3

Hi Team

I downloaded the file from the git and tried to execute the following command.
the PATH env looks like below:
$ echo $PATH
/usr/pgsql-13/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:
When I try to run make install I get the below error
Make Version is this
$ make --version
GNU Make 3.82
Built for x86_64-redhat-linux-gnu
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
When I try ti install by giving the command
$ make install
Makefile:21: ../../src/Makefile.global: No such file or directory
Makefile:22: /contrib/contrib-global.mk: No such file or directory
make: *** No rule to make target `/contrib/contrib-global.mk'. Stop.
I am getting the above error.
Please let me know am I missing anything.

docs don't mention modifying the conf file

I tried to blindly follow the installation instructions at http://pgsimilarity.projects.postgresql.org/ and met up with two frustrations:

a handful of errors from the install scripts
failure of the show and set examples in the install

Solution was to copy pg_similarity.conf.sample to the PGDATA directory as pg_similarity.conf and add a line to postgresql.conf:

include 'pg_similarity.conf'

Then reinstalled and all was well. Not sure if the solution is to extend the Makefile or/and fix the docs.

Example of how to add Indexes

Can you please add an example of how to add indexes and what the use_index means ?
Is adding the indexes and the operators possible on a JSONB column?

Installation on Postgresql 9.5 - pg_similarity.control: No such file or directory

Ubuntu 16.04 and Postgreqsl 9.5.

I've been trying to install pg_similarity, but run into the following error:

ERROR: could not open extension control file "/usr/share/postgresql/9.5/extension/pg_similarity.control": No such file or directory

Steps I followed:

$ ls
block.c    dice.c       jaccard.c      Makefile      needlemanwunsch.c     qgram.c       smithwaterman.c       test2.sql  tokenizer.c
COPYRIGHT  euclidean.c  jaro.c         matching.c    overlap.c             similarity.c  smithwatermangotoh.c  test.sql   tokenizer.h
cosine.c   hamming.c    levenshtein.c  mongeelkan.c  pg_similarity.sql.in  similarity.h  soundex.c             TODO       uninstall_pg_similarity.sql

➜  $ USE_PGXS=1 make

Output: https://pastebin.com/z84DZh8e

➜ $ ls
block.c    dice.c       hamming.o  levenshtein.c  mongeelkan.c       overlap.o             qgram.o          smithwatermangotoh.c  test.sql     uninstall_pg_similarity.sql
block.o    dice.o       jaccard.c  levenshtein.o  mongeelkan.o       pg_similarity.so      similarity.c     smithwatermangotoh.o  TODO
COPYRIGHT  euclidean.c  jaccard.o  Makefile       needlemanwunsch.c  pg_similarity.sql     similarity.h     smithwaterman.o       tokenizer.c
cosine.c   euclidean.o  jaro.c     matching.c     needlemanwunsch.o  pg_similarity.sql.in  similarity.o     soundex.c             tokenizer.h
cosine.o   hamming.c    jaro.o     matching.o     overlap.c          qgram.c               smithwaterman.c  test2.sql             tokenizer.o

➜  $ sudo USE_PGXS=1 make install
/bin/mkdir -p '/usr/lib/postgresql/9.5/lib'
/bin/mkdir -p '/usr/share/postgresql/9.5/contrib'
/usr/bin/install -c -m 755  pg_similarity.so '/usr/lib/postgresql/9.5/lib/pg_similarity.so'
/usr/bin/install -c -m 644 .//uninstall_pg_similarity.sql pg_similarity.sql '/usr/share/postgresql/9.5/contrib/'

➜  $ psql mydb
psql (9.5.6)
Type "help" for help.

mydb=# CREATE EXTENSION pg_similarity;
ERROR:  could not open extension control file "/usr/share/postgresql/9.5/extension/pg_similarity.control": No such file or directory
mydb=#

What am I missing?

Querying a table with jaccard similarity with 1.6 million records take 12 seconds

Hi Team

We have encountered a problem in our testing environment.
I have a scenario where I am running a similarity match for an address
I have created a table with following number of records
1603423
And I have created GIN index (since i am using pg_similarity) library jaccard similarity method
when I run the Explain analyze
EXPLAIN ANALYZE select complete_address, jaccard(complete_address, 'raj nagar ext near ajnara integrity up ghaziabad 201017') as qsim from address where complete_address~??'raj nagar ext near ajnara integrity up ghaziabad 201017' order by qsim DESC;
QUERY PLAN

Sort (cost=5856.13..5860.14 rows=1603 width=93) (actual time=12101.194..12101.197 rows=6 loops=1)
Sort Key: (jaccard(complete_address, 'raj nagar ext near ajnara integrity up ghaziabad 201017'::text)) DESC
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on address (cost=172.43..5770.80 rows=1603 width=93) (actual time=3516.233..12101.172 rows=6 loops=1)
Recheck Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity up ghaziabad 201017'::text)
Rows Removed by Index Recheck: 1039186
Heap Blocks: exact=58018
-> Bitmap Index Scan on address_complete_address_idx1 (cost=0.00..172.02 rows=1603 width=0) (actual time=256.037..256.037 rows=1039192 loops=1)
Index Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity up ghaziabad 201017'::text)
Planning Time: 0.141 ms
Execution Time: 12101.245 ms
(11 rows)
it took 12 seconds
following are my pgconf file settings
shared buffer as 4GB
work_mem 256 MB
maintenence_work_mem 512MB
autovacuum_work_mem 20MB
My index definition is this "address_complete_address_idx1" gin (complete_address gin_similarity_ops)

It is taking the index correctly. But why it took 12 seconds to process I really don't understand.

Please help.

Thanks
C.R.Bala

Possible Bug in Dice (and Jaccard) Function

Hi,

I tried running the dice function on the sample string

select('euler', 'heuser')

and I got 0 instead of 0.444... as specified in the example.

Using another example

select dice('hello there not', 'hello friend what')

It seems that the dice function computes over the words and not the n-grams specified in the comments. Is this the expected behavior?

Wrong loop variable in levenshtein.c

There is a bug in the _lev() function: Near the end of the function, a loop is looping alen times, but it should actually be blen (since that's the length of the arow[] array).

As far as I can tell, the fix is a one letter change from alen to blen.

Let me know if you'd like a PR - I didn't see any contributing docs so was unsure how you like to take patches.

New feature: compare array of ids

Hi,

Can you implement feature that would allow to test similarity between two sets of integers? For example to compare tags of two objects?

select pg_similarity.jaccard_threshold(0.7);
select * from objects where tags ARRAY[1,2,3,4]

As far as I understand Jaccard Coefficient is used for similarity between sets of numbers?

Thank you and have a nice day.

Trying to get pg_similarity extension on windows

Can you please help me to get code for pg_similarity.dll?

Large number of errors when compiling on CentOS

Hi eulerto,

thanks first of all for the cool library. I am having trouble compiling it on CentOS 6.4. Please see the following output:

[gpadmin@gpdbsne pg_similarity]$ USE_PGXS=1 make
sed 's,MODULE_PATHNAME,$libdir/pg_similarity,g' pg_similarity.sql.in >pg_similarity.sql
gcc -m64 -O3 -funroll-loops -fargument-noalias-global -fno-omit-frame-pointer -g -finline-limit=1800 -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv -I/data/espine1/dev/tools/curl/7.21.7/dist/rhel5_x86_64/include -Werror -fpic -I. -I/usr/local/greenplum-db-4.2.5.1/include/postgresql/server -I/usr/local/greenplum-db-4.2.5.1/include/postgresql/internal -D_GNU_SOURCE -I/data/home/build/builds/greenplum-db/Release-4_2_5_1-build-1_rc_042513-1306/Release-4_2_5_1-build-1_src/ext/rhel5_x86_64/include -I/data/home/build/builds/greenplum-db/Release-4_2_5_1-build-1_rc_042513-1306/Release-4_2_5_1-build-1_src/ext/rhel5_x86_64/include/libxml2 -c -o tokenizer.o tokenizer.c
gcc -m64 -O3 -funroll-loops -fargument-noalias-global -fno-omit-frame-pointer -g -finline-limit=1800 -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv -I/data/espine1/dev/tools/curl/7.21.7/dist/rhel5_x86_64/include -Werror -fpic -I. -I/usr/local/greenplum-db-4.2.5.1/include/postgresql/server -I/usr/local/greenplum-db-4.2.5.1/include/postgresql/internal -D_GNU_SOURCE -I/data/home/build/builds/greenplum-db/Release-4_2_5_1-build-1_rc_042513-1306/Release-4_2_5_1-build-1_src/ext/rhel5_x86_64/include -I/data/home/build/builds/greenplum-db/Release-4_2_5_1-build-1_rc_042513-1306/Release-4_2_5_1-build-1_src/ext/rhel5_x86_64/include/libxml2 -c -o similarity.o similarity.c
similarity.c: In function ‘_PG_init’:
similarity.c:135: error: array type has incomplete element type
similarity.c:142: error: array type has incomplete element type
cc1: warnings being treated as errors
similarity.c:148: error: implicit declaration of function ‘DefineCustomEnumVariable’
similarity.c:174: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:174: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:186: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:186: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:215: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:215: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:227: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:227: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:256: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:256: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:268: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:268: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:297: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:297: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:309: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:309: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:325: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:325: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:337: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:337: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:366: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:366: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:378: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:378: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:394: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:394: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:406: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:406: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:422: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:422: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:434: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:434: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:450: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:450: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:462: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:462: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:491: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:491: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:503: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:503: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:532: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:532: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:544: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:544: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:560: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:560: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:572: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:572: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:586: error: large integer implicitly truncated to unsigned type
similarity.c:586: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:586: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:615: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:615: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:627: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:627: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:656: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:656: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:668: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:668: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:684: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:684: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:696: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:696: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:712: error: passing argument 8 of ‘DefineCustomRealVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:359: note: expected ‘GucRealAssignHook’ but argument is of type ‘int’
similarity.c:712: error: too many arguments to function ‘DefineCustomRealVariable’
similarity.c:724: error: passing argument 6 of ‘DefineCustomBoolVariable’ makes pointer from integer without a cast
/usr/local/greenplum-db-4.2.5.1/include/postgresql/server/utils/guc.h:339: note: expected ‘GucBoolAssignHook’ but argument is of type ‘int’
similarity.c:724: error: too many arguments to function ‘DefineCustomBoolVariable’
similarity.c:142: error: unused variable ‘pgs_gram_options’
similarity.c:135: error: unused variable ‘pgs_tokenizer_options’
make: *** [similarity.o] Error 1

Any help would be greatly appreciated!

Thanks and best regards
Ronert

hamming_text function not created

It looks like I have most of the functions. hamming_text() is not created however. I have the hamming() and hamming_op(). I'm using 8.4 on ubuntu 10.04 server LTS

Empty strings with ~*~ crashes server

Brief

Empty strings with ~*~ crashes server

Repro

SELECT '' ~*~ '';

SQL Output

db=> SELECT version();
-[ RECORD 1 ]---------------------------------------------------------------------------------------------------
version | PostgreSQL 13.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6), 64-bit

db=> SELECT '' ~*~ '';
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>

could not load library

ERROR: could not load library "/usr/lib/postgresql/10/lib/pg_similarity.so": /usr/lib/postgresql/10/lib/pg_similarity.so: undefined symbol: Float8GetDatum

i got this error, how to fix it

Comparing timestamps

Hi,

Is there a good way to compare similarities between timestamps with pg_similarity? E.g.:

Table A

City	Arrival Time
Los Angeles	2019-03-05 11:00:00.00000+00
Hong Kong	2019-03-07 17:00:00.00000+00

Table B

City	Arrival Time
Los Angeles	2019-03-06 13:00:00.00000+00
Los Angeles	2019-03-01 11:00:00.00000+00

When joining Table A and Table B I would like to return only the rows which are most similar, in this case the rows Los Angeles with arrival dates on 2019-03-05 and 2019-03-06 are the closest matches.

Problema na Instalação do PG_SIMILARITY

Euler, estou tendo problemas para instalar o pg_similarity no meu Ubuntu sobre uma VM. O problema ocorre a partir da linha do Makefile. Gostaria de saber se você tem algum outro 'how to install' atualizado ou se você sabe os pontos principais que devem ser atualizados no Makefile.

Obrigado.

Native MSVC build support

A recent stack overflow poster showed interest in using this extension on Windows.

I'm sure you find Windows as fun as I do ;-) . Nonetheless, I noticed you had instructions on how to build for MSVC. There's actually a much simpler way to do it if you want to, by providing or generating MSBuild project file. I wrote some details about it in a recent blog post:

Compiling PostgreSQL extensions with MSVC.

It's a relatively non-awful process, for Windows development, and doesn't require a full PostgreSQL source tree to be configured and built.

I'm not asking you to do anything. I just thought I'd mention the option in case you - or someone who has a reason to bother to contribute Windows support - is interested.

Hamming does not support bytea type

Unfortunately there's a) no good way to convert between postgres's bytea type (variable length byte strings) and varbit (variable length bit strings) due to well, read the whole thread here: http://archives.postgresql.org/pgsql-patches/2006-05/msg00041.php

mostly seems to be endianness/arguments that byte strings and bitstrings aren't really the same thing, yadda, yadda,

Anyway, I'm gonna try to write a hamming_bytea function and submit a pull request, but we'll see how far I get before I realize I have no idea what I'm doing. Anyway, looks like you're treating "text" as a bytestring anyway, so I should be able to copy/paste that code somehow, or something, maybe.