matthewfranglen / postgres-elasticsearch-fdw Goto Github PK
View Code? Open in Web Editor NEWPostgres to Elastic Search Foreign Data Wrapper
License: MIT License
Postgres to Elastic Search Foreign Data Wrapper
License: MIT License
We are modernizing Multicorn2 the support python 3.9+ and Postgres 12+. Perhaps this project could do the same? Dropping support for old Postgres versions and old python versions will make it sensible to continue moving this valuable project forward to support thru current versions of Postgres and ElasticSearch
Hi, again me :)
I think the main goal of this extension is to synchronize an ES index to a Postgres database.
The only option for now is to filter thanks to the URI search in ES.
If you have to synchronize regularly your ES index database, you need to only load new documents since the last import (if the ES documents are immutable such as logs). Personally, to do a delta import this way, I do that:
-- it loads every new documents since the last import.
-- it uses Query String Syntax. Do note that date filter is a ">" (as expected) and not ">=" due to the character "{" in the query string
--
-- @see https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
-- @see https://lucene.apache.org/solr/guide/6_6/working-with-dates.html
--
-- Do not try to modify it with CTE or put subquery from the WHERE clause in the FROM clause.
-- Otherwise, the optimizer tries to load all data from ES and then applies the WHERE clause on the timestamp on Postgres side.
insert into access_log
select
*
from es_access_log
where query = (
select
CONCAT(
'log_timestamp:{',
to_char(MAX(log_timestamp), 'YYYY-MM-DD"T"HH24:MI:SS"Z"'),
' TO *}'
)
from access_log
);
The problem is that you can have logs in the same second interval that are not loaded. Generally, when you want to do such export, you export with a deterministic order as for a seek pagination
Whith pseudo SQL code, the filter would be:
WHERE log_timestamp >= max_log_timestamp and id > id_for_max_log_timestamp
ORDER BY id
Adding sort capability (possible in the URI search from ES) would allow to export accurately your data for such usecase.
Hi Matthew,
I like the work you are doing and am interested in possible sponsorship. Please contact me at [email protected]
--Luss
I create two tables with one-many relationship, and I want to use a 'join field' to handle this. But I've got a RequestError(400, 'mapper_parsing_exception', 'failed to parse') issue when I insert a child data. Could this extension support the 'join field' feature?
Will you consider to change your doc?
Hi,
Are fields starting with an underscore in ElasticSearch, like _index or _source, supported? Because they showed empty whe I do a select in the foreing table of PostgreSQL, and they have information in Elasticsearch.
In the foreign table definition, I tried with index, _index, and "_index" without any luck.
Thanks in advance!
Hi. Thanks for your amazing work!
A question: are capital letter fields in ElasticSearch, like client_source_IP and client_source_Port, supported? Because they showed empty, and they have information in Elasticsearch.
Also, te correct type for an ip datatype of Elasticsearch is INET in PostgreSQL, right? And for the port, that in Elasticsearch is a long datatype, I'm using BIGINT in PostgreSQL, and is empty too. Is this correct?
Thanks!
Hello,
did this fdw also works with OpenSearch?
Regrads
Lutz
When the field is defined as an array type, the insert succeeds, but an error is thrown on query.
CREATE FOREIGN TABLE "public"."_es_spt_product" ( "id" varchar(20) COLLATE "pg_catalog"."default", "product_id" varchar(20) COLLATE "pg_catalog"."default", "product_name" varchar(100) COLLATE "pg_catalog"."default", "cate_ids" text[] ) SERVER "multicorn_es" OPTIONS ( "host" '127.0.0.1', "port" '9200', "index" 'spt_product', "rowid_column" 'id', "query_dsl" 'true', "default_sort" 'product_id', "refresh" 'false', "timeout" '20', "username" 'elastic', "password" 'elastic' );
INSERT INTO "_es_spt_product" VALUES ( '194805', '194805', 'TIMBUK2', '{7JIA2_CATE_04,7JIA2_CATE_0402,7JIA2_CATE_040203}')
"hits" : [ { "_index" : "spt_product", "_type" : "_doc", "_id" : "194805", "_score" : 1.0, "_source" : { "product_id" : "194805", "product_name" : "TIMBUK2", "cate_ids" : [ "7JIA2_CATE_04", "7JIA2_CATE_0402", "7JIA2_CATE_040203" ] } }
SELECT * FROM "_es_spt_product"
ERROR: malformed array literal: "["7JIA2_CATE_04", "7JIA2_CATE_0401", "7JIA2_CATE_040101"]"
DETAIL: "[" must introduce explicitly-specified array dimensions.
Hello,
I was trying to perform the query from Postgres to elasticsearch using "query" as shown above. But it was throwing me an error saying...
postgres=# select * from articles_es where query = '_id:1';
ERROR: COUNT for /index1 failed: Expecting value: line 1 column 1 (char 0)
Here are the details of how I created the foreign table (providing info on a high level):
CREATE FOREIGN TABLE articles_es
(
id TEXT,
query TEXT,
....
)
SERVER multicorn_es
OPTIONS
(
host 'elasticsearch',
port '9200',
index 'article-index',
type 'article',
rowid_column 'id',
query_column 'query',
query_dsl 'false',
refresh 'false',
complete_returning 'false',
timeout '20',
username 'elastic',
password 'elastic'
)
;
Any suggestions, please.
For posterity, I was able to get this to "work" with PostgreSQL 10 and Elasticsearch 6 by pinning the version of the elasticsearch
package dependency (in setup.py
, and requirements.txt
for consistency). Without pinning the version, elasticsearch
v7 is installed, which only works with Elasticsearch 7+.
The Dockerfile I used:
FROM postgres:10-alpine
RUN apk add --no-cache \
python2 \
python2-dev \
py2-pip
RUN apk add --no-cache build-base
RUN pip install pgxnclient
RUN pgxn install multicorn
RUN pip install pg_es_fdw
ENTRYPOINT ["docker-entrypoint.sh"]
EXPOSE 5432
CMD ["postgres"]
I haven't run the full tests yet, but simple insert
, delete
, and select
are working for me.
The documentation is pretty good, but I think it could be tweaked to make it even more clear up-front that quite likely you will have to accommodate your particular version Elasticsearch in addition to PostgreSQL and the most recent Multicorn that goes with your version of PostgreSQL.
Along with this, it would be better if the repo was configured to work with a particular set of versions, and those versions were clearly spelled out.
BTW, the current version of Multicorn (1.3.5 from 2018) is compatible with PostgreSQL 10 but not PostgreSQL 11.
when i try to insert into chinese,i got zhe flowing error:
postgres@linux-dtq3:/home/fdw/postgres-elasticsearch-fdw-master> 2021-03-05 05:37:13.358 EST [1705] 错误: INDEX for /test-index/doc/1 and document {'istrue': u't', 'testjson': {}, 'createdate': datetime.date(2021, 3, 5), 'time': u'11:27:00', 'flag': u'111', 'num': 666L, 'testbytea': '\xe6\xb5\x8b\xe8\xaf\x95', 'address': u'\u5317\u4eac\u5e02\u4e30\u53f0\u533a', 'price2': u'66.666', 'message': u'\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5', 'datetime': datetime.datetime(2021, 3, 5, 12, 22, 54), 'price': 11.01} failed: ({'istrue': u't', 'testjson': {}, 'createdate': datetime.date(2021, 3, 5), 'time': u'11:27:00', 'flag': u'111', 'num': 666L, 'testbytea': '\xe6\xb5\x8b\xe8\xaf\x95', 'address': u'\u5317\u4eac\u5e02\u4e30\u53f0\u533a', 'price2': u'66.666', 'message': u'\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5', 'datetime': datetime.datetime(2021, 3, 5, 12, 22, 54), 'price': 11.01}, UnicodeDecodeError('ascii', '"\xe6\xb5\x8b\xe8\xaf\x95"', 1, 2, 'ordinal not in range(128)'))
2021-03-05 05:37:13.358 EST [1705] 语句: insert into test_es(id,num,flag,istrue,createdate,price,price2,datetime,address,message,testbytea,testjson,time)
values(1,666,'111'::"bit",true,'2021-03-05',11.01,66.666,'2021-03-05 12:22:54','北京市丰台区','测试字段测试字段测试字段测试字段','测试'::bytea,'{}','11:27')
geometry field don‘t support?
[created_stamp] field is Timestamptz type, an error occurred when inserting data:
failed: RequestError(400, 'mapper_parsing_exception', "failed to parse field [created_stamp] of type [date] in document with id '1111111111'. Preview of field's value: '2018-05-28 19:00:15.359+08'")
Hi,
Want to know whether "pg_es_fdw" supports SSL authentication.
I have an Elasticsearch cluster setup with SSL authentication. When I tried to insert data into elasticsearch after the foreign table creation step, I see below authentication error:
postgres=# INSERT INTO articles_es1
(
id,
title,
body,
metadata
)
VALUES
(
1,
'foo',
'spike',
'{"score": 3}'::json
);
ERROR: INDEX for /article-index/1 and document {'title': 'foo', 'body': 'spike', 'metadata': {'score': 3}, 'query': None, 'score': None, 'sort': None} failed: ConnectionError(('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))) caused by: ProtocolError(('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')))
Moreover, I tried with "SSL" option too by providing the pem file location as shown below, but no luck:
CREATE FOREIGN TABLE articles_es1
(
id BIGINT,
title TEXT,
body TEXT,
metadata JSON,
query TEXT,
score NUMERIC,
sort TEXT
)
SERVER multicorn_es
OPTIONS
(
host '192.168.1.194',
port '9200',
index 'article-index',
rowid_column 'id',
query_column 'query',
query_dsl 'false',
score_column 'score',
default_sort 'last_updated:desc',
sort_column 'sort',
refresh 'false',
complete_returning 'false',
timeout '20',
username 'admin',
password 'admin',
ssl '/opt/root-ca.pem'
)
;
Thanks in advance !!
Hello,
It seems something has changed with the python3 or python3-pip packages, as meanwhile they implemented "externally managed" base environments.
https://peps.python.org/pep-0668/#specification
As a result, the docker image fails Dockerfile PG15 at line PYTHON_OVERRIDE=python3 make install, more specifically when it tries to run pip3 install .
I tried fixing it by myself, I also tried to install an earlier version of python3 but it seems they are not available anymore, and I am not very good at installing packages in linux.
I would greatly appreciate it if you could offer me a solution and maybe update the dockerfile.
Thank you,
Mihai
there is an issue with doc/example ... "body" and "content" that are mixed.
replace content by body and everything will be fine
ty
Hi,
Thank you for your very helpful project!
Currently, I’m using an authentication system passing identifiers using headers in the request to ElasticSearch.
Would it be possible to customize headers in the request made to ElasticSearch please ?
Thank you for your help,
Matthiosso
https://wiki.postgresql.org/wiki/Foreign_data_wrappers#Big_Data_Wrappers
This wrapper supports 12 now. I need to update that page. There is a cool-off period after creating an account 😸
I put 100,000 rows data to elasticsearch,when I select 1 row by id through foreign table,it took 222 second, I wander is it fetch all data from elasticsearch to postgres and then filter the data?
Looks like the auth header is http_auth for versions 6 and 7 and basic_auth for 8.
Hi!
As PostgreSQL 13 offers significant improvements in performance and usability, want to know whether any plan to support pgsql-13 version in the future.
Thank you very much!
I'm trying to test this wrapper with AWS Elastic Search
My Elastic Object
{
"_index": "twitter",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"User": "olivere",
"Message": "Take Five",
"Retweets": 0
}
}
My foreing table
CREATE FOREIGN TABLE articles_es
(
"User" TEXT,
Message TEXT,
Retweets NUMERIC
)
SERVER multicorn_es
OPTIONS
(
host 'https://vpc-my-elastic-jxd4u4p3jwj4sl00ydrardrt7i.us-east-1.es.amazonaws.com',
port '80',
index 'twitter',
type '',
rowid_column 'id',
query_column 'query',
query_dsl 'false',
score_column 'score',
default_sort 'last_updated:desc',
sort_column 'sort',
refresh 'false',
complete_returning 'false',
timeout '20',
username '',
password ''
)
;
I got this error
SQL Error [XX000]: ERROR: COUNT for /twitter failed: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f5373493750>: Failed to establish a new connection: [Errno -2] Name or service not known) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f5373493750>: Failed to establish a new connection: [Errno -2] Name or service not known)
With elasticsearch==7.16.0 I have this error:
ERROR: SEARCH for /index-name failed: RequestError(400, 'search_phase_execution_exception', 'No mapping found for [] in order to sort on')
With previous elasticsearch==7.14.2 all work correctly.
Is this compatible?
self.client = Elasticsearch(
[
{
"host": options.get("host", "localhost"),
"port": int(options.get("port", "9200")),
}
],timeout=int(options.get("timeout", "10"))
)
Hello,
Is it possible to cast an array as json easily with the extension? Same thing for json objects.
As far I know and by reading the Python code of the extension, it casts it as a Python string.
For example, the ES document:
{
"browser_name": "Chrome",
"user_os_name": "Mac OS X",
"email_domains": ["example.com", "example2.com"],
"geoip": {
"ip": "203.168.53.187",
"latitude": -37.734,
"location": {
"lat": -37.734,
"lon": 144.7378
}
}
}
The foreign table definition (JSON for geoip
or domain_names
does not work):
CREATE FOREIGN TABLE es_access_log
(
id TEXT,
query TEXT,
geoip TEXT,
browser_name TEXT,
user_os_name TEXT
domain_names TEXT
)
SERVER multicorn_es
OPTIONS
(
host 'elasticsearch',
port '9200',
index 'logstash-nginx',
rowid_column 'id',
query_column 'query',
score_column 'score',
timeout '20',
scroll_size '1000',
duration '30m'
)
;
select email_domains from my_foreign_table;
[u'example.com', u'example2.com']
select geoip from my_foreign_table;
"ip"=>"203.168.53.187", "latitude"=>-37.734, "location"=>"lat"=>-37.734, "lon"=>144.7378
What I would like is to declare it as json
or jsonb
in order to be cast automatically by the wrapper.
Did I miss something? Is it supported?
please now test esfdw on pg15.
Good morning,
I would like to query ElasticSearch using a query body defined in Postgres.
I am not able to pass the query body as a JSON using the query parameter.
I have one index containing all my data in ElasticSearch target-index
With the mapping:
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "text"
},
"city": {
"type": "text"
}
}
}
containing the data
id | name | city |
---|---|---|
1 | Paul Ogier | New York, NY |
2 | Joe Scott | Boston, MA |
3 | Alice Sharp | Boston, MA |
One table
id | name | city | q_body |
---|---|---|---|
1 | Paul | New York | {JSON query: see below |
2 | Joe | Boston | JSON Query |
3 | Alice | Boston | Json query |
See the example of the JSON query for row 1: It's a multi-match query
{"query" :
{"bool" :
{"should":
[
{"match": {"name": {"query": "Paul", "fuzziness": 1}}},
{"match": {"city": {"query": "New York", "fuzziness": 1}}}
]
}
},
"explain" : true, "size" : 20}
CREATE FOREIGN TABLE es_fdw
(
id BIGINT,
name varchar,
query JSON,
score NUMERIC
)
WITH mypgquery AS (SELECT pg_id, q_body FROM mypgtable )
SELECT a.pg_id, id as es_id, name, score
FROM es_fdw, a
WHERE query = mypgquery.q_body LIMIT 10;
pg_id | es_id | name | score |
---|---|---|---|
1 | 1 | Paul Ogier | 10 |
2 | 2 | Json Scott | 10 |
2 | 3 | Alice Sharp | 3 |
3 | 3 | Alice Sharp | 10 |
3 | 2 | Json Scott | 3 |
ERROR: operator does not exist: json = json
I cannot pass a query parameter as json
First, thank you for this awesome project ! Easy to set up and works very well.
I use RETURNING *
almost every time I do an INSERT
in postgres to retrieve the real inserted data (e.g: with default values).
When using it on a foreign table using this foreign data wrapper, I get an empty row.
Is it the expected behavior? If not, even if I am very new to FDW and python, I could try to fix it.
Originally reported in #22
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.