stac-utils / pgstac Goto Github PK
View Code? Open in Web Editor NEWSchema, functions and a python library for storing and accessing STAC collections and items in PostgreSQL
License: MIT License
Schema, functions and a python library for storing and accessing STAC collections and items in PostgreSQL
License: MIT License
I noticed that the latest version pypgstac on pypi is 0.2.4. The main branch has a version of 0.2.3, and there's not tags for all released versions.
We should set up GitHub Actions to do the publishing, much like we are doing in pystac. That way releases happen on each GitHub release, and the python package is published by CI against tagged versions. This will require the stacutils
user to be made a maintainer on the pypgstac pypi repository.
The data upload functionality assumes the data is temporally dense and creates datetime partitions that are 1 week long.
Some datasets (eg: CMIP6) are sparsely populated over a very large period of time (eg: 20 files per month spanning 150 years) - this leads to about 1800 partitions with about 20 files each (the monthly data is timestamped to the 1st of the month, so only the partition for the 1st week of the month is created).
I haven't benchmarked the search functionality with this many partitions, but after a discussion today with @bitner, I wanted to log an issue for implementing arbitrary partitions, since it seems like a valuable feature anyways.
The STAC API will contain both the CMIP6 dataset and other datasets that are much more temporally dense and with shorter temporal extent. The preferred solution in this case would be to have the ability to make large partitions for the sparsely populated time-ranges and smaller partitions for the densely populated time-ranges (eg: 1 year or 5 year partitions from 1950-1999, week or month partitions from 2000-2030 and then 1 or 5 year partitions again from 2031-2100).
This can be a function that needs to be executed manually before uploading the data (as opposed to dynamically calculating the partition size during upload).
A further improvement could be a function to further split up partitions (or even recombine them) in the case of data being added or removed from the STAC database
https://github.com/stac-utils/pgstac/blob/main/sql/004_search.sql#L561
This field doesn't match the STAC API response and is extraneous; I think we should remove to avoid confusion and stick to standard fields.
The code to add the temporal component to queries with pagination is not appropriately intersecting the token datetime with the query datetime.
Tried psql -f pgstac.sql
on an empty PostgreSQL 12 GCP Cloud SQL database on commit d60d7e8 and got some errors, especially:
psql:sql/003_items.sql:269: NOTICE: trigger "items_stmt_trigger" for relation "items" does not exist, skipping
psql:sql/004_search.sql:191: ERROR: return type mismatch in function declared to return box3d
Here's the full output:
BEGIN
CREATE EXTENSION
CREATE SCHEMA
CREATE EXTENSION
CREATE SCHEMA
SET
CREATE TABLE
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
SET
CREATE TABLE
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
DELETE 0
create_parent
---------------
t
(1 row)
CREATE FUNCTION
CREATE FUNCTION
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE TYPE
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
psql:sql/003_items.sql:269: NOTICE: trigger "items_stmt_trigger" for relation "items" does not exist, skipping
DROP TRIGGER
CREATE TRIGGER
CREATE VIEW
CREATE VIEW
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
SET
DROP VIEW
CREATE VIEW
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
psql:sql/004_search.sql:191: ERROR: return type mismatch in function declared to return box3d
DETAIL: Actual return type is geometry.
CONTEXT: SQL function "bbox_geom"
psql:sql/004_search.sql:195: ERROR: current transaction is aborted, commands ignored until end of transaction block
psql:sql/004_search.sql:199: ERROR: current transaction is aborted, commands ignored until end of transaction block
psql:sql/004_search.sql:259: ERROR: current transaction is aborted, commands ignored until end of transaction block
psql:sql/004_search.sql:279: ERROR: current transaction is aborted, commands ignored until end of transaction block
psql:sql/004_search.sql:288: ERROR: current transaction is aborted, commands ignored until end of transaction block
psql:sql/004_search.sql:330: ERROR: current transaction is aborted, commands ignored until end of transaction block
psql:sql/004_search.sql:361: ERROR: current transaction is aborted, commands ignored until end of transaction block
psql:sql/004_search.sql:371: ERROR: current transaction is aborted, commands ignored until end of transaction block
psql:sql/004_search.sql:595: ERROR: current transaction is aborted, commands ignored until end of transaction block
ROLLBACK
I had an old version of pgstac running, which I have just upgraded to the latest version. Since then, I've found that some of the search queries I've run have returned incorrect results. I actually found the incorrect results when querying using stac_fastapi, but I've checked by running queries using the underlying pgstac SQL functions and have found that they are returning incorrect results, which are then just displayed by stac_fastapi).
For example, when running the query:
SELECT * FROM pgstac.search('{"collection":"test-bec-4"}'::jsonb)
I get a JSON result which starts like this:
{
"next": "N51E000.tif",
"prev": null,
"type": "FeatureCollection",
"context": {
"limit": 10,
"returned": 10
},
"features": [
{
"id": "core-4",
"bbox": [
35,
19.3,
35,
19.3
],
"type": "Feature",
"links": [
{
"rel": "self",
"href": "http://localhost/test.tif",
"type": "application/json"
}
],
"assets": {},
"geometry": {
"type": "Point",
"coordinates": [
35,
19.3
]
},
"collection": "test-anglo-cores-3",
"properties": {
"datetime": "2020-02-08T11:23:00Z",
"core-type": "shallow",
"last-reviewed": "2021-03-19"
},
...
In that returned JSON, it lists the collection as test-anglo-cores-3
not test-bec-4
.
Running
SELECT * FROM items WHERE collection_id = 'test-bec-4'
gives me 100 results, which is the correct number - and shows that there are items in the database with that collection ID. I've also checked the content
field of the items table, and in there the collection is also specified correctly as test-bec-4
.
I have no idea how to go forward with debugging this, as my SQL skills are quite limited. Any help would be very much appreciated.
Let me know if you have another recommendation for backing up and restoring the database, but I attempted to use pg_dump
and pg_restore
and I get this error:
CONTEXT: SQL function "stac_geom" during inlining
....
pg_restore: error: COPY failed for table "items_p2021w37": ERROR: function st_geomfromgeojson(text) does not exist
LINE 5: ST_GeomFromGeoJSON(value->>'geometry')
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
QUERY:
SELECT
CASE
WHEN value->>'geometry' IS NOT NULL THEN
ST_GeomFromGeoJSON(value->>'geometry')
WHEN value->>'bbox' IS NOT NULL THEN
ST_MakeEnvelope(
(value->'bbox'->>0)::float,
(value->'bbox'->>1)::float,
(value->'bbox'->>2)::float,
(value->'bbox'->>3)::float,
4326
)
ELSE NULL
END as geometry
;
This is happening because pg_dump
explicitly sets the search_path
to empty so custom functions (e.g., from the extension or pgstac's function) without specific schemas set aren't found. I was able to fix this specific error but I adding "public" in front of the ST_GeomFromGeoJSON
function but there were many more errors like this.
If functions called in the pgstac functions had explicit schemas set, I think it would solve the issue and pg_dump
and pg_restore
could be used for backups.
pg-stac
version - 0.4.3
pypgstac
version - 0.4.5
When using load_ndjson
with the method
loadopt.upsert
https://github.com/NASA-IMPACT/cmr-pgstac-loader/blob/main/lambdas/pgstac_loader/handler.py#L40 records with duplicate ids
throw UniqueViolationError
. An example full stack trace
[ERROR] UniqueViolationError: duplicate key value violates unique constraint "items_p2022w11_id_pk"
DETAIL: Key (id)=(HLS.S30.T49TCF.2022073T032539.v2.0) already exists.
Traceback (most recent call last):
File "/var/task/aws_lambda_powertools/middleware_factory/factory.py", line 133, in wrapper
response = middleware()
File "/var/task/aws_lambda_powertools/utilities/data_classes/event_source.py", line 39, in event_source
return handler(data_class(event), context)
File "/var/task/handler.py", line 36, in handler
asyncio.run(
File "/var/lang/lib/python3.8/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/var/lang/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/var/task/pypgstac/load.py", line 246, in load_ndjson
await load_iterator(f, table, conn, method)
File "/var/task/pypgstac/load.py", line 234, in load_iterator
await copy_upsert(iter, table, conn)
File "/var/task/pypgstac/load.py", line 212, in copy_upsert
await conn.copy_to_table(
File "/var/task/asyncpg/connection.py", line 897, in copy_to_table
return await self._copy_in(copy_stmt, source, timeout)
File "/var/task/asyncpg/connection.py", line 1101, in _copy_in
return await self._protocol.copy_in(
File "asyncpg/protocol/protocol.pyx", line 529, in copy_in
status_msg = await waiter
I can confirm that loadopt.upsert
appears to work in pypgstac
tests by modifying https://github.com/stac-utils/pgstac/blob/main/test/testdata/items.ndjson to include an item
with a duplicate id and running the pypgstac
LoadTest
.
Inconsistently I am also seeing the following error immediately following
[ERROR] CardinalityViolationError: ON CONFLICT DO UPDATE command cannot affect row a second time
HINT: Ensure that no rows proposed for insertion within the same command have duplicate constrained values.
Traceback (most recent call last):
File "/var/task/aws_lambda_powertools/middleware_factory/factory.py", line 133, in wrapper
response = middleware()
File "/var/task/aws_lambda_powertools/utilities/data_classes/event_source.py", line 39, in event_source
return handler(data_class(event), context)
File "/var/task/handler.py", line 36, in handler
asyncio.run(
File "/var/lang/lib/python3.8/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/var/lang/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/var/task/pypgstac/load.py", line 246, in load_ndjson
await load_iterator(f, table, conn, method)
File "/var/task/pypgstac/load.py", line 234, in load_iterator
await copy_upsert(iter, table, conn)
File "/var/task/pypgstac/load.py", line 212, in copy_upsert
await conn.copy_to_table(
File "/var/task/asyncpg/connection.py", line 897, in copy_to_table
return await self._copy_in(copy_stmt, source, timeout)
File "/var/task/asyncpg/connection.py", line 1101, in _copy_in
return await self._protocol.copy_in(
File "asyncpg/protocol/protocol.pyx", line 529, in copy_in
status_msg = await waiter
Right now the official
image published is built with https://github.com/stac-utils/pgstac/blob/main/Dockerfile packaging pypgstac python module and postgres (+pgstac SQL).
I don't think the image will be used for both function running migration/loading items and starting the database, so IMO we should split the dockerfile in 2:
Latest version of pypgstac (0.4.3) requires asyncpg < 0.23.0 and >= 0.22.0. However, the support for sslrootcert is added to asyncpg only in 0.24.0. Need to update the dependencies to at least asyncpg 0.24.0, so that the Postgresql that require a root cert to connect like Azure Postgresql can be supported.
pgstac/pypgstac/pyproject.toml
Line 18 in 4b58277
AsyncIO is part native python install so there is no need to force-install it.
This could cause issue when using pypgstac in AWS Lambda: https://stackoverflow.com/a/61746719
In order to add a bulk insert endpoint it would be great to have a create_items
function
CREATE OR REPLACE FUNCTION create_items(datas jsonb) RETURNS VOID AS $$
INSERT INTO items (content) SELECT * FROM jsonb_array_elements(datas);
SELECT backfill_partitions();
$$ LANGUAGE SQL SET SEARCH_PATH TO pgstac,public;
first mentioned in #39 (comment) the collections search doesn't seems to work anymore.
I think I found out what is going on:
# using `collection**s**`
$ postgis=# SELECT * from search('{"collections": ["facebook-population-density"]}');
NOTICE: SEARCH CQL 1: {"collections": ["facebook-population-density"]}
NOTICE: SEARCH CQL 2: {"filter": {}, "collections": ["facebook-population-density"]}
NOTICE: newprops: []
NOTICE: SEARCH CQL Final: {"filter": {}, "collections": ["facebook-population-density"]}
NOTICE: j: {}, op: <NULL>, jtype: object
NOTICE: parsing object
NOTICE: Getting stats for TRUE
NOTICE: Estimated Count: 253
NOTICE: Actual Count: 253
NOTICE: SEARCH: (a8cb91b69e25c49753b759a3f73e9462,"{""collections"": [""facebook-population-density""]}"," TRUE ","datetime DESC, id DESC","2021-08-16 10:19:03.68915+00",1,"2021-08-16 10:19:03.68915+00",253,253)
# using `collection` without **s**
$ SELECT * from search('{"collection": ["facebook-population-density"]}');
NOTICE: SEARCH: (ef6ca26f3bf560dfc90e2227abfa5735,"{""collection"": [""facebook-population-density""]}","(((collection_id = ANY ( '{facebook-population-density}'::text[] ))))","datetime DESC, id DESC","2021-08-16 10:20:43.002928+00",2,"2021-08-16 10:18:14.842254+00",91,90)
NOTICE: FULL QUERY (((collection_id = ANY ( '{facebook-population-density}'::text[] )))) 00:00:00.011999
it seems that when we translate the query to CQL, the code expects collection
while the specs says collections
Line 28 in be27663
Lines 149 to 158 in be27663
Line 191 in be27663
In previous (~0.2.*) version here is what we had:
Lines 454 to 456 in c9f0f64
Upgrading stac-fastapi to pgstac 0.3.2, some tests are failing because search returns look like:
{
"type": "FeatureCollection",
"context": {
"limit": 10,
"matched": 0,
"returned": 0
},
"features": null
}
Seems as though pgstac is returning null
instead of []
for empty searches, at least in some cases.
I'm seeing in logging output:
ERROR asyncio:base_events.py:1707 Unhandled exception in asyncpg log message listener callback <function pglogger at 0x7fc70bb4c280>
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/asyncpg/connection.py", line 1435, in _call_log_listener
cb(con_ref, message)
TypeError: pglogger() takes 1 positional argument but 2 were given
This doesn't error out code so is really more of a warning, but perhaps there's a mispatch on the pglogger method params and the callback signature asyncpg expects
I was testing out pgstac migrations (with pypgstac migrate
) and it looks like multiple version numbers can be inserted at the same time.
db=> SELECT * FROM pgstac.migrations;
version | datetime
---------+-------------------------------
0.2.4 | 2021-08-24 18:34:59.364497+00
0.2.7 | 2021-08-24 18:39:26.047286+00
0.3.4 | 2021-08-24 18:39:26.124702+00
0.2.8 | 2021-08-24 18:39:26.129415+00
0.3.4 | 2021-08-24 18:39:26.133057+00
0.2.9 | 2021-08-24 18:39:26.136371+00
0.3.4 | 2021-08-24 18:39:26.136371+00
(7 rows)
0.2.4
was added to the table manually, but all other rows were added as part of pypgstac migrate
. This is an issue because pypgstac migrate
picks up the most-recently-added row
pgstac/pypgstac/pypgstac/migrate.py
Lines 113 to 118 in 6e43fb3
db=> SELECT version FROM pgstac.migrations
db-> ORDER BY datetime DESC LIMIT 1;
version
---------
0.2.9
(1 row)
If pgstac is dedicated to Semver, you could just take Max(version)
and use lexicographic sort on the string?
db=> SELECT MAX(version) FROM pgstac.migrations;
max
-------
0.3.4
(1 row)
As described here, it would be useful to be able to filter a search request based on whether a given property is null.
When running searches that include an "intersects" for complex geometries, I am seeing this error:
ERROR: index row requires 293792 bytes, maximum size is 8191
The "293792 bytes" seems to correspond to the size of the input geometry.
The error does not occur for simpler geometries (I ran one that was a simple polygon covering the same area loosely as the complex polygon) nor in earlier versions of pgstac. Below is the stack trace from the database when running a search with one collection and an intersects of California, although I left off the first two notices of NOTICE: ARGS after array cleaning:
since the geometry value is so long. Happy to provide the specific California geometry if needed.
NOTICE: Checking if update is needed.
NOTICE: Stats Last Updated: <NULL>
NOTICE: TTL: 1 day, Age: <NULL>
NOTICE: Context: off, Existing Total: <NULL>
NOTICE: Time for just the explain: 00:00:00.428574
NOTICE: Time for explain + join: 00:00:00.431062
ERROR: index row requires 293792 bytes, maximum size is 8191
CONTEXT: SQL statement "INSERT INTO search_wheres SELECT sw.*
ON CONFLICT (_where)
DO UPDATE
SET
lastused = sw.lastused,
usecount = sw.usecount,
statslastupdated = sw.statslastupdated,
estimated_count = sw.estimated_count,
estimated_cost = sw.estimated_cost,
time_to_estimate = sw.time_to_estimate,
partitions = sw.partitions,
total_count = sw.total_count,
time_to_count = sw.time_to_count"
PL/pgSQL function where_stats(text,boolean) line 189 at SQL statement
SQL statement "SELECT where_stats(search._where, updatestats)"
PL/pgSQL function search_query(jsonb,boolean,jsonb) line 41 at PERFORM
PL/pgSQL function search(jsonb) line 71 at assignment
SQL state: 54000
Thank you!
Use case: I was about to migrate a database and wanted to confirm I had the latest pypgstac installed in my container. Would be useful to have a pypgstac version
or pypgstac --verison
to check the version installed.
I'm running into a situation where a query is logged in the search_wheres
table before new data is brought in. This causes queries to miss the newly ingested items, and returning only results from the previous set of partitions. In the case where a query was done before any items are ingested, it will return 0 results as no partitions are logged in search_wheres for that query.
Workaround is to perform a query like
delete from search_wheres where _where LIKE '%COLLECTIONID%';
where COLLECTIONID is the collection ID for the recently ingested items.
Perhaps some purging of the search_wheres table should be done after a load to avoid mis-cached results? Or turn down the default TTL - as 1 day is a long time to wait for new items to show up.
Changelog hasn't been updated since 0.3.4 😬
docker-compose.yml
database:
container_name: stac-db
platform: linux/amd64
image: ghcr.io/stac-utils/pgstac:v0.3.5
environment:
- POSTGRES_USER=username
- POSTGRES_PASSWORD=password
- POSTGRES_DB=postgis
ports:
- "5439:5432"
command: postgres -N 500
docker-compose up database
stac-db | /usr/local/bin/docker-entrypoint.sh: running /docker-entrypoint-initdb.d/pgstac.sql
stac-db | BEGIN
stac-db | psql:/docker-entrypoint-initdb.d/pgstac.sql:2: error: sql/001_core.sql: No such file or directory
cc @bitner
Any thoughts on adding a section in the documentation (maybe for "advanced users") for how to ingest data to pgstac using \copy
?
I tried using pypgstac load
, but for some reason it appeared to hang and after 24 hours I killed that process.
I saw that the existing load commands basically use copy
through asyncpg:
pgstac/pypgstac/pypgstac/load.py
Lines 112 to 119 in 8d85be7
So I figured it would be worth exploring using \copy
directly through psql and seeing if that would be faster. The following steps worked for me, appearing to load 1.4 million Landsat STAC items (the same file as above) in about 30 minutes on my local Mac machine.
volume
to the docker-compose section to be able to copy in datadocker exec -it stac-db psql -U username -d postgis
SET SEARCH_PATH TO pgstac, public;
in psql
. \i /docker-entrypoint-initdb.d/001_core.sql
, etchttps://github.com/stac-utils/pgstac/blob/8d85be7f26b0be0384140aed6d850d122a2a9947/pypgstac/pypgstac/load.py#L133-L161
CREATE TEMP TABLE pgstactemp (content jsonb);
\copy pgstactemp FROM '/app/data/stac_items.jsonl';
pgstac.items
(14 minutes)
INSERT INTO pgstac.items (content)
SELECT content FROM pgstactemp
ON CONFLICT DO NOTHING;
SELECT backfill_partitions();
)Here a list of operators are defined to include le
and ge
: https://github.com/stac-utils/pgstac/blob/main/sql/004_search.sql#L218-L230
The spec has these as lte
and gte
As follow on work to #18, we need to integrate poetry into the development environment setup, which was left as follow-up work.
The CI process is a bit complex IMO. It's built on top of a series of shell scripts which call other shell script.
I think it will be good to refactor a bit our CI process to:
Ideally the migration files should also be created automatically within the CI when pushing a new Tag.
Running pypgstac migrate --toversion 0.3.5
against a clean Postgres version 12 database fails with message:
asyncpg.exceptions.WrongObjectTypeError: "items" is a partitioned table
DETAIL: Partitioned tables cannot have BEFORE / FOR EACH ROW triggers.
This might be because BEFORE row level triggers for partitioned tables are not supported for Postgres version 12, but are supported for Postgres version 13 (found some hints here: https://www.depesz.com/2020/03/31/waiting-for-postgresql-13-enable-before-row-level-triggers-for-partitioned-tables/ ).
Upgrading to Postgres 13 resolved the error.
Full stack trace:
No pgstac version set, installing 0.3.5 from scratch.
Running migrations for ['/home/jthetzel/.cache/pypoetry/virtualenvs/c-stac-api-4u_-WkNV-py3.9/lib/python3.9/site-packages/pypgstac/migrations/pgstac.0.3.5.sql'].
Traceback (most recent call last):
File "/home/jthetzel/.cache/pypoetry/virtualenvs/c-stac-api-4u_-WkNV-py3.9/bin/pypgstac", line 8, in <module>
sys.exit(app())
File "/home/jthetzel/.cache/pypoetry/virtualenvs/c-stac-api-4u_-WkNV-py3.9/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
return get_command(self)(*args, **kwargs)
File "/home/jthetzel/.cache/pypoetry/virtualenvs/c-stac-api-4u_-WkNV-py3.9/lib/python3.9/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/jthetzel/.cache/pypoetry/virtualenvs/c-stac-api-4u_-WkNV-py3.9/lib/python3.9/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/jthetzel/.cache/pypoetry/virtualenvs/c-stac-api-4u_-WkNV-py3.9/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/jthetzel/.cache/pypoetry/virtualenvs/c-stac-api-4u_-WkNV-py3.9/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/jthetzel/.cache/pypoetry/virtualenvs/c-stac-api-4u_-WkNV-py3.9/lib/python3.9/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/jthetzel/.cache/pypoetry/virtualenvs/c-stac-api-4u_-WkNV-py3.9/lib/python3.9/site-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "/home/jthetzel/.cache/pypoetry/virtualenvs/c-stac-api-4u_-WkNV-py3.9/lib/python3.9/site-packages/pypgstac/pypgstac.py", line 31, in migrate
version = asyncio.run(run_migration(dsn, toversion))
File "/home/jthetzel/.pyenv/versions/3.9.6/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/home/jthetzel/.pyenv/versions/3.9.6/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
return future.result()
File "/home/jthetzel/.cache/pypoetry/virtualenvs/c-stac-api-4u_-WkNV-py3.9/lib/python3.9/site-packages/pypgstac/migrate.py", line 163, in run_migration
await conn.execute(migration_sql)
File "/home/jthetzel/.cache/pypoetry/virtualenvs/c-stac-api-4u_-WkNV-py3.9/lib/python3.9/site-packages/asyncpg/connection.py", line 297, in execute
return await self._protocol.query(query, timeout)
File "asyncpg/protocol/protocol.pyx", line 336, in query
asyncpg.exceptions.WrongObjectTypeError: "items" is a partitioned table
DETAIL: Partitioned tables cannot have BEFORE / FOR EACH ROW triggers.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/jthetzel/src/c-core-labs/c-stac-api/scripts.py", line 53, in migrate
check_call(["pypgstac", "migrate"])
File "/home/jthetzel/.pyenv/versions/3.9.6/lib/python3.9/subprocess.py", line 373, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['pypgstac', 'migrate']' returned non-zero exit status 1.
Running migrations against a --dsn that has ?sslmode=require (or with PGSSLMODE=require), I get:
ERROR:asyncio:Fatal error on SSL transport
protocol: <asyncio.sslproto.SSLProtocol object at 0x7f4882e35e20>
transport: <_SelectorSocketTransport closing fd=8>
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/asyncio/selector_events.py", line 910, in write
n = self._sock.send(data)
OSError: [Errno 9] Bad file descriptor
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/asyncio/sslproto.py", line 685, in _process_write_backlog
self._transport.write(chunk)
File "/opt/conda/lib/python3.8/asyncio/selector_events.py", line 916, in write
self._fatal_error(exc, 'Fatal write error on socket transport')
File "/opt/conda/lib/python3.8/asyncio/selector_events.py", line 711, in _fatal_error
self._force_close(exc)
File "/opt/conda/lib/python3.8/asyncio/selector_events.py", line 723, in _force_close
self._loop.call_soon(self._call_connection_lost, exc)
File "/opt/conda/lib/python3.8/asyncio/base_events.py", line 719, in call_soon
self._check_closed()
File "/opt/conda/lib/python3.8/asyncio/base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
I'm able to connect to the database using the same dsn using psql
. Any ideas?
Document the process for version tagging a release and generating the appropriate migration scripts.
This is follow-up work from #18, which introduced a release process that is based on GitHub releases, so that releases are automated based on a new tag/GitHub release.
The Makefile processing currently creates a script that inserts the version in the migration table. This issue would create a similar workflow that is done as a manual step prior to releasing.
Running the pypgstac migrate command from a previous version to a new version leads to multiple entries of the final version in the pgstac.migrations table:
Running migrations for ['pgstac.0.2.4-0.2.7.sql', 'pgstac.0.2.7-0.2.8.sql', 'pgstac.0.2.8-0.2.9.sql', 'pgstac.0.2.9-0.3.0.sql', 'pgstac.0.3.0-0.3.1.sql', 'pgstac.0.3.1-0.3.2.sql', 'pgstac.0.3.2-0.3.3.sql'].
pgstac version 0.3.3
version | datetime
---------+-------------------------------
0.2.4 | 2021-07-23 00:00:00+00
0.2.7 | 2021-09-03 09:02:22.32407+00
0.3.3 | 2021-09-03 09:02:22.360333+00
0.2.8 | 2021-09-03 09:02:22.362009+00
0.3.3 | 2021-09-03 09:02:22.362903+00
0.2.9 | 2021-09-03 09:02:22.364059+00
0.3.3 | 2021-09-03 09:02:22.364059+00
0.3.0 | 2021-09-03 09:02:22.366378+00
0.3.3 | 2021-09-03 09:02:22.366378+00
0.3.1 | 2021-09-03 09:05:24.690869+00
0.3.3 | 2021-09-03 09:05:24.690869+00
0.3.2 | 2021-09-03 09:05:24.716236+00
0.3.3 | 2021-09-03 09:05:24.716236+00
0.3.3 | 2021-09-03 09:05:24.723922+00
0.3.3 | 2021-09-03 09:05:24.723922+00
(15 rows)
For any migration file (e.g., 0.2.4 to 0.2.7) two inserts are currently conducted:
From the migration file
https://github.com/stac-utils/pgstac/blob/main/pypgstac/pypgstac/migrations/pgstac.0.2.4-0.2.7.sql#L313
From the python script (which inserts always to latest version to migrate to - in my case 0.3.3):
https://github.com/stac-utils/pgstac/blob/main/pypgstac/pypgstac/migrate.py#L160-L170
I think this can be resolved by removing the insert statement from the python script?
Lines 645 to 655 in b210b68
Working on a project that had pypgstac 0.2.3 installed
Running pypgstac migrate
against a development database using image bitner/pgstac:0.1.9
, I got the error:
root@39835833116a:/opt/src# pypgstac migrate --dsn postgresql://username:password@database:5432/postgis
Traceback (most recent call last):
File "/opt/conda/bin/pypgstac", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.8/site-packages/typer/main.py", line 214, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.8/site-packages/pypgstac/pypgstac.py", line 125, in migrate
typer.echo(asyncio.run(run_migration(dsn)))
File "/opt/conda/lib/python3.8/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.8/site-packages/pypgstac/pypgstac.py", line 110, in run_migration
await conn.execute(migration_sql)
File "/opt/conda/lib/python3.8/site-packages/asyncpg/connection.py", line 297, in execute
return await self._protocol.query(query, timeout)
File "asyncpg/protocol/protocol.pyx", line 336, in query
asyncpg.exceptions.PostgresSyntaxError: syntax error at or near "'0.2.3'"
I then updated to pypgstac 0.2.4 and reran, and got the error:
root@62d1d24ad3c3:/opt/src# pypgstac migrate --dsn postgresql://username:password
@database:5432/postgis
Traceback (most recent call last):
File "/opt/conda/bin/pypgstac", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.8/site-packages/typer/main.py", line 214, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.8/site-packages/pypgstac/pypgstac.py", line 125, in migrate
typer.echo(asyncio.run(run_migration(dsn)))
File "/opt/conda/lib/python3.8/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.8/site-packages/pypgstac/pypgstac.py", line 110, in run_migration
await conn.execute(migration_sql)
File "/opt/conda/lib/python3.8/site-packages/asyncpg/connection.py", line 297, in execute
return await self._protocol.query(query, timeout)
File "asyncpg/protocol/protocol.pyx", line 336, in query
asyncpg.exceptions.DuplicateObjectError: constraint "items_collections_fk" for relation "items" already exists
There's some datasets that do not have a datetime, but instead have a start_datetime and end_datetime. For example, a landcover dataset that represents a year of data - there's no single datetime to represent the data. STAC specifies that in this case datetime
is null
and the start_datetime and end_datetime is set.
How can pgstac handle these types of items?
Hi there!
We are using the stac-fastapi with the pgstac backend. The database was initially set up using pypgstac migrate
We are attempting to insert a test item:
{
"type": "Feature",
"stac_version": "1.0.0",
"id": "test_item",
"properties": {
"datetime": "1990-12-27T08:27:00Z"
},
"bbox": [-180, -90, 180, 90],
"geometry": {
"type": "Point",
"coordinates": [125.6, 10.1]
},
"links": [
{
"rel": "root",
"href": "http://stac.mydomain.com/",
"type": "application/json",
"title": "stac-fastapi"
},
{
"rel": "collection",
"href": "https://stac.mydomain.com/collections/test_collection",
"type": "application/json"
},
{
"rel": "parent",
"href": "https://stac.mydomain.com/collections/test_collection",
"type": "application/json"
},
{
"rel": "self",
"href": "https://stac.mydomain.com/collections/test_collection/items/test_item",
"type": "application/json"
}
],
"assets": {},
"stac_extensions": [],
"collection": "test_collection"
}
However the API is returning the following error:
asyncpg.exceptions.UndefinedTableError: relation "items_p1990w52" does not exist
I figured this might be caused by Partman not being installed, so we amended the database:
create schema partman;
create extension pg_partman with schema partman;
create table partman.pgstac_items_template(
like pgstac.items
);
create index on partman.pgstac_items_template (datetime asc);
create index on partman.pgstac_items_template (end_datetime asc);
create index on partman.pgstac_items_template using gist(geometry);
select partman.create_parent('pgstac.items', 'datetime', 'native', 'weekly',
p_template_table := 'partman.pgstac_items_template');
This unfortunately did not help.
Not sure what is going on here. Are anyone able to point us in the direction of a solution?
As follow up work to #18, we need to get the PGTap SQL tests to run as part of the testing workflow. With that PR, only the python tests are run as part of scripts/test
and the CI testing; this will integrate the tests in the top level test
folder.
It seems that every time there is a pgstac update, the hash
for a search will be different (I have hash based test in https://github.com/stac-utils/titiler-pgstac/blob/master/tests/test_titiler_pgstac.py#L7-L8 and I have to update the hash every time pgstac version change)
I think this is usually fine, but it would be great if we could have a way to have a consistent
id for registering/searching/accessing a search query.
I'm mostly thinking about using a name
to describe a search query. This would be optional, a search could still be located with it's hash, but also with it's name.
Lines 1000 to 1004 in 76512ab
I am seeing behavior where if a search is performed with only "items" specified, but no "collections", that invalid results can occur. I saw this through stac-fastapi, where the get_item() call was only supplying an item_id to the PgstacSearch. This is incorrect and I'll be fixing in a PR there, but what I was seeing was that an item - that did not have a matching Item ID - was being returned, and that item belonged to another collection. A search on an Item ID without a collection ID should return all (potentially multiple) matching Items from across collections. I have not reproduces this error in the project, and it may be hard to reproduce - this seems to have been working fine, but after a redeploy - I'm not sure if some state changed in the DB - I started consistently getting incorrect results from Item queries.
Executing migrations/pgstac.0.4.3.sql
on PostgreSQL 12.8 fails with the following
"items" is a partitioned table
Partitioned tables cannot have BEFORE / FOR EACH ROW triggers
Issue #63 reported the same error, and the issue was later closed by the same user reporting "Forgot to install pg_partman.", indicating that installing pg_partman solved the problem. However PR #66 removed pg_partman from the list of required extensions. My testing suggests pg_partman does not make a difference.
pgstac.0.3.4.sql fails in the same way on PostgreSQL 12.8.
pgstac.0.4.3.sql succeeds in PostgreSQL 13.4 with no further changes besides version number.
To reproduce: https://github.com/captaincoordinates/pgstac-pg12-test/
Any feedback much appreciated 🙏
Currently the dependencies of the project block using typer 0.4.0, which is needed for click compatibility (see fastapi/typer#280).
Rather than a limit based on a count, the limit is filled when the coverage of the found items covers the entire tile.
As follow up work on #18, we need to add the ability to push docker images to the cipublish script. Pypi publishing is already handled by cipublish, and the addition of publishing docker images will take care of most of the functionality in the Makefile
, which should be removed once this and any other functionality is covered by scripts/
https://planetarycomputer-staging.microsoft.com/api/stac/v1/collections/mtbs/items/S2A_MSIL2A_20220329T215531_R029_T07WDU_20220330T055949 is a link to a valid item from Sentinel-2, but it's using the wrong collection in the path (mtbs). That should return an error saying that that item isn't in the collection mtbs
. Instead, it returns the sentinel-2 item.
This seems to be fixed on the 0.5 branch (https://pc-pqe-red-mqe.westeurope.cloudapp.azure.com/api/stac/v1/collections/mtbs/items/S2B_MSIL2A_20220309T105759_R094_T39XVH_20220309T203945), but I wanted to report it anyway in case we want to add a regression test. This is also not present in previous versions of pgstac (e.g. https://planetarycomputer.microsoft.com/api/stac/v1/collections/mtbs/items/S2B_MSIL2A_20220309T105759_R094_T39XVH_20220309T203945; I'm not sure what version that's running)
Feel free to close if you think the 0.5 release will make this impossible by design.
If the pgstac_settings
default-filter-lang
is set to cql-json
, I don't believe it can be overridden as intended by this line:
Line 651 in 3afa5df
In the case where the default is set to "cql-json"
and the client sends a filter-lang
property of "cql2-json"
, I read the logic of that lines as (false or true)
and it then attempts and fails to parse the cql2 as cql (which is the behavior I'm seeing). Instead, I think the default just need to be checked is the search didn't provide a filter-lang property.
If the default is set to cql2-json
, the search-supplied and override behavior seems correct.
In the most recent release of the STAC API specification (i.e. v1.0.0-beta.5), two features have been added which imply a hierarchical representation of of STAC documents within STAC APIs: children and browseable. A recently added draft on stac_fastapi implements these features with an in memory representation of the hierarchy which defines behavior for both of these features.
In this draft, the hierarchy is specified separately from the ingested contents of the database. For instance, the data which defines the Joplin
collection and its various related items is supplied via a json file and an environment variable which tells the server where a valid hierarchy definition can be located.
Assuming this structure is sufficient and desiring a more robust implementation which leverages the performance and benefits for implementers of other STAC libraries which might lean on pgstac, it would be great to push this information and associated behaviors down to postgres.
Below are some TODOs in the stac_fastapi draft which are next to functions/behavior which could likely find its home in pgstac
I've been loading some STAC items into my pgstac instance, and it seems that pgstac requires globally unique Item IDs. For example, if I create a new collection, and then add an item that I've already added, with the same ID as an existing item, then I get the following error:
$ pypgstac load items bec-items.json
How to deal conflicting ids (insert, insert_ignore, upsert) [insert]:
loading bec-items.json into items using insert
Traceback (most recent call last):
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\Scripts\pypgstac.exe\__main__.py", line 7, in <module>
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\site-packages\typer\main.py", line 214, in __call__
return get_command(self)(*args, **kwargs)
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\site-packages\click\core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\site-packages\click\core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\site-packages\typer\main.py", line 500, in wrapper
return callback(**use_params) # type: ignore
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\site-packages\pypgstac\pypgstac.py", line 43, in load
typer.echo(asyncio.run(load_ndjson(file=file, table=table, dsn=dsn, method=method)))
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\asyncio\runners.py", line 44, in run
return loop.run_until_complete(main)
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\asyncio\base_events.py", line 642, in run_until_complete
return future.result()
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\site-packages\pypgstac\load.py", line 246, in load_ndjson
await load_iterator(f, table, conn, method)
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\site-packages\pypgstac\load.py", line 230, in load_iterator
await copy(iter, table, conn)
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\site-packages\pypgstac\load.py", line 134, in copy
await conn.copy_to_table(
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\site-packages\asyncpg\connection.py", line 859, in copy_to_table
return await self._copy_in(copy_stmt, source, timeout)
File "C:\Users\rwilson3\Documents\mambaforge\envs\anglo\lib\site-packages\asyncpg\connection.py", line 1041, in _copy_in
return await self._protocol.copy_in(
File "asyncpg\protocol\protocol.pyx", line 506, in copy_in
asyncpg.exceptions.UniqueViolationError: duplicate key value violates unique constraint "items_p1999w52_datetime_id_idx"
DETAIL: Key (datetime, id)=(2000-01-01 00:00:00+00, 4096a_0_10.0_32_7_-46) already exists.
Is this intentional? The STAC spec states that:
The ID should be unique within the Collection that contains the Item.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.