open-eo / openeo-geopyspark-driver Goto Github PK

View Code? Open in Web Editor NEW

25.0 9.0 4.0 6.77 MB

OpenEO driver for GeoPySpark (Geotrellis)

License: Apache License 2.0

Python 96.66% Shell 1.16% Jinja 0.86% Makefile 0.85% Dockerfile 0.47%

openeo-geopyspark-driver's Introduction

OpenEO Geopyspark Driver

Python version: at least 3.8

This driver implements the GeoPySpark/Geotrellis specific backend for OpenEO.

It does this by implementing a direct (non-REST) version of the OpenEO client API on top of GeoPySpark.

A REST service based on Flask translates incoming calls to this local API.

Operating environment dependencies

This backend has been tested with:

Something that runs Spark: Kubernetes or YARN (Hadoop), standalone or on your laptop
Accumulo as the tile storage backend for Geotrellis
Reading GeoTiff files directly from disk or object storage

Public endpoint

https://openeo.vito.be/openeo/

Running locally

Set up your (virtual) environment with necessary dependencies:

# Install Python package and its depdendencies
pip install . --extra-index-url https://artifactory.vgt.vito.be/artifactory/api/pypi/python-openeo/simple

# Get necessary JAR dependency files for running Geopyspark driver
python scripts/get-jars.py

For development, refer to docs/development for more information. You can run the service with:

export SPARK_HOME=$(find_spark_home.py)
export HADOOP_CONF_DIR=/etc/hadoop/conf
export FLASK_DEBUG=1
python openeogeotrellis/deploy/local.py

For production, a gunicorn server script is available: PYTHONPATH=. python openeogeotrellis/server.py

Running on the Proba-V MEP

The web application can be deployed by running: sh scripts/submit.sh This will package the application and it's dependencies from source, and submit it on the cluster. The application will register itself with an NginX reverse proxy using Zookeeper.

openeo-geopyspark-driver's People

Contributors

Stargazers

Watchers

Forkers

soxofaan bartjanssens92 yutingyao vincentverelst

openeo-geopyspark-driver's Issues

NetCDF result download comes without CRS

Upon downloading results in NetCDF format, no CRS is specified in the file. Is this intentional?
The files appear to be in the corresponding WGS84 UTM zone, which can be assigned later, but this appears not very secure to me.

Example ProcessGraph:
{ "process_graph": { "1": { "process_id": "load_collection", "arguments": { "id": "TERRASCOPE_S2_NDVI_V2", "spatial_extent": { "west": 5.224096231078771, "south": 50.69038597219307, "east": 5.311809621280062, "north": 50.72560528417654 }, "temporal_extent": [ "2020-10-08T00:00:00Z", "2020-10-22T23:59:59Z" ], "bands": [ "NDVI_10M" ] } }, "2": { "process_id": "reduce_dimension", "arguments": { "reducer": { "process_graph": { "1": { "process_id": "min", "arguments": { "data": { "from_parameter": "data" } }, "result": true } } }, "dimension": "t", "data": { "from_node": "1" } } }, "3": { "process_id": "save_result", "arguments": { "data": { "from_node": "2" }, "format": "NETCDF" }, "result": true } } }

unexpected importance of parameter type in func add_dimension

I stumbled upon the following behaviour of the function add_dimension: It seems to only write the given label when the parameter type is not other (but, for example, bands). When type is other it seems to just be set as band_0. Because type has a default value (other) and therefore is not a mandatory input I did not expect it to be so important towards the behaviour of the function. In my opinion this should at least be documented.

I did not test this with other backends. Here's an example:
{ "process_graph": { "add_dimension_IWUCC3825R": { "arguments": { "data": { "from_node": "reduce_dimension_IBSFT4301B" }, "label": "collaps1", "name": "bands", "type": "other" }, "process_id": "add_dimension" }, "load_collection_WHZYS7018X": { "arguments": { "bands": [ "B04", "B08" ], "id": "SENTINEL2_L2A_SENTINELHUB", "spatial_extent": { "east": 4.5277, "north": 50.9305, "south": 50.7816, "west": 4.2369 }, "temporal_extent": [ "2020-10-01", "2020-10-15" ] }, "process_id": "load_collection" }, "reduce_dimension_IBSFT4301B": { "arguments": { "context": null, "data": { "from_node": "reduce_dimension_SXELG4042R" }, "dimension": "bands", "reducer": { "process_graph": { "array_element_OCQIS1854F": { "arguments": { "data": { "from_parameter": "data" }, "index": 1, "return_nodata": false }, "process_id": "array_element" }, "array_element_ZXQBW2629C": { "arguments": { "data": { "from_parameter": "data" }, "index": 0, "return_nodata": false }, "process_id": "array_element" }, "normalized_difference_JVSKC0506O": { "arguments": { "x": { "from_node": "array_element_ZXQBW2629C" }, "y": { "from_node": "array_element_OCQIS1854F" } }, "process_id": "normalized_difference", "result": true } } } }, "process_id": "reduce_dimension" }, "reduce_dimension_SXELG4042R": { "arguments": { "context": null, "data": { "from_node": "load_collection_WHZYS7018X" }, "dimension": "t", "reducer": { "process_graph": { "mean_ZYDSH5431N": { "arguments": { "data": { "from_parameter": "data" }, "ignore_nodata": true }, "process_id": "mean", "result": true } } } }, "process_id": "reduce_dimension" }, "save_result_PLEPS1878X": { "arguments": { "data": { "from_node": "add_dimension_IWUCC3825R" }, "format": "NetCDF", "options": {} }, "process_id": "save_result", "result": true } } }

Add config to toggle on/off zookeeper usage

Originally we used TRAVIS env var to toggle on/off various subsystems (e.g. in tests):

openeo-geopyspark-driver/openeogeotrellis/configparams.py

Line 13 in ef7c456

 self.is_ci_context = any(v in env for v in ['TRAVIS', 'PYTEST_CURRENT_TEST', 'PYTEST_CONFIGURE']) 

But now this is also being used as a workaround to skip zookeeper in our creodias setup: Open-EO/openeo-geotrellis-kubernetes#1

We better provide an explicit config toggle on/off usage of zookeeper.

CreoDIAS capabilities: production=true but description says it's "unstable"

The CreoDIAS driver's capability response at https://openeo.creo.vito.be/openeo/1.0/ has the production flag set to true, yet the free text under description says, as the very first thing and in CAPITAL letters, that this backend is "[UNSTABLE]". Could you align the two?

openeo-geopyspark-driver/openeogeotrellis/deploy/kube.py

Lines 66 to 81 in 8c02c8d

 server.run(title="OpenEO API", 

 description="""[UNSTABLE] OpenEO API running on CreoDIAS (using GeoPySpark driver). This endpoint runs openEO on a Kubernetes cluster. 

  The main component can be found here: https://github.com/Open-EO/openeo-geopyspark-driver 

  The deployment is configured using Terraform and Kubernetes configs: https://github.com/Open-EO/openeo-geotrellis-kubernetes 

  Data is read directly from the CreoDIAS data offer through object storage. Processing is limited by the processing 

  capacity of the Kubernetes cluster running on DIAS. Contact VITO for experiments with higher resource needs. 

  """, 

 deploy_metadata=build_backend_deploy_metadata( 

 packages=["openeo", "openeo_driver", "openeo-geopyspark", "openeo_udf", "geopyspark"] 

 # TODO: add version info about geotrellis-extensions jar? 

 ), 

 backend_version=get_backend_version(), 

 threads=10, 

 host=host, 

 port=port, 

 on_started=on_started)

'resample_cube_spatial' can't handle temporally reduced input

The function resample_cube_spatial can't seem to handle data that is temporally reduced, i.e. has undergone a reduce_dimension(.. dimension = t ..) process. It throws the error geotrellis.layer.SpatialKey cannot be cast to geotrellis.layer.SpaceTimeKey when both data and target or just data are temporally reducedand, and the error TypeError: can't compare offset-naive and offset-aware datetimes when reduced data is put as target. When the reduce_dimension is done after the resampling, the process runs without error.

Example process, expected to throw the error above:
{ "process_graph": { "load_collection_TWKYF6910T": { "arguments": { "bands": [ "VV" ], "id": "TERRASCOPE_S1_SLC_COHERENCE_V1", "spatial_extent": { "east": 4.5277, "north": 50.9305, "south": 50.7816, "west": 4.2369 }, "temporal_extent": [ "2020-10-01", "2020-10-31" ] }, "process_id": "load_collection" }, "load_collection_VGBOT9882O": { "arguments": { "bands": [ "NIR" ], "id": "PROBAV_L3_S5_TOC_100M", "spatial_extent": { "east": 4.5277, "north": 50.9305, "south": 50.7816, "west": 4.2369 }, "temporal_extent": [ "2020-10-01", "2020-10-31" ] }, "process_id": "load_collection" }, "reduce_dimension_MQUHW5102U": { "arguments": { "context": null, "data": { "from_node": "load_collection_VGBOT9882O" }, "dimension": "t", "reducer": { "process_graph": { "mean_ZQKLU1648K": { "arguments": { "data": { "from_parameter": "data" }, "ignore_nodata": true }, "process_id": "mean", "result": true } } } }, "process_id": "reduce_dimension" }, "resample_cube_spatial_SSRYY7709G": { "arguments": { "data": { "from_node": "reduce_dimension_MQUHW5102U" }, "method": "near", "target": { "from_node": "load_collection_TWKYF6910T" } }, "process_id": "resample_cube_spatial" }, "save_result_HQUXV2759H": { "arguments": { "data": { "from_node": "resample_cube_spatial_SSRYY7709G" }, "format": "NetCDF", "options": {} }, "process_id": "save_result", "result": true } } }

implement dimension_labels

I noticed that dimension_labels is not yet implemented and I just wanted to feedback that the function would be much appreciated.

Non-deterministic GTiff download of multitemporal cube

gtiff download of multitemporal cubes seems to be non-deterministic: the geopyspark implementation collapses time dimension by randomly picking a rastertile (when there are multiple) for each tile location in the layout I think.

if process not working

Sample process graph:


{
  "process_graph": {
    "6": {
      "arguments": {
        "id": "S1_GRD_SIGMA0_ASCENDING",
        "spatial_extent": {
          "east": 12.04481792607112,
          "north": 46.51293492736491,
          "south": 46.33970892265867,
          "west": 11.495501519821119
        },
        "temporal_extent": [
          "2019-09-01T00:00:00Z",
          "2020-09-19T23:59:59Z"
        ]
      },
      "process_id": "load_collection"
    },
    "17": {
      "arguments": {
        "data": {
          "from_node": "99"
        },
        "process": {
          "process_graph": {
            "6": {
              "arguments": {
                "accept": {
                  "from_parameter": "x"
                },
                "reject": 1,
                "value": {
                  "from_node": "10"
                }
              },
              "process_id": "if",
              "result": true
            },
            "10": {
              "process_id": "gt",
              "arguments": {
                "x": {
                  "from_parameter": "x"
                },
                "y": 20
              }
            }
          }
        }
      },
      "process_id": "apply"
    },
    "28": {
      "arguments": {
        "data": {
          "from_node": "1012"
        },
        "format": "GTIFF"
      },
      "process_id": "save_result",
      "result": true
    },
    "99": {
      "arguments": {
        "data": {
          "from_node": "1009"
        },
        "dimension": "t",
        "reducer": {
          "process_graph": {
            "2": {
              "arguments": {
                "data": {
                  "from_parameter": "data"
                }
              },
              "process_id": "mean",
              "result": true
            }
          }
        }
      },
      "process_id": "reduce_dimension"
    },
    "1009": {
      "arguments": {
        "bands": [
          "angle"
        ],
        "data": {
          "from_node": "6"
        },
        "wavelengths": []
      },
      "process_id": "filter_bands"
    },
    "1012": {
      "arguments": {
        "data": {
          "from_node": "17"
        },
        "extent": [
          "2019-12-31T00:00:00Z",
          "2020-01-06T23:59:59Z"
        ]
      },
      "process_id": "filter_temporal"
    }
  }
}

Questions about TERRASCOPE coherence dataset

I have some questions about the TERRASCOPE_S1_SLC_COHERENCE_V1 dataset, that I hope you can help me with (quotes from the description):

The product algorithm starts from two ESA Level-1 SLC products which are from the same area, the same relative orbit number and preferably from within a short time interval [my emphasis].

So far I just assumed these images would always be successive images. Is that correct?
Which timestamps do they have in the collection, first or second acquisition?
Is there a way to know which two dates were used in the coherence calculation?

This product exposes properties to enable filtering on relative orbit number and orbit direction.

I assume this feature has yet to be implemented - but how will it be done?

These details would be very interesting towards using this coherence product for change detection use cases, as it is important to know in which time period changes occurred. I have outlined an example use case already in #53.

`linear_scale_range` throws unsupported operation error

Upon applying a linear_scale_range I get a "unsupported operation" error. Is this known?

Test graph for reference:
{ "process_graph": { "1": { "process_id": "apply", "arguments": { "data": { "from_node": "reduce1" }, "process": { "process_graph": { "2": { "process_id": "linear_scale_range", "arguments": { "x": { "from_parameter": "x" }, "inputMin": 0, "inputMax": 4000, "outputMin": 0, "outputMax": 255 }, "result": true } } } } }, "loadco1": { "arguments": { "bands": [ "B02", "B03", "B04" ], "id": "SENTINEL2_L2A_SENTINELHUB", "spatial_extent": { "west": 4.181508513240051, "south": 50.770854338038305, "east": 4.326734351829529, "north": 50.84505243999911 }, "temporal_extent": [ "2021-02-01T00:00:00Z", "2021-02-14T23:59:59Z" ] }, "process_id": "load_collection" }, "reduce1": { "arguments": { "data": { "from_node": "loadco1" }, "dimension": "t", "reducer": { "process_graph": { "mean1": { "arguments": { "data": { "from_parameter": "data" } }, "process_id": "mean", "result": true } } } }, "process_id": "reduce_dimension" }, "savere1": { "arguments": { "data": { "from_node": "1" }, "format": "PNG", "options": { "red": "B4", "green": "B3", "blue": "B2" } }, "process_id": "save_result", "result": true } } }

bands input in load_collection causes error

In a process graph, when adding the bands input, e.g.

"dc": {
    "process_id": "load_collection",
    "arguments": {
      "id": "CGS_SENTINEL2_RADIOMETRY_V102_001",
      "spatial_extent": {
        "west": 2.052030657924054,
        "east": 4.063236553549667,
        "north": 51.00726308446294,
        "south": 50.99458367677388,
        "crs": "EPSG:4326"
        },
      "temporal_extent": [
        "2018-06-04T00:00:00.000Z",
        "2018-06-23T00:00:00.000Z"
      ],
      "bands": ["8", "4"]
    },
    "result": false
  }

the back-end returns the following error:

{"message":"An error occurred while calling z:org.openeo.geotrellis.geotiff.package.saveStitched.\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 23 in stage 266.0 failed 4 times, most recent failure: Lost task 23.3 in stage 266.0 (TID 35790, epod53.vgt.vito.be, executor 147): java.lang.ClassCastException\n\nDriver stacktrace:\n\tat org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)\n\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)\n\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)\n\tat scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)\n\tat scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)\n\tat org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)\n\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)\n\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)\n\tat scala.Option.foreach(Option.scala:257)\n\tat org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)\n\tat org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)\n\tat org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)\n\tat org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n\tat org.apache.spark.rdd.RDD.withScope(RDD.scala:363)\n\tat org.apache.spark.rdd.RDD.collect(RDD.scala:944)\n\tat geotrellis.spark.stitch.SpatialTileLayoutRDDStitchMethods.stitch(StitchRDDMethods.scala:81)\n\tat org.openeo.geotrellis.geotiff.package$.saveStitched(package.scala:39)\n\tat org.openeo.geotrellis.geotiff.package$.saveStitched(package.scala:28)\n\tat org.openeo.geotrellis.geotiff.package.saveStitched(package.scala)\n\tat sun.reflect.GeneratedMethodAccessor73.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: java.lang.ClassCastException\n"}

If the bands input is not given, everything works fine. If needed I can provide full examples for this.

empty entries with polygonal_mean_timeseries (aggregate_spatial with mean reducer)

polygonal_mean_timeseries returns a dense timeseries, with an entry for every day, which is empty if there was no observation for that day.

other comparable timeseries like polygonal_median_timeseries and polygonal_standarddeviation_timeseries do only return entries with data

e.g. the print statements of this test in the integrationtests project: https://github.com/Open-EO/openeo-geopyspark-integrationtests/blob/1d643b96243ea714e8bd1e5a2edcdb9f873708b1/tests/test_integration.py#L724-L743
return this:

----------------------------- Captured stdout call -----------------------------
mean {'2017-11-01T00:00:00Z': [[149.22597218531314]], '2017-11-02T00:00:00Z': [[]], '2017-11-03T00:00:00Z': [[]], '2017-11-04T00:00:00Z': [[]], '2017-11-05T00:00:00Z': [[]], '2017-11-06T00:00:00Z': [[]], '2017-11-07T00:00:00Z': [[]], '2017-11-08T00:00:00Z': [[]], '2017-11-09T00:00:00Z': [[]], '2017-11-10T00:00:00Z': [[]], '2017-11-11T00:00:00Z': [[149.62631765435378]], '2017-11-12T00:00:00Z': [[]], '2017-11-13T00:00:00Z': [[]], '2017-11-14T00:00:00Z': [[]], '2017-11-15T00:00:00Z': [[]], '2017-11-16T00:00:00Z': [[]], '2017-11-17T00:00:00Z': [[]], '2017-11-18T00:00:00Z': [[]], '2017-11-19T00:00:00Z': [[]], '2017-11-20T00:00:00Z': [[]], '2017-11-21T00:00:00Z': [[94.16547081229515]]}
median {'2017-11-01T00:00:00Z': [[150.9128919860627]], '2017-11-11T00:00:00Z': [[151.16723549488054]], '2017-11-21T00:00:00Z': [[87.43661971830986]]}
sd {'2017-11-01T00:00:00Z': [[30.416932628062053]], '2017-11-11T00:00:00Z': [[32.6456345358718]], '2017-11-21T00:00:00Z': [[35.279792157199665]]}

Weird issue when combining `linear_scale_range` with `resample`

While working on usecase involving ndvi resampling (EP-3068), I stumbled on weird issue:

in attempt to reduce data size @jdries suggested to rescale ndvi values to range 0-240 to to change to byte pixels instead of float pixels. The resampled output however changed in weird way depending on whether you do linear_scale_range before or after resample

this notebook tries to illustrate and drill down in the issue:

https://gist.github.com/soxofaan/37e8d927e89984973805556c9b6a6a33

"Unsupported operation: linear_scale_range" even though it's listed on the backend

When using the linear_scale_range process in the process graph, I get the following error message even though it is listed in the list of processes on the backend:

{
	message: "Unsupported operation: linear_scale_range (arguments: [inputMax, inputMin, x, outputMin, outputMax])"
}

Am I using the process wrong here or is this an issue on the backend?

The process graph used:

{
	"process_graph": {
		"dc": {
			"process_id": "load_collection",
			"arguments": {
				"id": "TERRASCOPE_S2_TOC_V2",
				"spatial_extent": {
					"west": 7.594852,
					"south": 46.893906,
					"east": 7.6498137,
					"north": 46.91994
				},
				"temporal_extent": ["2020-10-18", "2020-10-28"],
				"bands": ["TOC-B04_10M", "TOC-B08_10M"]
			}
		},
		"diff": {
			"process_id": "reduce_dimension",
			"arguments": {
				"data": {
					"from_node": "dc"
				},
				"reducer": {
					"process_graph": {
						"red": {
							"process_id": "array_element",
							"arguments": {
								"data": {
									"from_parameter": "data"
								},
								"index": 0
							}
						},
						"nir": {
							"process_id": "array_element",
							"arguments": {
								"data": {
									"from_parameter": "data"
								},
								"index": 1
							}
						},
						"diff": {
							"process_id": "normalized_difference",
							"arguments": {
								"x": {
									"from_node": "nir"
								},
								"y": {
									"from_node": "red"
								}
							},
							"result": true
						}
					}
				},
				"dimension": "bands"
			}
		},
		"reduce": {
			"process_id": "reduce_dimension",
			"arguments": {
				"data": {
					"from_node": "diff"
				},
				"reducer": {
					"process_graph": {
						"min": {
							"arguments": {
								"data": {
									"from_parameter": "data"
								}
							},
							"process_id": "min",
							"result": true
						}
					}
				},
				"dimension": "t"
			}
		},
		"scale": {
			"process_id": "apply",
			"arguments": {
				"data": {
					"from_node": "reduce"
				},
				"process": {
					"process_graph": {
						"lsr": {
							"arguments": {
								"x": {
									"from_parameter": "x"
								},
								"inputMin": -1,
								"inputMax": 1,
								"outputMin": 0,
								"outputMax": 255
							},
							"process_id": "linear_scale_range",
							"result": true
						}
					}
				}
			}
		},
		"save": {
			"process_id": "save_result",
			"arguments": {
				"data": {
					"from_node": "scale"
				},
				"format": "GTIFF"
			},
			"result": true
		}
	}
}

Process specifications incomplete

I just tried to use the back-end and figured out that the processes are often missing schemas or the parameter_order although they have them in the official definitions.

save_result for example looks like this in this back-end:

{
      "description": "Save processed data to storage or export to http.",
      "id": "save_result",
      "name": "save_result",
      "parameters": {
        "data": {
          "description": "The data to save.",
          "required": true,
          "schema": {}
        },
        "format": {
          "description": "The file format to save to. It must be one of the values that the server reports as supported output formats, which usually correspond to the short GDAL/OGR codes. This parameter is case insensitive.",
          "required": true,
          "schema": {}
        },
        "options": {
          "description": "The file format options to be used to create the file(s). Must correspond to the options that the server reports as supported options for the chosen format. The option names and valid values usually correspond to the GDAL/OGR format options.",
          "required": true,
          "schema": {}
        }
      },
      "returns": {
        "description": "Raster Data Cube",
        "schema": {
          "format": "raster-cube",
          "type": "object"
        }
      }
    }

The official definition looks like this:

{
  "id": "save_result",
  "summary": "Save processed data to storage",
  "description": "Saves processed data to the local user workspace / data store of the authenticated user. This process aims to be compatible to GDAL/OGR formats and options. STAC-compatible metadata should be stored with the processed data.\n\nCalling this process may be rejected by back-ends in the context of secondary web services.",
  "categories": [
    "cubes",
    "export"
  ],
  "parameter_order": [
    "data",
    "format",
    "options"
  ],
  "parameters": {
    "data": {
      "description": "The data to save.",
      "schema": {
        "anyOf": [
          {
            "type": "object",
            "format": "raster-cube"
          },
          {
            "type": "object",
            "format": "vector-cube"
          }
        ]
      },
      "required": true
    },
    "format": {
      "description": "The file format to save to. It must be one of the values that the server reports as supported output formats, which usually correspond to the short GDAL/OGR codes. This parameter is *case insensitive*.",
      "schema": {
        "type": "string",
        "format": "output-format"
      },
      "required": true
    },
    "options": {
      "description": "The file format options to be used to create the file(s). Must correspond to the options that the server reports as supported options for the chosen `format`. The option names and valid values usually correspond to the GDAL/OGR format options.",
      "schema": {
        "type": "object",
        "format": "output-format-options",
        "default": {}
      }
    }
  },
  "returns": {
    "description": "`false` if saving failed, `true` otherwise.",
    "schema": {
      "type": "boolean"
    }
  },
  "links": [
    {
      "rel": "about",
      "href": "https://www.gdal.org/formats_list.html",
      "title": "GDAL Raster Formats"
    },
    {
      "rel": "about",
      "href": "https://www.gdal.org/ogr_formats.html",
      "title": "OGR Vector Formats"
    }
  ]
}

Why have they been changed? The schema and parameter order are important for clients to automatically generate methods. In the Web Editor for example the VITO back-end is much less user-friendly by default:

Would you respond with the full process specification, it would look like this by default:

cryptic error message upon filtering for out of bounds temporal array

When a specific temporal extent is loaded and then filtered (filter_temporal) for an array that lies completely out of bounds, the error message "minKey" is returned. This is not very meaningful and a more significant message could decrease debugging time by a lot.

Example ProcessGraph:

{
  "process_graph": {
    "load_collection_MLVVS5955J": {
      "arguments": {
        "bands": [
          "VV"
        ],
        "id": "TERRASCOPE_S1_SLC_COHERENCE_V1",
        "spatial_extent": {
          "west": 5.55281639099121,
          "south": 50.62333062064343,
          "east": 5.601739883422852,
          "north": 50.65076659491649
        },
        "temporal_extent": [
          "2018-02-10",
          "2019-03-10"
        ]
      },
      "process_id": "load_collection"
    },
    "filter_temporal_GBOXV6689F": {
      "arguments": {
        "data": {
          "from_node": "load_collection_MLVVS5955J"
        },
        "extent": [
          "2020-02-18",
          "2020-02-23"
        ]
      },
      "process_id": "filter_temporal"
    },
    "reduce_dimension_QGCXA5614G": {
      "arguments": {
        "context": null,
        "data": {
          "from_node": "filter_temporal_GBOXV6689F"
        },
        "dimension": "t",
        "reducer": {
          "process_graph": {
            "1": {
              "process_id": "mean",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                }
              },
              "result": true
            }
          }
        }
      },
      "process_id": "reduce_dimension"
    },
    "save_result_MEEZS2624X": {
      "arguments": {
        "data": {
          "from_node": "reduce_dimension_QGCXA5614G"
        },
        "format": "NetCDF",
        "options": {}
      },
      "process_id": "save_result",
      "result": true
    }
  }
}

Travis build is failing since 4 months

all travis builds from last 4 months failed:
https://travis-ci.org/Open-EO/openeo-geopyspark-driver/builds

ProcessGraphComplexity when setting up secondary service

I am trying to set up secondary service of a cube without spatial bounds (because spatial bounds will be coming from usage of WMTS), but backend doesn't let me do that:

OpenEoApiError: [400] ProcessGraphComplexity: The process graph is too complex for for synchronous processing. Please use a batch job instead.

Encapsulate "driver implementation" functions

This is openeo-geopyspark-driver specific companion ticket for Open-EO/openeo-python-driver#8

Fix several issues reported by the validatior

@bgoesswe deployed the Go validation tool for back-ends. The results for the geopyspark back-end can be found here:
https://www.geo.tuwien.ac.at/openeoct/backend/validate/3

There are several issues reported:

http://openeo.vgt.vito.be/openeo/0.4.0 returns something different than http://openeo.vgt.vito.be/openeo/0.4.0/
title in GET / is missing
Your collections contain invalid datetimes (i.e. only the date, but missing the time). Make sure they are RFC3339 compliant (e.g. 2019-01-01T12:00:00Z)
GET /jobs returns "Usage: Create a new batch processing job using POST" instead of a list of jobs (but it is advertised in GET / to be supported).
GET /output_formats seems to still return a response compatible only to API v0.3. Basically, just move the value for formats to the root level and remove the property default.
GET /services returns an invalid response.

API spec incompatibilities

This list became a bit extensive, I just wrote down everything that I spotted, hope it helps :)

Most important to get the Hub working:

GET /collections doesn't wrap the collections array into a collections property and misses links property. Output should be like this:

{
    "collections": [
        <array here as you already output it>
    ],
    "links": [
        <required, but can be empty>
    ]
}

GET /processes: same as above: array that is currently returned should be wrapped into processes property and required links property is missing

(The links are not really needed by the Hub, but I mentioned that here since while you're at it you can just add them too.)

Nice to add more content to the Hub:

GET /output_formats and GET service_types are not listed in /'s endpoints array

Those endpoints exist and are spec-compliant (:+1:), but not added to the Hub yet because they're not listed in the capabilities :(

Needed to display collections correctly:

The second collections in GET /collections looks alright-ish, but the others contain less properties. Can they all look the same?

From now I'm referring to what the second collection looks like.

Non-spec data_id and product_id are present (which is okay), but not the required name property (which is really needed!)
extent property should be object with two sub-properties spatial and temporal
that spatial property needs a bounding box in array form
that temporal property should contain what is currently in time

That should get the rendering working.

For completeness (the Hub doesn't really care about those properties (yet), but FYI):

required license property missing
required links property missing

Needed to display processes correctly:

process identifier should be named name instead of process_id
required returns property is missing

Band filtering does not work

Through openeo-client I am trying to run the following code:

dataCollection=openeo.connect(url)
.load_collection('TERRASCOPE_S2_TOC_V2')
.filter_temporal('2019-01-01', '2019-01-10')
.filter_bbox(crs="EPSG:4326", **dict(zip(["west", "south", "east", "north"], bbox)))
.filter_bands(["TOC-B02_10M","TOC-B04_10M","TOC-B08_10M"])
.apply_dimension(utils.load_udf('udf_vito_save_to_public.py'),dimension='t',runtime="Python")
.execute_batch("tmp/batchtest.json",job_options=job_options)

But I get the following exception:

DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): openeo-dev.vgt.vito.be:80
DEBUG:urllib3.connectionpool:http://openeo-dev.vgt.vito.be:80 "GET /openeo/0.4.0/ HTTP/1.1" 200 1721
DEBUG:urllib3.connectionpool:http://openeo-dev.vgt.vito.be:80 "GET /openeo/0.4.0/credentials/basic HTTP/1.1" 200 58
DEBUG:urllib3.connectionpool:http://openeo-dev.vgt.vito.be:80 "GET /openeo/0.4.0/collections/TERRASCOPE_S2_TOC_V2 HTTP/1.1" 200 2056
DEBUG:urllib3.connectionpool:http://openeo-dev.vgt.vito.be:80 "POST /openeo/0.4.0/result HTTP/1.1" 500 2229
Traceback (most recent call last):
File "/home/banyait/eclipse-workspace/openeo-usecases/multisource_phenology_usecase/multisource_phenology_2_usecase.py", line 64, in
.apply_dimension(utils.load_udf('udf_vito_save_to_public.py'),dimension='t',runtime="Python")
File "/home/banyait/eclipse-workspace/openeo-python-client/openeo/rest/imagecollectionclient.py", line 1067, in execute
return self.session.execute(newbuilder.processes)
File "/home/banyait/eclipse-workspace/openeo-python-client/openeo/rest/connection.py", line 448, in execute
return self.post(path="/result", json=req).json()
File "/home/banyait/eclipse-workspace/openeo-python-client/openeo/rest/connection.py", line 134, in post
return self.request("post", path=path, json=json, **kwargs)
File "/home/banyait/eclipse-workspace/openeo-python-client/openeo/rest/connection.py", line 93, in request
self._raise_api_error(resp)
File "/home/banyait/eclipse-workspace/openeo-python-client/openeo/rest/connection.py", line 113, in _raise_api_error
raise exception
openeo.rest.connection.OpenEoApiError: [500] unknown: Traceback (most recent call last):
File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1590661099118_0334/container_e4867_1590661099118_0334_01_000572/pyspark.zip/pyspark/worker.py", line 253, in main
process()
File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1590661099118_0334/container_e4867_1590661099118_0334_01_000572/pyspark.zip/pyspark/worker.py", line 248, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1590661099118_0334/container_e4867_1590661099118_0334_01_000572/pyspark.zip/pyspark/serializers.py", line 140, in dump_stream
for obj in iterator:
File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1590661099118_0334/container_e4867_1590661099118_0334_01_000572/pyspark.zip/pyspark/util.py", line 55, in wrapper
return f(*args, **kwargs)
File "/data3/hadoop/yarn/local/usercache/openeo/appcache/application_1590661099118_0334/container_e4867_1590661099118_0334_01_000002/venv/lib64/python3.6/site-packages/openeogeotrellis/GeotrellisImageCollection.py", line 244, in tilefunction
File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1590661099118_0334/container_e4867_1590661099118_0334_01_000572/venv/lib64/python3.6/site-packages/openeogeotrellis/GeotrellisImageCollection.py", line 212, in _tile_to_datacube
the_array = xr.DataArray(bands_numpy, coords=coords,dims=dims,name="openEODataChunk")
File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1590661099118_0334/container_e4867_1590661099118_0334_01_000572/venv/lib64/python3.6/site-packages/xarray/core/dataarray.py", line 281, in init
coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1590661099118_0334/container_e4867_1590661099118_0334_01_000572/venv/lib64/python3.6/site-packages/xarray/core/dataarray.py", line 104, in _infer_coords_and_dims
'coordinate %r' % (d, sizes[d], s, k))
ValueError: conflicting sizes for dimension 'bands': length 3 on the data but length 9 on coordinate 'bands'

I believe what happens is that the coordinate 'bands' (holding the band names) is not reduced to the filtered band names.

about geopyspark==0.4.6+openeo

Hi,
How can i find the python lib of geopyspark==0.4.6+openeo?

Problems with "spatial_extent" parameter in `load_collection` process

I just tried to recreate the min evi example with the r-client and got an error in load_collection:

Invalid Extent: ymin must be less than ymax (ymin=48.6, ymax=16.6)

I have used the order west, east, north, south in that case. Changing it to xmin, xmax, ymin, ymax solved my problem for now, but regarding the process description this was not documented nor is this part of the core processes definition.

CGLS data on VITO backend appears empty

Although the coverage seems given (see viewer.terrascope.be, the CGLS datasets on VITO return empty results. Why is that?

"ripple effect" on download of binary image

when downloading a binary (mask) image, the result has a strange ripple effect along horizontal dimension.

example: top image shows a scene classification layer, bottom image is binary mask by picking class 2:

full notebook: https://gist.github.com/soxofaan/30590718a6443724f25f97b1da582581

Mask process not working

I have just tested the mask process as discussed in the dev meeting. It throws an error: "An error occured while calling RasterMask..." (too slow to read the rest, before it disappears).
The process graph up to "reduce_dimension" works.
Here the visual process graph:

Here the json:

{
  "process_graph": {
    "1": {
      "arguments": {
        "bands": null,
        "id": "TERRASCOPE_S2_NDVI_V2",
        "spatial_extent": {
          "east": 5.67950128534149,
          "north": 51.98561668073381,
          "south": 51.960603275339736,
          "west": 5.6154620713870225
        },
        "temporal_extent": [
          "2018-01-01T00:00:00Z",
          "2018-02-01T23:59:59Z"
        ]
      },
      "process_id": "load_collection"
    },
    "2": {
      "arguments": {
        "data": {
          "from_node": "4"
        },
        "process": {
          "process_graph": {
            "1": {
              "arguments": {
                "x": {
                  "from_parameter": "x"
                },
                "y": 100
              },
              "process_id": "gt",
              "result": true
            }
          }
        }
      },
      "process_id": "apply"
    },
    "3": {
      "arguments": {
        "data": {
          "from_node": "4"
        },
        "mask": {
          "from_node": "2"
        },
        "replacement": 0
      },
      "process_id": "mask"
    },
    "4": {
      "arguments": {
        "data": {
          "from_node": "1"
        },
        "dimension": "t",
        "reducer": {
          "process_graph": {
            "1": {
              "arguments": {
                "data": {
                  "from_parameter": "data"
                }
              },
              "process_id": "mean",
              "result": true
            }
          }
        }
      },
      "process_id": "reduce_dimension"
    },
    "5": {
      "arguments": {
        "data": {
          "from_node": "3"
        },
        "format": "GTIFF"
      },
      "process_id": "save_result",
      "result": true
    }
  }
}

Version parameter should be named "api_version"

With v0.4, the version parameter in the capabilities endpoint (/) was renamed to api_version -- could you update this in your implementation? It's a very minor incompatibility with the very major effect to block the JS client from working with the VITO deployment 🙂 (the JS client needs the attribute to detect the correct version) Fixing this is not urgent at all, but would be a lot "value for money" 😉

Support for CORS

We try to access the GeoPySpark back-end for a presentation next week, but unfortunately it doesn't send CORS headers so the JS client (in a browser environment) and the Web Editor can't access it:

Access to XMLHttpRequest at 'http://openeo.vgt.vito.be/openeo/' from origin 'http://editor.openeo.org' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.

It would be appreciated to get CORS support. For more information see https://open-eo.github.io/openeo-api/v/0.3.1/cors/

Problem executing a process graph

I tried to calculate the min evi example using the r-client with this graph. I got a rather cryptic message as error from the server:

'list' object has no attribute 'reduce'

I'm not sure whether I got something wrong in my graph or where to search for the problem in the graph.

GeopysparkDataCube

openeo-geopyspark-driver/openeogeotrellis/geopysparkdatacube.py

Lines 190 to 195 in 60361fe

 def apply(self, process: str, arguments: dict = {}) -> 'GeopysparkDataCube': 

 from openeogeotrellis.backend import SingleNodeUDFProcessGraphVisitor, GeoPySparkBackendImplementation 

 if isinstance(process, dict): 

 apply_callback = GeoPySparkBackendImplementation.accept_process_graph(process) 

 #apply should leave metadata intact, so can do a simple call? 

 return self.reduce_bands(apply_callback)

openeo apply process should be pixel level (local) transformation, but reduce_bands reduces the band dimension, so there is something wrong with the above, right, @jdries ?

Link to the jar in README is password protected

Hi,

I was trying to run this driver locally but the link to the jar file required for installation as mentioned in the README (https://artifactory.vgt.vito.be/libs-snapshot-local/be/vito/eodata/GeoPySparkExtensions/2.7.0-SNAPSHOT/GeoPySparkExtensions-2.7.0-20180612.110004-15.jar) requires a username and a password to access. Is it possible to open up access to the file?

Thanks!

Make band filtering more consistent

Implementation of band filtering and handling is confusing at the moment

in GeoPySparkLayerCatalog.load_collection:

s3_jp2_pyramid, file_pyramid, and some others do handle band filtering immediately, while accumulo_pyramid and s3_pyramid do not
at the end there is an additional image_collection.band_filter(band_indices), which will probably fail for the cases where band filtering already happened

in GeotrellisTimeSeriesImageCollection:

there is some stuff going on with _band_index which only considers the first band of the list, this is probably a temp hack that has to be eliminated
band_filter does a special case for source type "file", but that is probably not correct anymore (now that that there are multiple source types that handle band filtering up front)

Paths in capabilities should not include the base URL of the API

Today I tried to include this backend in the openEO Hub, but the crawling process failed because the crawler didn't recognise the backend as supporting any features. This happened because the paths in the capability document's endpoints attribute include the base URL of the API (e.g. /openeo/0.3.0/collections instead of /collections), so the js-client's hasFeature method couldn't detect any endpoints that it is familiar with.

The corresponding API spec entry says that the paths should be the "Path to the endpoint, starting from the base url of the API" -- I see that this can be understood one or the other way, but it's meant to be "starting after the base url of the API". Matthias just made this clearer in the docs.

Process "array_element" only accepts numbers for "label" parameter

Trying to run a process graph that contains

"reducer": {
    "process_graph": {
        "red": {
            "process_id": "array_element",
            "arguments": {
                "data": {
                    "from_parameter": "data"
                },
                "label": "B04"
            }
        },
        "nir": {
            "process_id": "array_element",
            "arguments": {
                "data": {
                    "from_parameter": "data"
                },
                "label": "B08"
            }
        },

I receive an error:

Expecting numeric value for 'label' but got 'B08'

But according to the /processes endpoint, both strings and numbers are valid:

Or as rendered in the Hub:

Failed to start the server

When i run "python openeogeotrellis/deploy/local.py", i got the mistake below:

[2020-12-03 20:37:27,804] 13669 INFO in openeo-geotrellis-local: Created Spark Context <SparkContext master=local[*] appName=openeo-geotrellis-local>
[2020-12-03 20:37:27,830] 13669 INFO in openeo-geotrellis-local: Spark web UI: http://localhost:4040/
[2020-12-03 20:37:28,274] 13669 WARNING in openeo_driver.ProcessGraphDeserializer: Adding process 'pi' without implementation
[2020-12-03 20:37:28,278] 13669 WARNING in openeo_driver.ProcessGraphDeserializer: Adding process 'e' without implementation
[2020-12-03 20:37:28,289] 13669 INFO in openeo_driver.backend: Using driver implementation package openeogeotrellis
Debugger failed to attach: handshake failed - received >GET / HTTP/1.1< - expected >JDWP-Handshake<
Debugger failed to attach: handshake failed - received >GET / HTTP/1.1< - expected >JDWP-Handshake<
Debugger failed to attach: handshake failed - received >GET / HTTP/1.1< - expected >JDWP-Handshake<
Debugger failed to attach: handshake failed - received >GET / HTTP/1.1< - expected >JDWP-Handshake<
Debugger failed to attach: handshake failed - received >GET / HTTP/1.1< - expected >JDWP-Handshake<
Debugger failed to attach: handshake failed - received >GET / HTTP/1.1< - expected >JDWP-Handshake<
Debugger failed to attach: handshake failed - received >GET / HTTP/1.1< - expected >JDWP-Handshake<
Debugger failed to attach: handshake failed - received >GET / HTTP/1.1< - expected >JDWP-Handshake<
Debugger failed to attach: handshake failed - received >GET / HTTP/1.1< - expected >JDWP-Handshake<
[2020-12-03 20:37:58,364] 13669 WARNING in kazoo.client: Cannot resolve epod-master1.vgt.vito.be: [Errno 8] nodename nor servname provided, or not known
Debugger failed to attach: handshake failed - received >GET / HTTP/1.1< - expected >JDWP-Handshake<
Debugger failed to attach: handshake failed - received >GET / HTTP/1.1< - expected >JDWP-Handshake<
Debugger failed to attach: handshake failed - received >GET / HTTP/1.1< - expected >JDWP-Handshake<
[2020-12-03 20:38:28,367] 13669 WARNING in kazoo.client: Cannot resolve epod-master2.vgt.vito.be: [Errno 8] nodename nor servname provided, or not known
[2020-12-03 20:38:58,369] 13669 WARNING in kazoo.client: Cannot resolve epod-master3.vgt.vito.be: [Errno 8] nodename nor servname provided, or not known
Traceback (most recent call last):
  File "openeogeotrellis/deploy/local.py", line 104, in <module>
    from openeo_driver.views import app, build_backend_deploy_metadata
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/openeo_driver/views.py", line 20, in <module>
    from openeo_driver.ProcessGraphDeserializer import evaluate, get_process_registry
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 128, in <module>
    backend_implementation = get_backend_implementation()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/openeo_driver/backend.py", line 420, in get_backend_implementation
    _backend_implementation = module.get_openeo_backend_implementation()
  File "/Users/phil/openeo/openeo-geopyspark-driver-master/openeogeotrellis/__init__.py", line 12, in get_openeo_backend_implementation
    return GeoPySparkBackendImplementation()
  File "/Users/phil/openeo/openeo-geopyspark-driver-master/openeogeotrellis/backend.py", line 224, in __init__
    else ZooKeeperServiceRegistry()
  File "/Users/phil/openeo/openeo-geopyspark-driver-master/openeogeotrellis/service_registry.py", line 121, in __init__
    with self._zk_client() as zk:
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/Users/phil/openeo/openeo-geopyspark-driver-master/openeogeotrellis/service_registry.py", line 201, in _zk_client
    zk.start()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/kazoo/client.py", line 560, in start
    raise self.handler.timeout_exception("Connection time-out")
kazoo.handlers.threading.KazooTimeoutError: Connection time-out

So how can i process it?

aggregate_temporal not specification compliant, not working

Upon executing a simple process with aggregate_temporal (e.g. via online editor, example given below), an error is thrown and the execution stopped. Upon using ["2019-07-03T00:00:00Z", "2019-07-17T23:59:59Z"] (not specification compliant) instead of [ ["2019-07-03T00:00:00Z", "2019-07-17T23:59:59Z"] ] (compliant) the process is executed, but a faulty raster seems to be returned.

process example, expected to throw array error:
{ "process_graph": { "load_collection_BEMGN6618B": { "arguments": { "bands": [ "B08" ], "id": "SENTINEL2_L2A_SENTINELHUB", "spatial_extent": { "west": 5.991754531860351, "south": 51.06869305078254, "east": 6.013212203979491, "north": 51.08141995869366 }, "temporal_extent": [ "2019-07-01T00:00:00Z", "2019-07-31T23:59:59Z" ] }, "process_id": "load_collection" }, "aggregate_temporal_YEHJA8836E": { "arguments": { "context": null, "data": { "from_node": "load_collection_BEMGN6618B" }, "intervals": [ [ "2019-07-03T00:00:00Z", "2019-07-17T23:59:59Z" ] ], "labels": [], "reducer": { "process_graph": { "1": { "process_id": "max", "arguments": { "data": { "from_parameter": "data" } }, "result": true } } }, "dimension": "t" }, "process_id": "aggregate_temporal" }, "save_result_IINPH1538H": { "arguments": { "data": { "from_node": "aggregate_temporal_YEHJA8836E" }, "format": "GTiff", "options": {} }, "process_id": "save_result", "result": true } } }

Eleminate service registry and WMTS dependencies from GeotrellisTimeSeriesImageCollection

GeotrellisTimeSeriesImageCollection has a dependency on InMemoryServiceRegistry and WMTS/tms details

class GeotrellisTimeSeriesImageCollection(ImageCollection):

    def __init__(self, pyramid: Pyramid, service_registry: InMemoryServiceRegistry, metadata: CollectionMetadata = None):
        # ...
        self.tms = None
        self._service_registry = service_registry

all this baggage is there just for a single method tiled_viewing_service, which seems a bit backward. I think it would be cleaner to refactor these details out of the GeotrellisTimeSeriesImageCollection class

Internal Server Error on batch/preview job

Hello all,

I am trying to get the results of a batch job and preview the job.

Tested with 2 process graphs:

PG 1:

{
  "process_graph": {
    "process_id": "NDVI",
    "process_description": "Computes the normalized difference vegetation index (NDVI) for all pixels and time slices of the input dataset.",
    "imagery": {
      "process_id": "filter_daterange",
      "process_description": "Creates a subset in including only values inside a given data range.",
      "imagery": {
        "process_id": "filter_bbox",
        "process_description": "Creates a subset in including only values inside a given bounding box.",
        "imagery": {
          "process_id": "get_collection",
          "process_description": "Finds the Collection to be processed.",
          "name": "CGS_SENTINEL2_RADIOMETRY_V102_001"
        },
        "extent": {
          "west": 10.9840519,
          "east": 11.2523813,
          "north": 46.5853562,
          "south": 46.7603795
        }
      },
      "extent": [
        "2018-01-27T11:13:29.024Z",
        "2018-04-28T11:13:29.024Z"
      ]
    },
    "red": "B04",
    "nir": "B8A"
  }
}

PG 2:

{
  "process_graph": {
    "imagery": {
      "red": "B4",
      "nir": "B8A",
      "imagery": {
        "extent": [
          "2018-01-27T11:13:29.024Z",
          "2018-02-28T11:13:29.024Z"
        ],
        "imagery": {
          "extent": {
            "west": 10.9840519,
            "east": 11.2523813,
            "north": 46.5853562,
            "south": 46.7603795
          },
          "imagery": {
            "process_id": "get_collection",
            "name": "CGS_SENTINEL2_RADIOMETRY_V102_001"
          },
          "process_id": "filter_bbox"
        },
        "process_id": "filter_daterange"
      },
      "process_id": "NDVI"
    },
    "process_id": "max_time"
  }
}

Response on http://openeo.vgt.vito.be/openeo/0.4.0/preview :

HTTP/1.1 500 Internal Server Error
Date: Tue, 12 Mar 2019 14:45:00 GMT
Server: nginx/1.10.3
Content-Type: application/json
Content-Length: 60
Access-Control-Allow-Origin: *
Connection: close

{
  "message": "'NoneType' object has no attribute 'get'"
}

Response on http://openeo.vgt.vito.be/openeo/0.4.0/jobs

HTTP/1.1 201 Created
Date: Tue, 12 Mar 2019 14:46:09 GMT
Server: nginx/1.10.3
Content-Type: text/html; charset=utf-8
Content-Length: 0
Location: http://openeo.vgt.vito.be/openeo/0.4.0/jobs/d3c4c928-1c17-4ac9-bcdd-ea4a740aeea2
Access-Control-Allow-Origin: *
Connection: close

Response on http://openeo.vgt.vito.be/openeo/0.4.0/jobs/d3c4c928-1c17-4ac9-bcdd-ea4a740aeea2/results :

HTTP/1.1 500 Internal Server Error
Date: Tue, 12 Mar 2019 14:46:50 GMT
Server: nginx/1.10.3
Content-Type: application/json
Content-Length: 371
Access-Control-Allow-Origin: *
Connection: close

{
  "message": "Command '['./submit_batch_job.sh', 'OpenEO batch job d3c4c928-1c17-4ac9-bcdd-ea4a740aeea2', '/mnt/ceph/Projects/OpenEO/d3c4c928-1c17-4ac9-bcdd-ea4a740aeea2/in', '/mnt/ceph/Projects/OpenEO/d3c4c928-1c17-4ac9-bcdd-ea4a740aeea2/out', '[email protected]', 'mep_tsviewer.keytab-799e0a26-f2fa-423b-a372-7b262fd6cd8c']' returned non-zero exit status 1"
}

cryptic error message when selecting out of bounds temporal

Update process graph handling in POST/GET/PATCH requests to 1.0 API

For batch jobs (/jobs), services (/services) and sync. processing (/result) the property process_graph got replaced by process. It contains a process graph and optionally all process metadata. #260

also see Open-EO/openeo-api#260

methods and paths that have to be addressed:

POST /result request
PUT /process_graphs/{process_graph_id} request
POST /services request
PATCH /services/{service_id} request
GET /services/{service_id} response
POST /jobs request
PATCH /jobs/{job_id} request
GET /jobs/{job_id} response

also see python client: Open-EO/openeo-python-client#129 and python driver Open-EO/openeo-python-driver#34

add file name extension to result assets

With VITO geopyspark backend, the standard batch jobs have a single output file, which is called "out" in the result listing (GET /jobs/{jobid}/results)

{'assets': {'out': {'href': 'http://openeo-dev.vgt.vito.be/openeo/1.0/jobs/57da31da-7fd4-463a-9d7d-c9c51646b6a4/results/out',
   'type': 'application/octet-stream'}},

it would be better for the user experience when downloading to add a proper extension (.tiff, .json, ...)

array_element not exposed?

It seems array_element is usable in process graphs, but /processes doesn't expose this process. This makes the EVI example fail for the R client and Web Editor.

cc @lforesta

Avoid heavy init.py

openeogeotrellis/__init__.py is quite heavy: lot of functions are defined, imports lot of implementation details, triggers some spark related initializations.

This causes some issues: e.g. high risk on circular import issues, undesired side effects of importing openeogeotrellis make code reuse harder, ...

It would be better to move things to subpackages and limit __init__ to imports of the core features

(related to Open-EO/openeo-python-driver#7)

filter_temporal should be left-closed

@lforesta noted that filter_temporal (and the related argument in load_collection) on the VITO backend does not exclude the end date of the data selection as specified in the process specification https://open-eo.github.io/openeo-api/processreference/#filter_temporal:

extent
Left-closed temporal interval, i.e. an array with exactly two elements:

The first element is the start of the date and/or time interval. The specified instance in time is included in the interval.

The second element is the end of the date and/or time interval. The specified instance in time is excluded from the interval.

support generic reducers in aggregate_spatial (aka zonal_statistics)

aggregate_spatial/zonal_statistics in openeo-geopyspark-driver (and openeo-python-driver) assumes the provided reducer is a simple one-process reducer, and only supports: 'histogram', 'sd', 'median' or 'mean'

(also: I think it would be cleaner if we could also remove the usage of old-style "zonal_statistics" and use 1.0-style "aggregate_spatial" in more places)

band_filter() process is executed twice

While trying to figure out the weird band_filter() behaviour discussed in Open-EO/openeo-python-client#76 I found that the band_filter logic is executed twice.

take this simple example

im = session.imagecollection("CGS_SENTINEL2_RADIOMETRY_V102_001")
im = im.band_filter(0)

When you execute/download this, the band filter logic is first executed when loading the image collection from openeogeotrellis.getImageCollection:

openeo-geopyspark-driver/openeogeotrellis/__init__.py

Line 168 in b2982a7

 return image_collection.band_filter(band_indices) if band_indices else image_collection 

The second time when "applying" the band_filter process in openeo_driver.filter_bands:
https://github.com/Open-EO/openeo-python-driver/blob/0133e6160fc347e7e91ace2bca90ef5e58b0543a/openeo_driver/ProcessGraphDeserializer.py#L376-L379

Get band metadata according to spec

Currently, the collections' band metadata in layercatalog.json uses the non-standard path bands. The spec however specifies where band names (and common names) should be specified: properties/eo:bands

Travis build fails due to Kryo serialisation issue

Latest travis-ci builds are failing with (from https://travis-ci.org/github/Open-EO/openeo-geopyspark-driver/builds/675374224):

INTERNALERROR>   File "/home/travis/build/Open-EO/openeo-geopyspark-driver/tests/conftest.py", line 14, in pytest_configure
INTERNALERROR>     _setup_local_spark(terminal_reporter, verbosity=config.getoption("verbose"))
INTERNALERROR>   File "/home/travis/build/Open-EO/openeo-geopyspark-driver/tests/conftest.py", line 58, in _setup_local_spark
INTERNALERROR>     answer = context.parallelize([9, 10, 11, 12]).sum()
INTERNALERROR>   File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/pyspark/rdd.py", line 1044, in sum
INTERNALERROR>     return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
INTERNALERROR>   File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/pyspark/rdd.py", line 915, in fold
INTERNALERROR>     vals = self.mapPartitions(func).collect()
INTERNALERROR>   File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/pyspark/rdd.py", line 814, in collect
INTERNALERROR>     sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
INTERNALERROR>   File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/py4j/java_gateway.py", line 1305, in __call__
INTERNALERROR>     answer, self.gateway_client, self.target_id, self.name)
INTERNALERROR>   File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value
INTERNALERROR>     format(target_id, ".", name), value)
INTERNALERROR> py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
INTERNALERROR> : org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: org.apache.spark.SparkException: Failed to register classes with Kryo
INTERNALERROR> org.apache.spark.SparkException: Failed to register classes with Kryo
INTERNALERROR> 	at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:140)
INTERNALERROR> 	at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:324)
INTERNALERROR> 	at org.apache.spark.serializer.KryoSerializerInstance.<init>(KryoSerializer.scala:309)
INTERNALERROR> 	at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:218)
INTERNALERROR> 	at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:288)
INTERNALERROR> 	at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127)
INTERNALERROR> 	at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:88)
INTERNALERROR> 	at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
INTERNALERROR> 	at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
INTERNALERROR> 	at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1482)
INTERNALERROR> 	at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1039)

implement ndvi process

(placeholder ticket for adding ndvi process in geotrellis driver and python client)

	server.run(title="OpenEO API",
	description="""[UNSTABLE] OpenEO API running on CreoDIAS (using GeoPySpark driver). This endpoint runs openEO on a Kubernetes cluster.
	The main component can be found here: https://github.com/Open-EO/openeo-geopyspark-driver
	The deployment is configured using Terraform and Kubernetes configs: https://github.com/Open-EO/openeo-geotrellis-kubernetes
	Data is read directly from the CreoDIAS data offer through object storage. Processing is limited by the processing
	capacity of the Kubernetes cluster running on DIAS. Contact VITO for experiments with higher resource needs.
	""",
	deploy_metadata=build_backend_deploy_metadata(
	packages=["openeo", "openeo_driver", "openeo-geopyspark", "openeo_udf", "geopyspark"]
	# TODO: add version info about geotrellis-extensions jar?
	),
	backend_version=get_backend_version(),
	threads=10,
	host=host,
	port=port,
	on_started=on_started)

	def apply(self, process: str, arguments: dict = {}) -> 'GeopysparkDataCube':
	from openeogeotrellis.backend import SingleNodeUDFProcessGraphVisitor, GeoPySparkBackendImplementation
	if isinstance(process, dict):
	apply_callback = GeoPySparkBackendImplementation.accept_process_graph(process)
	#apply should leave metadata intact, so can do a simple call?
	return self.reduce_bands(apply_callback)