locationtech / geowave Goto Github PK

GeoWave provides geospatial and temporal indexing on top of Accumulo, HBase, BigTable, Cassandra, Kudu, Redis, RocksDB, and DynamoDB.

License: Apache License 2.0

Shell 0.82% Java 94.90% Puppet 0.07% Scheme 0.17% FreeMarker 0.02% Gnuplot 0.47% Python 3.43% Dockerfile 0.03% ANTLR 0.08%

geowave java accumulo geospatial-data hbase geoserver cassandra dynamodb kudu redis

geowave's People

Contributors

Stargazers

Watchers

Forkers

booz-allen-hamilton pljplj gregbuehler state-hiu mfarwell bptran spohnan gijs mwaineo dlyle65535 bradh larrytech7 viggyprabhu kitplummer aashish24 techchrj gisupc ruks cj2001 ewilson-radblue rfecher binderparty sky727 kcompher jskora ammaskartik bsteine spatially keith-turner michaeljohns giserh andrewdmanning datasedai cgore dcy2003 omusico derekjkern scottevil rwgdrummer bmendell cameronwork mbrukman vpipkt andrew-dunn vikashranjan akolts liusiye netbirdfly jgornowich kaydoh akash-peri xingwujie mak438 xm-gui anicr7 ashish217 jnm92 nmehta92 becca42 mrkev barrycug donrv junchuan guanml silky blyncs hgryoo nubiofs cuulee millerhooks gitter-badger blastarr mawhitby yangtairen prominentedge jwileczek srinivasreddyv2 caporossi dgtery phanirajl maduhu venicegeo vvovvo ahuarte47 jdgarrett sam-radiant willcohen iamjoshbinder beikehanbao23 cjw5db xiaozhongwang gisdevelope westybsa nurudeensulaimon oogetyboogety soxueren russellorf lidi100 mapwaylabs autumnwormsun

geowave's Issues

Implement get feature by id index

Features should be able to be looked up directly by the feature id - i.e. a secondary index mapping feature id's -> row id's.

Probably needed for
#16

Create Travis build matrix to test multiple configurations

Specify multiple build configurations for hadoop versions, accumulo versions, geotools versions, and geoserver versions, and have travis run the different combinations.

In the readme detail this build matrix, and include status icons.

Geospatial benchmark utility

Develop a spatial (and consider temporal) benchmark utility that can be generally applicable to run against any system that stores, indexes, and retrieves spatial content. Likely this will be most generally done by keeping it separate from GeoWave, and utilizing GeoTool's data store abstraction.

Kernel Density process dies when encountering empty points

14/06/27 18:33:30 INFO mapred.JobClient: Task Id : attempt_201406170050_0010_m_000128_2, Status : FAILED
java.lang.IllegalStateException: getY called on empty Point
        at com.vividsolutions.jts.geom.Point.getY(Point.java:131)
        at mil.nga.giat.geowave.analytics.mapreduce.kde.GaussianCellMapper.incrementLevelStore(GaussianCellMapper.java:157)
        at mil.nga.giat.geowave.analytics.mapreduce.kde.GaussianCellMapper.map(GaussianCellMapper.java:144)
        at mil.nga.giat.geowave.analytics.mapreduce.kde.GaussianCellMapper.map(GaussianCellMapper.java:32)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)

additional support for gridded/raster datasets

create a data adapter, index, and a data store specifically to handle raster (generally gridded) datasets
- the data store and index should be able to support a reduced-resolution pyramid and the ability to query for a particular resolution
- the adapter should be able to support different tile sizes for persistence, but the data store should just provide query by geometric bounding box and resolution, not by tile directly

3D Geometries are stored as 2D

The geometryToBinary function in the GeometryUtils class is using the default constructor for the WKBWriter. This defaults to an output coordinate dimension of 2, so any z values will be ignored.

Transparently handle geometric transformation to EPSG:4326 on ingest for features using other coordinate reference systems

All our data is stored in EPSG:4326 internally - data sources which aren't in this projection need to be normalized. Currently some input methods (such as the vector file ingest) don't do this. Since the CRS is known we should normalize data not in 4326 in a cetralized place (so each ingest process doesn't have to replicate the code - and also to ensure data consistency)

Zookeeper connection pool / fault handling for geoserver plugin

The current geoserver plugin dies (tomcat needs to be restarted) if the zookeeper cluster goes down and is then restarted (due to persistent zookeeper connections).

Some sort of connection pooling might also need to be enabled as there's a limit to the number of simultaneous zookeeper connections we want (due to internal collection synchronization).

Ensure uzaygezen library usage respects end of range as exclusive

Currently, GeoWave's usage of the uzaygezen library expects the resultant beginning and end of range to be inclusive, but after further inspection the end of the range is exclusive. This will result in one extra row ID potentially which will be filtered out by the fine-grained constraints filtering, but should still be fixed for accurate usage.

NPE Exception when Coordinate method for geometry returns null

Integration tests failing:

Issue is in mil.nga.giat.geowave.store.GeometryUtils::geometryToBinary
Coordinate will be null when the geometry is empty (no points).
In this case it doesn't really matter how we initialize the WKBWriter, as there will be no points to serialize - so just default to 2 when the geometry is empty.

Accumulo namespace support

Add accumulo namespace support where required (or validate and document if namespace.tablename prefix is needed)

Documentation: Finish Quickstart + Graphics/Description of underlying theory

Finish quickstart guide
Add in a technical background page that describes how the system works
Add in an architecture page that details (at an admittedly high level) the architecture of the system.

Add locality group caching to address performance issues.

Checking the existence of a locality group is causing a noticeable slowdown during ingest. Logic needs to be added to cache locality groups so that we can avoid the overhead with Accumulo. Cached locality groups should be dropped after a specified period of time.

Please put up some screenshots

Please place some explanatory and useful graphics within the placeholder sections.

Iterator classloader hack breaks with hdfs URI prefix

Caused by: java.net.MalformedURLException: invalid url: hdfs://:8020/accumulo/lib/geowave-gt-0.8.0-SNAPSHOT-accumulo-singlejar.jar!/ (java.net.MalformedURLException: unknown protocol: hdfs)
at java.net.URL.(URL.java:619)
at java.net.URL.(URL.java:482)
at java.net.URL.(URL.java:431)
at mil.nga.giat.geowave.gt.query.CqlQueryFilterIterator.initClassLoader(CqlQueryFilterIterator.java:163)
at mil.nga.giat.geowave.gt.query.CqlQueryFilterIterator.init(CqlQueryFilterIterator.java:186)
at org.apache.accumulo.core.iterators.IteratorUtil.loadIterators(IteratorUtil.java:243)

(This is where, in a static instance, we create a new classloader instance with the jars and attach it the parent VFS classloader. This hack is required to get SPI injection working in iterator stacks)

Create GeoGig backend

GeoGit is a DVCS for geospatial data. It adds the ability to track provenance, history, and perform diffs on different data sets.
Ultimately it would be ideal to be able to store and track this information for any feature stored in GeoWave - as a first step to this implementing a GeoGit backend in accumulo (leveraging geowave where possible) is desired.

Geogit has a pluggable data store implementation - it looks like there's a concept of a GraphDatabase, an ObjectDatabase, and a StagingDatabase that need to be implemented. The MongoDB backend provides a good example of implementing all three of these (see [3]).

The GraphDB implementation can be done from scratch, but it looks like there are canned implementations that leverage the BluePrints API. There's an a project ([4]) that implements a blueprints-api on accumulo which may speed this up (state of this project is currently known (stability, quality, etc.).

Note that further investigation is needed to determine to what extent GeoWave and GeoGit objects can be co-mingled. It would be ideal to not duplicate any date when not strictly needed. It might also be desireable to keep historical data (diffs, versions), in a separate table to keep "current state" queries quick. Working out the appropriate direction here would be done in conjunction with this task.

There are two groups on the geogit dev list working an HBase (may be migrating to something else - Ceph?) object store as well as spatial indexing - currently most of the stuff seems pretty rudimentary, but might be worth keeping an eye on.

[1] GeoGit: https://github.com/boundlessgeo/GeoGit
[2] DevDocs: https://github.com/boundlessgeo/GeoGit/blob/master/doc/technical/source/developers.rst
[3] MongDB implementation: https://github.com/boundlessgeo/GeoGit/tree/master/src/storage/mongo
[4] Accumulo-Blueprints project: https://github.com/mikelieberman/blueprints-accumulo-graph
[5] Group working NoSQL object database (Hbase - now Ceph?): http://geogitobjdb.blogspot.com/
[6] GeoGit spatial index discussion: https://groups.google.com/a/boundlessgeo.com/forum/#!searchin/geogit/spatial$20index/geogit/9yVQAFL4n4I/VZDFrCsh3kgJ
[7] GeoGit discussion group: https://groups.google.com/a/boundlessgeo.com/forum/#!forum/geogit

Create web front end for easy upload of data

Create a web page / back-end service that exposes the vector file ingest, as well as other ingesters (gpx, etc.) to allow web based submission of GIS data (think geojson, shapefile, etc.). The ingest service should be usable directly as well (i.e. doesn't require the web page)

This should also interact with the geoserver api to automate, or at least simplify, the publishing of data stores and layers.

Handle longitude wrapping on ingest for geolife longitude values (180 -> 400 )

Geolife dataset (See ingester in current geolife branch, data at http://research.microsoft.com/en-us/projects/geolife/ ) has values longitude values that go up to 400. Dateline wrapping isn't handled properly for these values.

Should handle on parsing side, as "meaning" of EPSG:4326 values outside the -180/+180 range is undefined, or rather defined by the convention of the data set.

Mark jai as provided

Jai is getting bundled as a dependency, but if it isn't installed it's causing an exception when the native libraries aren't found. (Causing geoserver not to load in some cases depending on plugins installed).

The libraries should be on the classpath ($JRE_HOME/lib/ext) already, so no need to bundle.

Basic Utility functions

create a geowave-utils project with main methods to do convenience functions:

set splits on a table
- based on quantile distribution and fixed number of splits
- based on equal interval distribution and fixed number of splits
- based on fixed number of rows per split
get all geowave namespaces
set locality groups per column family (data adapter) or clear all locality groups
get # of entries per data adapter in an index
get # of entries per index
get # of entries per namespace
list adapters per namespace
list indices per namespace

Store tiers that are actually used in metadata

This will allow us to only multiplex SFC query ranges across the tiers that actually have data.

Implement a generalized MapReduce ingest process to use for GPX point and line ingest

Add a utility to stage files to HDFS (recurse files in a directory matching a given file filter)
Implement GeoWaveAccumuloOuputFormat to easily write data to GeoWave in Accumulo
Implement a general purpose mapper and reducer that can use a file reader interface and an aggregation strategy to persist OGC Features in GeoWave
Provide a concrete implementation of this generalized process for GPX points and lines

Geotools datastores / iterators not properly releasing handle on shapefiles during integration tests

The integration test fill the output log with errors regarding a non-closed shape file.

Ensure the datastore is
Ensure iterators are closed

Servlet.class being included in shaded geotools plugin jar

servlet or servlet-api is being included in the shaded tomcat jar, which causes tomcat to refuse to load the geoserver plugin (and geoserver as a result).

Add maven profile for generation of an executable jar for geotools datastore ingest / add geotools plugin datastores to dependencies

The geowave-ingest project has a class, VectorFileIngest, which ingests supported geotools datastore formats into geowave.

Provide a maven shade profile to generate an executable jar
~~Bundle in standard geotools datastores~~
~~Provide a mechanism to query for supported datastores (at the command line)~~

Add data adapter and index caching to improve performance

Fix issue related to Calendar to GMT

Add a module that can perform end to end system integration testing

The test will ingest reasonably large point and line temporal datasets within default spatial and spatial-temporal indices and test that query results match expected results to give a good indication that the entire system works as expected. This can be useful for verifying a system is set up correctly and for functional regression testing as new features are added.

WFS-T support

Finish implementing required geotools datastore methods for WFS-T functionality.
Probably has a dependency on #17

Ensure CQL filtering is enabled for all GeoTool's data store queries

add delete by row ID

Add support for secondary indices.

This is a feature that would be nice to have in the future. The hope is that we will be able to leverage some of the work done for #17

Create PDAL driver

Write a plugin (read/write) PDAL that allows persistence and query of pointclouds in geowave.

See #13 - use the same technique chosen there (rpc vs. jni) to bridge the PDAL c++ interface with the java.

[1] https://github.com/PDAL/PDAL
[2] http://www.pdal.io/docs.html
[3] http://osgeo-org.1560.x6.nabble.com/pdal-Feedback-on-driver-development-td4680397.html

Output streams sometimes closed twice in geowave iterators

Only apparent impact currently is logspam, but needs to be fixed

19 Jun 06:24:32 WARN [transport.TIOStreamTransport] - Error closing output stream.
java.io.IOException: The stream is closed
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:115)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
        at org.apache.thrift.transport.TIOStreamTransport.close(TIOStreamTransport.java:110)
        at org.apache.thrift.transport.TFramedTransport.close(TFramedTransport.java:89)
        at org.apache.accumulo.core.client.impl.ThriftTransportPool$CachedTTransport.close(ThriftTransportPool.java:289)
        at org.apache.accumulo.core.client.impl.ThriftTransportPool.returnTransport(ThriftTransportPool.java:570)
        at org.apache.accumulo.core.util.ThriftUtil.returnClient(ThriftUtil.java:115)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:693)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:361)
        at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
        at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
        at java.lang.Thread.run(Thread.java:745)

Consider allowing for visibility within all GEOWAVE_METADATA persistable objects

The Data Adapter and the Index configuration are persisted within the metadata as values that are only intended for internal system use but are not given a visibility.

Integrate Continuous Integration (probably Travis CI)

Ideally we will use Travis CI to run integration and unit tests prior to merging any changes into the master. It will also be nice to automatically build javadocs to be committed to our gh-pages branch for living documentation.

Ensure geowave works with Accumulo 1.6 / drop Accumulo 1.4 support if needed

There are a few deprecated methods in 1.5 that are dropped in 1.6 (InputFormatBase methods, etc.). Default is to support 1.5.x and 1.6.x so bias method choice to that.

Use pre-computed Bounding Box from Data Statistics metadata

This is particularly for a performance improvement for GeoServer's "Compute from data" in the "Edit Layer" page.

Move uzaygezen dependency over to maven central

It's published now -
http://search.maven.org/#browse|1859425327

Ensure GeoTools feature collection iterable is closed after ingestion

Here's the commit that got wiped out from the latest ingest framework:
bc795d5

Create pointcloud (LAS probably) ingester

Load pointcloud (3d / 3d + temporal) into geowave.

Investigation on access efficiency

Load each point as an individual point (vector)
Store a volumetric space and index it by the volume (tiling in 3 dimensions)

Access mechanism: look to #14 as one of the potential use cases

Update code, interfaces, and documentation to clarify namespace type: (accumulo vs. geowave)

In 1.6 accumulo has a namespace option.
In geowave we use the term namespace to refer to a dataset.

Ensure where a namespace value is required or displayed we are clear what type of namespace it is.

Create geowave-examples project

At first cut geowave examples should

Show an easy way of programatically ingesting feature data from shape files and geojson
Show an easy way of programatically ingesting geospatial data generated in code
Show an easy way to programatically query data and export to a shape file and geojson
Demonstrate how to ingest and query data using the mapreduce input/output formats
Demonstrate how to ingest data for supported types using the ingest framework
- This will likely be more in the supporting documentation as there is no code required
Demonstrate how to write a new plugin for a simple format type

The documentation in the gh-pages branch should be updated to explain the above as well.

Add GeoServer subpanel to support configuring visibility options

Current visibility options are determined by a GEOWAVE_VISIBILITY attribute of a feature.
Each layer should define it's own visibility criteria.

The visibility metadata should be maintained in zookeeper.
The visibility metadata should be associated with each adapter (typename).
The metadata includes the attribute name and the parser. Currently, there is only one:JsonDefinitionColumnVisibilityManagement. These options are provided to the FeatureDataAdaptor in its constructor as called by GeoWaveGTDataStore#createSchema.
Somehow, the GeoWaveGTDataStore must compile and maintain the metadata. The visibility page access the data though the data store.

Of note: each GT Data Store instance is associated with a workspace and has its own set of associated layers. At the moment, namespace issues can be resolved having layers have unique names. However, the developer should consider the possibility that two data store instances have the layers with same name. I do not think is possible or realistic. Thus, metadata could be simply indexed by typeName.

REST API for service access to GeoWave datastores

Some operations to support are:

list GeoWave namespaces that exist (geowave-utils #43)
ingest by:
- upload file (#4)
- ingest from a filesystem accessible by the server
- allow for additional attributes be associated with each feature (GeoTools ingest type only)
Geoserver facades with default GeoWave configuration to
- publish data stores
- publish layers
- get/set styles
- enable GeoWebCache
- list GeoWave data stores, with zookeepers, accumulo instance and namespace of each
- list all GeoWave layers, and list layers by namespace
analytics services to follow

GeoTools data store utilizing existing spatial-temporal index

Currently, our GeoTools data store will create a SpatialQuery object for all queries against any index. In particular we want to be able to utilize the spatial-temporal index if both spatial and temporal bounds are given, but this could be generally useful for querying by any property/dimension, and an index will not work if some bounds are not provided for an indexed field.

Create Mapnik data source

Develop a mapnik geowave datasource plugin:

This will enable mapnik to render tiles directly from a geowave datastore. [1][2]
Mapnik is behind most of the OSM infrastructure - enabling this brings in lots of great features
Initial target would be 3.x unless something comes up[3]

Decision point on RPC (thrift, etc.) vs. in-process (Jace[4] probably)
See #14 - same technique here applies there

ref:
[1] https://github.com/mapnik/mapnik/wiki/PluginArchitecture
[2] https://github.com/mapnik/mapnik/wiki/DevelopingPlugins
[3] https://github.com/mapnik/mapnik/issues?milestone=15&state=open
[4] https://code.google.com/p/jace/wiki/Overview

Determine meaningful visibility for row ID lookup entries in "Alt" index

Options for AccumuloDataStore

Add options that can be provided to AccumuloDataStore and GeoWaveDataStore to change some behaviors but have reasonable defaults be the current behavior if no options are specified. A few examples to start out would be to enable/disable persisting a data adapter in the metadata table if it doesn't exist, enable/disable persisting an index in the metadata table if it doesn't exist, and enable/disable creating an index table if it doesn't exist (probably just throwing an error if it doesn't exist).

Another option in which the default behavior is currently not implemented is to enable/disable automatically adding a locality group for a new column family (data adapter ID) within an index table. It seems the default behavior would be to create a locality group for each column family because this is the most typical access pattern.