This should be a configurable option to index geospatial data using either the original TieredSFCIndexStrategy or the new XZHierarchicalIndexFactory. GeoWave query functions should be updated as well to support querying the new indices (perhaps via an optional parameter) while also supporting the older indices.
Need to add support for populating the value cell of the field index filter with relevant data, and to add support for filtering against that data at query time within the query iterators.
About 4 years ago I described the D4M schema at https://github.com/medined/D4M_Schema. Could any of that material be adapted for DataWave? If so, I give permission to use any or all of it.
In the event that we are processing the same geometry strings over and over, it wouldn't be a bad idea to cache the geometry objects once they have been parsed.
The instanceof method in QueryPropertyMarker is used to identify nodes that have been marked with a QueryPropertyMarker (like ASTDelayedPredicate), but at the moment, it does not correctly identify marked nodes after they have gone through serialization/deserialization. At times, it also prematurely identifies the parent of a marked node as the marked node, which can cause problems.
TLDEventDataFilter assumes a 20 character fixed length base UID for rootPointer parsing purposes. This is not always the case. Adjust the code remove the bad assumption and instead calculate the length for each Key, counting the separators to determine root document state.
The authorization microservice exposes an endpoint to evict users from the cache. However, if more than one copy of the service is running, then the operation won't currently take effect on all copies. The backing cache is shared, however if users are in an in-memory cache on a different copy of the service then they won't be evicted. Use the spring event bus to notify all copies of the authorization service of an eviction request.
When we have a document specific zone and an index only field with a query of the form
(some expression) && !(some other expression)
The reason has to do with the limitSources path where we will return a field index entry without actually looking it up assuming we already found it in the global index. There is a check for negation but that flag is not appropriatly set within a ASTNotNode in the IteratorBuildingVisitor.
When we have a sequence of or'ed terms that have been delayed by the planner, but have term frequencies (i.e. are tokenized) then the QueryIterator will pull those terms back up. However if this is in a union of other terms then this can cause results to not be found.
With the current query API, it can be problematic to restart a web server during a long-running query since either the query will be cancelled and the user has to start over, or we must wait a long time for the query to complete before shutting the server down. The latter also requires load balancer support to ensure that requests for an active query go to the draining server, but no other new requests go to it.
Design and implement a mechanism to checkpoint query progress so that a server can be restarted between next calls without interrupting the query. This could be supported only for certain query logics at first, although it would be nice to have it for all query logics.
I think it would be beneficial to place the table structures on the front readme or linked in the docs. I know the docs are under construction, but I'd be happy to take a first stab at this.
The "services" folder contains microservices, which are intended to be evolved and deployed largely independently of each other. Currently, this entire hierarchy shares the same pom version as the main datawave parent pom. Instead, each service should support having its own version. The services parent pom (which will eventually become the main parent pom) could then name the current version of each service that will be in use.
Another task that is part of this issue is to break the dependencies that currently exist where microservices code depends on legacy datawave code. Instead, the cross-dependent code should be moved to new modules under services. For example, each service might need a service-api module that contains the public api for the service.
DocumentProjection should support copying a secondary field into a primary field if the primary field is not set. This could be useful when there may be multiple sources for a primary field value and when not set the final value set should be determined in a priority order.
The UID generation needs to be separated from the setting of the raw data. This is causing in some circumstances the UID to be unnecessarily generated multiple times. Also we need to ensure that the time on the event is set before the UID is generated. Finally, we should allow for alternative UID implementations other than hash, and snowflake.
FieldConfigHelper is currently a way to control whether a field is indexed (forward/reverse) and tokenized (also forward/reverse). It would be helpful if this configuration file could be used to control whether a field was stored in the event as well.
We need the ability to allow the configuration for a table to override properties. The idea is to prefix the properties to override with the table name or a defined prefix via another property.
Instead of generating a bounding envelope which contains the entire multipolygon, try to break the multipolygon down into smaller bounding areas so that we generate tighter, more accurate scan ranges.
A feature has been requested where we configure the shard query logic with one or more objects that can be used to modify the document prior to evaluation. The particular requirement is to allow us to dynamically create the pieces of a virtual field for insertion into the JexlContext. This will allow one to drop the pieces of a virtual field (saving space in the DB) but still allow users to potentially query on the pieces.
Currently multiple scans of the same shard can occur. This happens when a day range get expanded at the same time we have the shard range for shard 0:
day range: 20180101 to 20180101
shard range: 20180101_0 to 20180101\x00
after day range expansion we get: 20180101 to 20180101\x00
The expanded day range does not get collapsed with the shard range resulting in two scans against the same shard.
Meanwhile, if we have an ivarator we will now be working with the same directory at the same time which can cause all kinds of confusion. Typically we found that when one of the ivarators compacts the files that the other will then falter because of missing files.
Authorization and auditing have already been separated out to microservices. Identify additional components in addition to query logics. For example, TypeMetadataHelper is a good candidate. There is probably other low-hanging fruit.
Through profiling with yourkit and monitoring cache misses I've noticed improved execution time and fewer cache misses when I avoid some of the CAS operations with AtomicInteger. By using LongAdder in QueryStatsDClient I've seen improved perofrmance. I suspect this will also be the case with the schedulers and scanners that increment integers across threads. I'll make these additional targeted changes ( beyond the statsd client ) as a POC.
Currently the NumShards cache can only handle URIs that correspond to the configured defaultFS. However we have instances where we may have multiple available filesystems. Instead of using new FileSystem(conf) we should be using new FileSystem(uri, conf) to create the appropriate one.
The version of HTTPClient we are using is subject to a bug during connection pool shutdown where deadlock can occur (HTTPCORE-446). Since we perform shutdown as a part of the "/Common/Configuration/refresh" method, this bug can take effect during runtime after a refresh has been issued.
Update to a newer version of HTTPClient that is not susceptible to this bug. Also update usage of HTTPClient to delay shutdown for a while to give existing pending requests a chance to finish since they would otherwise be immediately canceled and would fail, which is not what we want during a refresh.
FieldConfigHelper is currently a way to control whether a field is indexed (forward/reverse) and tokenized (also forward/reverse). It would be helpful if this configuration file could be used to control whether a field was stored in the event as well.
MetadataHelperUpdateHdfsListener throws TNFE because knowledgeMetadata table does not exist and floods Query.log. Knowledge schema is unused at the moment and therefore should not be configured by default for MetadataHelperUpdateHdfsListener in default.properties
Computing size on BaseType can be very expensive (in time) depending on the implementation. Specifically Normalizer's that are singleton's need not be evaluated as part of the computation.