nationalsecurityagency / datawave Goto Github PK

View Code? Open in Web Editor NEW

556.0 556.0 242.0 100.41 MB

DataWave is an ingest/query framework that leverages Apache Accumulo to provide fast, secure data access.

Home Page: https://code.nsa.gov/datawave

License: Apache License 2.0

Java 97.46% Shell 2.13% HTML 0.17% XSLT 0.01% Python 0.03% CSS 0.02% JavaScript 0.14% Dockerfile 0.04%

accumulo bigdata java

datawave's People

Contributors

Stargazers

Watchers

Forkers

phrocker jwomeara drewfarris mwaineo danielsaenz tomnelson braintrustholdings fongrx7 trucemop fineanddandy jschmidt10 brianloss valmach ivakegg pcgrenier ejrgilbert agcarls parksunwoo ufordb jblann milleruntime smonaem bmwmaestoso hayreg bbux-atg matthpeterson roshanp hogank lhfei jpmcnamee ltoscano julianocristian cjmctague threezerobravoteam mdheller inc775 apmoriarty jretza jkrdev 4a4f484e ajdubya chmodawk hlgp chassereau cephurs brianelugan99 carlosanpedro-ubidy josecarlosanpedro matlle ragowe frostytear enumerrata alerman timg-dev conditionblack seedaily thezedwards beniyah mblackbourne gaybro8777 sim0n stjordanis ivansaji klodi667 narthanaj gulsumkaygisiz wildemarelias19 sgnls leonmagnus codeqby charrine matth3wology spencerx nairobi222 c01jcii tiancheng-luo nyble0101 seabreg nsahitman amaroq-clearwater modulexcite sepppenner nmsrox dtspence agc-software huwenhai tsw424 patternagainstuser larrycameron80 rqualis-altamiracorp zccfzcc cyberjarv rodmoten ssj4ndrew pks-os cnxtech jzeiberg manno15 bekelm sinsixx

datawave's Issues

Wildfly Setup in quick start does not appear to work in my setup.

Running typical stack identified in guide. Running on centos 7.

Add ability to ingest/query via new GeoWave Index Strategy

This should be a configurable option to index geospatial data using either the original TieredSFCIndexStrategy or the new XZHierarchicalIndexFactory. GeoWave query functions should be updated as well to support querying the new indices (perhaps via an optional parameter) while also supporting the older indices.

Add ability to use the field index value for custom filtering

Need to add support for populating the value cell of the field index filter with relevant data, and to add support for filtering against that data at query time within the query iterators.

Remove unused code in QueryModelVisitor

The flattenMultimap and noModelChange methods are no longer used in the QueryModelVisitor

Could my D4M documentation be adapted here?

About 4 years ago I described the D4M schema at https://github.com/medined/D4M_Schema. Could any of that material be adapted for DataWave? If so, I give permission to use any or all of it.

Look into caching for geo string to geometry conversion

In the event that we are processing the same geometry strings over and over, it wouldn't be a bad idea to cache the geometry objects once they have been parsed.

Remove duplicate data type handlers from the configured list of data type handlers.

This will prevent duplicate data type handlers in the configured list from creating duplicate data ingest.

QLF.xml and DataWave/doc cleanup

QueryPropertyMarker.instance of is not identifying marked nodes correctly

The instanceof method in QueryPropertyMarker is used to identify nodes that have been marked with a QueryPropertyMarker (like ASTDelayedPredicate), but at the moment, it does not correctly identify marked nodes after they have gone through serialization/deserialization. At times, it also prematurely identifies the parent of a marked node as the marked node, which can cause problems.

Remove fixed length UID parse assumptions from TLDEventDataFilter

TLDEventDataFilter assumes a 20 character fixed length base UID for rootPointer parsing purposes. This is not always the case. Adjust the code remove the bad assumption and instead calculate the length for each Key, counting the separators to determine root document state.

JsonObjectFlattener-derived classes should use simplified construction logic

Whomever coded these should be tarred/feathered - without mercy
Oh wait, it was me.... nevermind

Add unit test for result cardinality aggregation

Test the cartesian product - type expansion of key/values when the result cardinality configuration specifies composite fields of length N.

WikipediaInputFormat has extends an in-eligible class to be run by event mapper

Change order of BUILDING so people reading it don't get tripped up on the build

Support cache eviction across multiple copies of the authorization microservice

The authorization microservice exposes an endpoint to evict users from the cache. However, if more than one copy of the service is running, then the operation won't currently take effect on all copies. The backing cache is shared, however if users are in an in-memory cache on a different copy of the service then they won't be evicted. Use the spring event bus to notify all copies of the authorization service of an eviction request.

Negation may filter out too many results

When we have a document specific zone and an index only field with a query of the form
(some expression) && !(some other expression)
The reason has to do with the limitSources path where we will return a field index entry without actually looking it up assuming we already found it in the global index. There is a check for negation but that flag is not appropriatly set within a ASTNotNode in the IteratorBuildingVisitor.

In Quickstart, 5 Datawave Menu Options Are Forbidden

I performed the following steps:

forked, then cloned the Datawave project.
updated docker to use devicemapper as the storage driver.
enabled docker's experimental mode.
ran docker-build.sh which created the 'datawave-quickstart' image.
ran './docker-run-example.sh 1.0.0-SNAPSHOT -it'.
visited

The response was simply 'forbidden'.

Long List iterator failing to build because negate flag is not set

The long list iterator is failing to build in the IteratorBuildingVisitor because the negate flag is not being set.

QueryIterator incorrectly pulls up delated predicates with an or

When we have a sequence of or'ed terms that have been delayed by the planner, but have term frequencies (i.e. are tokenized) then the QueryIterator will pull those terms back up. However if this is in a union of other terms then this can cause results to not be found.

come up with a way to make certain queries survive web server restarts

With the current query API, it can be problematic to restart a web server during a long-running query since either the query will be cancelled and the user has to start over, or we must wait a long time for the query to complete before shutting the server down. The latter also requires load balancer support to ensure that requests for an active query go to the draining server, but no other new requests go to it.

Design and implement a mechanism to checkpoint query progress so that a server can be restarted between next calls without interrupting the query. This could be supported only for certain query logics at first, although it would be nice to have it for all query logics.

Link the table structures on the front readme or documentation

I think it would be beneficial to place the table structures on the front readme or linked in the docs. I know the docs are under construction, but I'd be happy to take a first stab at this.

prepare microservices to be versioned independently

The "services" folder contains microservices, which are intended to be evolved and deployed largely independently of each other. Currently, this entire hierarchy shares the same pom version as the main datawave parent pom. Instead, each service should support having its own version. The services parent pom (which will eventually become the main parent pom) could then name the current version of each service that will be in use.
Another task that is part of this issue is to break the dependencies that currently exist where microservices code depends on legacy datawave code. Instead, the cross-dependent code should be moved to new modules under services. For example, each service might need a service-api module that contains the public api for the service.

Support promoting field values from a secondary field to a primary field if the primary is not set

DocumentProjection should support copying a secondary field into a primary field if the primary field is not set. This could be useful when there may be multiple sources for a primary field value and when not set the final value set should be determined in a priority order.

UID generation is being done redundantly

The UID generation needs to be separated from the setting of the raw data. This is causing in some circumstances the UID to be unnecessarily generated multiple times. Also we need to ensure that the time on the event is set before the UID is generated. Finally, we should allow for alternative UID implementations other than hash, and snowflake.

Wildfly setup script fails to create symlink during deploy

During web services deployment, setup-wildfly.sh throws an error on this line....

ln -s $HADOOP_HOME/lib/native $WILDFLY_HOME/modules/org/apache/hadoop/common/main/lib/${OSNAME}-uname -m

...because $WILDFLY_HOME/modules/org/apache/hadoop/common/main/lib doesn't exist.

Geo queries should be split across the anti-meridian if they cross it

This pertains to the old datawave geo functionality, not the more recent GeoWave functionality.

Add support for scores, skips and same position matching for content queries (adjacent, within, phrase)

Use FieldConfigHelper to determine if a field should be excluded from the Event

FieldConfigHelper is currently a way to control whether a field is indexed (forward/reverse) and tokenized (also forward/reverse). It would be helpful if this configuration file could be used to control whether a field was stored in the event as well.

Allow table name overrides in config helpers

We need the ability to allow the configuration for a table to override properties. The idea is to prefix the properties to override with the table name or a defined prefix via another property.

Update GeometryNormalizer to support normalization of multiple string geometry formats.

At the moment, we only support WKT. We should update the geometry normalizer so that it can handle additional geometry representations besides WKT.

Add ability to decompose multipolygons into smaller ranges when computing geo query bounds.

Instead of generating a bounding envelope which contains the entire multipolygon, try to break the multipolygon down into smaller bounding areas so that we generate tighter, more accurate scan ranges.

Add null check in DocumentTransformer 3-part colFam logic

Add the ability to permute documents for evaluation

A feature has been requested where we configure the shard query logic with one or more objects that can be used to modify the document prior to evaluation. The particular requirement is to allow us to dynamically create the pieces of a virtual field for insertion into the JexlContext. This will allow one to drop the pieces of a virtual field (saving space in the DB) but still allow users to potentially query on the pieces.

Ivarator directory conflicts

Currently multiple scans of the same shard can occur. This happens when a day range get expanded at the same time we have the shard range for shard 0:

day range: 20180101 to 20180101
shard range: 20180101_0 to 20180101\x00
after day range expansion we get: 20180101 to 20180101\x00

The expanded day range does not get collapsed with the shard range resulting in two scans against the same shard.

Meanwhile, if we have an ivarator we will now be working with the same directory at the same time which can cause all kinds of confusion. Typically we found that when one of the ivarators compacts the files that the other will then falter because of missing files.

Identify additional web service components to convert to micro services

Authorization and auditing have already been separated out to microservices. Identify additional components in addition to query logics. For example, TypeMetadataHelper is a good candidate. There is probably other low-hanging fruit.

Replace uses of AtomicInteger with LongAdder

Through profiling with yourkit and monitoring cache misses I've noticed improved execution time and fewer cache misses when I avoid some of the CAS operations with AtomicInteger. By using LongAdder in QueryStatsDClient I've seen improved perofrmance. I suspect this will also be the case with the schedulers and scanners that increment integers across threads. I'll make these additional targeted changes ( beyond the statsd client ) as a POC.

Add GeoWave JexlQueryFunctions to enable Lucene syntax for geo queries

The NumShards cache needs to handle multiple hadoop filesystems

Currently the NumShards cache can only handle URIs that correspond to the configured defaultFS. However we have instances where we may have multiple available filesystems. Instead of using new FileSystem(conf) we should be using new FileSystem(uri, conf) to create the appropriate one.

Asynchronously close lookupUUID requests

The lookupUUID endpoint should not wait to close the query prior to returning the result. Use the executor to close the query and return immediately

Shutdown/refresh deadlock in HTTPClient

The version of HTTPClient we are using is subject to a bug during connection pool shutdown where deadlock can occur (HTTPCORE-446). Since we perform shutdown as a part of the "/Common/Configuration/refresh" method, this bug can take effect during runtime after a refresh has been issued.

Update to a newer version of HTTPClient that is not susceptible to this bug. Also update usage of HTTPClient to delay shutdown for a while to give existing pending requests a chance to finish since they would otherwise be immediately canceled and would fail, which is not what we want during a refresh.

Use FieldConfigHelper to determine if a field should be excluded from the Event

FieldConfigHelper is currently a way to control whether a field is indexed (forward/reverse) and tokenized (also forward/reverse). It would be helpful if this configuration file could be used to control whether a field was stored in the event as well.

In Quickstart, releaseNotes.html was not found.

After building and running the docker image, I visited the following but it was not found.

https://localhost:8443/DataWave/doc/releaseNotes.html

[discuss] GitHub Wiki for Documentation?

Do we want to use the GitHub wiki capabilities for documentation, or do we want all documentation to be checked in alongside the codebase? (https://github.com/NationalSecurityAgency/datawave/wiki)

Flag Maker should be allowed to increase job size when loader or ingest is backed up

The idea is that if the bulk loader or ingest is backing up, then we need to be able to increase the job sizes to reduce the rate of backing up.

Add ability to alias indexed fields

Add the ability to store the alias at ingest rather than be expanded via query model, reducing index scans. Trade storage for performance.

Term Frequency fields should not be aggregated from the fi unless they are also index only fields

values derived from tf are being returned alongside event fields when the field the tf hit on was not index only

NullPointerException in EventDataQueryExpressionVisitor

EventDataQueryExpressionVisitor throws a NullPointerException when evaluating queries with expressions like FOO == null or FOO != null.

TableNotFoundException thrown due to default config, flooding Query.log

MetadataHelperUpdateHdfsListener throws TNFE because knowledgeMetadata table does not exist and floods Query.log. Knowledge schema is unused at the moment and therefore should not be configured by default for MetadataHelperUpdateHdfsListener in default.properties

add a readme for the services dir

Add a README.md for the services subdir to document the microservices split.

Implemented BaseType(s) should implement ObjectSizeOf

Computing size on BaseType can be very expensive (in time) depending on the implementation. Specifically Normalizer's that are singleton's need not be evaluated as part of the computation.