thedataleek / hadoopinspector Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 10 MB

License: Other

Python 34.41% CSS 2.59% HTML 4.97% JavaScript 0.30% Groff 43.91% Jupyter Notebook 13.81%

hadoopinspector's People

Contributors

Stargazers

Watchers

hadoopinspector's Issues

improve logging for: bad config results in failed check

If the cluster instance info is wrong you don't see the result until a check is run.

Then, if the check does good logging you'll see the issue in that check's logs.

But if it doesn't it may just print a json structure without valid fields. In this case the runner will fail it on data validation. But it won't log what the failed structure looked like - which is probably key to quickly understanding the problem.

This happens specifically with rule_table_range.bash if given invalid hapinsp_instance.

Add JUnit output and Jenkins integration

The runner should be easily integrated within a CI environment.

add: distinction in results-db between run timestamp & data timestamp

Right now the results database only knows when a check was run - not what time period the check was run against.

Add another set of timestamps for the start & stop time the check was run against. This is particularly important when incremental testing is being performed.

defines when process ran:

run_start_timestamp
run_stop_timestamp

defines period being checked:

if mode == 'full', then it's all data, up to the runtime, more or less
- because they could have data for the future in the table
- or because maybe there's no data for the past 3 weeks in the table
if mode == 'incremental', then it's the partitioning date used by the incremental checks
- but there's a lot of different incremental date formats
- maybe the setup should also provide a start & stop regular timestamp when it provides incremental checks?
data_start_timestamp
data_stop_timestamp

Add: logging

Need to add logging to the runner

add: parallel running of checks

Checks are run in series. Lets change this to parallel, driven by a max workers provided by the user in the registry.

SQLite database needs auto-incrementing id field

Easy change to the DDL, but for future-proofing it would be a good addition.

Standard Config

Config items should be kept in a standard xdg location.

In particular for hadoopinspector_runner

add: SQL Checks

We need pure-SQL checks since they'll be very simple, fast, easy to write, and preferred by some.

The SQL will need to be a template which the runner will read, fill in the variables for. The runner will also have to make the database connection and run the SQL.

add: logging of actual violation rows

We should add the ability to log violations - since it will help users in diagnosing problems.

We should create a table: check_results_sample - to contain sample row violations. It should have a key of inst, db, table, check_name, run_starttime and a serial integer. It's non-key columns should include: contents - which will be a string representation of a json data structure - that is the table.

The checks will pass back these samples in the json stdout. The runner will have to handle out-of-control checks passing a million rows back. Though it's definitely not intended for that.

The next problem is that the typical activity of counting is at odds with getting actual rows. So, we need to figure out how we do one vs the other and when both. Need to resolve.

add: Hive metadata integration

The runner and tests should integrate with Hive's Metastore in order to collect known information about tables to be tested.

add: extremely clear input validation

Right now if the runner encounters invalid inputs the issue may not be very clear. This affects:
- tests providing invalid input to some add-on, like hapinsp_formatter.py
- tests providing invalid input back to the runner, which hapinsp_formatter.py should prevent
- registry.json having invalid configs

runner: add ability to test a single partition

Testing an entire table can take far too long, especially considering that the only changes since the last run may have been to the most recent daily/weekly/monthly partition.

The runner should be able to identify the right partition to check, given:
- what it last checked
- what has been loaded since

And then it should share that with the checks.

Also, while ideally these time-oriented constraints should be partitions in order to get performance benefits, in some cases they may not be. So, maybe 'partition' is the wrong terminology.

add: data annotation

The user should get to annotate - or comment on the issues discovered.

These comments should be stored within the sqlite database.

add: teardown checks

We've got setups - now we need teardowns, to support the following:

dropping temporary tables created only to enable testing against a subset of a table
integration with Jenkins, etc that tests have failed for a table

thedataleek / hadoopinspector Goto Github PK

hadoopinspector's People

Contributors

Stargazers

Watchers

hadoopinspector's Issues

Recommend Projects

Recommend Topics

Recommend Org