thedataleek / hadoopinspector Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
If the cluster instance info is wrong you don't see the result until a check is run.
Then, if the check does good logging you'll see the issue in that check's logs.
But if it doesn't it may just print a json structure without valid fields. In this case the runner will fail it on data validation. But it won't log what the failed structure looked like - which is probably key to quickly understanding the problem.
This happens specifically with rule_table_range.bash if given invalid hapinsp_instance.
The runner should be easily integrated within a CI environment.
Right now the results database only knows when a check was run - not what time period the check was run against.
Add another set of timestamps for the start & stop time the check was run against. This is particularly important when incremental testing is being performed.
defines when process ran:
defines period being checked:
Need to add logging to the runner
Checks are run in series. Lets change this to parallel, driven by a max workers provided by the user in the registry.
Easy change to the DDL, but for future-proofing it would be a good addition.
Config items should be kept in a standard xdg location.
In particular for hadoopinspector_runner
We need pure-SQL checks since they'll be very simple, fast, easy to write, and preferred by some.
The SQL will need to be a template which the runner will read, fill in the variables for. The runner will also have to make the database connection and run the SQL.
We should add the ability to log violations - since it will help users in diagnosing problems.
We should create a table: check_results_sample - to contain sample row violations. It should have a key of inst, db, table, check_name, run_starttime and a serial integer. It's non-key columns should include: contents - which will be a string representation of a json data structure - that is the table.
The checks will pass back these samples in the json stdout. The runner will have to handle out-of-control checks passing a million rows back. Though it's definitely not intended for that.
The next problem is that the typical activity of counting is at odds with getting actual rows. So, we need to figure out how we do one vs the other and when both. Need to resolve.
The runner and tests should integrate with Hive's Metastore in order to collect known information about tables to be tested.
Right now if the runner encounters invalid inputs the issue may not be very clear. This affects:
- tests providing invalid input to some add-on, like hapinsp_formatter.py
- tests providing invalid input back to the runner, which hapinsp_formatter.py should prevent
- registry.json having invalid configs
Testing an entire table can take far too long, especially considering that the only changes since the last run may have been to the most recent daily/weekly/monthly partition.
The runner should be able to identify the right partition to check, given:
- what it last checked
- what has been loaded since
And then it should share that with the checks.
Also, while ideally these time-oriented constraints should be partitions in order to get performance benefits, in some cases they may not be. So, maybe 'partition' is the wrong terminology.
The user should get to annotate - or comment on the issues discovered.
These comments should be stored within the sqlite database.
We've got setups - now we need teardowns, to support the following:
Right now the hadoopinspector_runner could be run multiple times, probably by mistake, or due to insanely long tests. This should be changed to only allow one instance at a time to run - for a given database.
Should support at least one effect method of daemonizing:
* ubuntu upstart is probably top priority
* daemonize?
* ?
The runner needs to use standard logging for logstuff.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.