Giter Club home page Giter Club logo

hadoopinspector's People

Contributors

kenfar avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

hadoopinspector's Issues

improve logging for: bad config results in failed check

If the cluster instance info is wrong you don't see the result until a check is run.

Then, if the check does good logging you'll see the issue in that check's logs.

But if it doesn't it may just print a json structure without valid fields. In this case the runner will fail it on data validation. But it won't log what the failed structure looked like - which is probably key to quickly understanding the problem.

This happens specifically with rule_table_range.bash if given invalid hapinsp_instance.

add: distinction in results-db between run timestamp & data timestamp

Right now the results database only knows when a check was run - not what time period the check was run against.

Add another set of timestamps for the start & stop time the check was run against. This is particularly important when incremental testing is being performed.

defines when process ran:

  • run_start_timestamp
  • run_stop_timestamp

defines period being checked:

  • if mode == 'full', then it's all data, up to the runtime, more or less
    • because they could have data for the future in the table
    • or because maybe there's no data for the past 3 weeks in the table
  • if mode == 'incremental', then it's the partitioning date used by the incremental checks
    • but there's a lot of different incremental date formats
    • maybe the setup should also provide a start & stop regular timestamp when it provides incremental checks?
  • data_start_timestamp
  • data_stop_timestamp

Standard Config

Config items should be kept in a standard xdg location.

In particular for hadoopinspector_runner

add: SQL Checks

We need pure-SQL checks since they'll be very simple, fast, easy to write, and preferred by some.

The SQL will need to be a template which the runner will read, fill in the variables for. The runner will also have to make the database connection and run the SQL.

add: logging of actual violation rows

We should add the ability to log violations - since it will help users in diagnosing problems.

We should create a table: check_results_sample - to contain sample row violations. It should have a key of inst, db, table, check_name, run_starttime and a serial integer. It's non-key columns should include: contents - which will be a string representation of a json data structure - that is the table.

The checks will pass back these samples in the json stdout. The runner will have to handle out-of-control checks passing a million rows back. Though it's definitely not intended for that.

The next problem is that the typical activity of counting is at odds with getting actual rows. So, we need to figure out how we do one vs the other and when both. Need to resolve.

add: Hive metadata integration

The runner and tests should integrate with Hive's Metastore in order to collect known information about tables to be tested.

add: extremely clear input validation

Right now if the runner encounters invalid inputs the issue may not be very clear. This affects:
- tests providing invalid input to some add-on, like hapinsp_formatter.py
- tests providing invalid input back to the runner, which hapinsp_formatter.py should prevent
- registry.json having invalid configs

runner: add ability to test a single partition

Testing an entire table can take far too long, especially considering that the only changes since the last run may have been to the most recent daily/weekly/monthly partition.

The runner should be able to identify the right partition to check, given:
- what it last checked
- what has been loaded since

And then it should share that with the checks.

Also, while ideally these time-oriented constraints should be partitions in order to get performance benefits, in some cases they may not be. So, maybe 'partition' is the wrong terminology.

add: data annotation

The user should get to annotate - or comment on the issues discovered.

These comments should be stored within the sqlite database.

add: teardown checks

We've got setups - now we need teardowns, to support the following:

  • dropping temporary tables created only to enable testing against a subset of a table
  • integration with Jenkins, etc that tests have failed for a table

add: hadoopinspector_runner run once only

Right now the hadoopinspector_runner could be run multiple times, probably by mistake, or due to insanely long tests. This should be changed to only allow one instance at a time to run - for a given database.

add: daemonization

Should support at least one effect method of daemonizing:
* ubuntu upstart is probably top priority
* daemonize?
* ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.