rstudio / pointblank Goto Github PK

Data quality assessment and metadata reporting for data frames and database tables

Home Page: https://rstudio.github.io/pointblank/

License: Other

R 98.09% CSS 0.20% JavaScript 0.03% HTML 0.02% Rez 1.65%

data-validation database-tables data-dictionaries easy-to-understand data-frames reporting-tool data-profiler data-management schema-validation data-verification

pointblank's People

Contributors

Stargazers

Watchers

pointblank's Issues

Checks for sf-objects

I intended to check key-properties of sf(c)-objects making use of rows_not_duplicated(). The check was supposed to ignore the geometry column of the object (cf. 2nd example in reprex).

It seems that interrogate() ran into an error, because of the way, summarize() works on these objects.

Reprex example:

library(pointblank)
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3

# Geometry object with 2 features
g <- rep(st_sfc(st_point(1:2)), 2)

# vector with 2 entries
v <- c("a", "b")

# object including both objects
mixed_obj <- st_sf("vector" = v, "points" = g)
mixed_obj
#> Simple feature collection with 2 features and 1 field
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 1 ymin: 2 xmax: 1 ymax: 2
#> epsg (SRID):    NA
#> proj4string:    NA
#>   vector      points
#> 1      a POINT (1 2)
#> 2      b POINT (1 2)

agent <- create_agent()
agent %>% 
  focus_on("mixed_obj") %>% 
  rows_not_duplicated() %>% 
  interrogate()
#> Error: Can't coerce element 2 from a list to a double

# It already happens, when I only check if column "vector" is duplicated 
# (likely because `sf`-objects have "sticky geometries")
agent <- create_agent()
agent %>% 
  focus_on("mixed_obj") %>% 
  rows_not_duplicated(cols = vector) %>% 
  interrogate()
#> Error: Can't coerce element 2 from a list to a double

^{Created on 2019-02-12 by the reprex package (v0.2.1)}

I think it happens at the following chunk in interrogate() in the section "# Judge tables on expectation of non-duplicated rows":

      # Get total count of rows
      row_count <-
        table %>%
        dplyr::group_by() %>%
        dplyr::summarize(row_count = n()) %>%
        dplyr::as_tibble() %>%
        purrr::flatten_dbl()

My expectation would be, that

in the first case of the reprex (rows_not_duplicated(), without specifying columns) each whole row, including the geometry column, would be compared with the others.
in the second case (rows_not_duplicated(cols = vector)) the check would be done only for the column "vector".

Perhaps a solution might be to call as_tibble() before group_by() and summarize()?

CC: @krlmlr

Include a notification function that integrates with Slack

Thanks for making this package! It's very helpful and I got a few validation processes running smoothly in production (no issues with that at all).

I like the email notifier that's included and something along the same lines is a Slack notifier, which would be a great addition! Would that be a feature you're willing to add in?

`preconditions` should be a list of expressions

Presently, any preconditions just filter the data before performing a validation. It would be much better to accept a list of expressions that manipulate the data. In this way, the user could mutate the table and perhaps generate a new column that would undergo validation (among other possibilities).

Release pointblank 0.3.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
Update cran-comments.md
devtools::submit_cran()
Approve email

Wait for CRAN...

Add dataset to package

Currently there are no datasets in the package but one or two would be useful for examples and vignettes.

Add manual tests for a variety of database types

We need a standardized test suite that exercises all of the validation step functions with a variety of database types. The databases and their drivers should be: MySQL (with RMariaDB), PostgreSQL (with RPostgres), SQLite (with RSQLite).

welcome page example not working

>  create_agent() %>%             # (1)
+   focus_on(
+     tbl_name = "tbl_1") %>%      # (2)
+   col_vals_gt(
+     column = "a",
+     value = 0)
Error in bind_rows_(x, .id) : 
  Evaluation error: Argument 6: list can't contain data frames.

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.12.4 (Sierra)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.1     pointblank_0.1   rlang_0.0.0.9018

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10.2      bindr_0.1           knitr_1.15.1        magrittr_1.5        hms_0.3             devtools_1.12.0    
 [7] R6_2.2.0            stringr_1.2.0       httr_1.2.1          dplyr_0.5.0.9004    tools_3.3.1         DBI_0.6-11         
[13] git2r_0.15.0        withr_1.0.2         htmltools_0.3.5     lazyeval_0.2.0.9000 RPostgreSQL_0.4-1   assertthat_0.2.0   
[19] digest_0.6.12       rprojroot_1.2       tibble_1.3.0        tidyr_0.6.1         purrr_0.2.2         readr_1.1.0        
[25] curl_2.3            memoise_1.0.0       glue_1.0.0          evaluate_0.10       rmarkdown_1.5       stringi_1.1.5      
[31] backports_1.0.5

Add testthat tests for the `conjointly()` function

This is currently not tested at all.

Create a branch without the mailR and rJava dependencies?

Just a suggestion as these are quite difficult for some people to meet. 😮

Add the `tldr_pointblank()` function

This function is just to display simple instructions on how to do the basic things à la https://tldr.sh/.

Add functions useful for action levels and `end_fns`

These are functions that perform logging, emailing, and issuing warnings/errors.

Also, there should be variants that create function templates for each of the types.

Add support for sparklyr and add tests for interrogations on Spark tables

Tables that are tbl_spark are not yet supported but this shouldn't be difficult to implement. There will also need to be tests for performing pointblank validations on tbl_spark datasets.

Have some of the step functions use columns (not just numbers) as comparisons

Love this package! I'm setting up all sorts of validations and one thing I think would be useful is to enable a direct comparison of one column to another.

For example if we wanted to validate that column a is always greater than column b, we should be able to use col_vals_gt(vars(a), vars(b)). What do you think?

Again, this package is incredible. Thanks!

Add autobrief functionality for the `conjointly()` validation step function

The conjointly() function does not currently create an autobrief if brief = NULL (which is most of the time). So, handcraft a nice brief that states the number of steps and the types of validations contained in this step. Limit the listing of validation types to three.

Compare to assertr?

Would it be useful to list the specificities and strengths of the package w/r/t the assert and assertive packages, which perform similar tasks?

col_values_in_set passes even when values are not in the set

The col_values_in_set test appears to pass regardless of whether or not values are actually in the set. See reproducible example below:

Create a simple two column data frame

df <-
  data.frame(
    a = c(1, 2, 3, 4),
    b = c("one", "two", "three", "four"),
    stringsAsFactors = FALSE)

Validate that all numerical values in column a belong to a numerical set, and, that all values in column b belong to a set of string values. Note that none of the values in either validation set should pass.

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
  col_vals_in_set(
    column = a,
    set = 10:20) %>%
  col_vals_in_set(
    column = b,
    set = c("mouse", "dog", "cat", "pig")) %>%
  interrogate()

However, all validation checks are reported as passed

all_passed(agent)
[1] TRUE

Add print method for the x-list

As a way to make the x-list object a bit more visually appealing in the console (and less annoying), a print method should be added.

Add an `active` option to all validation step functions

Each validation step function will get the argument active, which will accept a logical value (defaulting to TRUE).

If step functions are working with an agent, FALSE will make the step inactive (still reporting its presence and keeping indexes for the steps unchanged).

If the step functions are operating directly on data, then any step with active = FALSE will simply pass the data through, no longer acting as a filter (internally, just returning the data early).

A valid use case for this is setting a global switch on some or all validation steps depending on the context (e.g., in production or not).

Add functionality for simple validations (e.g., `df %>% col_vals_gt(...)`)

The idea is to pass a data object directly to a validation function and get a re-usable output (e.g., vector of logical values) that can be used in other functions. This would be very useful for joint validations where we could have:

df %>% <validation_function>(...) & 
df %>% <validation_function>(...)

And the resultant vector of logicals could show which rows jointly passed (of course, one has to ensure that the input is passed to the validation functions unchanged).

This shouldn't affect the existing API that much. The first argument of any validation function will change from agent to ... where each function will internally sort out whether to use an agent object or immediately interrogate. The ... will also be useful if we decide to wrap inputs in some helper function (e.g., jointly(), etc.).

Use a `values` list column in the `validation_set` object

This is needed to simplify the model for validation steps. With a list column we can accommodate any type so any values put in value, set, and regex would simply go into the values list column.

This also makes it easier to have non-numeric comparisons so dates or date-times could then be specified and used.

col_exists handling of multiple columns

The description for col_exists implies that it can handle multiple columns. However the method for doing this is not apparent.

col_exists(column = c(start_date, age)) looks for a column called "c(start_date, age)"

col_exists(column = c("start_date", "age")) looks for a column called c('start_date', 'age')

Given this function isn't fully documented I'm not sure if I'm missing something. Is there a way to pass a vector of column names to this function?

Release pointblank 0.2.1

Prepare for release:

Submit to CRAN:

usethis::use_version('patch')
Update cran-comments.md
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version()
Tweet

Fix the `rows_distinct()` function for `tbl_dbi` objects

Currently this doesn't work for tbl_dbi objects (tested on SQLite and MySQL) but it's fine for data.frame and tbl_df.

lubridate not installed with package

lubridate is required for creation of the HTML summary but is not required on package install.

all_passed and n_passed handle NA values differently for some interrogation tests

Different parts of the interrogration report handle NA values differently. This is apparent for col_vals_between and col_vals_in_set (as well as presumably effecting other tests I don't currently use).

I have a simple dataframe that includes missing values and check that all values of column B are between 0 and 20.

df<-structure(list(A = c(NA, 2L, 3L, 4L, 5L, 6L, NA, 8L, 9L, NA), 
               B = c(11L, 12L, NA, NA, NA, NA, 17L, 18L, 19L, 20L), 
               C = c(NA, NA, NA, 24L, NA, 26L, NA, 28L, NA, 30L)), 
               .Names = c("A",  "B", "C"), row.names = c(NA, -10L), class = "data.frame")

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
    col_vals_between( 
    column = B,
    left = 0,
    right = 20)
  interrogate()

all_passed(agent)
get_interrogation_summary(agent)

all_passed() returns TRUE but get_interrogation_summary() reports that 60% of rows are not within range.

# A tibble: 1 x 12
  tbl_name db_type  assertion_type   column value regex all_passed     n n_passed f_passed action brief            
  <chr>    <chr>    <chr>            <chr>  <dbl> <chr> <lgl>      <dbl>    <dbl>    <dbl> <chr>  <chr>            
1 df       local_df col_vals_between B         NA NA    TRUE          10        6      0.6 NA     Expect that valu~

This occurs because the NAs are counted as failing in some calculations but not in others.

I can partially control this behaviour by adding a pre-condition to only apply this test to rows where B is not NA.

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
  col_vals_between(
    column = B,
    left = 0,
    right = 20,
    preconditions = is.na(B) == FALSE) %>%
  interrogate()

But this becomes tedious when applying testing a large number of columns where NAs should not be counted as failing (since each would require a pre-condition that refers to the relevant column by name).

It's also not apparent whether I could specify that NAs should be counted as not in range (except by using a different test). This is a problem if I want the to trigger a warning or notification based on the joint failure rate of NAs and out of range values.

The ability to include or exclude NAs from any given test (as in the example below) would improve the usability of this function and add consistency between the n_passed and all_passed values.

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
  col_vals_between(
    column = B,
    left = 0,
    right = 20,
    na_as_in_range = TRUE) %>%
  interrogate()

`conjointly()` doesn’t work with `tbl_dbi` objects

Because the implementation uses dplyr::bind_cols() there is no SQL translation. What should be done instead is to use a join on row indices. Then, we get something that works in all contexts.

Refactor the `create_autobrief()` util function

This function could use some refactoring and tidying up.

Have options to sort agent report by failure condition

Currently, the agent report provides line entries for all validation steps in the given order. There should be options for sorting by severity of the failure conditions and limiting/omitting the passing steps. This will make for more succinct reporting especially in an email context.

Additional language support for message parts

I really like that you added multilingual support for the report outputs. One place where that is currently missing (I think) is in the stock message parts for the emailing of the pointblank report. Could you add those in?

Add arg to specify a `stop()` threshold

Add the stop_count and stop_fraction arguments for stopping validations when encountering a set threshold of failed checks.

Use a ‘secret agent‘ when directly validating a table

Behind the scenes, create a secret agent and perform a one-hit interrogation on the target. Put the logic in secret_ops.R.

Add vignette for using this package to test input data for a machine learning model.

👋 Howdy! Not so much an issue, but a request. Would it be useful to create a vignette detailing how to use this package for data testing prior to retraining a machine learning model?

.pb

`focus_on` can fail to get the right local dataframe

Here's an example:

library(pointblank)
# Copied from the docs, but wrapped in a function
fn <- function() {
  my_df <- data.frame(a = c(5, 4, 3, 5, 1, 2))
  
  agent <- create_agent() %>%
    focus_on(tbl_name = "my_df") %>%
    col_vals_lt(
      column = a,
      value = 6) %>%
    interrogate()
  
  all_passed(agent)
}
fn()
#> Error in get(tbl_name): object 'my_df' not found

^{Created on 2019-09-21 by the reprex package (v0.3.0)}

Note that if a my_df object existed in the global scope, focus_on would use the global object instead. I think this behavior is happening because focus_on uses get instead of dynGet here.

Add option to use environment variables for DB connections

There needs to be a convenient method for passing in references to environment variables that hold DB credentials. A bonus function would be for testing environment variables (i.e., do the supplied environment variables result in a successful connection?).

Enable other database types using odbc?

Hi,

It would be very useful if this package and be applied to other DB types too using odbc, which translate into DB specific queries. Any thoughts?

Validations that work for numeric columns should also work for Date and datetime-type columns

Right now date and datetime column values cannot be validated, and that’s a shame.

col_vals_in_set() broken

Seeing different behavior for the CRAN version, an intermediate version, and master :

CRAN

library(tibble)
library(pointblank)

data <-
  tibble(text = c("a", "b", "C", NA))

set <- letters

create_agent() %>%
  focus_on("data") %>%
  col_vals_in_set(text, set) %>%
  interrogate()
#> pointblank agent // <agent_2019-02-12_17:28:44>
#> 
#> tables of focus: data/local_df (1).
#> number of validation steps: 1
#> 
#> interrogation (2019-02-12 17:28:44) resulted in:
#>   - 1 passing validation
#>   - no failing validations   more info: `get_interrogation_summary()`

^{Created on 2019-02-12 by the reprex package (v0.2.1.9000)}

`5f7b88a` (last good revision, parent of `b2541da`)

library(tibble)
library(pointblank)

data <-
  tibble(text = c("a", "b", "C", NA))

set <- letters

create_agent() %>%
  focus_on("data") %>%
  col_vals_in_set(text, set) %>%
  interrogate()
#> Warning: Prefixing `UQ()` with the rlang namespace is deprecated as of rlang 0.3.0.
#> Please use the non-prefixed form or `!!` instead.
#> 
#>   # Bad:
#>   rlang::expr(mean(rlang::UQ(var) * 100))
#> 
#>   # Ok:
#>   rlang::expr(mean(UQ(var) * 100))
#> 
#>   # Good:
#>   rlang::expr(mean(!!var * 100))
#> 
#> This warning is displayed once per session.
#> pointblank agent // <agent_2019-02-12_17:29:47>
#> 
#> tables of focus: data/local_df (1).
#> number of validation steps: 1
#> 
#> interrogation (2019-02-12 17:29:48) resulted in:
#>   - no passing validations
#>   - 1 failing validation   more info: `get_interrogation_summary()`

^{Created on 2019-02-12 by the reprex package (v0.2.1.9000)}

`b2541da` up to master

library(tibble)
library(pointblank)

data <-
  tibble(text = c("a", "b", "C", NA))

set <- letters

create_agent() %>%
  focus_on("data") %>%
  col_vals_in_set(text, set) %>%
  interrogate()
#> Error in create_autobrief(agent = agent, assertion_type = "col_vals_in_set", : argument "set" is missing, with no default

^{Created on 2019-02-12 by the reprex package (v0.2.1.9000)}

Potential unclosed connections in Redshift

When using the package to validate tables in Redshift, the amount of non-strapped connections shows a definite increase. Look into solutions on how to close connections and re-use existing connections efficiently.

Update README

The README hasn't been touched in quite a long time and it could be a little shorter. Goal, I think, is to talk about the main workflows and problems that can be solved. All the other little details can go into vignettes/articles.

Add documentation for the `action_levels()` helper function

Currently, the documentation is minimal and doesn't explain how this function is utilized in the different validation contexts.

Improve the reporting produced by `get_agent_report()`

Right now, it's a pretty barebones report.

Add ability to send notifications through Slack

We need the option to send warnings or other general notifications to a specified Slack channel. This is very important for in-production validations running non-interactively.

Provide examples for all functions in their help pages

A few usage examples for each function would be helpful.

Install fails due to missing packages

Installation required manual installation of rJava, Hmisc, and lazyWeave

Consider less stringent R version dependence.

The dependence on R version >= 3.4.0 means I cannot install this package at work. (We are stuck on R 3.2, and it's not going to change.) I have looked at the dependencies of pointblank, and none of them seem to require 3.4, although perhaps some of their dependencies do. I might suggest modifying the .travis.yml file to try builds on older releases as a test, c.f. R versions.

Add documentation that explains the agent x-list

Add documentation for the x-list object. It can take multiple forms, which is potentially confusing. However, good docs and examples can make things more clear.

Use tidyselect and reexport select helpers

While most functions only require the use of one column, some take multiple columns. Using select helpers would be helpful for those cases.

Create an actions and levels info strip

This is necessary for creating any schematics of the validation plan and for reporting post-interrogation. It should indicate the settings for the object returned by the action_levels() helper function and applied to the validation step.

Include console messages that show progress of interrogation

Right now, nothing in shown in the console during an interrogation (or even at the conclusion of one). So, use messages to convey what is happening.

Reporting across interrogations

I've been using this package (along with your gt package!) for about a month and it's really helped out a lot at work. We're trying to get our data quality under control and this package solves that problem perfectly (it's actually amazing how everything just seems to work without problems).

A feature request that I have is making a sort of combined report of interrogations for the same table but at different times. We want to have these to see (in a simple table) where things have improved or gotten worse.

I honestly wouldn't be surprised if you weren't thinking about this already, so, if this is something you were planning it would be a great next step.

Thank you so much for all your great packages. You're making my life easier!

rstudio / pointblank Goto Github PK

pointblank's People

Contributors

Stargazers

Watchers

Forkers

pointblank's Issues

CRAN

5f7b88a (last good revision, parent of b2541da)

b2541da up to master

Recommend Projects

Recommend Topics

Recommend Org

`5f7b88a` (last good revision, parent of `b2541da`)

`b2541da` up to master