rstudio / pointblank Goto Github PK
View Code? Open in Web Editor NEWData quality assessment and metadata reporting for data frames and database tables
Home Page: https://rstudio.github.io/pointblank/
License: Other
Data quality assessment and metadata reporting for data frames and database tables
Home Page: https://rstudio.github.io/pointblank/
License: Other
I intended to check key-properties of sf(c)
-objects making use of rows_not_duplicated()
. The check was supposed to ignore the geometry column of the object (cf. 2nd example in reprex).
It seems that interrogate()
ran into an error, because of the way, summarize()
works on these objects.
Reprex example:
library(pointblank)
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3
# Geometry object with 2 features
g <- rep(st_sfc(st_point(1:2)), 2)
# vector with 2 entries
v <- c("a", "b")
# object including both objects
mixed_obj <- st_sf("vector" = v, "points" = g)
mixed_obj
#> Simple feature collection with 2 features and 1 field
#> geometry type: POINT
#> dimension: XY
#> bbox: xmin: 1 ymin: 2 xmax: 1 ymax: 2
#> epsg (SRID): NA
#> proj4string: NA
#> vector points
#> 1 a POINT (1 2)
#> 2 b POINT (1 2)
agent <- create_agent()
agent %>%
focus_on("mixed_obj") %>%
rows_not_duplicated() %>%
interrogate()
#> Error: Can't coerce element 2 from a list to a double
# It already happens, when I only check if column "vector" is duplicated
# (likely because `sf`-objects have "sticky geometries")
agent <- create_agent()
agent %>%
focus_on("mixed_obj") %>%
rows_not_duplicated(cols = vector) %>%
interrogate()
#> Error: Can't coerce element 2 from a list to a double
Created on 2019-02-12 by the reprex package (v0.2.1)
I think it happens at the following chunk in interrogate()
in the section "# Judge tables on expectation of non-duplicated rows":
# Get total count of rows
row_count <-
table %>%
dplyr::group_by() %>%
dplyr::summarize(row_count = n()) %>%
dplyr::as_tibble() %>%
purrr::flatten_dbl()
My expectation would be, that
rows_not_duplicated()
, without specifying columns) each whole row, including the geometry column, would be compared with the others.rows_not_duplicated(cols = vector)
) the check would be done only for the column "vector".Perhaps a solution might be to call as_tibble()
before group_by()
and summarize()
?
CC: @krlmlr
Thanks for making this package! It's very helpful and I got a few validation processes running smoothly in production (no issues with that at all).
I like the email notifier that's included and something along the same lines is a Slack notifier, which would be a great addition! Would that be a feature you're willing to add in?
Presently, any preconditions
just filter the data before performing a validation. It would be much better to accept a list of expressions that manipulate the data. In this way, the user could mutate the table and perhaps generate a new column that would undergo validation (among other possibilities).
Prepare for release:
devtools::check()
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
Submit to CRAN:
usethis::use_version('minor')
cran-comments.md
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
Currently there are no datasets in the package but one or two would be useful for examples and vignettes.
We need a standardized test suite that exercises all of the validation step functions with a variety of database types. The databases and their drivers should be: MySQL (with RMariaDB
), PostgreSQL (with RPostgres
), SQLite (with RSQLite
).
> create_agent() %>% # (1)
+ focus_on(
+ tbl_name = "tbl_1") %>% # (2)
+ col_vals_gt(
+ column = "a",
+ value = 0)
Error in bind_rows_(x, .id) :
Evaluation error: Argument 6: list can't contain data frames.
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.12.4 (Sierra)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.1 pointblank_0.1 rlang_0.0.0.9018
loaded via a namespace (and not attached):
[1] Rcpp_0.12.10.2 bindr_0.1 knitr_1.15.1 magrittr_1.5 hms_0.3 devtools_1.12.0
[7] R6_2.2.0 stringr_1.2.0 httr_1.2.1 dplyr_0.5.0.9004 tools_3.3.1 DBI_0.6-11
[13] git2r_0.15.0 withr_1.0.2 htmltools_0.3.5 lazyeval_0.2.0.9000 RPostgreSQL_0.4-1 assertthat_0.2.0
[19] digest_0.6.12 rprojroot_1.2 tibble_1.3.0 tidyr_0.6.1 purrr_0.2.2 readr_1.1.0
[25] curl_2.3 memoise_1.0.0 glue_1.0.0 evaluate_0.10 rmarkdown_1.5 stringi_1.1.5
[31] backports_1.0.5
This is currently not tested at all.
Just a suggestion as these are quite difficult for some people to meet. 😮
This function is just to display simple instructions on how to do the basic things à la https://tldr.sh/.
These are functions that perform logging, emailing, and issuing warnings/errors.
Also, there should be variants that create function templates for each of the types.
Tables that are tbl_spark
are not yet supported but this shouldn't be difficult to implement. There will also need to be tests for performing pointblank validations on tbl_spark
datasets.
Love this package! I'm setting up all sorts of validations and one thing I think would be useful is to enable a direct comparison of one column to another.
For example if we wanted to validate that column a is always greater than column b, we should be able to use col_vals_gt(vars(a), vars(b))
. What do you think?
Again, this package is incredible. Thanks!
The conjointly()
function does not currently create an autobrief
if brief = NULL
(which is most of the time). So, handcraft a nice brief that states the number of steps and the types of validations contained in this step. Limit the listing of validation types to three.
Would it be useful to list the specificities and strengths of the package w/r/t the assert and assertive packages, which perform similar tasks?
The col_values_in_set
test appears to pass regardless of whether or not values are actually in the set. See reproducible example below:
Create a simple two column data frame
df <-
data.frame(
a = c(1, 2, 3, 4),
b = c("one", "two", "three", "four"),
stringsAsFactors = FALSE)
Validate that all numerical values in column a
belong to a numerical set, and, that all values in column b
belong to a set of string values. Note that none of the values in either validation set should pass.
agent <-
create_agent() %>%
focus_on(tbl_name = "df") %>%
col_vals_in_set(
column = a,
set = 10:20) %>%
col_vals_in_set(
column = b,
set = c("mouse", "dog", "cat", "pig")) %>%
interrogate()
However, all validation checks are reported as passed
all_passed(agent)
[1] TRUE
As a way to make the x-list object a bit more visually appealing in the console (and less annoying), a print method should be added.
Each validation step function will get the argument active
, which will accept a logical value (defaulting to TRUE
).
If step functions are working with an agent, FALSE
will make the step inactive (still reporting its presence and keeping indexes for the steps unchanged).
If the step functions are operating directly on data, then any step with active = FALSE
will simply pass the data through, no longer acting as a filter (internally, just returning the data early).
A valid use case for this is setting a global switch on some or all validation steps depending on the context (e.g., in production or not).
The idea is to pass a data object directly to a validation function and get a re-usable output (e.g., vector of logical values) that can be used in other functions. This would be very useful for joint validations where we could have:
df %>% <validation_function>(...) &
df %>% <validation_function>(...)
And the resultant vector of logicals could show which rows jointly passed (of course, one has to ensure that the input is passed to the validation functions unchanged).
This shouldn't affect the existing API that much. The first argument of any validation function will change from agent
to ...
where each function will internally sort out whether to use an agent object or immediately interrogate. The ...
will also be useful if we decide to wrap inputs in some helper function (e.g., jointly()
, etc.).
This is needed to simplify the model for validation steps. With a list column we can accommodate any type so any values put in value
, set
, and regex
would simply go into the values
list column.
This also makes it easier to have non-numeric comparisons so dates or date-times could then be specified and used.
The description for col_exists implies that it can handle multiple columns. However the method for doing this is not apparent.
col_exists(column = c(start_date, age))
looks for a column called "c(start_date, age)"
col_exists(column = c("start_date", "age"))
looks for a column called c('start_date', 'age')
Given this function isn't fully documented I'm not sure if I'm missing something. Is there a way to pass a vector of column names to this function?
Prepare for release:
devtools::check()
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
Submit to CRAN:
usethis::use_version('patch')
cran-comments.md
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
Currently this doesn't work for tbl_dbi
objects (tested on SQLite and MySQL) but it's fine for data.frame
and tbl_df
.
lubridate is required for creation of the HTML summary but is not required on package install.
Different parts of the interrogration report handle NA values differently. This is apparent for col_vals_between and col_vals_in_set (as well as presumably effecting other tests I don't currently use).
I have a simple dataframe that includes missing values and check that all values of column B are between 0 and 20.
df<-structure(list(A = c(NA, 2L, 3L, 4L, 5L, 6L, NA, 8L, 9L, NA),
B = c(11L, 12L, NA, NA, NA, NA, 17L, 18L, 19L, 20L),
C = c(NA, NA, NA, 24L, NA, 26L, NA, 28L, NA, 30L)),
.Names = c("A", "B", "C"), row.names = c(NA, -10L), class = "data.frame")
agent <-
create_agent() %>%
focus_on(tbl_name = "df") %>%
col_vals_between(
column = B,
left = 0,
right = 20)
interrogate()
all_passed(agent)
get_interrogation_summary(agent)
all_passed()
returns TRUE
but get_interrogation_summary()
reports that 60% of rows are not within range.
# A tibble: 1 x 12
tbl_name db_type assertion_type column value regex all_passed n n_passed f_passed action brief
<chr> <chr> <chr> <chr> <dbl> <chr> <lgl> <dbl> <dbl> <dbl> <chr> <chr>
1 df local_df col_vals_between B NA NA TRUE 10 6 0.6 NA Expect that valu~
This occurs because the NAs are counted as failing in some calculations but not in others.
I can partially control this behaviour by adding a pre-condition to only apply this test to rows where B is not NA.
agent <-
create_agent() %>%
focus_on(tbl_name = "df") %>%
col_vals_between(
column = B,
left = 0,
right = 20,
preconditions = is.na(B) == FALSE) %>%
interrogate()
But this becomes tedious when applying testing a large number of columns where NAs should not be counted as failing (since each would require a pre-condition that refers to the relevant column by name).
It's also not apparent whether I could specify that NAs should be counted as not in range (except by using a different test). This is a problem if I want the to trigger a warning or notification based on the joint failure rate of NAs and out of range values.
The ability to include or exclude NAs from any given test (as in the example below) would improve the usability of this function and add consistency between the n_passed and all_passed values.
agent <-
create_agent() %>%
focus_on(tbl_name = "df") %>%
col_vals_between(
column = B,
left = 0,
right = 20,
na_as_in_range = TRUE) %>%
interrogate()
Because the implementation uses dplyr::bind_cols()
there is no SQL translation. What should be done instead is to use a join on row indices. Then, we get something that works in all contexts.
This function could use some refactoring and tidying up.
Currently, the agent report provides line entries for all validation steps in the given order. There should be options for sorting by severity of the failure conditions and limiting/omitting the passing steps. This will make for more succinct reporting especially in an email context.
I really like that you added multilingual support for the report outputs. One place where that is currently missing (I think) is in the stock message parts for the emailing of the pointblank report. Could you add those in?
Add the stop_count
and stop_fraction
arguments for stopping validations when encountering a set threshold of failed checks.
Behind the scenes, create a secret agent and perform a one-hit interrogation on the target. Put the logic in secret_ops.R
.
👋 Howdy! Not so much an issue, but a request. Would it be useful to create a vignette detailing how to use this package for data testing prior to retraining a machine learning model?
.pb
Here's an example:
library(pointblank)
# Copied from the docs, but wrapped in a function
fn <- function() {
my_df <- data.frame(a = c(5, 4, 3, 5, 1, 2))
agent <- create_agent() %>%
focus_on(tbl_name = "my_df") %>%
col_vals_lt(
column = a,
value = 6) %>%
interrogate()
all_passed(agent)
}
fn()
#> Error in get(tbl_name): object 'my_df' not found
Created on 2019-09-21 by the reprex package (v0.3.0)
Note that if a my_df
object existed in the global scope, focus_on
would use the global object instead. I think this behavior is happening because focus_on
uses get
instead of dynGet
here.
There needs to be a convenient method for passing in references to environment variables that hold DB credentials. A bonus function would be for testing environment variables (i.e., do the supplied environment variables result in a successful connection?).
Hi,
It would be very useful if this package and be applied to other DB types too using odbc, which translate into DB specific queries. Any thoughts?
Right now date and datetime column values cannot be validated, and that’s a shame.
Seeing different behavior for the CRAN version, an intermediate version, and master
:
library(tibble)
library(pointblank)
data <-
tibble(text = c("a", "b", "C", NA))
set <- letters
create_agent() %>%
focus_on("data") %>%
col_vals_in_set(text, set) %>%
interrogate()
#> pointblank agent // <agent_2019-02-12_17:28:44>
#>
#> tables of focus: data/local_df (1).
#> number of validation steps: 1
#>
#> interrogation (2019-02-12 17:28:44) resulted in:
#> - 1 passing validation
#> - no failing validations more info: `get_interrogation_summary()`
Created on 2019-02-12 by the reprex package (v0.2.1.9000)
library(tibble)
library(pointblank)
data <-
tibble(text = c("a", "b", "C", NA))
set <- letters
create_agent() %>%
focus_on("data") %>%
col_vals_in_set(text, set) %>%
interrogate()
#> Warning: Prefixing `UQ()` with the rlang namespace is deprecated as of rlang 0.3.0.
#> Please use the non-prefixed form or `!!` instead.
#>
#> # Bad:
#> rlang::expr(mean(rlang::UQ(var) * 100))
#>
#> # Ok:
#> rlang::expr(mean(UQ(var) * 100))
#>
#> # Good:
#> rlang::expr(mean(!!var * 100))
#>
#> This warning is displayed once per session.
#> pointblank agent // <agent_2019-02-12_17:29:47>
#>
#> tables of focus: data/local_df (1).
#> number of validation steps: 1
#>
#> interrogation (2019-02-12 17:29:48) resulted in:
#> - no passing validations
#> - 1 failing validation more info: `get_interrogation_summary()`
Created on 2019-02-12 by the reprex package (v0.2.1.9000)
library(tibble)
library(pointblank)
data <-
tibble(text = c("a", "b", "C", NA))
set <- letters
create_agent() %>%
focus_on("data") %>%
col_vals_in_set(text, set) %>%
interrogate()
#> Error in create_autobrief(agent = agent, assertion_type = "col_vals_in_set", : argument "set" is missing, with no default
Created on 2019-02-12 by the reprex package (v0.2.1.9000)
When using the package to validate tables in Redshift, the amount of non-strapped connections shows a definite increase. Look into solutions on how to close connections and re-use existing connections efficiently.
The README hasn't been touched in quite a long time and it could be a little shorter. Goal, I think, is to talk about the main workflows and problems that can be solved. All the other little details can go into vignettes/articles.
Currently, the documentation is minimal and doesn't explain how this function is utilized in the different validation contexts.
Right now, it's a pretty barebones report.
We need the option to send warnings or other general notifications to a specified Slack channel. This is very important for in-production validations running non-interactively.
A few usage examples for each function would be helpful.
Installation required manual installation of rJava, Hmisc, and lazyWeave
The dependence on R version >= 3.4.0
means I cannot install this package at work. (We are stuck on R 3.2, and it's not going to change.) I have looked at the dependencies of pointblank
, and none of them seem to require 3.4, although perhaps some of their dependencies do. I might suggest modifying the .travis.yml
file to try builds on older releases as a test, c.f. R versions.
Add documentation for the x-list object. It can take multiple forms, which is potentially confusing. However, good docs and examples can make things more clear.
While most functions only require the use of one column, some take multiple columns. Using select helpers would be helpful for those cases.
This is necessary for creating any schematics of the validation plan and for reporting post-interrogation. It should indicate the settings for the object returned by the action_levels()
helper function and applied to the validation step.
Right now, nothing in shown in the console during an interrogation (or even at the conclusion of one). So, use messages to convey what is happening.
I've been using this package (along with your gt package!) for about a month and it's really helped out a lot at work. We're trying to get our data quality under control and this package solves that problem perfectly (it's actually amazing how everything just seems to work without problems).
A feature request that I have is making a sort of combined report of interrogations for the same table but at different times. We want to have these to see (in a simple table) where things have improved or gotten worse.
I honestly wouldn't be surprised if you weren't thinking about this already, so, if this is something you were planning it would be a great next step.
Thank you so much for all your great packages. You're making my life easier!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.