Giter Club home page Giter Club logo

pointblank's People

Contributors

brancengregory avatar davzim avatar ekothe avatar gadenbuie avatar kierisi avatar ldalby avatar mayeulk avatar mikejohnpage avatar nutterb avatar pachadotdev avatar rich-iannone avatar yjunechoe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pointblank's Issues

Checks for sf-objects

I intended to check key-properties of sf(c)-objects making use of rows_not_duplicated(). The check was supposed to ignore the geometry column of the object (cf. 2nd example in reprex).

It seems that interrogate() ran into an error, because of the way, summarize() works on these objects.

Reprex example:

library(pointblank)
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3

# Geometry object with 2 features
g <- rep(st_sfc(st_point(1:2)), 2)

# vector with 2 entries
v <- c("a", "b")

# object including both objects
mixed_obj <- st_sf("vector" = v, "points" = g)
mixed_obj
#> Simple feature collection with 2 features and 1 field
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 1 ymin: 2 xmax: 1 ymax: 2
#> epsg (SRID):    NA
#> proj4string:    NA
#>   vector      points
#> 1      a POINT (1 2)
#> 2      b POINT (1 2)

agent <- create_agent()
agent %>% 
  focus_on("mixed_obj") %>% 
  rows_not_duplicated() %>% 
  interrogate()
#> Error: Can't coerce element 2 from a list to a double

# It already happens, when I only check if column "vector" is duplicated 
# (likely because `sf`-objects have "sticky geometries")
agent <- create_agent()
agent %>% 
  focus_on("mixed_obj") %>% 
  rows_not_duplicated(cols = vector) %>% 
  interrogate()
#> Error: Can't coerce element 2 from a list to a double

Created on 2019-02-12 by the reprex package (v0.2.1)

I think it happens at the following chunk in interrogate() in the section "# Judge tables on expectation of non-duplicated rows":

      # Get total count of rows
      row_count <-
        table %>%
        dplyr::group_by() %>%
        dplyr::summarize(row_count = n()) %>%
        dplyr::as_tibble() %>%
        purrr::flatten_dbl()

My expectation would be, that

  1. in the first case of the reprex (rows_not_duplicated(), without specifying columns) each whole row, including the geometry column, would be compared with the others.
  2. in the second case (rows_not_duplicated(cols = vector)) the check would be done only for the column "vector".

Perhaps a solution might be to call as_tibble() before group_by() and summarize()?

CC: @krlmlr

Include a notification function that integrates with Slack

Thanks for making this package! It's very helpful and I got a few validation processes running smoothly in production (no issues with that at all).

I like the email notifier that's included and something along the same lines is a Slack notifier, which would be a great addition! Would that be a feature you're willing to add in?

`preconditions` should be a list of expressions

Presently, any preconditions just filter the data before performing a validation. It would be much better to accept a list of expressions that manipulate the data. In this way, the user could mutate the table and perhaps generate a new column that would undergo validation (among other possibilities).

Release pointblank 0.3.0

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Polish NEWS
  • Polish pkgdown reference index
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • Update cran-comments.md
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Add dataset to package

Currently there are no datasets in the package but one or two would be useful for examples and vignettes.

Add manual tests for a variety of database types

We need a standardized test suite that exercises all of the validation step functions with a variety of database types. The databases and their drivers should be: MySQL (with RMariaDB), PostgreSQL (with RPostgres), SQLite (with RSQLite).

welcome page example not working

>  create_agent() %>%             # (1)
+   focus_on(
+     tbl_name = "tbl_1") %>%      # (2)
+   col_vals_gt(
+     column = "a",
+     value = 0)
Error in bind_rows_(x, .id) : 
  Evaluation error: Argument 6: list can't contain data frames.
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.12.4 (Sierra)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.1     pointblank_0.1   rlang_0.0.0.9018

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10.2      bindr_0.1           knitr_1.15.1        magrittr_1.5        hms_0.3             devtools_1.12.0    
 [7] R6_2.2.0            stringr_1.2.0       httr_1.2.1          dplyr_0.5.0.9004    tools_3.3.1         DBI_0.6-11         
[13] git2r_0.15.0        withr_1.0.2         htmltools_0.3.5     lazyeval_0.2.0.9000 RPostgreSQL_0.4-1   assertthat_0.2.0   
[19] digest_0.6.12       rprojroot_1.2       tibble_1.3.0        tidyr_0.6.1         purrr_0.2.2         readr_1.1.0        
[25] curl_2.3            memoise_1.0.0       glue_1.0.0          evaluate_0.10       rmarkdown_1.5       stringi_1.1.5      
[31] backports_1.0.5   

Have some of the step functions use columns (not just numbers) as comparisons

Love this package! I'm setting up all sorts of validations and one thing I think would be useful is to enable a direct comparison of one column to another.

For example if we wanted to validate that column a is always greater than column b, we should be able to use col_vals_gt(vars(a), vars(b)). What do you think?

Again, this package is incredible. Thanks!

col_values_in_set passes even when values are not in the set

The col_values_in_set test appears to pass regardless of whether or not values are actually in the set. See reproducible example below:

Create a simple two column data frame

df <-
  data.frame(
    a = c(1, 2, 3, 4),
    b = c("one", "two", "three", "four"),
    stringsAsFactors = FALSE)

Validate that all numerical values in column a belong to a numerical set, and, that all values in column b belong to a set of string values. Note that none of the values in either validation set should pass.

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
  col_vals_in_set(
    column = a,
    set = 10:20) %>%
  col_vals_in_set(
    column = b,
    set = c("mouse", "dog", "cat", "pig")) %>%
  interrogate()

However, all validation checks are reported as passed

all_passed(agent)
[1] TRUE

Add print method for the x-list

As a way to make the x-list object a bit more visually appealing in the console (and less annoying), a print method should be added.

Add an `active` option to all validation step functions

Each validation step function will get the argument active, which will accept a logical value (defaulting to TRUE).

If step functions are working with an agent, FALSE will make the step inactive (still reporting its presence and keeping indexes for the steps unchanged).

If the step functions are operating directly on data, then any step with active = FALSE will simply pass the data through, no longer acting as a filter (internally, just returning the data early).

A valid use case for this is setting a global switch on some or all validation steps depending on the context (e.g., in production or not).

Add functionality for simple validations (e.g., `df %>% col_vals_gt(...)`)

The idea is to pass a data object directly to a validation function and get a re-usable output (e.g., vector of logical values) that can be used in other functions. This would be very useful for joint validations where we could have:

df %>% <validation_function>(...) & 
df %>% <validation_function>(...)

And the resultant vector of logicals could show which rows jointly passed (of course, one has to ensure that the input is passed to the validation functions unchanged).

This shouldn't affect the existing API that much. The first argument of any validation function will change from agent to ... where each function will internally sort out whether to use an agent object or immediately interrogate. The ... will also be useful if we decide to wrap inputs in some helper function (e.g., jointly(), etc.).

Use a `values` list column in the `validation_set` object

This is needed to simplify the model for validation steps. With a list column we can accommodate any type so any values put in value, set, and regex would simply go into the values list column.

This also makes it easier to have non-numeric comparisons so dates or date-times could then be specified and used.

col_exists handling of multiple columns

The description for col_exists implies that it can handle multiple columns. However the method for doing this is not apparent.

col_exists(column = c(start_date, age)) looks for a column called "c(start_date, age)"

col_exists(column = c("start_date", "age")) looks for a column called c('start_date', 'age')

Given this function isn't fully documented I'm not sure if I'm missing something. Is there a way to pass a vector of column names to this function?

Release pointblank 0.2.1

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Polish NEWS
  • Polish pkgdown reference index

Submit to CRAN:

  • usethis::use_version('patch')
  • Update cran-comments.md
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Tweet

all_passed and n_passed handle NA values differently for some interrogation tests

Different parts of the interrogration report handle NA values differently. This is apparent for col_vals_between and col_vals_in_set (as well as presumably effecting other tests I don't currently use).

I have a simple dataframe that includes missing values and check that all values of column B are between 0 and 20.

df<-structure(list(A = c(NA, 2L, 3L, 4L, 5L, 6L, NA, 8L, 9L, NA), 
               B = c(11L, 12L, NA, NA, NA, NA, 17L, 18L, 19L, 20L), 
               C = c(NA, NA, NA, 24L, NA, 26L, NA, 28L, NA, 30L)), 
               .Names = c("A",  "B", "C"), row.names = c(NA, -10L), class = "data.frame")

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
    col_vals_between( 
    column = B,
    left = 0,
    right = 20)
  interrogate()

all_passed(agent)
get_interrogation_summary(agent)

all_passed() returns TRUE but get_interrogation_summary() reports that 60% of rows are not within range.

# A tibble: 1 x 12
  tbl_name db_type  assertion_type   column value regex all_passed     n n_passed f_passed action brief            
  <chr>    <chr>    <chr>            <chr>  <dbl> <chr> <lgl>      <dbl>    <dbl>    <dbl> <chr>  <chr>            
1 df       local_df col_vals_between B         NA NA    TRUE          10        6      0.6 NA     Expect that valu~

This occurs because the NAs are counted as failing in some calculations but not in others.

I can partially control this behaviour by adding a pre-condition to only apply this test to rows where B is not NA.

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
  col_vals_between(
    column = B,
    left = 0,
    right = 20,
    preconditions = is.na(B) == FALSE) %>%
  interrogate()

But this becomes tedious when applying testing a large number of columns where NAs should not be counted as failing (since each would require a pre-condition that refers to the relevant column by name).

It's also not apparent whether I could specify that NAs should be counted as not in range (except by using a different test). This is a problem if I want the to trigger a warning or notification based on the joint failure rate of NAs and out of range values.

The ability to include or exclude NAs from any given test (as in the example below) would improve the usability of this function and add consistency between the n_passed and all_passed values.

agent <-
  create_agent() %>%
  focus_on(tbl_name = "df") %>%
  col_vals_between(
    column = B,
    left = 0,
    right = 20,
    na_as_in_range = TRUE) %>%
  interrogate()

Have options to sort agent report by failure condition

Currently, the agent report provides line entries for all validation steps in the given order. There should be options for sorting by severity of the failure conditions and limiting/omitting the passing steps. This will make for more succinct reporting especially in an email context.

Additional language support for message parts

I really like that you added multilingual support for the report outputs. One place where that is currently missing (I think) is in the stock message parts for the emailing of the pointblank report. Could you add those in?

`focus_on` can fail to get the right local dataframe

Here's an example:

library(pointblank)
# Copied from the docs, but wrapped in a function
fn <- function() {
  my_df <- data.frame(a = c(5, 4, 3, 5, 1, 2))
  
  agent <- create_agent() %>%
    focus_on(tbl_name = "my_df") %>%
    col_vals_lt(
      column = a,
      value = 6) %>%
    interrogate()
  
  all_passed(agent)
}
fn()
#> Error in get(tbl_name): object 'my_df' not found

Created on 2019-09-21 by the reprex package (v0.3.0)

Note that if a my_df object existed in the global scope, focus_on would use the global object instead. I think this behavior is happening because focus_on uses get instead of dynGet here.

Add option to use environment variables for DB connections

There needs to be a convenient method for passing in references to environment variables that hold DB credentials. A bonus function would be for testing environment variables (i.e., do the supplied environment variables result in a successful connection?).

col_vals_in_set() broken

Seeing different behavior for the CRAN version, an intermediate version, and master :

CRAN

library(tibble)
library(pointblank)

data <-
  tibble(text = c("a", "b", "C", NA))

set <- letters

create_agent() %>%
  focus_on("data") %>%
  col_vals_in_set(text, set) %>%
  interrogate()
#> pointblank agent // <agent_2019-02-12_17:28:44>
#> 
#> tables of focus: data/local_df (1).
#> number of validation steps: 1
#> 
#> interrogation (2019-02-12 17:28:44) resulted in:
#>   - 1 passing validation
#>   - no failing validations   more info: `get_interrogation_summary()`

Created on 2019-02-12 by the reprex package (v0.2.1.9000)

5f7b88a (last good revision, parent of b2541da)

library(tibble)
library(pointblank)

data <-
  tibble(text = c("a", "b", "C", NA))

set <- letters

create_agent() %>%
  focus_on("data") %>%
  col_vals_in_set(text, set) %>%
  interrogate()
#> Warning: Prefixing `UQ()` with the rlang namespace is deprecated as of rlang 0.3.0.
#> Please use the non-prefixed form or `!!` instead.
#> 
#>   # Bad:
#>   rlang::expr(mean(rlang::UQ(var) * 100))
#> 
#>   # Ok:
#>   rlang::expr(mean(UQ(var) * 100))
#> 
#>   # Good:
#>   rlang::expr(mean(!!var * 100))
#> 
#> This warning is displayed once per session.
#> pointblank agent // <agent_2019-02-12_17:29:47>
#> 
#> tables of focus: data/local_df (1).
#> number of validation steps: 1
#> 
#> interrogation (2019-02-12 17:29:48) resulted in:
#>   - no passing validations
#>   - 1 failing validation   more info: `get_interrogation_summary()`

Created on 2019-02-12 by the reprex package (v0.2.1.9000)

b2541da up to master

library(tibble)
library(pointblank)

data <-
  tibble(text = c("a", "b", "C", NA))

set <- letters

create_agent() %>%
  focus_on("data") %>%
  col_vals_in_set(text, set) %>%
  interrogate()
#> Error in create_autobrief(agent = agent, assertion_type = "col_vals_in_set", : argument "set" is missing, with no default

Created on 2019-02-12 by the reprex package (v0.2.1.9000)

Potential unclosed connections in Redshift

When using the package to validate tables in Redshift, the amount of non-strapped connections shows a definite increase. Look into solutions on how to close connections and re-use existing connections efficiently.

Update README

The README hasn't been touched in quite a long time and it could be a little shorter. Goal, I think, is to talk about the main workflows and problems that can be solved. All the other little details can go into vignettes/articles.

Consider less stringent R version dependence.

The dependence on R version >= 3.4.0 means I cannot install this package at work. (We are stuck on R 3.2, and it's not going to change.) I have looked at the dependencies of pointblank, and none of them seem to require 3.4, although perhaps some of their dependencies do. I might suggest modifying the .travis.yml file to try builds on older releases as a test, c.f. R versions.

Create an actions and levels info strip

This is necessary for creating any schematics of the validation plan and for reporting post-interrogation. It should indicate the settings for the object returned by the action_levels() helper function and applied to the validation step.

Reporting across interrogations

I've been using this package (along with your gt package!) for about a month and it's really helped out a lot at work. We're trying to get our data quality under control and this package solves that problem perfectly (it's actually amazing how everything just seems to work without problems).

A feature request that I have is making a sort of combined report of interrogations for the same table but at different times. We want to have these to see (in a simple table) where things have improved or gotten worse.

I honestly wouldn't be surprised if you weren't thinking about this already, so, if this is something you were planning it would be a great next step.

Thank you so much for all your great packages. You're making my life easier!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.