emilyriederer / convo Goto Github PK

R package based on "Column Names as Contracts" blog post (https://emilyriederer.netlify.app/post/column-name-contracts/)

Home Page: https://emilyriederer.github.io/convo/

License: Other

R 100.00%

data-validation data-quality controlled-vocabulary schema-design variable-names variable-naming r-package

convo's Introduction

convo

The goal of convo is to enable the creation of a a controlled vocabularly for naming columns in a relational dataset as described in my blog post Column Names as Contracts. This controlled vocabularly can then be used to check a set of names for adherence, to automate documentation, and to generate data checks via the pointblank package.

Installation

You can install the development version of convo from GitHub with:

devtools::install_github("emilyriederer/convo")

Features

Available

Define controlled vocabularly (a convo) in R or YAML including valid name stubs at different levels of the ontology and optional descriptions or validation checks
Parse stub lists (candidate convos) from a set of variables
Evaluate if a set of names adheres to a convo and identify violations
Compare convo objects and/or stub lists with set-like operations (union, intersect, setdiff) to identify new candidates for inclusion
Generate a pointblank validation agent or YAML file from a convo object for data validation
Document a dataset with network diagrams or a table

Current Limitations / Future Enhancements

Define overall metadata for controlled vocabularly metadata such as:
- overall descriptor string
- human-readable names describing each level
Richer control over levels. Currently can only evaluate starting from the front, but in the future could:
- allow some levels to be optional
- work from both front and the back
Current levels are independent of one another
- could allow for truly hierarchical ontologies where allowed level 2 stubs vary by level 1 stub used
Current assumption is that realizations of a controlled vocabularly are all delimited by the same separator
- to work better with filepaths, might potentially want to enable multiple types of delimeters
Current regex support slightly unreliable. Need to better document and expand
More aesthetic documentation (describe_*() functions)
Better set operations for combining instead of overwriting full convo specifications (not just stub lists)
Explore integration with dm package to validate names across a schema

Example

Main pieces of functionality are illustrated in the Quick Start Guide on the package website.

convo's People

Contributors

Stargazers

Watchers

Forkers

petrbouchal

convo's Issues

Quickstart guide example fails

Hi Emily,

Lovely package. I wanted to use it for an upcoming project, and while I was kicking the tires, I hit this

> filepath <- system.file("", "ex-convo.yml", package = "convo")
> convo <- read_convo(filepath)
> write_pb(convo, c("IND_A", "AMT_B"), filename = "convo-validation.yml", path = tempdir())
 Error in if (step_list$column[[1]] == step_list$columns_expr) { : 
  missing value where TRUE/FALSE needed 
7.
get_column_text(step_list = step_list, expanded = expanded) 
6.
as_agent_yaml_list(agent = agent, expanded = expanded) 
5.
pointblank::yaml_write(., filename = "convo-validation.yml", 
    path = "/var/folders/bp/l5qt50g13sndlz0x48jqqv300000gn/T//RtmpqDnAMJ") 
4.
pointblank::create_agent(read_fn = ~setNames(as.data.frame(matrix(1, 
    ncol = 2)), c("IND_A", "AMT_B"))) %>% pointblank::col_vals_not_null(matches("^([A-Za-z]_){0}ID"), 
    step_id = 1) %>% pointblank::col_is_numeric(matches("^([A-Za-z]_){0}ID"), 
    step_id = 2) %>% pointblank::col_vals_between(matches("^([A-Za-z]_){0}ID"),  ... at <text>#1
3.
eval(parse(text = code)) 
2.
eval(parse(text = code)) 
1.
write_pb(convo, c("IND_A", "AMT_B"), filename = "convo-validation.yml", 
    path = tempdir())

Likewise when interrogating:

> pointblank::interrogate(agent)

── Interrogation Started - there are 13 steps ───────────────────────────────────────────────────────────────
Error: Only strings can be converted to symbols
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/rlang_error>
Only strings can be converted to symbols
Backtrace:
 1. pointblank::interrogate(agent)
 2. pointblank:::check_table_with_assertion(...)
 3. pointblank:::interrogate_not_null(agent = agent, idx = idx, table = table)
 4. pointblank:::get_column_as_sym_at_idx(agent = agent, idx = idx)
 5. rlang::sym(...)

I'd love to find out this was my misunderstanding, thanks!

GitHub links missing from DESCRIPTION :-)

Update `pointblank` dependency once it gets next version number

Per changes in #4

create_pb_agent() does not match any columns if level > 1

Hi Emily,

Excellent idea. I took the package for a spin - mainly to see if it can live with the not-recommended scheme of having the measure in the second level.

That's how I found a minor bug in create_pb_agent(), where no checks are run by {pointblank} if level is set to 2 or more.

Reprex:

library(convo)
library(tibble)
library(pointblank)

dt <- tibble(car_id = numeric(), car_name = character(),
             passenger_cnt = numeric())

cnv <- structure(list(level1 = list(car = NULL, 
                                    passenger = NULL), 
                      level2 = list(id = list(valid = "col_is_numeric()"), 
                                    cnt = list(valid = "col_is_numeric()"), 
                                    name = NULL)), class = c("convo", "list"))

evaluate_convo(cnv, names(dt))

#> Level 1
#> - 
#> Level 2
#> -

cagent <- create_pb_agent(cnv, dt, level = 2)
cagent$validation_set

#> # A tibble: 0 x 29
#> # … with 29 variables: i <int>, step_id <chr>, sha1 <chr>,
#> #   assertion_type <chr>, column <list>, values <list>, na_pass <lgl>,
#> #   preconditions <list>, actions <list>, label <chr>, brief <chr>,
#> #   active <list>, eval_active <lgl>, eval_error <lgl>, eval_warning <lgl>,
#> #   capture_stack <list>, all_passed <lgl>, n <int>, n_passed <int>,
#> #   n_failed <int>, f_passed <dbl>, f_failed <dbl>, warn <lgl>, notify <lgl>,
#> #   stop <lgl>, row_sample <dbl>, tbl_checked <list>, time_processed <dttm>,
#> #   proc_duration_s <dbl>

# interrogate(cagent2) # shows no checks if run in an interactive session

^{Created on 2021-01-13 by the reprex package (v0.3.0)}

Looking at the debug, I think it's caused by a bit of regex. I am aware this is experimental but in case it would be helpful I'd be happy to submit my local fix as a PR.

emilyriederer / convo Goto Github PK

convo's Introduction

convo

Installation

Features

Available

Current Limitations / Future Enhancements

Example

convo's People

Contributors

Stargazers

Watchers

Forkers

convo's Issues

Quickstart guide example fails

GitHub links missing from DESCRIPTION :-)

Update `pointblank` dependency once it gets next version number

create_pb_agent() does not match any columns if level > 1

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent