data-cleaning / validate Goto Github PK

View Code? Open in Web Editor NEW

402.0 19.0 38.0 6.98 MB

Professional data validation for the R environment

Shell 0.09% R 86.71% C 2.22% Makefile 0.83% TeX 10.09% CSS 0.06%

r validation data-cleaning

validate's People

Contributors

Stargazers

Watchers

validate's Issues

Add Type/Class information for variables

It may be useful/necessary to specify the data types of the variables.

Currently this can be done by adding a rule for each column, but it could be made more strict.
Especially when a duplicate variable name of different type is used in a large rule set.

Tolerances on (linear) equalities

We need to think about the following case. Given the rule

x == y

We can incorporate some roundoff errro by transforming to

abs(x-y) < eps

with some chosen constant eps=lin.eq.eps (or num.eq.eps). Alternatively we can also translate the abs function and create the equivalent

-(x-y) < eps
x-y < eps

extending validation syntax

This was possible in earlier development versions. For now, I think it is a good idea to leave it out and hard-code the common membership/comparison and boolean operators. It will be fairly easy to flexibulate this in version 0.2

Make the functional dependency validator a real validating function

~ / %->% currently eject conflict labels. Should be logical.

summary method for objects of class expressionset

Reporting on block size, variables and linearity of rules and indicators

selection methods on expressionsets and confrontations

Need testing and probably some fixes.

Separate pure validation function detection from compound

Pure validation functions have a membership operator at the top of the AST. Compound validation functions have a boolean operator at the top and are therefore not guaranteed to be surjective on {T,F,NA}.

It would be a good idea to separate this, at least under the hood.

in summary method for confrontation: nr of passes eqs nr of fails

Change group definition syntax?

Currently, a group is defined as:

G : { var1; var2; ...; varN}

We could do

G := { var1; var2; ...; varN }

or even

G := vargroup(var1, var2, ..., varN)

Since := is already used for transient assignment.

Advantages include cleaner syntax and probably cleaner parsing under the hood.

read indicators from file

We can parse validation rules from yaml syntax. We need something for indicators as wel. Something like


---
options:
include : 
  - <file1.yaml>
  - <file2.yaml>

---
indicators:
-
  expr : mean(x)
  name: average
- 
  expr: sd(x)
  name: std.dev
  short: standard deviation
  long: |
    Estimated mean quadratic deviance from the mean.

---

yaml export function

general tolerance

for general equality rules we could write our own eq function that can handle numeric and categorical data and takes a tolerance. i.e. substitute

f(x) == g(y)

with

eq(x,y,tol='num.eq.eps')

Implement scope for a rule?

Currently a rule has an implicite scope:
it can act with scope:

"record" of a dataset (e.g. height > 50)
"dataset" (mean(height) > 50)
"crossdataset" (mean(height) < 1.2 * mean(ref$height))

It would be useful for reporting and dependent packages to retrieve explicitly the scope of rules, or to
retrieve rules of a specific scope type.

P.S.

It would also be nice to have:

"group", e.g. group_by(gender) %>% mean(height) > 50

Make 'validate' pipe-ready

The %>% and %T% operators of the magrittr package are realy interesting for validating data flows.

Replace reference classes with R6

Expected better future integration with e.g. the Hadleyverse

reporting functionality

parsing expressionset and rule objects to markdown, LaTeX or html for report generation. Similar with confrontation-like objects. Could be in another package as well: validate.report (depending on validate.viz e.g.).

A miss slot for rule objects

A miss slot could take values NA (default), TRUE or FALSE and represents the value to return when a rule yields a missing value.

testing compare/comparison

We may need to look at the interface. I propose to label these functions EXPERIMENTAL int the doc for 0.1 and develop them further for 0.2.

Add 'key' option to confront methods

This way, output can be labeled with keys.

confront(v,d,key=<name of identifying variable>)

The key argument should be optional.

Make error catching optional

The confront generic always catches errors. This is a good default but should be made optional (probably by setting a switch in the factory function)

Rule definition in yaml format

Move away from weave-like parsing options and integrate it neatly in a yaml syntax. Advantages include

yaml is user friendly and well known
easy transformation to json, xml and what-have-you-not

'include' dependencies in validation files

Allow statemens like

#' @validate include <filename>

in validation syntax files

validator check for overwriting primitive comparisons

The function validator should emit a warning when a user has locally overwritten primitives such as < <= ==, => , > or %in%

alter .files argument to file in validator function

subtelty in check for validating functions

A validating function is defined as a function v: S->>{0,1 [,NA]} (with S a complicated definition of datasets, see book draft def 8.3.4), where the 'double arrow' means surjective. At the moment, only the domain part is ensured (sortof) by checking that the top of the AST is a comparison or membership operator (or a boolean). Surjectivity however is not ensured. For example, the rule

v <- validaror(1 > 0)

is swallowed by the constructor.

We should test for at least 1 variable, and probably

Throw a message when encountering and ignoring spurious rules (1 > 0)
Throw an error (or warning?) when encountering inconsistencies (1 < 0)

This may seem annoying for rule manipulation using elimination or substitution methods, but maybe we need to do those out of the validator context. Elimination methods need to extract rules anyway and we may leave it up to the elimination method on handling contradictions.

The disadvantage of throwing an error when encountering an obvious inconsistency is that users may get a false sense of security when no error is thrown: there is no such thing as a check for inconsistency accross rules for all cases..

Add `.` to allowed syntax

It would make validate more expressive if we allowed self referencing in the statements, meaning that the data set to be checked can be used in the syntax statements.

A good candidate is . since this is also used by magrittr and dplyr:

e.g.

v <- validator(
   rich := subset(., income_class == "rich"),
   mean(rich$income) >= 1e6 
)

Implementation would be simple: just add . to the environment in which the confrontation takes part.

use_names argument for property getters

nice to have.

validate not building for i386 on Windows (x86 works though)

Package does not build/install for i386 on my machine.
This gives check --as-cran errors on windows.

fine-grained object for rule representation

Right now, rules are stored as a list of calls in a reference object. Switch to a rule object that is more extendable

Features for validating time series

This probably means at least a confront method for time-series objects, e.g. ts, xts.

setting options in validation files

It should be possible to set processing options in files containing validation syntax. Eg

validate_option(
  key = idvar
, error = 'throw'
)

remove regex variable definition options

Several functions (var_group, number_missing, and so on) now have the option to define variables in quotes as regular expressions.

This is currently hardly tested and complicates things like rule/block analyses for unexpanded rule sets.

Better to remove this functionality for now and perhaps put it on the list for a higher version. When we've thought about this better.

yrf file version

for the future!

summary of confrontation object fails when one of the expressions fails.

redirect of confrontation object from `confront`

Turns out, the magrittr::%T>% operator cannot do everything we want. We could do the following.

Give confront an extra argument out that gets called on the resulting confrontation object. e.g.

confront <- function(dat, x, out){
  cf <- # compute confrontation object
  if ( !missing(out)){
    out(cf)
    invisible(dat)
  } else {
    cf
  }
}

This way, anybody can write a function that dumps confrontation output wherever. As a service we add something like the following simple storage facility

store <- function(){
  # storage room + forklifts (recall that confrontation objects are reference objects)
  L <- new.env()
  add <- function(x) L <- c(L,x)
  take <- function(x, simplify){
    if (simplify && length(x)==1){
      L[[x]]
    } else {
      L[x]
   }

  # return a function
  function(put, get, simplify=TRUE){
    if (!missing(put)) add(put)
    if (!missing(get)) return(take(get, simplify))
    if (missing(put) && missing(get)) take(TRUE, simplify)
  }
}

Typical workflow would be

v <- validator(x > 0, y>0)
st <- store()
mydata %<>% confront(v,st) %>%
  do_something_to_mydata %>%
  confront(v,st) %>%
  do_something_else_to_mydata %>%
  confront(v,st)

# retrieve first stored validation result
st(1)
# retrieve all validation results
st()

Support local execution options in objects of class expressionset.

These may overwrite the global options set by validate_options

show method fails

> library(validate)
> library(magrittr)
> women %>% check_that(mean(height)>0,height>0)
Object of class 'validation'
Call:
    check_that(., mean(height) > 0, height > 0)

Confrontations: 2
Error in apply(values(x), 1, all) : dim(X) must have a positive length

debugging level

For file processing, a debugging level that controls verbosity and exception handling will be usefull.

functional dependency operator

We need a decision on this. We currently have implemented the ~ and the %->% synonym.

I finally decided I like the %->% better. However, it has precedence over +, so the statement

A + B %->% C

is interpreted (parsed) as

A + (B %->% C)

while

A + B ~ C

parses to

(A + B) ~ C

Here's the precedence list from the R lang def

::
$ @
^
- +                (unary)
:
%xyz%
* /
+ -                (binary)
> >= < <= == !=
!
& &&
| ||
~                  (unary and binary)
-> ->>
=                  (as assignment)

Since we need a) a relational operator and b) a combination operator. I now vote that we use the ~ again...

before_check

Validator yaml files should allow for a section like this:


---
before_check : >
  # some free R code like
  nacecodes <- read.table("nace.tab")[,1]

---

The idea is that this code gets executed in the environment that will later surround the evaluation environment for the checks.

validation for time seriees

This probably means at least a confront method for time-series like objects, e.g. ts, xts.

update documentation

Docs/description need to be updated/rewritten to conform to the new structure and yaml parsing capabilities

file parsing: execute 'source' statements.

Execute prior to extracting validation rules so users may define their own functions external to the validation file. Example

# file fun.R
myfun <- function(x) mean(sqrt(abs(x)))

# file validation.val
source('fun.R')

myfun(y) > 3

block decomposition and missing counters

There are several functions in validate that count missings row- or column wise. For example

v <- validator(number_missing(x,y)==0)

tests whether the total nr of missings in columns x and y equals zero. So far so good. However,

v <- validator(number_missing()==0)

tests whether all variables in the dataset are non-missing.

Since number_missing maps the full dataset to a single number, all variables are involved. At the moment, the blocks function does not take this into account.

package dependencies in validation syntax files

allow statemenst like

#' @validate package <packagename>

in validation syntax files.

Example use case (on commandline)

library(validate)
library(stringdist)

v <- validate(stringdist(name1,name2) < 3)

# executed upon first parse.
validclasses <- read.csv('myclasses.csv')$classname

# executed when confronted with data:

v %in% validclasses

Alternative naming of "calls"?

Perhaps we should relabel "calls", because it is a internal R feature, namely a R expression that is "callable" (can be evaluated). It will likely be unknown to users what "calls" are or mean.

Alternatives:

"expr"
"expression"
"rule"

naming of "short" and "long" properties in rule

I suggest "title" and "description" (or "desc") in stead of "short" and "long".

Short and long may confuse the user to believe these are integers.

data-cleaning / validate Goto Github PK

validate's People

Contributors

Stargazers

Watchers

Forkers

validate's Issues

Recommend Projects

Recommend Topics

Recommend Org