data-cleaning / validate Goto Github PK
View Code? Open in Web Editor NEWProfessional data validation for the R environment
Professional data validation for the R environment
It may be useful/necessary to specify the data types of the variables.
Currently this can be done by adding a rule for each column, but it could be made more strict.
Especially when a duplicate variable name of different type is used in a large rule set.
We need to think about the following case. Given the rule
x == y
We can incorporate some roundoff errro by transforming to
abs(x-y) < eps
with some chosen constant eps=lin.eq.eps
(or num.eq.eps
). Alternatively we can also translate the abs
function and create the equivalent
-(x-y) < eps
x-y < eps
This was possible in earlier development versions. For now, I think it is a good idea to leave it out and hard-code the common membership/comparison and boolean operators. It will be fairly easy to flexibulate this in version 0.2
~
/ %->%
currently eject conflict labels. Should be logical.
Reporting on block size, variables and linearity of rules and indicators
Need testing and probably some fixes.
Pure validation functions have a membership operator at the top of the AST. Compound validation functions have a boolean operator at the top and are therefore not guaranteed to be surjective on {T,F,NA}
.
It would be a good idea to separate this, at least under the hood.
Currently, a group is defined as:
G : { var1; var2; ...; varN}
We could do
G := { var1; var2; ...; varN }
or even
G := vargroup(var1, var2, ..., varN)
Since :=
is already used for transient assignment.
Advantages include cleaner syntax and probably cleaner parsing under the hood.
We can parse validation rules from yaml syntax. We need something for indicators as wel. Something like
---
options:
include :
- <file1.yaml>
- <file2.yaml>
---
indicators:
-
expr : mean(x)
name: average
-
expr: sd(x)
name: std.dev
short: standard deviation
long: |
Estimated mean quadratic deviance from the mean.
---
for general equality rules we could write our own eq
function that can handle numeric and categorical data and takes a tolerance. i.e. substitute
f(x) == g(y)
with
eq(x,y,tol='num.eq.eps')
Currently a rule has an implicite scope:
it can act with scope:
height > 50
)mean(height) > 50
)mean(height) < 1.2 * mean(ref$height)
)It would be useful for reporting and dependent packages to retrieve explicitly the scope of rules, or to
retrieve rules of a specific scope type.
P.S.
It would also be nice to have:
group_by(gender) %>% mean(height) > 50
The %>%
and %T%
operators of the magrittr package are realy interesting for validating data flows.
Expected better future integration with e.g. the Hadleyverse
parsing expressionset
and rule
objects to markdown
, LaTeX
or html
for report generation. Similar with confrontation-like objects. Could be in another package as well: validate.report
(depending on validate.viz
e.g.).
A miss
slot could take values NA
(default), TRUE
or FALSE
and represents the value to return when a rule yields a missing value.
We may need to look at the interface. I propose to label these functions EXPERIMENTAL
int the doc for 0.1 and develop them further for 0.2.
This way, output can be labeled with keys.
confront(v,d,key=<name of identifying variable>)
The key
argument should be optional.
The confront
generic always catches errors. This is a good default but should be made optional (probably by setting a switch in the factory
function)
Move away from weave-like parsing options and integrate it neatly in a yaml
syntax. Advantages include
json
, xml
and what-have-you-notAllow statemens like
#' @validate include <filename>
in validation syntax files
The function validator
should emit a warning when a user has locally overwritten primitives such as < <= ==, => , >
or %in%
A validating function is defined as a function v: S->>{0,1 [,NA]}
(with S a complicated definition of datasets, see book draft def 8.3.4), where the 'double arrow' means surjective. At the moment, only the domain part is ensured (sortof) by checking that the top of the AST is a comparison or membership operator (or a boolean). Surjectivity however is not ensured. For example, the rule
v <- validaror(1 > 0)
is swallowed by the constructor.
We should test for at least 1 variable, and probably
1 > 0
)1 < 0
)This may seem annoying for rule manipulation using elimination or substitution methods, but maybe we need to do those out of the validator
context. Elimination methods need to extract rules anyway and we may leave it up to the elimination method on handling contradictions.
The disadvantage of throwing an error when encountering an obvious inconsistency is that users may get a false sense of security when no error is thrown: there is no such thing as a check for inconsistency accross rules for all cases..
It would make validate
more expressive if we allowed self referencing in the statements, meaning that the data set to be checked can be used in the syntax statements.
A good candidate is .
since this is also used by magrittr
and dplyr
:
e.g.
v <- validator(
rich := subset(., income_class == "rich"),
mean(rich$income) >= 1e6
)
Implementation would be simple: just add .
to the environment in which the confrontation takes part.
nice to have.
Package does not build/install for i386 on my machine.
This gives check --as-cran
errors on windows.
Right now, rules are stored as a list of calls in a reference object. Switch to a rule object that is more extendable
This probably means at least a confront
method for time-series objects, e.g. ts
, xts
.
It should be possible to set processing options in files containing validation syntax. Eg
validate_option(
key = idvar
, error = 'throw'
)
Several functions (var_group
, number_missing
, and so on) now have the option to define variables in quotes as regular expressions.
This is currently hardly tested and complicates things like rule/block analyses for unexpanded rule sets.
Better to remove this functionality for now and perhaps put it on the list for a higher version. When we've thought about this better.
for the future!
Turns out, the magrittr::%T>%
operator cannot do everything we want. We could do the following.
confront
an extra argument out
that gets called on the resulting confrontation
object. e.g.confront <- function(dat, x, out){
cf <- # compute confrontation object
if ( !missing(out)){
out(cf)
invisible(dat)
} else {
cf
}
}
This way, anybody can write a function that dumps confrontation output wherever. As a service we add something like the following simple storage facility
store <- function(){
# storage room + forklifts (recall that confrontation objects are reference objects)
L <- new.env()
add <- function(x) L <- c(L,x)
take <- function(x, simplify){
if (simplify && length(x)==1){
L[[x]]
} else {
L[x]
}
# return a function
function(put, get, simplify=TRUE){
if (!missing(put)) add(put)
if (!missing(get)) return(take(get, simplify))
if (missing(put) && missing(get)) take(TRUE, simplify)
}
}
Typical workflow would be
v <- validator(x > 0, y>0)
st <- store()
mydata %<>% confront(v,st) %>%
do_something_to_mydata %>%
confront(v,st) %>%
do_something_else_to_mydata %>%
confront(v,st)
# retrieve first stored validation result
st(1)
# retrieve all validation results
st()
These may overwrite the global options set by validate_options
> library(validate)
> library(magrittr)
> women %>% check_that(mean(height)>0,height>0)
Object of class 'validation'
Call:
check_that(., mean(height) > 0, height > 0)
Confrontations: 2
Error in apply(values(x), 1, all) : dim(X) must have a positive length
For file processing, a debugging level that controls verbosity and exception handling will be usefull.
We need a decision on this. We currently have implemented the ~
and the %->%
synonym.
I finally decided I like the %->%
better. However, it has precedence over +
, so the statement
A + B %->% C
is interpreted (parsed) as
A + (B %->% C)
while
A + B ~ C
parses to
(A + B) ~ C
Here's the precedence list from the R lang def
::
$ @
^
- + (unary)
:
%xyz%
* /
+ - (binary)
> >= < <= == !=
!
& &&
| ||
~ (unary and binary)
-> ->>
= (as assignment)
Since we need a) a relational operator and b) a combination operator. I now vote that we use the ~
again...
Validator yaml files should allow for a section like this:
---
before_check : >
# some free R code like
nacecodes <- read.table("nace.tab")[,1]
---
The idea is that this code gets executed in the environment that will later surround the evaluation environment for the checks.
This probably means at least a confront
method for time-series like objects, e.g. ts
, xts
.
Docs/description need to be updated/rewritten to conform to the new structure and yaml parsing capabilities
Execute prior to extracting validation rules so users may define their own functions external to the validation file. Example
# file fun.R
myfun <- function(x) mean(sqrt(abs(x)))
# file validation.val
source('fun.R')
myfun(y) > 3
There are several functions in validate
that count missings row- or column wise. For example
v <- validator(number_missing(x,y)==0)
tests whether the total nr of missings in columns x
and y
equals zero. So far so good. However,
v <- validator(number_missing()==0)
tests whether all variables in the dataset are non-missing.
Since number_missing
maps the full dataset to a single number, all variables are involved. At the moment, the blocks
function does not take this into account.
allow statemenst like
#' @validate package <packagename>
in validation syntax files.
Example use case (on commandline)
library(validate)
library(stringdist)
v <- validate(stringdist(name1,name2) < 3)
expressionset objects could carry a vector of package names that should be available during execution.
Seriously, setting add_calls=TRUE
makes no impression.
The idea is to reserve :=
assignments for transient variables and <-
assignments for immutable variables. It's use would be for example to do something like
# executed upon first parse.
validclasses <- read.csv('myclasses.csv')$classname
# executed when confronted with data:
v %in% validclasses
Perhaps we should relabel "calls", because it is a internal R feature, namely a R expression that is "callable" (can be evaluated). It will likely be unknown to users what "calls" are or mean.
Alternatives:
I suggest "title" and "description" (or "desc") in stead of "short" and "long".
Short and long may confuse the user to believe these are integers.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.