Giter Club home page Giter Club logo

validate's People

Contributors

daniel-barnett avatar dpritchard avatar earcanal avatar edwindj avatar erm-eanway avatar etiennebacher avatar flother avatar gerrymanoim avatar jonmcalder avatar markvanderloo avatar mayeulk avatar mutahiwachira avatar rmsharp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

validate's Issues

Add Type/Class information for variables

It may be useful/necessary to specify the data types of the variables.

Currently this can be done by adding a rule for each column, but it could be made more strict.
Especially when a duplicate variable name of different type is used in a large rule set.

Tolerances on (linear) equalities

We need to think about the following case. Given the rule

x == y

We can incorporate some roundoff errro by transforming to

abs(x-y) < eps

with some chosen constant eps=lin.eq.eps (or num.eq.eps). Alternatively we can also translate the abs function and create the equivalent

-(x-y) < eps
x-y < eps

extending validation syntax

This was possible in earlier development versions. For now, I think it is a good idea to leave it out and hard-code the common membership/comparison and boolean operators. It will be fairly easy to flexibulate this in version 0.2

Separate pure validation function detection from compound

Pure validation functions have a membership operator at the top of the AST. Compound validation functions have a boolean operator at the top and are therefore not guaranteed to be surjective on {T,F,NA}.

It would be a good idea to separate this, at least under the hood.

Change group definition syntax?

Currently, a group is defined as:

G : { var1; var2; ...; varN}

We could do

G := { var1; var2; ...; varN }

or even

G := vargroup(var1, var2, ..., varN)

Since := is already used for transient assignment.

Advantages include cleaner syntax and probably cleaner parsing under the hood.

read indicators from file

We can parse validation rules from yaml syntax. We need something for indicators as wel. Something like


---
options:
include : 
  - <file1.yaml>
  - <file2.yaml>

---
indicators:
-
  expr : mean(x)
  name: average
- 
  expr: sd(x)
  name: std.dev
  short: standard deviation
  long: |
    Estimated mean quadratic deviance from the mean.

---

general tolerance

for general equality rules we could write our own eq function that can handle numeric and categorical data and takes a tolerance. i.e. substitute

f(x) == g(y)

with

eq(x,y,tol='num.eq.eps')

Implement scope for a rule?

Currently a rule has an implicite scope:
it can act with scope:

  • "record" of a dataset (e.g. height > 50)
  • "dataset" (mean(height) > 50)
  • "crossdataset" (mean(height) < 1.2 * mean(ref$height))

It would be useful for reporting and dependent packages to retrieve explicitly the scope of rules, or to
retrieve rules of a specific scope type.

P.S.

It would also be nice to have:

  • "group", e.g. group_by(gender) %>% mean(height) > 50

reporting functionality

parsing expressionset and rule objects to markdown, LaTeX or html for report generation. Similar with confrontation-like objects. Could be in another package as well: validate.report (depending on validate.viz e.g.).

A miss slot for rule objects

A miss slot could take values NA (default), TRUE or FALSE and represents the value to return when a rule yields a missing value.

testing compare/comparison

We may need to look at the interface. I propose to label these functions EXPERIMENTAL int the doc for 0.1 and develop them further for 0.2.

Make error catching optional

The confront generic always catches errors. This is a good default but should be made optional (probably by setting a switch in the factory function)

Rule definition in yaml format

Move away from weave-like parsing options and integrate it neatly in a yaml syntax. Advantages include

  • yaml is user friendly and well known
  • easy transformation to json, xml and what-have-you-not

subtelty in check for validating functions

A validating function is defined as a function v: S->>{0,1 [,NA]} (with S a complicated definition of datasets, see book draft def 8.3.4), where the 'double arrow' means surjective. At the moment, only the domain part is ensured (sortof) by checking that the top of the AST is a comparison or membership operator (or a boolean). Surjectivity however is not ensured. For example, the rule

v <- validaror(1 > 0)

is swallowed by the constructor.

We should test for at least 1 variable, and probably

  • Throw a message when encountering and ignoring spurious rules (1 > 0)
  • Throw an error (or warning?) when encountering inconsistencies (1 < 0)

This may seem annoying for rule manipulation using elimination or substitution methods, but maybe we need to do those out of the validator context. Elimination methods need to extract rules anyway and we may leave it up to the elimination method on handling contradictions.

The disadvantage of throwing an error when encountering an obvious inconsistency is that users may get a false sense of security when no error is thrown: there is no such thing as a check for inconsistency accross rules for all cases..

Add `.` to allowed syntax

It would make validate more expressive if we allowed self referencing in the statements, meaning that the data set to be checked can be used in the syntax statements.

A good candidate is . since this is also used by magrittr and dplyr:

e.g.

v <- validator(
   rich := subset(., income_class == "rich"),
   mean(rich$income) >= 1e6 
)

Implementation would be simple: just add . to the environment in which the confrontation takes part.

setting options in validation files

It should be possible to set processing options in files containing validation syntax. Eg

validate_option(
  key = idvar
, error = 'throw'
)

remove regex variable definition options

Several functions (var_group, number_missing, and so on) now have the option to define variables in quotes as regular expressions.

This is currently hardly tested and complicates things like rule/block analyses for unexpanded rule sets.

Better to remove this functionality for now and perhaps put it on the list for a higher version. When we've thought about this better.

redirect of confrontation object from `confront`

Turns out, the magrittr::%T>% operator cannot do everything we want. We could do the following.

  • Give confront an extra argument out that gets called on the resulting confrontation object. e.g.
confront <- function(dat, x, out){
  cf <- # compute confrontation object
  if ( !missing(out)){
    out(cf)
    invisible(dat)
  } else {
    cf
  }
}

This way, anybody can write a function that dumps confrontation output wherever. As a service we add something like the following simple storage facility

store <- function(){
  # storage room + forklifts (recall that confrontation objects are reference objects)
  L <- new.env()
  add <- function(x) L <- c(L,x)
  take <- function(x, simplify){
    if (simplify && length(x)==1){
      L[[x]]
    } else {
      L[x]
   }

  # return a function
  function(put, get, simplify=TRUE){
    if (!missing(put)) add(put)
    if (!missing(get)) return(take(get, simplify))
    if (missing(put) && missing(get)) take(TRUE, simplify)
  }
}

Typical workflow would be

v <- validator(x > 0, y>0)
st <- store()
mydata %<>% confront(v,st) %>%
  do_something_to_mydata %>%
  confront(v,st) %>%
  do_something_else_to_mydata %>%
  confront(v,st)

# retrieve first stored validation result
st(1)
# retrieve all validation results
st()

show method fails

> library(validate)
> library(magrittr)
> women %>% check_that(mean(height)>0,height>0)
Object of class 'validation'
Call:
    check_that(., mean(height) > 0, height > 0)

Confrontations: 2
Error in apply(values(x), 1, all) : dim(X) must have a positive length

debugging level

For file processing, a debugging level that controls verbosity and exception handling will be usefull.

functional dependency operator

We need a decision on this. We currently have implemented the ~ and the %->% synonym.

I finally decided I like the %->% better. However, it has precedence over +, so the statement

A + B %->% C

is interpreted (parsed) as

A + (B %->% C)

while

A + B ~ C

parses to

(A + B) ~ C

Here's the precedence list from the R lang def

::
$ @
^
- +                (unary)
:
%xyz%
* /
+ -                (binary)
> >= < <= == !=
!
& &&
| ||
~                  (unary and binary)
-> ->>
=                  (as assignment)

Since we need a) a relational operator and b) a combination operator. I now vote that we use the ~ again...

before_check

Validator yaml files should allow for a section like this:


---
before_check : >
  # some free R code like
  nacecodes <- read.table("nace.tab")[,1]

---

The idea is that this code gets executed in the environment that will later surround the evaluation environment for the checks.

update documentation

Docs/description need to be updated/rewritten to conform to the new structure and yaml parsing capabilities

file parsing: execute 'source' statements.

Execute prior to extracting validation rules so users may define their own functions external to the validation file. Example

# file fun.R
myfun <- function(x) mean(sqrt(abs(x)))
# file validation.val
source('fun.R')

myfun(y) > 3

block decomposition and missing counters

There are several functions in validate that count missings row- or column wise. For example

v <- validator(number_missing(x,y)==0)

tests whether the total nr of missings in columns x and y equals zero. So far so good. However,

v <- validator(number_missing()==0)

tests whether all variables in the dataset are non-missing.

Since number_missing maps the full dataset to a single number, all variables are involved. At the moment, the blocks function does not take this into account.

package dependencies in validation syntax files

allow statemenst like

#' @validate package <packagename>

in validation syntax files.

Example use case (on commandline)

library(validate)
library(stringdist)

v <- validate(stringdist(name1,name2) < 3)

parsing of files: execute '<-' assignments

The idea is to reserve := assignments for transient variables and <- assignments for immutable variables. It's use would be for example to do something like

# executed upon first parse.
validclasses <- read.csv('myclasses.csv')$classname

# executed when confronted with data:

v %in% validclasses

Alternative naming of "calls"?

Perhaps we should relabel "calls", because it is a internal R feature, namely a R expression that is "callable" (can be evaluated). It will likely be unknown to users what "calls" are or mean.

Alternatives:

  • "expr"
  • "expression"
  • "rule"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.