Giter Club home page Giter Club logo

datamaid's Introduction

dataMaid

dataMaid is an R package for documenting and creating reports on data cleanliness.

Installation

This github page contains the development version of dataMaid. For the latest stable version download the package from CRAN directly using

install.packages("dataMaid")

To install the development version of dataMaid run the following commands from within R (requires that the devtools package is already installed)

devtools::install_github('ekstroem/dataMaid')

Build Status Download counter

Package overview

A super simple way to get started is to load the package and use the makeDataReport function on a data frame (if you try to generate several reports for the same data, then it may be necessary to add the replace=TRUE argument to overwrite the existing report).

library(dataMaid)
data(trees)
makeDataReport(trees)

This will create a report with summaries and error checks for each variable in the trees data frame. The format of the report depends on your OS and whether you have have a LaTeX installation on your computer, which is needed for creating pdf reports.

Using dataMaid interactively

The dataMaid package can also be used interactively by running checks for the individual variables or for all variables in the dataset

data(toyData)
check(toyData$events)  # Individual check of events
check(toyData) # Check all variables at once

By default the standard battery of tests is run depending on the variable type. If we just want a specific test for, say, a numeric variable then we can specify that. All available checks can be viewed by calling allCheckFunctions(). See the documentation for an overview of the checks available or how to create and include your own tests.

check(toyData$events, checks = setChecks(numeric = "identifyMissing"))

We can also access the graphics or summary tables that are produced for a variable by calling the visualize or summarize functions. One can visualize a single variable or a full dataset:

#Visualize a variable
visualize(toyData$events)

#Visualize a dataset
visualize(toyData)

The same is true for summaries. Note also that the choice of checks/visualizations/summaries are customizable:

#Summarize a variable with default settings:
summarize(toyData$events) 

#Summarize a variable with user-specified settings:
summarize(toyData$events, summaries = setSummaries(all =  c("centralValue", "minMax"))  

Detailed documentation

This manuscript has been accepted for publication in JSS and it provides a detailed introduction to the dataMaid package. At one point it will be added as a vignette. Moreover, we have created a vignette that describes how to extend dataMaid to include user-defined data screening checks, summaries and visualizations. This vignette is called extending_dataMaid:

vignette("extending_dataMaid")

Online app

We are currently working on an online version of the tool, where users can upload their data and get a report. A prototype is already up and running - we just need to configure the R server correctly.

Until we have set it up online, you can try it out on your own machine:

library(shiny)
runUrl("https://github.com/ekstroem/dataMaid/raw/master/app/app.zip")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.