Giter Club home page Giter Club logo

elucidate's Introduction

elucidate

Project Status

img R-CMD-check

This package is currently maintained by Craig Hutton, a Data Scientist working with the Research Branch of the British Columbia Ministry of Social Development & Poverty Reduction.

Why elucidate?

elucidate provides a collection of convenience functions to make exploratory data analysis in R easier and more accessible for researchers to (among other things):

  • Interrogate data in search of row duplicates with dupes() and anomalous values with the counts* set of functions.

  • Describe data with the describe* set of functions for obtaining summary statistics, bootstrapping confidence intervals, and detecting missing values.

  • Quickly visualise and interact with data representations with the plot_* set of functions.

Inspired by tidyverse naming conventions, the core functions of elucidate are organized into sets that begin with a common root (e.g. describe*, plot_*), since this enables the user to see them all as suggestions as you are coding in R studio.

Drawing from similar inspiration, many elucidate functions are also designed to accept a data object as the 1st argument and return a data or plotting object (e.g. ggplot2 or plotly) so they are compatible with the pipe operator from the magrittr package for easy integration into data processing pipelines.

For a comprehensive introduction to the package see the vignette via vignette("elucidate").

Installation

You can install the development version of elucidate from this repository with:

# use the remotes package to install from a github repository

install.packages("remotes") #only run this 1st if you haven't installed remotes before

remotes::install_github("bcgov/elucidate")

The authors of elucidate acknowledge and express their gratitude to the authors of the tidyverse packages, data.table, and the functions of other dependency packages which were used to build elucidate, since without their effort and ingenuity elucidate would mostly have remained a collection of ideas instead of functions.

Usage

dupes() can tell you how many rows are duplicated based on one or more variables (default is all of them).

library(elucidate)

#list any number of variables to use when searching for duplicates after the
#data argument
dupes(pdata, d) 
#> Duplicated rows detected! 12000 of 12000 rows in the input data have multiple copies.
#> # A tibble: 12,000 x 11
#>       id d          g     high_low even     y1    y2    x1    x2    x3 n_copies
#>    <int> <date>     <fct> <chr>    <lgl> <dbl> <dbl> <int> <int> <int>    <int>
#>  1     1 2019-01-01 e     high     FALSE  144. 112.     86   199   261     1000
#>  2     2 2019-01-01 c     low      TRUE   262.  91.8    88   104   224     1000
#>  3     3 2019-01-01 d     high     FALSE  224. 120.     29   192   224     1000
#>  4     4 2019-01-01 c     high     TRUE   262. 102.     77   151   254     1000
#>  5     5 2019-01-01 a     low      FALSE  167.  91.6    75   186   258     1000
#>  6     6 2019-01-01 a     low      TRUE   163.  83.6    39   152   293     1000
#>  7     7 2019-01-01 d     low      FALSE  247.  98.9    88   177   202     1000
#>  8     8 2019-01-01 b     high     TRUE   189. 102.     79   163   217     1000
#>  9     9 2019-01-01 e     high     FALSE  156. 110.      4   136   220     1000
#> 10    10 2019-01-01 c     low      TRUE   248.  95.9    63   113   213     1000
#> # ... with 11,990 more rows
#in this case we search for duplicated based on the "d" (date) column        

describe() a single variable in a data frame or a vector of values.

#set random generator seed for reproducibility
set.seed(1234)

#using a numeric vector as input
describe(data = rnorm(1:1000, 100, 5))
#> # A tibble: 1 x 14
#>   cases     n    na  p_na  mean    sd    se    p0   p25   p50   p75  p100   skew
#>   <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
#> 1  1000  1000     0     0  99.9  4.99 0.158  83.0  96.6  99.8  103.  116. -0.005
#> # ... with 1 more variable: kurt <dbl>

describe_all() all variables in a data frame.

describe_all(pdata)
#> $date
#>    variable cases     n na p_na n_unique      start        end
#> 1:        d 12000 12000  0    0       12 2008-01-01 2019-01-01
#> 
#> $factor
#>    variable cases     n na p_na n_unique ordered
#> 1:        g 12000 12000  0    0        5   FALSE
#>                              counts_tb
#> 1: a_2592, b_2460, ..., e_2352, c_2220
#> 
#> $character
#>    variable cases     n na p_na n_unique min_chars max_chars
#> 1: high_low 12000 12000  0    0        2         3         4
#>              counts_tb
#> 1: high_6045, low_5955
#> 
#> $logical
#>    variable cases     n na p_na n_TRUE n_FALSE p_TRUE
#> 1:     even 12000 12000  0    0   6000    6000    0.5
#> 
#> $numeric
#>    variable cases     n na p_na    mean      sd    se      p0     p25     p50
#> 1:       id 12000 12000  0    0 500.500 288.687 2.635   1.000 250.750 500.500
#> 2:       y1 12000 12000  0    0 153.705  42.724 0.390  69.224 120.941 144.782
#> 3:       y2 12000 12000  0    0 100.092  10.120 0.092  59.963  93.351 100.128
#> 4:       x1 12000 12000  0    0  50.495  28.937 0.264   1.000  25.000  50.000
#> 5:       x2 12000 12000  0    0 150.651  28.820 0.263 101.000 126.000 150.000
#> 6:       x3 12000 12000  0    0 250.497  28.865 0.264 201.000 225.000 251.000
#>        p75     p100   skew   kurt
#> 1: 750.250 1000.000  0.000 -1.200
#> 2: 181.022  289.235  0.739 -0.181
#> 3: 106.906  142.181 -0.032  0.059
#> 4:  75.000  100.000  0.008 -1.208
#> 5: 176.000  200.000  0.002 -1.209
#> 6: 276.000  300.000  0.006 -1.214

Use plot_var() to produce a class-appropriate ggplot2 graph of a single variable in a data frame or a vector of values.

plot_var(data = rnorm(1:1000, 100, 5)) 

#in this case we get a density plot with a normal density curve added for
#reference (dashed line).

To generate class-appropriate ggplot2 graphs for all variables in a data frame and combine them into a multiple-panel figure with the patchwork package, use plot_var_all(). You can also limit the graphing to a subset of columns with the “cols” argument, which accepts a character vector of column names.

plot_var_all(pdata, cols = c("y1", "y2", "g", "even"))

#density plots for numeric variables and bar graphs for categorical variables

Learn more

These examples only highlight a few of the many things elucidate can do. You can learn more from these additional resources:

  • A blog post, which was written to help researchers and science trainees learn how to use the package for common exploratory data analysis tasks.

  • The introductory vignette (vignette("elucidate")), which provides annotated code examples that demonstrate most of the package’s functionality.

Reporting an Issue

To report bugs/issues or request feature changes, open an issue for the package GitHub repo. If raising an issue, please provide a reproducible example (reprex) of the problem you’re encountering.

Requesting Features and/or Changes

To suggest changes or code improvements, please submit a pull request.

License

Copyright 2021 Province of British Columbia

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at:

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

elucidate's People

Contributors

huttoncp avatar repo-mountie[bot] avatar sean-hayes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

eipi10 mbexhrs3

elucidate's Issues

It's Been a While Since This Repository has Been Updated

This issue is a kind reminder that your repository has been inactive for 181 days. Some repositories are maintained in accordance with business requirements that infrequently change thus appearing inactive, and some repositories are inactive because they are unmaintained.

To help differentiate products that are unmaintained from products that do not require frequent maintenance, repomountie will open an issue whenever a repository has not been updated in 180 days.

  • If this product is being actively maintained, please close this issue.
  • If this repository isn't being actively maintained anymore, please archive this repository. Also, for bonus points, please add a dormant or retired life cycle badge.

Thank you for your help ensuring effective governance of our open-source ecosystem!

Add project lifecycle badge

No Project Lifecycle Badge found in your readme!

Hello! I scanned your readme and could not find a project lifecycle badge. A project lifecycle badge will provide contributors to your project as well as other stakeholders (platform services, executive) insight into the lifecycle of your repository.

What is a Project Lifecycle Badge?

It is a simple image that neatly describes your project's stage in its lifecycle. More information can be found in the project lifecycle badges documentation.

What do I need to do?

I suggest you make a PR into your README.md and add a project lifecycle badge near the top where it is easy for your users to pick it up :). Once it is merged feel free to close this issue. I will not open up a new one :)

It's Been a While Since This Repository has Been Updated

This issue is a kind reminder that your repository has been inactive for 181 days. Some repositories are maintained in accordance with business requirements that infrequently change thus appearing inactive, and some repositories are inactive because they are unmaintained.

To help differentiate products that are unmaintained from products that do not require frequent maintenance, repomountie will open an issue whenever a repository has not been updated in 180 days.

  • If this product is being actively maintained, please close this issue.
  • If this repository isn't being actively maintained anymore, please archive this repository. Also, for bonus points, please add a dormant or retired life cycle badge.

Thank you for your help ensuring effective governance of our open-source ecosystem!

Add missing topics

TL;DR

Topics greatly improve the discoverability of repos; please add the short code from the table below to the topics of your repo so that ministries can use GitHub's search to find out what repos belong to them and other visitors can find useful content (and reuse it!).

Why Topic

In short order we'll add our 800th repo. This large number clearly demonstrates the success of using GitHub and our Open Source initiative. This huge success means its critical that we work to make our content as discoverable as possible; Through discoverability, we promote code reuse across a large decentralized organization like the Government of British Columbia as well as allow ministries to find the repos they own.

What to do

Below is a table of abbreviation a.k.a short codes for each ministry; they're the ones used in all @gov.bc.ca email addresses. Please add the short codes of the ministry or organization that "owns" this repo as a topic.

add a topic

That's in, you're done!!!

How to use

Once topics are added, you can use them in GitHub's search. For example, enter something like org:bcgov topic:citz to find all the repos that belong to Citizens' Services. You can refine this search by adding key words specific to a subject you're interested in. To learn more about searching through repos check out GitHub's doc on searching.

Pro Tip 🤓

  • If your org is not in the list below, or the table contains errors, please create an issue here.

  • While you're doing this, add additional topics that would help someone searching for "something". These can be the language used javascript or R; something like opendata or data for data only repos; or any other key words that are useful.

  • Add a meaningful description to your repo. This is hugely valuable to people looking through our repositories.

  • If your application is live, add the production URL.

Ministry Short Codes

Short Code Organization Name
AEST Advanced Education, Skills & Training
AGRI Agriculture
ALC Agriculture Land Commission
AG Attorney General
MCF Children & Family Development
CITZ Citizens' Services
DBC Destination BC
EMBC Emergency Management BC
EAO Environmental Assessment Office
EDUC Education
EMPR Energy, Mines & Petroleum Resources
ENV Environment & Climate Change Strategy
FIN Finance
FLNR Forests, Lands, Natural Resource Operations & Rural Development
HLTH Health
FLNR Indigenous Relations & Reconciliation
JEDC Jobs, Economic Development & Competitiveness
LBR Labour Policy & Legislation
LDB BC Liquor Distribution Branch
MMHA Mental Health & Addictions
MAH Municipal Affairs & Housing
BCPC Pension Corporation
PSA Public Safety & Solicitor General & Emergency B.C.
SDPR Social Development & Poverty Reduction
TCA Tourism, Arts & Culture
TRAN Transportation & Infrastructure

NOTE See an error or omission? Please create an issue here to get it remedied.

It's Been a While Since This Repository has Been Updated

This issue is a kind reminder that your repository has been inactive for 181 days. Some repositories are maintained in accordance with business requirements that infrequently change thus appearing inactive, and some repositories are inactive because they are unmaintained.

To help differentiate products that are unmaintained from products that do not require frequent maintenance, repomountie will open an issue whenever a repository has not been updated in 180 days.

  • If this product is being actively maintained, please close this issue.
  • If this repository isn't being actively maintained anymore, please archive this repository. Also, for bonus points, please add a dormant or retired life cycle badge.

Thank you for your help ensuring effective governance of our open-source ecosystem!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.