Giter Club home page Giter Club logo

datamodules's Introduction

datamodules

The goal of datamodules is to facilitate working with a central data store that is structured so that data-related scripts for downloading, updating, cleaning, etc. are organized in source-specific subfolders.

This all has to do with my continual pains in working with diverse data sources. A couple of specific problems that I persistently struggle with are:

  • For any given data source, there is a mix of tasks that can be automated, like data downloading and updating, tasks that likely will have to remain interactive and monitored, like data cleaning and imputation, and project-specific needs like particular data transformations. Where do you put the bits for all of this?
  • I often have multiple copies of essentially the same data in different locations, and similarly multiple copies and slight variations of similar data cleaning code.
  • It’s always a pain to figure out how to get a particular dataset into R for exploration. E.g. sometimes I’d like to look up how many ACLED events there have been in Estonia last year without having to think about downloading ACLED, where that file is located on my computer. etc.

A structure I have evolved towards is to have a central data store in which raw data are saved, and which contains source-specific cleaning code, e.g. to get a dataset complian with G&W country-years. For a specific project, copy the minimally cleaned data from that store and further process it as needed.

data/  # the base path for storing artifacts and source-specific cleaning code, notes, etc.
├── acled/
├── archigos/
│   ├── input-data/
│   ├── output-data/
│   ├── coding-notes.Rmd  # clean, transform
│   └── README.Rmd        # summary of current data, notes, etc.
├── epr
...

datamodules is meant to fill some gaps in this strategy:

  • dm_path() provides the location to the central data store to make it easier to read data from it for use in other projects.
  • it includes code to scrape Wikipedia terrorism events, and the idea is in the future to add more miscellaneous small code sets like this that don’t warrant a fullblown R package themselves
  • it includes some imputation helpers (well one right now)

Installation

remotes::install_github("andybega/datamodules")

Example

datamodules's People

Contributors

andybega avatar

Watchers

 avatar

datamodules's Issues

Parsing dates not working for format %d %B

Scraping doesn't work for dates in the format %d %B (see below). [Non-numerical casualty numbers also throw an error but this is a different issue.]

library("datamodules")
testcase <- wikipedia_terrorism_scrape(from = "1970-01", to = '1970-01')
# Warning messages:
#   1: In wikipedia_terrorism_scrape_table(urls[i]) :
#   Some dates were not parsed
# Table: NULL
# # A tibble: 5 x 2
# orig              date      
# <chr>             <date>    
# 1 26 March          NA        
# 2 April             NA        
# 3 May               NA        
# 4 31 July-August 10 NA        
# 5 12 August         NA        
# 
# 2: In wikipedia_terrorism_scrape_table(urls[i]) :
#
#   Some fatality figures were not parsed
# Table: NULL
# # A tibble: 2 x 2
# dead    dead_min
# <chr>      <dbl>
# 1 (1+)      NA
# 2 Unknown   NA

wiki-terror scraper has issues for earlier years

e.g.

library("datamodules")
#> Artifacts path not set, see `?setup_datamodules()`
foo = wikipedia_terrorism_scrape(from = 2000, to = 2005)
#> Warning in wikipedia_terrorism_scrape_table(urls[i]): Some dates were not parsed
#> Table: NULL
#> # A tibble: 3 x 2
#>   orig  date      
#>   <chr> <date>    
#> 1 14    NA        
#> 2 28    NA        
#> 3 29    NA
#> Warning in wikipedia_terrorism_scrape_table(urls[i]): Some dates were not parsed
#> Table: NULL
#> # A tibble: 8 x 2
#>   orig  date      
#>   <chr> <date>    
#> 1 4     NA        
#> 2 6     NA        
#> 3 8     NA        
#> 4 17    NA        
#> 5 18    NA        
#> 6 24    NA        
#> 7 25    NA        
#> 8 25    NA
#> Warning: Unknown or uninitialised column: 'dead'.
#> Error in `$<-.data.frame`(`*tmp*`, "dead_min", value = numeric(0)): replacement has 0 rows, data has 4

Created on 2019-10-11 by the reprex package (v0.3.0)

This should ultimately work back to 1970 when the data starts (and throw error for input < 1970)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.