datamodules's Introduction

datamodules

The goal of datamodules is to facilitate working with a central data store that is structured so that data-related scripts for downloading, updating, cleaning, etc. are organized in source-specific subfolders.

This all has to do with my continual pains in working with diverse data sources. A couple of specific problems that I persistently struggle with are:

For any given data source, there is a mix of tasks that can be automated, like data downloading and updating, tasks that likely will have to remain interactive and monitored, like data cleaning and imputation, and project-specific needs like particular data transformations. Where do you put the bits for all of this?
I often have multiple copies of essentially the same data in different locations, and similarly multiple copies and slight variations of similar data cleaning code.
It’s always a pain to figure out how to get a particular dataset into R for exploration. E.g. sometimes I’d like to look up how many ACLED events there have been in Estonia last year without having to think about downloading ACLED, where that file is located on my computer. etc.

A structure I have evolved towards is to have a central data store in which raw data are saved, and which contains source-specific cleaning code, e.g. to get a dataset complian with G&W country-years. For a specific project, copy the minimally cleaned data from that store and further process it as needed.

data/  # the base path for storing artifacts and source-specific cleaning code, notes, etc.
├── acled/
├── archigos/
│   ├── input-data/
│   ├── output-data/
│   ├── coding-notes.Rmd  # clean, transform
│   └── README.Rmd        # summary of current data, notes, etc.
├── epr
...

datamodules is meant to fill some gaps in this strategy:

dm_path() provides the location to the central data store to make it easier to read data from it for use in other projects.
it includes code to scrape Wikipedia terrorism events, and the idea is in the future to add more miscellaneous small code sets like this that don’t warrant a fullblown R package themselves
it includes some imputation helpers (well one right now)

Installation

remotes::install_github("andybega/datamodules")

Example

datamodules's People

Contributors

Watchers

datamodules's Issues

Parsing dates not working for format %d %B

Scraping doesn't work for dates in the format %d %B (see below). [Non-numerical casualty numbers also throw an error but this is a different issue.]

library("datamodules")
testcase <- wikipedia_terrorism_scrape(from = "1970-01", to = '1970-01')
# Warning messages:
#   1: In wikipedia_terrorism_scrape_table(urls[i]) :
#   Some dates were not parsed
# Table: NULL
# # A tibble: 5 x 2
# orig              date      
# <chr>             <date>    
# 1 26 March          NA        
# 2 April             NA        
# 3 May               NA        
# 4 31 July-August 10 NA        
# 5 12 August         NA        
# 
# 2: In wikipedia_terrorism_scrape_table(urls[i]) :
#
#   Some fatality figures were not parsed
# Table: NULL
# # A tibble: 2 x 2
# dead    dead_min
# <chr>      <dbl>
# 1 (1+)      NA
# 2 Unknown   NA

wiki-terror scraper has issues for earlier years

e.g.

library("datamodules")
#> Artifacts path not set, see `?setup_datamodules()`
foo = wikipedia_terrorism_scrape(from = 2000, to = 2005)
#> Warning in wikipedia_terrorism_scrape_table(urls[i]): Some dates were not parsed
#> Table: NULL
#> # A tibble: 3 x 2
#>   orig  date      
#>   <chr> <date>    
#> 1 14    NA        
#> 2 28    NA        
#> 3 29    NA
#> Warning in wikipedia_terrorism_scrape_table(urls[i]): Some dates were not parsed
#> Table: NULL
#> # A tibble: 8 x 2
#>   orig  date      
#>   <chr> <date>    
#> 1 4     NA        
#> 2 6     NA        
#> 3 8     NA        
#> 4 17    NA        
#> 5 18    NA        
#> 6 24    NA        
#> 7 25    NA        
#> 8 25    NA
#> Warning: Unknown or uninitialised column: 'dead'.
#> Error in `$<-.data.frame`(`*tmp*`, "dead_min", value = numeric(0)): replacement has 0 rows, data has 4

^{Created on 2019-10-11 by the reprex package (v0.3.0)}

This should ultimately work back to 1970 when the data starts (and throw error for input < 1970)

Recommend Projects

andybega / datamodules Goto Github PK

datamodules's Introduction

datamodules

Installation

Example

datamodules's People

Contributors

Watchers

datamodules's Issues

Parsing dates not working for format %d %B

wiki-terror scraper has issues for earlier years

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent