A few months ago, Harper and Palayew[1] published a study looking at whether a signal could be detected in fatal car crashes in the United States based on the "4/20" holiday, based on a previous study by Staples and Redelmeier[2] that suggested a strong link. Using more robust methods and a more comprehensive time window, Harper and Palayew could not find a signal for 4/20, but could for other holidays, such as July 4.
This is a great example of how charts can mislead based on choices in analysis and plotting.
Some of Harper and Palayew's analysis was done in R, but more was done in Stata and Stan. Their manuscript and their original data/code is at https://osf.io/qnrg6/. I built a script to download their original raw data and tidy it up into datasets they used in their paper as a possible #tidytuesday activity using R. Other dataset creation from the raw data and additional tidying possibilities exist, of course.
#### Load packages -------------------------------------------------------------
library(haven)
library(tidyverse)
library(lubridate)
#### Acquire raw data ----------------------------------------------------------
# Crash data (from Harper and Palayew)
download.file("https://osf.io/kj7ub/download", "~/Downloads/farsp/farsp.zip")
unzip("~/Downloads/farsp/farsp.zip", exdir = "~/Downloads/farsp")
dta_files = list.files(path = "~/Downloads/farsp", pattern = "*.dta", full.names = TRUE)
dta_files = setNames(dta_files, dta_files)
fars = map_df(dta_files, read_dta, .id = "id")
# Geographic lookup
geog = read_csv("https://www2.census.gov/geo/docs/reference/codes/files/national_county.txt",
col_names = c("state_name", "state_code", "county_code",
"county_name", "FIPS_class_code")) %>%
mutate(state = as.numeric(state_code),
count = as.numeric(county_code),
FIPS = paste0(state_code, county_code))
#### Data wrangling ------------------------------------------------------------
# Used https://osf.io/drbge/ Stata code as a guide for cleaning
# All data
# This might take awhile... go get a coffee
all_accidents = fars %>%
# What are state and county codes/look ups?
select(id, state, county, month, day, hour, minute, st_case, per_no, veh_no,
per_typ, age, sex, inj_sev, death_da, death_mo, death_yr,
death_hr, mod_year, death_mn, death_tm, lag_hrs, lag_mins) %>%
# CAPS used to avoid conflict with lubridate
rename(MONTH = month, DAY = day, HOUR = hour, MINUTE = minute) %>%
mutate_at(vars(MONTH, DAY, HOUR, MINUTE), na_if, 99) %>%
mutate(crashtime = HOUR * 100 + MINUTE,
YEAR = as.numeric(gsub("\\D", "", id)) - 10000,
DATE = as.Date(paste(YEAR, MONTH, DAY, sep = "-")),
TIME = paste(HOUR, MINUTE, sep = ":"),
TIMESTAMP = as.POSIXct(paste(DATE, TIME), format = "%Y-%m-%d %H:%M"),
e420 = case_when(
MONTH == 4 & DAY == 20 & crashtime >= 1620 & crashtime <= 2359 ~ 1,
TRUE ~ 0),
e420_control = case_when(
MONTH == 4 & (DAY == 20 | DAY == 27) & crashtime >= 1620 & crashtime < 2359 ~ 1,
TRUE ~ 0),
d420 = case_when(
crashtime >= 1620 & crashtime <= 2359 ~ 1,
TRUE ~ 0),
sex = factor(case_when(
sex == 2 ~ "F",
sex == 1 ~ "M",
sex >= 8 ~ NA_character_,
TRUE ~ NA_character_)),
Period = factor(case_when(
YEAR < 2004 ~ "Remote (1992-2003)",
YEAR >= 2004 ~ "Recent (2004-2016)",
TRUE ~ NA_character_)),
age_group = factor(case_when(
age <= 20 ~ "<20y",
age <= 30 ~ "21-30y",
age <= 40 ~ "31-40y",
age <= 50 ~ "41-50y",
age <= 97 ~ "51-97y",
age == 98 | age == 99 | age == 998 ~ NA_character_,
is.na(age) ~ NA_character_,
TRUE ~ NA_character_))
) %>%
filter(per_typ == 1,
!is.na(MONTH),
!is.na(DAY))
# Daily+Time Group
# This should match 420-data.dta observations at https://osf.io/ejz28/
# Verify: dta_orig = read_dta("https://osf.io/ejz28/download")
# arsenal::compare(daily_accidents_time_groups, dta_orig)
daily_accidents_time_groups = all_accidents %>%
group_by(DATE, d420) %>%
summarize(fatalities_count = n())
# Daily+Time Group final working data
# Only use data starting in 1992
daily_accidents_time_groups = all_accidents %>%
filter(YEAR > 1991) %>%
group_by(DATE, d420) %>%
summarize(fatalities_count = n())
# Daily final working data
daily_accidents = all_accidents %>%
filter(YEAR > 1991) %>%
group_by(DATE) %>%
summarize(fatalities_count = n())
"We will have many sources of data and want to emphasize that no causation is implied. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our guidelines are to use the data provided to practice your data tidying and plotting techniques. Participants are invited to consider for themselves what nuancing factors might underlie these relationships.
The intent of Tidy Tuesday is to provide a safe and supportive forum for individuals to practice their wrangling and data visualization skills independent of drawing conclusions. While we understand that the two are related, the focus of this practice is purely on building skills with real-world data."
[1]. Harper S, Palayew A The annual cannabis holiday and fatal traffic crashes. BMJ Injury Prevention. Published Online First: 29 January 2019. doi: 10.1136/injuryprev-2018-043068. Manuscript and original data/code at https://osf.io/qnrg6/
[2]. Staples JA, Redelmeier DA. The April 20 cannabis celebration and fatal traffic crashes in the United States. JAMA Intern Med. 2018 Feb;178(4):569–72.