safegraphinc / safegraphr Goto Github PK
View Code? Open in Web Editor NEWR code for common, repeatable data wrangling and analysis of SafeGraph data
Home Page: https://safegraphinc.github.io/SafeGraphR/
License: Apache License 2.0
R code for common, repeatable data wrangling and analysis of SafeGraph data
Home Page: https://safegraphinc.github.io/SafeGraphR/
License: Apache License 2.0
You may want to put this link in your about section.
Hi there @felixsafegraph ! Not sure whether it's better to reach out here or on Slack. But the SafeGraphR package is ready for beta, and there's a docsite in here now. Could you turn on GitHub pages for the SafeGraphR github (set to docs/)? I don't have access. Thank you!
Hello -
I am running the following code on data downloaded directly from the SafeGraph Shop:
pitt.data <- read_shop(
filename = "safe-graph-data.zip",
keeplist = c("patterns", "home_panel_summary.csv"),
by = "placekey",
expand_int = "visitors_by_day",
name = "visits",
start_date = lubridate::ymd("2018-01-01"))
However, doing so results in the following error:
Error in read_many_patterns(filelist = patfiles, dir = exdir, recursive = FALSE, :
Number of files (0) does not match number of start_dates (1) to go along with them.
My desired result is having the visitors by day calculated for each placekey
by day (2018-01-01, 2018-01-02, etc., etc.).
Looking forward to a response.
So I am trying to use the Read_Many_Patterns function, however, I keep getting this ERROR:
Attempted to find start_date from filename but failed. The zipped file I have came directly from SafeGraph and are the 2019-06-core_poi_patterns_part1. There are 10 zip files for June alone.
read_many_patterns appears to be having a problem handling missing values for distance_from_home when aggregating by county FIPS code.
For example, when I make a call to read_many_patterns with the below code to read weekly patterns for a single state, every other variable reads without issue, but the entire column of distance_from_home is filled with NA values.
Not every POI is missing data for distance_from_home, so this is not the expected behavior for this function. Is there any way around this?
patterns <- read_many_patterns("patterns_dir",
recursive = TRUE,
naics_link = poi_link,
by = c('state_fips', 'county_fips'),
filter = 'state_fips == 34')
Hi,
I re-ran some old code today, and I'm wondering if the cbg_pop file has changed: the poi_cbg codes are appearing as ultra-small values (i.e. 4.960370e-314) rather than 12-digit numbers.
Any help greatly appreciated.
I am attempting to read a collection of v2 social distancing files for 2020 in the directory expected by read_distancing()
however, the function appears to be broken.
I have all social distancing patterns in the layout expected by this function within my current working directory, yet the function is unable to detect them and fails, as shown from this reprex:
library(SafeGraphR)
library(tidyverse)
# Start with all social distancing for 2020
setwd("Y:/Gavin/social-distancing/social-distancing/v2")
distancing <- read_distancing(
start = lubridate::ymd('2020-01-01'),
end = lubridate::ymd('2020-03-10')
)
#> Running read_distancing with default select and by - this will select only the device count variables, and aggregate to the county level. Change the select and by options if you don't want this. This message will be displayed only once per session.
#> [1] ".2020/01/01/"
#> Error in data.table::fread(file = target, select = select, ...): File '.2020/01/01/' does not exist or is non-readable. getwd()=='Y:/Gavin/social-distancing/social-distancing/v2'
Created on 2021-04-20 by the reprex package (v2.0.0)
Hi,
Not sure if this is a bug or intended behavior. The JSON in the second row of the input datatable is empty. If I expand the JSON with by = F and na.rm = T, the initial_rowno variable for rows 3 and 4 of the output is 2, when it should be 3. If I set na.rm = F, it becomes 3.
Obviously this issue can be avoided entirely by setting na.rm = F. Maybe it should be obvious to me why this behavior occurs, but it confused me so I thought I'd bring it up.
Thanks for all your work on this package, by the way!
patterns <- data.table::data.table(state_fips = c(1,2,3),
cat_origin = c('{"a": "2", "b": "3"}',
'{}',
'{"a": "4", "b": "5"}'))
> patterns
state_fips cat_origin
1: 1 {"a": "2", "b": "3"}
2: 2 {}
3: 3 {"a": "4", "b": "5"}
>
expand_cat_json(
patterns,
'cat_origin',
'index',
by = F,
na.rm = T
)
initial_rowno cat_origin index
1: 1 2 a
2: 1 3 b
3: 2 4 a
4: 2 5 b
expand_cat_json(
patterns,
'cat_origin',
'index',
by = F,
na.rm = F
)
initial_rowno cat_origin index
1: 1 2 a
2: 1 3 b
3: 3 4 a
4: 3 5 b
Per our conversation on Slack, it would be great if this package could process the open_hours
field from SafeGraph (see here for spec). I had hoped to write up a PR but, having compared my amateurish attempt to the existing codebase, maybe it's better if I just supply the code I put together here and you decide how to proceed.
library(data.table)
library(SafeGraphR)
library(fst)
library(magrittr)
# Load Core POI data ----
core_poi <- read_many_csvs(dir = "/data1/safegraph/core_poi/2020/11/06/11/")
# Limit to POI that give open hours
open_hours_only <- core_poi[open_hours != ""]
convert_hour_str <- function(time_str, midnight_is_zero = TRUE) {
# Convert an %H:%M time string to numeric, e.g., "08:15" -> 8.25
time_POSIX <- as.POSIXlt(time_str, format = "%H:%M")
result <- hour(time_POSIX) + minute(time_POSIX) / 60
if (!midnight_is_zero) {
result[result == 0] <- 24
}
return(result)
}
convert_JSON_hours <- function(hours_clean) {
# Convert a JSON string listing hours open and closed into a data.table
# hours_clean <- unique_hours$open_hours_clean[96] # DEBUG
hour_list <- jsonlite::fromJSON(hours_clean) # This takes a long, long time.
# Keep only non-empty
hour_list <- hour_list[lapply(hour_list,length)>0]
hour_dt <- rbindlist(lapply(hour_list, as.data.table), idcol = "dow")
setnames(hour_dt, c("V1", "V2"), c("open", "close"))
hour_dt[, `:=`(open = convert_hour_str(open),
close = convert_hour_str(close, midnight_is_zero = F))]
hour_dt
}
expand_hours <- function(dt) {
# dt <- open_hours_only[1:10000] # DEBUG
# To save on parsing time, get unique values of open_hours
unique_hours <- dt[, .N, by = open_hours] %>% .[, N := NULL]
# Remove extra escaped quotes
unique_hours[, open_hours_clean := stringr::str_replace_all(open_hours, '\\"\\"','\\"')]
# Get a data.table where each obs is row-by-dow-open/close interval
unique_hours_dt <- unique_hours[, convert_JSON_hours(open_hours_clean), by = open_hours]
# Merge (M:M) back to original dataset
dt_final <- merge(dt[, .(placekey, open_hours)],
unique_hours_dt,
by = "open_hours",
allow.cartesian = T)
dt_final <- dt_final[, .(placekey, dow, open, close)]
dt_final
}
expanded_hours <- expand_hours(open_hours_only[sample(.N, 100)])
I've tried all the packages by to 3.6, and none work with SafeGraphR. ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.