Giter Club home page Giter Club logo

google-covid-mobility-scrape's Introduction

google-covid-mobility-scrape

Repo status: Inactive lapsedgeographer blog post 1 lapsedgeographer blog post 2

This is a repo to scrape the data from Google's COVID19 community mobility reports using R. This code is released freely under the MIT Licence, it is provided 'as-is'.

This project is now archived, no further development is planned and GitHub actions paused. Google have been publishing the underlying data for some time, this should be preferred over any data held in this repository.


This project is built in R and extracts both the headline mobility comparison figures and trendline data from Google's PDFs. Trendline data exists in the feature/trendlines branch until verified.

The trendline extraction work benefits significantly from the following work:

If you'd like to read about the process of developing this code please read the following blogs:

Data

You can browse the data extracted in the data folder, this folder also contains a log of the processed countries and regions.

A GitHub action workflow runs the get_all_data.R script on an hourly basis to check for new reports. If new reports have been published (or existing reports updated) the script will run and new data will be pushed to the repository, files continue to have the format YYYY-MM-DD_alldata_[wide|long].csv however there are now also latest_alldata_[wide|long].csv files which are copies of the last produced data. All files contain a reference date column. A workflow has also been written to scrape the trendlines which will execute when an update to LASTUPDATE_UTC.txt is pushed to the repository (i.e. when new headline figures have been added).

The table below provides a list of data in the repository, but is manually updated, check processing.log for a log of activity, and LASTUPDATE_UTC.txt for the metadata relating to updates if you want to check whether there has been an update.

cd ~/r/google-covid-mobility-scrape
Rscript get_all_data.R

NEWS (date/time in London local time; BST)

Date Update
2020-09-23 13:10 Project archived. GitHub Actions paused
2020-04-23 19:30 Code updated, GitHub Actions resumed
2020-04-23 20:04 Google updated their website, breaking the code so GitHub Actions automated checking was paused
2020-04-17 12:45 Google are now publishing their own CSV, this should be considered the canonical source, this project will continue for now
2020-04-17 12:40 Trendlines moved to feature/trendline branch while reviewing.
2020-04-16 01:50 Corrected an error with the baselining of trendlines for the overall report trends.
2020-04-15 22:16 TRENDLINES EXTRACTED data for the trendlines is now being extracted, with thanks to Duncan Garmonsway's port of the ONS code to R for the code inspiration.
2020-04-13 19:30 get_all_data.R now runs hourly via GitHub actions
2020-04-10 16:16 get_all_data.R amended to check update time, doesn't run extraction code if times are the same, gives a warning if update times have changed but report dates are unchanged
2020-04-10 15:36 Added function get_update_time() to extract time of update
2020-04-10 13:15 Extracted new mobility data (reference date 2020-04-05)
get_all_data.R updated so can be run without needing to change filenames (i.e. will programmatically extract date and use that for the filenames)
2020-04-07 16:52 Updated README to reference ONS work on trendline extraction
2020-04-04 16:51 get_all_data.R script pulls data from all reports, saved in the data folder
2020-04-04 16:26 Add comments to the functions, move tidyverse library call to scripts
2020-04-03 18:22 Converted code into a functions, added date and country codes into output tables, created functions for region reports (US state-level data)
2020-04-03 12:59 First version, scrape of PDF and extract of data into CSV (reference date 2020-03-29)

How to use

You'll need the following R packages: dplyr, purrr, xml2, rvest, pdftools and countrycode. These are all on CRAN.

install.packages("tidyverse")       # installs dplyr, purrr, rvest and xml2
install.packages("pdftools")
install.packages("countrycode")

The R/functions.R script provides a number of functions to interact with the Google COVI19 Community Mobility Reports:

  • get_country_list() gets a list of the country reports available
  • get_national_data() extracts the overall figures from a country report
  • get_subnational_data() extracts the locality figures from a country report
  • get_region_list() gets a list of the region reports available (currently just US states)
  • get_region_data() extracts the overall figures from a region report
  • get_subregion_data() extracts the locality figures from a region report
  • get_update_time() extracts the time the reports were updated (not the reference date of the reports)

The functions return tibbles providing the headline mobility report figures, they do not extract or interact with the trend-lines provided in the chart reports. The tibbles have the following columns:

  • date: the date from the PDF file name
  • country: the ISO 2-character country code from the PDF file name
  • region: for region reports the region name
  • entity: the datapoint label, one of
  • value: the datapoint value, these are presented as percentages in the report but are converted to decimal representation in the tables

There are six mobility entities presented in the reports:

entity value Description
retail_recr Retail & recreation: Mobility trends for places like restaurants, cafes, shopping centers, theme parks, museums, libraries, and movie theaters
grocery_pharm Grocery & pharmacy: Mobility trends for places like grocery markets, food warehouses, farmers markets, specialty food shops, drug stores, and pharmacies.
parks Parks: Mobility trends for places like national parks, public beaches, marinas, dog parks, plazas, and public gardens.
transit Transit stations: Mobility trends for places like public transport hubs such as subway, bus, and train stations.
workplace Workplaces: Mobility trends for places of work.
residential Residential: Mobility trends for places of residence.

Example code

This code is also provided in mobility_report_scraping.R

library(tidyverse)       # pdftools and countrycode do not need to be loaded
source("R/functions.R")  # they are referenced in my functions using pkg::fun()

# get list of countries
# default url is https://www.google.com/covid19/mobility/
countries <- get_country_list()

# extract the url for the uk
uk_url <- countries %>% filter(country == "GB") %>% pull(url)

# extract overall data for the uk
uk_overall_data <- get_national_data(uk_url)

# extract locality data for the uk
uk_location_data <- get_subnational_data(uk_url)

# get list of us states
states <- get_region_list()

# extract the url for new york
ny_url <- states %>% filter(region == "New York") %>% pull(url)

# extract overall data for new york state
ny_data <- get_region_data(ny_url)

# extract locality data for new york state
ny_locality_data <- get_subregion_data(ny_url)

google-covid-mobility-scrape's People

Contributors

actions-user avatar mattkerlogue avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

google-covid-mobility-scrape's Issues

Add check to subnational function

Countries with no subnational data only have three pages, so the code doesn't actually need to be run.

Add check for number of pages.

Code duplication for PDF reading

Problem

Not much of a problem. Just some minor code duplication when reading a PDF.

Example

get_national_data() and get_subnational_data() both do this:

report_data <- pdftools::pdf_data(url)

Solution

Mild refactor to create a separate PDF-reading step that then feeds into e.g. get_national_data() and get_subnational_data().

Risk

Minimal. Some efficiency is gained from reading PDFs just once from a given URL.

Invoke error for national/regional URL input

Problem

You can, for example, pass a national URL to get_subnational_data() and no error is raised. The function manages to extract data and theregion column gets filled with Mobility Report en.pdf (because this variable is filled using a str_split() index).

Example

Passing the GB PDF to get_subregion_data().

get_subregion_data("https://www.gstatic.com/covid19/mobility/2020-04-05_GB_Mobility_Report_en.pdf")
## A tibble: 900 x 6
#   date       country region                 location      entity         value
#   <chr>      <chr>   <chr>                  <chr>         <chr>          <dbl>
# 1 2020-04-05 GB      Mobility Report en.pdf Aberdeen City retail_recr   -0.84 
# ...

Solution

Detect the input as the path to a national or regional file. Could be based on the number of str_split() elements, but this will depend on the consistency of the URL format.

length(str_split("2020-04-05_US_Alabama_Mobility_Report_en.pdf", "_")[[1]])  # 6 elements
length(str_split("2020-04-05_GB_Mobility_Report_en.pdf", "_")[[1]])  # 5 elements

Or perhaps there's an element in the PDFs themselves that can help identify whether it's national or subnational.

Risk

Minimal. Perhaps only a problem if a third party uses the function incorrectly.

dot separator in full_ref is ambiguous

Unfortunately there are dots in some location names, for example "St. Gallen", which makes it difficult to separate full_ref into columns. This affects a few thousand rows.

library(tidyverse)
x <- readRDS("./2020-04-05_trendline_long.rds")
filter(x, str_detect(location, fixed(".")))

Nice work on the rest though, I hope to learn how you did the GitHub actions.

additional libraries

I needed to additionally require the stringr, tibble, and tidyr packages (clearly, oldschool and not a tidyverse regular here).

This is amazing, and saved me a ton of time -- thank you.

time-series

Nice work. Are you going to build time-series data by running this script daily?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.