Giter Club home page Giter Club logo

appc's Introduction

Air Pollution Prediction Commons

R-CMD-check CRAN status Lifecycle: stable

About

The goal of the appc package is to provide daily, high resolution, near real-time, model-based ambient air pollution exposure assessments. This is achieved by training a generalized random forest on several geomarkers to predict daily average EPA AQS concentrations from 2017 until the present at exact locations across the contiguous United States (see vignette("cv-model-performance") for more details). The appc package contains functions for generating geomarker predictors and the ambient air pollution concentrations. Predictor geomarkers include weather and atmospheric information, wildfire smoke plumes, elevation, and satellite-based aerosol diagnostics products. Source files included with the package train and evaluate models that can be updated with any release to use more recent AQS measurements and/or geomarker predictors.

Installing

Install the development version of appc from GitHub with:

# install.packages("remotes")
remotes::install_github("geomarker-io/appc")

Example

In R, create model-based predictions of ambient air pollution concentrations at exact locations on specific dates using the predict_pm25() function:

appc::predict_pm25(
  x = s2::as_s2_cell(c("8841b39a7c46e25f", "8841a45555555555")),
  dates = list(as.Date(c("2023-05-18", "2023-11-06")), as.Date(c("2023-06-22", "2023-08-15")))
)
#> ℹ (down)loading random forest model
#> ✔ (down)loading random forest model [9.2s]
#> 
#> ℹ checking that s2 are within the contiguous US
#> ✔ checking that s2 are within the contiguous US [60ms]
#> 
#> ℹ adding coordinates
#> ✔ adding coordinates [1.6s]
#> 
#> ℹ adding elevation
#> ✔ adding elevation [1.4s]
#> 
#> ℹ adding HMS smoke data
#> ✔ adding HMS smoke data [889ms]
#> 
#> ℹ adding NARR
#> ✔ adding NARR [483ms]
#> 
#> ℹ adding gridMET
#> ✔ adding gridMET [443ms]
#> 
#> ℹ adding MERRA
#> ✔ adding MERRA [558ms]
#> 
#> ℹ adding time components
#> ✔ adding time components [23ms]
#> 
#> [[1]]
#> # A tibble: 2 × 2
#>    pm25 pm25_se
#>   <dbl>   <dbl>
#> 1  8.16   1.11 
#> 2  9.28   0.806
#> 
#> [[2]]
#> # A tibble: 2 × 2
#>    pm25 pm25_se
#>   <dbl>   <dbl>
#> 1  5.13   0.381
#> 2  5.95   0.466

Installed geomarker data sources and the grf model are hosted as release assets on GitHub and are downloaded locally to the package-specific R user data directory (i.e., tools::R_user_dir("appc", "data")). These files are cached across all of an R user’s sessions and projects. (Specify an alternative download location by setting the R_USER_DATA_DIR environment variable; see ?tools::R_user_dir.)

See more examples in vignette("timeline-example").

S2 geohash

The s2 geohash is a hierarchical geospatial index that uses spherical geometry. The appc package uses s2 cells via the s2 package to specify geospatial locations. In R, s2 cells can be created using their character string representation, or by specifying latitude and longitude coordinates; e.g.:

s2::s2_lnglat(c(-84.4126, -84.5036), c(39.1582, 39.2875)) |> s2::as_s2_cell()
#> <s2_cell[2]>
#> [1] 8841ad122d9774a7 88404ebdac3ea7d1

Geomarker Assessment

Spatiotemporal geomarkers are used for predicting air pollution concentrations, but also serve as exposures or confounding exposures themselves. View information and options about each geomarker:

geomarker appc function
🌦 weather & atmospheric conditions get_gridmet_data, get_narr_data()
🛰 satellite-based aerosol diagnostics get_merra_data()
🔥 wildfire smoke get_hms_smoke_data()
🗻 elevation get_elevation_summary()

Currently, get_urban_imperviousness(), get_traffic(), and get_nei_point_summary() are stashed in the /inst folder and are not integrated into this package.

Developing

Please note that the appc project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

To create and release geomarker data for release assets, as well as to create the AQS training data, train, and evaluate a generalized random forest model, use just to execute recipes in the justfile.

> just --list

Available recipes:
    build_site             # build readme and webpage
    check                  # CRAN check package
    dl_geomarker_data      # download all geomarker ahead of time, if not already cached
    docker_test            # run tests without cached release files
    docker_tool            # build docker image preloaded with {appc} and data
    make_training_data     # make training data for GRF
    release_hms_smoke_data # install smoke data from source and upload to github release
    release_merra_data     # upload merra data to github release
    release_model          # upload grf model and training data to current github release
    train_model            # train grf model and render report

appc's People

Contributors

cole-brokamp avatar qing-duan avatar

Watchers

 avatar

appc's Issues

example data set of s2 and dates

To use for both examples and in tests

create children polygons inside contiguous_us()

randomly sample some

s2::s2_point_on_surface()

aggregate back and then choose random number of dates for each point

make into named list that can be used with examples:

ex_data <- ...
get_merra_data(x = names(ex_data), dates = ex_data)

detailed user example vignettes

make into two separate vignettes, and link to from the readme

case crossover example

  • simulate a data frame of case dates and geocodes
  • apply Stephen's case-crossover code to generate matched control days
  • use package to also include temperature and humidity

chronic exposures example

  • simulate a data frame of patient IDs, geocodes, start and stop dates
  • move example from readme on how to expand into a list of dates
  • estimate pm25 and average it to monthly estimates
  • simulate some model? and use average other variables for confounders?

CV error report updates

  • plot pred vs actual concentrations on the log10 scale
  • add in variable importance plots and tables
  • try one level of more coarse aggregation for s2 error maps
  • why is there a mismatch between the quick error calcs when training the model?
    • R sq is the Spearman Correlation Coefficient

consistent messages

  • clearly state if file is not found and will be downloaded
  • provide argument to quiet predict_pm25() and all other functions?
  • be consistent with progress updates

error building vignettes / running examples... related to tigris caching?

If I clone the repo and try to build the package locally without changing anything, I keep getting this error, sometimes when building vignettes and sometimes when running tests. In both cases, the error is triggered by tigris calls.

Cannot open "/private/var/folders/7f/zd0jc6710pj2mk4wzyv_tj5w0000gr/T/RtmpEdlWqx"; The source could be corrupt or not supported. See `st_drivers()` for a list of supported formats.

In my .Rprofile I changed tigris_use_cache to FALSE but I am still getting the error. I think this is more a problem with the way R runs checks on my machine than with the package itself, but just documenting here as I work through it.

Use vignette for example geomarker assessment

Start from addresses? or from lat/lon?

Randomly choose dates? Or show example of start_date and end_date being used to make a daily time series? (or do this as an s3 object)

Show conversion to s2 cells and how to use tibble to add new geomarker column or columns

Show how to extract columns from geomarker functions that return tibbles

Show how to nest / unnest dates to get tabular data for further use

air pollution predictor functions

install_{geomarker}_data() functions (e.g., install_smoke_data(), install_elevation_data()) downloads geomarker data files directly from the provider and stores them as harmonized files in the R user data directory for the appc package. This allows for reference of the geomarker data files across R sessions and projects. These functions are utilized automatically by the geomarker assessment functions, but can be called without input data to install the geomarker data ahead of time, if external internet access is not possible after input data is added. Note that some of the install functions require a system installation of gdal.

get_{geomarker}_summary() functions (e.g., get_narr_data(), get_census_tract_id() take a vector of s2 geographic cell identifers and a list of date vectors. Each item in the list of date vectors corresponds to one of the s2 cell identifiers in the vector and can contain multiple dates.

TODO Each get_ geomarker summary function has a XXX argument specifying the path to the geomarker data file. This defaults to using the install_ geomarker function, but can be set to use existing geomarker data files (e.g., on a shared storage drive, across compute nodes on a high performance cluster).

get_elevation_data(data_source = "/scratch/broeg1/appc/")

TODO if providing a directory, the specific geomarker data file will be searched for in that directory.

e.g.,

get_narr_data(data_source = "/scratch/broeg1/appc")

consenting users to downloading large files

Include a function that gets file size from header and interactively asks user if downloading a file of this size is OK. Insert in each install_() function and provide option or env var to consent to all downloads (how does renv do this?)

dependencies

  • only require packages for model predictions
  • suggest (or rlang::check...) for packages required to install data

speed

  • can final random forest be smaller file size by removing oob preds?
  • #69
  • improve smoke data extraction?
  • replace any purrr calls with furrr and doc how to set parallel options for model training?

model accuracy experiments

Do any of the following improve model accuracy?

  • #16
  • #36
  • #18
  • projecting coordinates to planar coordinates instead of lat/lon coordinates (yes)
  • day of week (yes by 0.01 in CV Rsq)
  • #47

training data

  • pickup off of other models and be more useful for recent EHR and registry data
  • low-latency exposure estimates for major air pollutants are a recognized need

Instead of creating a spatiotemporal grid of predictors, create prediction model for set of input points and reuse code to derive features and predict on new input data.

atmospheric composition data from NASA's GEOS model

GEOS - global earth observing system - https://gmao.gsfc.nasa.gov/GEOS_systems/

https://gmao.gsfc.nasa.gov/research/aerosol/

The GEOS-5 system is used with the GOCART module of AeroChem to make real-time estimates and forecasts of aerosols, CO and CO2 tracers in support of NASA field campaigns.

collections can be "aqc_tavg_1hr_g1440x721_v1", "chm_tavg_1hr_g1440x721_v1", "met_tavg_1hr_g1440x721_x1", "xgc_tavg_1hr_g1440x721_x1", "chm_inst_1hr_g1440x721_p23", "met_inst_1hr_g1440x721_p23"

https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/

====================

MERRA-2 reanalysis are hosted on GES DISC, which points to S3 locations

MERRA-2 documentation on available data

https://disc.gsfc.nasa.gov/datasets?page=1&measurement=Aerosol%20Optical%20Depth%2FThickness&project=MERRA-2

https://disc.gsfc.nasa.gov/datasets/M2T1NXAER_5.12.4/summary

s3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXAER.5.12.4/

needs nertc file with valid Earthdata Login credentials, here is a tutorial on how to do this in python: https://disc.gsfc.nasa.gov/information/howto?keywords=%22Earthdata%20Cloud%22&title=How%20to%20Directly%20Access%20MERRA-2%20Data%20from%20an%20S3%20Bucket%20with%20Python%20from%20a%20Cloud%20Environment

(re)moving unnecessary geomarker functions

Throughout development of the model, geomarker assessment functions were developed, but were eventually not included in the final prediction model. These are more hyperlocal predictors, like get_nei_point_summary(), get_traffic_summary(), and get_urban_imperviousness(). The get_census_tract_id() function is useful in general, but no longer used here.

Should these be moved to a different package to reduce code and dependencies here?

  • models for other pollutants could utilize these geomarkers, but then could just use them from another package
  • move three "hyperlocal" geomarkers into their own package (NEI, traffic, NLCD)
  • #70

readme section on s2 geohash

The s2 geohash is a hierarchical geospatial index that uses spherical geometry (https://s2geometry.io/about/overview). The {appc} package uses s2 cells via the {s2} package to specify geospatial locations.

In R, s2 cells can be created using their character string representation, or by specifying latitude and longitude coordinates; e.g.:

s2::s2_lnglat(c(-84.4126, -84.5036), c(39.1582, 39.2875)) |> s2::as_s2_cell()

#> <s2_cell[2]>
#> [1] 8841ad122d9774a7 88404ebdac3ea7d1

refer to vignettes in readme

  • remove timeline example now that it has its own vignette
  • remove model report from /inst and update just recipe

development roadmap

  • add just target for downloading/installing required geomarker data sources (w/o running any geomarker assessment code)
  • add small section about just and using it to run commands in the repository
  • #12
  • #11
    • double check data pipeline (missingness, summary statistics)
    • use OOB predictions to create CV error estimations, pred vs obs scatter plots (both in bag and out of bag by space)
    • include prediction tests for random locations and tests (but with fixed seed)
    • test for reasonable prediction magnitude, mean and var of preds across time and space
    • describe differences between test predictions for different model releases
  • #10
    • alter NEI emissions predictors for these pollutants?
  • translate to a runnable 'thing' (SALT? GHA?)
  • build user prediction tools, API
    • add data tests for prediction outputs (reasonable prediction magnitude?)
    • version the api and prediction tool with {appc}
  • release appc package on CRAN
    • just geomarker assessment functions? or entire model and predictions?
  • quantify how model predictions retroactively change over time with updated models
    • publish this process and the change in models that provide "near real-time" ambient exposure estimates

problem with get_closest_year()

Need a better way to match.

get_closest_year <- function(date, years) {
  date_year <- as.numeric(format(date, "%Y"))
  purrr::map_chr(date_year, \(x) as.character(years[which.min(abs(as.numeric(years) - x))]))
}

Ties mean order of supplied years changes output:

get_closest_year(as.Date(c("2021-09-15", "2022-09-01")), years = c(2020, 2022))
# [1] "2020" "2022"
get_closest_year(as.Date(c("2021-09-15", "2022-09-01")), years = c(2022, 2020))
# [1] "2022" "2022"

overwriting appc package data

  • how to delete certain appc package data?
  • store by version so newer version of package doesn't use old version of data?
  • provide overwrite argument for install functions?

update smoke data to use NOAA hazard mapping system

https://www.ospo.noaa.gov/Products/land/hms.html#about

updated daily

HMS's smoke analysis is based on visual classification of plumes using GOES-16 and GOES-17 ABI true-color imagery available during the sunlit part of the orbit. Since the analysis generally requires a sequential set of satellite images to help distinguish smoke from clouds and other atmospheric aerosols, the first smoke analysis for the current day is usually produced around the local noon time – until then, only fire detection points may be available. Additional smoke analysis will occur throughout the day until sunset or as observation conditions permit.

The density information is qualitatively described using light, medium, and heavy labels that are assigned based on the apparent thickness (opacity) of the smoke in the satellite imagery.

Use https://www.ospo.noaa.gov/Products/land/hms.html#data ; e.g.,

https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/2023/06/hms_smoke20230610.zip

or

https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/2021/06/hms_smoke20210610.zip

readme examples

TODO add example usage for prediction

TODO add example usage for using get_ functions

CRS Error

Receiving an error when trying to run the first example on the Readme.

> appc::predict_pm25(
    x = s2::as_s2_cell(c("8841b39a7c46e25f", "8841a45555555555")),
    dates = list(as.Date(c("2023-05-18", "2023-11-06")), as.Date(c("2023-06-22", "2023-08-15"))),
    quiet = FALSE
)

loading random forest model...
checking input s2 locations and dates
checking that s2 locations are within the contiguous united states
adding HMS smoke data
progress bar doesn't work in RStudio; follow the file ".progress" instead
adding coordinates
adding elevation
adding AADT using level 14 s2 approximation (~ 260 m sq)
adding NARR
Error: [project] output crs is not valid
In addition: Warning message:
[rast] unknown extent 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.