Giter Club home page Giter Club logo

conmat's Introduction

conmat

R-CMD-check Codecov test coverage

The goal of conmat is to make it easy to generate synthetic contact matrices for a given age population.

What is a contact matrix?

Contact matrices describe the degree of contact between individuals of given age groups.

For example, this matrix describes the number of contacts between individuals

#>       0-4 5-9 10-14
#> 0-4    10   3     4
#> 5-9     3  11     5
#> 10-14   4   5    13

The rows and columns represent the age groups of the people. On the main diagonal we see that we have a higher number of contacts - showing that people of similar ages tend to interact more with one another.

We can use the information in these matrices to model how diseases such as COVID-19 spread in a population through social contact.

Why do we need synthetic contact matrices?

Contact matrices are produced from empirical data resulting from a contact survey, which requires individuals to diary the amount and manner of contact a person has in a day.

However, these surveys are highly time-consuming and expensive to run, meaning that only a handful of these empirical datasets exist globally.

We can use statistical methods to create synthetic contact matrices, which are new contact matrices that have been generalised to new countries based on existing surveys.

Why do we need conmat?

Existing methods only provide outputs of the contact matrices for each country, or at best, for urban and rural areas for a given country.

We need methods that allow for flexibly creating synthetic contact matrices for a specified age population, as the age population distribution of many countries (e.g., Australia), are quite heterogeneous, and assuming it is homogeneous would result in inaccurate representation of community infection in many regions.

Installation

You can install the development version with:

install.packages("conmat", repos = "https://idem-lab.r-universe.dev")

Or alternatively you can use remotes

# install.packages("remotes")
remotes::install_github("idem-lab/conmat")

Example

First we want to fit the model to the POLYMOD data, which contains various survey and population data.

library(conmat)
polymod_contact_data <- get_polymod_contact_data(setting = "work")
polymod_survey_data <- get_polymod_population()

The contact data is a data frame containing the age from and to, and the number of contacts for each of the specified settings, “home”, “work”, “school”, “other”, or “all” as well as the number of participants. By default, polymod_contact_data contains data from “all”, but we’re going to use the “work” set of data, as it produces an interesting looking dataset. Each row contains survey information of the number of contacts. Specifically, the number of contacts from one age group to another age group, and then the number of participants in that age group.

The survey data, polymod_survey_data contains the lower age limit and the population in that age group.

polymod_survey_data
#> # A tibble: 21 × 2 (conmat_population)
#>  - age: lower.age.limit
#>  - population: population
#>    lower.age.limit population
#>              <int>      <dbl>
#>  1               0   1898966.
#>  2               5   2017632.
#>  3              10   2192410.
#>  4              15   2369985.
#>  5              20   2467873.
#>  6              25   2484327.
#>  7              30   2649826.
#>  8              35   3043704.
#>  9              40   3117812.
#> 10              45   2879510.
#> # ℹ 11 more rows

Predicting the contact rate

We can create a model of the contact rate with the function fit_single_contact_model

set.seed(2022 - 09 - 06)
contact_model <- fit_single_contact_model(
  contact_data = polymod_contact_data,
  population = polymod_survey_data
)

This fits a generalised additive model (GAM), predicting the contact rate, based on a series of prediction terms that describe various features of the contact rates.

contact_model
#> 
#> Family: poisson 
#> Link function: log 
#> 
#> Formula:
#> contacts ~ s(gam_age_offdiag) + s(gam_age_offdiag_2) + s(gam_age_diag_prod) + 
#>     s(gam_age_diag_sum) + s(gam_age_pmax) + s(gam_age_pmin) + 
#>     school_probability + work_probability + offset(log_contactable_population)
#> 
#> Estimated degrees of freedom:
#> 1.85 3.95 3.82 7.10 7.59 4.77  total = 32.09 
#> 
#> fREML score: 24429.55     rank: 55/57

We can use this contact model to then predict the contact rate in a new population.

As a demonstration, let’s take an age population from a given LGA in Australia (this was the initial motivation for the package, so there are some helper functions for Australian specific data).

fairfield <- abs_age_lga("Fairfield (C)")
fairfield
#> # A tibble: 18 × 4 (conmat_population)
#>  - age: lower.age.limit
#>  - population: population
#>    lga           lower.age.limit  year population
#>    <chr>                   <dbl> <dbl>      <dbl>
#>  1 Fairfield (C)               0  2020      12261
#>  2 Fairfield (C)               5  2020      13093
#>  3 Fairfield (C)              10  2020      13602
#>  4 Fairfield (C)              15  2020      14323
#>  5 Fairfield (C)              20  2020      15932
#>  6 Fairfield (C)              25  2020      16190
#>  7 Fairfield (C)              30  2020      14134
#>  8 Fairfield (C)              35  2020      13034
#>  9 Fairfield (C)              40  2020      12217
#> 10 Fairfield (C)              45  2020      13449
#> 11 Fairfield (C)              50  2020      13419
#> 12 Fairfield (C)              55  2020      13652
#> 13 Fairfield (C)              60  2020      12907
#> 14 Fairfield (C)              65  2020      10541
#> 15 Fairfield (C)              70  2020       8227
#> 16 Fairfield (C)              75  2020       5598
#> 17 Fairfield (C)              80  2020       4006
#> 18 Fairfield (C)              85  2020       4240

We can then pass the contact model through to predict_contacts, along with the fairfield age population data, and some age breaks that we want to predict to.

set.seed(2022 - 09 - 06)
synthetic_contact_fairfield <- predict_contacts(
  model = contact_model,
  population = fairfield,
  age_breaks = c(seq(0, 85, by = 5), Inf)
)

synthetic_contact_fairfield
#> # A tibble: 324 × 3
#>    age_group_from age_group_to contacts
#>    <fct>          <fct>           <dbl>
#>  1 [0,5)          [0,5)         0.00281
#>  2 [0,5)          [5,10)        0.00318
#>  3 [0,5)          [10,15)       0.00345
#>  4 [0,5)          [15,20)       0.00571
#>  5 [0,5)          [20,25)       0.0133 
#>  6 [0,5)          [25,30)       0.0261 
#>  7 [0,5)          [30,35)       0.0356 
#>  8 [0,5)          [35,40)       0.0372 
#>  9 [0,5)          [40,45)       0.0349 
#> 10 [0,5)          [45,50)       0.0317 
#> # ℹ 314 more rows

Plotting

Let’s visualise the matrix to get a sense of the predictions with autoplot. First we need to transform the predictions to a matrix:

synthetic_contact_fairfield %>%
  predictions_to_matrix() %>%
  autoplot()

Applying the model across all settings.

You can also fit a model for all of the settings all at once with a series of functions, fit_setting_contacts, and predict_setting_contacts. This means we can do the above, but for each setting, “home”, “work”, “school”, “other”, and “all”. We would recommend this when using conmat, as it is a pretty common use case. However for demonstration purposes we wanted to show how it works for a single matrix here first. We also provide details on how to fit the model to each of these settings in parallel. For more details on that workflow, see the “getting started” vignette.

Data sources

This package provides data and helper functions for the data, for use in calculating contact matrices. The data sources are from the Australian Bureau of Statistics (ABS), as we were using these a lot when we created the package. In the future we might wrap these data sources and helpers into another package, but for the time being they are here. Below are a couple of examples of data provided, see the “data sources” vignette and helpful at the website for full details.

You can extract the age population structure for the LGA, Brisbane, like so:

abs_age_lga("Brisbane (C)")
#> # A tibble: 18 × 4 (conmat_population)
#>  - age: lower.age.limit
#>  - population: population
#>    lga          lower.age.limit  year population
#>    <chr>                  <dbl> <dbl>      <dbl>
#>  1 Brisbane (C)               0  2020      72894
#>  2 Brisbane (C)               5  2020      75933
#>  3 Brisbane (C)              10  2020      73990
#>  4 Brisbane (C)              15  2020      72010
#>  5 Brisbane (C)              20  2020     104564
#>  6 Brisbane (C)              25  2020     119000
#>  7 Brisbane (C)              30  2020     110798
#>  8 Brisbane (C)              35  2020     100493
#>  9 Brisbane (C)              40  2020      86630
#> 10 Brisbane (C)              45  2020      86791
#> 11 Brisbane (C)              50  2020      76063
#> 12 Brisbane (C)              55  2020      69273
#> 13 Brisbane (C)              60  2020      59666
#> 14 Brisbane (C)              65  2020      49134
#> 15 Brisbane (C)              70  2020      42252
#> 16 Brisbane (C)              75  2020      29927
#> 17 Brisbane (C)              80  2020      20898
#> 18 Brisbane (C)              85  2020      22683

Note that you need to use the exact LGA name - you can look up LGA names in the data set abs_lga_lookup:

abs_lga_lookup
#> # A tibble: 544 × 3
#>    state lga_code lga                  
#>    <chr>    <dbl> <chr>                
#>  1 NSW      10050 Albury (C)           
#>  2 NSW      10180 Armidale Regional (A)
#>  3 NSW      10250 Ballina (A)          
#>  4 NSW      10300 Balranald (A)        
#>  5 NSW      10470 Bathurst Regional (A)
#>  6 NSW      10500 Bayside (A)          
#>  7 NSW      10550 Bega Valley (A)      
#>  8 NSW      10600 Bellingen (A)        
#>  9 NSW      10650 Berrigan (A)         
#> 10 NSW      10750 Blacktown (C)        
#> # ℹ 534 more rows

Or get the information for states like so:

abs_age_state(state_name = "QLD")
#> # A tibble: 18 × 4 (conmat_population)
#>  - age: lower.age.limit
#>  - population: population
#>     year state lower.age.limit population
#>    <dbl> <chr>           <dbl>      <dbl>
#>  1  2020 QLD                 0     314602
#>  2  2020 QLD                 5     339247
#>  3  2020 QLD                10     345205
#>  4  2020 QLD                15     319014
#>  5  2020 QLD                20     338824
#>  6  2020 QLD                25     370468
#>  7  2020 QLD                30     362541
#>  8  2020 QLD                35     354219
#>  9  2020 QLD                40     325208
#> 10  2020 QLD                45     348003
#> 11  2020 QLD                50     321168
#> 12  2020 QLD                55     317489
#> 13  2020 QLD                60     288317
#> 14  2020 QLD                65     254114
#> 15  2020 QLD                70     226033
#> 16  2020 QLD                75     156776
#> 17  2020 QLD                80     100692
#> 18  2020 QLD                85      94266

Note

The contact matrices created using this package are transposed when compared to the contact matrices discussed by Prem and Mossong. That is, the rows are “age group to”, and the columns are “age group from”.

Code of Conduct

Please note that the conmat project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

conmat's People

Contributors

aarathybabu97 avatar chitrams avatar geryan avatar goldingn avatar mikelydeamore avatar njtierney avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

conmat's Issues

standardise column names and state values

Currently sometimes state either long (Victoria) or short name (VIC)

  • standardise state values so they are the same, to assist with joining.
  • standardise columns to be just "lga" instead of "lga_name" - no need for "lga_code" as not all data have it?
  • use "population" as the value column where appropriate

Household age matrices

Add ability to include known household age distributions as offsets in the 'home' setting model, in place of the whole population distribution (pop_age_to)

clean up wpp_age function

  • so we have lower and upper age bounds
  • make it easier for people to add their own data source to clean it up (so column names aren't hard coded)

smooth out diagonal in school/other settings

The modelled results have a more pronounced 'ridge' of contacts between people of the same or similar ages for adults in 'school' and 'other' settings than is evident in the data.

Needs a tweak of the model structure to capture that ridge in the other settings but not this one.

abs_age_lga failing

this function fails:
abs_age_lga("Fairfield (C)") Error: Problem with mutate()columnage_group. i age_group = readr::parse_number(age_group). x is.character(x) is not TRUE

Issue seems to be that age_group is a factor.

Can be solved by wrapping it in as.character in function

abs_pop_age_lga_2020 %>% dplyr::filter(lga == lga_name) %>% dplyr::select(lga, age_group, year, population) %>% dplyr::mutate(age_group = readr::parse_number(as.character(age_group))) %>% dplyr::rename(lower.age.limit = age_group)

Separately - it might be useful to return both lower age limit and age_class in original factor form?

add functions to generate next-generation-matrices

I think it makes sense to include in this repo functions for applying age-specific per-contact transmission probabilities to contact matrices in order to produce next generation matrices. There is various code for doing this in the contact_matrices_australia private repo that could be brought in here. None of the relevant functions or data are sensitive.

The interface might be something like:

ngm_5y_fairfield <- make_ngm(
  contact_matrices = synthetic_settings_5y_fairfield,
  age_contributions = davies_covid19(),
  R = 6
)

which involves the find_m() to do calibration, and apply_age_contribution() to apply the Davies estimates (though those will need interpolating from 10y age bins in order to be used more generally)

school offset makes baddd predictions

Each of the setting-specific models include an offset term representing the age distribution of the contactable population. The model therefore fits to the difference between this and the observed number of contacts - estimating a propensity to contact each contactable person. That way, the contact matrix can be extrapolated to a new population based on the age distribution of that population - applying the model with an updated offset. This is the same as the population ratio reweighting in Prem, but in a model-based way.

Currently the contactable population for every setting is taken to be the whole population. So the rate of contacts is proportional to the fraction of the population in that age group - extrapolating to a younger population means more contacts with young people and fewer with old. That makes sense for most settings (population distribution predicts household age structure, age of workforce, age of casual contacts), but not school contacts.

Schools contacts are heavily cohorted - students are in classes with people either their calendar age, or the age 1 above or 1 below. Regardless of the population age distribution of the population, the age and number (class size) for students will remain the same. The only things that will change are the average number of contacts from adults to kids in school (fewer kids = fewer classes, and adults are less likely to be teachers), and the average number of contacts from kids to adults in school (teachers are likely to be younger).

As a consequence, predictions currently massively over/under estimate school contacts when extrapolated to younger/older populations. Redefining the offset for schools to capture class-based mixing (independent of population age distribution) for contacts between students, but also population distribution and for contacts between adults in schools and students. The prediction of the number of contacts within the same/adjacent calendar ages therefore wouldn't change when extrapolating, but contacts with adults would.

mgcv::bam uses the offset as a covariate

😱😱😱😱

Only when specified in the formula, by the looks of it. This makes extrapolation behave in some very weird ways, e.g. places with more kids have far fewer school contacts.

Will investigate and can always pass in with the offset argument instead

link contact data and survey population

As mentioned in #33 by @goldingn

... population is paired with the contact data (represents the population when and where the survey took place, to capture the opportunity for contacts) so in a future implementation we might want to wrap them up in a single data object. That will make it easier when we start fitting to different/multiple surveys.

We should think about a way to wrap up the contact data and survey population into a single object/dataframe/list to pair them together.

Document the full description of the methods towards a wider audience

Write up a vignette that walks through the entire process and motivation for how conmat creates the contact matrices.

This should be written up in a way that is aimed at people who might not be as familiar with epidemiology. So, it will go into detail on what each of the steps in the model fitting does, what the polymod data is, and answer questions like:

  • "What is a contact matrix"
  • "How do I adjust contact matrices for my age population"
  • "How do I estimate the impact of vaccination on spread"

This will involve fixing and simplifying the README, which is #35.

make vignette more compact

Currently it takes a long time to fit - would be good to resolve this using smaller data and examples

Tidy up README

There's a lot of output in the README that could be cut down, and also the images and outputs keep changing on each render, when they should be fixed, so might need to fix the set.seed value?

incorporate adjustment to household size

There is code in the contact_matrices_australia private repo to adjust synthetic household contact matrices to account for household size distributions for that population . We should bring this over and incorporate it as an optional argument to predict_setting_contacts(). Probably a bit like this:

synthetic_settings_5y_fairfield <- predict_setting_contacts(
  population = fairfield_age_pop,
  contact_model = setting_models,
  age_breaks = c(seq(0, 85, by = 5), Inf),
  household_size_distribution = fairfield_household_sizes
)

population offsets should be normalised

The number of contacts is not proportional to the number of people in the area, only the relative number of contacts per age group relative to the relative population in each age group. So use normalised populations as offsets for fitting and prediction.

add per-contact transmission probabilities from Eyre et al.

Depends on #38, which will introduce the ability to turn sets of setting-specific contact matrices into next generation matrices.

Code to digitise per-contact transmission probability matrices is in the contact_matrices_australia private repo. Note that these will be setting-specific (one each for home, work/school or other) so should be combined with the corresponding contact matrix before summing to get the NGM

Find globally-available household structure data

Ie. for developing countries that don't have abs-equivalent data. Prem et al used DHS survey data. You need to apply to access that, but we have access at MAP that we might be able to use for this purpose?

ABS data to source

Population count data

  • Number of people in a given 5 year age bracket (0-5, 6-10, etc.)

Workforce data

  • Age group breakdown of people in different type of workforces

add functions to apply vaccination to next generation matrices

depends on #38

Once we have functions to generate next generation matrices, add functions to apply vaccination to them in the same way as in the contact_matrices_australia private repo. This could look something like this:

ngm_5y_fairfield_vacc_0.5 <- apply_vaccination(
	ngm = ngm_5y_fairfield,
	coverage_by_age = age_coverage,
	efficacy_susceptibility = efficacy_susceptibility,
	efficacy_infectiousness = efficacy_infectiousness
)

aggregating age groups is inclusive of the upper and exclusive of the lower bound

i.e. when aggregating integer age bins with breaks c(0, 5, 10), 0 is being discarded, 5 is being assigned to the first bin and not the second.

This would be fine if we were explicit and consistent about it, and always indexed from 0, but polymod data indexes from 0, and socialmixr's wpp_age(), which we use, defines breaks based on the lowest age, and indexes from 0. So we just need to flip the definition in calls to cut(), so that in the example above, 0s get assigned to the first bin, 5s get assigned to the second, etc.

Clean LGA names

Hi @njtierney , Started a new issue on this one so that I work on everything related to this on another branch.

It seems lgas that look like the below ones are present in other datasets in conmat. If I change it in abs_lga_lookup , I should change it in other datasets like abs_household_lga and so on so that the existing functions (eg. get_per_capita_household_size()) don't break.

library(tidyverse)
conmat::abs_household_lga %>%
  filter(str_detect(lga,patterns))%>%
  distinct(state,lga)

#> # A tibble: 20 x 2
#>    state lga                                   
#>    <chr> <chr>                                 
#>  1 NSW   Campbelltown (C) (NSW)                
#>  2 NSW   Central Coast (C) (NSW)               
#>  3 NSW   Migratory - Offshore - Shipping (NSW) 
#>  4 VIC   Kingston (C) (Vic.)                   
#>  5 VIC   Latrobe (C) (Vic.)                    
#>  6 VIC   Migratory - Offshore - Shipping (Vic.)
#>  7 QLD   Central Highlands (R) (Qld)           
#>  8 QLD   Flinders (S) (Qld)                    
#>  9 QLD   Migratory - Offshore - Shipping (Qld) 
#> 10 SA    Campbelltown (C) (SA)                 
#> 11 SA    Kingston (DC) (SA)                    
#> 12 SA    Migratory - Offshore - Shipping (SA)  
#> 13 WA    Migratory - Offshore - Shipping (WA)  
#> 14 TAS   Central Coast (M) (Tas.)              
#> 15 TAS   Central Highlands (M) (Tas.)          
#> 16 TAS   Flinders (M) (Tas.)                   
#> 17 TAS   Latrobe (M) (Tas.)                    
#> 18 TAS   Migratory - Offshore - Shipping (Tas.)
#> 19 NT    Migratory - Offshore - Shipping (NT)  
#> 20 ACT   Migratory - Offshore - Shipping (ACT)

Created on 2022-05-26 by the reprex package (v2.0.1)

Capture MGCV error

library(conmat)
library(readr)
library(dplyr)

oz_pop <- abs_pop_age_lga_2020 %>%
group_by(age_group) %>%
summarise(
population = sum(population)
) %>%
mutate(
lower.age.limit = parse_number(as.character(age_group))
) %>%
mutate(
country = "Australia"
)
oz_pop

fit_setting_contacts(
contact_data_list = get_polymod_setting_data(),
population = oz_pop
)
#> Error in if (sum(uconv.ind) == 0) {: missing value where TRUE/FALSE needed

Thanks to @goldingn for working out when this error typically occurs:

Because the interpolated population function was predicting 0 population for some ages, which then led to a non-finite offset

The goal here it to either try and capture that error exactly, and give a more informative error, replacing it with suggestions.

Alternatively, identify this ahead of time somehow, before model fitting, and describe how to fix that issue.

Add Eyre weights as a data object

- [x] called setting_weights - for each of the settings (home, school , work, other)
- [ ] Provide greta code to reproduce these weights in data-raw
- [ ] Provide a function to appropriately map the weights to the transmission probability settings

get_per_capita_household_size() uses outdated LGA definitions

abs_household_lga uses data with an older set of LGAs than abs_pop_age_lga_2020, so the code to adjust household sizes doesn't work for all LGAs.

We'll need to use a correspondence to line them up. I'll see what I can hack together for this one off code I'm writing and paste here.

matrices are transposed!

When outputting matrices, 'age_group_from' should correspond to columns, and "age_group_to" to rows. That way, after conversion to a next generation matrix, a matrix multiply will give the number of infections in each age group at the next step.

Currently the output matrix is being output the other way: https://github.com/njtierney/conmat/blob/master/R/predictions_to_matrix.R#L17-L18 ('from' on the rows)

This also applies to the setting transmission matrices.

This doesn't appear in the plots, which are fine since it's the other way around in matrix_to_predictions() too, so it cancels out.

safer interpolation

in predict_contacts(), the contacts are predicted at integer resolution using the population age interpolator function to define population sizes, and then integrated to get the aggregate matrices. For some population age distributions, this predicts 0 population within the age bounds of the integration, which leads to non-finite prediction of contact rates. The integration range needs to be restricted to prevent this from happening during prediction.

Some lgas listed in `abs_lga_lookup` aren't listed in `abs_household_lga`:

A number of lgas listed in abs_lga_lookup aren't listed in abs_household_lga:

> anti_join(abs_lga_lookup, abs_household_lga, by = "lga")
# A tibble: 13 × 3
   state lga_code lga                              
   <chr>    <dbl> <chr>                            
 1 NSW      10500 Bayside (A)                      
 2 NSW      12160 Cootamundra-Gundagai Regional (A)
 3 NSW      12390 Dubbo Regional (A)               
 4 NSW      15700 Nambucca Valley (A)              
 5 SA       40150 Adelaide Plains (DC)             
 6 SA       45400 Orroroo-Carrieton (DC)           
 7 WA       54200 Kalamunda (C)                    
 8 WA       54280 Kalgoorlie-Boulder (C)           
 9 TAS      60210 Break O`Day (M)                  
10 TAS      62410 Glamorgan-Spring Bay (M)         
11 TAS      65410 Waratah-Wynyard (M)              
12 NA       99399 Unincorp. Other Territories      
13 NA          NA NA

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.