wjchulme / oswgmcr-maps-collaboration Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 2.0 1.25 MB

R 100.00%

oswgmcr-maps-collaboration's People

Contributors

Stargazers

Watchers

Forkers

olijimbo jspickering

oswgmcr-maps-collaboration's Issues

Data treatment/tidying

It looks like we need to do a fair bit of data tidying at the start. For example, for the variables comp_int_bed_16 and comp_noint_bed_16 the values in the dataset are "Yes" and "NA". Presumably the NA corresponds to "No" rather than reflecting real missing data:

> unique(maps_synthetic_data$comp_int_bed_16)
[1] NA "Yes"

> unique(maps_synthetic_data$comp_noint_bed_16)
[1] NA "Yes"

So I'm thinking we replace the NAs in these two variables with "No" and then turn them into 2-level factors:

maps_synthetic_data[is.na(maps_synthetic_data$comp_int_bed_16), ]$comp_int_bed_16 <- "No"
maps_synthetic_data[is.na(maps_synthetic_data$comp_noint_bed_16), ]$comp_noint_bed_16 <- "No"
maps_synthetic_data$comp_int_bed_16 <- factor(maps_synthetic_data$comp_int_bed_16)
maps_synthetic_data$comp_noint_bed_16 <- factor(maps_synthetic_data$comp_noint_bed_16)

For the for anxiety measure at age 15 variable, is looks like the NAs correspond to 0 rather than missing data:

> unique(maps_synthetic_data$anx_band_15)
[1] "~0.5%" NA "~3%" "~15%" "~50%" "<0.1%"

So we might want to replace the NAs there with zeros.

maps_synthetic_data[is.na(maps_synthetic_data$anx_band_15), ]$anx_band_15 <- 0

But what about the other values? Should we make this an ordered factor or treat as numerical? If treating as numerical we could recode as:

maps_synthetic_data <- maps_synthetic_data %>% mutate(anx_band_15 = as.integer(recode(anx_band_15, "~0.5%" = ".5", "~3%" = "3", "~15%" = "15", "~50%" = "50", "<0.1%" = "0")))

Although that forces the <0.1% values to be 0. Would an ordered factor be better do you think? @wjchulme, @jspickering, @OliJimbo

Exposure

This is "computer use during weekday and weekends at 16 years old".

There are two self-reported variables, comp_week and comp_wend, defined as the "average time child spent per day using a computer of a typical week/weekend day".
Presumably we should correct for the fact that there are 5 weekdays and 2 weekend days, so comp_all = (comp_week/5 + comp_wend/2)*7. Or should we consider these variables separately? Something more sophisticated?

Statistical modelling approach

Firstly, Bayesian – yes or no? It might be a good learning opportunity for people with limited experience of Bayesian inference, including me. But I don’t have a strong feeling about this either way and happy to go with majority view.

As for the actual model, an odds ratio is required for the final output so the outcome variable is necessarily binary - depression at 18, yes/no. Logistic regression is the obvious candidate but it’s possible to recover an odds ratio from any model that can (be coerced to) provide the probability of the outcome with/without the exposure - including models with non-binary outcomes.

So it will depend on how depression is represented in the dataset which we won’t know until we get access. There’s a variable for a clinical diagnosis of depression at aged 18, has_dep_diag, but also some other variables relating to depressive symptoms at 18.

I can see where the path of least resistance will take us, but if anyone has a desire to do the non-obvious thing then let’s hear it! EIther way, it would be helpful to get a consensus on these issues as early as possible.

Software: R or Python?

Presumably more of us are using R than Python. Other languages aren't allowed. I'm going to assume R is a given and close.

Meeting

Hi all,

Can I suggest an in-person meeting to discuss everything and come up with an action plan? We can then create issues on github depending on what needs to be done

Jade

Measurement error

Many of the variables are self-reported and measured with error. No way does any kid know the exact “average time spent per day talking on a mobile phone on a typical weekend day”, for example. This will increase uncertainty, and possibly bias, in effect estimates. Should we try to account for this?

This paper is a useful place to go to start thinking about these issues if we want to.

Confounders

This is where it gets spicy. The wonderful thing about synthetic data is that you can’t p-hack - you have to specify your study design based on the data schema, and not the data values, since you can’t rely on the results.

An obvious important confounder is the pre-existence of depression at or before aged 16. We have variables dep_band_10, dep_band_13, and dep_band_15, which are caregiver-reported (eg by mum or dad) depression of the child aged 10, 13, and 15. We should consider at least dep_band_15.

After that, it’s a free-for-all. We can’t use everything, but we want to be sure we’re adjusting for relevant confounders. We can’t (and shouldn’t) use univariate associations or step-wise variable selection, etc. Causal inference would be helpful here – see https://doi.org/10.1007/s10654-019-00494-6

A first step might be for somebody to draw a DAG representing beliefs about causal relationships. We can reason about confounders from that starting point.

Missing values

How we choose to deal with missing values depends on how their distribution throughout the dataset. Complete-case (removing all observations where some information is missing) is probably justified if you’re removing no more than about 10%. Otherwise, Multiple imputation is usually an approximately valid approach to take, depending on what we’re willing to assume about why the data might be missing.

The use of synthetic data forces us to consider this up-front. First thing to investigate is missing value patterns once we get the dataset. Who wants to volunteer? Some inspiration.

wjchulme / oswgmcr-maps-collaboration Goto Github PK

oswgmcr-maps-collaboration's People

Contributors

Stargazers

Watchers

Forkers

oswgmcr-maps-collaboration's Issues

Data treatment/tidying

Exposure

Statistical modelling approach

Software: R or Python?

Meeting

Measurement error

Confounders

Missing values

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent