Giter Club home page Giter Club logo

oswgmcr-maps-collaboration's People

Contributors

ajstewartlang avatar wjchulme avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

oswgmcr-maps-collaboration's Issues

Data treatment/tidying

It looks like we need to do a fair bit of data tidying at the start. For example, for the variables comp_int_bed_16 and comp_noint_bed_16 the values in the dataset are "Yes" and "NA". Presumably the NA corresponds to "No" rather than reflecting real missing data:

> unique(maps_synthetic_data$comp_int_bed_16)
[1] NA "Yes"

> unique(maps_synthetic_data$comp_noint_bed_16)
[1] NA "Yes"

So I'm thinking we replace the NAs in these two variables with "No" and then turn them into 2-level factors:

maps_synthetic_data[is.na(maps_synthetic_data$comp_int_bed_16), ]$comp_int_bed_16 <- "No"
maps_synthetic_data[is.na(maps_synthetic_data$comp_noint_bed_16), ]$comp_noint_bed_16 <- "No"
maps_synthetic_data$comp_int_bed_16 <- factor(maps_synthetic_data$comp_int_bed_16)
maps_synthetic_data$comp_noint_bed_16 <- factor(maps_synthetic_data$comp_noint_bed_16)

For the for anxiety measure at age 15 variable, is looks like the NAs correspond to 0 rather than missing data:

> unique(maps_synthetic_data$anx_band_15)
[1] "~0.5%" NA "~3%" "~15%" "~50%" "<0.1%"

So we might want to replace the NAs there with zeros.

maps_synthetic_data[is.na(maps_synthetic_data$anx_band_15), ]$anx_band_15 <- 0

But what about the other values? Should we make this an ordered factor or treat as numerical? If treating as numerical we could recode as:

maps_synthetic_data <- maps_synthetic_data %>% mutate(anx_band_15 = as.integer(recode(anx_band_15, "~0.5%" = ".5", "~3%" = "3", "~15%" = "15", "~50%" = "50", "<0.1%" = "0")))

Although that forces the <0.1% values to be 0. Would an ordered factor be better do you think? @wjchulme, @jspickering, @OliJimbo

Exposure

This is "computer use during weekday and weekends at 16 years old".

There are two self-reported variables, comp_week and comp_wend, defined as the "average time child spent per day using a computer of a typical week/weekend day".
Presumably we should correct for the fact that there are 5 weekdays and 2 weekend days, so comp_all = (comp_week/5 + comp_wend/2)*7. Or should we consider these variables separately? Something more sophisticated?

Statistical modelling approach

Firstly, Bayesian – yes or no? It might be a good learning opportunity for people with limited experience of Bayesian inference, including me. But I don’t have a strong feeling about this either way and happy to go with majority view.

As for the actual model, an odds ratio is required for the final output so the outcome variable is necessarily binary - depression at 18, yes/no. Logistic regression is the obvious candidate but it’s possible to recover an odds ratio from any model that can (be coerced to) provide the probability of the outcome with/without the exposure - including models with non-binary outcomes.

So it will depend on how depression is represented in the dataset which we won’t know until we get access. There’s a variable for a clinical diagnosis of depression at aged 18, has_dep_diag, but also some other variables relating to depressive symptoms at 18.

I can see where the path of least resistance will take us, but if anyone has a desire to do the non-obvious thing then let’s hear it! EIther way, it would be helpful to get a consensus on these issues as early as possible.

Software: R or Python?

Presumably more of us are using R than Python. Other languages aren't allowed. I'm going to assume R is a given and close.

Meeting

Hi all,

Can I suggest an in-person meeting to discuss everything and come up with an action plan? We can then create issues on github depending on what needs to be done

Jade

Measurement error

Many of the variables are self-reported and measured with error. No way does any kid know the exact “average time spent per day talking on a mobile phone on a typical weekend day”, for example. This will increase uncertainty, and possibly bias, in effect estimates. Should we try to account for this?

This paper is a useful place to go to start thinking about these issues if we want to.

Confounders

This is where it gets spicy. The wonderful thing about synthetic data is that you can’t p-hack - you have to specify your study design based on the data schema, and not the data values, since you can’t rely on the results.

An obvious important confounder is the pre-existence of depression at or before aged 16. We have variables dep_band_10, dep_band_13, and dep_band_15, which are caregiver-reported (eg by mum or dad) depression of the child aged 10, 13, and 15. We should consider at least dep_band_15.

After that, it’s a free-for-all. We can’t use everything, but we want to be sure we’re adjusting for relevant confounders. We can’t (and shouldn’t) use univariate associations or step-wise variable selection, etc. Causal inference would be helpful here – see https://doi.org/10.1007/s10654-019-00494-6

A first step might be for somebody to draw a DAG representing beliefs about causal relationships. We can reason about confounders from that starting point.

Missing values

How we choose to deal with missing values depends on how their distribution throughout the dataset. Complete-case (removing all observations where some information is missing) is probably justified if you’re removing no more than about 10%. Otherwise, Multiple imputation is usually an approximately valid approach to take, depending on what we’re willing to assume about why the data might be missing.

The use of synthetic data forces us to consider this up-front. First thing to investigate is missing value patterns once we get the dataset. Who wants to volunteer? Some inspiration.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.