Giter Club home page Giter Club logo

coiaf-real-data's Introduction

Real Data Analysis With coiaf

DOI Requirement License: MIT

This repository stores the real data analysis conducted to test the software package {coiaf}.

Data Source

We analysed samples from the MalariaGEN Plasmodium falciparum Community Project [1]. The MalariaGEN Plasmodium falciparum Community Project provides genomic data from over 7,000 P. falciparum samples from 28 malaria-endemic countries in Africa, Asia, South America, and Oceania from 2002-2015. Detailed information about the data release including brief descriptions of contributing partner studies and study locations is available in the supplementary of MalariaGEN et al..

Project Structure

.
├── analysis
│   ├── estimation-comparison.Rmd
│   └── pf6_analysis.Rmd
├── data-outputs
│   ├── core-genome.rds
│   ├── data_dims.rds
│   ├── rmcl_estimation.rds
│   └── seq-error
│       ├── seq_0.01.rds
│       ├── seq_0.05.rds
│       ├── seq_0.1.rds
│       ├── seq_0.15.rds
│       └── seq_0.2.rds
├── download
│   ├── 00_Pf6_vcf_filtering.Rmd
│   ├── 01_create_coiaf_inputs.Rmd
│   ├── 02_run_rmcl.Rmd
│   └── core-genome.tsv
├── figures
│   ├── cluster-locations.png
│   ├── coi-world.png
│   ├── comparison.png
│   ├── continuous-region.png
│   ├── deploid.png
│   ├── discrete-region.png
│   ├── fws.png
│   ├── grouped-prevalence.png
│   ├── log-prevalence.png
│   ├── silhoutte.png
│   └── varying-seq-error.png
├── metadata
│   ├── deploidibd_data.rdata
│   ├── pf6_meta.Rmd
│   └── pf6_meta.rds
├── raw-regions
└── scripts
    ├── coi-region.R
    ├── combine-raw-regions.R
    ├── data-dimensions.R
    ├── rmcl-estimation.R
    └── slurm-region.R

References

1. MalariaGEN, Ahouidi A, Ali M, Almagro-Garcia J, Amambua-Ngwa A, Amaratunga C, et al. An open dataset of Plasmodium falciparum genome variation in 7,000 worldwide samples. Wellcome Open Research. 2021;6: 42. doi:10.12688/wellcomeopenres.16168.1

coiaf-real-data's People

Contributors

arisp99 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

skiyaga

coiaf-real-data's Issues

Prevalence analyses comments

Hey Aris,

This seemed simpler to me to just make some comments through an issue here that relate to overleaf comments.

  1. Malaria prevalence data

This line I think will fail to grab the prevalence data for non African settings as the rasters have changed. This explains why you have so many NAs in meta$prev_2_10

PfPR2_10 <- malariaAtlas::getRaster(year = sort(unique(meta$prev_2_10)))

Just to be sure use:

PfPR2_10 <- malariaAtlas::getRaster(surface = "Plasmodium falciparum PR2 - 10 version 2020", year = sort(unique(meta$prev_2_10)))
  1. Prevalence analysis:

In the below we can definitely see that COI is increasing as prevalence increases. Probably worth adding the loglinear relationship here (i.e. geom_smooth(method = "lm")) +

ggplot(data = prev_fws, aes(x = prev, y = COI, color = Method)) +
geom_point(alpha = 0.3, position = position_jitter(height = 0.2)) +
scale_x_log10(
breaks = c(0.01, 0.1, 1),
labels = c(0.01, 0.1, 1),
limits = c(0.01, 1)
) +
facet_grid(~Method) +
labs(x = "Log10 Prevalence", y = "Estimated COI") +
scale_color_discrete(
name = "Estimation Method",
labels = c("Discrete Variant Method", "THE REAL McCOIL")
) +
theme_coiaf()

Also, to test for this relationship, put it in a linear model and have a look at the coefficients and the p_value:

lm(COI ~ (prev) + Method, data = prev_fws) %>% summary()

Call:
lm(formula = COI ~ (prev) + Method, data = prev_fws)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.4135 -0.4991 -0.2853  0.4548  7.4829 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            1.26726    0.01336   94.86   <2e-16 ***
prev                   1.19990    0.06905   17.38   <2e-16 ***
MethodTHE REAL McCOIL  0.22653    0.01703   13.31   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8644 on 10307 degrees of freedom
  (27 observations deleted due to missingness)
Multiple R-squared:  0.04435,	Adjusted R-squared:  0.04417 
F-statistic: 239.2 on 2 and 10307 DF,  p-value: < 2.2e-16

Clearly prevalence increases lead to COI increases, with ~1.2 increase in COI as prevalence goes from 0 to 1, and also noting that McCOIL has a significantly quicker increase of 0.22653 in COI with respect to prevalence compared to discrete variant method (which is the default in the linear model).

However, the R2 is incredibly low, 0.044, which shows that the majority of the variation in COI is not well predicted by prevalence. This is expected as all of this samples are not equal (timing in transmission season, age of sample, clinical status are all unknown and may vary).

What I would do though is also have a look at the R2 for each method on its own. For example, I was having a look at the relationships on the grouped data (was working off the old codebase) and you can see that the continuous method 1 has much less variation around the mean linear relationship than the other discrete approaches. This is probably both due to the fact it's continuous, so the rounding that occurs in discrete approaches is less able to contribute to the variance observed, but also shows that it is probably a more reliable indicator of prevalence than the discrete methods that lose sensitivity with relation to prevalence (please ignore discrete_2_med - this is with your old values).

image

But it would be good to show these relationships, and also discuss the R2 for each method on its own, to show that while there is a relationship to be observed, it is highly variable and not that large from this data. The continuous approach though is more informative of prevalence, most likely because it is more capable at not losing signal as a result of relatedness.


Bring this in for the last bit of the analysis in the Results section. Once in this will then help you write that missing Discussion paragraph about comparing to the wider literature.

Data download inconsistencies

Hey @OJWatson, I was working through the data download files and had some questions regarding regional VCF creations (i.e. VCFs that have a MAF above a threshold regionally but not globally).

My understanding of the following lines was that we were splitting each of the 14 VCFs into a new 24 VCFs (one per region for a total of 14*24 VCFs). From the code, it seems that we are skipping region 1 here and, as a matter of fact, when I run this code, it creates a VCF for every region but region 1. Why do we skip region 1 here?

out <- mclapply(X = unlist(region_pos[2:24]), FUN = function(i){
loci_sample_subset(
loci = obj_list[[i]]$loci,
vcf = obj_list[[i]]$vcf,
chrom = obj_list[[i]]$chrom,
vcfout = obj_list[[i]]$vcfout,
samples = obj_list[[i]]$samples
)
}, mc.cores = 14
)

In the next code chunk, we join all the VCFs from each region into one (taking the 14*24 VCFs and making 24 VCFs). However, in the for loop, we only look at regions 4:24. Why do we skip regions 1, 2, 3, and 4?

for(j in (4:24)){
lj <- grep(paste0("\\d\\d_v3.*reg_",j,"\\.vcf"), l, value = TRUE)
O <- gsub("_01_v3","",lj[1])
# concatenate our vcfs with bcftools
system(paste("bcftools concat -O z -o", gsub("//", "/", O), paste(lj, collapse = " ")))
# read that in and save it out
vcf_rds <- file.path(here::here(),paste0("analysis/data/derived/snp_selections/vcf_final_reg_",j,".rds"))
newvcfR <- vcfR::read.vcfR(gsub("//", "/", O))
saveRDS(newvcfR, vcf_rds)
}

Subsetting data in comparing results

Comparison of coiaf and The REAL McCOIL

In comparing the results of {coiaf} and The REAL McCOIL, we remove some patients in the following lines:

```{r exclude patients}
exclude <- setdiff(unique(complete_predictions$name), unique(rmcl_coi_out$name))
compare_data <- complete_predictions %>%
dplyr::filter(!name %in% exclude) %>%
dplyr::filter(rmcl_med != 25) %>%
dplyr::relocate(Region, .after = dplyr::last_col())
```

This is done as in some cases we were missing predictions from The REAL McCOIL. We also remove two samples in which the The REAL McCOIL prediction was clearly wrong (the predicted value was 25). This data filtering step is justified for comparing the two software packages.

COI across the world

When we then use our results to plot the COI across the world, we continue to use our subsetted data.

```{r load meta}
meta <- readRDS(here::here("metadata", "pf6_meta.rds"))
patient_lat_long <- dplyr::left_join(
compare_data, meta,
by = c("name" = "Sample")
)
```

The variable patient_lat_long is what we base the rest of our results on. Given that this data is a subset of our results and we are no longer interested in the comparison between the two populations, it seems that it would be better to use our estimations for the complete patient population we had access to.

Prevalence and FwS

For examing the relationship of COI to the prevalence of malaria and FwS, we can continue to use the subsetted data set as we compare {coiaf} to The REAL McCOIL. However, when we plot the ridge plot with the prevalence in each of our 24 regions, we should use the entire data set. Especially because our subsetted data does not cover region 11.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.