Giter Club home page Giter Club logo

coiaf's Introduction

coiaf

R-CMD-check Lifecycle: stable License: MIT

Introduction

In malaria, individuals are often infected with different parasite strains, with the complexity of infection (COI) giving the number of genetically different parasite strains in an individual. Changes in the mean COI in a population have been shown to be informative of changes in transmission intensity with a number of probabilistic likelihood and Bayesian models now developed to estimate COI. However, rapid, direct measures based on heterozygosity or FwS have not been directly related to the COI. This package features two rapid, direct measures for characterizing polyclonal infections.

Installation

# Install most recent released version
devtools::install_github("bailey-lab/[email protected]")
# Install development version
devtools::install_github("bailey-lab/coiaf")

Usage

In order to run real data, please refer to the Articles drop-down menu. Several articles detail how the algorithm works, how data is simulated to test the algorithm, and, importantly, how to run real data. A short example of running real data is included and outlines the necessary data structure and the commands to run.

coiaf's People

Contributors

arisp99 avatar ojwatson avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

skiyaga mfdiop

coiaf's Issues

Regression returns list instead of value

Bug Description

The continuous estimation of the COI for the Frequency Method returns a list when we suspect the COI is 1. When we use bins, the result is a value with attributes.

Expected Behavior

The result should be a single value with attributes.

Reprex

library(coiaf)
set.seed(202)
data <- sim_biallelic(1, runif(1000, 0, 0.5), coverage = 200)

optimize_coi(data, "sim", coi_method = "frequency", use_bins = FALSE)
#> $coi
#> [1] 1
#> 
#> $probability
#>  [1] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> 
#> $notes
#> [1] "Too few variant loci suggesting that the COI is 1 based on the Variant Method."

Created on 2022-02-24 by the reprex package (v2.0.1)

Vignette Links are broken

Maintenance Request

Vignette Links are broken.

Code Example

Ex: followed hyperlink to vignettes on website and GitHub. Both result in the following:
"The 'bailey-lab/coiaf' repository doesn't contain the 'analysis/vignettes/example_real_data.Rmd' path in 'main'. "

I thrive on pictures and guides, a vignette would be great! Thanks guys:)

Rename WSAF and PLAF

Maintenance Request

Rename WSAF to WSAMF and PLAF to PLMAF, to make it more clear that we are looking at the minor allele.

Code Example

coiaf/R/process.R

Lines 25 to 29 in 856887f

process <- function(wsaf,
plaf,
seq_error = NULL,
bin_size = 20,
coi_method = "variant") {

Release coiaf 0.1.0

First release:

Prepare for release:

  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • Review pkgdown reference index for, e.g., missing topics

CRAN:

usethis::use_cran_comments()
rhub::check_for_cran()
usethis::use_version('major')
devtools::submit_cran()
Approve email

Ready to Release:

  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Bump version (in DESCRIPTION and NEWS)
  • Update install instructions in README

Replace superseded functions

In some locations, we use superseded functions to accomplish tasks. These function calls should be replaced with their newer equivalents. A couple of examples where we use superseded functions are below:

dplyr::select_if(function(x) dplyr::n_distinct(x) > 1)

dplyr::select_if(function(x) dplyr::n_distinct(x) > 1)

Likelihood based distance

Currently, we work out residuals between the theoretical COIs and the observed data.

This was done because initially we were using buckets to work out averages and then comparing those.

With the regression approach though we use the individual data. We could now construct a formal likelihood here, by saying that our observed data (whether method 1 or 2) is described by a Binomial distribution, with number of trials being the coverage and the number of successes being as.integer(wsmaf*coverage) (as.integer here just for rounding errors, wsmaf cam from doing reads/coverage initially) for Method 2.

E.g demo for Method 2:

data2 <- sim_biallelic(6, runif(10000,0,0.5), epsilon = 0.02, coverage = sample(300, 10000, replace = TRUE))

data <- data2$data
data <- data[data$wsmaf > 0.03 & data$wsmaf < 0.97,]
counts <- data$counts
coverage <- data$coverage
plmaf <- data$plmaf

thcois <- theoretical_coi(2:10, plmaf, coi_method = "frequency")

lls <- dbinom(x = counts, size = coverage, prob = as.matrix(thcois[,-ncol(thcois)]), log = TRUE)
lls <- colSums(lls)
which.max(lls)
coi_6 
    5 

Just did a quick check to see how different these approaches are:

     
fgh <- function(){
  
data2 <- sim_biallelic(6, runif(1000,0,0.5), epsilon = 0.02, coverage = sample(50, 1000, replace = TRUE))

data <- data2$data
data <- data[data$wsmaf > 0.03 & data$wsmaf < 0.97,]
counts <- data$counts
coverage <- data$coverage
plmaf <- data$plmaf

thcois <- theoretical_coi(2:25, plmaf, coi_method = "frequency")

lls <- dbinom(x = counts, size = coverage, prob = as.matrix(thcois[,-ncol(thcois)]), log = TRUE)
lls <- colSums(lls)
return(c(as.numeric(which.max(lls)+1),compute_coi(data2, "sim", coi_method = "frequency", use_bins = FALSE, seq_error = 0.03)$coi))

}

comp <- replicate(100, fgh(), simplify = TRUE)
rowMeans(comp)
6.42 6.22   

So the Binomial likelihood approach is slightly higher for the same samples. Not sue will make a big difference but just flagging here to remind us to ask Bob about this when we go through the paper on his thoughts on this re simple residual

Function exiting

Maintenance Request

Function exciting should be uniform across the package. We should exit implicity (without calling return()) and return values visibily so when users call the function, the output is printed.

Code Examples

Here we return explicitly

# Include the PLAF in the final output and return
curves$plaf <- plaf
return(curves)

Here we return an invisibile value

coiaf/R/compute_coi.R

Lines 249 to 250 in 9f3b786

# List to return
ret <- list(coi = as.numeric(coi), probability = dist)

We should be doing the opposite.

Frequency Method COI = 1 threshold

The current check_freq_method is I think too conservative. E.g. the following sample that based on WSMAF vs PLMAF is almost certainly a COI of 2, but it would fail the check_freq_method as it has too few loci (it has 6550 but the 95% is 6560) - this is an extreme example but there are others i came across where it was say 3000 loci but was clearly COI of 2 but relatedness meant we had fewer loci than expected.

On reflection re code design, rather than have coiaf return COI of 1, get it instead to execute as normal but then have the note in place. Then the end user can decide if the COI returned by coiaf should be taken on face value or should in fact be 1. As it currently works though we are making too many COI = 2 samples that have some relatedness (which the frequency method is less affected by) return as COI = 1 when if left to coiaf they would return as COI = 2.

We could try loosening the threshold for setting the COI = 1 by using a higher confidence interval instead of the 95% confidence interval. It may, as you mentioned, be better to just let the algorithm run all the way through and add a note if we did not have enough variant sites. If we use this approach, then we will also see a lot more samples for which we estimate the maximum COI. I do worry that users will not notice the note and not take the uncertainty surrounding the estimation into consideration. I think that the concern whether the users will see the note holds regardless of which strategy we employ.

With that in mind, I am starting to think that the best course of action is to return a special value which makes it clear that there is uncertainty regarding the calculation (perhaps just NA_real_ or NaN). I think we can then add attributes to this result. We can then run our estimation on the data and add a note saying the COI could be 1 or it could be the estimated value. This gives more advanced users the opportunity to choose how to handle these samples while preventing more basic users from making unintentional assumptions about the results.

Maybe one option would be to have anything that Method 1 discrete returns as COI = 1 have then Method 2 return as COI of 1 with a note, rather than the check_freq_method?

I think it is better to leave the two methods separate from one another and not have the Frequency Method call the Variant Method in its estimation.

Originally posted by @arisp99 in #17 (comment)

Error when input data is a dataframe or tibble

Bug Description

compute_coi and optimize_coi fail to run when the data is either a data frame or a tibble. They require data to be a matrix. This error is caused when our data is processed to be fed into the algorithims.

Data Frame

library(coiaf)

data_df <- data.frame(
  check.names = FALSE,
  row.names = c("FP0008-C","FP0009-C","FP0015-C","FP0016-C","FP0017-C"),
  `Pf3D7_01_v3-93378` = c(0, 0, 0, 0, 0),
  `Pf3D7_01_v3-93901` = c(0.882353, 1, 1, NA, 1),
  `Pf3D7_01_v3-94422` = c(0, 0, 0.22973, 0, 0),
  `Pf3D7_01_v3-94442` = c(0, 0, 0.30137, NA, 0),
  `Pf3D7_01_v3-94445` = c(0, 0, 0.3375, NA, 0)
)
str(data_df)
#> 'data.frame':    5 obs. of  5 variables:
#>  $ Pf3D7_01_v3-93378: num  0 0 0 0 0
#>  $ Pf3D7_01_v3-93901: num  0.882 1 1 NA 1
#>  $ Pf3D7_01_v3-94422: num  0 0 0.23 0 0
#>  $ Pf3D7_01_v3-94442: num  0 0 0.301 NA 0
#>  $ Pf3D7_01_v3-94445: num  0 0 0.338 NA 0

wsaf_df  <- data_df[1, ]
plaf_df <- colMeans(data_df, na.rm = T)
input_df <- tibble::tibble(wsaf = wsaf_df, plaf = plaf_df) %>% tidyr::drop_na()
compute_coi(input_df, "real", coi_method = "1")
#> Error: wsaf must be a non-recursive vector

Tibble

data_tb <- tibble::tibble(data_df)
str(data_tb)
#> tibble [5 ร— 5] (S3: tbl_df/tbl/data.frame)
#>  $ Pf3D7_01_v3-93378: num [1:5] 0 0 0 0 0
#>  $ Pf3D7_01_v3-93901: num [1:5] 0.882 1 1 NA 1
#>  $ Pf3D7_01_v3-94422: num [1:5] 0 0 0.23 0 0
#>  $ Pf3D7_01_v3-94442: num [1:5] 0 0 0.301 NA 0
#>  $ Pf3D7_01_v3-94445: num [1:5] 0 0 0.338 NA 0

wsaf_tb  <- data_tb[1, ]
plaf_tb <- colMeans(data_tb, na.rm = T)
input_tb <- tibble::tibble(wsaf = wsaf_tb, plaf = plaf_tb) %>% tidyr::drop_na()
optimize_coi(input_tb, "real", coi_method = "1")
#> Error: wsaf must be a non-recursive vector

Created on 2021-09-15 by the reprex package (v2.0.1)

Relatedness decay in simulator

Relatedness is defined really only for the comparison between two individuals. So in the case when COI = 2, we have relatedness between the two strains equal to r (relatedness).

The next way to define relatedness I suppose is the mean relatedness when we do all the pairwise comparisons as you have detailed for 4 strains. Let's look at the case with 3 strains (A, B, C) and the steps we take, and have r be 0.5 and lets say there are 12 loci.

  1. Strain A is created
  2. Strain B is created
  3. Strain B is altered to share 6 loci with A
  4. Strain C is created
  5. Strain C is altered. It will end up have 3 loci shared with A and 3 loci shared with B (because we choose which of A and B it is related to each loci when altering).

Strain C will actually have 4.5 loci shared with A and B, because half of the loci it shares with B will be also shared with A due to step 3.

Therefore, the mean shared number of loci we would expect would be ((6 + 4.5 + 4.5)/3)/12 = 5/12

We can generalise this mathematically (something like this - haven't figured it out fully, it will collapse down I am sure), for k = COI, and r, mean relatedness:

1/k * sum(r + 2*(r/2 + r/2r) + 2(r/3 + r/3r + r/3((r/2 + r/2*r))) + ... )

Crucially though this decays away from r with increasing COI. So maybe we should change this so that each strain is equally related to each other to begin with. We may need to think about this a bit more though as coding that is quite hard. If we make it so that strain C above shared 6 loci with A and B we then end up increasing relatedness with increasing COI due to similar reasons.

dummy_rel <- function(COI = 3, nl = 12, r = 0.5) {
  
  s <- matrix(data = seq_len(COI*nl), nrow = COI, ncol = nl, byrow = TRUE)
  
  if (r > 0 && COI > 1) {
    
    # If there is relatedness we iteratively step through each lineage
    for (i in seq_len(COI - 1)) {
      
      # For each loci we assume that it is related with probability relatedness
      rel_i <- as.logical(rbinom(nl, size = 1, prob = r))
      
      # And for those sites that related, we draw the other lineages
      if (any(rel_i)) {
        
        if (i == 1) {
          s[i+1, rel_i] <- s[seq_len(i), rel_i]
        } else {
          s[i+1, rel_i] <- apply(s[seq_len(i), rel_i, drop = FALSE], 2, sample, size = 1)
        }
      }
      
    }
    
  }
  
  return(s)
}

relatedness_2 <- function(s) {
  
  sum(s[1,] == s[2,])/ncol(s)
  
}

relatedness <- function(s) {
  
  Triangular <- function(n) choose(seq(n),2)
  
  rs <- tail(Triangular(nrow(s)),1)
  rs <- rep(0, rs)
  
  count <- 1
  
  for (i in seq_len(nrow(s)-1)) {
    
    for(j in seq(i+1, nrow(s))) {
      
      rs[count] <- relatedness_2(s[c(i, j),])
      count <- count + 1
    }
    
  }
  
  return(rs)
  
}


ks <- 2:10
r_mean <- vector("list",length(ks)+1)
for(c in ks) {    
  
  ss <- replicate(10000, dummy_rel(COI = c))
  rs <- apply(ss, 3, relatedness)
  if(c > 2) {
    rs <- colMeans(rs)
  } 
  
  r_mean[[c]] <- data.frame("COI" = c, "rel" = rs)
  
}

r_df <- do.call(rbind, r_mean)
ggplot(r_df, aes(as.factor(COI), rel)) + 
  geom_boxplot() + 
  geom_smooth(method = "loess", aes(group = 1)) +
  xlab("COI") + ylab("Relatedness") 

image

Originally posted by @OJWatson in #1 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.