theislab / kbet Goto Github PK

An R package to test for batch effects in high-dimensional single-cell RNA sequencing data.

License: Apache License 2.0

R 3.85% HTML 96.06% TeX 0.09%

data-science batch-effects scrnaseq quantification

kbet's Introduction

title	author	date
kBET short introduction	Maren Büttner	9/18/2017

kBET - k-nearest neighbour batch effect test

The R package provides a test for batch effects in high-dimensional single-cell RNA sequencing data. It evaluates the accordance of replicates based on Pearson's $\chi^2$ test. First, the algorithm creates k-nearest neighbour matrix and choses 10% of the samples to check the batch label distribution in its neighbourhood. If the local batch label distribution is sufficiently similar to the global batch label distribution, the $\chi^2$-test does not reject the null hypothesis (that is "all batches are well-mixed"). The neighbourhood size k is fixed for all tests. Next, the test returns a binary result for each of the tested samples. Finally, the result of kBET is the average test rejection rate. The lower the test result, the less bias is introduced by the batch effect. kBET is very sensitive to any kind of bias. If kBET returns an average rejection rate of 1 for your batch-corrected data, you may also consider to compute the average silhouette width and PCA-based batch-effect measures to explore the degree of the batch effect. Learn more about kBET and batch effect correction in our publication.

Installation

Installation should take less than 5 min.

Via Github and devtools

If you want to install the package directly from Github, I recommend to use the devtools package.

library(devtools)
install_github('theislab/kBET')

Manually

Please download the package as zip archive and install it via

install.packages('kBET.zip', repos = NULL, type = 'source')

Usage of the kBET function:

#data: a matrix (rows: cells or other observations, columns: features (genes); will be transposed if necessary)
#batch: vector or factor with batch label of each cell/observation; length has to match the size of the corresponding data dimension  
batch.estimate <- kBET(data, batch)

kBET creates (if plot = TRUE) a boxplot of the kBET rejection rates (for neighbourhoods and randomly chosen subsets of size k) and kBET returns a list with several parts:

summary: summarizes the test results (with 95% confidence interval)
results: the p-values of all tested samples
average.pval: an average over all p-values of the tested samples
stats: the results for each of n_repeat runs - they can be used to reproduce the boxplot that is returned by kBET
params: the parameters used in kBET
outsider: samples without mutual nearest neighbour, their batch labels and a p-value whether their batch label composition varies from the global batch label frequencies

For a single-cell RNAseq dataset with less than 1,000 samples, the estimated run time is less than 2 minutes.

Plot kBET's rejection rate

By default (plot = TRUE), kBET returns a boxplot of observed and expected rejection rates for a data set. You might want to turn off the display of these plots and create them elsewhere. kBET returns all information that is needed in the stats part of the results.

library(ggplot2)
batch.estimate <- kBET(data, batch, plot=FALSE)
plot.data <- data.frame(class=rep(c('observed', 'expected'), 
                                  each=length(batch.estimate$stats$kBET.observed)), 
                        data =  c(batch.estimate$stats$kBET.observed,
                                  batch.estimate$stats$kBET.expected))
g <- ggplot(plot.data, aes(class, data)) + geom_boxplot() + 
     labs(x='Test', y='Rejection rate',title='kBET test results') +
     theme_bw() +  
     scale_y_continuous(limits=c(0,1))

Variations:

The standard implementation of kBET performs a k-nearest neighbour search (if knn = NULL) with a pre-defined neighbourhood size k0, computes an optimal neighbourhood size (heuristics = TRUE) and finally 10% of the samples is randomly chosen to compute the test statistics itself (repeatedly by default to derive a confidence interval, n_repeat = 100). For repeated runs of kBET, we recommend to run the k-nearest neighbour search separately:

require('FNN')
# data: a matrix (rows: samples, columns: features (genes))
k0=floor(mean(table(batch))) #neighbourhood size: mean batch size 
knn <- get.knn(data, k=k0, algorithm = 'cover_tree')

#now run kBET with pre-defined nearest neighbours.
batch.estimate <- kBET(data, batch, k = k0, knn = knn)

It must be noted that the get.knn function from the FNN package initializes a variable with n * k entries, where n is the sample size and k is the neighbourhood size. If n * k > 2^31, the get.knn aborts the k-nearest neighbour search. The initial neighbourhood size in kBET (k0) is ~ 1/4* mean(batch size), which can be already too large for example for mass cytometry data. In such cases, we recommend to subsample the data.

Subsampling:

Currently (July 2019), kBET operates only on dense matrices, which results in memory issues for large datasets. Furthermore, k-nearest neighbour search with FNN is limited (see above). We recommend to subsample in these cases. We have thought of several options. One option is to subsample the data irrespective of the substructure:

#data: a matrix (rows: samples, columns: features (genes))
#batch: vector or factor with batch label of each cell 
subset_size <- 0.1 #subsample to 10% of the data
subset_id <- sample.int(n = length(batch), size = floor(subset_size * length(batch)), replace=FALSE)
batch.estimate <- kBET(data[subset_id,], batch[subset_id])

In case of differently sized batches, one should consider stratified sampling in order to keep more samples from smaller batches.

The second option of subsampling is to take into account the substructure of the data (i.e. clusters). We observed that the batch label frequencies may vary in clusters. For example, such changes are due to inter-individual variability, or due to targeted population enrichment in some batches (e.g. by FACS), in contrast to unbiased cell sampling. In these cases, we compute the rejection rates for each cluster separately and average the results afterwards.

#data: a matrix (rows: samples, columns: features (genes))
#batch: vector or factor with batch label of each cell 
#clusters: vector or factor with cluster label of each cell 

kBET_result_list <- list()
sum_kBET <- 0
for (cluster_level in unique(clusters)){
   batch_tmp <- batch[clusters == cluster_level]
   data_tmp <- data[clusters == cluster_level,]
   kBET_tmp <- kBET(df=data_tmp, batch=batch_tmp, plot=FALSE)
   kBET_result_list[[cluster_level]] <- kBET_tmp
   sum_kBET <- sum_kBET + kBET_tmp$summary$kBET.observed[1]
}

#averaging
mean_kBET = sum_kBET/length(unique(clusters))

Compute a silhouette width and PCA-based measure:

#data: a matrix (rows: samples, columns: features (genes))
#batch: vector or factor with batch label of each cell 
pca.data <- prcomp(data, center=TRUE) #compute PCA representation of the data
batch.silhouette <- batch_sil(pca.data, batch)
batch.pca <- pcRegression(pca.data, batch)

For a single-cell RNAseq dataset with less than 1,000 samples, the estimated run time is less than 2 minutes.

kbet's People

Contributors

Stargazers

Watchers

kbet's Issues

kBET doesn't give any output?

Hi I am trying out kBET for my counts data from alignment for three single cell RNA seq batches with [6,8,17cells] respectively. I used the following Code and it executed perfectly without any errors [ after I troubleshooted previous errors I faced using some answered issues]. But I am not able to generate any output file. Can you please help as I want to know how the batch corrected files can be obtained? Thanks so much for your time and in advance!!

library(kBET)
scdata<-read.csv("batch_merge.txt", skip=1,row.names = 1, header=FALSE, sep='\t')
scdata
df<- t(scdata)
df
batch<-factor (c(1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3))

kBET <- function(
  df, batch, k0 = 8, knn = NULL,
  testSize = NULL, do.pca = TRUE, dim.pca = 50,
  heuristic = TRUE, n_repeat = 100,
  alpha = 0.05, addTest = FALSE,
  verbose = TRUE, plot = TRUE, adapt = TRUE
) {
  #create a subsetting mode:
  #if (is.data.frame(batch) && dim(batch)[2]>1) {
  #  cat('Complex study design detected.\n')
  #  design.names <- colnames(batch)
  #  cat(paste0('Subset by ', design.names[[2]], '\n'))
  #  bio <- unlist(unique(batch[[design.names[2]]]))
  #drop subset batch
  #  new.batch <- base::subset(batch,
  # batch[[2]]== bio[1])[,!design.names %in% design.names[[2]]]
  #  sapply(df[batch[[2]]==bio[1],], kBET, new.batch,
  # k0, knn, testSize, heuristic,n_repeat, alpha, addTest, verbose)
  
  #}
  
  
  #preliminaries:
  dof <- length(unique(batch)) - 1 #degrees of freedom
  if (is.factor(batch)) {
    batch <- droplevels(batch)
  }
  
  frequencies <- table(batch)/length(batch)
  #get 3 different permutations of the batch label
  batch.shuff <- replicate(3, batch[sample.int(length(batch))])
  
  
  class.frequency <- data.frame(class = names(frequencies),
                                freq = as.numeric(frequencies))
  dataset <- df
  dim.dataset <- dim(dataset)
  #check the feasibility of data input
  if (dim.dataset[1] != length(batch) && dim.dataset[2] != length(batch)) {
    stop("Input matrix and batch information do not match. Execution halted.")
  }
  
  if (dim.dataset[2] == length(batch) && dim.dataset[1] != length(batch)) {
    if (verbose) {
      cat('Input matrix has samples as columns. kBET needs samples as rows. Transposing...\n')
    }
    dataset <- t(dataset)
    dim.dataset <- dim(dataset)
  }
  #check if the dataset is too small per se
  if (dim.dataset[1]<=10){
    if (verbose){
      cat("Your dataset has less than 10 samples. Abort and return NA.\n")
    }
    return(NA)
  }
  
  stopifnot(class(n_repeat) == 'numeric', n_repeat > 0)
  
  do_heuristic <- FALSE
  if (is.null(k0) || k0 >= dim.dataset[1]) {
    do_heuristic <- heuristic
    if (!heuristic) {
      #default environment size: quarter the size of the largest batch
      k0 <- floor(mean(class.frequency$freq)*dim.dataset[1]/4)
    } else {
      #default environment size: three quarter the size of the largest batch
      k0 <- floor(mean(class.frequency$freq)*dim.dataset[1]*0.75)
      if (k0 < 10) {
        if (verbose){
          warning(
            "Your dataset has too few samples to run a heuristic.\n",
            "Return NA.\n",
            "Please assign k0 and set heuristic=FALSE."
          )
        }
        
        return(NA)
      }
    }
    if (verbose) {
      cat(paste0('Initial neighbourhood size is set to ', k0, '.\n'))
    }
  }
  
  #if k0 was set by the user and is too small & we do not operate on a
  #knn graph, abort
  #the reason is that if we want to test kBET on knn graph data integration methods,
  #we usually face small numbers of nearest neighbours.
  if (k0 < 10 & is.null(knn)) {
    if (verbose){
      warning(
        "Your dataset has too few samples to run a heuristic.\n",
        "Return NA.\n",
        "Please assign k0 and set heuristic=FALSE."
      )
    }
    return(NA)
  }
  # find KNNs
  if (is.null(knn)) {
    if (!do.pca) {
      if (verbose) {
        cat('finding knns...')
        tic <- proc.time()
      }
      #use the nearest neighbour index directly for further use in the package
      knn <- get.knn(dataset, k = k0, algorithm = 'cover_tree')$nn.index
    } else {
      dim.comp <- min(dim.pca, dim.dataset[2])
      if (verbose) {
        cat('reducing dimensions with svd first...\n')
      }
      data.pca <- svd(x = dataset, nu = dim.comp, nv = 0)
      if (verbose) {
        cat('finding knns...')
        tic <- proc.time()
      }
      knn <- get.knn(data.pca$u,  k = k0, algorithm = 'cover_tree')
    }
    if (verbose) {
      cat('done. Time:\n')
      print(proc.time() - tic)
    }
  }
  
  #backward compatibility for knn-graph
  if (class(knn) =='list'){
    knn <- knn$nn.index
    if (verbose){
      cat('KNN input is a list, extracting nearest neighbour index.\n')
    }
  }
  
  #set number of tests
  if (is.null(testSize) || (floor(testSize) < 1 || dim.dataset[1] < testSize)) {
    test.frac <- 0.1
    testSize <- ceiling(dim.dataset[1]*test.frac)
    if (testSize < 25 && dim.dataset[1] > 25) {
      testSize <- 25
    }
    if (verbose) {
      cat('Number of kBET tests is set to ')
      cat(paste0(testSize, '.\n'))
    }
  }
  #decide to adapt general frequencies
  if (adapt) {
    # idx.run <- sample.int(dim.dataset[1], size = min(2*testSize, dim.dataset[1]))
    outsider <- which(!(seq_len(dim.dataset[1]) %in% knn[, seq_len(k0 - 1)]))
    is.imbalanced <- FALSE #initialisation
    p.out <- 1
    #avoid unwanted things happen if length(outsider) == 0
    if (length(outsider) > 0) {
      p.out <- chi_batch_test(outsider, class.frequency, batch,  dof)
      if (!is.na(p.out)){
        is.imbalanced <- p.out < alpha
        if (is.imbalanced) {
          new.frequencies <- table(batch[-outsider])/length(batch[-outsider])
          new.class.frequency <- data.frame(class = names(new.frequencies),
                                            freq = as.numeric(new.frequencies))
          if (verbose) {
            outs_percent <- round(length(outsider) / length(batch) * 100, 3)
            cat(paste(
              sprintf(
                'There are %s cells (%s%%) that do not appear in any neighbourhood.',
                length(outsider), outs_percent
              ),
              'The expected frequencies for each category have been adapted.',
              'Cell indexes are saved to result list.',
              '', sep = '\n'
            ))
          }
        } else {
          if (verbose) {
            cat(paste0('No outsiders found.'))
          }
        }
      } else{
        if (verbose) {
          cat(paste0('No outsiders found.'))
        }
      }
    }
  }
  
  if (do_heuristic) {
    #btw, when we bisect here that returns some interval with
    #the optimal neihbourhood size
    if (verbose) {
      cat('Determining optimal neighbourhood size ...')
    }
    opt.k <- bisect(scan_nb, bounds = c(10, k0), known = NULL, dataset, batch, knn)
    #result
    if (length(opt.k) > 1) {
      k0 <- opt.k[2]
      if (verbose) {
        cat(paste0('done.\nNew size of neighbourhood is set to ', k0, '.\n'))
      }
    } else {
      if (verbose) {
        cat(paste(
          'done.',
          'Heuristic did not change the neighbourhood.',
          sprintf('If results appear inconclusive, change k0=%s.', k0),
          '', sep = '\n'
        ))
      }
    }
  }
  
  #initialise result list
  rejection <- list()
  rejection$summary <- data.frame(
    kBET.expected = numeric(4),
    kBET.observed = numeric(4),
    kBET.signif = numeric(4)
  )
  
  rejection$results <- data.frame(
    tested = numeric(dim.dataset[1]),
    kBET.pvalue.test = rep(0,dim.dataset[1]),
    kBET.pvalue.null = rep(0, dim.dataset[1])
  )
  #get average residual score
  env <- as.vector(cbind(knn[, seq_len(k0 - 1)], seq_len(dim.dataset[1])))
  cf <- if (adapt && is.imbalanced) new.class.frequency else class.frequency
  rejection$average.pval <- 1 - pchisq(k0 * residual_score_batch(env, cf, batch), dof)
  
  
  
  #initialise intermediates
  kBET.expected <- numeric(n_repeat)
  kBET.observed <- numeric(n_repeat)
  kBET.signif <- numeric(n_repeat)
  
  
  if (addTest) {
    #initialize result list
    rejection$summary$lrt.expected <- numeric(4)
    rejection$summary$lrt.observed <- numeric(4)
    
    rejection$results$lrt.pvalue.test <- rep(0,dim.dataset[1])
    rejection$results$lrt.pvalue.null <- rep(0, dim.dataset[1])
    
    lrt.expected <- numeric(n_repeat)
    lrt.observed <- numeric(n_repeat)
    lrt.signif <- numeric(n_repeat)
    #decide to perform exact test or not
    if (choose(k0 + dof,dof) < 5e5 && k0 <= min(table(batch))) {
      exact.expected <- numeric(n_repeat)
      exact.observed <- numeric(n_repeat)
      exact.signif <- numeric(n_repeat)
      
      rejection$summary$exact.expected <- numeric(4)
      rejection$summary$exact.observed <- numeric(4)
      rejection$results$exact.pvalue.test <- rep(0, dim.dataset[1])
      rejection$results$exact.pvalue.null <- rep(0, dim.dataset[1])
    }
    
    for (i in seq_len(n_repeat)) {
      # choose a random sample from dataset (rows: samples, columns: features)
      idx.runs <- sample.int(dim.dataset[1], size = testSize)
      env <- cbind(knn[idx.runs, seq_len(k0 - 1)], idx.runs)
      #env.rand <- t(sapply(rep(dim.dataset[1],testSize),  sample.int, k0))
      
      #perform test
      if (adapt && is.imbalanced) {
        p.val.test <- apply(env, 1, FUN = chi_batch_test,
                            new.class.frequency, batch,  dof)
      } else {
        p.val.test <- apply(env, 1, FUN = chi_batch_test,
                            class.frequency, batch,  dof)
      }
      
      is.rejected <- p.val.test < alpha
      
      
      
      #p.val.test <- apply(env, 1, FUN = chi_batch_test, class.frequency,
      #batch,  dof)
      p.val.test.null <-  apply(apply(batch.shuff, 2,
                                      function(x, freq, dof, envir) {
                                        apply(envir, 1, FUN = chi_batch_test, freq, x, dof)},
                                      class.frequency, dof, env), 1, mean, na.rm=TRUE)
      #p.val.test.null <- apply(env.rand, 1, FUN = chi_batch_test,
      #class.frequency, batch, dof)
      
      #summarise test results
      kBET.expected[i] <- sum(p.val.test.null < alpha,
                              na.rm=TRUE) / sum(!is.na(p.val.test.null))
      kBET.observed[i] <- sum(is.rejected,na.rm=TRUE) / sum(!is.na(p.val.test))
      
      #compute significance
      kBET.signif[i] <-
        1 - ptnorm(kBET.observed[i],
                   mu = kBET.expected[i],
                   sd = sqrt(kBET.expected[i] * (1 - kBET.expected[i]) / testSize),
                   alpha = alpha)
      
      #assign results to result table
      rejection$results$tested[idx.runs] <- 1
      rejection$results$kBET.pvalue.test[idx.runs] <- p.val.test
      rejection$results$kBET.pvalue.null[idx.runs] <- p.val.test.null
      
      
      #compute likelihood-ratio test (approximation for multinomial exact test)
      cf <- if (adapt && is.imbalanced) new.class.frequency else class.frequency
      p.val.test.lrt <- apply(env, 1, FUN = lrt_approximation, cf, batch, dof)
      p.val.test.lrt.null <- apply(apply(batch.shuff, 2,
                                         function(x, freq, dof, envir) {
                                           apply(envir, 1, FUN = lrt_approximation, freq, x, dof)},
                                         class.frequency, dof, env), 1, mean, na.rm=TRUE)
      
      lrt.expected[i] <-
        sum(p.val.test.lrt.null < alpha, na.rm=TRUE) / sum(!is.na(p.val.test.lrt.null))
      lrt.observed[i] <-
        sum(p.val.test.lrt < alpha, na.rm=TRUE) / sum(!is.na(p.val.test.lrt))
      
      lrt.signif[i] <-
        1 - ptnorm(lrt.observed[i],
                   mu = lrt.expected[i],
                   sd = sqrt(lrt.expected[i] * (1 - lrt.expected[i]) / testSize),
                   alpha = alpha)
      
      rejection$results$lrt.pvalue.test[idx.runs] <- p.val.test.lrt
      rejection$results$lrt.pvalue.null[idx.runs] <- p.val.test.lrt.null
      
      
      #exact result test consume a fairly high amount of computation time,
      #as the exact test computes the probability of ALL
      #possible configurations (under the assumption that all batches
      #are large enough to 'imitate' sampling with replacement)
      #For example: k0=33 and dof=5 yields 501942 possible
      #choices and a computation time of several seconds (on a 16GB RAM machine)
      if (exists(x = 'exact.observed')) {
        if (adapt && is.imbalanced) {
          p.val.test.exact <- apply(env, 1, multiNom,
                                    new.class.frequency$freq, batch)
        } else {
          p.val.test.exact <- apply(env, 1, multiNom,
                                    class.frequency$freq, batch)
        }
        p.val.test.exact.null <-  apply(apply(batch.shuff, 2,
                                              function(x, freq, envir) {
                                                apply(envir, 1, FUN = multiNom, freq, x)},
                                              class.frequency$freq, env), 1, mean, na.rm=TRUE)
        # apply(env, 1, multiNom, class.frequency$freq, batch.shuff)
        
        exact.expected[i] <- sum(p.val.test.exact.null < alpha, na.rm=TRUE)/testSize
        exact.observed[i] <- sum(p.val.test.exact < alpha, na.rm=TRUE)/testSize
        #compute the significance level for the number of rejected data points
        exact.signif[i] <-
          1 - ptnorm(exact.observed[i],
                     mu = exact.expected[i],
                     sd = sqrt(exact.expected[i] * (1 - exact.expected[i]) / testSize),
                     alpha = alpha)
        #p-value distribution
        rejection$results$exact.pvalue.test[idx.runs] <- p.val.test.exact
        rejection$results$exact.pvalue.null[idx.runs] <- p.val.test.exact.null
      }
    }
    
    
    
    if (n_repeat > 1) {
      #summarize chi2-results
      CI95 <- c(0.025,0.5,0.975)
      rejection$summary$kBET.expected <-  c(mean(kBET.expected, na.rm=TRUE) ,
                                            quantile(kBET.expected, CI95,
                                                     na.rm=TRUE))
      rownames(rejection$summary) <- c('mean', '2.5%', '50%', '97.5%')
      rejection$summary$kBET.observed <-  c(mean(kBET.observed,na.rm=TRUE) ,
                                            quantile(kBET.observed, CI95,
                                                     na.rm=TRUE))
      rejection$summary$kBET.signif <- c(mean(kBET.signif,na.rm=TRUE) ,
                                         quantile(kBET.signif, CI95,
                                                  na.rm=TRUE))
      #summarize lrt-results
      rejection$summary$lrt.expected <-  c(mean(lrt.expected,na.rm=TRUE) ,
                                           quantile(lrt.expected, CI95,
                                                    na.rm=TRUE))
      rejection$summary$lrt.observed <-  c(mean(lrt.observed,na.rm=TRUE) ,
                                           quantile(lrt.observed, CI95,
                                                    na.rm=TRUE))
      rejection$summary$lrt.signif <- c(mean(lrt.signif,na.rm=TRUE) ,
                                        quantile(lrt.signif, CI95,
                                                 na.rm=TRUE))
      #summarize exact test results
      if (exists(x = 'exact.observed')) {
        rejection$summary$exact.expected <-  c(mean(exact.expected,na.rm=TRUE) ,
                                               quantile(exact.expected, CI95,
                                                        na.rm=TRUE))
        rejection$summary$exact.observed <-  c(mean(exact.observed,na.rm=TRUE) ,
                                               quantile(exact.observed, CI95,
                                                        na.rm=TRUE))
        rejection$summary$exact.signif <- c(mean(exact.signif,na.rm=TRUE) ,
                                            quantile(exact.signif, CI95,
                                                     na.rm=TRUE))
      }
      
      if (n_repeat < 10) {
        cat('Warning: The quantile computation for ')
        cat(paste0(n_repeat))
        cat(' subset results is not meaningful.')
      }
      
      if (plot && exists(x = 'exact.observed')) {
        plot.data <- data.frame(class = rep(c('kBET', 'kBET (random)',
                                              'lrt', 'lrt (random)',
                                              'exact', 'exact (random)'),
                                            each = n_repeat),
                                data =  c(kBET.observed, kBET.expected,
                                          lrt.observed, lrt.expected,
                                          exact.observed, exact.expected))
        g <- ggplot(plot.data, aes(class, data)) + geom_boxplot() +
          theme_bw() + labs(x = 'Test', y = 'Rejection rate')  +
          theme(axis.text.x = element_text(angle = 45, hjust = 1))
        print(g)
      }
      if (plot && !exists(x = 'exact.observed')) {
        plot.data <- data.frame(class = rep(c('kBET', 'kBET (random)',
                                              'lrt', 'lrt (random)'),
                                            each = n_repeat),
                                data =  c(kBET.observed, kBET.expected,
                                          lrt.observed, lrt.expected))
        g <- ggplot(plot.data, aes(class, data)) + geom_boxplot() +
          theme_bw() + labs(x = 'Test', y  = 'Rejection rate') +
          scale_y_continuous(limits = c(0,1))  +
          theme(axis.text.x = element_text(angle = 45, hjust = 1))
        print(g)
      }
      
    } else {#i.e. no n_repeat
      rejection$summary$kBET.expected <- kBET.expected
      rejection$summary$kBET.observed <- kBET.observed
      rejection$summary$kBET.signif <- kBET.signif
      
      if (addTest) {
        rejection$summary$lrt.expected <- lrt.expected
        rejection$summary$lrt.observed <- lrt.observed
        rejection$summary$lrt.signif <- lrt.signif
        if (exists(x = 'exact.observed')) {
          rejection$summary$exact.expected <-  exact.expected
          rejection$summary$exact.observed <-  exact.observed
          rejection$summary$exact.signif <- exact.signif
        }
      }
      
    }
  } else {#kBET only
    for (i in seq_len(n_repeat)) {
      # choose a random sample from dataset
      #(rows: samples, columns: parameters)
      idx.runs <- sample.int(dim.dataset[1], size = testSize)
      env <- cbind(knn[idx.runs,seq_len(k0 - 1)], idx.runs)
      
      #perform test
      cf <- if (adapt && is.imbalanced) new.class.frequency else class.frequency
      p.val.test <- apply(env, 1, chi_batch_test, cf, batch, dof)
      
      #print(dim(env))
      is.rejected <- p.val.test < alpha
      
      p.val.test.null <- apply(
        batch.shuff, 2,
        function(x) apply(env, 1, chi_batch_test, class.frequency, x, dof)
      )
      # p.val.test.null <- apply(env, 1, FUN = chi_batch_test,
      #class.frequency, batch.shuff, dof)
      
      #summarise test results
      #kBET.expected[i] <- sum(p.val.test.null < alpha) / length(p.val.test.null)
      kBET.expected[i] <- mean(apply(
        p.val.test.null, 2,
        function(x) sum(x < alpha, na.rm =TRUE) / sum(!is.na(x))
      ))
      
      kBET.observed[i] <- sum(is.rejected,na.rm=TRUE) / sum(!is.na(p.val.test))
      
      #compute significance
      kBET.signif[i] <- 1 - ptnorm(
        kBET.observed[i],
        mu = kBET.expected[i],
        sd = sqrt(kBET.expected[i] * (1 - kBET.expected[i]) / testSize),
        alpha = alpha
      )
      #assign results to result table
      rejection$results$tested[idx.runs] <- 1
      rejection$results$kBET.pvalue.test[idx.runs] <- p.val.test
      rejection$results$kBET.pvalue.null[idx.runs] <- rowMeans(p.val.test.null, na.rm=TRUE)
    }
    
    if (n_repeat > 1) {
      #summarize chi2-results
      CI95 <- c(0.025,0.5,0.975)
      rejection$summary$kBET.expected <- c(mean(kBET.expected,na.rm=TRUE),
                                           quantile(kBET.expected,
                                                    CI95,na.rm=TRUE))
      rownames(rejection$summary) <- c('mean', '2.5%', '50%', '97.5%')
      rejection$summary$kBET.observed <- c(mean(kBET.observed,na.rm=TRUE),
                                           quantile(kBET.observed, CI95,
                                                    na.rm=TRUE))
      rejection$summary$kBET.signif <- c(mean(kBET.signif,na.rm=TRUE),
                                         quantile(kBET.signif, CI95,
                                                  na.rm=TRUE))
      
      #return also n_repeat
      rejection$stats$kBET.expected <- kBET.expected
      rejection$stats$kBET.observed <- kBET.observed
      rejection$stats$kBET.signif <- kBET.signif
      
      if (n_repeat < 10) {
        cat('Warning: The quantile computation for ')
        cat(paste0(n_repeat))
        cat(' subset results is not meaningful.')
      }
      if (plot) {
        plot.data <- data.frame(class = rep(c('observed(kBET)',
                                              'expected(random)'),
                                            each = n_repeat),
                                data =  c( kBET.observed,kBET.expected))
        g <- ggplot(plot.data, aes(class, data)) +
          geom_boxplot() + theme_bw() +
          labs(x = 'Test', y = 'Rejection rate') +
          scale_y_continuous(limits = c(0,1))
        print(g)
      }
    } else {
      rejection$summary$kBET.expected <- kBET.expected[1]
      rejection$summary$kBET.observed <- kBET.observed[1]
      rejection$summary$kBET.signif <- kBET.signif[1]
    }
  }
  #collect parameters
  rejection$params <- list()
  rejection$params$k0 <- k0
  rejection$params$testSize <- testSize
  rejection$params$do.pca <- do.pca
  rejection$params$dim.pca <- dim.pca
  rejection$params$heuristic <- heuristic
  rejection$params$n_repeat <- n_repeat
  rejection$params$alpha <- alpha
  rejection$params$addTest <- addTest
  rejection$params$verbose <- verbose
  rejection$params$plot <- plot
  
  #add outsiders
  if (adapt) {
    rejection$outsider <- list()
    rejection$outsider$index <- outsider
    rejection$outsider$categories <- table(batch[outsider])
    rejection$outsider$p.val <- p.out
  }
  rejection
}

session_info_kbet.docx

Overall Batch Effect

Hi,

I wanted to use kBET to estimate the overall magnitude of batch effects in my data.
Does it make sense to use the difference between expected and observed mean rejection rate for this?
(summary$kBET.observed[1] - summary$kBET.expected[1])

Best
Marc

kBET not working (?)

Hi,

I have a data set with the following samples:

Sample	Replicates
Vehicle	3
Short term treatment	3
Long term treatment	3

I subsampled my matrix to 1,000 cells, with 21,190 genes.

It looks like this:

mydata[c(1:5), c(1:10)]

GNAI3 CDC45 H19 SCML2 APOH NARF CAV2 KLF6 SCMH1 COX5A
231279535536556-V_1     0     0   0     0    0    0    1   22     0     5
226953411382694-V_1     1     0   0     0    0    0    0    3     0     1
196527716482341-V_1     0     0   0     0    0    1    0    0     0     4
160440043826078-V_1     0     0   0     0    0    0    0    3     0     1
232301289171748-V_1     1     0   0     0    0    0    2   12     3     3

However, when I run kBET and do batch.estimate$summary, I get:

      kBET.expected kBET.observed kBET.signif
mean              0             0           1
2.5%              0             0           1
50%               0             0           1
97.5%             0             0           1

All tests result in 0.

This, occurs both if I set each sample as a different batch, or if label each treatment as a different batch.

The example in the method's help however, does provide results different from zero:

      kBET.expected kBET.observed kBET.signif
mean     0.02653333        0.0212  0.73780060
2.5%     0.00000000        0.0000  0.01681108
50%      0.02666667        0.0000  1.00000000
97.5%    0.06666667        0.0800  1.00000000

Do you have any idea of why this could be happening?

Thank you,

data matrix for kBET evaluation

Hi,
I have two questions:
Q1. What kind of data matrix would you suggest for kBET estimation?
(1) raw counts
(2) normalized data (log normalized)
(3) normalized and scaled data
Which one would be proper?

Q2. Thanks for adding the subsampling part! Should I iterate multiple times to get the average results? Or subsampling one time is enough?

Thanks,
Phoebe.

High rejection rates for seemingly well integrated data

Hello there. I have been trying to get this package working for quite some time now but no matter what, the rejection rates are either 1 or close to 1 regardless of correction. I would gladly appreciate feedback on this issue.

The data before conversion to a matrix is a [cell by gene] data.frame with the last two columns giving the batch and celltypes as categorical data. The data is for a single celltype collected over three separate batches and each batch contains not too disimilar numbers of observations.

Example:

PCA before correction:

kBET before correction:

PCA after correction:

kBET after correction:

Code:

# KBET for uncorrected 
# coerce as matrix and remove categorical columns
data <- as.matrix(original[ , 1:(length(original)-2)])
#coerce batch labels(sample) to factor
batch <- as.factor(original$sample)
#compute and plot
k0=floor(mean(table(batch))) 
knn <- get.knn(data, k=k0, algorithm = 'cover_tree')
batch.estimate <- kBET(data, batch, plot=TRUE, do.pca = TRUE, dim.pca = 2)

#KBET for corrected 
# coerce as matrix and remove categorical columns
data <- as.matrix(corrected[ , 1:(length(corrected)-2)])
#coerce batch labels(sample) to factor
batch <- as.factor(corrected$sample)
#compute and plot
k0=floor(mean(table(batch))) #neighbourhood size: mean batch size 
knn <- get.knn(data, k=k0, algorithm = 'cover_tree')
batch.estimate <- kBET(data, batch, plot=TRUE, do.pca = TRUE, dim.pca = 2)

I also tried iterating over many different values of k(recomputing knn as suggested) yet the rejection rate saturates and remains high throughout.

Assessment over varying values of k (corrected data):

best regards,
Dean

Warning from 'if (class(knn) == "list")' in R 4.0.0

In R 4.0.0 I am getting the following warning from the kBET function:

Warning messages:
1: In if (class(knn) == "list") { :
  the condition has length > 1 and only the first element will be used
2: In if (class(knn) == "list") { :
  the condition has length > 1 and only the first element will be used
3: In if (class(knn) == "list") { :
  the condition has length > 1 and only the first element will be used

It may have to do with the fact that R 4.0.0 assigns both class "matrix" and "array" to matrices. See the discussion here: https://developer.r-project.org/Blog/public/2019/11/09/when-you-think-class.-think-again/index.html
They recommend using if(is(object, "list")) instead of if(class(object) == "list").

Seurat CCA

Hi,

I'm trying to use kBet for optimizing dims.use in Seurat CCA. Can you please provide a quick explanation/code for how you did this in Buttner et al?

Thanks,
Alex

Using kBET with AnnData object

Hi together,
I would like to use the R version of kBET on an AnnData object (or rather the neighborhood graph stored in this anndata object) for the comparison of two batch correction methods.
I don't want to use scib, since that is currently not working for me (gives me only 'nan' for the kBET metric).
Which part of the anndata object should I use as input for kBET so that the batch correction stored in the neighbor graph is considered (connectivities or distances)?
Can I pass these matrices to the df parameter in kBET or rather to the knn parameter?

Thanks for the help!

kBET values

Hello,

Thanks for developing of a great tool.
I have several datasets which were taken in different days. We analyzed them and we have in total 4 clusters (which we checked and they are viable). Neverthless we collected another dataset which in the tSNE have 2 of 4 clusters significantly shifted from the other datasets. I know that tsne distances mean nothing but I wanted to check if we have batch effect using kBET.

When I compute kBET for 2 datasets (the new dataset, which has 2 clusters separately and another "good" dataset) I see the observed rejection rate of around 0.8 for Non normalized and 0.4 for normalized data.

I wonder should one look more to the normalized or non normalized results?
In order to understand how the kBET will evaluate 2 "good" datasets I computed kBET on the "good" datasets by splitting and got observed rejection rate of almost 0 for Non normalized and around 0% for Normalized. Are this values normal? Because I would always expect some perentage of rejection, not 0. Also I got around 0.15 of expected rejection rate. What is the expected rejection rate? How is it computed?

Thanks in advance for any help,
Vlad

Error when setting nPCs to 2 in batch_sil function

Hi!
I have only PC1 and PC2, so when I try to set nPCs = 2 instead of the default 3

pca.data <- prcomp(data, center=TRUE) #compute PCA representation of the data
batch.silhouette <- batch_sil(pca.data, batch, nPCs = 2)

I get

Error in if (!all(x == round(x))) stop("'x' must only have integer codes") : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In silhouette(as.numeric(batch), dd) : NAs introduced by coercion

I have also tried a different approach:
after PCA dimension reduction (Scanpy) I input the top 40 PCs from the batch-corrected gene expression matrix to the kBET function (matrix: PCs as columns, cells as rows; vector with batch label of each cell). kBET returned an average rejection rate of 1, thus I tried batch_sil function, but the error remained the same

Error in if (!all(x == round(x))) stop("'x' must only have integer codes") : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In silhouette(as.numeric(batch), dd) : NAs introduced by coercion

How should I calculate batch.silhouette?
Thanks!

Can I apply pandas pd as input or np.matrix as input for kBET? Thanks.

The input data format

Sorry to disturb you but I have a problem about the input of kBET.
In the readme you write that columns represent features (genes) and let's assume it's true.

#data: a matrix (rows: samples, columns: features (genes))
#batch: vector or factor with batch label of each cell 
batch.estimate <- kBET(data, batch)

In the example you prepared, you said:

batch <- rep(seq_len(10),each=20)
data <- matrix(rpois(n = 50000, lambda = 10)*rbinom(50000,1,prob=0.5), ncol=200)
batch.estimate <- kBET(data,batch)

so the 'data' is a 250x200 matrix and there are 200 genes. And the 'batch' is a vector with 200 elements.

In my understanding, the batch information should be related about something like the experiment rather than genes. Am I correct? So the column represents cells and the row represents genes, it is true?

Extremely long run time

Hello! Thank you for developing this tool.

I am currently trying to run the initial kBET command, but am having difficulty due to an extremely long run time (>24 hours).

My inputs are as follows:

df: a matrix where rows are cells (1157) and columns as genes (2000)
batch: a vector of factors (length 1157)

When I set verbose = TRUE on this command, I see that it is selecting an optimal number of nearest neighbors of 433, which confuses me a bit seeing as there are only 1157 cells. In addition, I can tell that the code is stuck at identifying knns.

What can I do to optimize the format of my data in order to achieve a run time similar to what is stated on your code (~2 min for ~1k cells)?

Thank you so much!

kBET documentation for batch argument

Hello, I was wondering if you could clarify the documentation for the batch argument to the kBET function. Currently, it says "batch id for each cell or a data frame with both condition and replicates". Does the data frame option allow you to specify multiple replicates for some biological condition (i.e. treatment vs. control)? If so, could you provide an example of how to format this data frame?

Seurat v3 and kBET

Hi everybody,
is there a vignette or a snippet explaining how to use kBET on Seurat V3 integration method (CCA + anchors https://doi.org/10.1016/j.cell.2019.05.031 ), if possible?
Thanks

kBET within known cluster

Hi,

I appreciate the great work, kBET. I have a question about its implementation. All (simulated) setting in the paper show no clustering of cells. When cells are clustered, then kBET should only consider neighborhoods within clusters. Is there a way to force kBET to do so?

Best regards,
Alemu

weird(opposite) result

Hi, I enjoy your tool very much! However, I encountered a weird problem dealing with a small sample size problem. I found that kBET may not be very stable when quantifying batch effect? I found that the score may be exactly 0 or 1 and the scores may be totally opposite to intuition and visualization. Is it possible kBET requires a big sample size to work?

question about input matirx size

Hello, thanks for your useful tool! Recently I was trying to run the kBet on a matrix with size 4280*30. But it turns out an error stating that:

I am wondering are there any limitation on the input dataset for the get.knn() function?

question about mean_kBET in the README subsampling section

I want to make sure that my interpretation about this output is correct.

So I go through the source code, and in kBET.R file,

line 343: is.rejected <- p.val.test < alpha
line 359: kBET.observed[i] <- sum(is.rejected,na.rm=TRUE) / sum(!is.na(p.val.test))
lines 439-442:
rownames(rejection$summary) <- c('mean', '2.5%', '50%', '97.5%')
rejection$summary$kBET.observed <- c(mean(kBET.observed,na.rm=TRUE) ,
quantile(kBET.observed, CI95,
na.rm=TRUE))

mean_kBET is kind of average rejection rate over all repeats (e.g., n_repeat = 100), Therefore, the lower the mean_kBET is, the better the samples are mixed. Am I right?

Thanks for your time!

batch_sil with non-numeric batch codes

Hello,

I get the following error when running batch_sil:

R[write to console]: Error in if (!all(x == round(x))) stop("'x' must only have integer codes") :
missing value where TRUE/FALSE needed

R[write to console]: In addition:
R[write to console]: Warning message:

R[write to console]: In silhouette(as.numeric(batch), dd) :
R[write to console]: NAs introduced by coercion

The problem comes from having non-numeric batch codes, which this line of code converts to NAs:
summary(silhouette(as.numeric(batch), dd))

Can you suggest any workaround for this?

Many thanks,
Jess

Abnormal kBET Scores

Hello,

Thank you for a great tool! We had some questions on interpreting the results. In the image below, the figure on the top represents integrated data, where species are well-mixed in. Yet, the rejection rate is 1 for all samples, suggesting that the local and global distribution of species labels are completely different.

On the unintegrated data, however, the rejection rate averages to 0.91. How should we interpret these results? We were under the impression that kbet observed should be high for unintegrated and low for integrated data. Thank you for the help!

Memory

Hi,

will there be support for sparse matrices with the package? With my data I get the error below, when I try to run kBet:

Error in La.svd(x, nu, nv) : 
  cannot allocate memory block of size 134217728 Tb
Calls: kBET -> svd -> La.svd
Execution halted

Cheers,

Dave

which cell is rejected

Hi, I want to generate a plot similar to that in kBET paper:

Where can I get the results to indicate whether a cell is rejected or not ?

Searching for code and corresponding data.

Can anyone help me with implementation of some batch correction validation metrices like ARI,KBet,ASW,Clisi,f1 score,PCR etc

Please help me.

Add a release tag

Hi, I'm trying to install kBET in an Anaconda environment using the conda skeleton cran command. Unfortunately, doing so fails unless there is at least one release for the repo. Would it be possible to add a release to make the repo compatible with conda?

Which kind of normalization for 10X data is perferred for kBET?

Thanks for such a good tool for assessing single-cell RNA-seq batch correction. I have 180k cells across 20 patients and I would like to analyze whether there is batch effect. I wonder which kind of data is perferred for kBET? raw counts with total genes, log(CPM+1) data with selected highly varibales genes, or z-score normalized log(CPM+1) data? Do I have to selected highly varibales genes or use PCAs as input?

Improve runtime of kBET

kBET is slow and partly because it's running many computations multiple times (for instance, to obtain good stats for the rejection rate).

ensure that neighbourhoods are computed at most once
revisit the subsampling implementation
use a more efficient kNN computation (FNN at the moment)

how to calculate the max rejection rate and labeled the line in line chart

how do we calculate the max rejection rates were sampling the data

Using kBET for methylation and RNA-Seq data

Hi, I came across your method and thought that it was a great way to query the presence of batch effects in other data types (i.e. methylation arrays and RNA-Seq data). From my understanding of your algorithm, your approach should be fine when applied to other batch effects (e.g. array, plate id, ect...) . What are your thoughts ? Thank you again for this wonderful approach to assess batch effects !

Scanpy Anndata object to KBET?

Hi KBet developers,

I was wondering if you have guidance or tutorials on how to run KBET on Integrated scanpy Anndata object?

Maybe you have one already, but I cannot find it after searching.

Thanks!

Dose kBET allow normalized matrix as input?

Hi there,

I used kBET to detect the batch effect of the data after batch effect correction. However, I have a question on the input format. The example shows that kBET function takes raw count matrix as the input. However, after correction using Seurat, I only have normalized and scaled matrices, where there would be both positive and negative values (as I showed below). So I wonder can I use the log-transformed data as the input file?

5 x 5 sparse Matrix of class "dgCMatrix"
F3_Unsorted_cell1 F3_LMPP_cell2 F3_LMPP_cell3 F3_MPP_cell4 F3_MPP_cell5
ENSMUSG00000027562 4.609797282 5.60968033 5.563666940 4.71301533 5.278862679
ENSMUSG00000027556 4.303335643 6.25554822 5.787379093 4.03904792 4.507551082
ENSMUSG00000027863 0.188618445 . -0.011766633 -0.03955919 -0.005985704
ENSMUSG00000035042 0.157980256 0.03028318 0.194765506 -0.06541776 -0.007800523
ENSMUSG00000056399 -0.001272955 -0.02507969 -0.001600793 -0.04997392 -0.018655246

Thanks!

how to assess the batch? by rejection rate or silhouette?

hi, I recently used kbet to assess the batch effect of single-cell dataset. but I'm really confused about how to assess the batch effect according to the result.
1: do you have a cutoff of rejection rate to assess there's a batch or not?
2: I noticed the silhouette score, but confused about the description, it ranged from -1 to 1, from the formula, s(i)>-1, means lower batch, s(i)>1 higher batch; but according to the method part of k-bet paper, the absolute value of s was used to assess absence or presence of batch. so how do we assess the batch by silhouette?

appreciate your response.

Questions about cell type parameters.

If I do not have cell type variables, when I call kBET(), can I leave this place as none or empty? Or I must fill something? Thanks.

In addition, sometimes I notice that the SOTA's rejection rate is higher than old method (such as seurat v3). Thanks.

Error in batch_sil() and pcRegression()

Hello,

I am following the steps in the tutorial and I was able to get an output for the kBET() function on my dataset. I'm running into some issues in the subsequent functions:

batch.silhouette <- batch_sil(pca.data, meta$sample_id)
Error in silhouette.default(as.numeric(batch), dd) :
clustering 'x' and dissimilarity 'dist' are incompatible

batch.pca <- pcRegression(pca.data, meta$sample_id)
Error in model.frame.default(formula = rot.data ~ batch, drop.unused.levels = TRUE) :
variable lengths differ (found for 'batch')

I'm using the same data and batch variables I used in the kBET() function. Any thoughts of what might be going wrong and how I can fix this ? Thanks a lot.

Please add license to your repository

Could you please add a license to your repository?

data.R is missing

Hi Maren!

In the last commit there is data.R file listed in the DESCRIPTION file, however it was not added to the package. This brakes the devtools installation. Could you please add data.R to the package?

Cheers,
Vlad

is df input kBET a data frame with cells in rows and features in columns or the opposite?

In the documentation and help pages, the df option of kBET states:

dataset (rows: cells, columns: features)

However, when I am running the vignette, I find that the argument given to the kBET function is a matrix with features in rows and in columns the cells.

Which one is the correct one?

Best,
Jose

High rejection rates in integrated data of lung and liver than two lung tissues

Hello there ,I used the package to detect the batch effect of the data,for example

First, I used the Seurat to integrate the raw matrix, then fitered cell and caculated the top 2000 VariableFeatures. I used the matrix of 2000 genes as the data to search the batch effect with the kBET, here are my results:

lung1-liver1:

lung1-lung3

Reasonable, the batch effect within different tissues is higher than within the same tissues.
So,I want to know that why the rejection rates between two different tissues (lung1 vs liver1) lower than the same tissues (lung1 vs lung3).
I am looking forward to your reply. Thank you.

kBET for scATAC-seq data

Hi, thanks for the nice package!

A few questions... kBET has worked well for scRNA-seq data. But I'm wondering if the method is applicable to scATAC-seq data as well? The cell by peaks features matrix is a bit different from an expression counts matrix, so I wonder if kBET makes any specific assumptions that would prevent it from being used on this type of data.

Second, and related, is whether there is any update on sparse matrix support? For example, if I use a tile matrix covering the entire human genome, it's very large: 6 million features by 12k cells (or larger), which is not really feasible to load as a dense matrix. I know it's possible to use a pre-computed nearest neighbor graph, but even when doing this, kBET still seems to require the original data matrix as input. Any other suggestions would be welcome also.

Improve input flexibility for kBET

kBET requires a dense matrix as input, even if kBET will only use a knn graph.

Include sparse matrix computation
Remove requirement of data matrix when a knn graph is provided
Use sparse matrix structure for knn graph
allow SCE objects as input

Add cell type label as optional argument

kBET uses the entire data matrix currently, and when kBET needs to be computed per cell type, one has to subset per cell type manually. kBET should handle this internally and return rejection rates per cell type.

Add cell type as optional argument
Handle knn graph subsetting if a knn graph is provided
Adapt output to a table of rejection rates

Working on SingleCellExperiment object.

Thank you very much for the great package. I find it on papers.
I am sorry that my questions maybe a little bit basic. I was using the seurat object and as you suggest I transfrom to the SingleCellExperiment object. However, as the vignette the data input should be matrix. Could you tell me how to use the SingleCellExperiment object as input. (batch.estimate <- kBET(data, batch)) I know how to get the batch , but the matrix I did not know how to make it work.
Could you give me a help?
Thank you very much.

cell x feature or feature x cell?

Hi,

I see you mentioned row for cell and column for features.

But in your example, there are 200 labels (for cells), while your sample data has 250 rows with 200 columns.

Does that mean we should put cells in columns, please?

Thanks!
Jon

theislab / kbet Goto Github PK

kbet's Introduction

kBET - k-nearest neighbour batch effect test

Installation

Via Github and devtools

Manually

Usage of the kBET function:

Plot kBET's rejection rate

Variations:

Subsampling:

Compute a silhouette width and PCA-based measure:

kbet's People

Contributors

Stargazers

Watchers

Forkers

kbet's Issues

Recommend Projects

Recommend Topics

Recommend Org