Giter Club home page Giter Club logo

scib-pipeline's Introduction

Pipeline for benchmarking atlas-level single-cell integration

This repository contains the snakemake pipeline for our benchmarking study for data integration tools. In this study, we benchmark 16 methods (see here) with 4 combinations of preprocessing steps leading to 68 methods combinations on 85 batches of gene expression and chromatin accessibility data. The pipeline uses the scIB package and allows for reproducible and automated analysis of the different steps and combinations of preprocesssing and integration methods.

Workflow

Resources

  • On our website we visualise the results of the study.

  • The scib package that is used in this pipeline can be found here.

  • For reproducibility and visualisation we have a dedicated repository: scib-reproducibility.

  • The data used in the study on figshare

Please cite:

_Benchmarking atlas-level data integration in single-cell genomics.
MD Luecken, M Büttner, K Chaichoompu, A Danese, M Interlandi, MF Mueller, DC Strobl, L Zappia, M Dugas, M Colomé-Tatché, FJ Theis bioRxiv 2020.05.22.111161; doi: https://doi.org/10.1101/2020.05.22.111161 _

Installation

To reproduce the results from this study, three different conda environments are needed. There are different environments for the python integration methods, the R integration methods and the conversion of R data types to anndata objects.

The main steps are:

  1. Install the conda environment
  2. Set environment variables
  3. Install any extra packages through R

For the installation of conda, follow these instructions or use your system's package manager. The environments have only been tested on linux operating systems although it should be possible to run the pipeline using Mac OS.

To create the conda environments use the .yml files in the envs directory. To install the envs, use

conda env create -f FILENAME.yml

Note: Instead of conda you can use mamba to speed up installation times

For R environments, some dependencies need to be installed after the environment has been created. However, it is important to set environment variables for the conda environments first, to guarantee that the correct R version installs packages into the correct directories. All necessary steps are mentioned below.

Setting Environment Variables

Some parameters need to be added manually to the conda environment in order for packages to work correctly. For example, all environments using R need LD_LIBRARY_PATH set to the conda R library path. If that variable is not set, rpy2 might reference the library path of a different R installation that might be on your system.

Environment variables are provided in env_vars_activate.sh and env_vars_deactivate.sh and should be copied to the designated locations of each conda environment. Make sure to determine $CONDA_PREFIX in the activated environment first, then deactivate the environment before copying the files to prevent unwanted effects. This process is automated with the following script, which you should call for each environment that uses R.

. envs/set_vars.sh <conda_prefix>

After the script has successfully finished, you should be ready to use your new environment.

If you want to set these and potentially other variables manually, proceed as follows.

e.g. for scIB-python:

conda activate scIB-python
echo $CONDA_PREFIX  # referred to as <conda_prefix>
conda deactivate

# copy activate variables
cp envs/env_vars_activate.sh <conda_prefix>/etc/conda/activate.d/env_vars.sh
# copy deactivate variables
cp envs/env_vars_deactivate.sh <conda_prefix>/etc/conda/deactivate.d/env_vars.sh

If necessary, create any missing directories manually. In case some lines in the environment scripts cause problems, you can edit the files to trouble-shoot.

Python environments

There are multiple different environments for the python dependencies:

YAML file location Environment name Description
envs/scib-pipeline.yml scib-pipeline Base environment for calling the pipeline, running python integration methods and computing metrics
envs/scIB-python-paper.yml scIB-python-paper Environment used for the results in the publication

The scib-pipeline environment is the one that the user activates before calling the pipeline. It needs to be specified under the py_env key in the config files under configs/ so that the pipeline will use it for running python methods. Alternatively, you can specify scIB-python-paper as the py_env to recreate the environment used in the paper to reproduce the results.

Furthermore, scib-pipeline python environments require the R package kBET to be installed manually. Make sure that the environment variables are set as described above, so that R packages are correctly installed and located by rpy2. For example, when working with scib-pipeline, call

conda activate scib-pipeline
conda_prefix=$CONDA_PREFIX
conda deactivate
. envs/set_vars.sh $conda_prefix

Once environment variables have been set, you can install kBET:

conda activate <py-environment>
Rscript -e "devtools::install_github('theislab/kBET')"

Make sure you have rpy2==3.4.2

R environments

YAML file location Environment name Description
envs/scIB-R-integration.yml scIB-R-integration Environment used for the results in the [publication](doi: https://doi.org/10.1101/2020.05.22.111161)
envs/scib-R.yml scib-R Updated environment with R dependencies

The R environments require extra R packages to be installed manually. Don't forget to set the environment variables before installing anything through R. e.g. for scib-R:

conda activate scib-R
conda_prefix=$CONDA_PREFIX
conda deactivate
. envs/set_vars.sh $conda_prefix

Activate the environment and install the packages all the R dependencies in R directly or use the script install_R_methods.R.

conda activate <r-environment>
Rscript envs/install_R_methods.R

For the installation ofConos, please see the Conos github repo.

We used these conda versions of the R integration methods in our study:

harmony_1.0
Seurat_3.2.0
conos_1.3.0
liger_0.5.0
batchelor_1.4.0

Running the Pipeline

This repository contains a snakemake pipeline to run integration methods and metrics reproducibly for different data scenarios preprocessing setups.

Generate Test data

A script in data/ can be used to generate test data. This is useful, in order to ensure that the installation was successful before moving on to a larger dataset. More information on how to use the data generation script can be found in data/README.md.

Setup Configuration File

The parameters and input files are specified in config files, that can be found in configs/. In the DATA_SCENARIOS section you can define the input data per scenario. The main input per scenario is a preprocessed .h5ad file of an anndata with batch and cell type annotations.

TODO: explain different entries

Pipeline Commands

To call the pipeline on the test data

snakemake --configfile configs/test_data.yaml -n

This gives you an overview of the jobs that will be run. In order to execute these jobs, call

snakemake --configfile configs/test_data.yaml --cores N_CORES

where N_CORES defines the number of threads to use.

More snakemake commands can be found in the documentation.

Visualise the Workflow

A dependency graph of the workflow can be created anytime and is useful to gain a general understanding of the workflow. Snakemake can create a graphviz representation of the rules, which can be piped into an image file.

snakemake --configfile configs/test_data.yaml --rulegraph | dot -Tpng -Grankdir=TB > dependency.png

Snakemake workflow

Tools

Tools that are compared include:

scib-pipeline's People

Contributors

danielstrobl avatar hrovatin avatar kridsadakorn avatar lazappi avatar lisasikkema avatar luckymd avatar martaint avatar mbuttner avatar mumichae avatar scottgigante avatar simonmfr avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.