Giter Club home page Giter Club logo

luca's Introduction

LuCA - The single-cell Lung Cancer Atlas

DOI

Salcher, S., Sturm, G., Horvath, L., Untergasser, G., Kuempers, C., Fotakis, G., ... & Trajanoski, Z. (2022). High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. Cancer Cell. doi:10.1016/j.ccell.2022.10.008

The single cell lung cancer atlas is a resource integrating more than 1.2 million cells from 309 patients across 29 datasets.

The atlas is publicly available for interactive exploration through a cell-x-gene instance. We also provide h5ad objects and a scArches model which allows to project custom datasets into the atlas. For more information, check out the

This repository contains the source-code to reproduce the single-cell data analysis for the paper. The analyses are wrapped into nextflow pipelines, all dependencies are provided as singularity containers, and input data are available from zenodo.

For clarity, the project is split up into two separate workflows:

  • build_atlas: Takes one AnnData object with UMI counts per dataset and integrates them into an atlas.
  • downstream_analyses: Runs analysis tools on the annotated, integrated atlas and produces plots for the publication.

The build_atlas step requires specific hardware (CPU + GPU) for exact reproducibility (see notes on reproducibility) and is relatively computationally expensive. Therefore, the downstream_analysis step can also operate on pre-computed results of the build_atlas step, which are available from zenodo.

Launching the workflows

1. Prerequisites

  • Nextflow, version 21.10.6 or higher
  • Singularity/Apptainer, version 3.7 or higher (tested with 3.7.0-1.el7)
  • A high performance cluster (HPC) or cloud setup. The whole analysis will consume several thousand CPU hours.

2. Obtain data

Before launching the workflow, you need to obtain input data and singularity containers from zenodo. First of all, clone this repository:

git clone https://github.com/icbi-lab/luca.git
cd luca

Then, within the repository, download the data archives and extract then to the corresponding directories:

 # singularity containers
curl "https://zenodo.org/record/7227571/files/containers.tar.xz?download=1" | tar xvJ

# input data
curl "https://zenodo.org/record/7227571/files/input_data.tar.xz?download=1" | tar xvJ

# OPTIONAL: obtain intermediate results if you just want to run the `downstream_analysis` workflow
curl "https://zenodo.org/record/7227571/files/build_atlas_results.tar.xz?download=1" | tar xvJ

Note that some steps of the downstream analysis depend on an additional cohort of checkpoint-inhibitor-treated patients, which is only available under protected access agreement. For obvious reasons, these data are not included in our data archive. You'll need to obtain the dataset yourself and place it in the data/14_ici_treatment/Genentech folder. The corresponding analysis steps are skipped by default. You can enable them by adding the --with_genentech flag to the nextflow run command.

3. Configure nextflow

Depending on your HPC/cloud setup you will need to adjust the nextflow profile in nextflow.config, to tell nextflow how to submit the jobs. Using a withName:... directive, special resources may be assigned to GPU-jobs. You can get an idea by checking out the icbi_lung profile - which we used to run the workflow on our on-premise cluster. Only the build_atlas workflow makes use of GPU processes.

4. Launch the workflows

# newer versions of nextflow are incompatible with the workflow. By setting this variable
# the correct version will be used automatically.
export NXF_VER=22.04.5

# Run `build_atlas` workflow
nextflow run main.nf --workflow build_atlas -resume -profile <YOUR_PROFILE> \
    --outdir "./data/20_build_atlas"

# Run `downstream_analysis` workflow
nextflow run main.nf --workflow downstream_analyses -resume -profile <YOUR_PROFILE> \
    --build_atlas_dir "./data/20_build_atlas" \
    --outdir "./data/30_downstream_analyses"

As you can see, the downstream_analysis workflow requires the output of the build_atlas workflow as input. The intermediate results from zenodo contain the output of the build_atlas workflow.

Structure of this repository

  • analyses: Place for e.g. jupyter/rmarkdown notebooks, gropued by their respective (sub-)workflows.
  • bin: executable scripts called by the workflow
  • conf: nextflow configuration files for all processes
  • containers: place for singularity image files. Not part of the git repo and gets created by the download command.
  • data: place for input data and results in different subfolders. Gets populated by the download commands and by running the workflows.
  • lib: custom libraries and helper functions
  • modules: nextflow DSL2.0 modules
  • preprocessing: scripts used to preprocess data upstream of the nextflow workflows. The processed data are part of the archives on zenodo.
  • subworkflows: nextflow subworkflows
  • tables: contains static content that should be under version control (e.g. manually created tables)
  • workflows: the main nextflow workflows

Build atlas workflow

The build_atlas workflow comprises the following steps:

  • QC of the individual datasets based on detected genes, read counts and mitochondrial fractions
  • Merging of all datasets into a single AnnData object. Harmonization of gene symbols.
  • Annotation of two "seed" datasets as input for scANVI.
  • Integration of datasets with scANVI
  • Doublet removal with SOLO
  • Annotation of cell-types based on marker genes and unsupervised leiden clustering.
  • Integration of additional datasets with transfer learning using scArches.

Downstream analysis workflow

  • Patient stratification into immune phenotypes
  • Subclustering and analysis of the neutrophil cluster
  • Differential gene expression analysis using pseudobulk + DESeq2
  • Differential analysis of transcription factors, cancer pathways and cytokine signalling using Dorothea, progeny, and CytoSig.
  • Copy number variation analysis using SCEVAN
  • Cell-type composition analysis using scCODA
  • Association of single cells with phenotypes from bulk RNA-seq datasets with Scissor
  • Cell2cell communication based on differential gene expression and the CellphoneDB database.

Contact

For reproducibility issues or any other requests regarding single-cell data analysis, please use the issue tracker. For anything else, you can reach out to the corresponding author(s) as indicated in the manuscript.

Notes on reproducibility

We aimed at making this workflow reproducible by providing all input data, containerizing all software dependencies and integrating all analysis steps into a nextflow workflow. In theory, this allows to execute the workflow on any system that can run nextflow and singularity. Unfortunately, some single cell analysis algorithms (in particular scVI/scANVI and UMAP) will yield slightly different results on different hardware, trading off computational reproducibility for a significantly faster runtime. In particular, results will differ when changing the number of cores, or when running on a CPU/GPU of a different architecture. See also scverse/scanpy#2014 for a discussion.

Since the cell-type annotation depends on clustering, and the clustering depends on the neighborhood graph, which again depends on the scANVI embedding, running the build_atlas workflow on a different machine will likely break the cell-type labels.

Below is the hardware we used to execute the build_atlas workflow. Theoretically, any CPU/CPU of the same generation shoud produce identical results, but we did not have the chance to test this yet.

  • Compute node CPU: Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz (2x)
  • GPU node CPU: EPYC 7352 24-Core (2x)
  • GPU node GPU: Nvidia Quadro RTX 8000 GPU

luca's People

Contributors

abyssum avatar grst avatar riederd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

luca's Issues

RAM usage in SCISSOR_TCGA

Hi, thank you for your work. When I was performing SCISSOR_TCGA step, I encountered an issue requiring over three thousand GB RAM. I wanted to inquire about how much memory resources you used when performing SCISSOR_TCGA?

Question about cluster idents in Seurat object

Hello authors. Thanks for sharing the valuable data and code.
While reproducing the data analysis using Seurat, I had a question about the differences between three idents that can be used to annotate the clusters: cell_type, cell_type_major, and cell_type_tumor.
Sometimes, the same cluster is named differently depending on which ident we used for annotating the cluster.
For example, a cluster annotated with "type I/II pneumocytes" with cell_type ident is differently annotated with "Alveolar cell type 1/2" when using the cell_type_major ident.
Could you explain the differences between these three idents, and what should I use as the principal object for the annotation?
Thank you so much.

Project new dataset to the atlas

Dear authors,

Thanks for sharing this valuable data and workflow. I wonder if you may have the script for projecting new dataset from the user onto the atlas. Or if there is a(n) parameter/argument in the nextflow pipeline can do such projection. Many thanks.

Best,
Nan

Modules scanpy_helpers / AnnotationHelper

Hi,

I love to explore these thrilling datasets and code, but I've got one error I want to ask.

From the code at analyses/37_subclustering/37_neutrophil_subclustering.py
I tried to import following modules and functions

"from scanpy_helpers.annotation import AnnotationHelper
import scanpy_helpers as sh"

, but I countered the errors that there is no module scanpy_helpers neither AnnotationHelper.

From Googling, I seems there are no module called scanpy_helpers neither AnnotationHelper that I can install.
How could I use the modules ? Is there any specific routes that I can install these?

Thanks
bangbattlers

Index 0 out of bounds for length 0 error

Dear authors,
Thank you very much for your work. I am new to nextflow. After I download related and follow the command you give
"nextflow run main.nf --workflow downstream_analyses -resume -profile icbi_lung --build_atlas_dir ./data/20_build_atlas --outdir ./data/30_downstream_analyses"
WX20231013-122501@2x

This error always occurs. I run this on a server instead of HPC. I don't know why this error occurred. Is it related to the environment I am using?

Here is the error info from log file:
image

Is there rds/Seurat data can be downloaded for extended and Core atlas with the mutation information of STK11?

Hi, I am not a professional bioinformatics person. I am interested in the data with or without STK11 mutations. Could you please let me know which dataset that I can download? It would be great if the dataset include the mutations information of STK11 as well as other key oncogenes mentioned in the paper? I can only start with RDS or Seurat data. Could you please send me the direct link to download the data?

I sincerely appreciate your help.

Regards,
Shawn

Loading the h5ad

Congratulations on a very nice preprint.

I am trying to load the extended atlas from the .h5ad provided. In R, using loomR's connect() function I encounter the following error:

lfile  <- connect( filename = 'data/extended_atlas.h5ad' )
Error in validateLoom(object = self) :
  There can only be one dataset at the root of the loom file

When I tried using python:

out.file =  scanpy.read_10x_h5 ('data/extended_atlas.h5ad')
ValueError: 'data/extended_atlas.h5ad' contains more than one genome. For legacy 10x h5 files you must specify the genome if more than one is present. Available genomes are: ['X', 'obs', 'obsm', 'obsp', 'raw', 'uns', 'var']

So I then attempted using the raw "genome," and encountered this error:

Exception: File is missing one or more required datasets.

I work mostly in R for scRNA-seq analyses, so I don't have much experience with the h5ad format. How can I go about loading this file?

Epithelial Cell Types

Hi!

We were wondering what happened to the different epithelial cell types in 33_epithelial_cells

In our project, we are interested in epithelial cells (Goblet cells in particular). Were these cell types filtered out during preprocessing or not found at all?

In the script 33_epithelial_cells.py Goblet cells are mentioned, but the plots aren't visible to give more context.

Thanks in advance!

malignant cells labeled in normal samples

Hi,

I downloaded the extended data atlas from CellxGene, and was surprised to find many cells labeled as "malignant" in supposedly normal tissues (see screenshot containing malignant cell counts per disease/study/origin below - data was aggregated from the cell metadata in the downloaded h5ad file). Can you please help me understand how these cell type labels were generated, and explain the presence of these malignant labels in normal tissues? I thought perhaps it was due to mislabeling during transfer learning, but many of these cells come from the core atlas datasets.

Thanks,
Rebecca

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.