Giter Club home page Giter Club logo

tamor's Introduction

tamor

Rapid automated Personal Cancer Genome Report (PCGR) generation using Illumina Dragen + Snakemake.

tl;dr

Data for large scale tumor analysis projects can be spread over multiple DNA sequencing instrument runs, tamor simplifies the process of analyzing them.

A tab-delimited file is provided by the user to associate tumor and germline sequencing sample IDs with a study subject ID, along with a tissue-of-origin for the tumor. PCGR somatic variant reports (including germline susceptibility sequence variants) are generated using 1) this tab-delimited file, 2) the Illumina sequencer output (BCL or FASTQ), and 3) the Illumina Experiment Manager samplesheets CSV for the sequencing runs.

Tumor RNA analysis is in development.

Installation

  1. Nota bene: These instructions assume that you already have a Dragen server with software version 3.10 or higher and a working hg38 genome index.

  2. Download this code base:

git clone https://github.com/nodrogluap/tamor
  1. Install all the dependencies via conda or mamba (my preference because it's much, much faster):
mamba env create -f conda_tamor.yml
  1. Due to quirks in the conda dependencies spec, you will need to install the latest version of the Perl zlib library module manually, and the R hg38 genome sequence module:
mamba activate pcgrr
cpanm Compress::Raw::Zlib
R -e 'BiocManager::install("rtracklayer", force=TRUE);BiocManager::install("BSgenome.Hsapiens.UCSC.hg38")'
  1. Download (~22GB) the cancer databases that CPSR and PCGR rely on for annotation of your discovered sequence variants:
BUNDLE=pcgr.databundle.hg38.20220203.tgz
wget http://insilico.hpc.uio.no/pcgr/$BUNDLE
tar zxvf $BUNDLE

Configuration

If you do not have the tamor directory leading in your shell's PATH variable, you will need to prepend it so tamor's ersatz bcftools command is used (place this in your .bashrc if you don't want to do this manually each time):

export PATH=/where/you/have/put/tamor:$PATH

Copy the config.yml.sample file to config.yml:

cp config.yml.sample config.yml

This is the file that you can customize for your site-specific settings. By default the config is set up to write result files under the current directory in output, and is expecting the input list of paired tumor-normal samples in a file called tumor_dna_paired_germline_dna_samples.tsv which has 5 columns to be specified:

subjectID<tab>tumorSampleName<tab>germlineSampleName<tab>TrueOrFalse_germline_contains_some_tumor<tab>PCGRTissueSiteNumber

The subjectID, tumorSampleName and germlineSampleName must:

  • CONTAIN NO UNDERSCORES
  • The subjectID must be between 6 and 35 characters (due to a PCGR naming limitation)
  • tumorSampleName and germlineSampleName must be the exact Sample_Name values you used in your Illumina sequencing sample spreadsheets

These sample sheets are the only metadata to which tamor has access. Place all the Illumina experiment sample sheets for your project into data/spreadsheets by default (see the samplesheets_dir setting in config.yml). They must be called runID.csv where runID is typically the Illumina folder name in the format YYMMDD_machineID_SideFlowCellID.

Tamor can start with either BCL files or FASTQ. If you are starting with BCLs, the full Illumina experiment output folders (which contain the requisite Data/Intensities/Basecalls subfolder) are expected by in data/bcls/runID (see bcl_dir setting inconfig.yaml). Tamor will perform bcl to fastq conversion, with the FASTQ output into data/analysis/primary/sequencer/runID (see analysis_dir setting in config.yaml, and the default sequencer is novaseq6000).

If instead you are providing the FASTQs directly as input to tamor, they must also be in the data/analysis/primary/sequencerName/runID directory, with a corresponding Illumina Experiment Manager samplesheet data/spreadsheets/runID.csv. Why? This is required because tamor reads the sample sheet to find the correspondence between Sample_Name and Sample ID for each sequencing library, also analysis for DNA samples differs from that for RNA samples, so the sample sheet must also contain a Sample_Project column. Sample projects with names that contain "RNA" in them will be processed as such, all others are assumed to be DNA. The samplesheet is also used to determine if Unique Molecular Indices were used to generate the sequencing libraries, which requires different handling in Dragen during genotyping downstream.

If you provide FASTQ files directly, they must be timestamped later than the corresponding Illumina Experiment Manager spreadsheet, otherwise Snakemake will assume you've consequentially changed the spreadsheet and try to automatically regenerated all FASTQs for that run -- from potentially non-existent BCLs.

The fourth column of the paired input sample TSV file is usually False, unless your germline sample is from a leukemia or perhaps a poor quality histology section from a tumor, in which case use True. This instructs Dragen to consider low frequency variants in the germline sample to still show up as somatic variants in the tumor analysis output (see default of 0.05 under tumor_in_normal_tolerance_proportion in config.yaml)

For the fifth column, the list of tissue site numbers for the version of PCGR included here is:

                        0 = Any
                        1 = Adrenal Gland
                        2 = Ampulla of Vater
                        3 = Biliary Tract
                        4 = Bladder/Urinary Tract
                        5 = Bone
                        6 = Breast
                        7 = Cervix
                        8 = CNS/Brain
                        9 = Colon/Rectum
                        10 = Esophagus/Stomach
                        11 = Eye
                        12 = Head and Neck
                        13 = Kidney
                        14 = Liver
                        15 = Lung
                        16 = Lymphoid
                        17 = Myeloid
                        18 = Ovary/Fallopian Tube
                        19 = Pancreas
                        20 = Peripheral Nervous System
                        21 = Peritoneum
                        22 = Pleura
                        23 = Prostate
                        24 = Skin
                        25 = Soft Tissue
                        26 = Testis
                        27 = Thymus
                        28 = Thyroid
                        29 = Uterus
                        30 = Vulva/Vagina

Running a paired tumor-normal analysis

Any time you want to use tamor, you must be sure to have the conda/mamba environment loaded:

mamba activate pcgrr

Once the sample pairing file mentioned earlier is ready, you can simply run Snakemake to generate the FASTQs (optiuonally), BAMs, VCFs, and CPSR/PCGR reports:

snakemake --cores=1

The default outputs are in a directory called data/output/pcgr/subjectID_tumorSampleName_germlineSampleName. The most relevant document may be the self-contained Web page subjectID.pcgr_acmg.grch38.flexdb.html.

In a multi-user system, it is imperative to use a queuing system such as slurm to submit only one job at a time to Dragen v4.x. Once slurm is installed and configured on your Dragen system, Snakemake support for slurm is enabled by invoking like so:

snakemake --cluster sbatch --cores=2

Screenshot of a sample Personal Cancer Genome Report, FlexDB version

Acknowledgements

This project is being developed in support of the Terry Fox Research Institute's Marathon of Hope Cancer Care Network activities within the Prairie Cancer Research Consortium.

tamor's People

Contributors

nodrogluap avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.