Giter Club home page Giter Club logo

cancerpipeline's Introduction

######################################################################

CancerPipeline

This is a simple pipeline to automate the processing of cancer samples downstream of variant calling.

The pipeline steps are as follows:

Annotation

  • Filter variants according to coverage, allele frequency or quality cutoffs.
  • Annotate variants with Annovar and SNPeff.
  • Transform sample vcfs into mafs.
  • Unite all samples into a common output maf/vcf.

MuSiC

  • Prepare input files for MuSiC (i.e. intersection of samples available as bam and samples in the united maf).
  • Run all MuSiC tools sequentially.

MutSigCV

  • Run MutSigCV on the united maf.

CancerPipeline is written in Python/ruffus and aware of which steps have been run before on which files. Each rerun of a pipeline will thus only touch the files that have changed since the last time it was run, not stupidly rerun all of them.

##Input

CancerPipeline allows the definition of different projects, each with different filtering settings, a different name etc. If new samples arrive, a project can easily be rerun to bring all folders, files and functional analyses for that project up to date. In the following examples, our project is called myproject.

The pipeline expects all unfiltered input vcf to be located in one folder. Different projects can share the same raw vcf folder.

.
└── root
    └── vcfs_raw
        ├── sample1.vcf
        └── sample2.vcf

CancerPipeline will then create output folders per project for the filtered and annotated vcfs and an output folder for the functional analysis tools.

.
└── root
    ├── analysis_myproject
    │   ├── input
    │   │   └── myproject_input.txt
    │   └── output
    │       └── myproject_sign_genes.txt
    ├── vcfs_myproject
    │   ├── sample1_annotated.tsv
    │   ├── sample1.vcf
    │   ├── sample2_annotated.tsv
    │   └── sample2.vcf
    └── vcfs_raw
        ├── sample1.vcf
        └── sample2.vcf

File locations and settings per project are saved into a regular text file in Python configuration format (see below). The file contains information on each project that was run, serving both as a config file for the current runs and a log file for past runs.

Example project definition file

[myproject]

root = /home/user/root/
raw_vcf_folder = /home/user/root/vcfs_raw/
bam_folder = /home/user/bams/
bed = /home/user/root/data/myproject.bed
ref = /home/user/root/data/hg19.fasta

cpus = 1
verbose_logging = False 
min_cov = 100
min_varfreq = 0.05

Location settings

Use absolute paths!

  • root: the root folder of this pipeline
  • raw_vcf_folder: the folder containing the unfiltered vcfs
  • bed: the bed file for this panel
  • ref: path to the reference sequence (in fasta format)
  • bam_folder (optional): the folder containing the bam and bam.bai files. If not supplied, MuSiC will not be started.
  • whitelist (optional): a list of samples from the raw_vcf_folder to process (default: use whole folder)
  • blacklist (optional): a list of samples from the raw_vcf_folder to NOT process (default: use whole folder)

Filtering settings (optional)

  • min_cov: Minimum accepted coverage for a variant position
  • min_varfreq: Minimum variant frequency for a variant position
  • min_qual: Minimum quality for a variant position

Run flags (optional)

  • vcf_type: select the type of input vcf (iontorrent/illumina_strelka) (default: iontorrent)
  • cpus: number of CPUs to use in parallel (default: 1)
  • functional_analysis: if False, don't run MutSigCV and MuSiC (default: True)
  • verbose_logging: if set to True, will result in more output while running (default: False)
  • version_numbers_not_in_blacklist: legacy flag, don't use

##Usage

If there's only one project in the definition file, the only argument needed is the name of the file:

> python CancerPipeline.py project_definitions.txt

If there's more than one project defined, we need to select a project with the 2nd argument::

> python CancerPipeline.py project_definitions.txt myproject

##Output

For each unfiltered input mysample, the following outputs are created in a new folder named ./myproject/:

  • mysample.vcf: all variants that passed the filtering step
  • mysample.maf: all filtered variants, in maf format
  • mysample_annotated.tsv: all filtered variants with annotation from the vcf itself, Annovar and SNPeff

The following files are created for all unfiltered inputs together:

  • all_samples_myproject.maf: all filtered variants in maf format
  • all_samples_myproject.tsv: all filtered and annotated variants

The folder analysis_myproject contains intermediate and output files for MuSiC and MutSigCV. The list of significant genes created by is in ./analysis_myproject/output/myproject_mutsigcv.sig_genes.txt

##Dependencies

CoverageCheck was developed on Ubuntu 13.10 with Python 2.7 and the Python packages numpy (1.8.1+), pandas (0.13.1+) and ruffus (2.4+).

The pipeline depends on the installation of the following 3rd-party tools:

The install directories of these tools need to be set in ToolConfig.py.

cancerpipeline's People

Contributors

aweller avatar

Stargazers

0x1orz avatar Igor Kroitor avatar  avatar Joakim Karlsson avatar

Watchers

James Cloos avatar  avatar Roozbeh avatar Mako Medical Lab Bioinformatics Team avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.