Giter Club home page Giter Club logo

cgpbox's Introduction

cgpbox

The cgpbox project encapsulates the core Cancer Genome Project analysis pipeline in an easy to use docker image:

Docker Repository on Quay

The pipeline is optimised for somatic variation calling using BWA mem mapped, Illumina paired-end sequencing.

Analysis performed

cgpbox will perform the following analysis (not necessarily in this order):

  • Basic genotype call using the standard Sequenom SNP locations
    • GRCh37 locations here
  • A comparison of the called genotype between tumour and normal.
  • An evaluation of gender using 4 chrY specific SNPs
    • GRCh37 locations here
    • not ordered as first 2 are part of standard Sequenom QCplex, additional are included for improved accuracy in patchy sequencing.
  • Copy Number Variation (CNV) using ascatNgs
  • Insertion and deletion (InDel) calling using cgpPindel
  • Single Nucleotide Variant (SNV) calling using CaVEMan
  • Gene annotation of SNV and InDel calls using VAGrENT
  • Structural Variation (SV) calls using BRASS
    • Basic gene annotation via grass

Running the docker image

The bulk of this repository is to manage the building of a docker image so that users don't have to.

Provided you have a base system configured to run docker then you only need to fulfil the following requirements:

  1. Ability to provide large workspace as a volume mount point.
    • Workspace needs to be ~25% of the sum of your BAM inputs.
    • Normally it is simplest to place the BAMs in the same area.
  2. 24 cores or more for sensible turn around times.
  3. ~4GB RAM per core

The required resources are unfortunately large but the system does run many elements in parallel to reduce wall-time.


Test run

The current test dataset takes quite a long time to run. We are working to find more suitable data that we can share. Please see Running your data to use your own sample pair.

To run the pre-built docker image with the test data log into a docker enabled host and run the following:

$ cd ~
$ curl -sSL --retry 10 -O https://raw.githubusercontent.com/cancerit/cgpbox/master/examples/run.params
$ export MOUNT_POINT=/some/large/storage
$ (docker run --rm -v $MOUNT_POINT:/datastore -v ~/run.params:/datastore/run.params quay.io/wtsicgp/cgp_in_a_box > ~/run.out) >& ~/run.err &

$MOUNT_POINT should be a storage area with ~25GB of space for this test.

Result files will be written to $MOUNT_POINT/output

Running your data

To analyse your own pairs of tumour normal BAM files you can modify the example run.params file indicated in Test run.

The run.params file contains comments to assist you but here are the critical items:

  • NAME_* - Should match the sample names found in the headers of the BAM files.
  • *_MT - Refers to data linked to the MuTant/tumour sample.
  • *_WT - Refers to data linked to the WildType/Normal sample.
  • BAM_* - Paths to the input BAM files, path is that found within the docker image.

You are also able to force the CPU count to be a specified value. By default the image will use all cores available to the docker image. Should you need to make more memory available you can force a CPU value to be lower than the actual by specifying the value you want as CPU=4 (uncommenting if needed).

Please see Input requirements.

PRE-EXEC array

This is an optional section to provide actions that should be performed prior to the main analysis being triggered. In the example run.params this downloads and unpacks the test dataset.

The uses are only limited by the tools available within the docker image (S3 tools are already included). If there is a good case for additional tools please raise an issue.

If not needed comment out or delete.

POST-EXEC array

This is an optional section to provide actions that should be performed after to the main analysis being triggered. In the example run.params this shows how you could automatically trigger an upload to an S3 bucket.

The uses are only limited by the tools available within the docker image (S3 tools are already included). If there is a good case for additional tools please raise an issue.

If not needed comment out or delete.

Other params not documented here

There are some other parameters that have not been documented here as they relate to future features. Basic notes are included with all parameters in examples/run.params.

Input requirements

cgpbox expects to be provided with a pair of BAM files (one tumour, one normal) each:

  • Mapped with BWA-mem
    • Having valid ReadGroup headers including LB and SN tags
    • See SAM/BAM specification here for more details.
  • Duplicates marked.
  • BAM indexes created.

Data mapped in different fashion

Data mapped using a different algorithm may process successfully however we are unlikely to be able to provide detailed support.

If you already have a mapped BAM you can re-map with all of the above handled for you using the bwa_mem.pl script which is part of PCAP-core.

Monitoring

A simple webpage has been created so that you can monitor the progress of your job. It simply provides evidence that things are progressing and requires the base host (not the docker) to have python installed:

$ cd $MOUNT_POINT/site
$ sudo python -m SimpleHTTPServer 80 >& ~/monitor.log&

Then point you browser at:

http://yourhost/html/index.html

-- Example display: startup

Example display: mid run

Output

On completion the data files used to generate the web-site are copied into the output location along with files containing timing/memory data. These can be found at $MOUNT_POINT/output/*.time and are of the form:

$ cat ascat.time
command:ascat.pl -o /datastore/output/HCC1143_vs_HCC1143_BL/ascat -t /datastore/output/tmp/HCC1143.bam -n /datastore/output/tmp/HCC1143_BL.bam -s /datastore/reference_files/ascat/SnpLocus.tsv -sp /datastore/reference_files/ascat/SnpPositions.tsv -sg /datastore/reference_files/ascat/SnpGcCorrections.tsv -r /datastore/reference_files/genome.fa -q 20 -g L -rs Human -ra GRCh37 -pr WGS -pl ILLUMINA -c 8
real:1390.62
user:2106.95
sys:40.48
text:0k
data:0k
max:2183804k

Additionally all of the data in the output folder is packaged as a tar.gz for easy retrieval (example data set: $MOUNT_POINT/result_HCC1143_vs_HCC1143_BL.tar.gz). Please see examples/run.params for an example of using post-exec to push your data to AWS.

Primary analysis software

It incorporates the following cancerit projects:

Dependancies

Additionally these have dependancies on the following software packages which may have different license restrictions to the cancerit packages:

LICENSE

Copyright (c) 2016 Genome Research Ltd.

Author: Cancer Genome Project [email protected]

This file is part of cgpbox.

cgpbox is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.

  1. The usage of a range of years within a copyright statement contained within this distribution should be interpreted as being equivalent to a list of years including the first and last year specified and all consecutive years between them. For example, a copyright statement that reads ‘Copyright (c) 2005, 2007- 2009, 2011-2012’ should be interpreted as being identical to a statement that reads ‘Copyright (c) 2005, 2007, 2008, 2009, 2011, 2012’ and a copyright statement that reads ‘Copyright (c) 2005-2012’ should be interpreted as being identical to a statement that reads ‘Copyright (c) 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012’."

cgpbox's People

Contributors

keiranmraine avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cgpbox's Issues

Caveman filtering and flagging

Hello,

We are interested in using the cgpbox container to teach a workshop on the analysis of somatic mutations. However, we're struggling to see what filtering is being performed on the caveman calls.

Running the container on the example dataset (ftp://ftp.sanger.ac.uk/pub/cancer/cgpbox/testdata.tar) yields 2027 calls in the file "HCC1143_vs_HCC1143_BL.muts.ids.vcf.gz"; which I presume are the raw somatic calls before filtering?

In the file "HCC1143_vs_HCC1143_BL.flagged.muts.vcf.gz" there are still 2027 calls, however the only change seems to be that the FILTER flag has changed from "." to PASS. In other words, all the calls in HCC1143_vs_HCC1143_BL.flagged.muts.vcf.gz are still passing the filter.

Is this the expected behaviour? It seems as though the caveman filtering step is no filtering / flagging anything?

Do I need to run the script cgpFlagCaVEMan.pl manually to achieve filtering?

Regards,

Mark

Dockerfile can't create image

I tried to use Dokcerfile to create image. But meet some problems.

  1. cgpBigWig is not installed before PCAP-core
  2. no python

Could you please update the latest Dockerfile?

Thanks!

Cannot build Dockerfile

I cannot build the Dockerfile. Fails at # PCAP-core. Using release 2.0.

Building and testing Bio-BigFile-1.07 ... ! Installing Bio::DB::BigFile failed. See /root/.cpanm/work/1471547787.8447/build.log for details. Retry with --force to force install it.
FAIL
INFO[0480] The command [/bin/sh -c curl -L -o master.zip --retry 10 https://github.com/ICGC-TCGA-PanCancer/PCAP-core/archive/master.zip &&     mkdir /tmp/downloads/distro &&     bsdtar -C /tmp/downloads/distro --strip-components 1 -xf master.zip &&     cd /tmp/downloads/distro &&     ./setup.sh $OPT &&     cd /tmp/downloads &&     rm -rf master.zip /tmp/downloads/distro /tmp/hts_cache] returned a non-zero code: 1 

X11 display problem with --user parameters in docker

It occurred to me that if I set --user parameters for docker run I got following error and based on some google search I would guess it is because of Rprofile settings. Is there any known solution for this problem?

Error in .External2(C_X11, paste("png::", filename, sep = ""), g$width, :
unable to start device PNG
Calls: ascat.plotRawData -> png
In addition: Warning message:
In png(filename = paste(ASCATobj$samples[i], ".tumour.png", sep = ""), :
unable to open connection to X11 display ''
Execution halted

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.