Giter Club home page Giter Club logo

tasrkleat's Introduction

Tasrkleat, a pipeline for targeted analysis

What does it do

The pipeline was designed with three types of analysis in mind:

  1. Targeted cleavage site prediction in candidate genes, which is the main focus of this pipeline
  2. Targeted de novo assembly of candidate genes
  3. Targeted read alignment for expression quantification of candidate genes, i.e. genes of interest

The commonality among the three is that as they are all targeted analysis of candidate genes instead of all genes available for a given RNA-Seq dataset. In addition, targeted cleavage sites prediction (Task 1) depends on the results from the target de novo assembly (Task 3).

Currently, all three analysis will be conducted by default when running the pipeline. There is still no option implemented to disable any of them yet.

Citation

Xue Z, Warren RL, Gibb EA, MacMillan D, Wong J, Chiu R, et al. Recurrent tumor-specific regulation of alternative polyadenylation of cancer-related genes. BMC Genomics. 2018;19:536

Running environment

The pipeline is desigend mainly for the cloud computing environment, the Google Cloud Platform (GCP) in particular, and to be used in the form of a Docker image. However, in principle, nothing prevents it from being used without Docker in a non-cloud environment.

Install

The easiest way to install is to build a Docker image with the included Dockerfile, and use that image directly. To build a Docker image, make sure you have Docker installed, then try

git clone [email protected]:bcgsc/tasrkleat.git
cd tasrkleat
make build

To see if the image has been built successfully

docker images

Pre-built docker images

Pre-built Docker images are available at the dockerhub. The tags should match those at the github repo, except for the v0 tag, which is used for testing purpose exclusively, and the latest tag, which reflects the automatically built image from the master branch.

Use docker image

It is recommended to run the pipeline interactively first to get familiar with its behavior before scaling up the computation.

Fetch an interactive Docker session

# You may or may not need sudo depending on your user group setup
sudo docker run -it --rm \
    -v /path/to/reference:/mnt \
    -v /path/to/reads-data/:/data \
    zyxue/tasrkleat:latest \
    /bin/bash

-it means fetching an interactive pseudo-tty session. For details of docker run, please see the doc.

--rm means to remove the container after it finishes (e.g. you exit the container). This is optional, but I find it handy. Otherwise, you will need to cleanup all the finished container manually with docker rm.

-v mounts path of local file system to that inside the container so that the data is accessible by the pipeline. The above command mounts two paths, one for the references data, and one for the reads data.

/bin/bash means to run bash when the container first starts so that you could interact with it.

reference should contains all the necessary reference files, a copy of those used in the manuscript can be found at http://bcgsc.ca/downloads/tasrkleat-static/on-cloud/.

Once you are inside a tasrkleat container as root. The environment looks like

root@b7aed8a3b50f:/# whoami
root

While tasrkleat Docker image is just a binary file with all necessary software packaged in, a docker container is an running instance of the image. In analogy to programming, the image is like a class, and the container is like an instance of that class.

A example command to run the pipeline inside the container

app.py \
    --input-tar /data/data.tar \
    --input-bf /mnt/targets.bf \
    --transabyss-kmer-sizes 32 52 72 \
    --reference-genome /mnt/hg19.fa \
    --reference-genome-gmap-index /mnt/gmapdb \
    --gtf /mnt/ensembl.fixed.sorted.gz
  • --input-bf accepts the pre-built input bloomfilters.
  • --transabyss-kmer-sizes accept three kmer sizes.
  • --input-tar could be a tarball of gzipped fastq files, or a gzipped tar of uncompressed fastq files, both situation occurs in the TCGA samples. It's dealt in the extract_tarball function if you need more details. Currently, tasrkleat can only handle paired-end data.

After you get familiar with how the pipeline works, you could run it in batch mode, e.g.

sudo docker run --rm \
	-v /path/to/reference:/mnt \
	-v /path/to/reads-data/:/data \
	zyxue/tasrkleat:latest \
    app.py --input-tar /data/data.tar \
           --input-bf /mnt/targets.bf \
           --transabyss-kmer-sizes 32 52 72 \
           --reference-genome /mnt/hg19.fa \
           --reference-genome-gmap-index /mnt/gmapdb \
           --gtf /mnt/ensembl.fixed.sorted.gz

The command is mostly the same to that in the interactive mode except for the parts that enable interaction (e.g. -it and /bin/bash) are removed. Now it runs app.py directly instead of /bin/bash when the container starts.

Development

  1. Version every package installed if possible in the Dockerfile.
  2. Push each versioned image explicitly with docker push zyxue/tasrkleat:<tag>.
  3. Write changlog with git tag -a <commit hash> for new releases in the following format
    One sentence summary
    
    - changed thing a
    - change thing b
    
    Memo or other stuff to be recorded
    

tasrkleat's People

Contributors

dmacmillan avatar readmanchiu avatar zyxue avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

cwilson28

tasrkleat's Issues

Missing numpy dependency in Dockerfile

Hi there. I tried to build tasrkleat but the docker build fails because numpy is required by biopython but not installed. Installing numpy prior to biopython solves this issue and the image builds just fine.

TODOs

  • Time each cmd execution
  • Record the command used to run app.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.