hammerlab / biokepi Goto Github PK

View Code? Open in Web Editor NEW

27.0 27.0 4.0 1.92 MB

Bioinformatics Ketrew Pipelines

License: Apache License 2.0

OCaml 99.93% Makefile 0.07%

bioinformatics-ketrew-pipelines bioinformatics-pipeline ketrew ocaml

biokepi's People

Contributors

Stargazers

Watchers

Forkers

pombredanne eric-czech gnetsanet

biokepi's Issues

Expose more parameters for variant callers

Make Picard MarkDuplicates spilling settings configurable

By default it sets:

TMPDIR =/tmp
MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP = 50000
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP =8000
SORTING_COLLECTION_SIZE_RATIO = 0.25

We can change these to have MarkDuplicates spill less when the memory available is high or at least spill to a different directory.

We also run it with java -jar $PICARD_JAR MarkDuplicates, but can update this to
java -Xmx<MEM_PARAM> -jar $PICARD_JAR MarkDuplicates

Test tool installations on different architectures

find out actual dependencies

Add more alligners

See list here: http://massgenomics.org/varscan

Example script for local somatic variant calling

I wish there was an example script which

(1) ran locally without configuration
(2) allowed me to align reads & call somatic variants from local FASTQs
(3) didn't require editing any OCaml code to specify input/output filenames

Add GATK haplotype caller to target library

Going to try this out - started here: https://github.com/hammerlab/biokepi/tree/add-haplotype-caller

install through OPAM

Wanted a few clarifications on installing through OPAM

The opam pin add biokepi . should be performed inside of the biokepi source folder?
When I do that, I get:

[NOTE] You are pinning to /path/to/src/hammerlab/biokepi as a raw path while it seems to be a git repository.
[NOTE] Consider pinning with '-k git' to have OPAM synchronise with your commits and ignore build artefacts.

[biokepi.~unknown] Synchronizing with /path/to/src/hammerlab/biokepi
Installing new package description for biokepi from/path/to/src/hammerlab/biokepi/opam

biokepi needs to be installed.
Your request can't be satisfied:
  - biokepi is not available because the package is pinned to /path/to/src/hammerlab/biokepi, version 0.0.0-alpha.

No solution found, exiting
[NOTE] Pinning command successful, but your installed packages may be out of sync.

Mutect should take parameters like Strelka does

Setup tool versioning

This is important for many of the tools as some versions are incompatible with different programs

A good example is bwa as we may want to specify a downstream tool requires a specific version. A possible way to do this is to enumerate the versions supported by various tools and have the base tools setup the appropriate one. I'll provide a code example shortly for what I'm thinking for feedback.

Not sure the best way to handle multiple versions of tools not installed by biokepi. If a workflow needs python 2 vs 3.

add mdtag to star output (and any other aligners that don't do this already)

According to this: https://groups.google.com/forum/#!topic/rna-star/AWuKw5zuEiY there is a patch to have star output the MD field itself, but probably easier is to just run samtools calmd as the final step in the pipeline.

Add support for BASSOVAC

https://github.com/ding-lab/bassovac

Ensure result directories exist

The first step of many variant callers is

&& shf "cd %s" Filename.(dirname result_prefix)

where Filename.(dirname result_prefix) may not exist. The open is question is whether the task should create the directory or fail if it does not exist

Reimplement other people's pipelines

Document pipeline.ml

Examples are good; there's room for some improvement over that. The API is deceptively simple; it's good to explain what it handles for you.

Add Platypus

http://www.well.ox.ac.uk/platypus

run indel realigner jointly for tumor/normal BAMs

My understanding of

biokepi/src/lib/biokepi_common_pipelines.ml

Line 41 in 77abf04

let tumor =

is that we are running the indel realigner separately for tumor / normal. Wenyi Wang from MD Andersen says we'll get better results if we run it jointly (which I think means just giving the gatk multiple both BAMs by specifiying the -I option twice). We should support this in biokepi (and test if we in fact get different results).

BQSR keeps hitting missing file

BaseRecalibrator seems to run but then the recal table file is missing? @seb has this worked for you before?

INFO  13:48:34,132 HelpFormatter - Program Args: -T BaseRecalibrator -I /hpc/users/ahujaa01/ksinai-demeter/work/path/pt189_bwa-mem-gap11-gep4-sorted.bam -R /hpc/users/ahujaa01/ksinai-demeter/B37-reference-genome/b37.fasta -knownSites /hpc/users/ahujaa01/ksinai-demeter/B37-reference-genome/dbsnp.vcf -o /hpc/users/ahujaa01/ksinai-demeter/work/PT189-B37/pt189-bwa-mem-gap11-gep4-recal_data.table


INFO  08:11:58,496 BaseRecalibrator - Writing recalibration report...
INFO  08:11:59,798 BaseRecalibrator - ...done!

INFO  08:12:02,994 HelpFormatter - Program Args: -T PrintReads -R /hpc/users/ahujaa01/ksinai-demeter/B37-reference-genome/b37.fasta -I /hpc/users/ahujaa01/ksinai-demeter/work/PT189-B37/pt189-bwa-mem-gap11-gep4-sorted.bam -BQSR /hpc/users/ahujaa01/ksinai-demeter/work/PT189-B37/pt189-bwa-mem-gap11-gep4-recal_data.table -o /hpc/users/ahujaa01/ksinai-demeter/work/PT189-PT189--B37/pt189-bwa-mem-gap11-gep4-bqsr.bam

There are many calls to INFO 08:12:00,701 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable INFO 08:12:00,702 HttpMethodDirector - Retrying request due to the ET call. We can set the NO_ET flag if we have a key file to use

Add SNAP to target library

https://github.com/amplab/snap

Cycledash submission sends local paths when Cycledash wants HDFS paths

Cycledash requires the normal/tumor BAMs exist at a location on HDFS

The current post_vcf function provides local paths to BAMs which Cycledash won't be able to access. Also, it requires them to be file_targets instead of strings so you can't submit and HDFS path without failing.

I've been using an hdfs_file and hdfs_file_target for somethings, but not sure where it should in biokepi or ketrew.

let hdfs_file ~run_with path =
  let open Ketrew.EDSL in
  let open Biokepi_run_environment in 
  let test_statement = sprintf "hadoop fs -test -e %s" path in
  object 
    method path = path
    val command = 
      (Ketrew.Target.Command.shell 
            ~host:Machine.(as_host run_with) test_statement)
    method exists = (`Command_returns (command, 0))
    method is_bigger_than n = (`Command_returns ((Ketrew.Target.Command.shell 
            ~host:Machine.(as_host run_with) (sprintf "hadoop fs -test -z %s" path)), 0))
  end

let hdfs_file_target ~run_with ?dependencies ?make ?metadata ?name ?equivalence ?if_fails_activate
    ?success_triggers ?tags path =
  let product = hdfs_file ~run_with path in
  let name = Option.value name ~default:("Make on HDFS:" ^ path) in
  Ketrew_edsl.target ~product ?equivalence ?if_fails_activate ?tags ?success_triggers
    ~done_when:product#exists ?dependencies ?make ?metadata name

Add cloudbreak

https://github.com/cwhelan/cloudbreak

Return the oredoc document generation

At one point the Makefile had an oredoc target, bring it back.

Create topiary target

cc @iskandr

Make COSMIC optional for Mutect

It does not require it.

Create MuSE target

MuSE

Support Guacamole

This requires a new kind of machine, that knows about Hadoop/Yarn.

I've made it work before, the question is how to abstract it to make it portable/configurable.

Edit: typo

Make processors a max_procesors argument to a machine

Something we had discussed before, but most of the tools take a processors argument. Some of the tools don't support it but steps in their pipeline might. Alot of these tools default to 1 or drop the argument at some point, but this can lead to some oddities, i.e if I submit a variant caller run with mutect, the samtools sort step will happen with 2 threads, while with strelka it will happen with has many processors I gave strelka

Move bam type to Dna_bam and add Rna_bam

Interested to hear thoughts on this before I do it. In biokepi_pipeline there is a Bam type that the pipeline steps support. I was planing changing this to Dna_bam and adding an equivalent Rna_bam that has separated supported steps.

Sort and convert sam to bam simultaneously

An unsorted bam is much larger than a sorted one and we can avoid the intermediate file by piping the commands together since almost every tools needs a sorted bam

document Common.ml

I'm not quite sure where to file this ticket, could also go in internal-tools perhaps.

I'm finding it difficult to get my head around biokepi. Lots of this is me learning ocaml at the same time, so please bear with me.

Examples like: https://github.com/hammerlab/biokepi/blob/c823e3fb1cee3227930e703c9ed23c63a16a3578/src/app/main.ml and https://github.com/hammerlab/internal-tools/blob/master/ksinai/seb.ml are unfortunately not doing it for me. I think I need a more documented example.

How am in practice meant to use biokepi? Is the idea that I would set environment variables like BIOKEPI_DATASET_NAME as described in the biokepi README and then run a tool as it describes? Or is the idea that I write scripts like https://github.com/hammerlab/internal-tools/blob/master/ksinai/rna-alignment.ml and hard-code my dataset paths? I realize the answer is probably "you can do whatever you want, this is just a library", but I think it'd be useful for us to come up a recommended style for using this library, and a tutorial for beginners like me to get started working in that way, at least for our work in hammer lab.

I'm finding the stuff in Common.ml pretty confusing. Would be really useful to have docs on things like:

what's the basic idea behind the KEDSL extension to ketrew's EDSL? The only comment I can find on it is "The idea is carry around a type parameter to have arbitrary products.", which I don't understand.
- what is the purpose of workflow_node and how does it differ from a ketrew target? Which should I use? If I'm supposed to use workflow_node, how do I specify its dependencies?
What's a Dataset and why is that in internal-tools/ksinai instead of biokepi?

I'm also having a tough time understanding the Ketrew EDSL here: https://github.com/hammerlab/ketrew/blob/master/src/lib/ketrew_edsl.mli (or API docs in http://seb.mondet.org/software/ketrew/doc.1.0.0/api/Ketrew_edsl.html). For example, looking at target, I'm not clear what the active parameter means, and what I should set it to.

Thanks in advance for putting up with my ignorance: -)

Pass processors through to Yarn/scheduler to be a "good citizen"

Generate online API documentation

Woops, there is a build target...

GATK BQSR should take arguments

Reorganizing modules into pipelines

@rleonid suggested re-organizing some of the code here into more easily discoverable pipelines. I'd be interested in the best Ocaml strategy for doing this, but in the meantime I'll lay out some of the hierarchy for the tools that have been added.

DNA

Basic pipeline:
{aligners} x {BAM preprocessing} x {variant calling}

aligners: fastq -> BAM = {BWA, bwa-mem, bowtie (not in biokepi yet), SNAP (not in biokepi yet)
BAM preprocessing: BAM -> BAM = {GATK Indel Realignment, GATK BQSR, Picard MarkDuplicates}
Germline variant calling: BAM -> VCF = {GATK Unified Genotyper, GATK HaplotypeCaller}
Somatic variant calling: (BAM, BAM) -> VCF = {Mutect, Strelka, Varscan2, SomaticSniper}

RNA

Basic pipeline:
{aligners} x {quantification}

aligners: fastq -> BAM = {Hisat, Star}
quantification: fastq/BAM -> Custom text file = {Cufflinks, Stringtie, Kallisto}

BQSR doesn't work with B38 build

More GATK randomess

##### ERROR MESSAGE: Input files /hpc/users/ahujaa01/ksinai-demeter/B38-reference-genome/dbsnp.vcf and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.
##### ERROR   /hpc/users/ahujaa01/ksinai-demeter/B38-reference-genome/dbsnp.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT]
##### ERROR   reference contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, MT, X, Y, KI270728.1, KI270727.1, KI270442.1, KI270729.1, GL000225.1, KI270743.1, GL000008.2, GL000009.2, KI270747.1, KI270722.1, GL000194.1, KI270742.1, GL000205.2, GL000195.1, KI270736.1, KI270733.1, GL000224.1, GL000219.1, KI270719.1, GL000216.2, KI270712.1, KI270706.1, KI270725.1, KI270744.1, KI270734.1, GL000213.1, GL000220.1, KI270715.1, GL000218.1, KI270749.1, KI270741.1, GL000221.1, KI270716.1, KI270731.1, KI270751.1, KI270750.1, KI270519.1, GL000214.1, KI270708.1, KI270730.1, KI270438.1, KI270737.1, KI270721.1, KI270738.1, KI270748.1, KI270435.1, GL000208.1, KI270538.1, KI270756.1, KI270739.1, KI270757.1, KI270709.1, KI270746.1, KI270753.1, KI270589.1, KI270726.1, KI270735.1, KI270711.1, KI270745.1, KI270714.1, KI270732.1, KI270713.1, KI270754.1, KI270710.1, KI270717.1, KI270724.1, KI270720.1, KI270723.1, KI270718.1, KI270317.1, KI270740.1, KI270755.1, KI270707.1, KI270579.1, KI270752.1, KI270512.1, KI270322.1, GL000226.1, KI270311.1, KI270366.1, KI270511.1, KI270448.1, KI270521.1, KI270581.1, KI270582.1, KI270515.1, KI270588.1, KI270591.1, KI270522.1, KI270507.1, KI270590.1, KI270584.1, KI270320.1, KI270382.1, KI270468.1, KI270467.1, KI270362.1, KI270517.1, KI270593.1, KI270528.1, KI270587.1, KI270364.1, KI270371.1, KI270333.1, KI270374.1, KI270411.1, KI270414.1, KI270510.1, KI270390.1, KI270375.1, KI270420.1, KI270509.1, KI270315.1, KI270302.1, KI270518.1, KI270530.1, KI270304.1, KI270418.1, KI270424.1, KI270417.1, KI270508.1, KI270303.1, KI270381.1, KI270529.1, KI270425.1, KI270396.1, KI270363.1, KI270386.1, KI270465.1, KI270383.1, KI270384.1, KI270330.1, KI270372.1, KI270548.1, KI270580.1, KI270387.1, KI270391.1, KI270305.1, KI270373.1, KI270422.1, KI270316.1, KI270340.1, KI270338.1, KI270583.1, KI270334.1, KI270429.1, KI270393.1, KI270516.1, KI270389.1, KI270466.1, KI270388.1, KI270544.1, KI270310.1, KI270412.1, KI270395.1, KI270376.1, KI270337.1, KI270335.1, KI270378.1, KI270379.1, KI270329.1, KI270419.1, KI270336.1, KI270312.1, KI270539.1, KI270385.1, KI270423.1, KI270392.1, KI270394.1]

Interesting answers on GATK forum without further explanation

B38 is not a simple drop-in replacement as previous releases were. There is some tool work associated with it before we can release anything.

We coud resort the fasta file or find a GATK appropriate link

Make reference-build handling more extensible

Now we now more where we are going with these.

remove the polymorphic-variant
centralize the existing references into one map-like datastructure accessed by name, and extensible from outside of the library.

Create pipeline for neoepitope prediction

What needs to get done first:

Add mm9 and mm10 mouse genomes

Currently reference_genome (https://github.com/hammerlab/biokepi/blob/master/src/lib/reference_genome.ml#L3) only contains a small number of human references:

type specification =
  [`B37 | `B38 | `hg19 | `hg18 | `B37decoy ]

For working with mouse melanoma data we'll need the mouse references mm9/GRCm37 and mm10/GRCm38.

Remove option wrapper around is_done Condition

Having an a product where is_done = None does not make sense. You should always be able to determine when a target is_done. We should not use it for target equivalence, since there is already a field for that logic in a target.

Create a repository for datasets/revive the Dataset module

We don't want to have to download FASTAs from random FTP servers as part of our workflow; it'd be nice to have a verified repository of FASTAs/VCFs/etc that workflows can use.

output compressed VCFs from variant callers

Most tools that work with VCFs (e.g. varcode) can also work with gzipped compressed VCFs, which of course tend to be way smaller. We should compress the VCFs outputted by our pipelines by default.

The best way to do this would be with the bgzip tool from tabix, but plain gzip would also work. See: http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html

Allow mutect to specify a BED/VCF file for intervals

allow any variant caller to be run on a series of downsampled reads

For some analyses we're doing, I'd like to be able to run variant callers on randomly downsampled data.

An example analyses would be: run MuTect on subsamples of my BAM with {0.1, 0.2, ..., 1.0} fraction of the reads present, for a total of 10 runs. It be useful if I could specify any arbitrary schedule of downsampling rates, potentially with duplicates, e.g.: {0.1, 0.1, 0.1, 0.5, 0.5, 0.5}, so that we can get replicates with the same fraction of reads sampled (but a different random sample).

The downsampling can be performed on a BAM with "samtools view". This tool is actually fairly clever in that it keeps mate pairs together (either both the reads in the pair are sampled or none). Here's the bash loop I've been using so far for generating down samples manually:

for i in $(seq 1 9) ; do
  samtools view -s 0.$i -b original.clean.dedup.recal.bam > subsampled-0.$i.bam
done

According to samtools help, the integer part of the -s option can be used to specify the random seed, which we'd have to do in a careful way to support downsampling schedules with duplicates.

mutect dbsnp file gotchas

Just noting this in case it matters later, regarding the dbsnp_alt_url and dbsnp_broad_url defined here to be passed to mutect:

https://github.com/hammerlab/biokepi/blob/master/src/lib/run_environment.ml#L656-L657

the dbsnp_alt_url is a dead link: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/v4.0/00-All.vcf.gz
I think a copy of that file is available here: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b144_GRCh37p13/VCF/00-All.vcf.gz
however, dbsnp_alt_url and dbsnp_broad_url are not the same files. The broad URL gives a VCF with 61M records; the alt url gives one with 145M records.