hammerlab / biokepi Goto Github PK
View Code? Open in Web Editor NEWBioinformatics Ketrew Pipelines
License: Apache License 2.0
Bioinformatics Ketrew Pipelines
License: Apache License 2.0
See #54
By default it sets:
TMPDIR =/tmp
MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP = 50000
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP =8000
SORTING_COLLECTION_SIZE_RATIO = 0.25
We can change these to have MarkDuplicates spill less when the memory available is high or at least spill to a different directory.
We also run it with java -jar $PICARD_JAR MarkDuplicates
, but can update this to
java -Xmx<MEM_PARAM> -jar $PICARD_JAR MarkDuplicates
See list here: http://massgenomics.org/varscan
I wish there was an example script which
(1) ran locally without configuration
(2) allowed me to align reads & call somatic variants from local FASTQs
(3) didn't require editing any OCaml code to specify input/output filenames
Going to try this out - started here: https://github.com/hammerlab/biokepi/tree/add-haplotype-caller
Wanted a few clarifications on installing through OPAM
opam pin add biokepi .
should be performed inside of the biokepi
source folder?[NOTE] You are pinning to /path/to/src/hammerlab/biokepi as a raw path while it seems to be a git repository.
[NOTE] Consider pinning with '-k git' to have OPAM synchronise with your commits and ignore build artefacts.
[biokepi.~unknown] Synchronizing with /path/to/src/hammerlab/biokepi
Installing new package description for biokepi from/path/to/src/hammerlab/biokepi/opam
biokepi needs to be installed.
Your request can't be satisfied:
- biokepi is not available because the package is pinned to /path/to/src/hammerlab/biokepi, version 0.0.0-alpha.
No solution found, exiting
[NOTE] Pinning command successful, but your installed packages may be out of sync.
This is important for many of the tools as some versions are incompatible with different programs
A good example is bwa
as we may want to specify a downstream tool requires a specific version. A possible way to do this is to enumerate the versions supported by various tools and have the base tools setup the appropriate one. I'll provide a code example shortly for what I'm thinking for feedback.
Not sure the best way to handle multiple versions of tools not installed by biokepi. If a workflow needs python 2 vs 3.
According to this: https://groups.google.com/forum/#!topic/rna-star/AWuKw5zuEiY there is a patch to have star output the MD field itself, but probably easier is to just run samtools calmd
as the final step in the pipeline.
The first step of many variant callers is
&& shf "cd %s" Filename.(dirname result_prefix)
where Filename.(dirname result_prefix)
may not exist. The open is question is whether the task should create the directory or fail if it does not exist
Examples are good; there's room for some improvement over that. The API is deceptively simple; it's good to explain what it handles for you.
My understanding of
is that we are running the indel realigner separately for tumor / normal. Wenyi Wang from MD Andersen says we'll get better results if we run it jointly (which I think means just giving the gatk multiple both BAMs by specifiying the-I
option twice). We should support this in biokepi (and test if we in fact get different results).BaseRecalibrator
seems to run but then the recal table file is missing? @seb has this worked for you before?
INFO 13:48:34,132 HelpFormatter - Program Args: -T BaseRecalibrator -I /hpc/users/ahujaa01/ksinai-demeter/work/path/pt189_bwa-mem-gap11-gep4-sorted.bam -R /hpc/users/ahujaa01/ksinai-demeter/B37-reference-genome/b37.fasta -knownSites /hpc/users/ahujaa01/ksinai-demeter/B37-reference-genome/dbsnp.vcf -o /hpc/users/ahujaa01/ksinai-demeter/work/PT189-B37/pt189-bwa-mem-gap11-gep4-recal_data.table
INFO 08:11:58,496 BaseRecalibrator - Writing recalibration report...
INFO 08:11:59,798 BaseRecalibrator - ...done!
INFO 08:12:02,994 HelpFormatter - Program Args: -T PrintReads -R /hpc/users/ahujaa01/ksinai-demeter/B37-reference-genome/b37.fasta -I /hpc/users/ahujaa01/ksinai-demeter/work/PT189-B37/pt189-bwa-mem-gap11-gep4-sorted.bam -BQSR /hpc/users/ahujaa01/ksinai-demeter/work/PT189-B37/pt189-bwa-mem-gap11-gep4-recal_data.table -o /hpc/users/ahujaa01/ksinai-demeter/work/PT189-PT189--B37/pt189-bwa-mem-gap11-gep4-bqsr.bam
There are many calls to INFO 08:12:00,701 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable INFO 08:12:00,702 HttpMethodDirector - Retrying request
due to the ET call. We can set the NO_ET flag if we have a key file to use
Cycledash requires the normal/tumor BAMs exist at a location on HDFS
The current post_vcf
function provides local paths to BAMs which Cycledash won't be able to access. Also, it requires them to be file_targets
instead of strings so you can't submit and HDFS path without failing.
I've been using an hdfs_file
and hdfs_file_target
for somethings, but not sure where it should in biokepi or ketrew.
let hdfs_file ~run_with path =
let open Ketrew.EDSL in
let open Biokepi_run_environment in
let test_statement = sprintf "hadoop fs -test -e %s" path in
object
method path = path
val command =
(Ketrew.Target.Command.shell
~host:Machine.(as_host run_with) test_statement)
method exists = (`Command_returns (command, 0))
method is_bigger_than n = (`Command_returns ((Ketrew.Target.Command.shell
~host:Machine.(as_host run_with) (sprintf "hadoop fs -test -z %s" path)), 0))
end
let hdfs_file_target ~run_with ?dependencies ?make ?metadata ?name ?equivalence ?if_fails_activate
?success_triggers ?tags path =
let product = hdfs_file ~run_with path in
let name = Option.value name ~default:("Make on HDFS:" ^ path) in
Ketrew_edsl.target ~product ?equivalence ?if_fails_activate ?tags ?success_triggers
~done_when:product#exists ?dependencies ?make ?metadata name
At one point the Makefile had an oredoc target, bring it back.
cc @iskandr
It does not require it.
This requires a new kind of machine, that knows about Hadoop/Yarn.
I've made it work before, the question is how to abstract it to make it portable/configurable.
Edit: typo
Something we had discussed before, but most of the tools take a processors
argument. Some of the tools don't support it but steps in their pipeline might. Alot of these tools default to 1 or drop the argument at some point, but this can lead to some oddities, i.e if I submit a variant caller run with mutect
, the samtools sort
step will happen with 2 threads, while with strelka
it will happen with has many processors
I gave strelka
Interested to hear thoughts on this before I do it. In biokepi_pipeline
there is a Bam
type that the pipeline steps support. I was planing changing this to Dna_bam
and adding an equivalent Rna_bam
that has separated supported steps.
An unsorted bam is much larger than a sorted one and we can avoid the intermediate file by piping the commands together since almost every tools needs a sorted bam
I'm not quite sure where to file this ticket, could also go in internal-tools perhaps.
I'm finding it difficult to get my head around biokepi. Lots of this is me learning ocaml at the same time, so please bear with me.
Examples like: https://github.com/hammerlab/biokepi/blob/c823e3fb1cee3227930e703c9ed23c63a16a3578/src/app/main.ml and https://github.com/hammerlab/internal-tools/blob/master/ksinai/seb.ml are unfortunately not doing it for me. I think I need a more documented example.
How am in practice meant to use biokepi? Is the idea that I would set environment variables like BIOKEPI_DATASET_NAME
as described in the biokepi README and then run a tool as it describes? Or is the idea that I write scripts like https://github.com/hammerlab/internal-tools/blob/master/ksinai/rna-alignment.ml and hard-code my dataset paths? I realize the answer is probably "you can do whatever you want, this is just a library", but I think it'd be useful for us to come up a recommended style for using this library, and a tutorial for beginners like me to get started working in that way, at least for our work in hammer lab.
I'm finding the stuff in Common.ml pretty confusing. Would be really useful to have docs on things like:
KEDSL
extension to ketrew's EDSL? The only comment I can find on it is "The idea is carry around a type parameter to have arbitrary products.", which I don't understand.
workflow_node
and how does it differ from a ketrew target
? Which should I use? If I'm supposed to use workflow_node
, how do I specify its dependencies?Dataset
and why is that in internal-tools/ksinai
instead of biokepi?I'm also having a tough time understanding the Ketrew EDSL here: https://github.com/hammerlab/ketrew/blob/master/src/lib/ketrew_edsl.mli (or API docs in http://seb.mondet.org/software/ketrew/doc.1.0.0/api/Ketrew_edsl.html). For example, looking at target
, I'm not clear what the active
parameter means, and what I should set it to.
Thanks in advance for putting up with my ignorance: -)
Woops, there is a build target...
@rleonid suggested re-organizing some of the code here into more easily discoverable pipelines. I'd be interested in the best Ocaml strategy for doing this, but in the meantime I'll lay out some of the hierarchy for the tools that have been added.
Basic pipeline:
{aligners} x {BAM preprocessing} x {variant calling}
aligners: fastq -> BAM = {BWA
, bwa-mem
, bowtie
(not in biokepi yet), SNAP
(not in biokepi yet)
BAM preprocessing: BAM -> BAM = {GATK Indel Realignment
, GATK BQSR
, Picard MarkDuplicates
}
Germline variant calling: BAM -> VCF = {GATK Unified Genotyper
, GATK HaplotypeCaller
}
Somatic variant calling: (BAM, BAM) -> VCF = {Mutect
, Strelka
, Varscan2
, SomaticSniper
}
Basic pipeline:
{aligners} x {quantification}
aligners: fastq -> BAM = {Hisat
, Star
}
quantification: fastq/BAM -> Custom text file = {Cufflinks
, Stringtie
, Kallisto
}
More GATK randomess
##### ERROR MESSAGE: Input files /hpc/users/ahujaa01/ksinai-demeter/B38-reference-genome/dbsnp.vcf and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.
##### ERROR /hpc/users/ahujaa01/ksinai-demeter/B38-reference-genome/dbsnp.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT]
##### ERROR reference contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, MT, X, Y, KI270728.1, KI270727.1, KI270442.1, KI270729.1, GL000225.1, KI270743.1, GL000008.2, GL000009.2, KI270747.1, KI270722.1, GL000194.1, KI270742.1, GL000205.2, GL000195.1, KI270736.1, KI270733.1, GL000224.1, GL000219.1, KI270719.1, GL000216.2, KI270712.1, KI270706.1, KI270725.1, KI270744.1, KI270734.1, GL000213.1, GL000220.1, KI270715.1, GL000218.1, KI270749.1, KI270741.1, GL000221.1, KI270716.1, KI270731.1, KI270751.1, KI270750.1, KI270519.1, GL000214.1, KI270708.1, KI270730.1, KI270438.1, KI270737.1, KI270721.1, KI270738.1, KI270748.1, KI270435.1, GL000208.1, KI270538.1, KI270756.1, KI270739.1, KI270757.1, KI270709.1, KI270746.1, KI270753.1, KI270589.1, KI270726.1, KI270735.1, KI270711.1, KI270745.1, KI270714.1, KI270732.1, KI270713.1, KI270754.1, KI270710.1, KI270717.1, KI270724.1, KI270720.1, KI270723.1, KI270718.1, KI270317.1, KI270740.1, KI270755.1, KI270707.1, KI270579.1, KI270752.1, KI270512.1, KI270322.1, GL000226.1, KI270311.1, KI270366.1, KI270511.1, KI270448.1, KI270521.1, KI270581.1, KI270582.1, KI270515.1, KI270588.1, KI270591.1, KI270522.1, KI270507.1, KI270590.1, KI270584.1, KI270320.1, KI270382.1, KI270468.1, KI270467.1, KI270362.1, KI270517.1, KI270593.1, KI270528.1, KI270587.1, KI270364.1, KI270371.1, KI270333.1, KI270374.1, KI270411.1, KI270414.1, KI270510.1, KI270390.1, KI270375.1, KI270420.1, KI270509.1, KI270315.1, KI270302.1, KI270518.1, KI270530.1, KI270304.1, KI270418.1, KI270424.1, KI270417.1, KI270508.1, KI270303.1, KI270381.1, KI270529.1, KI270425.1, KI270396.1, KI270363.1, KI270386.1, KI270465.1, KI270383.1, KI270384.1, KI270330.1, KI270372.1, KI270548.1, KI270580.1, KI270387.1, KI270391.1, KI270305.1, KI270373.1, KI270422.1, KI270316.1, KI270340.1, KI270338.1, KI270583.1, KI270334.1, KI270429.1, KI270393.1, KI270516.1, KI270389.1, KI270466.1, KI270388.1, KI270544.1, KI270310.1, KI270412.1, KI270395.1, KI270376.1, KI270337.1, KI270335.1, KI270378.1, KI270379.1, KI270329.1, KI270419.1, KI270336.1, KI270312.1, KI270539.1, KI270385.1, KI270423.1, KI270392.1, KI270394.1]
Interesting answers on GATK forum without further explanation
B38 is not a simple drop-in replacement as previous releases were. There is some tool work associated with it before we can release anything.
We coud resort the fasta file or find a GATK appropriate link
Now we now more where we are going with these.
What needs to get done first:
Currently reference_genome
(https://github.com/hammerlab/biokepi/blob/master/src/lib/reference_genome.ml#L3) only contains a small number of human references:
type specification =
[`B37 | `B38 | `hg19 | `hg18 | `B37decoy ]
For working with mouse melanoma data we'll need the mouse references mm9/GRCm37 and mm10/GRCm38.
Having an a product where is_done = None
does not make sense. You should always be able to determine when a target is_done
. We should not use it for target equivalence, since there is already a field for that logic in a target
.
We don't want to have to download FASTAs from random FTP servers as part of our workflow; it'd be nice to have a verified repository of FASTAs/VCFs/etc that workflows can use.
Most tools that work with VCFs (e.g. varcode
) can also work with gzipped compressed VCFs, which of course tend to be way smaller. We should compress the VCFs outputted by our pipelines by default.
The best way to do this would be with the bgzip
tool from tabix
, but plain gzip
would also work. See: http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html
For some analyses we're doing, I'd like to be able to run variant callers on randomly downsampled data.
An example analyses would be: run MuTect on subsamples of my BAM with {0.1, 0.2, ..., 1.0} fraction of the reads present, for a total of 10 runs. It be useful if I could specify any arbitrary schedule of downsampling rates, potentially with duplicates, e.g.: {0.1, 0.1, 0.1, 0.5, 0.5, 0.5}, so that we can get replicates with the same fraction of reads sampled (but a different random sample).
The downsampling can be performed on a BAM with "samtools view". This tool is actually fairly clever in that it keeps mate pairs together (either both the reads in the pair are sampled or none). Here's the bash loop I've been using so far for generating down samples manually:
for i in $(seq 1 9) ; do
samtools view -s 0.$i -b original.clean.dedup.recal.bam > subsampled-0.$i.bam
done
According to samtools help, the integer part of the -s
option can be used to specify the random seed, which we'd have to do in a careful way to support downsampling schedules with duplicates.
Just noting this in case it matters later, regarding the dbsnp_alt_url
and dbsnp_broad_url
defined here to be passed to mutect:
https://github.com/hammerlab/biokepi/blob/master/src/lib/run_environment.ml#L656-L657
dbsnp_alt_url
is a dead link: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/v4.0/00-All.vcf.gzdbsnp_alt_url
and dbsnp_broad_url
are not the same files. The broad URL gives a VCF with 61M records; the alt url gives one with 145M records.Ran into this a few times but don't have a solution yet. Most of the variant callers take a reference build and BAM, but those are coupled and it should not be possible to specify the wrong reference build for a BAM
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.