kids-first / kf-alignment-workflow Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 6.0 2.92 MB

:microscope: Alignment workflow for Kids-First DRC

License: Apache License 2.0

Python 1.23% Common Workflow Language 98.77%

bioinformatics cwl drc-harmonization

kf-alignment-workflow's People

Contributors

Stargazers

Watchers

Forkers

bogdang989 bgi-chop ferlab-ste-justine clairexinsun inab

kf-alignment-workflow's Issues

HaplotypeCaller RamMin parameter

We should try to reduce the RamMin requirement as much as possible to allow more jobs to run in parallel, but without causing failure due to full RAM. Few tests show that reducing RamMin requirement to 8000 works on c4.8xlarge and c5.9xlarge instances.

Alignment Workflow

@allisonheath commented on Thu Oct 12 2017

Will document specific details as development occurs, but at a high level:

By read-group alignment by BWA-mem
Picard MarkDuplicates
GATK4 BQSR
Picard Metrics
Final output CRAM

@kellrott commented on Wed Oct 18 2017

Should the pipeline start from FASTQs or unaligned BAMs? With fastq pairs we need to package meta-data along side, with unaligned BAMs its built into the BAM header. PCAWG went with unaligned BAMs.

Also what meta-data should be included as part of the input?

@yuankunzhu commented on Wed Oct 25 2017

We're working on a workflow take both FASTQ and uBAM as input.

I personally lean to input uBAM, as BAM's advantage of handling metadata and that will make the whole pipeline way much cleaner, but I can also see the why some people still love FASTQs, as GZIP is smaller than BGZF, while you don't necessarily need random access for uBAM. If we could have a submitter system capture read group info like GDC dictionary, most metadata could be stored there as well. Lastly, if we decide to do bwa mem, we still need to convert uBAM to intermediate FastQ to do that afaik.

Anyway, we'd like to provide both FastQ and uBAM as the alignment pipeline entry points for now. Like to hearing more thoughts and comments around this.

@yuankunzhu commented on Wed Oct 25 2017

@kellrott do you have the link or more-details of the PCAWG pipeline?

@allisonheath commented on Fri Oct 27 2017

@yuankunzhu the alignment workflow is here: https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow and I believe in general the other various workflows are compiled under https://github.com/ICGC-TCGA-PanCancer. Also might be of interest is the specifications for the unaligned BAMs that were uploaded: https://github.com/ICGC-TCGA-PanCancer/cli/wiki/Preparing-BAM-files.

I think in general we want to use the Gen3 stack as the authoritative source of sequencing metadata, with just the correct IDs kept as part of the header. So a critical decision will be what ID(s) those are, remembering that they typically propagate to downstream files as well.

@allisonheath commented on Thu Dec 21 2017

This now lives in https://github.com/kids-first/kf-alignment-workflow

Optimal Instance Type

Choosing an optimal instance type for the execution regarding task cost / duration. Setting the CWL Requirements for RamMin and specifying max RAM allowed for tools, we can do a switch from AWS memory optimized instances (r-.-xlarge) to compute optimized instances (c-.-xlarge). Last week, Cavatica deployed support for the modern c5.*xlarge instances and preliminary tests show that the c5.9xlarge(36vCPU, 72GB RAM) instance may be the best bet as it is cheaper (looking at on-demand pricing) and faster (newer and better hardware) that it's c4.8 predecessor.

BAM Preprocessing optimization suggestion

Tools in the workflow

Picard Revertsam

Current function

Split input BAM/SAM/CRAM into BAM per RG.

Proposed modification

Use Samtools split for this purpose.

Performance improvement

Picard RevertSam time 4h 7m
Samtools Split time 28m

Example command line

/opt/samtools-1.7/samtools split -f '%!.bam' -@ 29 ae3b4fcd963d404081393b9cf038d4d5.bam

Samtools split is tested with a randomly selected BAM from the pilot set. QualityYield metrics show that there is no difference in tool outputs, which is expected since it is a simple tool.

alignment optimization suggestions

BWA Mem (scattered by read group)

Current setup:

Picard SamToFastq | BWA | Picard MergeBamAlignment

Proposed setup:

Identify pilot test set for workflows

Proposed: 30 trios (so 90 WGS) from each sequencing center to run.

Replaces kids-first/kf-workflows#4

workflow/tools versioning and organizing

@yuankunzhu commented on Wed Nov 01 2017

Open this Issue to discuss how to better version and organize KF workflows in terms of tool versioning, docker images build and tags, CWL file names, workflow and tool ID and versions etc. And potentially how to automated the release process.

Initial idea is to create subrepo for each workflow, and link them back [this kf-workflows repo] (https://github.com/kids-first/kf-workflows/)

Currently on pre-cocleaning branch, we have something like this:

.
├── LICENSE
├── README.md
├── dockerfiles
│   ├── alpine-bwa-samtools
│   │   ├── 0.7.17
│   │   │   └── Dockerfile
│   │   └── latest
│   │       └── Dockerfile
│   ├── alpine-gatk
│   │   ├── 3.8
│   │   │   └── Dockerfile
│   │   └── latest
│   │       └── Dockerfile
│   └── alpine-picard
│       ├── 2.14.0
│       │   └── Dockerfile
│       └── latest
│           └── Dockerfile
├── tools
│   ├── bwa_mem_0.7.17.cwl
│   ├── bwa_mem_latest.cwl
│   ├── picard_markduplicates_2.14.0.cwl
│   ├── picard_markduplicates_latest.cwl
│   ├── picard_sortsam_2.14.0.cwl
│   └── picard_sortsam_latest.cwl
└── workflows
    ├── pre_cocleaning_workflow_1.0.cwl
    └── pre_cocleaning_workflow_latest.cwl

12 directories, 16 files

@allisonheath commented on Thu Dec 21 2017

Now being worked on in https://github.com/kids-first/rfcs/pull/4

Identify pilot test set for workflows

@allisonheath commented on Tue Dec 19 2017

Proposed: 30 trios (so 90 WGS) from each sequencing center to run.

@allisonheath commented on Thu Dec 21 2017

Replaced by #27

BAM sorting optimization suggestion

Tools in the workflow

Picard SortSam

Current function

Sort aligned, duplicate marked BAM.

Proposed modification

Use Sambamba Sort+Index for this purpose.

Performance improvement

Picard SortSam time 3h 1m
Picard MarkDuplicates time 4h 50m
Sambamba Merge+Sort+Index time 1h 12m

Example command lines

Sambamba Merge

/opt/sambamba_v0.6.4 merge  -t 31  ae3b4fcd963d404081393b9cf038d4d5.aligned.duplicates_marked.sorted.bam
/root_bwa_mem_1_s/2895813008.aligned.unsorted.bam 
/root_bwa_mem_2_s/2895813030.aligned.unsorted.bam 
/root_bwa_mem_3_s/2895813316.aligned.unsorted.bam 
....
...
... 
/root_bwa_mem_21_s/2895821901.aligned.unsorted.bam

Sambamba Sort

/opt/sambamba_v0.6.4 sort  -o ae3b4fcd963d404081393b9cf038d4d5.aligned.duplicates_marked.sorted.bam -t 31 /root_sambamba_merge/ae3b4fcd963d404081393b9cf038d4d5.aligned.duplicates_marked.sorted.bam

Sambamba Index

mv /root_sambamba_sort/ae3b4fcd963d404081393b9cf038d4d5.aligned.duplicates_marked.sorted.bam . && /opt/sambamba_v0.6.4 index -t 31 ae3b4fcd963d404081393b9cf038d4d5.aligned.duplicates_marked.sorted.bam ae3b4fcd963d404081393b9cf038d4d5.aligned.duplicates_marked.sorted.bai

Sambamba Merge is added in place of Picard MarkDuplicates, as merging of separate BAM files was originally done there. Tested with a randomly selected BAM from the pilot set.

optimization suggestions list

Open this issue as an aggregation for optimization suggestions.

Alignment #47
Marking duplicates, BQSR, sorting
Metrics
Validate Gvcf

kids-first / kf-alignment-workflow Goto Github PK

kf-alignment-workflow's People

Contributors

Stargazers

Watchers

Forkers

kf-alignment-workflow's Issues

Tools in the workflow

Current function

Proposed modification

Performance improvement

Example command line

Current setup:

Proposed setup:

Tools in the workflow

Current function

Proposed modification

Performance improvement

Example command lines

Recommend Projects

Recommend Topics

Recommend Org