Giter Club home page Giter Club logo

hapdip's Introduction

Introduction

This project started as a set of companion scripts used in my review (or preprint). After the publication, I polished scripts and separated them out in this independent repo. With time, the project grows with more functionalities particularly focusing on VCF processing. This README gives an overview of key routines. README-hapdip.md explains the original hapdip benchmark in more details.

As most of my programs, this hapdip.js script consists of multiple subcommands. It is very fast and seamlessly reads both plain and gzip'd text files. In addition, for VCF processing, the script properly breaks MNPs and complex events down to atomic SNPs and INDELs. It is aware of the CIGAR INFO field produced by FreeBayes.

This script requires the k8 javascript shell to run, which can be found here.

Functionalities

Filtering single-sample VCF files

VCF produced from whole-genome deep resequencing can be filtered with:

k8 hapdip.js deovlp raw.vcf.gz | k8 hapdip.js anno | gzip -1 > anno.vcf.gz
k8 hapdip.js filter anno.vcf.gz | bgzip > filtered.vcf.gz

Here deovlp chooses the best VCF record among overlapping records, anno adds generic depth-related INFO fields and filter uses the fields to filter variants with hard thresholds in my review. In my experience, for single-sample calling, these hard filters are better than GATK's VQSR (see the FermiKit manuscript).

Converting VCF to unary BED

k8 hapdip.js vcf2bed var.vcf.gz > var.bed

In the output, each alternate allele takes one BED line, which consists of chr, start, end, allele length (0 for SNPs, positive for insertions and negative for deletions), ref allele, alt allele, count of alt allele, total called haplotypes and filters in VCF. Complex events are broken down to atomic SNPs and INDELs.

Subsetting VCF and reordering/replacing sample names

echo -e "old2\tnew1\nold4\tnew2" > list.txt
k8 hapdip.js vcfsub list.txt var.vcf > var-sub.vcf

The command chooses a subset of samples and reorder the samples according to the input. If the input list.txt has a second column, the command replace the old sample name in the first column to the new one on the second column.

Evaluating calls from the CHM1-NA12878 pair

k8 hapdip.js eval CHM1.vcf.gz NA12878.vcf.gz

Please see README-hapdip.md for details.

Evaluating against a truth set

The distEval command evaluates a call set against a GIAB-like truth dataset, which consists of one BED file giving high-confidence regions and one VCF file giving variants.

# download GIAB-2.18 truth dataset (or from http://bit.ly/giab218)
wget ftp://hengli-data:[email protected]/hapdip/GIAB/N+P.auto.bed.gz
wget ftp://hengli-data:[email protected]/hapdip/GIAB/P.vcf.gz
# download an evaluation call set
wget ftp://hengli-data:[email protected]/hapdip/vcf-flt/NA12878.mem.hc.flt.vcf.gz
# evaluate
k8 hapdip.js distEval -d 10 -b N+P.auto.bed.gz P.vcf.gz NA12878.mem.hc.flt.vcf.gz

The output is:

distEval        SNP     N+P     2083729399
distEval        SNP     TP      2634728
distEval        SNP     FN      27555
distEval        SNP     FP      263
distEval        INDEL   N+P     2083729399
distEval        INDEL   TP      164730
distEval        INDEL   FN      2074
distEval        INDEL   FP      1762

where N+P gives the total length of the non-overlapping regions in N+P.auto.bed.gz, TP is the number of true SNPs (INDELs) within d-bp from a called variant (including both SNPs and INDELs), FN the number of true SNPs (INDELs) not within d-bp from any called variants and FP the number of called SNPs (INDELs) not within d-bp from any true variants. Distance-based evaluation is robust to the multi-representation of variants.

hapdip's People

Contributors

brentp avatar lh3 avatar mcshane avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.