Giter Club home page Giter Club logo

biscuit's Introduction

BISCUIT Travis-CI Build Status DOI Downloads

BISulfite-seq CUI Toolkit (BISCUIT) is a utility suite for analyzing sodium bisulfite conversion-based DNA methylation/modification data. It was written to perform alignment, DNA methylation and mutation calling, allele specific methylation from bisulfite sequencing data.

Download

Source code

All releases are available here. Note after v0.2.0, if you choose to use Git, make sure use git clone --recursive to get the submodules.

You can also download with Curl using

$ curl -O $(curl -s https://api.github.com/repos/zwdzwd/biscuit/releases/latest | grep browser_download_url | grep release-source.zip | cut -d '"' -f 4)

The precompiled

$ # macOS
$ curl -O $(curl -s https://api.github.com/repos/zwdzwd/biscuit/releases/latest | grep browser_download_url | grep macOS | cut -d '"' -f 4)
$ # Linux
$ curl -O $(curl -s https://api.github.com/repos/zwdzwd/biscuit/releases/latest | grep browser_download_url | grep x86_64 | cut -d '"' -f 4)

Compile

If you download source code, BISCUIT can be compiled through,

$ unzip release.zip
$ cd biscuit-release
$ make

The created biscuit binary is the main entry point.

Usage

Index reference for alignment

biscuit index GRCh38.fa

The index of BISCUIT composed of the 2-bit packed reference (.bis.pac, .bis.amb, .bis.ann). The suffix array and FM-index of the parent strand (.par.bwt and .par.sa) and the daughter strand (.dau.bwt and .dau.sa).

Read alignment

The following snippet shows how BISCUIT can be used in conjunction with samtools to produce indexed alignment BAM file.

$ biscuit align -t 10 GRCh38.fa fastq1.fq.gz fastq2.fq.gz | samtools sort -T . -O bam -o output.bam
$ samtools index output.bam
$ samtools flagstat output.bam >output.bam.flagstat

See here for more information.

Visualize alignment

The tview subroutine colors the alignments in bisulfite mode. Here is a screenshot.

$ biscuit tview -g chr19:7525080 input.bam ref.fa

Unlike samtools, in this subroutine, a reference fasta file is mandatory so that bisulfite conversion can be identified.

Mark duplicate reads

This step is optional. The mark duplicate of BISCUIT is bisulfite strand aware.

$ biscuit markdup input.bam output.bam

Pileup

Like samtools, BISCUIT extract DNA methylation as well as genetic information. The following shows how to produce a tabix-indexed VCF file.

$ biscuit pileup GRCh38.fa input.bam -o output.vcf -q 20
$ bgzip output.vcf
$ tabix -p vcf output.vcf.gz

Make bed files

The following extract CpG beta values from the VCF file.

$ biscuit vcf2bed -k 10 -t cg input.vcf.gz

-t can also take

  • snp - SNP information
  • c - all cytosines
  • hcg - HCG for NOMe-seq
  • gch - GCH for NOMe-seq

-e output sequence contexts.

QC your run

A bash script is provided to simplify QC procedure.

$ ./scripts/QC.sh -v input.vcf setup_file sample.name input.bam

generates QC information that can be picked up by MultiQC. Setup files for hg19, hg38 and mm10 are provided. Other genome builds can be built following the same format.

EPI-reads and allele-specific methylation

Following illustrates how to produce epiread which carries the information of epi-haplotype.

$ biscuit epiread -r GRCh38.fa -i input.bam -B snp.bed

To test all SNP-CpG pair,

$ biscuit epiread -r GRCh38.fa -P -i input.bam -B snp.bed

Details can be found here.

sort -k1,1 -k2,2n -k3,3n in.epiread >out.epiread
biscuit asm out.epiread >out.asm

Validate bisulfite conversion label

Sometimes, the bisulfite conversion labels in a given alignment are inaccurate, conflicting or ambiguous. The bsstrand command summarizes these labels given the number of C>T, G>A substitutions. It can correct inaccurate labels as an option.

$ biscuit bsstrand -g "chr1:1000000-1050000" GRCh37.fa input.bam 

gives something like

Mapped reads: 16688
Unmapped reads: 29
Corrected reads: 0 (0.00%)

Confusion counts:
orig\infer     BSW (f)      BSC (r)      Conflict (c) Unknown (u)
     BSW (f):   8426         42           4            12
     BSC (r):   15           8167         3            19
Conflict (c):   0            0            0            0
 Unknown (u):   0            0            0            0

The inferred YD tag gives the following

  • f: foward/Waston strand
  • r: reverse/Crick strand
  • c: conflicting strand information
  • u: unintelligible strand source (unknown)

YD is inferred based on the count of C>T (nCT) and G>A (nGA) observations in each read according to the following rule: If both nCT and nGA are zero, YD = u. s = min(nG2A,nC2T)/max(nG2A,nC2T). if nC2T > nG2A && (nG2A == 0 || s <= 0.5), then YD = f. if nC2T < nG2A && (nC2T == 0 || s <= 0.5), then YD = r. All other scenarios gives YD = c. -y append nCT(YC tag) and nGA(YG tag) in the output bam.

Summarize and filter reads by bisulfite conversion

$ biscuit bsconv -g "chr1:1000000-1050000" GRCh37.fa input.bam

For some library preparation, incomplete conversion are enriched in a subset of reads that needs to be filtered. This command transforms bam into one that contains file ZN tag e.g., ZN:Z:CA_R0C11,CC_R1C14,CG_R0C2,CT_R1C5. This tag summarizes counts of retention and conversion for four different cytosine contexts CpA, CpC, CpG and CpT. It contains a minimum threshold of CpA, CpC, CpT or CpH in general. The -b option outputs the summary in tables instead of as tags in the BAM file.

Acknowledgements

  • lib/aln was adapted from Heng Li's BWA-mem code.
  • lib/htslib was subtree-ed from the htslib library.
  • lib/klib was subtree-ed from Heng Li's klib.

biscuit's People

Contributors

zwdzwd avatar zackramjan avatar ttriche avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.