Giter Club home page Giter Club logo

grasper's Introduction

GRASPER
Heewook Lee
[email protected]

--------------------------
         SUMMARY
--------------------------

GRASPER (Genome Rearrangement Analysis using Short Paired-End Reads) is a de novo structural variation (SV) calling software that is capable of detecting repetitive SVs. 

It uses (BLAST to A-Bruijn program) to construct A-Bruijn graphs of a given refernece genome to capture approximate repeats (e.g. 95% sequence similarity or higher), then SVs are detected on the graphs. 

GRASPER requires a reference genome sequence in a FASTA formatted file along with a Illumina paired-end sequencing data of a sample genome.

Currently, it supports 

1) Duplicative transposition
2) Deletion of non-repetitive region
3) Deletion of repetitive region
4) Deletion of non-repetitive region bounded by repeats (via homologous recombination)
5) Inversion
6) Tandem-duplication

Unsupported events are still reported in the form of breakpoints. GRASPER first calls breakpoints then assign SV events based on the well known paired SV signatures along with read-depth information. Any breakpoint event without a SV event assignment is reported separately.


--------------------------
      Requirements
--------------------------
To build and run GRASPER, the following are required:

- JDK 1.6 or higher

- Unix-like OS (Linux, Mac OS X, ... )

- Legacy BLAST (available from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/ , more information on https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download) We used version 2.2.25 which can be downloaded from ( ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.25/ )

- Burrows-Wheeler Aligner by Heng Li (version 0.7.9 or higher)

- BLAST to A-Bruijn graph package (available from https://github.com/COL-IU/RepGraph )

- Illumina or Illumina-like paired-end reads (whole-genome sequencing)

- a reference genome sequence

- as of v0.1.1, .medMAD file is generated AUTOMATICALLY from RepGraph (v 0.1.1). This file contains meadian and Median Absolute Deviation (MAD) values for library insert size. 1 SD ~ 1.4826 MAD (https://en.wikipedia.org/wiki/Median_absolute_deviation) under normal distribution. This file contains single line of 2 values delimited by a tab.

-------------------------
      Installation
-------------------------

After downloading the GRASPER source distribution and unpacking it, change into the top-level directory:

> cd grasper


Then, compile and create .jar files

> make
 

This will create a new directory "bin" under the grasper directory with the following jar file:

grasper.jar


-------------------------
       Config file
-------------------------
Configuration file contains parameters that GRASPER/RepGraph/BLAST/bwa need.

An example configuration file can be found in "test_data" directory.


-------------------------
      How to run
-------------------------

Although grasper can run as a stand-along program, it first needs A-Bruijn graph representation of reference genome which is generated by RepGraph package as well as SAM formatted alignment of paired-end reads. For this reason, grasper.sh is provided to tie all these dependencies together in a single script. 

Here are the list of commands when running on test_data

1. Move into test_case directory under GRASPER directory
> cd <GRASPER_INSTALLATION_DIR>/test_data

2. Indexing for BLAST and bwa (ONLY needs to be run once for a reference genome)
> ../grasper.sh I example_config.txt

3. Run pair-wise BLASTN on a given reference genome and construct A-Bruijn graphs (ONLY needs to be run once for a reference genome)
> ../grasper.sh G example_config.txt

4. Align via BWA
> ../grasper.sh A example_config.txt 20Insertions_per_element_1TH_pIRS_20X_11_90_470_1.fq.gz 20Insertions_per_element_1TH_pIRS_20X_11_90_470_2.fq.gz

5.Depth Serialization, mid-sroting, discordant pair removal, SV detection
> ./grasper.sh DS example_config.txt

Note that command ADS can be run separately or combined all together. run grasper.sh without any parameters to see more explanation.
> ./grasper.sh

Screen dump of running on test_data can be found on test_data/test_data.screendump

------------------------
        OUTPUT
------------------------
*.thread : A-Bruijn graphs threading information

*.depth : .depth file contains the serialization of depth arrays. 

*.discordant.midsorted : midpoint-sorted SAM file containing only the discordant mappings

*.SV : this file contains the SV calls from GRASPER

-----------------------
       .SV file
-----------------------
2 breakpoint events (TRANSPOSITION or INVERSION) have 23 columns and 1 breakpoint events only have the first 13 columns

*** COLUMNS ***
Column 1 : Event  ( (I) means inverted )
Column 2 : event classifier (internal purpose)

Column 3/5/20/22 : These columns indicate #reads in cluster
Column 4/6/21/23 : These columns indicate # of instances these clusters can map on linear reference. Clusters on graph that are on repetitive paths will have numbers > 1 to indicate their multiplicities.

Column ( 7-8-9 / 10-11-12 / 14-15-16 / 17-18-19 ) : One triplet indicates 5'boundary-3'boundary-ClusteringDirection of a cluster of reads

Column 3-4-7-8-9 indicates single cluster (meaning the boundary and direction is described by columns 7-8-9 and #reads and multiplicity information of this clusters are in columns 3-4.)
Columm 5-6-10-11-12 indicates single cluster.
Column 20-21-14-15-16 indicates single cluster.
Column 22-23-17-18-19 indicates single cluster.

Clusters that cannot be assigned to a specific event are appended at the end under "#		UNASSIGNED CLUSTERS" section.


**** Event Boundaries ***
1) Deletion: Deletion boundaries are roughly defined by [column8, column10] (Direction of clusters : --> <--)

2) Inversion: Inversion boundaries are roughly defined by [column8/column10 , column15/column16] 

3) Transposition: Segment that is being transposed is roughly defined by [column3, column7] (<--- --->) and it's being transposed to the target location, roughly around column15/column16 (---> <---). A midpoint of column 15 and column16 is probably a resonable guess.

4) Tandem duplication: Segment that is being tandemly duplicated is roughly defined by [column7, column11] (Direction: <--- --->)

grasper's People

Contributors

heewookl avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.