heewookl / grasper Goto Github PK
View Code? Open in Web Editor NEWThis project forked from col-iu/grasper
GRASPER: Genome Rearrangement Analysis using Short Paired-End Reads
License: GNU General Public License v3.0
This project forked from col-iu/grasper
GRASPER: Genome Rearrangement Analysis using Short Paired-End Reads
License: GNU General Public License v3.0
GRASPER Heewook Lee [email protected] -------------------------- SUMMARY -------------------------- GRASPER (Genome Rearrangement Analysis using Short Paired-End Reads) is a de novo structural variation (SV) calling software that is capable of detecting repetitive SVs. It uses (BLAST to A-Bruijn program) to construct A-Bruijn graphs of a given refernece genome to capture approximate repeats (e.g. 95% sequence similarity or higher), then SVs are detected on the graphs. GRASPER requires a reference genome sequence in a FASTA formatted file along with a Illumina paired-end sequencing data of a sample genome. Currently, it supports 1) Duplicative transposition 2) Deletion of non-repetitive region 3) Deletion of repetitive region 4) Deletion of non-repetitive region bounded by repeats (via homologous recombination) 5) Inversion 6) Tandem-duplication Unsupported events are still reported in the form of breakpoints. GRASPER first calls breakpoints then assign SV events based on the well known paired SV signatures along with read-depth information. Any breakpoint event without a SV event assignment is reported separately. -------------------------- Requirements -------------------------- To build and run GRASPER, the following are required: - JDK 1.6 or higher - Unix-like OS (Linux, Mac OS X, ... ) - Legacy BLAST (available from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/ , more information on https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download) We used version 2.2.25 which can be downloaded from ( ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.25/ ) - Burrows-Wheeler Aligner by Heng Li (version 0.7.9 or higher) - BLAST to A-Bruijn graph package (available from https://github.com/COL-IU/RepGraph ) - Illumina or Illumina-like paired-end reads (whole-genome sequencing) - a reference genome sequence - as of v0.1.1, .medMAD file is generated AUTOMATICALLY from RepGraph (v 0.1.1). This file contains meadian and Median Absolute Deviation (MAD) values for library insert size. 1 SD ~ 1.4826 MAD (https://en.wikipedia.org/wiki/Median_absolute_deviation) under normal distribution. This file contains single line of 2 values delimited by a tab. ------------------------- Installation ------------------------- After downloading the GRASPER source distribution and unpacking it, change into the top-level directory: > cd grasper Then, compile and create .jar files > make This will create a new directory "bin" under the grasper directory with the following jar file: grasper.jar ------------------------- Config file ------------------------- Configuration file contains parameters that GRASPER/RepGraph/BLAST/bwa need. An example configuration file can be found in "test_data" directory. ------------------------- How to run ------------------------- Although grasper can run as a stand-along program, it first needs A-Bruijn graph representation of reference genome which is generated by RepGraph package as well as SAM formatted alignment of paired-end reads. For this reason, grasper.sh is provided to tie all these dependencies together in a single script. Here are the list of commands when running on test_data 1. Move into test_case directory under GRASPER directory > cd <GRASPER_INSTALLATION_DIR>/test_data 2. Indexing for BLAST and bwa (ONLY needs to be run once for a reference genome) > ../grasper.sh I example_config.txt 3. Run pair-wise BLASTN on a given reference genome and construct A-Bruijn graphs (ONLY needs to be run once for a reference genome) > ../grasper.sh G example_config.txt 4. Align via BWA > ../grasper.sh A example_config.txt 20Insertions_per_element_1TH_pIRS_20X_11_90_470_1.fq.gz 20Insertions_per_element_1TH_pIRS_20X_11_90_470_2.fq.gz 5.Depth Serialization, mid-sroting, discordant pair removal, SV detection > ./grasper.sh DS example_config.txt Note that command ADS can be run separately or combined all together. run grasper.sh without any parameters to see more explanation. > ./grasper.sh Screen dump of running on test_data can be found on test_data/test_data.screendump ------------------------ OUTPUT ------------------------ *.thread : A-Bruijn graphs threading information *.depth : .depth file contains the serialization of depth arrays. *.discordant.midsorted : midpoint-sorted SAM file containing only the discordant mappings *.SV : this file contains the SV calls from GRASPER ----------------------- .SV file ----------------------- 2 breakpoint events (TRANSPOSITION or INVERSION) have 23 columns and 1 breakpoint events only have the first 13 columns *** COLUMNS *** Column 1 : Event ( (I) means inverted ) Column 2 : event classifier (internal purpose) Column 3/5/20/22 : These columns indicate #reads in cluster Column 4/6/21/23 : These columns indicate # of instances these clusters can map on linear reference. Clusters on graph that are on repetitive paths will have numbers > 1 to indicate their multiplicities. Column ( 7-8-9 / 10-11-12 / 14-15-16 / 17-18-19 ) : One triplet indicates 5'boundary-3'boundary-ClusteringDirection of a cluster of reads Column 3-4-7-8-9 indicates single cluster (meaning the boundary and direction is described by columns 7-8-9 and #reads and multiplicity information of this clusters are in columns 3-4.) Columm 5-6-10-11-12 indicates single cluster. Column 20-21-14-15-16 indicates single cluster. Column 22-23-17-18-19 indicates single cluster. Clusters that cannot be assigned to a specific event are appended at the end under "# UNASSIGNED CLUSTERS" section. **** Event Boundaries *** 1) Deletion: Deletion boundaries are roughly defined by [column8, column10] (Direction of clusters : --> <--) 2) Inversion: Inversion boundaries are roughly defined by [column8/column10 , column15/column16] 3) Transposition: Segment that is being transposed is roughly defined by [column3, column7] (<--- --->) and it's being transposed to the target location, roughly around column15/column16 (---> <---). A midpoint of column 15 and column16 is probably a resonable guess. 4) Tandem duplication: Segment that is being tandemly duplicated is roughly defined by [column7, column11] (Direction: <--- --->)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.