Giter Club home page Giter Club logo

gff3sort's Introduction

GFF3sort

A Perl Script to sort gff3 files and produce suitable results for tabix tools

Usage

gff3sort.pl [input GFF3 file] >output.sort.gff3
Optional Parameters:
--precise           Run in precise mode, about 2X~3X slower than the default mode. 
                    Only needed to be used if your original GFF3 files have parent
                    features appearing behind their children features.
                    
--chr_order         Select how the chromosome IDs should be sorted. 
                    Acceptable values are: alphabet, natural, original
                    [Default: alphabet]
                    
--extract_FASTA     If the input GFF3 file contains FASTA sequence at the end, use this
                    option to extract the FASTA sequence and place in a separate file 
                    with the extention '.fasta'. By default, the FASTA sequences would be
                    discarded.

Publication

Zhu T, Liang C, Meng Z, Guo S, Zhang R: GFF3sort: A novel tool to sort GFF3 files for tabix indexing. BMC Bioinformatics 2017, 18:482, https://doi.org/10.1186/s12859-017-1930-3

Background

The tabix tool from htslib requires files sorted by their chromosomes and positions. For GFF3 files, they would be sorted by column 1 (chromosomes) and 4 (start positions) as:

sort -k1,1 -k4,4n myfile.gff > myfile.sorted.gff
(OR)
gt gff3 -sortlines -tidy -retainids myfile.gff > myfile.sorted.gff

Then, the sorted GFF3 file could be indexed by:

bgzip myfile.sorted.gff
tabix -p gff myfile.sorted.gff.gz

However, either the GNU sort or the gt tool has a bug: Lines with the same chromosomes and start positions would be placed randomly. Therefore, parent feature lines might sometimes be placed after their children lines. For example, the following features:

##gff-version 3
###
A01	Cufflinks	mRNA	473	6154	.	-	.	ID=XLOC_001154.41;description=Novel: Intergenic transcript
A01	Cufflinks	exon	473	814	.	-	.	Parent=XLOC_001154.41
A01	Cufflinks	exon	1626	2574	.	-	.	Parent=XLOC_001154.41
A01	Cufflinks	exon	2695	2721	.	-	.	Parent=XLOC_001154.41
A01	Cufflinks	exon	3637	3726	.	-	.	Parent=XLOC_001154.41
A01	Cufflinks	exon	5329	5408	.	-	.	Parent=XLOC_001154.41
A01	Cufflinks	exon	5994	6154	.	-	.	Parent=XLOC_001154.41
###
A01	Cufflinks	mRNA	473	6386	.	-	.	ID=XLOC_001154.42;description=Novel: Intergenic transcript
A01	Cufflinks	exon	473	2024	.	-	.	Parent=XLOC_001154.42
A01	Cufflinks	exon	2615	2721	.	-	.	Parent=XLOC_001154.42
A01	Cufflinks	exon	3637	3726	.	-	.	Parent=XLOC_001154.42
A01	Cufflinks	exon	5329	6386	.	-	.	Parent=XLOC_001154.42

would be sorted as:

##gff-version 3
##sequence-region   A01 473 6386
A01	Cufflinks	exon	473	814	.	-	.	Parent=XLOC_001154.41
A01	Cufflinks	exon	473	2024	.	-	.	Parent=XLOC_001154.42
A01	Cufflinks	mRNA	473	6154	.	-	.	ID=XLOC_001154.41;description=Novel: Intergenic transcript
A01	Cufflinks	mRNA	473	6386	.	-	.	ID=XLOC_001154.42;description=Novel: Intergenic transcript
A01	Cufflinks	exon	1626	2574	.	-	.	Parent=XLOC_001154.41
A01	Cufflinks	exon	2615	2721	.	-	.	Parent=XLOC_001154.42
A01	Cufflinks	exon	2695	2721	.	-	.	Parent=XLOC_001154.41
A01	Cufflinks	exon	3637	3726	.	-	.	Parent=XLOC_001154.42
A01	Cufflinks	exon	3637	3726	.	-	.	Parent=XLOC_001154.41
A01	Cufflinks	exon	5329	5408	.	-	.	Parent=XLOC_001154.41
A01	Cufflinks	exon	5329	6386	.	-	.	Parent=XLOC_001154.42
A01	Cufflinks	exon	5994	6154	.	-	.	Parent=XLOC_001154.41
###

That is, the two mRNA lines start with pos 473 would be "randomly" placed after the two exon lines which also start with pos 473. These would encount bugs such as GMOD/jbrowse#780

This script would adjust lines with the same start positions. It would move lines with "Parent=" attributes (case insensitive) behind lines without "Parent=" attributes. The result would be:

A01	Cufflinks	mRNA	473	6386	.	-	.	ID=XLOC_001154.42;description=Novel: Intergenic transcript
A01	Cufflinks	mRNA	473	6154	.	-	.	ID=XLOC_001154.41;description=Novel: Intergenic transcript
A01	Cufflinks	exon	473	814	.	-	.	Parent=XLOC_001154.41
A01	Cufflinks	exon	473	2024	.	-	.	Parent=XLOC_001154.42
A01	Cufflinks	exon	1626	2574	.	-	.	Parent=XLOC_001154.41
A01	Cufflinks	exon	2615	2721	.	-	.	Parent=XLOC_001154.42
A01	Cufflinks	exon	2695	2721	.	-	.	Parent=XLOC_001154.41
A01	Cufflinks	exon	3637	3726	.	-	.	Parent=XLOC_001154.41
A01	Cufflinks	exon	3637	3726	.	-	.	Parent=XLOC_001154.42
A01	Cufflinks	exon	5329	5408	.	-	.	Parent=XLOC_001154.41
A01	Cufflinks	exon	5329	6386	.	-	.	Parent=XLOC_001154.42
A01	Cufflinks	exon	5994	6154	.	-	.	Parent=XLOC_001154.41

gff3sort's People

Contributors

sestaton avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.