daler / gffutils Goto Github PK

View Code? Open in Web Editor NEW

283.0 283.0 78.0 31.63 MB

GFF and GTF file manipulation and interconversion

Home Page: http://daler.github.io/gffutils

License: MIT License

Shell 0.23% Python 99.77%

gffutils's People

Stargazers

Watchers

gffutils's Issues

more flexible FeatureDB.region

See comments in #42. It would be useful to select features by entire seqid rather than knowing ahead of time what the seqid size is.

Proposed interface:

# already works
db.region(('chr1', 1, 1000))

# already works
db.region("chr1:1-1000")

# String as first argument? Assume seqid.
db.region('chr1')

# None as "start" position in tuple? 
# Do not add start >= clause (interpret as chr1[:1000]).
db.region(('chr1', None, 1000))

# None as "stop" position in tuple?
# Do not add stop <= clause (interpret as chr1[1000:]
db.region(('chr1', 1000, None))

Port GFF to TBL script

From @sjackman, originally on chapmanb/bcbb#101.

Original GFF to TBL conversion script uses BCBio.GFF

port to gffutils
add as script and/or new gffutils.convertmodule
subset example data so it can live in gffutils/test/data
integrate into travis-ci testing

Example data:

Input 1, Output 1
Input 2, Output 2

See TBL format description 1 and TBL format description 2 for details.

score type mismatch

Hi Ryan,

The db schema defines score as 'float',
https://github.com/daler/gffutils/blob/master/gffutils/db.py#L58

Then in Feature it has type of 'str',
https://github.com/daler/gffutils/blob/master/gffutils/gfffeature.pyx#L81

This incompatibility is triggered when children() function is called. My own solution is to change 'float' to 'text' in db.py; not sure if this is the right way to approach this.

Thanks, Haibao

Citation?

What is the appropriate way to cite gffutils in a publication?
Thanks!
Olga

Optimization of database creation

Is it possible to optimize database creation? It'd be interesting to see where the bottlenecks are. Would Cythonizing certain parts of the code help with this, or is the bottleneck is purely at the sqlite interface? For example, creation of a database for this GFF (Ensembl mouse genes) takes ~15 minutes:

http://genes.mit.edu/yarden/Mus_musculus.NCBIM37.65.gff
$ gffutils-cli create Mus_musculus.NCBIM37.65.gff

The bottleneck is there even when dbs are created in memory, though it's of course smaller in these cases. This is related to another issue we discussed which is: how to create derivative GFF files from an existing ones, i.e. how to iterate through a GFF and make a new version of it. For example, if you wanted to iterate through the above GFF and simply add an optional key, value pair to the attributes of some records. This is a kind of "streaming" operation that can be done line-by-line and doesn't necessarily need a db. The overhead of creating the db makes gffutils impractical for these kinds of simple operations on large-ish GFF files.

There are more sophisticated operations (i.e. non-streaming ones) where a naive in memory solution still is considerably faster because of the database creation bottleneck. For example, the naive GFF parser used in misopy.gff_utils (see load_genes_from_gff in https://github.com/yarden/MISO/blob/fastmiso/misopy/Gene.py) simply iterates through the GFF file multiple times to collect all the gene entries into a simple Gene object, with mRNAs represented as lists of exons. This kind of gene-centric in-memory solution is less expressive than gffutils (does not handle arbitrary nesting, and ignores non-genes basically) but for simple operations like "Add attribute X to all genes" or "Reformat the IDs of this GFF" it's considerably faster. It's not actually faster once the DB is created; gffutils retrieval is excellent, but the overhead of creation of the db trumps the computation time for many of these operations.

In summary, I'm wondering if there's a way to try to bridge the gap between the fast, but hacky solutions and the sophisticated gffutils solution that comes with an overhead. I think this is an important issue because many of the operations done on GFFs (at least that I do) don't require many hierarchical SQL queries.

Database creation may be unnecessarily slow due to missing indexes

gffutils database creation may be unnecessarily slow since some sqlite indexes that are important for duplicate merging seem to be missing. The problematic part where the code spends most of the time is in _candidate_merges(). The joins in this function are expensive since apparently the newid and idspecid indexes are missing. I added

        c.execute('CREATE UNIQUE INDEX featuresid ON features (id);')
        c.execute('CREATE UNIQUE INDEX dupnewid ON duplicates (newid);')
        c.execute('CREATE INDEX dupspecid ON duplicates (idspecid);')
        self.conn.commit()

below the other index preparation steps which resulted in an about 100-fold faster database creation using these indexes.

As a side node, I also changed the default pragmas to further increase performance - but I believe the main boost came from the indexes alone:

        pragmas = {'synchronous':'OFF', 'journal_mode': 'MEMORY', 'main.page_size': 4096,\
                   'main.cache_size': 100000, 'temp_store': 2, 'locking_mode': 'EXCLUSIVE'}

add "annotate_introns" method

It would be useful to have a single method that adds intron annotations to the db or a GFF/GTF file. Something along these lines:

for gene in db.features_of_type('gene'):
    for transcript in db.children(gene):
        exons = sorted(db.children(transcript, featuretype='exon'))
        for intron in db.interfeatures(exons):
            intron.featuretype = 'intron'
            yield intron

(but interfeatures should prob not be a method on FeatureDB...classmethod maybe? Or a separate utility function? Also it should do something useful with attributes; currently attributes from interfeatures is an empty string)

sqlite3.OperationalError using features that are too large

db.region(seqid='chr2L', start=1, end=2e8)

gives a sqlite3.OperationalError for too many SQL variables. The sqlite3 limits docs says 999 is the default number of arguments.

So we get this error from looking at too many bins at the same time (similar to the last example query in the original UCSC Genome Browser paper).

I think the query-generating code should just detect that we have too many args and then just not add the bin constraint to the query.

Add adapter to convert to Bio.SeqRecord

See http://www.biostars.org/p/64363/#64463

Merging GTF database and output results to a GTF file

Hello @daler ,

I've been trying two merge features from two different GTF files/databases into a single "combined" GTF. I'm trying to merge an Ensembl GTF file with my features of interest from another GTF file (Cufflinks/Cuffcompare output)
So I've some questions I couldn't find answer reading the docs:

Fist one, since I want to output a single GTF as result of my merging features script: Is there any GTF writing method in gffutils? I know I can print features into an open file handler, but how do we deal with GTF/GFF headers?

Second, since I want to pick somefeatures from each GTF input dabase, my logical way to do this would be to create an empty db ("ouput_db" , I don't know if this is possible) and add interest features to this "ouput_db" and sort features by chromosome/start position and finally write them into output GTF file.

At this moment I'm at the point of picking my feature interest from both input gtf dbs, but I don't know how to merge them into a single db for sorting features and then write them to output.

Thanks!

Empty attribute

Hi Ryan,

gffutils is awesome, I'm swapping over to using it for doing all of my GTF manipulations.

I have a GTF file that looks like this:

chr1    hg19_refFlat    exon    14362   14829   0.000000    -   .   gene_id "WASH7P"; transcript_id "WASH7P";
chr1    hg19_refFlat    exon    14970   15038   0.000000    -   .   gene_id "WASH7P"; transcript_id "WASH7P";
chr1    hg19_refFlat    exon    15796   15947   0.000000    -   .   gene_id "WASH7P"; transcript_id "WASH7P";
chr1    hg19_refFlat    exon    16607   16765   0.000000    -   .   gene_id "WASH7P"; transcript_id "WASH7P";
chr1    hg19_refFlat    exon    16858   17055   0.000000    -   .   gene_id "WASH7P"; transcript_id "WASH7P";
chr1    hg19_refFlat    exon    17233   17368   0.000000    -   .   gene_id "WASH7P"; transcript_id "WASH7P";
chr1    hg19_refFlat    exon    17606   17742   0.000000    -   .   gene_id "WASH7P"; transcript_id "WASH7P";
chr1    hg19_refFlat    exon    17915   18061   0.000000    -   .   gene_id "WASH7P"; transcript_id "WASH7P";
chr1    hg19_refFlat    exon    18268   18366   0.000000    -   .   gene_id "WASH7P"; transcript_id "WASH7P";
chr1    hg19_refFlat    exon    24738   24891   0.000000    -   .   gene_id "WASH7P"; transcript_id "WASH7P";

But when I write out features like this:

 for isoform in single_isoform:
   for feature in db.children(isoform, featuretype="exon", order_by="start"):
     print feature.attributes                   
     out_handle.write(str(feature) + "\n")

it has an extra empty column in the attribute field:

{u'': [], u'transcript_id': [u'UBE2Q1'], u'gene_id': [u'UBE2Q1']}
{u'': [], u'transcript_id': [u'RNF10'], u'gene_id': [u'RNF10']}
{u'': [], u'transcript_id': [u'RNF11'], u'gene_id': [u'RNF11']}
{u'': [], u'transcript_id': [u'REM1'], u'gene_id': [u'REM1']}
{u'': [], u'transcript_id': [u'LOC100506540'], u'gene_id': [u'LOC100506540']}
{u'': [], u'transcript_id': [u'TRPS1'], u'gene_id': [u'TRPS1']}

Leading to a GTF file that some downstream tools really don't like:

chr1    hg19_refFlat    exon    154521051   154522945   0.000000    -   .   ; transcript_id "UBE2Q1"; gene_id "UBE2Q1"
chr1    hg19_refFlat    exon    154523414   154523480   0.000000    -   .   ; transcript_id "UBE2Q1"; gene_id "UBE2Q1"
chr1    hg19_refFlat    exon    154523873   154523968   0.000000    -   .   ; transcript_id "UBE2Q1"; gene_id "UBE2Q1"
chr1    hg19_refFlat    exon    154524247   154524295   0.000000    -   .   ; transcript_id "UBE2Q1"; gene_id "UBE2Q1"
chr1    hg19_refFlat    exon    154524397   154524455   0.000000    -   .   ; transcript_id "UBE2Q1"; gene_id "UBE2Q1"
chr1    hg19_refFlat    exon    154524569   154524659   0.000000    -   .   ; transcript_id "UBE2Q1"; gene_id "UBE2Q1"

Stuck on error: AttributeError: 'list' object has no attribute 'dialect'

Hi again,

I'm trying to update GFF features based on the attributes of their sibling or child, so I was trying to use FeatureDB.iter_by_parent_childs(). However, when I feed back the feature list to FeatureDB.update(), I keep getting the AttributeError: 'list' object has no attribute 'dialect' error, and I don't understand why. Could you have a quick look? I put a minimal example at https://gist.github.com/rmzelle/2dfd897a285c581c164b

[gffutils 0.8.7.1]

Lightweight Gene class for gffutils

A feature that I'd find very useful is parsing of gffutils GFF records into a (lightweight) Gene-like object. Something like:

# recs is a set of GFF records
gene_obj = make_gene_obj(recs)
# Access the mRNAs, their exons
first_mRNA = gene_obj.mRNAs[0]
print "Exons of mRNA: ", first_mRNA
for exon_num, exon in enumerate(first_mRNA.exons):
  print "Exon %d is: " %(exon_num), exon

The reason this is useful is because:

You don't have to keep thinking of parent-child queries for canonical structures like a gene; you can think of things as objects, which is very intuitive. A gene has a list of mRNAs as one of its attributes, and each mRNA has a list of exons (or introns, or CDSS) as its attributes.
The object representation would internally order the features in a predictable way, kind of like GFFWriter does for serialization. For example, if you iterate over the exons belonging to an mRNA, as in:

for exon in first_mRNA.exons:
  # access exon
  ...

then it'd be helpful to know that you're guaranteed to get the exons sorted by their coordinate, so that you're iterating over the mRNA's exons from the start of the mRNA to finish. I believe that if you simply do db.children(mRNA_id) you're not guaranteed to get it in order. Similarly for getting the mRNA records of the gene. It's helpful to know that they're sorted by some canonical order, like longest to shortest, so that gene_obj.mRNAs[0] is always the longest mRNA.

I think the ideal implementation would try to keep the objects as similar to GFF features as possible. For example, you should be able to do: gene_obj.mRNAs[0].start, gene_obj.mRNAs[0].stop, etc. The mRNA object could even inherit from the GFF feature class.

The way I'm thinking of accessing the objects would be through an iterator that returns gene objects, making them on the fly, something like:

for gene_obj in db.get_genes():
  # access gene
  ...

where get_genes() internally does something like:

def get_genes(self, ...):
   # Get records belonging to each gene
   for gene_recs in db.iter_by_parent_childs(featuretype='gene'):
     # make the gene's records into an object in canonical order
     # where mRNAs appear in attributes sorted by length
     # and where exons appear in attributes sorted by their
     # start coordinate
     gene_obj = records_to_gene(gene_recs)
     yield gene_obj

What do you think? Is this something you think is worth putting into gffutils? If so, do you have ideas on what would be the most efficient way to do it? The naive sketch of get_genes() might be fast enough, but I'm not sure. Thanks, --Yarden

keep_order doesn't work as expected

See https://gist.github.com/rmzelle/5697c87edf376f17fd63

There are two GFF files. The stringified output of all features seems to put the first attribute of the first feature in front, even though I have keep_order set to "True". Tested with gffutils 0.8.3 and Python 2.7.9.

ANALYZE features speeds computation up to 1000x ?

Hi.
I've been using gffutis with gencode v18 human genome annotation(http://www.gencodegenes.org/releases/current.html)
and noticed the following:
if I create database this way:

#!/usr/bin/python2
import gffutils as gff
import sys
db = gff.create_db('gencode.v19.annotation.gff3',
                   'gencode.v19.annotation.gff3.db',
                   force=True,
                   merge_strategy="merge")

And then run multiple calls to db.parent:

#!/usr/bin/python2
import gffutils as gff
import os.path
import sys

db = gff.FeatureDB('gencode.v19.annotation.gff3.db')
with open(os.path.join(os.path.dirname(__file__), '1000_transcripts.txt')) as transcripts:
        for t in transcripts:
                t = t.strip()
                for gene in db.parents(t, featuretype='gene'):
                        print gene

1000_transcripts.txt contains 1000 random transcripts to human genome.
The computation time of the second piece of code from time bash command appears to be

real    0m31.850s
user    0m17.351s
sys     0m14.497s

But if I create database this way:

#!/usr/bin/python2
import gffutils as gff
import sys
db = gff.create_db('gencode.v19.annotation.gff3',
                   'gencode.v19.annotation.gff3.db',
                   force=True,
                   merge_strategy="merge")
db.execute('ANALYZE features')

Then multiple calls to parent take

real    0m0.089s
user    0m0.072s
sys     0m0.016s

However I can't see this effect if call to parent is performed in the same script as database creation.
I suggest #55 as a fix.

Error handling for missing/empty/non-GFF files

Not very critical, but would be good to at some point add better error handling for empty/missing/non-GFF files in various gffutils operations, especially from gffutils-cli. For example:

# make empty file
$ touch test.gff3
# try to use it for gff db creation
$ gffutils-cli create test.gff3
Traceback (most recent call last):
...
  File "yarden/gffutils/gffutils/create.py", line 803, in create_db
    if force_gff or (dialect['fmt'] == 'gff3'):
TypeError: 'NoneType' object has no attribute '__getitem__'

Augustus output: AssertionError or ValueError

Hi again :),

As suggested, I am opening a new issue for GTF/GFF3 parsing files. They come from the same fasta file (and are attached below).

I have ValueError or AssertionError when doing the create_db:

>>> gffutils.create_db('augustusDbia3.gtf', 'gtf')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/create.py", line 1273, in create_db
    c.create()
  File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/create.py", line 488, in create
    self._populate_from_lines(self.iterator)
  File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/create.py", line 553, in _populate_from_lines
    for i, f in enumerate(lines):
  File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/iterators.py", line 100, in __iter__
    for i in self._iter:
  File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/iterators.py", line 144, in _custom_iter
    yield feature_from_line(line, dialect=self.dialect)
  File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/feature.py", line 392, in feature_from_line
    attrs, _dialect = parser._split_keyvals(attr_string, dialect=dialect)
  File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/parser.py", line 164, in _split_keyvals
    assert len(item) == 1, item
AssertionError: [u'', u'gene_id', u'"contig1.g1"']

and

>>> gffutils.create_db('augustusDbia3.gff3', 'gff3')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/create.py", line 1273, in create_db
    c.create()
  File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/create.py", line 488, in create
    self._populate_from_lines(self.iterator)
  File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/create.py", line 571, in _populate_from_lines
    fixed, final_strategy = self._do_merge(f, self.merge_strategy)
  File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/create.py", line 218, in _do_merge
    raise ValueError("Duplicate ID {0.id}".format(f))
ValueError: Duplicate ID contig1.g1.t1.cds

Do you know what it does indicate?

augustusDbia3.gtf.txt
augustusDbia3.gff3.txt

Thanks! :)

Missing documentation on writing GFF files?

I gathered from https://github.com/daler/gffutils/blob/master/gffutils/gffwriter.py that it's possible to write GFF files from a (modified) gffutils database, but I can't find any examples in the documentation at http://pythonhosted.org/gffutils/contents.html or online on how to do so.

Does this feature exist, and if so, can you provide any examples on how to invoke the function?

Parser fails on Alias value with spaces

I am getting the error

  File "C:\...\python-2.7.9.amd64\lib\site-packages\gffutils\parser.py", line 164, in _split_keyvals
    assert len(item) == 1, item
AssertionError: [u'Alias=CCHL,cytochrome', u'c', u'heme', u'lyase']

for the line

scaffold_84 liftOver    CDS 41166   41975   .   -   .   ID=YAL039C_JAY291;Name=YAL039C;orf_classification=Verified;gene=CYC3;Alias=CCHL,cytochrome c heme lyase;dbxref=SGD:S000000037;Note=Cytochrome%20c%20heme%20lyase%20%28holocytochrome%20c%20synthase%29%2C%20attaches%20heme%20to%20apo-cytochrome%20c%20%28Cyc1p%20or%20Cyc7p%29%20in%20the%20mitochondrial%20intermembrane%20space%3B%20human%20ortholog%20may%20have%20a%20role%20in%20microphthalmia%20with%20linear%20skin%20defects%20%28MLS%29

According to http://www.sequenceontology.org/gff3.shtml, spaces are allowed in the Alias field ("Note that unescaped spaces are allowed within fields, meaning that parsers must split on tabs, not spaces."), as are multiple values ("In addition to Parent, the Alias, Note, Dbxref and Ontology_term attributes can have multiple values.").

pip install RuntimeError: maximum recursion depth exceeded

Hello,

When I do pip install gffutils I get the error below.

I am installing gffutils 0.8.7.1 using python 2.7 with pip 8.1.1.

Could you please let me know where the problem is?

Best regards,
Sebastian.

  File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2134, in find_on_path
    for item in dists:
  File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2134, in find_on_path
    for item in dists:
  File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2120, in find_on_path
    path_item, entry, metadata, precedence=DEVELOP_DIST
  File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2510, in from_location
    py_version=py_version, platform=platform, **kw
  File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2489, in __init__
    self._version = safe_version(version)
  File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/pkg_resources/__init__.py", line 1366, in safe_version
    return str(packaging.version.Version(version))
  File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/packaging/version.py", line 230, in __init__
    self._version.local,
  File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/packaging/version.py", line 353, in _cmpkey
    reversed(release),
RuntimeError: maximum recursion depth exceeded

UCSC convention: end_included=false

It would be useful to accept GFF files where end-start = length

This can be done in HTSeq:
http://www-huber.embl.de/users/anders/HTSeq/doc/features.html#gff-reader-and-genomicfeature

Dealing with 'hybrid' bedtools gff/bed files?

An issue I frequently run into is that bedtools sometimes produces extra fields in its output, creating files that are strictly not GFF or BED, it seems. For example, if I call coverageBed with a BAM input as the -a parameter and a GFF as the -b parameter, it creates output lines like:

chr1    SE  gene    83877956    83908217    .   -   .   ID=chr1:83907290:83908217:-@chr1:83878734:83878799:-@chr1:83877956:83878093:-;Name=chr1:83907290:83908217:-@chr1:83878734:83878799:-@chr1:83877956:83878093:-   18  228 30262   0.0075342

where the last four fields are various counts/density values. gffutils will not parse this file, yielding:

in gffutils.gfffeature.GFFFile.__init__ (gffutils/gfffeature.c:1207)()

ValueError: No features found in file

which is technically the correct thing to do, since the file has more than 9 fields and so isn't technically a correct GFF3 file. However, these files are very useful and contain all the information necessary to be treated as if they were GFF3 files.

What do you think is the correct approach to dealing with these files?

I think it'd be ideal if bedtools outputted these values as lower-case optional parameters in the attributes field of the GFF3, since that would preserve the format and allow strict GFF3 parsers to parse the file with no issues, but given that this is not the current format, I'm curious as to how you think these files should be handled in gffutils. Best, --Yarden

GTF from FlyBase & exons ids

Hello,

I am using gffutils to process selected entries from following file:
ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.09_FB2016_01/gtf/dmel-all-r6.09.gtf.gz

In short, I got a list of transcript_id(s), fgrep-ed the above GTF file keeping rows with featuretype:

gene
mRNA
exon
CDS
start_codon
stop_codon

I am putting the data using:
db = gffutils.create_db(fn, dbfn='dmel_test.db', force=True, keep_order=False, merge_strategy='merge', sort_attribute_values=True, verbose=True, id_spec={'gene':'gene_id', 'mRNA': 'transcript_id',}, force_gff=False)

I would like to have at least ids for exons in a form:
transcript_id.exon_number (FBtr0330644.exon_1, FBtr0330644.exon_2, etc)
I did try to use transform function, but the id's of exons do stay the same:

def transform(f):
     if f.featuretype == 'exon':
        _id = f.attributes['transcript_id'][0] + ".ex_"
        id = 'autoincrement:_id'
     return f

db = gffutils.create_db(fn, dbfn='dmel_test.db', force=True, keep_order=False, merge_strategy='merge', sort_attribute_values=True, id_spec={'gene':'gene_id', 'mRNA': 'transcript_id',}, transform=transform, gtf_subfeature='exon', force_gff=False,  verbose=True)

I will keep trying, but if you find a time to give me a hint it will be really helpful.

Thank you for writing and maintaining gffutils

Darek Kedra

Features dropped when creating a database from iterators?

I'm trying to create a GFF database from iterators. I noticed that when I do this, the database is created successfully but the majority of features are dropped from the db when it is queried by methods like all_features. I wrote a short unit test for it. The GFF, "gff_example1.gff3", is:

chr1    ensGene gene    4763287 4775820 .   -   .   Name=ENSMUSG00000033845;ID=ENSMUSG00000033845;Alias=ENSMUSG00000033845;gid=ENSMUSG00000033845
chr1    ensGene mRNA    4764517 4775779 .   -   .   Name=ENSMUST00000045689;Parent=ENSMUSG00000033845;ID=ENSMUST00000045689;Alias=ENSMUSG00000033845;gid=ENSMUSG00000033845
chr1    ensGene CDS 4775654 4775758 .   -   0   Name=ENSMUST00000045689.cds0;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.cds0;gid=ENSMUSG00000033845
chr1    ensGene CDS 4772761 4772814 .   -   0   Name=ENSMUST00000045689.cds1;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.cds1;gid=ENSMUSG00000033845
chr1    ensGene exon    4775654 4775779 .   -   .   Name=ENSMUST00000045689.exon0;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.exon0;gid=ENSMUSG00000033845
chr1    ensGene exon    4772649 4772814 .   -   .   Name=ENSMUST00000045689.exon1;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.exon1;gid=ENSMUSG00000033845
chr1    ensGene exon    4767606 4767729 .   -   .   Name=ENSMUST00000045689.exon2;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.exon2;gid=ENSMUSG00000033845
chr1    ensGene exon    4764517 4764597 .   -   .   Name=ENSMUST00000045689.exon3;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.exon3;gid=ENSMUSG00000033845
chr1    ensGene five_prime_UTR  4775759 4775779 .   -   .   Name=ENSMUST00000045689.utr0;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.utr0;gid=ENSMUSG00000033845
chr1    ensGene three_prime_UTR 4772649 4772760 .   -   .   Name=ENSMUST00000045689.utr1;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.utr1;gid=ENSMUSG00000033845
chr1    ensGene three_prime_UTR 4767606 4767729 .   -   .   Name=ENSMUST00000045689.utr2;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.utr2;gid=ENSMUSG00000033845
chr1    ensGene three_prime_UTR 4764517 4764597 .   -   .   Name=ENSMUST00000045689.utr3;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.utr3;gid=ENSMUSG00000033845

The unit test is:

def test_create_db_from_iter():
    """
    Test creation of FeatureDB from iterator.
    """
    print "Testing creation of DB from iterator"
    def my_iterator():
        db_fname = gffutils.example_filename("gff_example1.gff3")
        db = gffutils.create_db(db_fname, ":memory:")
        for rec in db.all_features():
            yield rec
    new_db = gffutils.create_db(my_iterator(), ":memory:")
    print list(new_db.all_features())
    gene_feats = new_db.all_features(featuretype="gene")
    assert (len(list(gene_feats)) != 0), "Could not load genes from GFF."

The database is successfully created, but for some reason all features except the last two three_prime_UTR entries are dropped from the database when it is queried. For example, none of the gene or mRNA or exon entries are in the resulting database. The only features it registers are:

[<Feature three_prime_UTR (chr1:4767606-4767729[-]) at 0x1967b90>, <Feature three_prime_UTR (chr1:4764517-4764597[-]) at 0x1967790>]

Any ideas on what might be causing this? Thanks, --Yarden

gffutils.create_db | AttributeError: 'str' object has no attribute 'dialect'

Hi!

I just tried the version installed through pip, and I get an error about dialect on str object, when I want to create the db.

It reproduces on Mac and Linux (Ubuntu), on GTF and GFF3 :/.

Here is the output:

Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gffutils
>>> gtf = open("augustusDbia3.gtf", 'r')
>>> db = gffutils.create_db(gtf, ":memory:")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/remi/Dev/galaxy_env/local/lib/python2.7/site-packages/gffutils/create.py", line 1231, in create_db
    iterator = iterators.DataIterator(**kwargs)
  File "/home/remi/Dev/galaxy_env/local/lib/python2.7/site-packages/gffutils/iterators.py", line 278, in DataIterator
    return _FeatureIterator(**_kwargs)
  File "/home/remi/Dev/galaxy_env/local/lib/python2.7/site-packages/gffutils/iterators.py", line 93, in __init__
    self._observed_dialects = [i.dialect for i in self.peek]
AttributeError: 'str' object has no attribute 'dialect'

Thanks for the help!

Graph database?

This is not an issue, more a question. It takes some serious SQL-wrangling to get parent-child or grandparent-child information about gene-transcript-exon relationships. Have you thought about using a graph database for gffutils? There doesn't seem to be a SQLite equivalent for Node.js or TitanDB so you wouldn't have to open up a separate port, so that could be a drawback.

Database creation leaves file lock on GFF file?

Windows won't allow me to delete certain GFF files, and it seems to be a result of gffutils.create_db. Using psutil to keep track of open files (per http://stackoverflow.com/a/25069136/1712389), I get:

Code:

    proc = psutil.Process()
    print ('beforeDB: ' + str(proc.open_files()))

    #create database for genome
    db = gffutils.create_db(genomeGFF, dbfn=genomeGFFdb, force=True, merge_strategy="merge", keep_order=True)

    print ('afterDB: ' + str(proc.open_files()))

Output:

beforeDB: [popenfile(path=u'C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1)]
afterDB: [popenfile(path=u'C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1), popenfile(path=u'C:\\...\\modify-reference-genome\\temp\\SacceM3707_1_AssemblyScaffolds.gff', fd=-1)]

("genomeGFF" matches the "SacceM3707_1_AssemblyScaffolds.gff" file path). Is this a bug? If not, is there a way to release the GFF file? I'm using gffutils 0.8.3.1.

Hardcoded temp file does not allow for parallel processes

There is a constant for a temp file location in the loader that forces multiple parallel versions of gffutils to use the same tempfile. Also, is there any way we can do this without making temp files at all? If :memory: is used as the dbfn database location, users probably expect gffutils to work without disk access, so a pure in-memory solution would be beneficial in these cases.

Non-deterministic bug in children() method of gffutils?

I encountered what appears to be a non-deterministic / context-dependent bug in the children method. I have the following GFF file: http://genes.mit.edu/burgelab/yarden/SE.mm9.with_introns.gff3

Which contains the event:

chr6    SE  gene    71510891    71531580    .   +   .   ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+;Name=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+
chr6    SE  mRNA    71510891    71531580    .   +   .   ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.B;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+
chr6    SE  exon    71529635    71531580    .   +   .   ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.B.dn;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.B
chr6    SE  exon    71510891    71511070    .   +   .   ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.B.up;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.B
chr6    SE  mRNA    71510891    71531580    .   +   .   ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+
chr6    SE  exon    71510891    71511070    .   +   .   ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.up;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A
chr6    SE  exon    71524109    71524230    .   +   .   ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.se;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A
chr6    SE  exon    71529635    71531580    .   +   .   ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.dn;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A
chr6    SE  intron  71511071    71524108    .   +   .   ID=chr6:71511071-71524108:+.intron;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A
chr6    SE  intron  71524231    71529634    .   +   .   ID=chr6:71524231-71529634:+.intron;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A
chr6    SE  intron  71511071    71529634    .   +   .   ID=chr6:71511071-71529634:+.intron;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.B

The event has two mRNAs, the first mRNA contains three exons (and two intervening intron entries) while the second mRNA contains two exons (with only one intervening intron entry.) Hence, there are three intron entries total in the file, two having the first mRNA (whose ID ends in .A) as parent and one intron entry having the second mRNA (whose ID ends in .B) as the as parent.

I create a db for this GFF file in the usual way. Then I try to fetch all the children of the first mRNA as follows:

# Load up the database
In [1]: import gffutils

In [2]: gff_db = gffutils.FeatureDB("SE.mm9.with_introns.gff3.db")

In [5]: event = "chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+"

# Get the children of the event, i.e. the mRNAs
In [7]: list(gff_db.children(event))

Out[7]: 
[<Feature: mRNA, chr6:71510891-71531580 (+)>,
 <Feature: mRNA, chr6:71510891-71531580 (+)>]

# Now get the children of the first mRNA
In [8]: mrna1 = "chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A"

In [10]: list(gff_db.children(mrna1))
Out[10]: 
[<Feature: exon, chr6:71510891-71511070 (+)>,
 <Feature: intron, chr6:71511071-71524108 (+)>,
 <Feature: exon, chr6:71524109-71524230 (+)>,
 <Feature: intron, chr6:71524231-71529634 (+)>,
 <Feature: exon, chr6:71529635-71531580 (+)>]

So far so good: it picked out the three exon children of mrna1 and the two intron children. However, it got the Parent identifiers wrong:

In [12]: [p.attributes["Parent"] for p in list(gff_db.children(mrna1))]
Out[12]: 
['chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A',
 'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A',
 'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A',
 'chr6:71524109:71524230:+@chr6:71527725:71527839:+@chr6:71529635:71531580:+.B',
 'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A']

It assigned chr6:71524109:71524230:+@chr6:71527725:71527839:+@chr6:71529635:71531580:+.B as the parent of the second intron, even though it has as its parent chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A and we asked for the children of chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A. Here's the output that makes it more explicit:

In [13]: [(p.id, p.attributes["Parent"]) for p in list(gff_db.children(mrna1))]
Out[13]: 
[('chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.up',
  'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A'),
 ('chr6:71511071-71524108:+.intron',
  'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A'),
 ('chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.se',
  'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A'),
 ('chr6:71524231-71529634:+.intron',
  'chr6:71524109:71524230:+@chr6:71527725:71527839:+@chr6:71529635:71531580:+.B'),
 ('chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.dn',
  'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A')]

The intron entry chr6:71524231-71529634:+.intron definitely does not have the ...B mRNA as its Parent, yet it's listed this way. It looks like it might be something off with the attributes field, since if I reference the parents() method it gets it right:

In [18]: [(p.id, list(gff_db.parents(p))[0].id) for p in list(gff_db.children(mrna1))]
Out[18]: 
[('chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.up',
  'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A'),
 ('chr6:71511071-71524108:+.intron',
  'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A'),
 ('chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.se',
  'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A'),
 ('chr6:71524231-71529634:+.intron',
  'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A'),
 ('chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.dn',
  'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A')]

If I take out this event out of the large GFF file I linked to and place it on its own, the bug disappears and it gets the Parent field right... no idea what is happening. --Yarden

How to handle duplicate IDs

Currently db creation uses INSERT OR IGNORE rather than just INSERT. The side effect is that only the first instance of a unique ID is entered into the database and all other instances with the same ID are ignored.

Should duplicate IDs in a GFF file be considered an error?

Generating DataTables from GFF database

In design/feature/documentation category:

Find way to product an HTML table using entabled (https://github.com/daler/entabled) or something similar from a gffutils database. Would make for a great example in the documentation. The table could be easily sorted by start/end coordinates, chromosomes, and other GFF columns.

OperationalError: too many SQL variables

I'm trying to save the ENSEMBL/Gencode ID and chrom start stop of a gff3 (created by @yarden) that uses UCSC IDs. The GFF3 is ALE.hg19.gff3 from this zip file, and the Gencode database was created as described in #20 (comment).

Here's the code: http://nbviewer.ipython.org/gist/olgabot/ef88e1963c5647e95696

I get most of the way through the file, but then I run into this OperationalError:

---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
<ipython-input-7-e2b521904d43> in <module>()
      5 for ale in db.features_of_type('gene'):
      6     print ale.id
----> 7     overlapping_genes = list(v19_db.region((ale.chrom, ale.start, ale.stop), featuretype='gene'))
      8 
      9     gencode = set(g.id for g in overlapping_genes)

/projects/ps-yeolab/software/anaconda/envs/olga/lib/python2.7/site-packages/gffutils/interface.pyc in region(self, region, featuretype, completely_within)
    498 
    499         c = self.conn.cursor()
--> 500         c.execute(query, tuple(args))
    501         for i in c:
    502             yield self._feature_returner(**i)

OperationalError: too many SQL variables

Any ideas? I'm thinking of just saving to disk periodically

Command-line interface for gffutils

What sort of functionality should be exposed to the CLI?

For example a universal GTF<->GFF<->refFlat conversion script would be useful.

merge "non-attribute" fields

Currently when merge_strategy="merge", features will only be merged if all of their non-attribute fields (chrom, source, start, stop, etc) are identical.

For some use-cases (see later comments in #20), it would be useful to allow a subset of these fields to be ignored.

For example, these exons have the same coords, but different sources:

chr1  HAVANA  exon  12613  12721  .  +  .  transcript_id "mRNA1"; gene_id "gene1"
chr1  ENSEMBL  exon  12613  12721  .  +  .  transcript_id "mRNA2"; gene_id "gene1"

We would like to force a merged feature into something like this:

chr1  HAVANA,ENSEMBL  exon  12613  12721  .  +  .  transcript_id "mRNA1,mRNA2"; gene_id "gene1"

Fields allowed to be merged should be specified in the create_db() call, e.g., force_merge_fields=['source'].

It's up to the user to, if they really want to, do things like force_merge_fields=["start"] (which would result in non-integer start positions that will likely break downstream code).

force=True does not force creation of database

os.listdir('.')
['test.gtf', 'test.fastq', 'test.sql']

gffutils.create_db('test.gtf', 'test.sql', force=True)
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/gffutils-0.7-py2.7-linux-x86_64.egg/gffutils/init.py", line 44, in create_db
creator.create()
File "/usr/local/lib/python2.7/dist-packages/gffutils-0.7-py2.7-linux-x86_64.egg/gffutils/db.py", line 83, in create
self.init_tables()
File "/usr/local/lib/python2.7/dist-packages/gffutils-0.7-py2.7-linux-x86_64.egg/gffutils/db.py", line 73, in init_tables
''')
sqlite3.OperationalError: table features already exists

gffutils.create_db('test.gtf', 'test.sqlite')
Populating features and relations table: 9 (90%) done in 0.0s
Creating temporary indexes...done in 0.0s
Inferring gene and transcript extents: 15 (93%)...done in 0.0sle...
Importing inferred features...done in 0.0s
Creating index...done in 0.0s

Unicode error when trying to use gffutils?

I'm trying out gffutils and after installing and trying to import it, I got the error:

$ python
Python 2.7.3 (default, Aug  1 2012, 05:14:39) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gffutils
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lab/solexa_jaenisch2/yarden/software/gffutils/gffutils/__init__.py", line 3, in <module>
    from gfffeature import Feature, GFFFile
ImportError: /lab/solexa_jaenisch2/yarden/software/gffutils/gffutils/gfffeature.so: undefined symbol: PyUnicodeUCS2_Tailmatch

Any ideas on how this can be fixed? Also, are there plans to make a gffutils command line interface? I think it'd be extremely useful. I haven't seen anything like it, and the operations you set out to answer as described in docs are extremely common, so this would be tremendously useful. Thanks.

pip installation fails without cython

Hello, I'm sure this is a rare issue, but I wanted to let you know that in my case, installation in python 2.7 through pip fails with a compiler error if cython is not already installed. The installation appears to download cython as a dependency begins to install it, but conversion of *.pyx to *.c will fail and the compiler will abort with a "file not found" error.

Thanks for the great work!

Correct way to set attributes of GFF records?

I found an oddity which might be there by design in re-setting an attribute of a GFF field. Suppose I have an exon record:

In [237]: e
Out[237]: <Feature: exon, chr1:83907290-83908217 (-)>

In [238]: e.attributes
Out[238]: {'ID': 'chr1:83907290:83908217:-@chr1:83878734:83878799:-@chr1:83877956:83878093:-.A.up', 'Parent': 'chr1:83907290:83908217:-@chr1:83878734:83878799:-@chr1:83877956:83878093:-.A'}

I try to reassign its ID using e.attributes.__setitem__, accessing it as a dict:

In [239]: e.attributes["ID"] = "foo"

In [240]: e.attributes
Out[240]: {'ID': 'foo', 'Parent': 'chr1:83907290:83908217:-@chr1:83878734:83878799:-@chr1:83877956:83878093:-.A'}

It appears to work, but it internally it creates an additional ID field which gets serialized when the record is written to a file:

In [241]: str(e.attributes)
Out[241]: 'ID=foo;Parent=chr1:83907290:83908217:-@chr1:83878734:83878799:-@chr1:83877956:83878093:-.A;ID=foo'

Is this intentional? Is there a different recommended way to reset attributes? I wanted to make sure I'm not missing something before I start meddling with the code. Thanks.

`FeatureDB.region()` fails on very large features

See last few comments of #40, where using this feature in a region() call fails:

chr3    ALE gene    97947   120347376   .   -   .   [email protected]@uc003fzi.1uc003fzj.1;[email protected]@uc003fzi.1uc003fzj.1;[email protected]@uc003fzi.1uc003fzj.1

Specifically, a call to FeatureDB.region() like this results in too many possible bins being passed to SQL query:

db.region(('chr1', 1, 120000000))

Creating db from nonstandard GTFs

Hello,
Thank you for creating such an excellent package. I am attempting to read some gtf files which don't have the transcript_id feature, and thus with this command:

# RNA-Seq Exons from Gencode V7 data
gff_filename = '/nas3/scratch/obot/encode/gm12878/rna-seq/wgEncodeCshlLongRnaSeqGm12878CellPamExonsGencV7.gtf'
db_filename = gff_filename + '.db'
gffutils.create_db(gff_filename, db_filename)
cage_gencode7_db = gffutils.FeatureDB(db_filename)

I am getting this error:

Populating features and relations tables...
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-f3bcda2b7341> in <module>()
     11 gff_filename = '/nas3/scratch/obot/encode/gm12878/rna-seq/wgEncodeCshlLongRnaSeqGm12878CellPamExonsGencV7.gtf'
     12 db_filename = gff_filename + '.db'
---> 13 gffutils.create_db(gff_filename, db_filename)
     14 cage_gencode7_db = gffutils.FeatureDB(db_filename)

/nas3/yeolab/Software/Python-2.7.5/lib/python2.7/site-packages/gffutils-0.7-py2.7-linux-x86_64.egg/gffutils/__init__.pyc in create_db(fn, dbfn, verbose, force)
     43     else:
     44         creator = GTFDBCreator(fn, dbfn, verbose=verbose, force=force)
---> 45     creator.create()
     46 
     47 

/nas3/yeolab/Software/Python-2.7.5/lib/python2.7/site-packages/gffutils-0.7-py2.7-linux-x86_64.egg/gffutils/db.pyc in create(self)
     82     def create(self):
     83         self.init_tables()
---> 84         self.populate_from_features(self.features)
     85         self.update_relations()
     86         self.set_filetype()

/nas3/yeolab/Software/Python-2.7.5/lib/python2.7/site-packages/gffutils-0.7-py2.7-linux-x86_64.egg/gffutils/db.pyc in populate_from_features(self, features)
    229             last_perc = perc
    230 
--> 231             parent = feature.attributes['transcript_id']
    232             grandparent = feature.attributes['gene_id']
    233 

/nas3/yeolab/Software/Python-2.7.5/lib/python2.7/site-packages/gffutils-0.7-py2.7-linux-x86_64.egg/gffutils/gfffeature.so in gffutils.gfffeature.Attributes.__getitem__ (gffutils/gfffeature.c:6261)()

KeyError: 'transcript_id'

There are two main GTFs I'm looking at, which you can get from here:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqGm12878CellPamExonsGencV7.gtf.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageGm12878CellPapTssGencV7.gtf.gz

And they look like this:

head /nas3/scratch/obot/encode/gm12878/rna-seq/wgEncodeCshlLongRnaSeqGm12878CellPamExonsGencV7.gtf
chr14   gencodeV7   exon    105886353   105886400   0.005370    +   .   gene_ids "ENSG00000182979.10,"; transcript_ids "ENST00000438610.1,"; RPKM1 "0"; RPKM2 "0.177373"; iIDR "0.084";
chr9    gencodeV7   exon    123744122   123744217   0.003540    -   .   gene_ids "ENSG00000106804.5,"; transcript_ids "ENST00000223642.1,"; RPKM1 "0.06532"; RPKM2 "0.051599"; iIDR "0.151";
chr11   gencodeV7   exon    61539301    61539424    0.038597    +   .   gene_ids "ENSG00000124920.9,"; transcript_ids "ENST00000278836.5,ENST00000265460.5,"; RPKM1 "0.889132"; RPKM2 "0.385668"; iIDR "0.027";
chr16   gencodeV7   exon    50730251    50730391    0.001048    +   .   gene_ids "ENSG00000167207.4,"; transcript_ids "ENST00000531674.1,"; RPKM1 "0"; RPKM2 "0.034611"; iIDR "0.207";
chr11   gencodeV7   exon    64950172    64950439    0.172877    +   .   gene_ids "ENSG00000014216.9,"; transcript_ids "ENST00000528396.1,ENST00000529133.1,ENST00000533820.1,ENST00000530571.1,ENST00000527739.1,ENST00000526966.1,ENST00000533129.1,ENST00000524773.1,ENST00000530442.1,ENST00000279247.6,ENST00000527189.1,ENST00000527469.1,ENST00000534373.1,ENST00000259755.6,ENST00000526954.1,ENST00000531068.1,ENST00000527699.1,ENST00000533909.1,"; RPKM1 "4.06183"; RPKM2 "1.64803"; iIDR "0.008";
chr18   gencodeV7   exon    48447444    48447557    0.194347    +   .   gene_ids "ENSG00000082212.6,"; transcript_ids "ENST00000321341.5,ENST00000382927.2,"; RPKM1 "3.04978"; RPKM2 "3.36921"; iIDR "0.013";
chr15   gencodeV7   exon    62320557    62320619    0.357897    -   .   gene_ids "ENSG00000129003.10,"; transcript_ids "ENST00000261517.5,ENST00000395896.3,ENST00000395898.2,"; RPKM1 "5.8709"; RPKM2 "5.9499"; iIDR "0.000";
chr14   gencodeV7   exon    92789949    92790304    0.000000    +   .   gene_ids "ENSG00000140090.11,"; transcript_ids "ENST00000532405.1,"; RPKM1 "0"; RPKM2 "0"; iIDR "NA";
chr17   gencodeV7   exon    59989301    59989431    0.142000    -   .   gene_ids "ENSG00000108506.6,"; transcript_ids "ENST00000444766.2,ENST00000251334.5,"; RPKM1 "2.33894"; RPKM2 "2.3511"; iIDR "0.010";
chr4    gencodeV7   exon    123269847   123269872   0.000000    +   .   gene_ids "ENSG00000138688.9,"; transcript_ids "ENST00000431755.2,"; RPKM1 "0"; RPKM2 "0"; iIDR "NA";

Which has transcript_ids but not transcript_id, and thus the parent-child associations are failing.

And for this file:

head /nas3/scratch/obot/encode/gm12878/cage/wgEncodeRikenCageGm12878CellPapTssGencV7.gtf
chr1    Gencode TSS 11869   11869   0   +   .   gene_id "ENSG00000223972.3"; trlist "ENST00000456328.2,"; trbiotlist "processed_transcript,"; confidence "not_low"; gene_biotype "pseudogene"; rpm1 "0"; rpm2 "0";
chr1    Gencode TSS 11872   11872   0   +   .   gene_id "ENSG00000249291.2"; trlist "ENST00000515242.2,"; trbiotlist "protein_coding,"; confidence "not_low"; gene_biotype "protein_coding"; rpm1 "0"; rpm2 "0";
chr1    Gencode TSS 11874   11874   0   +   .   gene_id "ENSG00000253101.2"; trlist "ENST00000518655.2,"; trbiotlist "protein_coding,"; confidence "not_low"; gene_biotype "protein_coding"; rpm1 "0"; rpm2 "0";
chr1    Gencode TSS 12010   12010   0   +   .   gene_id "ENSG00000223972.3"; trlist "ENST00000450305.2,"; trbiotlist "transcribed_unprocessed_pseudogene,"; confidence "not_low"; gene_biotype "pseudogene"; rpm1 "0"; rpm2 "0";
chr1    Gencode TSS 19305   19305   0   -   .   gene_id "ENSG00000227232.3"; trlist "ENST00000537342.1,"; trbiotlist "unprocessed_pseudogene,"; confidence "not_low"; gene_biotype "pseudogene"; rpm1 "0"; rpm2 "0";
chr1    Gencode TSS 24886   24886   0   -   .   gene_id "ENSG00000227232.3"; trlist "ENST00000541675.1,"; trbiotlist "unprocessed_pseudogene,"; confidence "not_low"; gene_biotype "pseudogene"; rpm1 "0"; rpm2 "0";
chr1    Gencode TSS 29344   29344   0   -   .   gene_id "ENSG00000227232.3"; trlist "ENST00000430492.2,"; trbiotlist "unprocessed_pseudogene,"; confidence "not_low"; gene_biotype "pseudogene"; rpm1 "0"; rpm2 "0";
chr1    Gencode TSS 29370   29370   0   -   .   gene_id "ENSG00000227232.3"; trlist "ENST00000438504.2,ENST00000423562.1,"; trbiotlist "unprocessed_pseudogene,unprocessed_pseudogene,"; confidence "not_low"; gene_biotype "pseudogene"; rpm1 "0"; rpm2 "0";
chr1    Gencode TSS 29554   29554   0   +   .   gene_id "ENSG00000243485.1"; trlist "ENST00000473358.1,"; trbiotlist "processed_transcript,"; confidence "not_low"; gene_biotype "processed_transcript"; rpm1 "0"; rpm2 "0";
chr1    Gencode TSS 29570   29570   0   -   .   gene_id "ENSG00000227232.3"; trlist "ENST00000488147.1,"; trbiotlist "unprocessed_pseudogene,"; confidence "not_low"; gene_biotype "pseudogene"; rpm1 "0"; rpm2 "0";

Which has trlist for the transcripts.

Is the solution here to have a janky bunch of if/elses to check for all the possible types of transcript_id-like things?

Inserting/shifting annotations

In my field of research, it is very common to remove and/or insert custom DNA sequences in microbial genomes. I would like to make the same changes in silico in the FASTA file of the reference genome, as well as in the accompanying GFF annotation file.

To make this easy, I would love the following features:

conversion of GenBank annotations to the GFF format, since we save our genomic modifications in GenBank format. For this I found http://biopython.org/wiki/GFF_Parsing#Converting_other_formats_to_GFF3 (the author mentions at https://github.com/chapmanb/bcbb/tree/master/gff that he hopes to eventually integrate this code into gffutils).
an easy way to introduce annotations from a second GFF file (describing the newly introduced sequence) at a specified location in the first GFF file (describing the genome). For example, I might want to replace a 4 kb gene in GFF file A by a 14 kb DNA fragment described in GFF file B. It would be great if gffutils had functions for: a) deleting annotations that fall in the deleted 4 kb region, b) deleting, splitting, or stretching annotations that partially overlap or span the deleted 4 kb region, c) shifting the coordinates of any annotations downstream of the 4 kb integration site by 10 kb (14 - 4 kb), and d) adding the annotations from GFF file B to GFF file A while updating their coordinates.

Does this fall in the scope of gffutils, or are my requests too esoteric? I guess the simplest feature request would be the ability to easily select and/or shift annotations that start beyond a certain coordinate. (I realize that I can already make such changes by iterating over each feature, but it would still be nice to have such a function)

MySQL support

I wanted to integrate gffutils database with my existing database (MySQL) for a different purpose. I just need to include the tables to my database and use gffutils attributes in my python codes. Is it possible for you to add MySQL support for this?

Thanks.

retrofitting a db to add additional relationships

Use-case in the later comments of #20: can we have CDSs be children of exons?

More generally, can we define additional relationships that are not defined in the original GFF/GTF?

There are two different ways of "adding a relation", depending on what you want to do with it later:

Add a relation in the attributes, without touching the database relations, by adding an entry like Parent="gene1" in the attributes for a feature. This will only affect the output of GFF lines, and will not allow child/parent queries via the gffutils API. So it's more like a cosmetic change.
Add a relation to the database itself, by adding an entry in the relations table. On its own, this does not affect the output of GFF lines, since the info is not stored in the attributes field of a feature. But it does allow children/parents to be accessed from the db via the gffutils API.

For 1), this should just be left to the user to manipulate the attribute dictionaries however they want to.

For 2), this should be implemented as a new method on a FeatureDB -- say, FeatureDB.add_relation(). It should also [optionally] make the change in the attributes as in 1).

However . . .

There's the issue of database IDs, which are primary keys, vs the IDs that you'd want to include in the final output. The database ID needs to be added to the relations table, but you might want some other ID to be added to the attributes.

A concrete example, showing just attributes to save space:

ID="exon1"; fancy_id="exon:chr1:1-100"
ID="CDS1"; fancy_id="CDS:chr1:1-100"

Imagine we imported these features into a db using the idspec fancy_id, such that exon:chr1:1-100 and CDS:chr1:1-100 are the primary keys in the db. We want to make CDSs be children of exons. So we add an entry in the relations table, using the hypothetical FeatureDB.add_relation() method:

exon:chr1-100    CDS:chr1:1-100    1

But in the output -- and therefore the attributes as stored in the db -- we want to specify the parent in terms of another attribute, like ID. So when we print out the GFF lines, they look like this:

ID="exon1"; fancy_id="exon:chr1:1-100";
ID="CDS1"; fancy_id="CDS:chr1:1-100"; parent_exon="exon1";

This implies a method signature like:

FeatureDB.add_relation(
   child,
   parent,
   level, parent_func, child_func)

where parent_handler is a custom function with signature parent_func(parent, child) and returns the modified parent, and child_func(parent, child) returns the modified child.

In this example, parent_func would be lambda x: x and child_func would be:

def child_func(parent, child):
    child.attributes['parent_exon'] = parent.id

remove orphans

add a mechanism for cleaning a database of features that have an annotated parent, but that parent does not exist.

gffutils gets confused when using db.update() on subset of features?

Example at https://gist.github.com/rmzelle/256cbef7d594b518ac46

I'm trying to only update the features in my gffutils database that needed updating, but when I do that I get unexpected results. In particular, some features get deleted while others are duplicated, so it looks like gffutils sometimes misidentifies the features I'm trying to update. Any thoughts? How does gffutils establish which feature is which when using db.update()?

Order of records parsed by gffutils?

Hi Ryan and all,

Finally back to hacking on gffutils (over at my fork; will send pull request when next batch of features are completed.) I notice that in the refactor branch I forked today, the unit test test_attributes_modify. I believe it is because the GFF records are not read in the order in which they appear in the GFF file, like they used to be. The test function is below. The output I get when running it is:

$ python gffutils/test/test.py
First child is not an mRNA
exon
Traceback (most recent call last):
  File "gffutils/test/test.py", line 1073, in <module>
    test_attributes_modify()
  File "gffutils/test/test.py", line 1061, in test_attributes_modify
    assert str(gene_childs[0].attributes) == 'ID=FBtr0300689;Name=CG11023-RB;Parent=FBgn0031208;Dbxref=FlyBase_Annotation_IDs:CG11023-RB;score_text=Strongly Supported;score=11'
AssertionError

If I understand it correctly, I believe it's reading the exon record before the mRNA record, although in the file the mRNA appears as the first child of the gene:

chr2L   FlyBase gene    7529    9484    .   +   .   ID=FBgn0031208;Name=CG11023;Ontology_term=SO:0000010,SO:0000087,GO:0008234,GO:0006508;Dbxref=FlyBase:FBan0011023,FlyBase_Annotation_IDs:CG11023,GB:AE003590,GB_protein:AAO41164,GB:AI944728,GB:AJ564667,GB_protein:CAD92822,GB:BF495604,UniProt/TrEMBL:Q6KEV3,UniProt/TrEMBL:Q86BM6,INTERPRO:IPR003653,BIOGRID:59420,dedb:5870,GenomeRNAi_gene:33155,ihop:59383;derived_computed_cyto=21A5-21A5;gbunit=AE014134
chr2L   FlyBase mRNA    7529    9484    .   +   .   ID=FBtr0300689;Name=CG11023-RB;Parent=FBgn0031208;Dbxref=FlyBase_Annotation_IDs:CG11023-RB;score_text=Strongly Supported;score=11

Although the GFF format is unordered, I think it's worth ensuring that the entries are read in the order in which they appear in the file. We at least use this convention heavily to always put the longest mRNA entries first and these types of conventions are very useful.

I believe that the unordered entries are returned by the children method and that in general gffutils stores entries in the file appearance order. Here's a self-contained example (note that get_db_fname is a function I'm contributing to helpers in my fork):

In [56]: f = gffutils.example_filename("FBgn0031208.gff")

In [57]: db_fname = helpers.get_db_fname(f)

In [58]: db = gffutils.FeatureDB(db_fname)

In [59]: all_recs = list(db.all())

# Gene and mRNA are read first (file appearance order)
In [61]: all_recs[0:2]
Out[61]: [<Feature: gene, chr2L:7529-9484 (+)>, <Feature: mRNA, chr2L:7529-9484 (+)>]

# But the first child is not an mRNA, it is an exon
In [62]: c = list(db.children("FBgn0031208"))

In [63]: c[0]
Out[63]: <Feature: exon, chr2L:7529-8116 (+)>

The test function is:

def test_attributes_modify():
    """
    Test that attributes can be modified in a GFF record.

    TODO: This test case fails?
    """
    # Test that attributes can be modified
    gffutils.create_db(gffutils.example_filename('FBgn0031208.gff'), testdbfn_gff,
                       verbose=False,
                       force=True)
    db = gffutils.FeatureDB(testdbfn_gff)
    gene_id = "FBgn0031208"
    gene_childs = list(db.children(gene_id))
    print "First child is not an mRNA"
    print gene_childs[0].featuretype
    assert str(gene_childs[0].attributes) == 'ID=FBtr0300689;Name=CG11023-RB;Parent=FBgn0031208;Dbxref=FlyBase_Annotation_IDs:CG11023-RB;score_text=Strongly Supported;score=11'
    gene_childs[0].attributes["ID"] = "Modified"
    assert str(gene_childs[0].attributes) == 'ID=Modified;Name=CG11023-RB;Parent=FBgn0031208;Dbxref=FlyBase_Annotation_IDs:CG11023-RB;score_text=Strongly Supported;score=11;ID=Modified'

Bug in retrieval of features by type when IDs/chromosomes contain underscores?

I think there might be a bug in the retrieval of children records by type (in db.children(..., featuretype=)) when the record IDs/chromosomes have an underscore _ character in their name.

I made a small unit test for it over at my fork in test.py:

def test_random_chr():
    """
    Test on GFF files with random chromosome events.
    """
    gff_fname = gffutils.example_filename("random-chr.gff")
    db_fname = helpers.get_db_fname(gff_fname)
    db = gffutils.FeatureDB(db_fname)
    # Test that we can get children of only a selected type
    gene_id = "chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-"
    mRNAs = db.children(gene_id, featuretype="mRNA")
    for mRNA_entry in mRNAs:
        assert (mRNA_entry.featuretype == "mRNA"), \
               "Not all entries are of type mRNA! %s" \
               %(",".join([entry.featuretype for entry in mRNAs]))
    print "Parsed random chromosome successfully."

The GFF file is in test/data/random-chr.gff:

chr1_random SE  gene    97006   165969  .   -   .   ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-;Name=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-
chr1_random SE  mRNA    97006   165969  .   -   .   ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.A;Parent=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-
chr1_random SE  mRNA    97006   165969  .   -   .   ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.B;Parent=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-
chr1_random SE  exon    165882  165969  .   -   .   ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.up;Parent=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.A
chr1_random SE  exon    137473  137600  .   -   .   ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.se;Parent=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.A
chr1_random SE  exon    97006   97527   .   -   .   ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.dn;Parent=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.A
chr1_random SE  exon    165882  165969  .   -   .   ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.up;Parent=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.B
chr1_random SE  exon    97006   97527   .   -   .   ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.dn;Parent=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.B

When I run my unit test, I get:

$ python gffutils/test/test.py
Creating db for test/data/unsanitized.gff
  - Took 1.48 seconds
Sanitized GFF successfully.
Creating db for test/data/random-chr.gff
  - Took 1.55 seconds
Traceback (most recent call last):
  File "gffutils/test/test.py", line 1092, in <module>
    test_random_chr()
  File "gffutils/test/test.py", line 1059, in test_random_chr
    %(",".join([entry.featuretype for entry in mRNAs]))
AssertionError: Not all entries are of type mRNA! exon,exon

I'm guessing the reason is that the underscore character gets parsed as something else by SQLite, but I don't know enough about the SQL syntax to see how/where these characters should be escaped. The error goes away when I remove the _random suffix from the names of entries/chromosomes.

Any thoughts on this? Thanks, --Yarden

regions in refactor not working?

Hi,

I tried for the first time to fetch GFF entries from a region with db.region() and I get the error:

<ipython-input-44-f7ff462cf266> in <module>()
----> 1 for x in matched_db.region("chr5:653492-65368119"): print x

/home/yarden/.local/lib/python2.7/site-packages/gffutils-0.8-py2.7.egg/gffutils/interface.pyc in region(self, region, featuretype, completely_within)
    368 
    369         # Get a list of all possible bins for this region
--> 370         _bins = list(bins.bins(start, end, one=False))
    371 
    372         if completely_within:

/home/yarden/.local/lib/python2.7/site-packages/gffutils-0.8-py2.7.egg/gffutils/bins.pyc in bins(start, stop, fmt, one)
     68     # Jump to highest resolution bin that will fit these coords (depending on
     69     # whether we have a BED or GFF-style coordinate).
---> 70     start = (start - COORD_OFFSETS[fmt]) >> FIRST_SHIFT
     71     stop = (stop) >> FIRST_SHIFT
     72

This happens to me on refactor branch on pretty much any GFF (should be nothing special about this GFF, I believe).

A much more minor point: the region string parsing fails for coordinates that have a strand specification:

<ipython-input-41-c321ddc79db2> in <module>()
----> 1 for x in matched_db.region("chr5:65349234-65368119:-"): print x

/home/yarden/.local/lib/python2.7/site-packages/gffutils-0.8-py2.7.egg/gffutils/interface.pyc in region(self, region, featuretype, completely_within)
    357         """
    358         if isinstance(region, basestring):
--> 359             seqid, coords = region.split(':')
    360             start, end = coords.split('-')

Since region.split(':') is unpacked into two values. Thanks, --Yarden

Feature.sequence(): option to extract flanking region

It would be really great to add an option to extract flanking sequence along with the feature sequence using Feature.sequence method.

For instance I want to extract x number of bases flanking then:

Feature.sequence(fasta, use_strand=True, flanking=x)

will generate the sequence of:
x(bases)+feature+x(bases)

and argument flanking=0 will give just the feature sequence (could be set to default).

daler / gffutils Goto Github PK

gffutils's People

Stargazers

Watchers

Forkers

gffutils's Issues

Recommend Projects

Recommend Topics

Recommend Org