daler / gffutils Goto Github PK
View Code? Open in Web Editor NEWGFF and GTF file manipulation and interconversion
Home Page: http://daler.github.io/gffutils
License: MIT License
GFF and GTF file manipulation and interconversion
Home Page: http://daler.github.io/gffutils
License: MIT License
See comments in #42. It would be useful to select features by entire seqid rather than knowing ahead of time what the seqid size is.
Proposed interface:
# already works
db.region(('chr1', 1, 1000))
# already works
db.region("chr1:1-1000")
# String as first argument? Assume seqid.
db.region('chr1')
# None as "start" position in tuple?
# Do not add start >= clause (interpret as chr1[:1000]).
db.region(('chr1', None, 1000))
# None as "stop" position in tuple?
# Do not add stop <= clause (interpret as chr1[1000:]
db.region(('chr1', 1000, None))
as mentioned in #37 (comment)
From @sjackman, originally on chapmanb/bcbb#101.
Original GFF to TBL conversion script uses BCBio.GFF
gffutils.convert
modulegffutils/test/data
Example data:
See TBL format description 1 and TBL format description 2 for details.
Hi Ryan,
The db schema defines score as 'float',
https://github.com/daler/gffutils/blob/master/gffutils/db.py#L58
Then in Feature it has type of 'str',
https://github.com/daler/gffutils/blob/master/gffutils/gfffeature.pyx#L81
This incompatibility is triggered when children() function is called. My own solution is to change 'float' to 'text' in db.py; not sure if this is the right way to approach this.
Thanks, Haibao
What is the appropriate way to cite gffutils
in a publication?
Thanks!
Olga
Is it possible to optimize database creation? It'd be interesting to see where the bottlenecks are. Would Cythonizing certain parts of the code help with this, or is the bottleneck is purely at the sqlite interface? For example, creation of a database for this GFF (Ensembl mouse genes) takes ~15 minutes:
http://genes.mit.edu/yarden/Mus_musculus.NCBIM37.65.gff
$ gffutils-cli create Mus_musculus.NCBIM37.65.gff
The bottleneck is there even when dbs are created in memory, though it's of course smaller in these cases. This is related to another issue we discussed which is: how to create derivative GFF files from an existing ones, i.e. how to iterate through a GFF and make a new version of it. For example, if you wanted to iterate through the above GFF and simply add an optional key, value pair to the attributes of some records. This is a kind of "streaming" operation that can be done line-by-line and doesn't necessarily need a db. The overhead of creating the db makes gffutils impractical for these kinds of simple operations on large-ish GFF files.
There are more sophisticated operations (i.e. non-streaming ones) where a naive in memory solution still is considerably faster because of the database creation bottleneck. For example, the naive GFF parser used in misopy.gff_utils
(see load_genes_from_gff
in https://github.com/yarden/MISO/blob/fastmiso/misopy/Gene.py) simply iterates through the GFF file multiple times to collect all the gene entries into a simple Gene
object, with mRNAs represented as lists of exons. This kind of gene-centric in-memory solution is less expressive than gffutils (does not handle arbitrary nesting, and ignores non-genes basically) but for simple operations like "Add attribute X to all genes" or "Reformat the IDs of this GFF" it's considerably faster. It's not actually faster once the DB is created; gffutils retrieval is excellent, but the overhead of creation of the db trumps the computation time for many of these operations.
In summary, I'm wondering if there's a way to try to bridge the gap between the fast, but hacky solutions and the sophisticated gffutils solution that comes with an overhead. I think this is an important issue because many of the operations done on GFFs (at least that I do) don't require many hierarchical SQL queries.
gffutils database creation may be unnecessarily slow since some sqlite indexes that are important for duplicate merging seem to be missing. The problematic part where the code spends most of the time is in _candidate_merges()
. The joins in this function are expensive since apparently the newid
and idspecid
indexes are missing. I added
c.execute('CREATE UNIQUE INDEX featuresid ON features (id);')
c.execute('CREATE UNIQUE INDEX dupnewid ON duplicates (newid);')
c.execute('CREATE INDEX dupspecid ON duplicates (idspecid);')
self.conn.commit()
below the other index preparation steps which resulted in an about 100-fold faster database creation using these indexes.
As a side node, I also changed the default pragmas to further increase performance - but I believe the main boost came from the indexes alone:
pragmas = {'synchronous':'OFF', 'journal_mode': 'MEMORY', 'main.page_size': 4096,\
'main.cache_size': 100000, 'temp_store': 2, 'locking_mode': 'EXCLUSIVE'}
It would be useful to have a single method that adds intron annotations to the db or a GFF/GTF file. Something along these lines:
for gene in db.features_of_type('gene'):
for transcript in db.children(gene):
exons = sorted(db.children(transcript, featuretype='exon'))
for intron in db.interfeatures(exons):
intron.featuretype = 'intron'
yield intron
(but interfeatures should prob not be a method on FeatureDB...classmethod maybe? Or a separate utility function? Also it should do something useful with attributes; currently attributes from interfeatures is an empty string)
db.region(seqid='chr2L', start=1, end=2e8)
gives a sqlite3.OperationalError for too many SQL variables. The sqlite3 limits docs says 999 is the default number of arguments.
So we get this error from looking at too many bins at the same time (similar to the last example query in the original UCSC Genome Browser paper).
I think the query-generating code should just detect that we have too many args and then just not add the bin constraint to the query.
Hello @daler ,
I've been trying two merge features from two different GTF files/databases into a single "combined" GTF. I'm trying to merge an Ensembl GTF file with my features of interest from another GTF file (Cufflinks/Cuffcompare output)
So I've some questions I couldn't find answer reading the docs:
Fist one, since I want to output a single GTF as result of my merging features script: Is there any GTF writing method in gffutils? I know I can print features into an open file handler, but how do we deal with GTF/GFF headers?
Second, since I want to pick somefeatures from each GTF input dabase, my logical way to do this would be to create an empty db ("ouput_db" , I don't know if this is possible) and add interest features to this "ouput_db" and sort features by chromosome/start position and finally write them into output GTF file.
At this moment I'm at the point of picking my feature interest from both input gtf dbs, but I don't know how to merge them into a single db for sorting features and then write them to output.
Thanks!
Hi Ryan,
gffutils is awesome, I'm swapping over to using it for doing all of my GTF manipulations.
I have a GTF file that looks like this:
chr1 hg19_refFlat exon 14362 14829 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
chr1 hg19_refFlat exon 14970 15038 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
chr1 hg19_refFlat exon 15796 15947 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
chr1 hg19_refFlat exon 16607 16765 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
chr1 hg19_refFlat exon 16858 17055 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
chr1 hg19_refFlat exon 17233 17368 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
chr1 hg19_refFlat exon 17606 17742 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
chr1 hg19_refFlat exon 17915 18061 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
chr1 hg19_refFlat exon 18268 18366 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
chr1 hg19_refFlat exon 24738 24891 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
But when I write out features like this:
for isoform in single_isoform:
for feature in db.children(isoform, featuretype="exon", order_by="start"):
print feature.attributes
out_handle.write(str(feature) + "\n")
it has an extra empty column in the attribute field:
{u'': [], u'transcript_id': [u'UBE2Q1'], u'gene_id': [u'UBE2Q1']}
{u'': [], u'transcript_id': [u'RNF10'], u'gene_id': [u'RNF10']}
{u'': [], u'transcript_id': [u'RNF11'], u'gene_id': [u'RNF11']}
{u'': [], u'transcript_id': [u'REM1'], u'gene_id': [u'REM1']}
{u'': [], u'transcript_id': [u'LOC100506540'], u'gene_id': [u'LOC100506540']}
{u'': [], u'transcript_id': [u'TRPS1'], u'gene_id': [u'TRPS1']}
Leading to a GTF file that some downstream tools really don't like:
chr1 hg19_refFlat exon 154521051 154522945 0.000000 - . ; transcript_id "UBE2Q1"; gene_id "UBE2Q1"
chr1 hg19_refFlat exon 154523414 154523480 0.000000 - . ; transcript_id "UBE2Q1"; gene_id "UBE2Q1"
chr1 hg19_refFlat exon 154523873 154523968 0.000000 - . ; transcript_id "UBE2Q1"; gene_id "UBE2Q1"
chr1 hg19_refFlat exon 154524247 154524295 0.000000 - . ; transcript_id "UBE2Q1"; gene_id "UBE2Q1"
chr1 hg19_refFlat exon 154524397 154524455 0.000000 - . ; transcript_id "UBE2Q1"; gene_id "UBE2Q1"
chr1 hg19_refFlat exon 154524569 154524659 0.000000 - . ; transcript_id "UBE2Q1"; gene_id "UBE2Q1"
Hi again,
I'm trying to update GFF features based on the attributes of their sibling or child, so I was trying to use FeatureDB.iter_by_parent_childs()
. However, when I feed back the feature list to FeatureDB.update()
, I keep getting the AttributeError: 'list' object has no attribute 'dialect'
error, and I don't understand why. Could you have a quick look? I put a minimal example at https://gist.github.com/rmzelle/2dfd897a285c581c164b
[gffutils 0.8.7.1]
A feature that I'd find very useful is parsing of gffutils GFF records into a (lightweight) Gene-like object. Something like:
# recs is a set of GFF records
gene_obj = make_gene_obj(recs)
# Access the mRNAs, their exons
first_mRNA = gene_obj.mRNAs[0]
print "Exons of mRNA: ", first_mRNA
for exon_num, exon in enumerate(first_mRNA.exons):
print "Exon %d is: " %(exon_num), exon
The reason this is useful is because:
GFFWriter
does for serialization. For example, if you iterate over the exons belonging to an mRNA, as in:for exon in first_mRNA.exons:
# access exon
...
then it'd be helpful to know that you're guaranteed to get the exons sorted by their coordinate, so that you're iterating over the mRNA's exons from the start of the mRNA to finish. I believe that if you simply do db.children(mRNA_id)
you're not guaranteed to get it in order. Similarly for getting the mRNA records of the gene. It's helpful to know that they're sorted by some canonical order, like longest to shortest, so that gene_obj.mRNAs[0]
is always the longest mRNA.
I think the ideal implementation would try to keep the objects as similar to GFF features as possible. For example, you should be able to do: gene_obj.mRNAs[0].start
, gene_obj.mRNAs[0].stop
, etc. The mRNA
object could even inherit from the GFF feature class.
The way I'm thinking of accessing the objects would be through an iterator that returns gene objects, making them on the fly, something like:
for gene_obj in db.get_genes():
# access gene
...
where get_genes()
internally does something like:
def get_genes(self, ...):
# Get records belonging to each gene
for gene_recs in db.iter_by_parent_childs(featuretype='gene'):
# make the gene's records into an object in canonical order
# where mRNAs appear in attributes sorted by length
# and where exons appear in attributes sorted by their
# start coordinate
gene_obj = records_to_gene(gene_recs)
yield gene_obj
What do you think? Is this something you think is worth putting into gffutils? If so, do you have ideas on what would be the most efficient way to do it? The naive sketch of get_genes()
might be fast enough, but I'm not sure. Thanks, --Yarden
See https://gist.github.com/rmzelle/5697c87edf376f17fd63
There are two GFF files. The stringified output of all features seems to put the first attribute of the first feature in front, even though I have keep_order
set to "True". Tested with gffutils 0.8.3 and Python 2.7.9.
Hi.
I've been using gffutis with gencode v18 human genome annotation(http://www.gencodegenes.org/releases/current.html)
and noticed the following:
if I create database this way:
#!/usr/bin/python2
import gffutils as gff
import sys
db = gff.create_db('gencode.v19.annotation.gff3',
'gencode.v19.annotation.gff3.db',
force=True,
merge_strategy="merge")
And then run multiple calls to db.parent:
#!/usr/bin/python2
import gffutils as gff
import os.path
import sys
db = gff.FeatureDB('gencode.v19.annotation.gff3.db')
with open(os.path.join(os.path.dirname(__file__), '1000_transcripts.txt')) as transcripts:
for t in transcripts:
t = t.strip()
for gene in db.parents(t, featuretype='gene'):
print gene
1000_transcripts.txt contains 1000 random transcripts to human genome.
The computation time of the second piece of code from time
bash command appears to be
real 0m31.850s
user 0m17.351s
sys 0m14.497s
But if I create database this way:
#!/usr/bin/python2
import gffutils as gff
import sys
db = gff.create_db('gencode.v19.annotation.gff3',
'gencode.v19.annotation.gff3.db',
force=True,
merge_strategy="merge")
db.execute('ANALYZE features')
Then multiple calls to parent take
real 0m0.089s
user 0m0.072s
sys 0m0.016s
However I can't see this effect if call to parent is performed in the same script as database creation.
I suggest #55 as a fix.
Not very critical, but would be good to at some point add better error handling for empty/missing/non-GFF files in various gffutils operations, especially from gffutils-cli
. For example:
# make empty file
$ touch test.gff3
# try to use it for gff db creation
$ gffutils-cli create test.gff3
Traceback (most recent call last):
...
File "yarden/gffutils/gffutils/create.py", line 803, in create_db
if force_gff or (dialect['fmt'] == 'gff3'):
TypeError: 'NoneType' object has no attribute '__getitem__'
Hi again :),
As suggested, I am opening a new issue for GTF/GFF3 parsing files. They come from the same fasta file (and are attached below).
I have ValueError
or AssertionError
when doing the create_db:
>>> gffutils.create_db('augustusDbia3.gtf', 'gtf')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/create.py", line 1273, in create_db
c.create()
File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/create.py", line 488, in create
self._populate_from_lines(self.iterator)
File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/create.py", line 553, in _populate_from_lines
for i, f in enumerate(lines):
File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/iterators.py", line 100, in __iter__
for i in self._iter:
File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/iterators.py", line 144, in _custom_iter
yield feature_from_line(line, dialect=self.dialect)
File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/feature.py", line 392, in feature_from_line
attrs, _dialect = parser._split_keyvals(attr_string, dialect=dialect)
File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/parser.py", line 164, in _split_keyvals
assert len(item) == 1, item
AssertionError: [u'', u'gene_id', u'"contig1.g1"']
and
>>> gffutils.create_db('augustusDbia3.gff3', 'gff3')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/create.py", line 1273, in create_db
c.create()
File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/create.py", line 488, in create
self._populate_from_lines(self.iterator)
File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/create.py", line 571, in _populate_from_lines
fixed, final_strategy = self._do_merge(f, self.merge_strategy)
File "/Users/remi/Google_Drive/Dev/galaxy_env/lib/python2.7/site-packages/gffutils/create.py", line 218, in _do_merge
raise ValueError("Duplicate ID {0.id}".format(f))
ValueError: Duplicate ID contig1.g1.t1.cds
Do you know what it does indicate?
augustusDbia3.gtf.txt
augustusDbia3.gff3.txt
Thanks! :)
I gathered from https://github.com/daler/gffutils/blob/master/gffutils/gffwriter.py that it's possible to write GFF files from a (modified) gffutils database, but I can't find any examples in the documentation at http://pythonhosted.org/gffutils/contents.html or online on how to do so.
Does this feature exist, and if so, can you provide any examples on how to invoke the function?
I am getting the error
File "C:\...\python-2.7.9.amd64\lib\site-packages\gffutils\parser.py", line 164, in _split_keyvals
assert len(item) == 1, item
AssertionError: [u'Alias=CCHL,cytochrome', u'c', u'heme', u'lyase']
for the line
scaffold_84 liftOver CDS 41166 41975 . - . ID=YAL039C_JAY291;Name=YAL039C;orf_classification=Verified;gene=CYC3;Alias=CCHL,cytochrome c heme lyase;dbxref=SGD:S000000037;Note=Cytochrome%20c%20heme%20lyase%20%28holocytochrome%20c%20synthase%29%2C%20attaches%20heme%20to%20apo-cytochrome%20c%20%28Cyc1p%20or%20Cyc7p%29%20in%20the%20mitochondrial%20intermembrane%20space%3B%20human%20ortholog%20may%20have%20a%20role%20in%20microphthalmia%20with%20linear%20skin%20defects%20%28MLS%29
According to http://www.sequenceontology.org/gff3.shtml, spaces are allowed in the Alias field ("Note that unescaped spaces are allowed within fields, meaning that parsers must split on tabs, not spaces."), as are multiple values ("In addition to Parent, the Alias, Note, Dbxref and Ontology_term attributes can have multiple values.").
Hello,
When I do pip install gffutils
I get the error below.
I am installing gffutils 0.8.7.1 using python 2.7 with pip 8.1.1.
Could you please let me know where the problem is?
Best regards,
Sebastian.
File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2134, in find_on_path
for item in dists:
File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2134, in find_on_path
for item in dists:
File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2120, in find_on_path
path_item, entry, metadata, precedence=DEVELOP_DIST
File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2510, in from_location
py_version=py_version, platform=platform, **kw
File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2489, in __init__
self._version = safe_version(version)
File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/pkg_resources/__init__.py", line 1366, in safe_version
return str(packaging.version.Version(version))
File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/packaging/version.py", line 230, in __init__
self._version.local,
File "/apps/python-2.7.11/lib/python2.7/site-packages/pip/_vendor/packaging/version.py", line 353, in _cmpkey
reversed(release),
RuntimeError: maximum recursion depth exceeded
It would be useful to accept GFF files where end-start = length
This can be done in HTSeq:
http://www-huber.embl.de/users/anders/HTSeq/doc/features.html#gff-reader-and-genomicfeature
An issue I frequently run into is that bedtools sometimes produces extra fields in its output, creating files that are strictly not GFF or BED, it seems. For example, if I call coverageBed
with a BAM input as the -a
parameter and a GFF as the -b
parameter, it creates output lines like:
chr1 SE gene 83877956 83908217 . - . ID=chr1:83907290:83908217:-@chr1:83878734:83878799:-@chr1:83877956:83878093:-;Name=chr1:83907290:83908217:-@chr1:83878734:83878799:-@chr1:83877956:83878093:- 18 228 30262 0.0075342
where the last four fields are various counts/density values. gffutils
will not parse this file, yielding:
in gffutils.gfffeature.GFFFile.__init__ (gffutils/gfffeature.c:1207)()
ValueError: No features found in file
which is technically the correct thing to do, since the file has more than 9 fields and so isn't technically a correct GFF3 file. However, these files are very useful and contain all the information necessary to be treated as if they were GFF3 files.
What do you think is the correct approach to dealing with these files?
I think it'd be ideal if bedtools
outputted these values as lower-case optional parameters in the attributes field of the GFF3, since that would preserve the format and allow strict GFF3 parsers to parse the file with no issues, but given that this is not the current format, I'm curious as to how you think these files should be handled in gffutils
. Best, --Yarden
Hello,
I am using gffutils to process selected entries from following file:
ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.09_FB2016_01/gtf/dmel-all-r6.09.gtf.gz
In short, I got a list of transcript_id(s), fgrep-ed the above GTF file keeping rows with featuretype:
gene
mRNA
exon
CDS
start_codon
stop_codon
I am putting the data using:
db = gffutils.create_db(fn, dbfn='dmel_test.db', force=True, keep_order=False, merge_strategy='merge', sort_attribute_values=True, verbose=True, id_spec={'gene':'gene_id', 'mRNA': 'transcript_id',}, force_gff=False)
I would like to have at least ids for exons in a form:
transcript_id.exon_number (FBtr0330644.exon_1, FBtr0330644.exon_2, etc)
I did try to use transform function, but the id's of exons do stay the same:
def transform(f):
if f.featuretype == 'exon':
_id = f.attributes['transcript_id'][0] + ".ex_"
id = 'autoincrement:_id'
return f
db = gffutils.create_db(fn, dbfn='dmel_test.db', force=True, keep_order=False, merge_strategy='merge', sort_attribute_values=True, id_spec={'gene':'gene_id', 'mRNA': 'transcript_id',}, transform=transform, gtf_subfeature='exon', force_gff=False, verbose=True)
I will keep trying, but if you find a time to give me a hint it will be really helpful.
Thank you for writing and maintaining gffutils
Darek Kedra
I'm trying to create a GFF database from iterators. I noticed that when I do this, the database is created successfully but the majority of features are dropped from the db when it is queried by methods like all_features
. I wrote a short unit test for it. The GFF, "gff_example1.gff3"
, is:
chr1 ensGene gene 4763287 4775820 . - . Name=ENSMUSG00000033845;ID=ENSMUSG00000033845;Alias=ENSMUSG00000033845;gid=ENSMUSG00000033845
chr1 ensGene mRNA 4764517 4775779 . - . Name=ENSMUST00000045689;Parent=ENSMUSG00000033845;ID=ENSMUST00000045689;Alias=ENSMUSG00000033845;gid=ENSMUSG00000033845
chr1 ensGene CDS 4775654 4775758 . - 0 Name=ENSMUST00000045689.cds0;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.cds0;gid=ENSMUSG00000033845
chr1 ensGene CDS 4772761 4772814 . - 0 Name=ENSMUST00000045689.cds1;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.cds1;gid=ENSMUSG00000033845
chr1 ensGene exon 4775654 4775779 . - . Name=ENSMUST00000045689.exon0;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.exon0;gid=ENSMUSG00000033845
chr1 ensGene exon 4772649 4772814 . - . Name=ENSMUST00000045689.exon1;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.exon1;gid=ENSMUSG00000033845
chr1 ensGene exon 4767606 4767729 . - . Name=ENSMUST00000045689.exon2;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.exon2;gid=ENSMUSG00000033845
chr1 ensGene exon 4764517 4764597 . - . Name=ENSMUST00000045689.exon3;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.exon3;gid=ENSMUSG00000033845
chr1 ensGene five_prime_UTR 4775759 4775779 . - . Name=ENSMUST00000045689.utr0;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.utr0;gid=ENSMUSG00000033845
chr1 ensGene three_prime_UTR 4772649 4772760 . - . Name=ENSMUST00000045689.utr1;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.utr1;gid=ENSMUSG00000033845
chr1 ensGene three_prime_UTR 4767606 4767729 . - . Name=ENSMUST00000045689.utr2;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.utr2;gid=ENSMUSG00000033845
chr1 ensGene three_prime_UTR 4764517 4764597 . - . Name=ENSMUST00000045689.utr3;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.utr3;gid=ENSMUSG00000033845
The unit test is:
def test_create_db_from_iter():
"""
Test creation of FeatureDB from iterator.
"""
print "Testing creation of DB from iterator"
def my_iterator():
db_fname = gffutils.example_filename("gff_example1.gff3")
db = gffutils.create_db(db_fname, ":memory:")
for rec in db.all_features():
yield rec
new_db = gffutils.create_db(my_iterator(), ":memory:")
print list(new_db.all_features())
gene_feats = new_db.all_features(featuretype="gene")
assert (len(list(gene_feats)) != 0), "Could not load genes from GFF."
The database is successfully created, but for some reason all features except the last two three_prime_UTR
entries are dropped from the database when it is queried. For example, none of the gene
or mRNA
or exon
entries are in the resulting database. The only features it registers are:
[<Feature three_prime_UTR (chr1:4767606-4767729[-]) at 0x1967b90>, <Feature three_prime_UTR (chr1:4764517-4764597[-]) at 0x1967790>]
Any ideas on what might be causing this? Thanks, --Yarden
Hi!
I just tried the version installed through pip, and I get an error about dialect on str object, when I want to create the db.
It reproduces on Mac and Linux (Ubuntu), on GTF and GFF3 :/.
Here is the output:
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gffutils
>>> gtf = open("augustusDbia3.gtf", 'r')
>>> db = gffutils.create_db(gtf, ":memory:")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/remi/Dev/galaxy_env/local/lib/python2.7/site-packages/gffutils/create.py", line 1231, in create_db
iterator = iterators.DataIterator(**kwargs)
File "/home/remi/Dev/galaxy_env/local/lib/python2.7/site-packages/gffutils/iterators.py", line 278, in DataIterator
return _FeatureIterator(**_kwargs)
File "/home/remi/Dev/galaxy_env/local/lib/python2.7/site-packages/gffutils/iterators.py", line 93, in __init__
self._observed_dialects = [i.dialect for i in self.peek]
AttributeError: 'str' object has no attribute 'dialect'
Thanks for the help!
This is not an issue, more a question. It takes some serious SQL-wrangling to get parent-child or grandparent-child information about gene-transcript-exon relationships. Have you thought about using a graph database for gffutils? There doesn't seem to be a SQLite equivalent for Node.js or TitanDB so you wouldn't have to open up a separate port, so that could be a drawback.
Windows won't allow me to delete certain GFF files, and it seems to be a result of gffutils.create_db
. Using psutil
to keep track of open files (per http://stackoverflow.com/a/25069136/1712389), I get:
Code:
proc = psutil.Process()
print ('beforeDB: ' + str(proc.open_files()))
#create database for genome
db = gffutils.create_db(genomeGFF, dbfn=genomeGFFdb, force=True, merge_strategy="merge", keep_order=True)
print ('afterDB: ' + str(proc.open_files()))
Output:
beforeDB: [popenfile(path=u'C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1)]
afterDB: [popenfile(path=u'C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1), popenfile(path=u'C:\\...\\modify-reference-genome\\temp\\SacceM3707_1_AssemblyScaffolds.gff', fd=-1)]
("genomeGFF" matches the "SacceM3707_1_AssemblyScaffolds.gff" file path). Is this a bug? If not, is there a way to release the GFF file? I'm using gffutils 0.8.3.1.
There is a constant for a temp file location in the loader that forces multiple parallel versions of gffutils to use the same tempfile. Also, is there any way we can do this without making temp files at all? If :memory:
is used as the dbfn
database location, users probably expect gffutils to work without disk access, so a pure in-memory solution would be beneficial in these cases.
I encountered what appears to be a non-deterministic / context-dependent bug in the children
method. I have the following GFF file: http://genes.mit.edu/burgelab/yarden/SE.mm9.with_introns.gff3
Which contains the event:
chr6 SE gene 71510891 71531580 . + . ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+;Name=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+
chr6 SE mRNA 71510891 71531580 . + . ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.B;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+
chr6 SE exon 71529635 71531580 . + . ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.B.dn;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.B
chr6 SE exon 71510891 71511070 . + . ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.B.up;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.B
chr6 SE mRNA 71510891 71531580 . + . ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+
chr6 SE exon 71510891 71511070 . + . ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.up;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A
chr6 SE exon 71524109 71524230 . + . ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.se;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A
chr6 SE exon 71529635 71531580 . + . ID=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.dn;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A
chr6 SE intron 71511071 71524108 . + . ID=chr6:71511071-71524108:+.intron;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A
chr6 SE intron 71524231 71529634 . + . ID=chr6:71524231-71529634:+.intron;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A
chr6 SE intron 71511071 71529634 . + . ID=chr6:71511071-71529634:+.intron;Parent=chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.B
The event has two mRNAs, the first mRNA contains three exons (and two intervening intron
entries) while the second mRNA contains two exons (with only one intervening intron
entry.) Hence, there are three intron
entries total in the file, two having the first mRNA (whose ID ends in .A
) as parent and one intron
entry having the second mRNA (whose ID ends in .B
) as the as parent.
I create a db for this GFF file in the usual way. Then I try to fetch all the children of the first mRNA as follows:
# Load up the database
In [1]: import gffutils
In [2]: gff_db = gffutils.FeatureDB("SE.mm9.with_introns.gff3.db")
In [5]: event = "chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+"
# Get the children of the event, i.e. the mRNAs
In [7]: list(gff_db.children(event))
Out[7]:
[<Feature: mRNA, chr6:71510891-71531580 (+)>,
<Feature: mRNA, chr6:71510891-71531580 (+)>]
# Now get the children of the first mRNA
In [8]: mrna1 = "chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A"
In [10]: list(gff_db.children(mrna1))
Out[10]:
[<Feature: exon, chr6:71510891-71511070 (+)>,
<Feature: intron, chr6:71511071-71524108 (+)>,
<Feature: exon, chr6:71524109-71524230 (+)>,
<Feature: intron, chr6:71524231-71529634 (+)>,
<Feature: exon, chr6:71529635-71531580 (+)>]
So far so good: it picked out the three exon children of mrna1
and the two intron children. However, it got the Parent
identifiers wrong:
In [12]: [p.attributes["Parent"] for p in list(gff_db.children(mrna1))]
Out[12]:
['chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A',
'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A',
'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A',
'chr6:71524109:71524230:+@chr6:71527725:71527839:+@chr6:71529635:71531580:+.B',
'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A']
It assigned chr6:71524109:71524230:+@chr6:71527725:71527839:+@chr6:71529635:71531580:+.B
as the parent of the second intron, even though it has as its parent chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A
and we asked for the children of chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A
. Here's the output that makes it more explicit:
In [13]: [(p.id, p.attributes["Parent"]) for p in list(gff_db.children(mrna1))]
Out[13]:
[('chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.up',
'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A'),
('chr6:71511071-71524108:+.intron',
'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A'),
('chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.se',
'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A'),
('chr6:71524231-71529634:+.intron',
'chr6:71524109:71524230:+@chr6:71527725:71527839:+@chr6:71529635:71531580:+.B'),
('chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.dn',
'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A')]
The intron entry chr6:71524231-71529634:+.intron
definitely does not have the ...B
mRNA as its Parent
, yet it's listed this way. It looks like it might be something off with the attributes
field, since if I reference the parents()
method it gets it right:
In [18]: [(p.id, list(gff_db.parents(p))[0].id) for p in list(gff_db.children(mrna1))]
Out[18]:
[('chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.up',
'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A'),
('chr6:71511071-71524108:+.intron',
'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A'),
('chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.se',
'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A'),
('chr6:71524231-71529634:+.intron',
'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A'),
('chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A.dn',
'chr6:71510891:71511070:+@chr6:71524109:71524230:+@chr6:71529635:71531580:+.A')]
If I take out this event out of the large GFF file I linked to and place it on its own, the bug disappears and it gets the Parent
field right... no idea what is happening. --Yarden
Currently db creation uses INSERT OR IGNORE
rather than just INSERT
. The side effect is that only the first instance of a unique ID is entered into the database and all other instances with the same ID are ignored.
Should duplicate IDs in a GFF file be considered an error?
In design/feature/documentation category:
Find way to product an HTML table using entabled (https://github.com/daler/entabled) or something similar from a gffutils database. Would make for a great example in the documentation. The table could be easily sorted by start/end coordinates, chromosomes, and other GFF columns.
I'm trying to save the ENSEMBL/Gencode ID and chrom
start
stop
of a gff3
(created by @yarden) that uses UCSC IDs. The GFF3 is ALE.hg19.gff3
from this zip file, and the Gencode database was created as described in #20 (comment).
Here's the code: http://nbviewer.ipython.org/gist/olgabot/ef88e1963c5647e95696
I get most of the way through the file, but then I run into this OperationalError
:
---------------------------------------------------------------------------
OperationalError Traceback (most recent call last)
<ipython-input-7-e2b521904d43> in <module>()
5 for ale in db.features_of_type('gene'):
6 print ale.id
----> 7 overlapping_genes = list(v19_db.region((ale.chrom, ale.start, ale.stop), featuretype='gene'))
8
9 gencode = set(g.id for g in overlapping_genes)
/projects/ps-yeolab/software/anaconda/envs/olga/lib/python2.7/site-packages/gffutils/interface.pyc in region(self, region, featuretype, completely_within)
498
499 c = self.conn.cursor()
--> 500 c.execute(query, tuple(args))
501 for i in c:
502 yield self._feature_returner(**i)
OperationalError: too many SQL variables
Any ideas? I'm thinking of just saving to disk periodically
What sort of functionality should be exposed to the CLI?
For example a universal GTF<->GFF<->refFlat conversion script would be useful.
Currently when merge_strategy="merge"
, features will only be merged if all of their non-attribute fields (chrom, source, start, stop, etc) are identical.
For some use-cases (see later comments in #20), it would be useful to allow a subset of these fields to be ignored.
For example, these exons have the same coords, but different sources:
chr1 HAVANA exon 12613 12721 . + . transcript_id "mRNA1"; gene_id "gene1"
chr1 ENSEMBL exon 12613 12721 . + . transcript_id "mRNA2"; gene_id "gene1"
We would like to force a merged feature into something like this:
chr1 HAVANA,ENSEMBL exon 12613 12721 . + . transcript_id "mRNA1,mRNA2"; gene_id "gene1"
Fields allowed to be merged should be specified in the create_db()
call, e.g., force_merge_fields=['source']
.
It's up to the user to, if they really want to, do things like force_merge_fields=["start"]
(which would result in non-integer start positions that will likely break downstream code).
os.listdir('.')
['test.gtf', 'test.fastq', 'test.sql']gffutils.create_db('test.gtf', 'test.sql', force=True)
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/gffutils-0.7-py2.7-linux-x86_64.egg/gffutils/init.py", line 44, in create_db
creator.create()
File "/usr/local/lib/python2.7/dist-packages/gffutils-0.7-py2.7-linux-x86_64.egg/gffutils/db.py", line 83, in create
self.init_tables()
File "/usr/local/lib/python2.7/dist-packages/gffutils-0.7-py2.7-linux-x86_64.egg/gffutils/db.py", line 73, in init_tables
''')
sqlite3.OperationalError: table features already existsgffutils.create_db('test.gtf', 'test.sqlite')
Populating features and relations table: 9 (90%) done in 0.0s
Creating temporary indexes...done in 0.0s
Inferring gene and transcript extents: 15 (93%)...done in 0.0sle...
Importing inferred features...done in 0.0s
Creating index...done in 0.0s
I'm trying out gffutils and after installing and trying to import it, I got the error:
$ python
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gffutils
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/lab/solexa_jaenisch2/yarden/software/gffutils/gffutils/__init__.py", line 3, in <module>
from gfffeature import Feature, GFFFile
ImportError: /lab/solexa_jaenisch2/yarden/software/gffutils/gffutils/gfffeature.so: undefined symbol: PyUnicodeUCS2_Tailmatch
Any ideas on how this can be fixed? Also, are there plans to make a gffutils command line interface? I think it'd be extremely useful. I haven't seen anything like it, and the operations you set out to answer as described in docs are extremely common, so this would be tremendously useful. Thanks.
Hello, I'm sure this is a rare issue, but I wanted to let you know that in my case, installation in python 2.7 through pip fails with a compiler error if cython is not already installed. The installation appears to download cython as a dependency begins to install it, but conversion of *.pyx to *.c will fail and the compiler will abort with a "file not found" error.
Thanks for the great work!
I found an oddity which might be there by design in re-setting an attribute of a GFF field. Suppose I have an exon record:
In [237]: e
Out[237]: <Feature: exon, chr1:83907290-83908217 (-)>
In [238]: e.attributes
Out[238]: {'ID': 'chr1:83907290:83908217:-@chr1:83878734:83878799:-@chr1:83877956:83878093:-.A.up', 'Parent': 'chr1:83907290:83908217:-@chr1:83878734:83878799:-@chr1:83877956:83878093:-.A'}
I try to reassign its ID using e.attributes.__setitem__
, accessing it as a dict:
In [239]: e.attributes["ID"] = "foo"
In [240]: e.attributes
Out[240]: {'ID': 'foo', 'Parent': 'chr1:83907290:83908217:-@chr1:83878734:83878799:-@chr1:83877956:83878093:-.A'}
It appears to work, but it internally it creates an additional ID field which gets serialized when the record is written to a file:
In [241]: str(e.attributes)
Out[241]: 'ID=foo;Parent=chr1:83907290:83908217:-@chr1:83878734:83878799:-@chr1:83877956:83878093:-.A;ID=foo'
Is this intentional? Is there a different recommended way to reset attributes? I wanted to make sure I'm not missing something before I start meddling with the code. Thanks.
See last few comments of #40, where using this feature in a region()
call fails:
chr3 ALE gene 97947 120347376 . - . [email protected]@uc003fzi.1uc003fzj.1;[email protected]@uc003fzi.1uc003fzj.1;[email protected]@uc003fzi.1uc003fzj.1
Specifically, a call to FeatureDB.region()
like this results in too many possible bins being passed to SQL query:
db.region(('chr1', 1, 120000000))
Hello,
Thank you for creating such an excellent package. I am attempting to read some gtf files which don't have the transcript_id
feature, and thus with this command:
# RNA-Seq Exons from Gencode V7 data
gff_filename = '/nas3/scratch/obot/encode/gm12878/rna-seq/wgEncodeCshlLongRnaSeqGm12878CellPamExonsGencV7.gtf'
db_filename = gff_filename + '.db'
gffutils.create_db(gff_filename, db_filename)
cage_gencode7_db = gffutils.FeatureDB(db_filename)
I am getting this error:
Populating features and relations tables...
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-4-f3bcda2b7341> in <module>()
11 gff_filename = '/nas3/scratch/obot/encode/gm12878/rna-seq/wgEncodeCshlLongRnaSeqGm12878CellPamExonsGencV7.gtf'
12 db_filename = gff_filename + '.db'
---> 13 gffutils.create_db(gff_filename, db_filename)
14 cage_gencode7_db = gffutils.FeatureDB(db_filename)
/nas3/yeolab/Software/Python-2.7.5/lib/python2.7/site-packages/gffutils-0.7-py2.7-linux-x86_64.egg/gffutils/__init__.pyc in create_db(fn, dbfn, verbose, force)
43 else:
44 creator = GTFDBCreator(fn, dbfn, verbose=verbose, force=force)
---> 45 creator.create()
46
47
/nas3/yeolab/Software/Python-2.7.5/lib/python2.7/site-packages/gffutils-0.7-py2.7-linux-x86_64.egg/gffutils/db.pyc in create(self)
82 def create(self):
83 self.init_tables()
---> 84 self.populate_from_features(self.features)
85 self.update_relations()
86 self.set_filetype()
/nas3/yeolab/Software/Python-2.7.5/lib/python2.7/site-packages/gffutils-0.7-py2.7-linux-x86_64.egg/gffutils/db.pyc in populate_from_features(self, features)
229 last_perc = perc
230
--> 231 parent = feature.attributes['transcript_id']
232 grandparent = feature.attributes['gene_id']
233
/nas3/yeolab/Software/Python-2.7.5/lib/python2.7/site-packages/gffutils-0.7-py2.7-linux-x86_64.egg/gffutils/gfffeature.so in gffutils.gfffeature.Attributes.__getitem__ (gffutils/gfffeature.c:6261)()
KeyError: 'transcript_id'
There are two main GTFs I'm looking at, which you can get from here:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqGm12878CellPamExonsGencV7.gtf.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageGm12878CellPapTssGencV7.gtf.gz
And they look like this:
head /nas3/scratch/obot/encode/gm12878/rna-seq/wgEncodeCshlLongRnaSeqGm12878CellPamExonsGencV7.gtf
chr14 gencodeV7 exon 105886353 105886400 0.005370 + . gene_ids "ENSG00000182979.10,"; transcript_ids "ENST00000438610.1,"; RPKM1 "0"; RPKM2 "0.177373"; iIDR "0.084";
chr9 gencodeV7 exon 123744122 123744217 0.003540 - . gene_ids "ENSG00000106804.5,"; transcript_ids "ENST00000223642.1,"; RPKM1 "0.06532"; RPKM2 "0.051599"; iIDR "0.151";
chr11 gencodeV7 exon 61539301 61539424 0.038597 + . gene_ids "ENSG00000124920.9,"; transcript_ids "ENST00000278836.5,ENST00000265460.5,"; RPKM1 "0.889132"; RPKM2 "0.385668"; iIDR "0.027";
chr16 gencodeV7 exon 50730251 50730391 0.001048 + . gene_ids "ENSG00000167207.4,"; transcript_ids "ENST00000531674.1,"; RPKM1 "0"; RPKM2 "0.034611"; iIDR "0.207";
chr11 gencodeV7 exon 64950172 64950439 0.172877 + . gene_ids "ENSG00000014216.9,"; transcript_ids "ENST00000528396.1,ENST00000529133.1,ENST00000533820.1,ENST00000530571.1,ENST00000527739.1,ENST00000526966.1,ENST00000533129.1,ENST00000524773.1,ENST00000530442.1,ENST00000279247.6,ENST00000527189.1,ENST00000527469.1,ENST00000534373.1,ENST00000259755.6,ENST00000526954.1,ENST00000531068.1,ENST00000527699.1,ENST00000533909.1,"; RPKM1 "4.06183"; RPKM2 "1.64803"; iIDR "0.008";
chr18 gencodeV7 exon 48447444 48447557 0.194347 + . gene_ids "ENSG00000082212.6,"; transcript_ids "ENST00000321341.5,ENST00000382927.2,"; RPKM1 "3.04978"; RPKM2 "3.36921"; iIDR "0.013";
chr15 gencodeV7 exon 62320557 62320619 0.357897 - . gene_ids "ENSG00000129003.10,"; transcript_ids "ENST00000261517.5,ENST00000395896.3,ENST00000395898.2,"; RPKM1 "5.8709"; RPKM2 "5.9499"; iIDR "0.000";
chr14 gencodeV7 exon 92789949 92790304 0.000000 + . gene_ids "ENSG00000140090.11,"; transcript_ids "ENST00000532405.1,"; RPKM1 "0"; RPKM2 "0"; iIDR "NA";
chr17 gencodeV7 exon 59989301 59989431 0.142000 - . gene_ids "ENSG00000108506.6,"; transcript_ids "ENST00000444766.2,ENST00000251334.5,"; RPKM1 "2.33894"; RPKM2 "2.3511"; iIDR "0.010";
chr4 gencodeV7 exon 123269847 123269872 0.000000 + . gene_ids "ENSG00000138688.9,"; transcript_ids "ENST00000431755.2,"; RPKM1 "0"; RPKM2 "0"; iIDR "NA";
Which has transcript_ids
but not transcript_id
, and thus the parent-child associations are failing.
And for this file:
head /nas3/scratch/obot/encode/gm12878/cage/wgEncodeRikenCageGm12878CellPapTssGencV7.gtf
chr1 Gencode TSS 11869 11869 0 + . gene_id "ENSG00000223972.3"; trlist "ENST00000456328.2,"; trbiotlist "processed_transcript,"; confidence "not_low"; gene_biotype "pseudogene"; rpm1 "0"; rpm2 "0";
chr1 Gencode TSS 11872 11872 0 + . gene_id "ENSG00000249291.2"; trlist "ENST00000515242.2,"; trbiotlist "protein_coding,"; confidence "not_low"; gene_biotype "protein_coding"; rpm1 "0"; rpm2 "0";
chr1 Gencode TSS 11874 11874 0 + . gene_id "ENSG00000253101.2"; trlist "ENST00000518655.2,"; trbiotlist "protein_coding,"; confidence "not_low"; gene_biotype "protein_coding"; rpm1 "0"; rpm2 "0";
chr1 Gencode TSS 12010 12010 0 + . gene_id "ENSG00000223972.3"; trlist "ENST00000450305.2,"; trbiotlist "transcribed_unprocessed_pseudogene,"; confidence "not_low"; gene_biotype "pseudogene"; rpm1 "0"; rpm2 "0";
chr1 Gencode TSS 19305 19305 0 - . gene_id "ENSG00000227232.3"; trlist "ENST00000537342.1,"; trbiotlist "unprocessed_pseudogene,"; confidence "not_low"; gene_biotype "pseudogene"; rpm1 "0"; rpm2 "0";
chr1 Gencode TSS 24886 24886 0 - . gene_id "ENSG00000227232.3"; trlist "ENST00000541675.1,"; trbiotlist "unprocessed_pseudogene,"; confidence "not_low"; gene_biotype "pseudogene"; rpm1 "0"; rpm2 "0";
chr1 Gencode TSS 29344 29344 0 - . gene_id "ENSG00000227232.3"; trlist "ENST00000430492.2,"; trbiotlist "unprocessed_pseudogene,"; confidence "not_low"; gene_biotype "pseudogene"; rpm1 "0"; rpm2 "0";
chr1 Gencode TSS 29370 29370 0 - . gene_id "ENSG00000227232.3"; trlist "ENST00000438504.2,ENST00000423562.1,"; trbiotlist "unprocessed_pseudogene,unprocessed_pseudogene,"; confidence "not_low"; gene_biotype "pseudogene"; rpm1 "0"; rpm2 "0";
chr1 Gencode TSS 29554 29554 0 + . gene_id "ENSG00000243485.1"; trlist "ENST00000473358.1,"; trbiotlist "processed_transcript,"; confidence "not_low"; gene_biotype "processed_transcript"; rpm1 "0"; rpm2 "0";
chr1 Gencode TSS 29570 29570 0 - . gene_id "ENSG00000227232.3"; trlist "ENST00000488147.1,"; trbiotlist "unprocessed_pseudogene,"; confidence "not_low"; gene_biotype "pseudogene"; rpm1 "0"; rpm2 "0";
Which has trlist
for the transcripts.
Is the solution here to have a janky bunch of if/elses to check for all the possible types of transcript_id
-like things?
In my field of research, it is very common to remove and/or insert custom DNA sequences in microbial genomes. I would like to make the same changes in silico in the FASTA file of the reference genome, as well as in the accompanying GFF annotation file.
To make this easy, I would love the following features:
Does this fall in the scope of gffutils, or are my requests too esoteric? I guess the simplest feature request would be the ability to easily select and/or shift annotations that start beyond a certain coordinate. (I realize that I can already make such changes by iterating over each feature, but it would still be nice to have such a function)
I wanted to integrate gffutils database with my existing database (MySQL) for a different purpose. I just need to include the tables to my database and use gffutils attributes in my python codes. Is it possible for you to add MySQL support for this?
Thanks.
Use-case in the later comments of #20: can we have CDS
s be children of exon
s?
More generally, can we define additional relationships that are not defined in the original GFF/GTF?
There are two different ways of "adding a relation", depending on what you want to do with it later:
Parent="gene1"
in the attributes for a feature. This will only affect the output of GFF lines, and will not allow child/parent queries via the gffutils
API. So it's more like a cosmetic change.gffutils
API.For 1), this should just be left to the user to manipulate the attribute dictionaries however they want to.
For 2), this should be implemented as a new method on a FeatureDB
-- say, FeatureDB.add_relation()
. It should also [optionally] make the change in the attributes as in 1).
However . . .
There's the issue of database IDs, which are primary keys, vs the IDs that you'd want to include in the final output. The database ID needs to be added to the relations
table, but you might want some other ID to be added to the attributes.
A concrete example, showing just attributes to save space:
ID="exon1"; fancy_id="exon:chr1:1-100"
ID="CDS1"; fancy_id="CDS:chr1:1-100"
Imagine we imported these features into a db using the idspec fancy_id
, such that exon:chr1:1-100
and CDS:chr1:1-100
are the primary keys in the db. We want to make CDS
s be children of exon
s. So we add an entry in the relations
table, using the hypothetical FeatureDB.add_relation()
method:
exon:chr1-100 CDS:chr1:1-100 1
But in the output -- and therefore the attributes as stored in the db -- we want to specify the parent in terms of another attribute, like ID
. So when we print out the GFF lines, they look like this:
ID="exon1"; fancy_id="exon:chr1:1-100";
ID="CDS1"; fancy_id="CDS:chr1:1-100"; parent_exon="exon1";
This implies a method signature like:
FeatureDB.add_relation(
child,
parent,
level, parent_func, child_func)
where parent_handler
is a custom function with signature parent_func(parent, child)
and returns the modified parent, and child_func(parent, child)
returns the modified child.
In this example, parent_func
would be lambda x: x
and child_func
would be:
def child_func(parent, child):
child.attributes['parent_exon'] = parent.id
add a mechanism for cleaning a database of features that have an annotated parent, but that parent does not exist.
Example at https://gist.github.com/rmzelle/256cbef7d594b518ac46
I'm trying to only update the features in my gffutils database that needed updating, but when I do that I get unexpected results. In particular, some features get deleted while others are duplicated, so it looks like gffutils sometimes misidentifies the features I'm trying to update. Any thoughts? How does gffutils establish which feature is which when using db.update()
?
Hi Ryan and all,
Finally back to hacking on gffutils (over at my fork; will send pull request when next batch of features are completed.) I notice that in the refactor
branch I forked today, the unit test test_attributes_modify
. I believe it is because the GFF records are not read in the order in which they appear in the GFF file, like they used to be. The test function is below. The output I get when running it is:
$ python gffutils/test/test.py
First child is not an mRNA
exon
Traceback (most recent call last):
File "gffutils/test/test.py", line 1073, in <module>
test_attributes_modify()
File "gffutils/test/test.py", line 1061, in test_attributes_modify
assert str(gene_childs[0].attributes) == 'ID=FBtr0300689;Name=CG11023-RB;Parent=FBgn0031208;Dbxref=FlyBase_Annotation_IDs:CG11023-RB;score_text=Strongly Supported;score=11'
AssertionError
If I understand it correctly, I believe it's reading the exon record before the mRNA record, although in the file the mRNA appears as the first child of the gene:
chr2L FlyBase gene 7529 9484 . + . ID=FBgn0031208;Name=CG11023;Ontology_term=SO:0000010,SO:0000087,GO:0008234,GO:0006508;Dbxref=FlyBase:FBan0011023,FlyBase_Annotation_IDs:CG11023,GB:AE003590,GB_protein:AAO41164,GB:AI944728,GB:AJ564667,GB_protein:CAD92822,GB:BF495604,UniProt/TrEMBL:Q6KEV3,UniProt/TrEMBL:Q86BM6,INTERPRO:IPR003653,BIOGRID:59420,dedb:5870,GenomeRNAi_gene:33155,ihop:59383;derived_computed_cyto=21A5-21A5;gbunit=AE014134
chr2L FlyBase mRNA 7529 9484 . + . ID=FBtr0300689;Name=CG11023-RB;Parent=FBgn0031208;Dbxref=FlyBase_Annotation_IDs:CG11023-RB;score_text=Strongly Supported;score=11
Although the GFF format is unordered, I think it's worth ensuring that the entries are read in the order in which they appear in the file. We at least use this convention heavily to always put the longest mRNA entries first and these types of conventions are very useful.
I believe that the unordered entries are returned by the children
method and that in general gffutils stores entries in the file appearance order. Here's a self-contained example (note that get_db_fname
is a function I'm contributing to helpers
in my fork):
In [56]: f = gffutils.example_filename("FBgn0031208.gff")
In [57]: db_fname = helpers.get_db_fname(f)
In [58]: db = gffutils.FeatureDB(db_fname)
In [59]: all_recs = list(db.all())
# Gene and mRNA are read first (file appearance order)
In [61]: all_recs[0:2]
Out[61]: [<Feature: gene, chr2L:7529-9484 (+)>, <Feature: mRNA, chr2L:7529-9484 (+)>]
# But the first child is not an mRNA, it is an exon
In [62]: c = list(db.children("FBgn0031208"))
In [63]: c[0]
Out[63]: <Feature: exon, chr2L:7529-8116 (+)>
The test function is:
def test_attributes_modify():
"""
Test that attributes can be modified in a GFF record.
TODO: This test case fails?
"""
# Test that attributes can be modified
gffutils.create_db(gffutils.example_filename('FBgn0031208.gff'), testdbfn_gff,
verbose=False,
force=True)
db = gffutils.FeatureDB(testdbfn_gff)
gene_id = "FBgn0031208"
gene_childs = list(db.children(gene_id))
print "First child is not an mRNA"
print gene_childs[0].featuretype
assert str(gene_childs[0].attributes) == 'ID=FBtr0300689;Name=CG11023-RB;Parent=FBgn0031208;Dbxref=FlyBase_Annotation_IDs:CG11023-RB;score_text=Strongly Supported;score=11'
gene_childs[0].attributes["ID"] = "Modified"
assert str(gene_childs[0].attributes) == 'ID=Modified;Name=CG11023-RB;Parent=FBgn0031208;Dbxref=FlyBase_Annotation_IDs:CG11023-RB;score_text=Strongly Supported;score=11;ID=Modified'
I think there might be a bug in the retrieval of children records by type (in db.children(..., featuretype=)
) when the record IDs/chromosomes have an underscore _
character in their name.
I made a small unit test for it over at my fork in test.py
:
def test_random_chr():
"""
Test on GFF files with random chromosome events.
"""
gff_fname = gffutils.example_filename("random-chr.gff")
db_fname = helpers.get_db_fname(gff_fname)
db = gffutils.FeatureDB(db_fname)
# Test that we can get children of only a selected type
gene_id = "chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-"
mRNAs = db.children(gene_id, featuretype="mRNA")
for mRNA_entry in mRNAs:
assert (mRNA_entry.featuretype == "mRNA"), \
"Not all entries are of type mRNA! %s" \
%(",".join([entry.featuretype for entry in mRNAs]))
print "Parsed random chromosome successfully."
The GFF file is in test/data/random-chr.gff
:
chr1_random SE gene 97006 165969 . - . ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-;Name=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-
chr1_random SE mRNA 97006 165969 . - . ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.A;Parent=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-
chr1_random SE mRNA 97006 165969 . - . ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.B;Parent=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-
chr1_random SE exon 165882 165969 . - . ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.up;Parent=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.A
chr1_random SE exon 137473 137600 . - . ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.se;Parent=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.A
chr1_random SE exon 97006 97527 . - . ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.dn;Parent=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.A
chr1_random SE exon 165882 165969 . - . ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.up;Parent=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.B
chr1_random SE exon 97006 97527 . - . ID=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.dn;Parent=chr1_random:165882:165969:-@chr1_random:137473:137600:-@chr1_random:97006:97527:-.B
When I run my unit test, I get:
$ python gffutils/test/test.py
Creating db for test/data/unsanitized.gff
- Took 1.48 seconds
Sanitized GFF successfully.
Creating db for test/data/random-chr.gff
- Took 1.55 seconds
Traceback (most recent call last):
File "gffutils/test/test.py", line 1092, in <module>
test_random_chr()
File "gffutils/test/test.py", line 1059, in test_random_chr
%(",".join([entry.featuretype for entry in mRNAs]))
AssertionError: Not all entries are of type mRNA! exon,exon
I'm guessing the reason is that the underscore character gets parsed as something else by SQLite, but I don't know enough about the SQL syntax to see how/where these characters should be escaped. The error goes away when I remove the _random
suffix from the names of entries/chromosomes.
Any thoughts on this? Thanks, --Yarden
Hi,
I tried for the first time to fetch GFF entries from a region with db.region()
and I get the error:
<ipython-input-44-f7ff462cf266> in <module>()
----> 1 for x in matched_db.region("chr5:653492-65368119"): print x
/home/yarden/.local/lib/python2.7/site-packages/gffutils-0.8-py2.7.egg/gffutils/interface.pyc in region(self, region, featuretype, completely_within)
368
369 # Get a list of all possible bins for this region
--> 370 _bins = list(bins.bins(start, end, one=False))
371
372 if completely_within:
/home/yarden/.local/lib/python2.7/site-packages/gffutils-0.8-py2.7.egg/gffutils/bins.pyc in bins(start, stop, fmt, one)
68 # Jump to highest resolution bin that will fit these coords (depending on
69 # whether we have a BED or GFF-style coordinate).
---> 70 start = (start - COORD_OFFSETS[fmt]) >> FIRST_SHIFT
71 stop = (stop) >> FIRST_SHIFT
72
This happens to me on refactor
branch on pretty much any GFF (should be nothing special about this GFF, I believe).
A much more minor point: the region string parsing fails for coordinates that have a strand specification:
<ipython-input-41-c321ddc79db2> in <module>()
----> 1 for x in matched_db.region("chr5:65349234-65368119:-"): print x
/home/yarden/.local/lib/python2.7/site-packages/gffutils-0.8-py2.7.egg/gffutils/interface.pyc in region(self, region, featuretype, completely_within)
357 """
358 if isinstance(region, basestring):
--> 359 seqid, coords = region.split(':')
360 start, end = coords.split('-')
Since region.split(':')
is unpacked into two values. Thanks, --Yarden
It would be really great to add an option to extract flanking sequence along with the feature sequence using Feature.sequence method.
For instance I want to extract x number of bases flanking then:
Feature.sequence(fasta, use_strand=True, flanking=x)
will generate the sequence of:
x(bases)+feature+x(bases)
and argument flanking=0
will give just the feature sequence (could be set to default).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.