Comments (4)
@daler I can confirm that this seems to work for me. Thanks!
from gffutils.
The problem is caused by this line:
NC_000001.11 BestRefSeq gene 14362 29370 . - . gene_id "WASH7P"; transcript_id ""; db_xref "GeneID:653635"; db_xref "HGNC:HGNC:38034"; description "WASP family homolog 7, pseudogene"; gbkey "Gene"; gene "WASH7P"; gene_biotype "transcribed_pseudogene"; gene_synonym "FAM39F"; gene_synonym "WASH5P"; pseudo "true";
Specifically, the description
attribute is "WASP family homolog 7, pseudogene"
.
Some GTF and GFF files store multiple values separated by commas within a single attribute. During the dialect inference step to figure out which of the MANY different format flavors is used in this particular file, gffutils inspects the first N lines. It turns out that this gene falls within the first N lines, and the description has a comma, implying that multiple values are joined by a comma therefore no repeated keys are expected. However there are lines prior to it that have repeated keys (db_xref
is repeated in the first line), indicating that multiple values are NOT joined by comma.
Hence the error about being internally inconsistent.
The fix is to restrict the number of lines checked:
db = gffutils.create_db(
'GRCh38_latest_genomic.gtf.gz',
'GRCh38_latest_genomic.gtf.sqldb',
keep_order=True,
disable_infer_genes=True,
disable_infer_transcripts=True,
# add this argument
checklines=1
)
from gffutils.
I'm also hitting this issue with this GTF file, and wondering if there's a downside to setting 'repeated keys': True
in the dialect for GTF files without repeated keys? My issue is that I'm using gffutils
to extract transcript features from arbitrary GTF/FASTA pairs (https://github.com/Novartis/pisces/blob/master/pisces/index.py#L172) defined using a configuration file (e.g. https://github.com/Novartis/pisces/blob/master/pisces/config.json). I'd like to not require a fill GTF dialect definition in the config file, but for this most recent RefSeq release I'm having a hard time figuring out the best path forward.
from gffutils.
@mdshw5, @chenzemin see #208, I think that provides a reasonable general solution (see details in that PR). It's merged into the master branch, but I haven't released it yet -- can you confirm it works for you?
from gffutils.
Related Issues (20)
- "ValueError: The ID field ID has more than one value" when updating database. HOT 2
- Map transcript coordinates to genome coordinates HOT 3
- db.parents(id) stops at level 2?
- Apologies - delete me!
- Handling of multi-line features with same ID HOT 3
- Inconsistent behaviour of trailing semicolon HOT 1
- gtf_extract command not found with conda installation HOT 4
- Please remove 'nose' dependencies from tests: nose is not maintained and doesn't work any more on many systems HOT 1
- Field separators in quoted attributes cause error HOT 4
- create_db() does not parse directives from GFF files starting in v0.11.0 HOT 3
- merge_strategy='create_unique' not always creating unique id's HOT 2
- Sqlite3.OptimizedUnicode deprecation warning HOT 1
- how to save the modified gff to the disk HOT 2
- How to get all the gene ids of the read gff file HOT 1
- Suggestion: Drop support for python < 3.7 and add support for newer python verison
- gffutils-cli incompatability with argh >= 0.30.0
- How to get closest features? HOT 1
- FeatureDB.update on disk db very slow HOT 2
- Value of Target attribute gains quotes when it shouldn't in round trip manipulation HOT 4
- Update an entry and retain children HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gffutils.