nextgenusfs / gene2product Goto Github PK
View Code? Open in Web Editor NEWCurated list of gene names and product descriptions that pass NCBI genome submission rules.
License: BSD 2-Clause "Simplified" License
Curated list of gene names and product descriptions that pass NCBI genome submission rules.
License: BSD 2-Clause "Simplified" License
Hi,
I just cloned the repo and ran the script. It then complained about something that is already in the list:
update-gene2product.py Gene2Products.new-names-passed.txt
Product too long: SWI/SNF-related matrix-associated actin-dependent regulator of chromatin subfamily E member 1-related
Traceback (most recent call last):
File "./update-gene2product.py", line 38, in <module>
product = raw_input("Product for %s: " % name)
NameError: name 'raw_input' is not defined
Removing that particular line fixes it. Is there a hard product length limit for NCBI?
Cheers
sst2 human AMSH/STAMBP protein ubiquitin specific-protease
is in the ncbi_cleaned_gene_products list, but it throws a fatal error in the discrepancy report:
FATAL: DiscRep_SUB:SUSPECT_PRODUCT_NAMES::Remove organism from product name
DiscRep_SUB:DISC_PRODUCT_NAME_QUICKFIX::1 features contains 'human'
Should this be removed from the list and get flagged for manual curation instead? (Also, seems like it may be a bad annotation to begin with?) I imagine the same might be true for five other products containing "human" in ncbi_cleaned_gene_products, but this is the only one I have encountered in my own data.
The naming of Mitochondrial ribosomal protein of the small subunit and large subunit are ambiguous when it comes to product - it is a mix of 54S/39S when looking at yeast/human and 37S/28S for small subunit.
I think we need to manually make sure these are correct as it is inconsistent throughout the product list as is.
Here's a snapshot of some of these:
MRPL38 39S ribosomal protein L38, mitochondrial
MRPL39 54S ribosomal protein L39, mitochondrial
MRPL40 39S ribosomal protein L40, mitochondrial
MRPL43 39S ribosomal protein L43, mitochondrial
MRPL44 39S ribosomal protein L44, mitochondrial
MRPL45 39S ribosomal protein L45, mitochondrial
MRPL46 39S ribosomal protein L46, mitochondrial
MRPL49 54S ribosomal protein L49, mitochondrial
MRPL50 54S ribosomal protein L50, mitochondrial
MRPL51 39S ribosomal protein L51, mitochondrial
MRPS5 28S ribosomal protein S5, mitochondrial
MRPS8 37S ribosomal protein S8, mitochondrial
MRPS9 37S ribosomal protein S9, mitochondrial
MRPS12 37S ribosomal protein S12, mitochondrial
MRPS16 37S ribosomal protein S16, mitochondrial
MRPS17 37S ribosomal protein S17 mitochondrial
MRPS18 28S ribosomal protein S18b, mitochondrial
MRPS18C 28S ribosomal protein S18c, mitochondrial
MRPS28 37S ribosomal protein S28, mitochondrial
MRPS35 28S ribosomal protein S35, mitochondrial```
Hello,
I have an updated annotation (funannotate update) which I subsequently modified as follows:
I used the extracted proteins (extracted with agat from the reduced gff file) to produce interproscan and eggnog-mapper annotations out of funannotate.
Now I would like to add this annotation to the reduced gff and not the one produced by funannotate. I get the following error:
[Jul 14 02:09 PM]: OS: Ubuntu 18.10, 48 cores, ~ 264 GB RAM. Python: 3.8.12
[Jul 14 02:09 PM]: Running 1.8.11
[Jul 14 02:09 PM]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '--sbt'
[Jul 14 02:09 PM]: Unable to detect funannotate folder as input, please provide -o,--out directory
Is there away to run "funannotate annotate" on a custom gff and protein set after some gene models have been removed?
Thanks
Alex
A little unclear how you would best like changes added -here is report from a recent run - these need the gene/protein name removed I assume:
#Name Original Description Cleaned Description Error-message
CTS2 Inherit from sordNOG: Glycoside hydrolase 18 protein Cts2p Product defline failed funannotate checks
MUC21 Inherit from spriNOG: mucin 21, cell surface associated Muc21p Product defline failed funannotate checks
SLC3A1 Solute carrier 3 (Cystine, dibasic and neutral amino acid transporters, activator of cystine, dibasic and neutral amino acid transport), member 1 Slc3a1p Product defline failed funannotate checks
This did not get flagged in tbl2asn but was kicked by NCBI upon submission.
hypothetical protein:
4 /product="Conserved protein (DUF2356)"
6 /product="DUF1900 super"
4 /product="Family with sequence similarity 49, member"
4 /product="Inherit from dotNOG: Snf7"
2 /product="Inherit from euNOG: HEAT repeat containing 2"
2 /product="protein of unknown function DUF3722"
4 /product="with sequence similarity 63, member B"
6 /product="with sequence similarity 72, member"
Suggests that these should be fixed:
INTS3 Conserved protein (DUF2356)
CRN1 DUF1900 super family
FAM49A Family with sequence similarity 49, member
DID2 Inherit from dotNOG: Snf7
HEATR2 Inherit from euNOG: HEAT repeat containing 2
MDM10 protein of unknown function DUF3722
FAM63B with sequence similarity 63, member B
FAM72A with sequence similarity 72, member
Missing the word 'protein':
2 /product="Mitochondrial Translation Optimization"
MTO1 Mitochondrial Translation Optimization
CCDC97 Coiled-coil domain containing 97
Here's the mostly full list of what
incomplete product name:
2 /product="ATPase, class V, type"
16 /product="chitin synthase, class"
CHS3 chitin synthase, class
Missing the word 'protein':
2 /product="Ataxin 2-like"
2 /product="Cation efflux"
4 /product="Chitinase domain containing 1"
2 /product="Coiled-coil domain containing 82"
2 /product="Coiled-coil domain containing 97"
2 /product="Coiled-coil domain containing"
2 /product="DDHD domain containing"
4 /product="DENN MADD domain containing"
2 /product="DNA repair"
2 /product="Dcp1-like decapping"
4 /product="ELMO CED-12 domain containing 2"
2 /product="Formin-like"
2 /product="Fungalysin/Thermolysin Propeptide Motif"
2 /product="G patch domain containing 2"
2 /product="G patch domain containing 8"
2 /product="G patch domain-containing protein 4"
2 /product="GRIP and coiled-coil domain containing 2"
4 /product="HEAT repeat containing 6"
4 /product="Isochorismatase domain containing 1"
2 /product="Kelch domain containing"
4 /product="MAD2 mitotic arrest deficient-like 2"
2 /product="MRNA capping" (note MRNA as well)
2 /product="Mitochondrial Distribution and Morphology"
2 /product="Mitochondrial Translation Optimization"
4 /product="Mitochondrial distribution and morphology
8 /product="Myb-like DNA-binding domain"
6 /product="Myb-like, SWIRM and MPN domains 1"
2 /product="Nuclear Envelope Morphology"
2 /product="PCI domain containing 2"
6 /product="PQ loop repeat"
2 /product="Plexin repeat"
2 /product="RNA binding"
4 /product="RWD domain containing 1"
4 /product="Run and fyve domain containing"
2 /product="RuvB-like 1"
2 /product="RuvB-like 2"
4 /product="SAM domain and HD"
4 /product="SET and MYND domain containing 3"
8 /product="SET domain containing 4"
2 /product="SRP40, C-terminal domain"
4 /product="Sad1 and UNC84 domain containing"
4 /product="THUMP domain containing 1"
4 /product="Thioredoxin-like 4A"
4 /product="WD domain, G-beta repeat"
2 /product="WD repeat and FYVE domain containing 3"
4 /product="WD repeat domain 12"
4 /product="WD repeat domain 33"
4 /product="WD repeat domain 4"
8 /product="WD repeat domain 43"
6 /product="WD repeat domain 81"
2 /product="WD repeat domain 87"
4 /product="Zinc finger, CCHC domain containing"
4 /product="Zinc finger, DHHC-type containing"
8 /product="adenyl nucleotide binding"
2 /product="cell division cycle 34"
2 /product="chromosome X 26"
2 /product="cingulin-like 1"
2 /product="coiled-coil and C2 domain containing"
2 /product="coiled-coil domain containing 64"
4 /product="complexed with cef1p"
2 /product="erv26 super"
10 /product="grainyhead-like"
2 /product="guanylyl cyclase domain containing 1"
2 /product="heme binding"
8 /product="intercellular trafficking and secretion"
2 /product="intracellular distribution of mitochondria"
2 /product="jumonji domain containing 6"
4 /product="mRNA turnover 4"
2 /product="membrane metallo-endopeptidase-like 1"
4 /product="metallophosphoesterase domain containing 1"
2 /product="methyltransferase like 5"
8 /product="mitochondrial"
2 /product="nuclear control of ATPase"
6 /product="oxidation resistance"
2 /product="p53 and DNA-damage regulated 1"
2 /product="peptidyl-tRNA hydrolase domain containing 1"
6 /product="piwi-like"
6 /product="ring finger and SPRY domain containing 1"
2 /product="sister chromatid cohesion"
2 /product="spermatoproteinsis associated 5"
4 /product="tetratricopeptide repeat domain 39A"
2 /product="tetratricopeptide repeat domain"
2 /product="tetratricopeptide repeat"
4 /product="transmembrane and coiled-coil domains 7"
2 /product="von Willebrand factor A domain-containing
4 /product="wD repeat domain"
2 /product="zinc finger, DHHC domain containing 11"
8 /product="zinc finger, DHHC-type containing 20"
hypothetical protein:
2 /product="Conserved protein (DUF2356)"
2 /product="DUF1900 super"
2 /product="Family with sequence similarity 49, member"
2 /product="Inherit from dotNOG: Snf7"
4 /product="Inherit from euNOG: HEAT repeat containing 2"
4 /product="protein of unknown function DUF3722"
2 /product="with sequence similarity 63, member A"
6 /product="with sequence similarity 72, member"
Hi Jon,
I'd like to contribute to this community project. I was testing the waters with the Gene2Products.new-names-passed.txt files from one of my genomes.
I have already run update-gene2product.py. I have forked the repository, and now...how do I proceed? Should I just open a new pull request and add the new ncbi_cleaned_gene_products.txt file? I am very new at the inner guts of GitHub.
Thanks!
Hi Jon,
Per usual, thank you for all your work on this software. Truly amazing!
I recently annotated an invertebrate using v1.5.0 and thought I'd submit a few new gene product names. Sorry in advance, as I am not really a python person:
I forked the repository and tried to update, "python update-gene2product.py Gene2Products.new-names-passed.txt" and hit this error:
'''
Traceback (most recent call last):
File "update-gene2product.py", line 22, in for line in db:
io.UnsupportedOperation: not readable
'''
I worked around this (and the next error) by adding "a+" to lines 22 and 43, then I ran into this error and I don't know what it wants...
'''
"Traceback (most recent call last):
File "update-gene2product.py", line 65, in
os.rename(os.path.join(currentdir, 'ncbi_cleaned_gene_products.txt'), os.path.join(currentdir, 'ncbi_cleaned_gene_products.v'+version+'.txt'))
TypeError: can only concatenate str (not "NoneType") to str "
'''
I've attached my list of gene names in hopes that, if nothing else, someone can update the list for me.
Gene2Products.new-names-passed.txt
Cheers!
-bjp
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.