Giter Club home page Giter Club logo

gene2product's Introduction

gene2product

Curated list of gene names and product descriptions that pass NCBI genome submission rules. Used by funannotate during eukaryotic genome annotation.

I hope this is a "community" project where we can keep a list of gene names and their product definitions that pass NCBI requirements. The funannotate annotate script will generate a list of gene names/product deflines that pass tbl2asn but are not in the database, to contribute you can do a PR of this repository.

  1. Fork this repository
  2. run update-gene2product.py Gene2Products.new-names-passed.txt
  3. do a Pull Request with your update

Important:

While a product definition may pass tbl2asn, it doesn't mean that the description is great. Please at least manually glance at the names/product deflines that are in the Gene2Products.new-names-passed.txt file prior to doing a PR, there are likely a few manual tweaks that can be quickly done to improve the product defline.

Extra Credit:

funannotate annotate will also produce a file named Gene2Products.need-curating.txt, these are names/deflines that need manual curation. If you manually curate these data, please validate that they pass tbl2asn prior to adding them to your PR.

Thanks!

gene2product's People

Contributors

aberaslop avatar atiweb avatar azneto avatar dwinter avatar hyphaltip avatar jayvolr avatar nextgenusfs avatar olekto avatar plantdr430 avatar rowena-h avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

gene2product's Issues

sst2 human AMSH/STAMBP protein ubiquitin specific-protease

sst2 human AMSH/STAMBP protein ubiquitin specific-protease is in the ncbi_cleaned_gene_products list, but it throws a fatal error in the discrepancy report:

FATAL: DiscRep_SUB:SUSPECT_PRODUCT_NAMES::Remove organism from product name
DiscRep_SUB:DISC_PRODUCT_NAME_QUICKFIX::1 features contains 'human'

Should this be removed from the list and get flagged for manual curation instead? (Also, seems like it may be a bad annotation to begin with?) I imagine the same might be true for five other products containing "human" in ncbi_cleaned_gene_products, but this is the only one I have encountered in my own data.

kicked back products

This did not get flagged in tbl2asn but was kicked by NCBI upon submission.

hypothetical protein:
4 /product="Conserved protein (DUF2356)"
6 /product="DUF1900 super"
4 /product="Family with sequence similarity 49, member"
4 /product="Inherit from dotNOG: Snf7"
2 /product="Inherit from euNOG: HEAT repeat containing 2"
2 /product="protein of unknown function DUF3722"
4 /product="with sequence similarity 63, member B"
6 /product="with sequence similarity 72, member"

Suggests that these should be fixed:
INTS3 Conserved protein (DUF2356)
CRN1 DUF1900 super family
FAM49A Family with sequence similarity 49, member
DID2 Inherit from dotNOG: Snf7
HEATR2 Inherit from euNOG: HEAT repeat containing 2
MDM10 protein of unknown function DUF3722
FAM63B with sequence similarity 63, member B
FAM72A with sequence similarity 72, member

Missing the word 'protein':
2 /product="Mitochondrial Translation Optimization"
MTO1 Mitochondrial Translation Optimization

CCDC97 Coiled-coil domain containing 97

Here's the mostly full list of what

incomplete product name:
2 /product="ATPase, class V, type"

    16                           /product="chitin synthase, class"

CHS3 chitin synthase, class

Missing the word 'protein':
2 /product="Ataxin 2-like"
2 /product="Cation efflux"
4 /product="Chitinase domain containing 1"
2 /product="Coiled-coil domain containing 82"
2 /product="Coiled-coil domain containing 97"
2 /product="Coiled-coil domain containing"
2 /product="DDHD domain containing"
4 /product="DENN MADD domain containing"
2 /product="DNA repair"
2 /product="Dcp1-like decapping"
4 /product="ELMO CED-12 domain containing 2"
2 /product="Formin-like"
2 /product="Fungalysin/Thermolysin Propeptide Motif"
2 /product="G patch domain containing 2"
2 /product="G patch domain containing 8"
2 /product="G patch domain-containing protein 4"
2 /product="GRIP and coiled-coil domain containing 2"
4 /product="HEAT repeat containing 6"
4 /product="Isochorismatase domain containing 1"
2 /product="Kelch domain containing"
4 /product="MAD2 mitotic arrest deficient-like 2"
2 /product="MRNA capping" (note MRNA as well)
2 /product="Mitochondrial Distribution and Morphology"
2 /product="Mitochondrial Translation Optimization"
4 /product="Mitochondrial distribution and morphology
8 /product="Myb-like DNA-binding domain"
6 /product="Myb-like, SWIRM and MPN domains 1"
2 /product="Nuclear Envelope Morphology"
2 /product="PCI domain containing 2"
6 /product="PQ loop repeat"
2 /product="Plexin repeat"
2 /product="RNA binding"
4 /product="RWD domain containing 1"
4 /product="Run and fyve domain containing"
2 /product="RuvB-like 1"
2 /product="RuvB-like 2"
4 /product="SAM domain and HD"
4 /product="SET and MYND domain containing 3"
8 /product="SET domain containing 4"
2 /product="SRP40, C-terminal domain"
4 /product="Sad1 and UNC84 domain containing"
4 /product="THUMP domain containing 1"
4 /product="Thioredoxin-like 4A"
4 /product="WD domain, G-beta repeat"
2 /product="WD repeat and FYVE domain containing 3"
4 /product="WD repeat domain 12"
4 /product="WD repeat domain 33"
4 /product="WD repeat domain 4"
8 /product="WD repeat domain 43"
6 /product="WD repeat domain 81"
2 /product="WD repeat domain 87"
4 /product="Zinc finger, CCHC domain containing"
4 /product="Zinc finger, DHHC-type containing"
8 /product="adenyl nucleotide binding"
2 /product="cell division cycle 34"
2 /product="chromosome X 26"
2 /product="cingulin-like 1"
2 /product="coiled-coil and C2 domain containing"
2 /product="coiled-coil domain containing 64"
4 /product="complexed with cef1p"
2 /product="erv26 super"
10 /product="grainyhead-like"
2 /product="guanylyl cyclase domain containing 1"
2 /product="heme binding"
8 /product="intercellular trafficking and secretion"
2 /product="intracellular distribution of mitochondria"
2 /product="jumonji domain containing 6"
4 /product="mRNA turnover 4"
2 /product="membrane metallo-endopeptidase-like 1"
4 /product="metallophosphoesterase domain containing 1"
2 /product="methyltransferase like 5"
8 /product="mitochondrial"
2 /product="nuclear control of ATPase"
6 /product="oxidation resistance"
2 /product="p53 and DNA-damage regulated 1"
2 /product="peptidyl-tRNA hydrolase domain containing 1"
6 /product="piwi-like"
6 /product="ring finger and SPRY domain containing 1"
2 /product="sister chromatid cohesion"
2 /product="spermatoproteinsis associated 5"
4 /product="tetratricopeptide repeat domain 39A"
2 /product="tetratricopeptide repeat domain"
2 /product="tetratricopeptide repeat"
4 /product="transmembrane and coiled-coil domains 7"
2 /product="von Willebrand factor A domain-containing
4 /product="wD repeat domain"
2 /product="zinc finger, DHHC domain containing 11"
8 /product="zinc finger, DHHC-type containing 20"

hypothetical protein:
2 /product="Conserved protein (DUF2356)"
2 /product="DUF1900 super"
2 /product="Family with sequence similarity 49, member"
2 /product="Inherit from dotNOG: Snf7"
4 /product="Inherit from euNOG: HEAT repeat containing 2"
4 /product="protein of unknown function DUF3722"
2 /product="with sequence similarity 63, member A"
6 /product="with sequence similarity 72, member"

Adding functional annotation to reduced gff3 file

Hello,

I have an updated annotation (funannotate update) which I subsequently modified as follows:

  1. modified scaffold names
  2. Removed some cds models with strong hits to a TE library

I used the extracted proteins (extracted with agat from the reduced gff file) to produce interproscan and eggnog-mapper annotations out of funannotate.
Now I would like to add this annotation to the reduced gff and not the one produced by funannotate. I get the following error:


[Jul 14 02:09 PM]: OS: Ubuntu 18.10, 48 cores, ~ 264 GB RAM. Python: 3.8.12
[Jul 14 02:09 PM]: Running 1.8.11
[Jul 14 02:09 PM]: No NCBI SBT file given, will use default, however if you plan to submit to NCBI, create one and pass it here '--sbt'
[Jul 14 02:09 PM]: Unable to detect funannotate folder as input, please provide -o,--out directory

Is there away to run "funannotate annotate" on a custom gff and protein set after some gene models have been removed?

Thanks
Alex

update-gene2product.py not working

Hi Jon,

Per usual, thank you for all your work on this software. Truly amazing!
I recently annotated an invertebrate using v1.5.0 and thought I'd submit a few new gene product names. Sorry in advance, as I am not really a python person:

I forked the repository and tried to update, "python update-gene2product.py Gene2Products.new-names-passed.txt" and hit this error:
'''
Traceback (most recent call last):
File "update-gene2product.py", line 22, in for line in db:
io.UnsupportedOperation: not readable
'''

I worked around this (and the next error) by adding "a+" to lines 22 and 43, then I ran into this error and I don't know what it wants...

'''
"Traceback (most recent call last):
File "update-gene2product.py", line 65, in
os.rename(os.path.join(currentdir, 'ncbi_cleaned_gene_products.txt'), os.path.join(currentdir, 'ncbi_cleaned_gene_products.v'+version+'.txt'))
TypeError: can only concatenate str (not "NoneType") to str "
'''

I've attached my list of gene names in hopes that, if nothing else, someone can update the list for me.

Gene2Products.new-names-passed.txt

Cheers!
-bjp

Product too long: SWI/SNF-related matrix-associated....

Hi,
I just cloned the repo and ran the script. It then complained about something that is already in the list:

update-gene2product.py Gene2Products.new-names-passed.txt

Product too long: SWI/SNF-related matrix-associated actin-dependent regulator of chromatin subfamily E member 1-related
Traceback (most recent call last):
  File "./update-gene2product.py", line 38, in <module>
    product = raw_input("Product for %s: " % name)
NameError: name 'raw_input' is not defined

Removing that particular line fixes it. Is there a hard product length limit for NCBI?
Cheers

mitochondria ribosomal SSU/LSU nomeclature

The naming of Mitochondrial ribosomal protein of the small subunit and large subunit are ambiguous when it comes to product - it is a mix of 54S/39S when looking at yeast/human and 37S/28S for small subunit.
I think we need to manually make sure these are correct as it is inconsistent throughout the product list as is.

Here's a snapshot of some of these:

MRPL38	39S ribosomal protein L38, mitochondrial
MRPL39	54S ribosomal protein L39, mitochondrial
MRPL40	39S ribosomal protein L40, mitochondrial
MRPL43	39S ribosomal protein L43, mitochondrial
MRPL44	39S ribosomal protein L44, mitochondrial
MRPL45	39S ribosomal protein L45, mitochondrial
MRPL46	39S ribosomal protein L46, mitochondrial
MRPL49	54S ribosomal protein L49, mitochondrial
MRPL50	54S ribosomal protein L50, mitochondrial
MRPL51	39S ribosomal protein L51, mitochondrial
MRPS5	28S ribosomal protein S5, mitochondrial
MRPS8	37S ribosomal protein S8, mitochondrial
MRPS9	37S ribosomal protein S9, mitochondrial
MRPS12	37S ribosomal protein S12, mitochondrial
MRPS16	37S ribosomal protein S16, mitochondrial
MRPS17	37S ribosomal protein S17 mitochondrial
MRPS18	28S ribosomal protein S18b, mitochondrial
MRPS18C	28S ribosomal protein S18c, mitochondrial
MRPS28	37S ribosomal protein S28, mitochondrial
MRPS35	28S ribosomal protein S35, mitochondrial```

Adding terms

A little unclear how you would best like changes added -here is report from a recent run - these need the gene/protein name removed I assume:

#Name	Original Description	Cleaned Description	Error-message
CTS2	Inherit from sordNOG: Glycoside hydrolase  18 protein	Cts2p	Product defline failed funannotate checks
MUC21	Inherit from spriNOG: mucin 21, cell surface associated	Muc21p	Product defline failed funannotate checks
SLC3A1	Solute carrier  3 (Cystine, dibasic and neutral amino acid transporters, activator of cystine, dibasic and neutral amino acid transport), member 1	Slc3a1p	Product defline failed funannotate checks

Adding entries

Hi Jon,

I'd like to contribute to this community project. I was testing the waters with the Gene2Products.new-names-passed.txt files from one of my genomes.
I have already run update-gene2product.py. I have forked the repository, and now...how do I proceed? Should I just open a new pull request and add the new ncbi_cleaned_gene_products.txt file? I am very new at the inner guts of GitHub.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.