aureme / emapper2gbk Goto Github PK

View Code? Open in Web Editor NEW

This project forked from arnaudbelcour/gff_to_gbk

10.0 10.0 5.0 4.88 MB

Convert GFF, fastas, annotation table and species name into Genbank.

License: GNU Lesser General Public License v3.0

Python 100.00%

emapper2gbk's Introduction

AuReMe

License

This project is licensed under the GNU GPL-3.0-or-later, see the LICENSE file for details.

Docmentation

AuReMe documentation

emapper2gbk's People

Stargazers

Watchers

Forkers

chabname wook2014 mattoslmp souratr utguang

emapper2gbk's Issues

Full taxonomy doesn't work

eggnog2gbk version:
Python version:
Operating System:

Description

On directory mode, when we give a file with taxonomy values, having the full taxonomy and not the specie, the program only get the first value.
I'm going to create a branch and propose a modification

What I Did

Example :
    Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia marmotae
The taxon find is for "Bacteria".
If we reverse it, it will find the taxon for "Escherichia marmotae"
If the first element is not found (because does not exist) it will not try to find the parent.

Wrong GenBank output file

eggnog2gbk version: 0.1.0
Python version: 3.7.4
Operating System: CentOS-7

Description

Hello,

I did a annotation with eggnog-mapper on my new Pacbio assembly. The annotation and the output files are (look) OK. What I want to do is convert my GFF annotation file into a GenBank file. That's why I use emapper2gbk.

I have a single chromosome assembly >CP019962.1_RagTag , (I only show the beginning of the files) my gene predictions from proidgal :

>CP019962.1_RagTag_1  2  115  -1  ID=1_1;partial=10;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.465
MTKEQKAVLKRALDHYGIDNQLTKAAEEMAELTKEICK
>CP019962.1_RagTag_2  358  468  -1  ID=1_2;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.342
MSWTMIFKKFEFPVLKVPVGNKVYIWLKKNINLLRV*
>CP019962.1_RagTag_3  446  2329  -1  ID=1_3;partial=00;start_type=GTG;rbs_motif=GGxGG;rbs_spacer=5-10bp;gc_cont=0.503
MQIEKLTKETILEDTTFEEIIDEKDEIYRQRLINDLTDRAAELGVKTKFTSLLKAYQKEE
KKMLQEQKKQLQEQNRARILQNLDRRTEFGSECYPDLRCGNWFADETGIRTFGMFGEVQA

my annotation.tsv file :

CP019962.1_RagTag_3     903814.ELI_1277 0.0     1239.2  COG5519@1|root,COG5519@2|Bacteria,1TRMV@1239|Firmicutes,24BJ9@186801|Clostridia 186801|ClostridiaL       Psort location Cytoplasmic, score       -       -       -       -       -       -       -       -       -       -       -       -       DUF927
CP019962.1_RagTag_4     903814.ELI_1276 3.4e-100        370.9   COG0358@1|root,COG0358@2|Bacteria       2|Bacteria      L       DNA primase activity    --       3.6.4.12        ko:K02316,ko:K17680     ko03030,map03030        -       -       -       ko00000,ko00001,ko01000,ko03029,ko03032 -       -       -DUF3991,DnaB_C,Toprim_2,Toprim_3,Toprim_N,zf-CHC2

and my gff file :

CP019962.1_RagTag       eggNOG-mapper   CDS     1       35      69.7    +       .       ID=CP019962.1_RagTag_1740;em_target=903814.ELI_3509;em_score=69.7;em_evalue=3.2e-10;em_tcov=15.6;em_searcher=diamond;em_OGs=COG1668@1|root,COG1668@2|Bacteria,1V2WV@1239|Firmicutes,24IFP@186801|Clostridia;em_COG_cat=CP;em_desc=transmembrane transport;em_max_annot_lvl=186801|Clostridia;em_Preferred_name=;em_KEGG_ko=ko:K16906;em_KEGG_Pathway=ko02010,map02010;em_KEGG_Module=M00224;em_BRITE=ko00000,ko00001,ko00002,ko02000;em_KEGG_TC=3.A.1;em_PFAMs=
CP019962.1_RagTag       eggNOG-mapper   CDS     1       37      80.9    +       .       ID=CP019962.1_RagTag_1273;em_target=903814.ELI_4010;em_score=80.9;em_evalue=1.4e-13;em_tcov=100.0;em_searcher=diamond;em_OGs=COG0257@1|root,COG0257@2|Bacteria,1VK4F@1239|Firmicutes,24UGF@186801|Clostridia,25XP9@186806|Eubacteriaceae;em_COG_cat=J;em_desc=Belongs to the bacterial ribosomal protein bL36 family;em_max_annot_lvl=186801|Clostridia;em_PFAMs=Ribosomal_L36;em_Preferred_name=rpmJ;em_KEGG_ko=ko:K02919;em_KEGG_Pathway=ko03010,map03010;em_KEGG_Module=M00178;em_BRITE=br01610,ko00000,ko00001,ko00002,ko03011;em_GOs=

I get a GenBank file but without the CDS annotations :

  LOCUS       CP019962.1_RagTag    4424507 bp    DNA              BCT 05-NOV-2021
DEFINITION  Firmicutes genome.
ACCESSION   CP019962.1_RagTag
VERSION     CP019962.1_RagTag
KEYWORDS    Firmicutes.
SOURCE      .
  ORGANISM  Firmicutes
            Bacteria.
FEATURES             Location/Qualifiers
     source          1..4424507
                     /scaffold="CP019962.1_RagTag"
                     /db_xref="taxon:1239"
ORIGIN
        1 gcttgcagat ctcttttgtc aattcagcca tctcttccgc ggccttggtg agctggttgt
       61 caatgccgta atggtcaagt gcccttttta ggaccgcttt ctgttctttg gtcatttcat
      121 cctcccgtag gtctcataaa tttcttgcaa cttatagttt tattttttaa ttgttataaa

What I Did

I ran :

emapper2gbk genomes --fastanucleic EggMapper_Annot_Microbial_Assembly_v2_RagTag_Scaffolded.emapper.fna --fastaprot EggMapper_Annot_Microbial_Assembly_v2_RagTag_Scaffolded.emapper.genepred.faa --out test.gbk --gff EggMapper_Annot_Microbial_Assembly_v2_RagTag_Scaffolded.emapper.gff --annotation EggMapper_Annot_Microbial_Assembly_v2_RagTag_Scaffolded.emapper.annotation.tsv -n "Firmicutes"

Do you have an idea to get a correct Genbank file?

Thanks

UnboundLocalError: local variable 'annotation_dict' referenced before assignment

eggnog2gbk version: emapper2gbk 0.3.0
Python version: 3.9
Operating System: redhat

Description

I am trying to convert gff fasta to GenBank. but got following error:
The default organism name 'cellular organisms' is used.
Creating GFF database (gffutils) for chr01.1
Traceback (most recent call last):
File "/HOPS/hqkalsan/python3.9_new/bin/emapper2gbk", line 33, in
sys.exit(load_entry_point('emapper2gbk==0.3.0', 'console_scripts', 'emapper2gbk')())
File "/HOPS/hqkalsan/python3.9_new/lib64/python3.9/site-packages/emapper2gbk-0.3.0-py3.9.egg/emapper2gbk/main.py", line 306, in cli
gbk_creation(nucleic_fasta=args.fastanucleic, protein_fasta=args.fastaprot, annot=args.annotation, gff=args.gff, gff_type=gff_type,
File "/HOPS/hqkalsan/python3.9_new/lib64/python3.9/site-packages/emapper2gbk-0.3.0-py3.9.egg/emapper2gbk/emapper2gbk.py", line 84, in gbk_creation
gbk_result = genomes_to_gbk.gff_to_gbk(nucleic_fasta=nucleic_fasta, protein_fasta=protein_fasta, annot=annot,
File "/HOPS/hqkalsan/python3.9_new/lib64/python3.9/site-packages/emapper2gbk-0.3.0-py3.9.egg/emapper2gbk/genomes_to_gbk.py", line 164, in gff_to_gbk
annot = dict(read_annotation(annot))
File "/HOPS/hqkalsan/python3.9_new/lib64/python3.9/site-packages/emapper2gbk-0.3.0-py3.9.egg/emapper2gbk/utils.py", line 427, in read_annotation
for key in annotation_dict:
UnboundLocalError: local variable 'annotation_dict' referenced before assignment

What I Did

emapper2gbk genomes -fn genome_dir/chr01.1.fna -fp protein_sequence/chr01.1.faa -o hop_pseudomolecules_v1.1_p1_p2.gbk -g gff/chr01.1.gff -a annotation/chr01.1.tsv
the gff file looks like :
chr01.1 PGSBv3.1.1 gene 29309 31323 0.000 + . ID=id1
chr01.1 PGSBv3.1.1 CDS 29309 31323 . + . ID=id1.1;Parent=id1;primary=T

issue KeyError for gbk generation

eggnog2gbk version: 0.1.0
Python version: 3.9.6
Operating System: linux mint
emapper version: 2.1.6

Description

hello,

We encountered this KeyError while running emapper2gbk. It seems like due to unmatched column for input annotation file from emapper. We tested then with GitHub example files: betbox fna,faa and annotation, but the problem persisted.

We used emapper version 2.1.6 with a default outformat 6 (--outfmt 6).

And we also noticed that online file of go-basic.obo has a missing ":" .

thanks a lot

What I Did

emapper2gbk genes -fn nucleotide_sequence/ -fp protein_sequence/  -a annotation/ -o gbk/  -go /data/eggnog-mapper_database/eggnog-mapper/data/go-basic.obo 
The default organism name 'metagenome' is used.
Assembling Genbank informations for MAG001
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/anaconda3/envs/m2m/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/anaconda3/envs/m2m/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/home/anaconda3/envs/m2m/lib/python3.9/site-packages/emapper2gbk/genes_to_gbk.py", line 103, in faa_to_gbk
    create_genbank(gene_nucleic_seqs, gene_protein_seqs, annot, go_namespaces, go_alternatives, output_path, species_informations)
  File "/home/anaconda3/envs/m2m/lib/python3.9/site-packages/emapper2gbk/genes_to_gbk.py", line 127, in create_genbank
    record = record_info(gene_nucleic_id, gene_nucleic_seqs[gene_nucleic_id], species_informations)
  File "/home/anaconda3/envs/m2m/lib/python3.9/site-packages/emapper2gbk/utils.py", line 298, in record_info
    description=species_informations['description'],
KeyError: 'description'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/anaconda3/envs/m2m/bin/emapper2gbk", line 8, in <module>
    sys.exit(cli())
  File "/home/anaconda3/envs/m2m/lib/python3.9/site-packages/emapper2gbk/__main__.py", line 309, in cli
    gbk_creation(nucleic_fasta=args.fastanucleic, protein_fasta=args.fastaprot, annot=args.annotation, org=orgnames,
  File "/home/anaconda3/envs/m2m/lib/python3.9/site-packages/emapper2gbk/emapper2gbk.py", line 196, in gbk_creation
    gbk_results = gbk_pool.starmap(genes_to_gbk.faa_to_gbk, multiprocess_data)
  File "/home/anaconda3/envs/m2m/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/home/anaconda3/envs/m2m/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
KeyError: 'description'

Incorrect gbk files when genes identifiers are numbers

eggnog2gbk version: 0.1.0
Python version: 3.7.7
Operating System: MacOS 10.15.7

Description

Running emapper2gbk in genes mode with gene identifiers consisting of numbers does not create all the GBK features (translation etc.). There is no crash, a gbk is created but it lacks some important information.

What I Did

emapper2gbk genes -fn bin.fna -fp bin.faa -o bin.gbk -n "Prevotella" -a bin.tsv

LOCUS       _10007119               3225 bp    DNA              BCT 08-MAR-2022
DEFINITION  Prevotella genome.
ACCESSION   10007119
VERSION     10007119
KEYWORDS    Prevotella.
SOURCE      .
  ORGANISM  Prevotella
            Bacteria; Bacteroidetes; Bacteroidia; Bacteroidales; Prevotellaceae.
FEATURES             Location/Qualifiers
     source          1..3225
                     /scaffold="10007119"
                     /db_xref="taxon:838"
     gene            2..3225
                     /locus_tag="gene_10007119"
     CDS             2..3225
                     /locus_tag="gene_10007119"
ORIGIN
        1 atgaaagatc aaaatattaa gaaggtgttg ctcctcggct ccggtgcgtt gaagatcggt
       61 gaggccggcg agttcgacta ttccggttca caggcactca aggcgctgcg tgaggaaggc
      121 gtctacacgg tgctcatcaa tcctaatatc gccaccgtgc agacctccga gggcgtggcc
     [...]
//

When adding a prefix to all identifiers, a correct gbk is created:

LOCUS       g10007119               3225 bp    DNA              BCT 08-MAR-2022
DEFINITION  Prevotella genome.
ACCESSION   g10007119
VERSION     g10007119
KEYWORDS    Prevotella.
SOURCE      .
  ORGANISM  Prevotella
            Bacteria; Bacteroidetes; Bacteroidia; Bacteroidales; Prevotellaceae.
FEATURES             Location/Qualifiers
     source          1..3225
                     /scaffold="g10007119"
                     /db_xref="taxon:838"
     gene            2..3225
                     /locus_tag="g10007119"
     CDS             2..3225
                     /locus_tag="g10007119"
                     /gene="carB"
                     /EC_number="6.3.5.5"
                     /dbxref="KEGG:R00256"
                     /dbxref="KEGG:R00575"
                     /dbxref="KEGG:R01395"
                     /dbxref="KEGG:R10948"
                     /dbxref="KEGG:R10949"
                     /translation="MKDQNIKKVLLLGSGALKIGEAGEFDYSGSQALKALREEGVYTVL
                     INPNIATVQTSEGVADQIYFLP[...]"
ORIGIN
        1 atgaaagatc aaaatattaa gaaggtgttg ctcctcggct ccggtgcgtt gaagatcggt
       61 gaggccggcg agttcgacta ttccggttca caggcactca aggcgctgcg tgaggaaggc
      121 gtctacacgg tgctcatcaa tcctaatatc gccaccgtgc agacctccga gggcgtggcc
      [...]
\\

Usage of with eggnog-mapper2

eggnog2gbk version: 0.0.7
Python version: 3.8.2
Operating System: CentOS Linux 7

Description

Hi, I'm trying to use your tool with my output from eggnog-mapper v2

What I Did

I used your test data and it worked, but not with mine.

emapper2gbk genomic -fg ../Roseburia_inulinivorans_DSM16841/GCF_000174195.1_ASM17419v1_cds_from_genomic.fna -fp ../Roseburia_inulinivorans_DSM16841/GCF_000174195.1_ASM17419v1_protein.faa -o teste.out -a Roseburia_inulinivorans_DSM16841.emapper.annotations 
The default organism name 'cellular organisms' is used.
Formatting fasta and annotation file for GCF_000174195.1_ASM17419v1_genomic
Traceback (most recent call last):
  File "/raeslab/scratch/lucmac/miniconda3/bin/emapper2gbk", line 8, in <module>
    sys.exit(cli())
  File "/raeslab/scratch/lucmac/miniconda3/lib/python3.8/site-packages/emapper2gbk/__main__.py", line 245, in cli
    gbk_creation(genome=args.fastagenome, proteome=args.fastaprot, annot=args.annotation, gff=args.gff, org=orgnames, gbk=args.out, gobasic=args.gobasic, dirmode=directory_mode, cpu=args.cpu, metagenomic_mode=False)
  File "/raeslab/scratch/lucmac/miniconda3/lib/python3.8/site-packages/emapper2gbk/emapper2gbk.py", line 32, in gbk_creation
    fa_to_gbk.main(genome, proteome, annot, org, gbk, gobasic)
  File "/raeslab/scratch/lucmac/miniconda3/lib/python3.8/site-packages/emapper2gbk/fa_to_gbk.py", line 170, in main
    faa_to_gbk(genome_fasta, prot_fasta, annot_table, species_name, gbk_out, gobasic)
  File "/raeslab/scratch/lucmac/miniconda3/lib/python3.8/site-packages/emapper2gbk/fa_to_gbk.py", line 64, in faa_to_gbk
    annotation_data = dict(read_annotation(annotation_data))
  File "/raeslab/scratch/lucmac/miniconda3/lib/python3.8/site-packages/emapper2gbk/utils.py", line 269, in read_annotation
    annotation_data.columns = headers_row
  File "/home/lucmac/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 5475, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
  File "/home/lucmac/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 669, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/home/lucmac/.local/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 220, in set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 24 elements, new values have 1 elements

# Fri Feb 12 12:56:02 2021
# emapper-2.0.6
# emapper.py -i Roseburia_inulinivorans_DSM16841/GCF_000174195.1_ASM17419v1_protein.faa --cpu 4 --itype proteins -m diamond --output_dir eggnog --output Roseburia_inulinivorans_DSM16841 
#
#query_name     seed_eggNOG_ortholog    seed_ortholog_evalue    seed_ortholog_score     eggNOG OGs   narr_og_name     narr_og_cat     narr_og_desc    best_og_name    best_og_cat     best_og_desc    Preferred_name        GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass  BRITE    KEGG_TC CAZy    BiGG_Reaction   PFAMs

No Corresponding protein ID between GFF and FAA

eggnog2gbk version: 0.1.0
Python version: 3.9.12
Operating System: Ubuntu 20.04.4

Description

I am trying to convert the eggnogg-mapper output into gbk files. When entering using folders eggnog_fnas,eggnog_faas,eggnog_annot,eggnog_gff, and a namefile.txt, I keep getting a no corresponding protein ID error and then the gbk file isn't made

What I Did

my command:

emapper2gbk genomes -fn /mnt/d/eggnog_fnas -fp /mnt/d/eggnog_faas -o /mnt/d/gbk_files -g /mnt/d/eggnog_gffs -gt cds_only -go /mnt/d/GO_annotations/go-basic.obo -nf /mnt/d/namefile.txt -a /mnt/d/eggnog_annot -c 2 --keep-gff-annotation

The reply:

Creating GFF database (gffutils) for bin.8
Creating GFF database (gffutils) for bin.5
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.5.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.5.faa (-fp/protein_fasta) sequence for bin.5
Creating GFF database (gffutils) for bin.6
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.8.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.8.faa (-fp/protein_fasta) sequence for bin.8
Creating GFF database (gffutils) for bin.4
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.6.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.6.faa (-fp/protein_fasta) sequence for bin.6
Creating GFF database (gffutils) for bin.7
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.4.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.4.faa (-fp/protein_fasta) sequence for bin.4
Creating GFF database (gffutils) for bin.2
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.7.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.7.faa (-fp/protein_fasta) sequence for bin.7
Creating GFF database (gffutils) for bin.1
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.2.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.2.faa (-fp/protein_fasta) sequence for bin.2
Creating GFF database (gffutils) for bin.3
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.1.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.1.faa (-fp/protein_fasta) sequence for bin.1
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.3.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.3.faa (-fp/protein_fasta) sequence for bin.3
/!\ Only 0 on 8 genbanks have been created, check the logs for error.
--- Total runtime 46.52 seconds ---

I am a little confused. When I look at the .gff file, I see a header like thus:
NODE_27_length_174714_cov_8.012086

And when I look in the .faa file, I see this kind of header:
>NODE_27_length_174714_cov_8.012086_1

Is that the problem? How would I fix this?

Thank you for reading!

Question about new eggNOG output

Hello, I'm trying to use Metage2Metabo with eggNOG output and MAG bins from metaWRAP, but I didn't get a .faa file when running eggNOG. It appears that eggNOG v2 no longer needs to make a .faa file in order to annotate bins.

I'm curious, is there a way I can use those two outputs -- metaWRAP && eggNOG -- without those .faa files? And if not, is there an alternative method of creating a .faa file?

Thank you for your help.

Genome names do not match annotation

Hi,
I am trying to convert my emapper annotations into genebank format using your tool. I have the following directories set up:

ANNOTATION/ FASTAPROT/ FASTNUCLEIC/ GENBANK/ GFF/ HITS/ ORTHOLOGS/

(emapper2gbk) [mjensen2$] ls FASTNUCLEIC/
BC-1_bin.100.fna BC-1_bin.116.fna BC-1_bin.14.fna etc.

(emapper2gbk) [mjensen2$] ls FASTAPROT/
BC-1_bin.100.emapper.genepred.faa BC-1_bin.116.emapper.genepred.faa BC-1_bin.14.emapper.genepred.faa etc.

(emapper2gbk) [mjensen2$] ls ANNOTATION/
BC-1_bin.100.emapper.annotations BC-1_bin.116.emapper.annotations BC-1_bin.14.emapper.annotations etc.

When I run the following command, however, I get the an error saying that the genomes names do not match the annotation names.

(emapper2gbk) [mjensen2$] emapper2gbk genes -fn ./FASTNUCLEIC/ -fp ./FASTAPROT/ -o ./GENBANK/ -a ./ANNOTATION/ -c 10 -n BC-1 -go gobasic -g ./GFF/

Since it is not the filenames I checked the file content and noticed that emapper has added an additional number to the identifier when it predicted genes and annotated these, e.g.

Contig ID: >bin.1.fak127_1021
Prot ID: >bin.1.fak127_1021_1
Annotation ID: bin.1.fak127_1021_1

I believe this is the problem but I don't know how to work around this as this is something emapper added. Have you encountered this before? I might just be missing a flag of some sort but I am unsure and would appreciate your help!

Cheers,
Marlene

Issue with some macOS system (like macOS Catalina 10.15 from GitHub Actions)

With some macOS system (but not all), there is an issue with the multiprocessing part of emapper2gbk.

This leads to error like:

+[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called.

For GitHub Actions, this lead to macOS job reaching time limit without completing.

The fix is to use the OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES before the python call, like:

OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES python test_emapper2gbk.py

Solution from: https://stackoverflow.com/a/52230415

emapper2gbk genomes issue

eggnog2gbk version:
Python version:
Operating System:

Description

I am using M2M, and I am trying to make gbk files using emapper2gbk for ~1200 bacterial species for use as input in m2m recon. I have created separate folders with .faa, .fna, .gff, and eggNOG files (in .tsv), and have also made a .tsv file with genome ID in column 1 against bacterial name in column 2 for -nf. However, when I try to run emapper using the genomes mode for all of my ~1200 bacterial species, I attain the error attached below. Please note that I have tried to run emapper genomes for each bacteria separately and it was working; the issue seems to be when I try to run it in bulk.

What I Did

Here is how the organism name .tsv file looks like:

Handle multiple predictions for a same protein.

Some versions of eggnog-mapper (but I do not know which) can return multiple matches for the same protein. This could lead to an error in the script (here).

There are multiple solutions:

select the match with highest score, by adding the following line (just after this line):

annotation_data = annotation_data.sort_values('score', ascending=True).drop_duplicates('query').sort_index()

merge the annotation of the different matches, by adding the following line (just after this line):

annotation_data = annotation_data.groupby(['query']).agg(lambda col: ','.join(col)).reset_index()

Key error for emapper2gbk

eggnog2gbk version: 0.1.0
Python version: 3.9.6
Operating System: linux mint
emapper version: 2.1.6

Description

hello,

We encountered this KeyError while running emapper2gbk. It seems like due to unmatched column for input annotation file from emapper.
We used emapper version 2.1.6 with a default outformat 6 (--outfmt 6)

And we also noticed that online file of go-basic.obo has a missing ":" .

thanks a lot

What I Did

emapper2gbk genes -fn nucleotide_sequence/ -fp protein_sequence/  -a annotation/ -o gbk/  -go /data/eggnog-mapper_database/eggnog-mapper/data/go-basic.obo 
The default organism name 'metagenome' is used.
Assembling Genbank informations for MAG001
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/anaconda3/envs/m2m/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/anaconda3/envs/m2m/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/home/anaconda3/envs/m2m/lib/python3.9/site-packages/emapper2gbk/genes_to_gbk.py", line 103, in faa_to_gbk
    create_genbank(gene_nucleic_seqs, gene_protein_seqs, annot, go_namespaces, go_alternatives, output_path, species_informations)
  File "/home/anaconda3/envs/m2m/lib/python3.9/site-packages/emapper2gbk/genes_to_gbk.py", line 127, in create_genbank
    record = record_info(gene_nucleic_id, gene_nucleic_seqs[gene_nucleic_id], species_informations)
  File "/home/anaconda3/envs/m2m/lib/python3.9/site-packages/emapper2gbk/utils.py", line 298, in record_info
    description=species_informations['description'],
KeyError: 'description'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/anaconda3/envs/m2m/bin/emapper2gbk", line 8, in <module>
    sys.exit(cli())
  File "/home/anaconda3/envs/m2m/lib/python3.9/site-packages/emapper2gbk/__main__.py", line 309, in cli
    gbk_creation(nucleic_fasta=args.fastanucleic, protein_fasta=args.fastaprot, annot=args.annotation, org=orgnames,
  File "/home/anaconda3/envs/m2m/lib/python3.9/site-packages/emapper2gbk/emapper2gbk.py", line 196, in gbk_creation
    gbk_results = gbk_pool.starmap(genes_to_gbk.faa_to_gbk, multiprocess_data)
  File "/home/anaconda3/envs/m2m/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/home/anaconda3/envs/m2m/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
KeyError: 'description'

Issue with Pronto and GO Ontology.

eggnog2gbk version: 0.1.0
Python version: 3.7
Operating System: Linux

Description

If you try to use emapper2gbk with the new go-basic.obo file there is an error with pronto. Pronto will return the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/pronto/ontology.py", line 283, in __init__
    cls(self).parse_from(_handle)  # type: ignore
  File "/usr/local/lib/python3.6/dist-packages/pronto/parsers/obo.py", line 45, in parse_from
    raise SyntaxError(s.args[0], location) from None
  File "http://purl.obolibrary.org/obo/go/snapshot/go.obo", line 436334
    def: "Catalysis of the reaction: behenoyl-CoA(4-) + malonyl-CoA(5-) + H+ <=> 3-oxotetracosanoyl-CoA. + carbon dioxide + coenzyme A." [GOC:pz, Rhea: 36507]␊
                                                                                                                                                        ^
SyntaxError: expected QuotedString

This is linked to an issue in the current go-basic.obo file (format-version: 1.2, data-version: releases/2021-06-16) due to the space in Rhea: 36507. The issue has been fixed in geneontology/go-ontology@104252c. Before the fix is released in the new version of go-basic.obo you can either download the current go-basic.obo (which can be downloaded at this address: http://purl.obolibrary.org/obo/go/go-basic.obo) and manually corrects the Rhea ID (which is associated to the GO Term GO:0102338). Or you can use an old version of go-basic.obo like the one in the test folder of emapper2gbk.

aureme / emapper2gbk Goto Github PK

emapper2gbk's Introduction

AuReMe

License

Docmentation

emapper2gbk's People

Stargazers

Watchers

Forkers

emapper2gbk's Issues

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

Recommend Projects

Recommend Topics

Recommend Org