tseemann / prokka Goto Github PK

View Code? Open in Web Editor NEW

767.0 47.0 218.0 322.04 MB

:zap: :aquarius: Rapid prokaryotic genome annotation

Shell 1.04% Perl 98.34% TSQL 0.61%

genome-annotation functional-assignment bacterial-genomes gene-finding

prokka's People

Contributors

Stargazers

Watchers

Forkers

sjackman aleimba ctskennerton delafont rstabler lguy hjanime moonizer gitter-badger jvollme andrewjpage shaman-narayanasamy cometsong envgen tfuji fw1121 ofanoyi wenchaolin cnthornton mdcao spock claczny ptmckenney peterjc nsoranzo linearregression bachev rafalcode bretonics audy celiosantosjr jaredo slugger70 blawlor jessicalumian lskatz dennisj4995 glwinsor zhangyuwinnie gopalamannala dzif jennahd abelew zhssakura tolot27 znruss kelvin-wcl hurwitzlab odiogosilva ucpete rpetit3 mruehlemann eschatonchamp inbalb rajaldebnath nickp60 zachcp haoziyeung a7032018 tianxiongbb yemilawal vbonnici zdk123 lknlkn315 mikeraiko naespinas bioinfoacademy avrajit uma04 camilla-ip zhangxixi6688 ramkh brwnj laxeye nunoalexandrefaria abdo3a smallcrayfish macman123 mz-cy-han1998 nasfizina nasmab pkerpedjiev-zymergen azolin gtonkinhill ssarria 18874851654 eternal-bug crazyrabbit007 zwets mysoldier karlnyr kodrzywolek arkadiy-garber nakeene buihoangphuc412 githublilo sarahisme thexiyang bioforensics pythseq

prokka's Issues

Bug for option '--hypo' in 'prokka-genbank_to_fasta_db'

Option --hypo in prokka-genbank_to_fasta_db doesn't work.

Line 41 (https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka-genbank_to_fasta_db#L41) has to be changed from:

next if $prod eq 'hypothetical protein';

next if !$hypo and $prod eq 'hypothetical protein';

Support for --kingdom ALL for mixed metagenomes

Allow Kingdom=ALL or ANY for metagenomes [Andreas Bremges]

Exception: Bad end parameter

Running prokka 1.9, with --metagenome option.

Prokka falls down with same bad end parameter exception on two separate contigs from two separate assemblies.

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Bad end parameter (5209). End must be less than the total length of sequence (total=5208)
STACK: Error::throw
STACK: Bio::Root::Root::throw /srv/sw/cpan-modules/lib/perl5/Bio/Root/Root.pm:486
STACK: Bio::PrimarySeq::subseq /srv/sw/cpan-modules/lib/perl5/Bio/PrimarySeq.pm:432
STACK: Bio::PrimarySeq::subseq /srv/sw/cpan-modules/lib/perl5/Bio/PrimarySeq.pm:387
STACK: Bio::Seq::subseq /srv/sw/cpan-modules/lib/perl5/Bio/Seq.pm:630
STACK: Bio::PrimarySeqI::trunc /srv/sw/cpan-modules/lib/perl5/Bio/PrimarySeqI.pm:435
STACK: /srv/sw/prokka/1.9/prokka-1.9/bin/prokka:1054

E.g. Troublesome contig:

707_L1_merged_contig_150143
CGTATAAAGGCATTGCTTGCTGAATTTATGAATCCGGAATATGGGGTTGAAAATGTTCGTCCTTATTCGCCAAGTCAGCAAGAAATATTGCGGATTTATGAGGATACGGTTTTGAAAGGGGAAGAACAGATTCCGGAAGATATAGATGTAATATTGAAAAAATTCAATAATAGCAAACTACCGACAAAATCAGAGTTTTTGCGTTATAAATTATGGTTGGAACAGAAGTATCGTTCGCCTTATACCGGTGAGTTGATACCTTTGGGAAAATTGTTTACGGCTGCGTATGAGATAGAACATATAATTCCTCAATCTCGTTATTTTGATGATTCTTTTTCTAACAAGGTGATATGTGAATCTGCTGTGAATAAATTGAAAGATAATCAATTGGGGTATGAGTTTATCAAGAATCATCACGGGCAGAAAGTTGAAGTGGGTTTTGGAAAAACGGTAGAAATTCTTTCTGTGGATAGCTACGAATGTTTTGTAAAAGAACAATATGCTAAATCGGGCGTGAAAATGAAGAAATTGTTGATGGATGATATTCCCGAGCAATTTATTGAGCGCCAATTGAACGATAGCCGGTATATCAGCAAGGTTGTTAAAGGGCTTTTGTCGAATATTGTTCGTGAAAAGAATGATAGCGGTGAATATGAGCCGGAGGCTGTTTCAAAAAATATATTAGTTTGTACGGGAAGCGTGACGGACAGGCTGAAAAAGGATTGGGGGATGAATGATGTTTGGAACAGTATTGTATATCCTCGTTTTGAACGTTTAAACGCTTTGACTGGAACACAGTGCTTTGGGCATTGGGAGAATAAAGATGGAAAAAAAGTTTTTCAGACGGAATTGCCCCTTGAATATCAGAAAGGGTTTAGTAAGAAACGTATTGACCATAGGCATCATGCCATGGATGCAATAGTGATAGCTTGCGCTACGCGGAATCATGTGAACTATTTGAGCAATGAGTCTGCAAGCCGTAATGCCAAAATCTCCCGTTATGATTTGCAGAGATTGTTGTGTGATAAGAGCAGAGTAGATGGTACTGGTAATTATAGATGGATTATAAAGAAACCATGGAATACTTTTACACAAGATGCAAGGGAGGCATTGGATAAAATAGTGATTAGCTCGAAGCAGAATTTGCGTATAATAAATAAAACAACTAATATTTATCAACATTTTGATACAGAAGGAAATCGTGTTTATAAGAAACAGGAAACCGGTGATAGTTGGGCTATTCGTAAACCGATGCATAAAGATACGGTTTTTGGAACAGTGAATTTACGAAAAGTAAAAAGTGTACGATTGTCTGTGGCTTTGGATACTCCTACCATGATTGTTGATAAGAGAGTGAAAGGCAAGGTTCTTGAATTGTTATCATATAAATATGATAAGAAGAAAATTGAAAAATATTTCAAAGAGAATGTTTTCTTTTGGAAGGATTTGGATATAGCTAAAGTTGCAGTCTATTATTTTACAGAAAATACTTCTGAACCTTTGGTTGCGGTGCGTAAACCACTTGATTCTACTTTCAATGAGAAGAAAATAAAAGAATCGGTAACGGATACTGGCATACAGAAAATTCTTTTGAATCATTTATCTGCAAAAGAAGGAAAGACGGATTTGGCTTTTTCTGCAGAAGGAATAGAAGAAATGAATCGTAATATTTTACAGTTGAATGATGGAAAAGAACATCAGCCAATATATAAAGTGAGAGTGTATGAACCACGTGGAAATAAATTTAGAGTTGGTGCATTTGGTAATAAAGGGACTAAATGGGTGGAAGCCGCTAAGGGTACTAATTTGTTCTTTGCTATTTATGCAACAGAAGATGGAAAAAGGACGTATGAGACTGTCCCCTTAAATTTGGTTATAGAACGTGAGAAACAAGGGCTTATTCCTGTTCCGGATAGGAACGAAAAAGGGGATAAACTGTTGTTTTGGTTATCTCCTAATGATTTGGTGTATCTGCCAACTGAAGAAGAACGGGAATTTGGTAGGATAAATGAGCCGATAGATAGGGGGCGGGTTTATAAAATGGTAAGTTGTACTGGGAATGAGGGACATTTTATTCCTGTAAATGTGGCTAATCCAATATTGCCGACTATTGAATTAGGAAGTAATAATAAGGCCCAGAGAGCATGGAATAATGAAATGGTAAAAGATATTTGTATCCCAGTAAAAGTTGATAGATTGGGTCGTATTATAGAAGTTAAGTATAAAGCAAATGAATAATATAAAGTTATTTCAAGAAAAGAAAATCCGTTCCATGTGGAACGAAGAAGAGCAGCAATGGTACTTTTCTGTTGTTGATGTAGTTGGTGTATTGACTGATAGCGTGAATCCTACGGACTATCTGAAGAAGATGAGAAAACGGGATGAAGAACTGGCTACTTACCTGGGGACAAATTGTCCCCAGGTAGAAATGCTGACAGATACAGGAAAAAAAAGAAAAACTTTGGCGGCAAATGTACAGGCTTTATTCCGTATCATTCAATCCATCTCCTCTCCTAAAGCTGAACCTTTTAAACTTTGGCTGGCACAGGTGGGGTATGAGCGTGTGCAGGAAATTGAAAATCCGGAATTGGCTCAGGAACGCATGAAAGAACTTTATGAGCAGAAGGGTTATCCAAAGGATTGGATTGATAAACGTCTGAGAGGAATTGCCATTCGTCAGAATTTGACGGATGAGTGGAAAGAAAGGGGAATCACGGATGCCATTCTTACGGCAGAAATATCTAAGGCAACGTTTGGATTAAGCCCTTCGGATTATAAAATATATAAAGGACTGACAAAGAAGAATCAGAATCTTCGTGACCATATGTCCGATTTGGAATTGATATTCACGATGCTTGGCGAGCGTGTCACTACGGAAATCTCTCAGAAAGAGAAACCGGATACATTTACTAAAAGTAAACAAGTTGCACAGCGTGGTGGAAATGTTGCCGGAGTAGCACGTGAACAGGCTGAAAAAGAACTGGGTAGAAGTATTATTTCTTCCGACAATTTTTTGTTGGATTCAGATAAGCAAGATGATACCTTAAAACTTCCTTTTGAGGAAAATGATGAATGAATAATTTGTAAAATCTGTATACTATGATTAAGAAAACGCTTTATTTCGGAAATCCTGTTTATCTCTCTTTGAAAAATGCTCAGTTGGTGATTAAATTGCCGGAGGTCGTAAAAAGCTGTGCTTTGCCCGAAGGGTTCAAGCAAGTGTCTGAGGTGACTAAGCCAATAGAGGATATTGGGATAGTGGTATTGGATAATAAACAGATAACTGTTACTTCGGGAGTGTTGGAGGCTTTACTTGAAAATAATTGTGCAGTCATAACTTGTGACTCTAAAAGTATGCCGGTTGGTCTGATGCTTCCTTTGTATGGAAATACTACACAAAATGAGAGGTTTCGACAGCAACTTGGCGCTTCTCTGCCATTGATGAAACAACTTTGGCAGCAAACGATAAAGGCTAAAATAGAAAATCAGGCGGCGGTATTGAGTAAATGTACTGGAGAGGAAATAAAGTGTATGAAGATATGGGCTGCTGATGTGAAAAGTGGAGATCCGGATAACTTGGAGGCTCGTGCAGCTGCTTATTATTGGAAAAATTTGTTCAAAATAAAAGGTTTTACAAGAGATAGAGAAGGTATTCCACCTAATAATCTGTTGAATTATGGGTATGCTATTTTGCGGGCGGTCGTTGCCCGTGGTTTGGTTGCAAGTGGACTTTTACCTACTTTGGGAATACATCATCATAATCGTTATAATGCTTATTGTTTGGCGGATGATATAATGGAGCCTTATCGCCCCTATGTGGATAGGTTGGTATATGATATGATTAAAGGAGAAGAAATAAATTGTATTGGATTGACAAAAGAATTGAAAGCACAGCTGCTTACTATTCCTACGTTGGATACTATTATTTCGGGAAAACGTAGTCCGTTGATGGTGGCTGTTGGGCAGACTACGGCTTCTCTATATAAATGTTTTAGCGGTGAGTTACGCAGAATATCTTATCCGGAGATGTAATGGAACGGTTTAGTGAATATCGGATTATGTGGGTACTTGTATTGTTTGATTTGCCAACCGAAACAAAAAAAGATAAAAAGGCATATGCGGACTTTAGAAAAAATCTGCAAAAGGATGGATTTACGATGTTTCAATTTTCTATATATGTTCGCCATTGCGCAAGTAGTGAGAATGCGGAGGTACATATAAAAAGAGTTAAGTCTATTTTGCCTGAGCACGGAAGTATTGGAATAATGTGTATTACAGATAAACAATTTGGAAATATAGAACTTTTTTATGGGAAAAAAACAGTAGATGTGAATACTCCCGGGCAGCAGTTAGAACTATTCTGAAAAGAAAATCCCGCTATATAGCGGGATTTCTTTCTTGGAAACTATATCTTTTTTAAATTCTAATGTTTAATATAACTGTATGTATATTAGTTTGTTACTGATGTTCGGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACTGATACTTTCTTTGTCTTTCATCTTTTAACGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACTCGCAAAGAACAGCAACGATAAAATGATTGGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACAAGTTAATCCCAATTCGCTTAATCCTTTGTGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACAAACATTGGACGCTTGAAGCAAAGTACAGGGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACCAGGAGAAACGGAGAAAAACCGGCATATATGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACGGGATAATGCCATTTATCCTGAAACTAACGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACATGTTGATTACGGATGCAAAATTAGACGATGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACAATATGCTTTTTGATAATAATAGTTGGACGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACTCCTTAACTTCATCAAACTTATCTGCCGTTACTGTTTTCTATGGTTCAAAGATACTAAAATGAAAGCAAATCACAA

Rfam Update

I am a Prokka user, thanks for providing, maintaining and updating the Prokka regularly. Could you guys please update the RFAM database to RFAM 12 in the next version of the Prokka release. Seems much changes have been there compared to the earlier version of Rfam.

I suggest that Prokka should provide Rfam.cm or Rfam.fasta what ever file using for the ncRNA "Rfam analysis" should be visible so that people can easily change/update the Rfam database and then no longer need to wait an update from the Prokka authors (wait for the next version, especially Rfam). Thank you and have a nice day.

Check database indexes exist before spawning searches

If --setupdb failed for some reason, prokka will still attempt to run BLAST , HMMER etc giving a strange error. Best to check all is well before starting.

Prokka fails when --outdir has spaces in it

I have been very lazy and not been shell-quoting my pipe commands. Bad programmer.

prokka running always using rnammer

I am running Prokka for annotating several genomes. It worked well till now, but suddenly it starts to look for rnammer that I do not have installed, even though I did not select the flag -rnammer.
Noting change if I type the flag -rnammer.
Should not it use barrnap (which I have installed and running) as defualt?

Included aragron OSX binary hangs

aragorn binary in OS X distribution doesn’t work (Prokka hangs at tRNA prediction stage), at least on my mac (with OS X 10.9.4). Recompiling aragorn from the source fixes this.

Yevgeny Nikolaichik

Circular genome

The first line of the genbank file indicates the genome is linear. The default should be circular for bacteria (perhaps with a linear override option?).

LOCUS       205522                129078 bp    DNA     linear       20-MAY-2014

should be

LOCUS       205522                129078 bp    DNA     circular       20-MAY-2014

Missing sequence ID on 'gene' features (via Chris Fields)

Hi Torsten! Got something for you re: Prokka. I have a small bug fix, but it’s not worth a fork if you have the time.

BTW, are the Prokka scripts available on Github? Just curious...

We’re running Prokka 1.8 (BTW, great tool!) using the following:

prokka --locustag 'CBEIJ_B593' --gram pos
--cpus $PBS_NUM_PPN
--genus Clostridium
--species beijerinckii
--strain B593
--addgenes
--mincontiglen 200
--centre 'CBC'
--rfam
454Scaffolds.fna.GC2

Everything looks fine except the GFF; the reference seq ID for the added ‘gene’ feature looks like this:

gnl|CBC|contig000001 Prodigal:2.60 CDS 378 1526 . - 0 ID=CBEIJ_B593_00001;gene=mlc;inference=ab initio prediction:Prodigal:2.60,similar to AA sequence:UniProtKB:P50456;locus_tag=CBEIJ_B593_00001;product=Making large colonies protein;protein_id=gnl|CBC|CBEIJ_B593_00001
SEQ prokka gene 378 1526 . - 1 gene=mlc;locus_tag =CBEIJ_B593_00001
gnl|CBC|contig000001 Prodigal:2.60 CDS 1717 3219 . - 0 ID=CBEIJ_B593_00002;eC_number=2.7.1.17;gene=xylB_1;inference=ab initio prediction:Prodigal:2.60,similar to AA sequence:UniProtKB:P35850;locus_tag=CBEIJ_B593_00002;product=Xylulose kinase;protein_id=gnl|CBC|CBEIJ_B593_00002
SEQ prokka gene 1717 3219 . - 1 gene=xylB_1;locus_tag =CBEIJ_B593_00002
…

(note the replacement of the reference with ‘SEQ’). It’s easy enough to fix on my end, as the generic ‘SEQ’ comes from Bio::SeqFeature::Generic when no seq_id is present, just need to pass the seq_id along. Starting at line 957 in the main prokka script:

if ($addgenes) {
  # make a 'sister' gene feature for the CDS feature
  # (ideally it would encompass the UTRs as well, but we don't know them)
  my $g = Bio::SeqFeature::Generic->new(
    -primary    => 'gene',
    -seq_id     => $f->seq_id,  # <---
    -start      => $f->start,
    -end        => $f->end,
    -strand     => $f->strand,
    -source_tag => $EXE,
    -tag        => { 'locus_tag '=> $ID },
  );

chris

Could not run command: makeblastdb -dbtype prot

I previously installed prokka in Biolinux8 and everything worked well.
I had to create a new Biolinux account now and I tried to reisntall prokka-1.10.
Everything worked but when I try

prokka --setupdb

I got the followng erro:
manager@bl8vbox[lib] prokka --setupdb [12:01PM]
[12:02:05] Cleaning databases in /usr/local/lib/prokka-1.10/bin/../db
[12:02:05] Cleaning complete.
[12:02:05] Looking for 'makeblastdb' - found /usr/bin/makeblastdb
[12:02:06] Determined makeblastdb version is 2.2
[12:02:06] Making kingdom BLASTP database: /usr/local/lib/prokka-1.10/bin/../db/kingdom/Archaea/sprot
[12:02:06] Running: makeblastdb -dbtype prot -in /usr/local/lib/prokka-1.10/bin/../db/kingdom/Archaea/sprot -logfile /dev/null
[12:02:06] Could not run command: makeblastdb -dbtype prot -in /usr/local/lib/prokka-1.10/bin/../db/kingdom/Archaea/sprot -logfile /dev/null

suggestions?

Support .GBK/.GFF for --proteins option

Instead of having to prepare a .faa file from it manually, perhaps support within prokka.

For GBK would be simple to run "prokka-genbank_to_fasta_db" from within prokka.

Prokka reorders contigs

If you give prokka a contig set, ordered by reference, it reorders the contigs in the output genbank alphabetically. Would be nice if preserved the original contig order (preferably without renaming the contigs? We submit contigs with genbank format friendly names)

Gene name attribute from --proteins evidence

The genes annotated using the --proteins evidence don't get gene= attributes in the GFF file. My FASTA file of proteins is formatted like so:

>psbK photosystem II protein K
MPVMLNIFLDDAFIYSNNIFFGKLPEAYAISDPIVDVMPIIPVLSFLLAFVWQAAVSFR
>psbI photosystem II protein I
MLTLKLFVYTVVIFFISLFIFGFLSNDPGRNPGRKE
>ycf12 hypothetical protein
MNLEVIAQLTVLTLTVVSGPLVIVLLAVRKGNL

Batch run issues

Thanks for the great software! I have several files to be processed. Running PROKKA on them either serially individually or in batches of say 10 or 50 or 100 often results in partially completed outputs (> 80% of inputs are incomplete). The most common error is:

Could not run command: cat ~/proteins.faa | parallel --gnu -j 8 --block 943 --recstart '>' --pipe hmmscan --noali --notextw --acc -E 1e-06 --cpu 1 ~/tools/prokka/prokka-1.10/bin/../db/hmm/CLUSTERS.hmm /dev/stdin > ~/proteins.bls 2> /dev/null

Output directories usually have only the final *fna completed.

Any suggestions? Many thanks for your time and efforts.

Changing annotations to Hypothetical Protein

Hi again,
I was going through the prokka script as well as the log file, and I noticed that some of the annotations change themselves to Hypothetical protein, even though they don't look like they are annotated as "Hypothetical Protein". I could not find a suitable explanation for the same in the script. Can you help me out with this and let me know why it is changing some particular annotations and making them hypothetical?

Thanks!
Chandni

Problem running prokka on isolate genome

Hi,

this is a little feature request.

I have the following genome Abiotrophia defectiva ATCC 49176 (s.a. http://www.ncbi.nlm.nih.gov/genome/?term=txid592010[Organism:noexp]) in fasta format and wanted to run prokka on it for test purposes.
However, I get the following error when running the following command: prokka --notrna --norrna --cpus 1 Abiotrophia_defectiva_ATCC_49176.fasta with the prokka-binary directory being in my PATH.

[17:25:50] Loading and checking input file: Abiotrophia_defectiva_ATCC_49176.fasta
[17:25:50] Wrote 20 contigs
[17:25:50] Skipping tRNA search at user request.
[17:25:50] Disabling rRNA search: --kingdom=Bacteria or --norrna=1
[17:25:50] Skipping ncRNA search, enable with --rfam if desired.
[17:25:50] Total of 0 tRNA + rRNA features
[17:25:50] Predicting coding sequences
[17:25:50] Contigs total 629 bp, so using meta mode
[17:25:50] Running: prodigal -i PROKKA_09042014/PROKKA_09042014.fna -c -m -g 11 -p meta -f sco -q
[17:26:17] Found 1875 CDS
[17:26:17] Connecting features back to sequences
[17:26:17] Option --gram not specified, will NOT check for signal peptides.
[17:26:17] Not using genus-specific database. Try --usegenus to enable it.
[17:26:17] Annotating CDS, please be patient.
[17:26:17] Will use 1 CPUs for similarity searching.

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Bad end parameter (834). End must be less than the total length of sequence (total=629)
STACK: Error::throw
STACK: Bio::Root::Root::throw /home/users/claczny/perl5/lib/perl5/Bio/Root/Root.pm:486
STACK: Bio::PrimarySeq::subseq /home/users/claczny/perl5/lib/perl5/Bio/PrimarySeq.pm:452
STACK: Bio::Seq::subseq /home/users/claczny/perl5/lib/perl5/Bio/Seq.pm:630
STACK: Bio::PrimarySeqI::trunc /home/users/claczny/perl5/lib/perl5/Bio/PrimarySeqI.pm:458
STACK: Bio::SeqFeature::Generic::seq /home/users/claczny/perl5/lib/perl5/Bio/SeqFeature/Generic.pm:705
STACK: /work/projects/ecosystem_biology/local_tools/prokka-1.7/bin/prokka:754
-----------------------------------------------------------

I found the line [17:25:50] Contigs total 629 bp, so using meta mode suspicious. After looking into this, I found out that it appears to be related to the fasta headers. For this genome, the fasta header of the first contig is >Abiotrophia defectiva ATCC 49176 : ACIN03000001 (and similar for the other contigs). After replacing the whitespaces with underscores, prokka appears to run nicely through. The corresponding line now says [17:31:08] Contigs total 2041839 bp, so using single mode, which appears to be correct.
Hence, I suspect prokka needs unique fasta headers (which is not the case here).
Accordingly, I think it would be a useful feature to integrate or extend the input format check and let the user know when the fasta headers are not unique.

Looking forward to your comments.

Best,

Cedric

[EDIT]
The above is with respect to prokka-1.7. I installed prokka-1.10 now and discovered that some input format check is applied that was apparently not in the prior version (1.7) -> [11:11:27] WARNING: Contig IDs must be less than 38 characters for Genbank compliance - Abiotrophia_defectiva_ATCC_49176_:_ACIN03000001. I do not know though if there is a check already integrated as suggested above (uniqueness of IDs).

signalp 3.0b version checking problem

Peter Cock ‏@pjacock Mar 26

prokka 1.8 dependency version checking unhappy with 3.0b (as in a,b,c not beta):

$ signalp -v
3.0b, Dec 2005

COORDINATES: qualifier for Infernal/Aragorn output

I have been using Prokka to annotate de novo generated whole genome sequences of bacteria, based on species or a trusted database of proteins. I use the GBK output of Prokka to import the genome sequence into Artemis, where I do tweaks to the annotation, such as missed pseudogenes, for instance. I save the files as EMBL flat files for submission to ENA/SRA. Before submission I run the EnaValidator.jar to check for issues with the EMBL file. During these checks, it gives an error that turns out to be because of a space after the " COORDINATES: " qualifier. When I remove this in Artemis manually, the error is gone. I don't know where in the Prokka pipeline this space gets inserted, but it would be helpful to fix this (if possible).

prokka-genbank_to_fasta_db does not use the correct translation table

I've tried to make a local database of a genome from candidate division SR1, which uses translation table 25: http://www.ncbi.nlm.nih.gov/nuccore/CP006913

However, it seems like prokka-genbank_to_fasta_db does not use the correct translation table (it is incoded in the genbank file).

Overly long locustag/prefix results in bad GenBank LOCUS lines

Prokka 1.10 can produce broken GenBank output with over-long identifiers in the LOCUS lines.

Sample input (anonymised since this is for a collaborator):

/opt/prokka-1.10/bin/prokka --outdir XYZ123draft2_prokka --prefix XYZ123draft2 --locustag XYZ123draft2 --compliant --kingdom Bacteria --gram neg --genus ... --species ... --strain XYZ123 --quiet XZY123_draft2.fasta

Sample output:

$ grep LOCUS XYZ123draft2.gbk 
LOCUS       XYZ123draft2_contig000001119615 bp   DNA   linear       19-AUG-2014
LOCUS       XYZ123draft2_contig000002170983 bp   DNA   linear       19-AUG-2014
...

The LOCUS identifier is too long for the strict GenBank format, and there is no white space between the (truncated) identifier and the sequence length, meaning for example Biopython complains.

Possible output with truncation (not ideal) to ensure a white space would be something like this:

$ grep LOCUS XYZ123draft2.gbk 
LOCUS       XYZ123draft2_contig0000 1119615 bp   DNA   linear       19-AUG-2014
LOCUS       XYZ123draft2_contig0000 2170983 bp   DNA   linear       19-AUG-2014
...

Possible output abusing the LOCUS line (also not ideal, but some parsers will cope):

$ grep LOCUS XYZ123draft2.gbk 
LOCUS       XYZ123draft2_contig000001 1119615 bp   DNA   linear       19-AUG-2014
LOCUS       XYZ123draft2_contig000002 2170983 bp   DNA   linear       19-AUG-2014
...

Expected output: Fail early complaining about the overly long identifiers which will cause problems, specifying which option should be changed.

sig_peptide coordinates still in protein space

Signal peptide annotation is incorrect. Amino acid coordinates are used without x3 multiplication.

Yevgeny Nikolaichik

Remove doc/LICENSE.TIGRFAMs

Since TIGRFAMs HMM has been removed in Prokka 1.9, its license file can also be deleted.

Perl Exceptions

Installed Prokka and it ran fine. Wanted to add signalP and since then all gone wrong! I thought I needed to add a perl module and maybe updated through cpan now I get:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Could not read file 'minced -gff 'PROKKA_07152014/PROKKA_07152014.fna' |': No such file or directory
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl/5.18.2/Bio/Root/Root.pm:449
STACK: Bio::Root::IO::_initialize_io /usr/local/share/perl/5.18.2/Bio/Root/IO.pm:270
STACK: Bio::Tools::GFF::new /usr/local/share/perl/5.18.2/Bio/Tools/GFF.pm:200
STACK: /usr/local/bin/prokka:589

same exception for barrnaup

Both programs seem to run ok on their own.

Any suggestions?

Many thanks

Mitochondrial mode for plants

prokka --kingdom mito sets the genetic code to metazoa mt (5) and enables the metazoan mitochondrial tRNA mode of Aragron aragorn -mt. Is -kingdom mito intended specifically for metazoa, or should it also work for plants?

https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka#L274

Support prodigal 2.7 (git head)

Prodigal 2.7 has unfortunately changed the command line options in a non-compatible manner. -m was renamed to -n and -p was renamed to -m.

2.6	2.7
-m	-n
-p meta	-m anon
-p single	-m normal

Include more info from minced in the CRISPR annotations

Lizzy Wilbanks has left a new comment on your post "Prokka - rapid prokaryotic annotation":

Thanks for this great tool! So useful!! One thing that might be a nice addition for future releases would be providing more of the information from minced about the CRISPR regions - maybe as a separate output file? I've been re-running this to get the locations of the direct repeats and spacer sequences.

Posted by Lizzy Wilbanks to The Genome Factory at 31 July 2014 04:26

CLUSTERS.hmm corrupted in the tarball?

Greetings
when updating for Prokka 1.9 and running the 'prokka --setupdb' got this error message and the setup aborted:

[16:37:51] Running: hmmpress '/home/jcarrico/NGStools/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm'

Error: File /home/jcarrico/NGStools/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm does not appear to be in a recognized HMM format.

[16:37:51] Could not run command: hmmpress '/home/jcarrico/NGStools/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm'

Any idea on how to solve this? Is it a corrupted file? I've just downloaded the tarball few minutes ago. Thanks in advance!

parallel version

I am really sorry because bother you with this question. Is there any problem with the new parallel20141022 version? The prokka do not recognise it's version number correctly and always ask me to update it? I use the most up to date version! Should I downgrade it? Witch is the preferred parallel version?
Thank for every help!

Option to keep intermediate files

I'd like to inspect the .bls file for debugging (trying to track down some missing genes found by MAKER-P). I've disabled delfile( $faa_name, $bls_name);. A command line option for this purpose would be useful.

https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka#L934

New/Customized HMM Databases

Hi,
I've been using Prokka a bit with the default options and databases. I recently added the vFAM HMM database in /opt/prokka/db/hmm. After indexing, it was recognized successfully (prokka --listdb).
However, upon running Prokka, I see (from the log) that hmmer3 runs only for the default HMM databases (Pfam,CLUSTERS,HAMAP). Is there any way to confirm that the Prokka run actually used the new database?
Also, we work mostly on metagenomics projects so we are really looking forward to the kingdom=ALL option from the To-Do list.

Thank you,
Chandni

prokka and parallel parallel-20140222

HI.
Working with Prokka, really nice package and super-easy to run.

I noticed a small bug: when using Prokka with parallel-20140222 installed I got an error during the annotation step, this:
[16:05:51] Could not run command: cat MyAnnotation_MyGenomeproteins.faa | parallel --gnu -j 4 --block 166030 --recstart ...........

launching the command out of the pipeline I found that parallel was crashing. This is the message:

parallel: Error: -g has been retired. Use --group.
parallel: Error: -B has been retired. Use --bf.
parallel: Error: -T has been retired. Use --tty.
parallel: Error: -U has been retired. Use --er.
parallel: Error: -W has been retired. Use --wd.
parallel: Error: -Y has been retired. Use --shebang.
parallel: Error: -H has been retired. Use --halt.
parallel: Error: --tollef has been retired. Use -u -q --arg-sep -- and --load for -l.

so I reverted back to parallel-20130422 and now everything seems to work properly, even inside the prokka pipeline.

Maybe this is of some help.
best
m.

Marco Fondi, PhD
Dep. of Biology, University of Florence
Via Madonna del Piano 6, S. Fiorentino, Florence, Italy
Tel. +39 055 4574736

Improve the cleanup_product() function

This function makes lots of mistakes:

Bug: HI0933-like protein => -like protein
Bug: IS1251-like transposase => -like transposase
Bug: transcription termination factor Rho => hypothetical protein
Bug: xx kDa SS-A/Ro ribonucleoprotein homolog => hypothetical protein
[12:09:23] Modify product: conserved protein with nucleoside triphosphate hydrolase domain => hypothetical protein
[12:09:23] Modify product: 23S rRNA m(2)G2445 methyltransferase => 23S rRNA m(2) methyltransferase
[12:09:23] Modify product: DNA replication terminus site-binding protein => hypothetical protein
[12:09:24] Modify product: conserved inner membrane protein => hypothetical protein
[12:09:24] Modify product: 16S ribosomal RNA m2G1207 methyltransferase => 16S ribosomal RNA methyltransferase
[12:09:25] Modify product: hypothetical protein TTC0453 => hypothetical protein
[12:09:25] Modify product: type VI secretion protein, VC_A0107 family => type VI secretion protein, family
[12:09:26] Modify product: conserved hypothetical pathogenicity island protein => hypothetical protein
[12:09:26] Modify product: IS1400 transposase B => transposase B
[12:09:26] Modify product: Dyp-type peroxidase family => Dyp-type peroxidase family protein

prokka --setupdb should check binaries

I think that the check of versions of tools and the PATH extension with $BINDIR should be done before running setup_db sub:

$ ./bin/prokka --setupdb
[19:59:35] Cleaning databases in /tmp/prokka-1.9/bin/../db
[19:59:35] Cleaning complete.
[19:59:35] Making kingdom BLASTP database: /tmp/prokka-1.9/bin/../db/kingdom/Archaea/sprot
[19:59:35] Running: makeblastdb -dbtype prot -in '/tmp/prokka-1.9/bin/../db/kingdom/Archaea/sprot' -logfile /dev/null
sh: 1: makeblastdb: not found
[19:59:35] Could not run command: makeblastdb -dbtype prot -in '/tmp/prokka-1.9/bin/../db/kingdom/Archaea/sprot' -logfile /dev/null

Also the item "use included binary if PATH one is wrong version [Simon Gladman]" from TODO in doc/ChangeLog.txt would be helpful, since having a wrong version of hmmpress in the PATH leads to this error:

$ ./bin/prokka --setupdb
...
[20:01:03] Pressing HMM database: /tmp/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm
[20:01:03] Running: hmmpress '/tmp/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm'

Error: File /tmp/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm does not appear to be in a recognized HMM format.

[20:01:03] Could not run command: hmmpress '/tmp/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm'

Could I use prokka with scaffolds?

Hello everybody,

I'd like to use prokka and I have scaffolds of a draft genome. Could I use prokka to annotate it or should I use contigs?

Best Regards,

Daniel

Select Barrnap or RNAmmer

I have both Barrnap and RNAmmer installed, Prokka detects both, and seems to use Barrnap by default. How do I select which is used?

[14:36:48] Looking for 'barrnap' - found /usr/local/bin/barrnap
[14:36:48] Determined barrnap version is 0.4
[14:36:49] Looking for 'rnammer' - found /usr/local/bin/rnammer
[14:36:49] Determined rnammer version is 1.2
[14:36:49] Predicting Ribosomal RNAs
[14:36:49] Running Barrnap with 4 threads

Namespace collisions with default contig ID naming

Having just used Prokka 1.8 on several strains I am left with *.fna and *.gbk (etc) files with ambiguous identifiers like gnl|PROKKA|contig000001 which appear in all my strains.

Referring to http://www.ncbi.nlm.nih.gov/genomes/static/Annotation_pipeline_README.txt (linked to in the Prokka script - thank you) the NCBI say:

The fasta file should look like this:
 >gnl|center|<ID1> [organism=<ORGANISM NAME STRAIN NAME>] [strain=<STRAIN NAME>] [gcode=11]
 <NUCLEOTIDE SEQUENCE>

NOTE: The |center|<ID1> part of the header must be less than 38 characters

An example of a fasta header for the Bacterium bacterius 253 is:

>gnl|LrgU|Contig01 [organism=Bacterium bacterius 253] [strain=253] [gcode=11]

I propose that rather than using gnl|$centre|contig%06d where the ID is just contig000001 etc, Prokka prefixes this (if a prefix is specified, and under 38 - 6 = 32 characters). This prefix could be a new command line option, or perhaps reuse the existing strain or locus tag prefix?

e.g. I would like to be able to request gnl|PROKKA|XXX_contig000001 and gnl|PROKKA|YYY_contig000001 for strains XXX and YYY.

'--proteins' reference introduces too many "paralogs"

I noticed that if you use a reference genome for annotation with option '--proteins', lots of false paralogs will be annotated (gene names which are numerated with an underscore '_'; see hash %collide in lines https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka#L985-1013).

For annotation Prokka just takes the best BLASTP hit (lowest evalue). Might it be useful to have some more options to include the possibility of subject/query coverage/identity cutoffs, in addition to option '--evalue'?

For this purpose BioPerl includes blast HSP tiling via the 'frac*' methods, which can be used to skip BLASTP results which don't satisfy these restraints, e.g.:
$hit->frac_identical('query')
$hit->frac_aligned_hit
These could be included in the BLASTP parsing routine (line https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka#L921 onwards).

tbl2asn failes silently with pipe >> | << character in contig names

James Doonan: I couldn't get the full compliment of files from the prokka output for two of my whole genome sequences. I discovered that it was to do with the contig names. The whole genome contig was called; >scf7180000000002|quiver. The output from prokka was missing the genbank and sequin files. When I changed the contig name to just '>scf71' it gave me all the files as output.

*.faa result files methionine/stop codon

The resulting .faa files from Prokka include stop codons '' and atypical (non-ATG) start codons don't start with methionine (both not NCBI standard). This is remedied by changing line https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka#L1101:

$faa_fh->write_seq( $p->translate(-codontable_id=>$gcode) );

to:

$faa_fh->write_seq( $p->translate(-codontable_id=>$gcode, -complete => 1) );

See BioPerl HOWTO: http://www.bioperl.org/wiki/HOWTO:Beginners#Translating

Translation table 25 is not supported

Prokka does not support translation table 25: http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG25

Minor .gbk file issues

The LOCUS entries in the .gbk file don't put a space in between the name of the contig (in this case) and its size, which means BioPerl gets upset when it tries to read the file.
Additionally, it would be helpful if there was a default entry for the ACCESSION and VERSION fields, just a '.' would do, as other programs read the file incorrectly when these entries are blank (such as pmauve).

Proper /inference when --proteins has gi/gb/ref ID

/inference="similar to AA sequence:trusted.faa:gi|302750786|gb|ADL64963.1|"

should be :Genbank:ADLXXXXX.1

Missing short genes

Prodigal penalizes predicting genes that are shorter than 250 bp (83 aa). As a result, Prokka is missing a number of short proteins that do exist in a closely related species. Any thoughts on how to deal with this? By checking with prodigal -s I've learned that Prodigal is in fact predicting the genes, but they have a score < 0 and so are discarded.

Checksum for tarball

It would be useful if there was an md5 checksum for the stable-release tarball, for verifying the download from http://www.vicbioinformatics.com/software.prokka.shtml.

Equivalent option for HMMs like --proteins for BLASTP ?

New --hmms option to prioritise a custom HMM (like --proteins does for BLASTP)

Connor Driscoll

Output a .PTT file for the CDS features

Andrew Buultjens requests PTT file output: https://www.biostars.org/p/16405/

minced version upgrade error

Hi,

I recently queued Prokka for a large dataset - split into 7-8 parts so I could run them all together. AFter they all finished, my next script to extract output data failed for some runs. Upon some investigation I saw that there were a few failed runs with the log entry -
" Prokka needs minced 1.6 or higher. Please upgrade and try again. "

I am not sure why this error came up for only some runs while the others finished successfully. It was an extremely small fraction of the total runs, and I just queued those contigs again - but I thought worthwhile to mention it here, in case someone else had a similar issue.

Thanks!
Chandni

tbl2asn: no .gbk or .sqn files

I'm sorry to bother you with this issue here, but despite installing a new version of tbl2asn and being able to call it from the command line, we've not been able to produce .gbk files using prokka. An example final set of lines from the .log follow:

[21:15:14] Writing outputs to /home/cooper/Vaughn/tmp/pneumo/ref//
[21:15:17] Generating annotation statistics file
[21:15:17] Generating Genbank and Sequin files
[21:15:17] Running: tbl2asn -V b -a r10k -l paired-ends -M n -N 1 -y 'Annotated using prokka 1.10 from http://www.vicbioinformatics.com' -Z /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.err -i /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.fsa 2> /dev/null
[21:15:17] Deleting unwanted file: /home/cooper/Vaughn/tmp/pneumo/ref//errorsummary.val
[21:15:17] Deleting unwanted file: /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.dr
[21:15:17] Deleting unwanted file: /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.fixedproducts
[21:15:17] Deleting unwanted file: /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.ecn
[21:15:17] Deleting unwanted file: /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.val
[21:15:17] Output files:
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.faa
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.tbl
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.txt
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.log
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.fsa
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.fna
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.err
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.gff
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.ffn

Tag stable releases

Tagging stable releases could be a good way to download the Prokka code without downloading the large databases.