Giter Club home page Giter Club logo

prokka's People

Contributors

andersgs avatar ctskennerton avatar nsoranzo avatar petehaitch avatar peterjc avatar robymetallo avatar sjackman avatar smsaladi avatar standage avatar stephenturner avatar telatin avatar tseemann avatar ucpete avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prokka's Issues

Exception: Bad end parameter

Running prokka 1.9, with --metagenome option.

Prokka falls down with same bad end parameter exception on two separate contigs from two separate assemblies.

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Bad end parameter (5209). End must be less than the total length of sequence (total=5208)
STACK: Error::throw
STACK: Bio::Root::Root::throw /srv/sw/cpan-modules/lib/perl5/Bio/Root/Root.pm:486
STACK: Bio::PrimarySeq::subseq /srv/sw/cpan-modules/lib/perl5/Bio/PrimarySeq.pm:432
STACK: Bio::PrimarySeq::subseq /srv/sw/cpan-modules/lib/perl5/Bio/PrimarySeq.pm:387
STACK: Bio::Seq::subseq /srv/sw/cpan-modules/lib/perl5/Bio/Seq.pm:630
STACK: Bio::PrimarySeqI::trunc /srv/sw/cpan-modules/lib/perl5/Bio/PrimarySeqI.pm:435
STACK: /srv/sw/prokka/1.9/prokka-1.9/bin/prokka:1054

E.g. Troublesome contig:

707_L1_merged_contig_150143
CGTATAAAGGCATTGCTTGCTGAATTTATGAATCCGGAATATGGGGTTGAAAATGTTCGTCCTTATTCGCCAAGTCAGCAAGAAATATTGCGGATTTATGAGGATACGGTTTTGAAAGGGGAAGAACAGATTCCGGAAGATATAGATGTAATATTGAAAAAATTCAATAATAGCAAACTACCGACAAAATCAGAGTTTTTGCGTTATAAATTATGGTTGGAACAGAAGTATCGTTCGCCTTATACCGGTGAGTTGATACCTTTGGGAAAATTGTTTACGGCTGCGTATGAGATAGAACATATAATTCCTCAATCTCGTTATTTTGATGATTCTTTTTCTAACAAGGTGATATGTGAATCTGCTGTGAATAAATTGAAAGATAATCAATTGGGGTATGAGTTTATCAAGAATCATCACGGGCAGAAAGTTGAAGTGGGTTTTGGAAAAACGGTAGAAATTCTTTCTGTGGATAGCTACGAATGTTTTGTAAAAGAACAATATGCTAAATCGGGCGTGAAAATGAAGAAATTGTTGATGGATGATATTCCCGAGCAATTTATTGAGCGCCAATTGAACGATAGCCGGTATATCAGCAAGGTTGTTAAAGGGCTTTTGTCGAATATTGTTCGTGAAAAGAATGATAGCGGTGAATATGAGCCGGAGGCTGTTTCAAAAAATATATTAGTTTGTACGGGAAGCGTGACGGACAGGCTGAAAAAGGATTGGGGGATGAATGATGTTTGGAACAGTATTGTATATCCTCGTTTTGAACGTTTAAACGCTTTGACTGGAACACAGTGCTTTGGGCATTGGGAGAATAAAGATGGAAAAAAAGTTTTTCAGACGGAATTGCCCCTTGAATATCAGAAAGGGTTTAGTAAGAAACGTATTGACCATAGGCATCATGCCATGGATGCAATAGTGATAGCTTGCGCTACGCGGAATCATGTGAACTATTTGAGCAATGAGTCTGCAAGCCGTAATGCCAAAATCTCCCGTTATGATTTGCAGAGATTGTTGTGTGATAAGAGCAGAGTAGATGGTACTGGTAATTATAGATGGATTATAAAGAAACCATGGAATACTTTTACACAAGATGCAAGGGAGGCATTGGATAAAATAGTGATTAGCTCGAAGCAGAATTTGCGTATAATAAATAAAACAACTAATATTTATCAACATTTTGATACAGAAGGAAATCGTGTTTATAAGAAACAGGAAACCGGTGATAGTTGGGCTATTCGTAAACCGATGCATAAAGATACGGTTTTTGGAACAGTGAATTTACGAAAAGTAAAAAGTGTACGATTGTCTGTGGCTTTGGATACTCCTACCATGATTGTTGATAAGAGAGTGAAAGGCAAGGTTCTTGAATTGTTATCATATAAATATGATAAGAAGAAAATTGAAAAATATTTCAAAGAGAATGTTTTCTTTTGGAAGGATTTGGATATAGCTAAAGTTGCAGTCTATTATTTTACAGAAAATACTTCTGAACCTTTGGTTGCGGTGCGTAAACCACTTGATTCTACTTTCAATGAGAAGAAAATAAAAGAATCGGTAACGGATACTGGCATACAGAAAATTCTTTTGAATCATTTATCTGCAAAAGAAGGAAAGACGGATTTGGCTTTTTCTGCAGAAGGAATAGAAGAAATGAATCGTAATATTTTACAGTTGAATGATGGAAAAGAACATCAGCCAATATATAAAGTGAGAGTGTATGAACCACGTGGAAATAAATTTAGAGTTGGTGCATTTGGTAATAAAGGGACTAAATGGGTGGAAGCCGCTAAGGGTACTAATTTGTTCTTTGCTATTTATGCAACAGAAGATGGAAAAAGGACGTATGAGACTGTCCCCTTAAATTTGGTTATAGAACGTGAGAAACAAGGGCTTATTCCTGTTCCGGATAGGAACGAAAAAGGGGATAAACTGTTGTTTTGGTTATCTCCTAATGATTTGGTGTATCTGCCAACTGAAGAAGAACGGGAATTTGGTAGGATAAATGAGCCGATAGATAGGGGGCGGGTTTATAAAATGGTAAGTTGTACTGGGAATGAGGGACATTTTATTCCTGTAAATGTGGCTAATCCAATATTGCCGACTATTGAATTAGGAAGTAATAATAAGGCCCAGAGAGCATGGAATAATGAAATGGTAAAAGATATTTGTATCCCAGTAAAAGTTGATAGATTGGGTCGTATTATAGAAGTTAAGTATAAAGCAAATGAATAATATAAAGTTATTTCAAGAAAAGAAAATCCGTTCCATGTGGAACGAAGAAGAGCAGCAATGGTACTTTTCTGTTGTTGATGTAGTTGGTGTATTGACTGATAGCGTGAATCCTACGGACTATCTGAAGAAGATGAGAAAACGGGATGAAGAACTGGCTACTTACCTGGGGACAAATTGTCCCCAGGTAGAAATGCTGACAGATACAGGAAAAAAAAGAAAAACTTTGGCGGCAAATGTACAGGCTTTATTCCGTATCATTCAATCCATCTCCTCTCCTAAAGCTGAACCTTTTAAACTTTGGCTGGCACAGGTGGGGTATGAGCGTGTGCAGGAAATTGAAAATCCGGAATTGGCTCAGGAACGCATGAAAGAACTTTATGAGCAGAAGGGTTATCCAAAGGATTGGATTGATAAACGTCTGAGAGGAATTGCCATTCGTCAGAATTTGACGGATGAGTGGAAAGAAAGGGGAATCACGGATGCCATTCTTACGGCAGAAATATCTAAGGCAACGTTTGGATTAAGCCCTTCGGATTATAAAATATATAAAGGACTGACAAAGAAGAATCAGAATCTTCGTGACCATATGTCCGATTTGGAATTGATATTCACGATGCTTGGCGAGCGTGTCACTACGGAAATCTCTCAGAAAGAGAAACCGGATACATTTACTAAAAGTAAACAAGTTGCACAGCGTGGTGGAAATGTTGCCGGAGTAGCACGTGAACAGGCTGAAAAAGAACTGGGTAGAAGTATTATTTCTTCCGACAATTTTTTGTTGGATTCAGATAAGCAAGATGATACCTTAAAACTTCCTTTTGAGGAAAATGATGAATGAATAATTTGTAAAATCTGTATACTATGATTAAGAAAACGCTTTATTTCGGAAATCCTGTTTATCTCTCTTTGAAAAATGCTCAGTTGGTGATTAAATTGCCGGAGGTCGTAAAAAGCTGTGCTTTGCCCGAAGGGTTCAAGCAAGTGTCTGAGGTGACTAAGCCAATAGAGGATATTGGGATAGTGGTATTGGATAATAAACAGATAACTGTTACTTCGGGAGTGTTGGAGGCTTTACTTGAAAATAATTGTGCAGTCATAACTTGTGACTCTAAAAGTATGCCGGTTGGTCTGATGCTTCCTTTGTATGGAAATACTACACAAAATGAGAGGTTTCGACAGCAACTTGGCGCTTCTCTGCCATTGATGAAACAACTTTGGCAGCAAACGATAAAGGCTAAAATAGAAAATCAGGCGGCGGTATTGAGTAAATGTACTGGAGAGGAAATAAAGTGTATGAAGATATGGGCTGCTGATGTGAAAAGTGGAGATCCGGATAACTTGGAGGCTCGTGCAGCTGCTTATTATTGGAAAAATTTGTTCAAAATAAAAGGTTTTACAAGAGATAGAGAAGGTATTCCACCTAATAATCTGTTGAATTATGGGTATGCTATTTTGCGGGCGGTCGTTGCCCGTGGTTTGGTTGCAAGTGGACTTTTACCTACTTTGGGAATACATCATCATAATCGTTATAATGCTTATTGTTTGGCGGATGATATAATGGAGCCTTATCGCCCCTATGTGGATAGGTTGGTATATGATATGATTAAAGGAGAAGAAATAAATTGTATTGGATTGACAAAAGAATTGAAAGCACAGCTGCTTACTATTCCTACGTTGGATACTATTATTTCGGGAAAACGTAGTCCGTTGATGGTGGCTGTTGGGCAGACTACGGCTTCTCTATATAAATGTTTTAGCGGTGAGTTACGCAGAATATCTTATCCGGAGATGTAATGGAACGGTTTAGTGAATATCGGATTATGTGGGTACTTGTATTGTTTGATTTGCCAACCGAAACAAAAAAAGATAAAAAGGCATATGCGGACTTTAGAAAAAATCTGCAAAAGGATGGATTTACGATGTTTCAATTTTCTATATATGTTCGCCATTGCGCAAGTAGTGAGAATGCGGAGGTACATATAAAAAGAGTTAAGTCTATTTTGCCTGAGCACGGAAGTATTGGAATAATGTGTATTACAGATAAACAATTTGGAAATATAGAACTTTTTTATGGGAAAAAAACAGTAGATGTGAATACTCCCGGGCAGCAGTTAGAACTATTCTGAAAAGAAAATCCCGCTATATAGCGGGATTTCTTTCTTGGAAACTATATCTTTTTTAAATTCTAATGTTTAATATAACTGTATGTATATTAGTTTGTTACTGATGTTCGGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACTGATACTTTCTTTGTCTTTCATCTTTTAACGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACTCGCAAAGAACAGCAACGATAAAATGATTGGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACAAGTTAATCCCAATTCGCTTAATCCTTTGTGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACAAACATTGGACGCTTGAAGCAAAGTACAGGGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACCAGGAGAAACGGAGAAAAACCGGCATATATGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACGGGATAATGCCATTTATCCTGAAACTAACGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACATGTTGATTACGGATGCAAAATTAGACGATGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACAATATGCTTTTTGATAATAATAGTTGGACGCTGTTTCCAATGGTTCAAAGATACTAAAATGAAAGCAAATCACAACTCCTTAACTTCATCAAACTTATCTGCCGTTACTGTTTTCTATGGTTCAAAGATACTAAAATGAAAGCAAATCACAA

Rfam Update

I am a Prokka user, thanks for providing, maintaining and updating the Prokka regularly. Could you guys please update the RFAM database to RFAM 12 in the next version of the Prokka release. Seems much changes have been there compared to the earlier version of Rfam.

I suggest that Prokka should provide Rfam.cm or Rfam.fasta what ever file using for the ncRNA "Rfam analysis" should be visible so that people can easily change/update the Rfam database and then no longer need to wait an update from the Prokka authors (wait for the next version, especially Rfam). Thank you and have a nice day.

prokka running always using rnammer

I am running Prokka for annotating several genomes. It worked well till now, but suddenly it starts to look for rnammer that I do not have installed, even though I did not select the flag -rnammer.
Noting change if I type the flag -rnammer.
Should not it use barrnap (which I have installed and running) as defualt?

Included aragron OSX binary hangs

aragorn binary in OS X distribution doesn’t work (Prokka hangs at tRNA prediction stage), at least on my mac (with OS X 10.9.4). Recompiling aragorn from the source fixes this.

Yevgeny Nikolaichik

Circular genome

The first line of the genbank file indicates the genome is linear. The default should be circular for bacteria (perhaps with a linear override option?).

LOCUS       205522                129078 bp    DNA     linear       20-MAY-2014

should be

LOCUS       205522                129078 bp    DNA     circular       20-MAY-2014

Missing sequence ID on 'gene' features (via Chris Fields)

Hi Torsten! Got something for you re: Prokka. I have a small bug fix, but it’s not worth a fork if you have the time.

BTW, are the Prokka scripts available on Github? Just curious...

We’re running Prokka 1.8 (BTW, great tool!) using the following:

prokka --locustag 'CBEIJ_B593' --gram pos
--cpus $PBS_NUM_PPN
--genus Clostridium
--species beijerinckii
--strain B593
--addgenes
--mincontiglen 200
--centre 'CBC'
--rfam
454Scaffolds.fna.GC2

Everything looks fine except the GFF; the reference seq ID for the added ‘gene’ feature looks like this:

gnl|CBC|contig000001 Prodigal:2.60 CDS 378 1526 . - 0 ID=CBEIJ_B593_00001;gene=mlc;inference=ab initio prediction:Prodigal:2.60,similar to AA sequence:UniProtKB:P50456;locus_tag=CBEIJ_B593_00001;product=Making large colonies protein;protein_id=gnl|CBC|CBEIJ_B593_00001
SEQ prokka gene 378 1526 . - 1 gene=mlc;locus_tag =CBEIJ_B593_00001
gnl|CBC|contig000001 Prodigal:2.60 CDS 1717 3219 . - 0 ID=CBEIJ_B593_00002;eC_number=2.7.1.17;gene=xylB_1;inference=ab initio prediction:Prodigal:2.60,similar to AA sequence:UniProtKB:P35850;locus_tag=CBEIJ_B593_00002;product=Xylulose kinase;protein_id=gnl|CBC|CBEIJ_B593_00002
SEQ prokka gene 1717 3219 . - 1 gene=xylB_1;locus_tag =CBEIJ_B593_00002

(note the replacement of the reference with ‘SEQ’). It’s easy enough to fix on my end, as the generic ‘SEQ’ comes from Bio::SeqFeature::Generic when no seq_id is present, just need to pass the seq_id along. Starting at line 957 in the main prokka script:

if ($addgenes) {
  # make a 'sister' gene feature for the CDS feature
  # (ideally it would encompass the UTRs as well, but we don't know them)
  my $g = Bio::SeqFeature::Generic->new(
    -primary    => 'gene',
    -seq_id     => $f->seq_id,  # <---
    -start      => $f->start,
    -end        => $f->end,
    -strand     => $f->strand,
    -source_tag => $EXE,
    -tag        => { 'locus_tag '=> $ID },
  );

chris

Could not run command: makeblastdb -dbtype prot

I previously installed prokka in Biolinux8 and everything worked well.
I had to create a new Biolinux account now and I tried to reisntall prokka-1.10.
Everything worked but when I try

prokka --setupdb

I got the followng erro:
manager@bl8vbox[lib] prokka --setupdb [12:01PM]
[12:02:05] Cleaning databases in /usr/local/lib/prokka-1.10/bin/../db
[12:02:05] Cleaning complete.
[12:02:05] Looking for 'makeblastdb' - found /usr/bin/makeblastdb
[12:02:06] Determined makeblastdb version is 2.2
[12:02:06] Making kingdom BLASTP database: /usr/local/lib/prokka-1.10/bin/../db/kingdom/Archaea/sprot
[12:02:06] Running: makeblastdb -dbtype prot -in /usr/local/lib/prokka-1.10/bin/../db/kingdom/Archaea/sprot -logfile /dev/null
[12:02:06] Could not run command: makeblastdb -dbtype prot -in /usr/local/lib/prokka-1.10/bin/../db/kingdom/Archaea/sprot -logfile /dev/null

suggestions?

Support .GBK/.GFF for --proteins option

Instead of having to prepare a .faa file from it manually, perhaps support within prokka.

For GBK would be simple to run "prokka-genbank_to_fasta_db" from within prokka.

Prokka reorders contigs

If you give prokka a contig set, ordered by reference, it reorders the contigs in the output genbank alphabetically. Would be nice if preserved the original contig order (preferably without renaming the contigs? We submit contigs with genbank format friendly names)

Gene name attribute from --proteins evidence

The genes annotated using the --proteins evidence don't get gene= attributes in the GFF file. My FASTA file of proteins is formatted like so:

>psbK photosystem II protein K
MPVMLNIFLDDAFIYSNNIFFGKLPEAYAISDPIVDVMPIIPVLSFLLAFVWQAAVSFR
>psbI photosystem II protein I
MLTLKLFVYTVVIFFISLFIFGFLSNDPGRNPGRKE
>ycf12 hypothetical protein
MNLEVIAQLTVLTLTVVSGPLVIVLLAVRKGNL

Batch run issues

Thanks for the great software! I have several files to be processed. Running PROKKA on them either serially individually or in batches of say 10 or 50 or 100 often results in partially completed outputs (> 80% of inputs are incomplete). The most common error is:

Could not run command: cat ~/proteins.faa | parallel --gnu -j 8 --block 943 --recstart '>' --pipe hmmscan --noali --notextw --acc -E 1e-06 --cpu 1 ~/tools/prokka/prokka-1.10/bin/../db/hmm/CLUSTERS.hmm /dev/stdin > ~/proteins.bls 2> /dev/null

Output directories usually have only the final *fna completed.

Any suggestions? Many thanks for your time and efforts.

Changing annotations to Hypothetical Protein

Hi again,
I was going through the prokka script as well as the log file, and I noticed that some of the annotations change themselves to Hypothetical protein, even though they don't look like they are annotated as "Hypothetical Protein". I could not find a suitable explanation for the same in the script. Can you help me out with this and let me know why it is changing some particular annotations and making them hypothetical?

Thanks!
Chandni
screen shot 2014-10-15 at 2 38 49 pm

Problem running prokka on isolate genome

Hi,

this is a little feature request.

I have the following genome Abiotrophia defectiva ATCC 49176 (s.a. http://www.ncbi.nlm.nih.gov/genome/?term=txid592010[Organism:noexp]) in fasta format and wanted to run prokka on it for test purposes.
However, I get the following error when running the following command: prokka --notrna --norrna --cpus 1 Abiotrophia_defectiva_ATCC_49176.fasta with the prokka-binary directory being in my PATH.

[17:25:50] Loading and checking input file: Abiotrophia_defectiva_ATCC_49176.fasta
[17:25:50] Wrote 20 contigs
[17:25:50] Skipping tRNA search at user request.
[17:25:50] Disabling rRNA search: --kingdom=Bacteria or --norrna=1
[17:25:50] Skipping ncRNA search, enable with --rfam if desired.
[17:25:50] Total of 0 tRNA + rRNA features
[17:25:50] Predicting coding sequences
[17:25:50] Contigs total 629 bp, so using meta mode
[17:25:50] Running: prodigal -i PROKKA_09042014/PROKKA_09042014.fna -c -m -g 11 -p meta -f sco -q
[17:26:17] Found 1875 CDS
[17:26:17] Connecting features back to sequences
[17:26:17] Option --gram not specified, will NOT check for signal peptides.
[17:26:17] Not using genus-specific database. Try --usegenus to enable it.
[17:26:17] Annotating CDS, please be patient.
[17:26:17] Will use 1 CPUs for similarity searching.

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Bad end parameter (834). End must be less than the total length of sequence (total=629)
STACK: Error::throw
STACK: Bio::Root::Root::throw /home/users/claczny/perl5/lib/perl5/Bio/Root/Root.pm:486
STACK: Bio::PrimarySeq::subseq /home/users/claczny/perl5/lib/perl5/Bio/PrimarySeq.pm:452
STACK: Bio::Seq::subseq /home/users/claczny/perl5/lib/perl5/Bio/Seq.pm:630
STACK: Bio::PrimarySeqI::trunc /home/users/claczny/perl5/lib/perl5/Bio/PrimarySeqI.pm:458
STACK: Bio::SeqFeature::Generic::seq /home/users/claczny/perl5/lib/perl5/Bio/SeqFeature/Generic.pm:705
STACK: /work/projects/ecosystem_biology/local_tools/prokka-1.7/bin/prokka:754
-----------------------------------------------------------

I found the line [17:25:50] Contigs total 629 bp, so using meta mode suspicious. After looking into this, I found out that it appears to be related to the fasta headers. For this genome, the fasta header of the first contig is >Abiotrophia defectiva ATCC 49176 : ACIN03000001 (and similar for the other contigs). After replacing the whitespaces with underscores, prokka appears to run nicely through. The corresponding line now says [17:31:08] Contigs total 2041839 bp, so using single mode, which appears to be correct.
Hence, I suspect prokka needs unique fasta headers (which is not the case here).
Accordingly, I think it would be a useful feature to integrate or extend the input format check and let the user know when the fasta headers are not unique.

Looking forward to your comments.

Best,

Cedric

[EDIT]
The above is with respect to prokka-1.7. I installed prokka-1.10 now and discovered that some input format check is applied that was apparently not in the prior version (1.7) -> [11:11:27] WARNING: Contig IDs must be less than 38 characters for Genbank compliance - Abiotrophia_defectiva_ATCC_49176_:_ACIN03000001. I do not know though if there is a check already integrated as suggested above (uniqueness of IDs).

COORDINATES: qualifier for Infernal/Aragorn output

I have been using Prokka to annotate de novo generated whole genome sequences of bacteria, based on species or a trusted database of proteins. I use the GBK output of Prokka to import the genome sequence into Artemis, where I do tweaks to the annotation, such as missed pseudogenes, for instance. I save the files as EMBL flat files for submission to ENA/SRA. Before submission I run the EnaValidator.jar to check for issues with the EMBL file. During these checks, it gives an error that turns out to be because of a space after the " COORDINATES: " qualifier. When I remove this in Artemis manually, the error is gone. I don't know where in the Prokka pipeline this space gets inserted, but it would be helpful to fix this (if possible).

Overly long locustag/prefix results in bad GenBank LOCUS lines

Prokka 1.10 can produce broken GenBank output with over-long identifiers in the LOCUS lines.

Sample input (anonymised since this is for a collaborator):

/opt/prokka-1.10/bin/prokka --outdir XYZ123draft2_prokka --prefix XYZ123draft2 --locustag XYZ123draft2 --compliant --kingdom Bacteria --gram neg --genus ... --species ... --strain XYZ123 --quiet XZY123_draft2.fasta

Sample output:

$ grep LOCUS XYZ123draft2.gbk 
LOCUS       XYZ123draft2_contig000001119615 bp   DNA   linear       19-AUG-2014
LOCUS       XYZ123draft2_contig000002170983 bp   DNA   linear       19-AUG-2014
...

The LOCUS identifier is too long for the strict GenBank format, and there is no white space between the (truncated) identifier and the sequence length, meaning for example Biopython complains.

Possible output with truncation (not ideal) to ensure a white space would be something like this:

$ grep LOCUS XYZ123draft2.gbk 
LOCUS       XYZ123draft2_contig0000 1119615 bp   DNA   linear       19-AUG-2014
LOCUS       XYZ123draft2_contig0000 2170983 bp   DNA   linear       19-AUG-2014
...

Possible output abusing the LOCUS line (also not ideal, but some parsers will cope):

$ grep LOCUS XYZ123draft2.gbk 
LOCUS       XYZ123draft2_contig000001 1119615 bp   DNA   linear       19-AUG-2014
LOCUS       XYZ123draft2_contig000002 2170983 bp   DNA   linear       19-AUG-2014
...

Expected output: Fail early complaining about the overly long identifiers which will cause problems, specifying which option should be changed.

Perl Exceptions

Hi

Installed Prokka and it ran fine. Wanted to add signalP and since then all gone wrong! I thought I needed to add a perl module and maybe updated through cpan now I get:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Could not read file 'minced -gff 'PROKKA_07152014/PROKKA_07152014.fna' |': No such file or directory
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl/5.18.2/Bio/Root/Root.pm:449
STACK: Bio::Root::IO::_initialize_io /usr/local/share/perl/5.18.2/Bio/Root/IO.pm:270
STACK: Bio::Tools::GFF::new /usr/local/share/perl/5.18.2/Bio/Tools/GFF.pm:200
STACK: /usr/local/bin/prokka:589

same exception for barrnaup

Both programs seem to run ok on their own.

Any suggestions?

Many thanks

Support prodigal 2.7 (git head)

Prodigal 2.7 has unfortunately changed the command line options in a non-compatible manner. -m was renamed to -n and -p was renamed to -m.

2.6 2.7
-m -n
-p meta -m anon
-p single -m normal

Include more info from minced in the CRISPR annotations

Lizzy Wilbanks has left a new comment on your post "Prokka - rapid prokaryotic annotation":

Thanks for this great tool! So useful!! One thing that might be a nice addition for future releases would be providing more of the information from minced about the CRISPR regions - maybe as a separate output file? I've been re-running this to get the locations of the direct repeats and spacer sequences.

Posted by Lizzy Wilbanks to The Genome Factory at 31 July 2014 04:26

CLUSTERS.hmm corrupted in the tarball?

Greetings
when updating for Prokka 1.9 and running the 'prokka --setupdb' got this error message and the setup aborted:

[16:37:51] Running: hmmpress '/home/jcarrico/NGStools/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm'

Error: File /home/jcarrico/NGStools/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm does not appear to be in a recognized HMM format.

[16:37:51] Could not run command: hmmpress '/home/jcarrico/NGStools/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm'

Any idea on how to solve this? Is it a corrupted file? I've just downloaded the tarball few minutes ago. Thanks in advance!

parallel version

I am really sorry because bother you with this question. Is there any problem with the new parallel20141022 version? The prokka do not recognise it's version number correctly and always ask me to update it? I use the most up to date version! Should I downgrade it? Witch is the preferred parallel version?
Thank for every help!

New/Customized HMM Databases

Hi,
I've been using Prokka a bit with the default options and databases. I recently added the vFAM HMM database in /opt/prokka/db/hmm. After indexing, it was recognized successfully (prokka --listdb).
However, upon running Prokka, I see (from the log) that hmmer3 runs only for the default HMM databases (Pfam,CLUSTERS,HAMAP). Is there any way to confirm that the Prokka run actually used the new database?
Also, we work mostly on metagenomics projects so we are really looking forward to the kingdom=ALL option from the To-Do list.

Thank you,
Chandni

prokka and parallel parallel-20140222

HI.
Working with Prokka, really nice package and super-easy to run.

I noticed a small bug: when using Prokka with parallel-20140222 installed I got an error during the annotation step, this:
[16:05:51] Could not run command: cat MyAnnotation_MyGenomeproteins.faa | parallel --gnu -j 4 --block 166030 --recstart ...........

launching the command out of the pipeline I found that parallel was crashing. This is the message:

parallel: Error: -g has been retired. Use --group.
parallel: Error: -B has been retired. Use --bf.
parallel: Error: -T has been retired. Use --tty.
parallel: Error: -U has been retired. Use --er.
parallel: Error: -W has been retired. Use --wd.
parallel: Error: -Y has been retired. Use --shebang.
parallel: Error: -H has been retired. Use --halt.
parallel: Error: --tollef has been retired. Use -u -q --arg-sep -- and --load for -l.

so I reverted back to parallel-20130422 and now everything seems to work properly, even inside the prokka pipeline.

Maybe this is of some help.
best
m.


Marco Fondi, PhD
Dep. of Biology, University of Florence
Via Madonna del Piano 6, S. Fiorentino, Florence, Italy
Tel. +39 055 4574736

Improve the cleanup_product() function

This function makes lots of mistakes:

  • Bug: HI0933-like protein => -like protein
  • Bug: IS1251-like transposase => -like transposase
  • Bug: transcription termination factor Rho => hypothetical protein
  • Bug: xx kDa SS-A/Ro ribonucleoprotein homolog => hypothetical protein
  • [12:09:23] Modify product: conserved protein with nucleoside triphosphate hydrolase domain => hypothetical protein
  • [12:09:23] Modify product: 23S rRNA m(2)G2445 methyltransferase => 23S rRNA m(2) methyltransferase
  • [12:09:23] Modify product: DNA replication terminus site-binding protein => hypothetical protein
  • [12:09:24] Modify product: conserved inner membrane protein => hypothetical protein
  • [12:09:24] Modify product: 16S ribosomal RNA m2G1207 methyltransferase => 16S ribosomal RNA methyltransferase
  • [12:09:25] Modify product: hypothetical protein TTC0453 => hypothetical protein
  • [12:09:25] Modify product: type VI secretion protein, VC_A0107 family => type VI secretion protein, family
  • [12:09:26] Modify product: conserved hypothetical pathogenicity island protein => hypothetical protein
  • [12:09:26] Modify product: IS1400 transposase B => transposase B
  • [12:09:26] Modify product: Dyp-type peroxidase family => Dyp-type peroxidase family protein

prokka --setupdb should check binaries

I think that the check of versions of tools and the PATH extension with $BINDIR should be done before running setup_db sub:

$ ./bin/prokka --setupdb
[19:59:35] Cleaning databases in /tmp/prokka-1.9/bin/../db
[19:59:35] Cleaning complete.
[19:59:35] Making kingdom BLASTP database: /tmp/prokka-1.9/bin/../db/kingdom/Archaea/sprot
[19:59:35] Running: makeblastdb -dbtype prot -in '/tmp/prokka-1.9/bin/../db/kingdom/Archaea/sprot' -logfile /dev/null
sh: 1: makeblastdb: not found
[19:59:35] Could not run command: makeblastdb -dbtype prot -in '/tmp/prokka-1.9/bin/../db/kingdom/Archaea/sprot' -logfile /dev/null

Also the item "use included binary if PATH one is wrong version [Simon Gladman]" from TODO in doc/ChangeLog.txt would be helpful, since having a wrong version of hmmpress in the PATH leads to this error:

$ ./bin/prokka --setupdb
...
[20:01:03] Pressing HMM database: /tmp/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm
[20:01:03] Running: hmmpress '/tmp/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm'

Error: File /tmp/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm does not appear to be in a recognized HMM format.

[20:01:03] Could not run command: hmmpress '/tmp/prokka-1.9/bin/../db/hmm/CLUSTERS.hmm'

Could I use prokka with scaffolds?

Hello everybody,

I'd like to use prokka and I have scaffolds of a draft genome. Could I use prokka to annotate it or should I use contigs?

Best Regards,

Daniel

Select Barrnap or RNAmmer

I have both Barrnap and RNAmmer installed, Prokka detects both, and seems to use Barrnap by default. How do I select which is used?

[14:36:48] Looking for 'barrnap' - found /usr/local/bin/barrnap
[14:36:48] Determined barrnap version is 0.4
[14:36:49] Looking for 'rnammer' - found /usr/local/bin/rnammer
[14:36:49] Determined rnammer version is 1.2
[14:36:49] Predicting Ribosomal RNAs
[14:36:49] Running Barrnap with 4 threads

Namespace collisions with default contig ID naming

Having just used Prokka 1.8 on several strains I am left with *.fna and *.gbk (etc) files with ambiguous identifiers like gnl|PROKKA|contig000001 which appear in all my strains.

Referring to http://www.ncbi.nlm.nih.gov/genomes/static/Annotation_pipeline_README.txt (linked to in the Prokka script - thank you) the NCBI say:

The fasta file should look like this:
 >gnl|center|<ID1> [organism=<ORGANISM NAME STRAIN NAME>] [strain=<STRAIN NAME>] [gcode=11]
 <NUCLEOTIDE SEQUENCE>

NOTE: The |center|<ID1> part of the header must be less than 38 characters

An example of a fasta header for the Bacterium bacterius 253 is:

>gnl|LrgU|Contig01 [organism=Bacterium bacterius 253] [strain=253] [gcode=11]

I propose that rather than using gnl|$centre|contig%06d where the ID is just contig000001 etc, Prokka prefixes this (if a prefix is specified, and under 38 - 6 = 32 characters). This prefix could be a new command line option, or perhaps reuse the existing strain or locus tag prefix?

e.g. I would like to be able to request gnl|PROKKA|XXX_contig000001 and gnl|PROKKA|YYY_contig000001 for strains XXX and YYY.

'--proteins' reference introduces too many "paralogs"

I noticed that if you use a reference genome for annotation with option '--proteins', lots of false paralogs will be annotated (gene names which are numerated with an underscore '_'; see hash %collide in lines https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka#L985-1013).

For annotation Prokka just takes the best BLASTP hit (lowest evalue). Might it be useful to have some more options to include the possibility of subject/query coverage/identity cutoffs, in addition to option '--evalue'?

For this purpose BioPerl includes blast HSP tiling via the 'frac*' methods, which can be used to skip BLASTP results which don't satisfy these restraints, e.g.:
$hit->frac_identical('query')
$hit->frac_aligned_hit
These could be included in the BLASTP parsing routine (line https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka#L921 onwards).

tbl2asn failes silently with pipe >> | << character in contig names

James Doonan: I couldn't get the full compliment of files from the prokka output for two of my whole genome sequences. I discovered that it was to do with the contig names. The whole genome contig was called; >scf7180000000002|quiver. The output from prokka was missing the genbank and sequin files. When I changed the contig name to just '>scf71' it gave me all the files as output.

*.faa result files methionine/stop codon

The resulting .faa files from Prokka include stop codons '' and atypical (non-ATG) start codons don't start with methionine (both not NCBI standard). This is remedied by changing line https://github.com/Victorian-Bioinformatics-Consortium/prokka/blob/master/bin/prokka#L1101:

$faa_fh->write_seq( $p->translate(-codontable_id=>$gcode) );

to:

$faa_fh->write_seq( $p->translate(-codontable_id=>$gcode, -complete => 1) );

See BioPerl HOWTO: http://www.bioperl.org/wiki/HOWTO:Beginners#Translating

Minor .gbk file issues

The LOCUS entries in the .gbk file don't put a space in between the name of the contig (in this case) and its size, which means BioPerl gets upset when it tries to read the file.
Additionally, it would be helpful if there was a default entry for the ACCESSION and VERSION fields, just a '.' would do, as other programs read the file incorrectly when these entries are blank (such as pmauve).

Missing short genes

Prodigal penalizes predicting genes that are shorter than 250 bp (83 aa). As a result, Prokka is missing a number of short proteins that do exist in a closely related species. Any thoughts on how to deal with this? By checking with prodigal -s I've learned that Prodigal is in fact predicting the genes, but they have a score < 0 and so are discarded.

minced version upgrade error

Hi,

I recently queued Prokka for a large dataset - split into 7-8 parts so I could run them all together. AFter they all finished, my next script to extract output data failed for some runs. Upon some investigation I saw that there were a few failed runs with the log entry -
" Prokka needs minced 1.6 or higher. Please upgrade and try again. "

I am not sure why this error came up for only some runs while the others finished successfully. It was an extremely small fraction of the total runs, and I just queued those contigs again - but I thought worthwhile to mention it here, in case someone else had a similar issue.

Thanks!
Chandni

tbl2asn: no .gbk or .sqn files

I'm sorry to bother you with this issue here, but despite installing a new version of tbl2asn and being able to call it from the command line, we've not been able to produce .gbk files using prokka. An example final set of lines from the .log follow:

[21:15:14] Writing outputs to /home/cooper/Vaughn/tmp/pneumo/ref//
[21:15:17] Generating annotation statistics file
[21:15:17] Generating Genbank and Sequin files
[21:15:17] Running: tbl2asn -V b -a r10k -l paired-ends -M n -N 1 -y 'Annotated using prokka 1.10 from http://www.vicbioinformatics.com' -Z /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.err -i /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.fsa 2> /dev/null
[21:15:17] Deleting unwanted file: /home/cooper/Vaughn/tmp/pneumo/ref//errorsummary.val
[21:15:17] Deleting unwanted file: /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.dr
[21:15:17] Deleting unwanted file: /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.fixedproducts
[21:15:17] Deleting unwanted file: /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.ecn
[21:15:17] Deleting unwanted file: /home/cooper/Vaughn/tmp/pneumo/ref//pneumo.val
[21:15:17] Output files:
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.faa
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.tbl
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.txt
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.log
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.fsa
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.fna
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.err
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.gff
[21:15:17] /home/cooper/Vaughn/tmp/pneumo/ref/pneumo.ffn

Tag stable releases

Tagging stable releases could be a good way to download the Prokka code without downloading the large databases.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.