Giter Club home page Giter Club logo

biocode's People

Contributors

arnaudbelcour avatar jonathancrabtree avatar jorvis avatar jwdebler avatar kastman avatar ktmeaton avatar ktretina avatar mfitzp avatar mr-c avatar pgonzale60 avatar priti88 avatar rpinerd avatar zpgao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

biocode's Issues

Attribute error for update_selected_column9_values.py

Hi,
I am trying to add "EC-numbers" to a gff file.
This is my command:

python update_selected_column9_values.py -i LMA_1258_IMG.gff3 -u ID_EC_onlycol1258.tab -k 'ID' -a 'ec_num' -o LMA_1258_IMG_EC.gff

And this is the error I am getting:

Traceback (most recent call last):
File "update_selected_column9_values.py", line 100, in
main()
File "update_selected_column9_values.py", line 90, in main
atts = gff.column_9_dict(cols[8])
AttributeError: module 'biocode.gff' has no attribute 'column_9_dict'

What am I doing wrong ?

Thank you

convert_gff3_to_ncbi_tbl

Can someone tell me which assumption convert_gff3_to_ncbi_tbl makes on the formatting of the names? Apparently ours miss something:

python3 gff/convert_gff3_to_ncbi_tbl.py -i ../juncus.fasta.transdecoder.refined.sort.gff3 -o ../juncus.fasta.transdecoder.refined.sort.tbl -ln LAB -nap NAP -gf ../juncus.fasta 
Traceback (most recent call last):
  File "gff/convert_gff3_to_ncbi_tbl.py", line 89, in <module>
    main()
  File "gff/convert_gff3_to_ncbi_tbl.py", line 82, in main
    tbl.print_tbl_from_assemblies(assemblies=assemblies, ofh=ofh, go_obo=args.go_obo, lab_name=args.lab_name)
  File "/tmp/biocode/lib/biocode/tbl.py", line 95, in print_tbl_from_assemblies
    print_biogene(gene=gene, fh=ofh, obo_dict=go_idx, lab_name=lab_name)
  File "/tmp/biocode/lib/biocode/tbl.py", line 122, in print_biogene
    raise Exception("ERROR: locus_tag attributes are required for all gene elements (gene id: {0}".format(gene.id))
Exception: ERROR: locus_tag attributes are required for all gene elements (gene id: Transcript_32960|g.33387

ping @arsilan324

Minor fasta script naming issue

The fasta/ subdirectory contains a number of scripts, pretty much all of which deal with multi-FASTA files, rather than single-sequence FASTA files. Most (but not all) of the scripts have "fasta" somewhere in their name, but there's a lone script that uses the term "multifasta" instead, "compare_two_multifastas.pl"

report_gff_intron_and_intergenic_stats.py error message Detected assembly with undefined or 0 length

When I run this script, I get a message that I'm not sure how to troubleshoot.

$ /home/cmccracken/biocode/gff/report_gff_intron_and_intergenic_stats.py -i final_annotation_bmi_20140606.fixed.newIDs.gff3
/usr/local/packages/Python-3.2.3/lib/python3.2/subprocess.py:389: RuntimeWarning: The _posixsubprocess module is not being used. Child process reliability may suffer if your program uses threads.
"program uses threads.", RuntimeWarning)
Traceback (most recent call last):
File "/home/cmccracken/biocode/gff/report_gff_intron_and_intergenic_stats.py", line 212, in
main()
File "/home/cmccracken/biocode/gff/report_gff_intron_and_intergenic_stats.py", line 91, in main
raise Exception("ERROR: Detected assembly with undefined or 0 length: {0}".format(assembly.id))
Exception: ERROR: Detected assembly with undefined or 0 length: ChromosomeIII_BmicrotiR1

convert_gff3_to_gbk.py with no embedded FASTA

convert_gff3_to_gbk.py currently does not support GFF3 files without embedded FASTA sequences. Here is the error that I got:

"raise Exception("ERROR: CDS.get_residues() requested but its molecule {0} has no stored residues".format(mol.id))"

I guess the FASTA could be passed through a command line option when running the file.

Model UTRs explicitly

Kyle - This is something you requested, but could you add a comment with a bit more information? Do you just need the class to be created or do you have a file already where they could be included? (I expect a GFF file where the mRNA/exon feature coordinates are outside of the range of the CDS ones.)

Keep in mind the GFF specification (scroll down to the section labeled "The Canonical Gene")
http://www.sequenceontology.org/gff3.shtml

And the SO definition:
http://www.sequenceontology.org/miso/current_release/term/SO:0000203

bioannotation.py - check for properly-formed EC numbers

There are sources in public HMM and BLAST libraries which assert EC numbers that are malformed, such as "1.2.1.n2". Due to the nature of how these are used, I think the proper thing to do is to warn when the user attempts to add a malformed EC number but don't throw an exception.

correct_gff_feature_order.pl doesn't work

Hello,

When I run the script correct_gff_feature_order.pl, I get this error

Can't locate bioUtils.pm in @INC (you may need to install the bioUtils module) (@INC contains: /Users/arslan/Documents/Juncus/EMBL/EMBLmyGFF3/../lib /Library/Perl/5.18/darwin-thread-multi-2level /Library/Perl/5.18 /Network/Library/Perl/5.18/darwin-thread-multi-2level /Network/Library/Perl/5.18 /Library/Perl/Updates/5.18.2/darwin-thread-multi-2level /Library/Perl/Updates/5.18.2 /System/Library/Perl/5.18/darwin-thread-multi-2level /System/Library/Perl/5.18 /System/Library/Perl/Extras/5.18/darwin-thread-multi-2level /System/Library/Perl/Extras/5.18 .) at correct_gff_feature_order.pl line 76.
BEGIN failed--compilation aborted at correct_gff_feature_order.pl line 76.

Can you please comment how can I fix?
Thanks

write_fasta_from_gff.py silently ignores a (potentially large) portion of the input gff

Running write_fasta_from_gff.py on the output of convert_metagenemark_gff_to_gff3.py, I observed some large discrepancies between the number of CDS features in the GFF3 file and the number of CDS features written by write_fasta_from_gff.py In one case the GFF3 file contained 16042 CDS features, but the FASTA output contained only 10299 sequences, a loss of 5743 CDS sequences, ~36% of the total.

Structural comparison script deletes all bed files

I'd like to shift your structural comparison script out of the sandbox and into production, but I did a quick code review and one part at the very end concerned me. It seems to delete all bed files in the output directory rather than keeping track of the specific temporary bed files it creates and just deleting them.

This certainly isn't an urgent issue, since this is currently a sandbox script.

Annotation parsing file output options

For the annotation parsing file, it would be helpful to add a few options for different output types. GFF3 and protein fasta files are already generated. I would like to see these nucleic acid file outputs, as well:

  1. gene (full gene, including UTRs, start, stop, exons and introns).
  2. coding sequence (CDS from start to stop that get translated into protein)
  3. gene plus 1000 bases up and downstream of start and stop.

Thanks!
Marcus

Needed: Speed-optimized FASTQ statistics script

One of the really common tasks when given a FASTQ file is to find the following statistics:

  • total read count
  • total base count

While this is trivial itself, what can get more interesting is finding the method to do it that performs the best. Because this will be an important component of a few other projects, speed and proper error handling is important. Most apps assume python, but I'm up for implementations in whatever language will give the best results here as long as they don't open up a huge can of worms dependency-wise.

Python 2 Compatability?

Hi Josh,

Thanks for sharing biocode. Any interest in accepting pull requests for single files (I'm just looking at fastq/randomly_subsample_fastq.py) that would cheaply add py2 compatibility (change the hashbang, cast to float for future division)? On the bright side it would increase usability, but I understand if you only want to test against py3. Thanks for putting this out there,

NameError: name 'utils' is not defined

I am trying to run convert_gff3_to_ncbi_tbl.py script but getting this error.


ubt80:EMBLmyGFF3 arslan$ python3 convert_gff3_to_ncbi_tbl.py -i juncus.fasta.transdecoder.refined.gff3 -o arslan.tbl -ln TEST -nap JE -gf juncus-rp.fasta
INFO: splitting mRNA off gene Transcript_138016|g.186294
Traceback (most recent call last):
  File "convert_gff3_to_ncbi_tbl.py", line 89, in <module>
    main()
  File "convert_gff3_to_ncbi_tbl.py", line 82, in main
    tbl.print_tbl_from_assemblies(assemblies=assemblies, ofh=ofh, go_obo=args.go_obo, lab_name=args.lab_name)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/biocode/tbl.py", line 86, in print_tbl_from_assemblies
    print_biogene(gene=new_gene, fh=ofh, obo_dict=go_idx, lab_name=lab_name)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/biocode/tbl.py", line 117, in print_biogene
    gene_coords = utils.interbase0_to_humancoords(gene_loc.fmin, gene_loc.fmax, gene_loc.strand)
NameError: name 'utils' is not defined

Can you please comment how can I fix it?
Thanks

shorter than real intergenic space in "report_gff_intron_and_intergenic_stats.py"

Hi,
When I calculate the intergenic space of a contig with report_gff_intron_and_intergenic_stats.py and add the total length of the genes on that contig to it, the result is shorter than the total length of the contig. My assumption is, this code does not consider the length from beginning of the contig to the beginning of the first gene and also end of the last gene to the end of the contig.

Cheers,
Pezhman

Can't import things with write_fasta_from_gff.py

After trying to run this script with the below command, I was having trouble with important utils/things. I've used this script before (checkout and relevant .bashrc line below), so this must be due to recent changes. I noticed that these modules were in the */biocode/lib/biocode/ sub-directory, so I added that to my PYTHONPATH (below), and got the same error.

$ python ~/git/biocode/gff/write_fasta_from_gff.py -i ref.gff3 -f ref.fasta -o ref.fasta -t cds

Traceback (most recent call last):
File "/home/ktretina/git/biocode/gff/write_fasta_from_gff.py", line 31, in
from biocode import utils, gff
File "/home/jorvis/git/biocode/lib/biocode/gff.py", line 4, in
from biocode import things, annotation
File "/home/jorvis/git/biocode/lib/biocode/things.py", line 3, in
from biocode import utils, gff, tbl
File "/home/jorvis/git/biocode/lib/biocode/tbl.py", line 3, in
from biocode import utils, things

Checkout
/home/ktretina/git/biocode/

.bashrc file
export PYTHONPATH=$PYTHONPATH:/home/jorvis/lib:/home/jorvis/svn/jorvis/utilities/lib:/home/jorvis/git/biocode/lib:/home/jorvis/git/Emergence/emergence/apps:/home/ktretina/git/biocode/lib/biocode/

biocodeutils.py add error output mRNA ID

When the biocodeutils function called "translate" finds an unknown codon, it currently will deal with it like this:

print("WARN: Encountered unknown codon during translation: {0}".format(seq[x:x+3]))

Could you please add the mRNA ID to this output? I think you'll have to add that to the function input. This will help when trying to track down the sequences with this issue.

Thanks!

report_gff3_statistics.py unsupported operand type(s) error

When using report_gff3_statistics.py, get the following error:

/usr/local/packages/Python-3.3.2/bin/python3 /home/jorvis/git/biocode/gff/report_gff3_statistics.py -i 175.annotation.gff3
The biothings.py is still under testing and development. Please feel free to try using it, though the API is in flux.
Traceback (most recent call last):
File "/home/jorvis/git/biocode/gff/report_gff3_statistics.py", line 74, in
main()
File "/home/jorvis/git/biocode/gff/report_gff3_statistics.py", line 41, in main
type_lengths['assembly'] += assemblies[assembly_id].length
TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'

The full path of the gff3 file is: /usr/local/projects/mucormycosis/annotation/175/175.annotation.gff3 in the IGS filesystem.

write_fasta_from_gff.py error

When running write_fasta_from_gff.py, I'm getting an error and the output file contains only a small portion of the number of proteins that should be present (protein count varies every time).

The command I'm using is: python ~/git/biocode/gff/write_fasta_from_gff.py -i BV115/BV115.gff3 -f BV115/BV115.fasta -o test.txt

cwd: /local/scratch/ncpalmateer/silva_lab/p67

Error message:
Traceback (most recent call last):
File "/home/Nicholas.Palmateer/git/biocode/gff/write_fasta_from_gff.py", line 126, in
main()
File "/home/Nicholas.Palmateer/git/biocode/gff/write_fasta_from_gff.py", line 87, in main
coding_seq = feat.get_CDS_residues(for_translation=True)
File "/home/Nicholas.Palmateer/git/biocode/lib/biocode/things.py", line 1093, in get_CDS_residues
chop = sorted_cds[0].phase
IndexError: list index out of range

path to checkout: /home/Nicholas.Palmateer/git/biocode
$PYTHONPATH in .bashrc: /home/Nicholas.Palmateer/git/biocode/lib

Add motif predictions to parse_ergatis_euk_functional_pipeline.py

The euk functional annotation script (sandbox/jorvis/parse_ergatis_euk_functional_pipeline.py) might be augmented with some additional evidence. I propose adding the following predictions:
SignalP
SecretomeP
TMHMM
TargetP
(More information can be found here: http://www.cbs.dtu.dk/services/ and there are additional prediction tools there, as well.)

With respect to how to handle the annotation name in column 9 of the GFF3 file, I propose adding information to those names that would otherwise be "Hypothetical protein" due to lack of significant matches to other evidence (e.g. no named BLAST hits from UniProt, nor any HMM results). For example, if a protein is putatively secreted, but otherwise has no annotation, we might call it "Hypothetical secreted protein", and if a protein localizes to the membrane, it could be called "Hypothetical transmembrane protein".

For database submissions, this might not be useful (as GenBank would reject annotations following such nomenclature), but we could parse those prior to submission to GenBank. (For example, all proteins called "Hypothetical" followed by any other text would be renamed "Hypothetical protein".

Insert EC numbers into chado database issue

Hi Josh,

I am trying to run your script "insert_ec_number.pl" in chado folder to insert an EC number into chado database. I get the following error,

[snadendla@thanos chado]$ perl insert_ec_number.pl --ec_number=4.3.99.3 --name=7-carboxy-7-deazaguanine synthase --database=hcon2 --user=XXXX --password=XXXX --server=manatee-db --database_type=mysql
attempting to create database connection
INFO: got db_id 8 for name EC
INFO: got cv_id 7 for name EC
Unable to find cvterm_id corresponding to base accession 4.3.99.-. Check the base term? at insert_ec_number.pl line 283.

I tried adding the base 4.3.99.- but still get the same error as above.

What can I do to insert this EC number?

Thanks,
Suvvi

problem with compare_gene_structures.py

Dear,
I have a problem using the code compare_gene_structures.py. I get the following error: File "/home/faino001/bin/biocode/gff/compare_gene_structures.py", line 612, in
process_files(args)
File "/home/faino001/bin/biocode/gff/compare_gene_structures.py", line 322, in process_files
for exon_1 in sorted(feat_1) :
UnboundLocalError: local variable 'feat_1' referenced before assignment

any idea why?

thanks
Luigi

Augustus conversion failing

User @kayussky911 reports:

convert_augustus_to_gff3.py -i augustus_erins.gtf -o new_augustus

my input looks like,

scaffold10x_1 AUGUSTUS gene 3591 4530 0.27 - . g1
scaffold10x_1 AUGUSTUS transcript 3591 4530 0.27 - . g1.t1
scaffold10x_1 AUGUSTUS stop_codon 3591 3593 . - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS CDS 3591 3859 0.34 - 2 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS exon 3591 3859 . - . transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS intron 3860 4022 0.28 - . transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS CDS 4023 4530 0.63 - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS exon 4023 4530 . - . transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS start_codon 4528 4530 . - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS gene 26186 31433 0.2 - . g2
scaffold10x_1 AUGUSTUS transcript 26186 31433 0.2 - . g2.t1
scaffold10x_1 AUGUSTUS stop_codon 26186 26188 . - 0 transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS CDS 26186 26304 0.37 - 2 transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS exon 26186 26304 . - . transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS intron 26305 29389 0.28 - . transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS CDS 29390 30220 0.39 - 2 transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS exon 29390 30220 . - . transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS intron 30221 30844 0.45 - . transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS CDS 30845 31433 0.41 - 0 transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS exon 30845 31433 . - . transcript_id "g2.t1"; gene_id "g2";

the output just says;

##gff-version 3

and that's it. so I think its the last columns of the input files I need to work on.

convert_metagenemark_gff_to_gff3.py produces invalid GFF3

convert_metagenemark_gff_to_gff3.py echoes comment lines from MetaGeneMark unchanged. This is a problem when the comment is "##FASTA" (i.e., as part of a predicted polypeptide) GFF3 parsers are required to interpret such a line as the beginning of the GFF3 FASTA sequence section. One possible solution would be to tack on an extra "#" before echoing the comments. The situation is exacerbated by the fact that the current Biocode GFF3 parser will accept any line starting with "##FASTA" (e.g., "##FASTATKAANICDYENLAFMG") as the FASTA section delimiter (issue #32).

report_gff3_statistics.py, 'Gene' object has no attribute 'length'

Hi,
When I am trying to run "report_gff3_statistics.py" script for a file which looks like what I have pasted below, I get this error:
Traceback (most recent call last):
File "report_gff_stat.py", line 113, in
main()
File "report_gff_stat.py", line 56, in main
type_lengths['gene'] += gene.length
AttributeError: 'Gene' object has no attribute 'length'
T
he gff3 file:
28 scaffold_936 phytozome9_0 gene 5553 6897 . - . ID=gene4;Name=Aquca_936_00001
29 scaffold_936 phytozome9_0 mRNA 5553 6897 . - . ID=mRNA4;Parent=gene4;Name=Aquca_936_00001.1;pacid=22051342;longest=1
30 scaffold_936 phytozome9_0 three_prime_UTR 5553 5787 . - . Parent=mRNA4;pacid=22051342
31 scaffold_936 phytozome9_0 exon 5553 5897 . - . Parent=mRNA4;pacid=22051342
32 scaffold_936 phytozome9_0 CDS 5788 5897 . - 2 Parent=mRNA4;pacid=22051342
33 scaffold_936 . intron 5898 6021 . - . Parent=mRNA4
34 scaffold_936 phytozome9_0 exon 6022 6086 . - . Parent=mRNA4;pacid=22051342
35 scaffold_936 phytozome9_0 CDS 6022 6086 . - 1 Parent=mRNA4;pacid=22051342
36 scaffold_936 . intron 6087 6219 . - . Parent=mRNA4
37 scaffold_936 phytozome9_0 exon 6220 6305 . - . Parent=mRNA4;pacid=22051342
38 scaffold_936 phytozome9_0 CDS 6220 6305 . - 0 Parent=mRNA4;pacid=22051342
39 scaffold_936 . intron 6306 6802 . - . Parent=mRNA4
40 scaffold_936 phytozome9_0 CDS 6803 6895 . - 0 Parent=mRNA4;pacid=22051342
41 scaffold_936 phytozome9_0 exon 6803 6897 . - . Parent=mRNA4;pacid=22051342
42 scaffold_936 phytozome9_0 five_prime_UTR 6896 6897 . - . Parent=mRNA4;pacid=22051342

Thanks,

Pezhman

report_gc_content_by_feature_type.pl definition of telomere

It looks to me like report_gc_content_by_feature_type.pl defines a telomere as the region between the terminal exon on a contig and the end of the contig. Perhaps this definition varies by field, but I don't think that this is typically seen as the biological definition of a telomere, which is typically defined as something like "a region of repetitive sequence at the end of a chromatid." The two problems that come to mind are:

  1. Not all contigs contain the ends of chromosomes (i.e. chromosomes may not be fully sequenced or broken into several contigs).
  2. It includes the region between the repetitive sequence and the first annotation gene, which I think is usually considered part of the sub-telomeric region.

Maybe there was some application-specific reason for this addition. Please correct me if I am wrong on this, but I know a little about how you like to be very precise with your terminology, so I thought I should bring it up to be looked at further.

AttributeError: 'Gene' object has no attribute 'add_CDS'

Hello, I am trying to get intron and exon statistics using both your 'report_gff3_statistics.py' and 'report_gff_intron_and_intergenic_stats.py' and I am getting the AttributeError that is in the title.

stephenwyka@bspmgenomics:/data/wyka/Reference_genomes/originals$ /data/wyka/report_gff3_statistics.py -i Claviceps_purpurea_20_1.gff -o exon_report.txt
Traceback (most recent call last):
  File "/data/wyka/report_gff3_statistics.py", line 110, in <module>
    main()
  File "/data/wyka/report_gff3_statistics.py", line 30, in main
    (assemblies, features) = gff.get_gff3_features(args.input_file)
  File "/data/wyka/biocode/lib/biocode/gff.py", line 350, in get_gff3_features
    parent_feat.add_CDS(CDS)
AttributeError: 'Gene' object has no attribute 'add_CDS'

I downloaded this gff3 from GenBank and below is an example of the contents.

CAGA01000191.1	EMBL	region	1	224490	.	+	.	ID=id0;Dbxref=taxon:1111077;clone=scaffold00051;gbkey=Src;mol_type=genomic DNA;strain=20.1
CAGA01000191.1	EMBL	gene	3223	3902	.	-	.	ID=gene0;Name=CPUR_06801;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CPUR_06801
CAGA01000191.1	EMBL	CDS	3642	3902	.	-	0	ID=cds0;Parent=gene0;Dbxref=NCBI_GP:CCE35373.1;Name=CCE35373.1;Note=CP_06801.1;gbkey=CDS;product=uncharacterized protein;protein_id=CCE35373.1
CAGA01000191.1	EMBL	CDS	3223	3315	.	-	0	ID=cds0;Parent=gene0;Dbxref=NCBI_GP:CCE35373.1;Name=CCE35373.1;Note=CP_06801.1;gbkey=CDS;product=uncharacterized protein;protein_id=CCE35373.1
CAGA01000191.1	EMBL	exon	3223	3315	.	-	.	ID=id1;Parent=gene0;gbkey=exon
CAGA01000191.1	EMBL	exon	3642	3902	.	-	.	ID=id2;Parent=gene0;gbkey=exon
CAGA01000191.1	EMBL	gap	7156	7946	.	+	.	ID=id3;estimated_length=791;gbkey=gap
CAGA01000191.1	EMBL	gene	11485	11880	.	+	.	ID=gene1;Name=CPUR_06802;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CPUR_06802
CAGA01000191.1	EMBL	CDS	11485	11880	.	+	0	ID=cds1;Parent=gene1;Dbxref=NCBI_GP:CCE35374.1;Name=CCE35374.1;Note=CP_06802.1;gbkey=CDS;product=uncharacterized protein;protein_id=CCE35374.1
CAGA01000191.1	EMBL	exon	11485	11880	.	+	.	ID=id4;Parent=gene1;gbkey=exon
CAGA01000191.1	EMBL	gene	11895	12257	.	-	.	ID=gene2;Name=CPUR_06803;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CPUR_06803
CAGA01000191.1	EMBL	CDS	11895	12257	.	-	0	ID=cds2;Parent=gene2;Dbxref=NCBI_GP:CCE35375.1;Name=CCE35375.1;Note=CP_06803.1;gbkey=CDS;product=uncharacterized protein;protein_id=CCE35375.1
CAGA01000191.1	EMBL	exon	11895	12257	.	-	.	ID=id5;Parent=gene2;gbkey=exon
CAGA01000191.1	EMBL	gene	13574	15125	.	-	.	ID=gene3;Name=CPUR_06804;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CPUR_06804
CAGA01000191.1	EMBL	CDS	14956	15125	.	-	0	ID=cds3;Parent=gene3;Dbxref=NCBI_GP:CCE35376.1;Name=CCE35376.1;Note=CP_06804.1;gbkey=CDS;product=probable dis1-suppressing protein kinase dsk1;protein_id=CCE35376.1
CAGA01000191.1	EMBL	CDS	14507	14850	.	-	1	ID=cds3;Parent=gene3;Dbxref=NCBI_GP:CCE35376.1;Name=CCE35376.1;Note=CP_06804.1;gbkey=CDS;product=probable dis1-suppressing protein kinase dsk1;protein_id=CCE35376.1
CAGA01000191.1	EMBL	CDS	14135	14454	.	-	2	ID=cds3;Parent=gene3;Dbxref=NCBI_GP:CCE35376.1;Name=CCE35376.1;Note=CP_06804.1;gbkey=CDS;product=probable dis1-suppressing protein kinase dsk1;protein_id=CCE35376.1
CAGA01000191.1	EMBL	CDS	13822	14062	.	-	0	ID=cds3;Parent=gene3;Dbxref=NCBI_GP:CCE35376.1;Name=CCE35376.1;Note=CP_06804.1;gbkey=CDS;product=probable dis1-suppressing protein kinase dsk1;protein_id=CCE35376.1
CAGA01000191.1	EMBL	CDS	13574	13758	.	-	2	ID=cds3;Parent=gene3;Dbxref=NCBI_GP:CCE35376.1;Name=CCE35376.1;Note=CP_06804.1;gbkey=CDS;product=probable dis1-suppressing protein kinase dsk1;protein_id=CCE35376.1
CAGA01000191.1	EMBL	exon	13574	13758	.	-	.	ID=id6;Parent=gene3;gbkey=exon
CAGA01000191.1	EMBL	exon	13822	14062	.	-	.	ID=id7;Parent=gene3;gbkey=exon
CAGA01000191.1	EMBL	exon	14135	14454	.	-	.	ID=id8;Parent=gene3;gbkey=exon
CAGA01000191.1	EMBL	exon	14507	14850	.	-	.	ID=id9;Parent=gene3;gbkey=exon
CAGA01000191.1	EMBL	exon	14956	15125	.	-	.	ID=id10;Parent=gene3;gbkey=exon

Exclude mRNA features in bacterial TBL exports

From Suvvi:

Just a reminder that mRNA feature needs to be avoided in tbl while converting gff or genbank file into tbl file.

The genome that I submitted with mRNA feature has been sent back as it had mRNA… pasting the error here (just FYI),

“FATAL: DISC_BACTERIA_SHOULD_NOT_HAVE_MRNA:5 bacterial sequences have mRNA features

FATAL: DiscRep_ALL:DISC_BACTERIA_SHOULD_NOT_HAVE_MRNA::5 bacterial sequences have mRNA features

/tmp/tmp.zCDmHD4k9r:tig00000001_edited (length 4883137)
/tmp/tmp.zCDmHD4k9r:tig00000064_edited (length 1234209)
/tmp/tmp.zCDmHD4k9r:tig00000082_edited (length 415988)
/tmp/tmp.zCDmHD4k9r:tig00000065_edited (length 771583)
/tmp/tmp.zCDmHD4k9r:tig00000066_edited (length 630306)”

write_fasta_from_gff.pl

Hello,

I'd like to suggest that a check for a start and stop codon be added to this script (for both CDS and polypeptide sequences) for each sequence. Whether there is just a warning to STDOUT or to a log file does not matter so much to me, but this would be a very useful feature, particularly for an annotation project. I have recently found that such a check does not occur in WebApollo, and so this would be the most convenient place to add this check in our pipeline.

Thanks!

convert_gff3_to_gbk.py template error

I am getting the following error after a pip install of biocode when running convert_gff3_to_gbk.py.
raise TemplateNotFound(template)
jinja2.exceptions.TemplateNotFound: genbank_flat_file_header.template

convert_gff3_to_gbk.py, convert sequences with no annotation

Currently convert_gff3_to_gbk.py will only create GenBank entries for input sequences that have at least one feature localized to them in the GFF. However, one might want to create GenBank entries for genomic sequences (in the FASTA section of the input GFF3) that have no features localized to them. The description of the converter ("Converts GFF3 representing gene models to Genbank flat-file format.") does suggest that the conversion process is based around gene models rather than sequences, but since the GenBank flat file format is inherently sequence-based it would be good to at least have an option to include unannotated sequences in the conversion.

Script needed for assembly evaluation

We have need of a script which simulates fragmented sequences based on more-complete input sequence. This is perhaps best illustrated with a current use case.

We are using unsheared, paired-end reads aligned to transcriptome assemblies to determine real evidence for each, or even possibly group them further. We expect overlapping transcripts like this to be assembled:

5'---------------------3'
               5'----------------------------------3'

But paired-end grouping might also be able to pull these together, even inserting Ns given a known library insert size, if read mate pairs span the gap between them:

5'---------------------3'
                                          5'----------------------------------3'

So, here, this proposed script would allow me to take a known set of transcripts and artificially fragment them, generating some fragments that overlap and others that are separate from one another. This could be controlled with user-configurable options such as:

--min_overlap_distance=-200
--max_overlap_distance=100
--fragmentation_factor=6

Notice the negative value above, which allows for the 2nd case above where sequence fragments do not overlap. With these options, the script would transform a FASTA file with 1000 sequences into one with around 6000 sequences, with fragments generated with an overlap distance of up to 100bp and as far as 200bp apart from each other based on their parent sequence.

Data should be appended to the header descriptions in the product sequences to indicate their source and coordinates.

correct_gff_feature_order.pl misplacing ##FASTA

Hello,
While the script seems to work as advertised otherwise, when given a GFF file with a FASTA at the end of the file, correct_gff_feature_order.pl places the "##FASTA" header in the wrong place in the output file, such that this header is on the second line of the file like so:

1 ##gff-version 3
2 ##FASTA

The problem in the code seems to be here, where you need special handling of the "##FASTA" line:

first write the comments to the output file

for ( @comment_lines ) {
print $ofh "$_\n";
}

convert_augustus_to_gff3.py error

Hi,

I used the convert_augustus_to_gff3.py with the code python3 convert_augustus_to_gff3.py -i RH88_augustus_draft.gff -o RH88_augustus_draft_converted.gff3

And I got the following error:
Traceback (most recent call last): File "convert_augustus_to_gff3.py", line 179, in <module> main() File "convert_augustus_to_gff3.py", line 135, in main feat_id = gff.column_9_value(cols[8], 'ID') NameError: name 'gff' is not defined

I tried run the script directly ./convert_augustus_to_gff3.py -i RH88_augustus_draft.gff -o RH88_augustus_draft_converted.gff3
but it didn't work.

My python version is python/3.7.0. How can I fix this?

Thanks!!
Jing

gff/write_fasta_from_gff.pl mistranslates some Augustus -> GFF3 output

There are some classes of Augustus output genes that are a bit puzzling, such as the one below. The transcript starts with an intron and then the following (first) CDS fragment has a non-zero phase value, which goes against the GFF specification (in my understanding of it.) This needs to be checked and corrected for.

# start gene g856
NODE_4651_length_1024_cov_24.708984     AUGUSTUS        gene    1       1068    0.91    +       .       g856
NODE_4651_length_1024_cov_24.708984     AUGUSTUS        transcript      1       1068    0.91    +       .       g856.t1
NODE_4651_length_1024_cov_24.708984     AUGUSTUS        intron  1       106     0.91    +       .       transcript_id "g856.t1"; gene_id "g856";
NODE_4651_length_1024_cov_24.708984     AUGUSTUS        CDS     107     1068    0.91    +       2       transcript_id "g856.t1"; gene_id "g856";
NODE_4651_length_1024_cov_24.708984     AUGUSTUS        stop_codon      1066    1068    .       +       0       transcript_id "g856.t1"; gene_id "g856";
# protein sequence = [TQTSTAQSQAMDAESNTSTDPKNGDSQSALVQQLCQTVERLTNELSQARHEIQHLQERINTINSTTTPLSPLEFPTLQ
# ESQIRSTAFPDAPWNNPSKIQALKQPSIQRSEQRRMQREATAARFFQPPSENQGFKYLYIPTKARIPVGTIRTTFRKLGVNNARLLDIHYPARNTVAV
# LIHNDYEAEFVELLTRKNVHIRTDFTPFNGKILADPKYTSLPQEERDSIAIRLQKLRLSRALDYIRSPVKYAVARYFLDQEWISRTRYEEIMADRYNT
# KLTSIFDQTSQQQTTQDTFNDVSDNDLNMEAIDELPTGTSSPALH]
# Evidence for and against this transcript:
# % of transcript supported by hints (any source): 0
# CDS exons: 0/1
# CDS introns: 0/1
#5'UTR exons and introns: 0/0
#3'UTR exons and introns: 0/0
# hint groups fully obeyed: 0
# incompatible hint groups: 0
# end gene g856

Check and/or correct coordinate column order in biocodegff.py

It would be helpful if biocodegff could print a warning--and perhaps automatically switch the values-- if it detects that the GFF start coordinate (column 4) is larger than the GFF end coordinate (column 5). In the absence of this check incorrectly switched coordinates are getting passed through to the GenBank format output of convert_gff3_to_gbk.py.

Syntax error on gff.py

Hello,

I was trying to run some of your gff3 statistics and after getting the clone I get a syntax error when it tries to open the gff.py

/Wyka/bioinformatics$ python report_gff3_statistics.py -i Claviceps_purpurea_LM4.gff3 -o output_test
Traceback (most recent call last):
File "report_gff3_statistics.py", line 19, in
from biocode import gff
File "/opt/biocode/lib/biocode/gff.py", line 103
[*v] = map(unquote, tt[1].strip().split(COMMA))
^
SyntaxError: invalid syntax

I am running this on Ubuntu 18.04

AttributeError: type object 'str' has no attribute 'maketrans'

Traceback (most recent call last):
File "remove_duplicate_sequences.py", line 27, in
from biocode import utils
File "/opt/biocode/lib/biocode/utils.py", line 6, in
_nt_comp_table = bytes.maketrans(b'ACBDGHKMNSRUTWVYacbdghkmnsrutwvy',
AttributeError: type object 'str' has no attribute 'maketrans'

Needed: Speed-optimized FASTA statistics script

One of the really common tasks when given a FASTA file is to find the following statistics:

  • Total sequence count
  • Total base count
  • GC content
  • Longest sequence
  • Shortest sequence
  • Mean sequence length
  • Median sequence length
  • N50
  • N90

While this is trivial itself, what can get more interesting is finding the method to do it that performs the best. Because this will be an important component of a few other projects, speed and proper error handling is important. Most apps assume python, but I'm up for implementations in whatever language will give the best results here as long as they don't open up a huge can of worms dependency-wise.

Conda based install

For those who don't have admin privileges and avoid apt-get, you should be able to use conda to manage the install of the biocode dependencies. You can use the following commands...

# create a new conda environment named 'misc3' with needed dependencies and install biocode
conda create -n misc3 -c conda-forge python==3.6.8 pip zlib libblas liblapack libxml2
conda activate misc3
pip install biocode

I haven't fully tested my install but have used several of the gff scripts and it all seems to work fine.

Assuming this installation method actually works (I don't see why it wouldn't) it may be worth adding these commands to the biocode README

pip2 install biocode error

Hi I tried installing biocode using pipe3 in python3, and here is the output (I also tried using pip install biocode, from python 3.6.3 in anaconda)

Collecting biocode
Using cached biocode-0.5.3.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "/tmp/pip-build-a5aeh2_4/biocode/setup.py", line 5, in
from pypandoc import convert
ModuleNotFoundError: No module named 'pypandoc'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-build-a5aeh2_4/biocode/setup.py", line 8, in <module>
    raise Exception("Error: pypandoc module not found, could not convert Markdown to RST")
Exception: Error: pypandoc module not found, could not convert Markdown to RST

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-a5aeh2_4/biocode/

I need to convert augustus gtf to GFF3 format

any ideas?
Thanks
Kay

convert_gff3_to_ncbi_tbl.py error generated

Dear Joshua,

I've got error during convert_gff3_to_ncbi_tbl.py.

Can you please check ?

convert_gff3_to_ncbi_tbl.py -i ../gene.gff -o aasdasdasd -ln JC0 -nap adsadasd

Traceback (most recent call last):
File "/Users/wyim/bin/biocode/gff/convert_gff3_to_ncbi_tbl.py", line 92, in
main()
File "/Users/wyim/bin/biocode/gff/convert_gff3_to_ncbi_tbl.py", line 55, in main
(assemblies, features) = biocodegff.get_gff3_features( args.input_file )
File "/Users/wyim/bin/biocode/lib/biocodegff.py", line 272, in get_gff3_features
raise Exception("Error in GFF3: Parent {0} referenced by a child feature before it was defined".format(parent_id) )
Exception: Error in GFF3: Parent Mecry000010.1 referenced by a child feature before it was defined

convert_gff3_to_gbk.py, add full support for non-protein-coding genes

If convert_gff3_to_gbk.py finds a tRNA, rRNA, or other non protein-coding gene in the input GFF3 it will output the parent "gene" feature in the output GenBank file, but nothing else. Only protein-coding genes with an mRNA feature below the parent gene appear to be converted fully. It looks like biocodegenbank.print_biogene needs to be generalized to handle all gene types, or at least all those that currently have a corresponding representation in the biothings module.

Biocode.gff module error

Hello,

I kept getting an error that the module biocode.gff doesn't contain the function "column_9_value" even though I checked the module and I see that the function exists.

stephenwyka@bspmgenomics:/data/wyka/funannotate/LM470$ python3 /data/wyka/biocode/gff/convert_glimmerHMM_gff_to_gff3.py -i LM470_glimmerhmm.gff -o LM470_glimmerhmm.gff3
Traceback (most recent call last):
  File "/data/wyka/biocode/gff/convert_glimmerHMM_gff_to_gff3.py", line 104, in <module>
    main()
  File "/data/wyka/biocode/gff/convert_glimmerHMM_gff_to_gff3.py", line 66, in main
    id = gff.column_9_value(cols[8], 'ID')
AttributeError: module 'biocode.gff' has no attribute 'column_9_value'
stephenwyka@bspmgenomics:/data/wyka/funannotate/LM470$

Needed: Speed-optimized FASTQ to FASTA script

This script should accept a FASTQ file and and simply convert to FASTA. The only currently needed options are:

  • Allow the user to manually append a text string to the end of each header, such as "/1".
  • Auto-detect this header format "@SN7001163:78:C0YG5ACXX:6:1101:1241:2178 1:N:0:CCTAGGT" in which the first digit after the whitespace is the mate pair number, then add it to the read ID to make the header like: ">SN7001163:78:C0YG5ACXX:6:1101:1241:2178/1"

While this is trivial itself, what can get more interesting is finding the method to do it that performs the best. Because this will be an important component of a few other projects, speed and proper error handling is important. Most apps assume python, but I'm up for implementations in whatever language will give the best results here as long as they don't open up a huge can of worms dependency-wise.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.