jorvis / biocode Goto Github PK

Bioinformatics code libraries and scripts

License: MIT License

Python 77.54% Perl 21.29% Shell 0.66% Dockerfile 0.13% R 0.37%

biocode's Introduction

Overview

This is a collection of bioinformatics scripts many have found useful and code modules which make writing new ones a lot faster.

Over the years most bioinformatics people amass a collection of small utility scripts which make their lives easier. Too often they are kept either in private repositories or as part of a public collection to which noone else can contribute. Biocode is a curated repository of general-use utility scripts my colleagues and I have found useful and want to share with others. I have also developed some code libraries/modules which have made my scripting work a lot easier. Some have found these to be more useful than the scripts themselves.

Look below if you want to learn more, contribute code yourself, or just get the scripts.

-- Joshua Orvis

The scripts

The scope here is intentionally very open. I want to include anything that developers find generally useful. There are no limitations on language choice, though the majority are Python. For now, the following directories make up the initial groupings but will be expanded as needed:

blast - It if uses, massages, or just reformats BLAST output, it goes here.
chado - Scripts that are tied into the chado schema (gmod.org) should be found here.
fasta - Filtering, converting, size distribution plots, etc.
fastq - Utilities for fasta's newer sister format.
genbank - Anything related to the GenBank? Flat File Format.
general - Utility scripts that may not fit in any other existing directory or don't warrant creation of their own. We should be selective about what we put here and create or use other directories whenever appropriate.
gff - Extractions, conversions and manipulations of files in the Generic Feature Format
gtf - From Ensembl/WashU, the GTF format is the focus of scripts here.
hmm - Merging, manipulating or reading HMM libraries.
sam_bam - Analysis of and parsing SAM/BAM files.
sandbox - Each committer gets their own personal directory here to add anything they want while testing or waiting to be moved to the production directories.
sysadmin - While not specifically bioinformatics, our work tends to be on Unix machines, and utility scripts are often needed to support our work. From file system manipulation to database backup scripts, put your generic sysadmin utilities here.
taxonomy - Anything related to taxonomic analysis.

The modules

If you're a developer these modules can save a lot of time. Yes, there is some duplicate functionality you'll find in modules like Biopython, but these were written to add features I always wanted and with a more biologically-focused API.

Three of the primary Python modules:

biocode.things

Classes here represent biological things (as defined by the Sequence Ontology) in a way that makes more sense biologically and hiding some of the CS abstraction. What does this mean? This is a simple example, but compare these syntax approaches:

# This way is typical of other libraries
genes = assembly.get_subfeatures_by_type( 'type': 'genes' )
mRNAs = assembly.get_subfeatures_by_type( 'type': 'mRNA' )

# And instead, in biothings:
genes = assembly.genes()
for gene in genes:
    mRNAs = gene.mRNAs()

This more direct approach is held throughout these libraries. It also adds some shortcuts for tasks that always annoyed me when working with things that had coordinates. Consider if you wanted to determine if one gene is before another one on a molecule:

if gene1 < gene2:
    return True

In the background, biocode checks if the two gene objects are located on the same molecule and, if so, compares their coordinates. There are many other methods for coordinate comparison, such as:

thing1 <= thing2 : The thing1 overlaps thing2 on the 5' end
thing1.contained_within( thing2 )
thing1.overlaps( thing2 )
thing1.overlap_size_with( thing2 )

This module also contains readable and detailed documention within the source code.

biocode.annotation

This set of classes allows formal definition of functional annotation which can be attached to various biothings. These include gene product names, gene symbols, EC numbers, GO terms, etc. Once annotated, the biothings can be written out in common formats such as GFF3, GenBank, NCBI tbl, etc.

biocode.gff

Much of biocode was written while working with genomic data and annotation, and one of the more common formats for storing these is GFF3. Using this module, you can parse a GFF3 file of annotations into a set of biothings with a single line of code. For example:

import biocode.gff

(assemblies, features) = biocode.gff.get_gff3_features( input_file_path )

That's it. You can then iterate over the assemblies and their children, or access the 'features' dict, which is keyed on each feature's ID.

Installing dependencies

On Debian-based systems (like Ubuntu) you can be sure to get all biocode dependencies like this:

apt-get install -y python3 python3-pip zlib1g-dev libblas-dev liblapack-dev libxml2-dev

Getting the code (pip3, latest release)

You can install biocode using pip3 (requires Python3) like this:

pip3 install biocode

Getting the code (github, current trunk)

If you want the latest developer version:

git clone https://github.com/jorvis/biocode.git

Important: Many of these scripts use the modules in the biocode/lib directory, so you'll need to point Python to them. Full setup example:

cd /opt
git clone https://github.com/jorvis/biocode.git

# You probably want to add this line to your $HOME/.bashrc file
export PYTHONPATH=/opt/biocode/lib:$PYTHONPATH

Problems / Suggestions?

If you encounter any issues with the existing code, or would like to request new features or scripts please submit to the Issue tracking system.

Contributing

If you'd like to contribute code to this collection have a look at the Requirements And Convention Guide and then submit a pull request once your code is ready. We'll check your script and pull it into the production directories. If you're not that confident yet we'll happily pull in your sandbox directory if you'd like to add your code to the project but aren't sure if it's ready to be in the production directories yet.

biocode's People

Contributors

Stargazers

Watchers

Forkers

igs priti88 nickvinciguerra jonathancrabtree honglongwu klortho nidhiagarwal009 wooey zctea stefanoliver pombredanne dolleyj raj76 plasmid02 chioaguilar bioxiao kastman sridhar0605 tiramisutes danwiththeplan binlu1981 wyim-pgl flopezo jason790 hlkfoz nucleotide tw7649116 skerker jingjiesong sea200k cherriyoush jinfengchen yuntaotan wangxf133 lijiakuan beijingin ajinocean zengq1012 494118250 sunchangshuo yangjie4546 arvin580 acm2911 srividya22 mynameliuxiang jwdebler dayuer2010 outlier2016 yfljz buhijs bryan0425 tlysecust richard-tien maozhitao punicagranatuml xiaoying201355 xiaoaozqd idszice xxwgdhd koujiaodahan alexcorm ncpalmateer aseetharam lelouchzhu hwight jingjtang fungs meganqiu x-wj wenmm staryynight michalxlevin mengxiaoqian hylmxq liujeremy wangdianoo wilsonyangliu aspirincode sagecode98 liuyuan-cisd ch127 hugang123 ichobits kjokkjok dorbir smyang2018 lsjay jyguojun pengzw0909 zhouhui0916 lishuangshuang0616 18853857973 alphaneer mint1234 xingchaowu zhuxitong reedliu nsuvarnaiari eclipsezhao wangdi2014

biocode's Issues

Minor fasta script naming issue

The fasta/ subdirectory contains a number of scripts, pretty much all of which deal with multi-FASTA files, rather than single-sequence FASTA files. Most (but not all) of the scripts have "fasta" somewhere in their name, but there's a lone script that uses the term "multifasta" instead, "compare_two_multifastas.pl"

convert_gff3_to_gbk.py with no embedded FASTA

convert_gff3_to_gbk.py currently does not support GFF3 files without embedded FASTA sequences. Here is the error that I got:

"raise Exception("ERROR: CDS.get_residues() requested but its molecule {0} has no stored residues".format(mol.id))"

I guess the FASTA could be passed through a command line option when running the file.

bioannotation.py - check for properly-formed EC numbers

There are sources in public HMM and BLAST libraries which assert EC numbers that are malformed, such as "1.2.1.n2". Due to the nature of how these are used, I think the proper thing to do is to warn when the user attempts to add a malformed EC number but don't throw an exception.

Structural comparison script deletes all bed files

I'd like to shift your structural comparison script out of the sandbox and into production, but I did a quick code review and one part at the very end concerned me. It seems to delete all bed files in the output directory rather than keeping track of the specific temporary bed files it creates and just deleting them.

This certainly isn't an urgent issue, since this is currently a sandbox script.

pip2 install biocode error

Hi I tried installing biocode using pipe3 in python3, and here is the output (I also tried using pip install biocode, from python 3.6.3 in anaconda)

Collecting biocode
Using cached biocode-0.5.3.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "/tmp/pip-build-a5aeh2_4/biocode/setup.py", line 5, in
from pypandoc import convert
ModuleNotFoundError: No module named 'pypandoc'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-build-a5aeh2_4/biocode/setup.py", line 8, in <module>
    raise Exception("Error: pypandoc module not found, could not convert Markdown to RST")
Exception: Error: pypandoc module not found, could not convert Markdown to RST

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-a5aeh2_4/biocode/

I need to convert augustus gtf to GFF3 format

any ideas?
Thanks
Kay

report_gc_content_by_feature_type.pl definition of telomere

It looks to me like report_gc_content_by_feature_type.pl defines a telomere as the region between the terminal exon on a contig and the end of the contig. Perhaps this definition varies by field, but I don't think that this is typically seen as the biological definition of a telomere, which is typically defined as something like "a region of repetitive sequence at the end of a chromatid." The two problems that come to mind are:

Not all contigs contain the ends of chromosomes (i.e. chromosomes may not be fully sequenced or broken into several contigs).
It includes the region between the repetitive sequence and the first annotation gene, which I think is usually considered part of the sub-telomeric region.

Maybe there was some application-specific reason for this addition. Please correct me if I am wrong on this, but I know a little about how you like to be very precise with your terminology, so I thought I should bring it up to be looked at further.

gff/write_fasta_from_gff.pl mistranslates some Augustus -> GFF3 output

There are some classes of Augustus output genes that are a bit puzzling, such as the one below. The transcript starts with an intron and then the following (first) CDS fragment has a non-zero phase value, which goes against the GFF specification (in my understanding of it.) This needs to be checked and corrected for.

# start gene g856
NODE_4651_length_1024_cov_24.708984     AUGUSTUS        gene    1       1068    0.91    +       .       g856
NODE_4651_length_1024_cov_24.708984     AUGUSTUS        transcript      1       1068    0.91    +       .       g856.t1
NODE_4651_length_1024_cov_24.708984     AUGUSTUS        intron  1       106     0.91    +       .       transcript_id "g856.t1"; gene_id "g856";
NODE_4651_length_1024_cov_24.708984     AUGUSTUS        CDS     107     1068    0.91    +       2       transcript_id "g856.t1"; gene_id "g856";
NODE_4651_length_1024_cov_24.708984     AUGUSTUS        stop_codon      1066    1068    .       +       0       transcript_id "g856.t1"; gene_id "g856";
# protein sequence = [TQTSTAQSQAMDAESNTSTDPKNGDSQSALVQQLCQTVERLTNELSQARHEIQHLQERINTINSTTTPLSPLEFPTLQ
# ESQIRSTAFPDAPWNNPSKIQALKQPSIQRSEQRRMQREATAARFFQPPSENQGFKYLYIPTKARIPVGTIRTTFRKLGVNNARLLDIHYPARNTVAV
# LIHNDYEAEFVELLTRKNVHIRTDFTPFNGKILADPKYTSLPQEERDSIAIRLQKLRLSRALDYIRSPVKYAVARYFLDQEWISRTRYEEIMADRYNT
# KLTSIFDQTSQQQTTQDTFNDVSDNDLNMEAIDELPTGTSSPALH]
# Evidence for and against this transcript:
# % of transcript supported by hints (any source): 0
# CDS exons: 0/1
# CDS introns: 0/1
#5'UTR exons and introns: 0/0
#3'UTR exons and introns: 0/0
# hint groups fully obeyed: 0
# incompatible hint groups: 0
# end gene g856

[convert_genbank_to_gff3.py] key_error: locus_tag

Hello !
I'm trying to use the genbank to gff3 converter on this Genbank file: NC_008724

But I get a KeyError: 'locus_tag', some features doesn't have the locus_tag qualifier and it seems that's why the error is raised.

correct_gff_feature_order.pl doesn't work

Hello,

When I run the script correct_gff_feature_order.pl, I get this error

Can't locate bioUtils.pm in @INC (you may need to install the bioUtils module) (@INC contains: /Users/arslan/Documents/Juncus/EMBL/EMBLmyGFF3/../lib /Library/Perl/5.18/darwin-thread-multi-2level /Library/Perl/5.18 /Network/Library/Perl/5.18/darwin-thread-multi-2level /Network/Library/Perl/5.18 /Library/Perl/Updates/5.18.2/darwin-thread-multi-2level /Library/Perl/Updates/5.18.2 /System/Library/Perl/5.18/darwin-thread-multi-2level /System/Library/Perl/5.18 /System/Library/Perl/Extras/5.18/darwin-thread-multi-2level /System/Library/Perl/Extras/5.18 .) at correct_gff_feature_order.pl line 76.
BEGIN failed--compilation aborted at correct_gff_feature_order.pl line 76.

Can you please comment how can I fix?
Thanks

Biocode.gff module error

Hello,

I kept getting an error that the module biocode.gff doesn't contain the function "column_9_value" even though I checked the module and I see that the function exists.

stephenwyka@bspmgenomics:/data/wyka/funannotate/LM470$ python3 /data/wyka/biocode/gff/convert_glimmerHMM_gff_to_gff3.py -i LM470_glimmerhmm.gff -o LM470_glimmerhmm.gff3
Traceback (most recent call last):
  File "/data/wyka/biocode/gff/convert_glimmerHMM_gff_to_gff3.py", line 104, in <module>
    main()
  File "/data/wyka/biocode/gff/convert_glimmerHMM_gff_to_gff3.py", line 66, in main
    id = gff.column_9_value(cols[8], 'ID')
AttributeError: module 'biocode.gff' has no attribute 'column_9_value'
stephenwyka@bspmgenomics:/data/wyka/funannotate/LM470$

report_gff3_statistics.py, 'Gene' object has no attribute 'length'

Hi,
When I am trying to run "report_gff3_statistics.py" script for a file which looks like what I have pasted below, I get this error:
Traceback (most recent call last):
File "report_gff_stat.py", line 113, in
main()
File "report_gff_stat.py", line 56, in main
type_lengths['gene'] += gene.length
AttributeError: 'Gene' object has no attribute 'length'
T
he gff3 file:
28 scaffold_936 phytozome9_0 gene 5553 6897 . - . ID=gene4;Name=Aquca_936_00001
29 scaffold_936 phytozome9_0 mRNA 5553 6897 . - . ID=mRNA4;Parent=gene4;Name=Aquca_936_00001.1;pacid=22051342;longest=1
30 scaffold_936 phytozome9_0 three_prime_UTR 5553 5787 . - . Parent=mRNA4;pacid=22051342
31 scaffold_936 phytozome9_0 exon 5553 5897 . - . Parent=mRNA4;pacid=22051342
32 scaffold_936 phytozome9_0 CDS 5788 5897 . - 2 Parent=mRNA4;pacid=22051342
33 scaffold_936 . intron 5898 6021 . - . Parent=mRNA4
34 scaffold_936 phytozome9_0 exon 6022 6086 . - . Parent=mRNA4;pacid=22051342
35 scaffold_936 phytozome9_0 CDS 6022 6086 . - 1 Parent=mRNA4;pacid=22051342
36 scaffold_936 . intron 6087 6219 . - . Parent=mRNA4
37 scaffold_936 phytozome9_0 exon 6220 6305 . - . Parent=mRNA4;pacid=22051342
38 scaffold_936 phytozome9_0 CDS 6220 6305 . - 0 Parent=mRNA4;pacid=22051342
39 scaffold_936 . intron 6306 6802 . - . Parent=mRNA4
40 scaffold_936 phytozome9_0 CDS 6803 6895 . - 0 Parent=mRNA4;pacid=22051342
41 scaffold_936 phytozome9_0 exon 6803 6897 . - . Parent=mRNA4;pacid=22051342
42 scaffold_936 phytozome9_0 five_prime_UTR 6896 6897 . - . Parent=mRNA4;pacid=22051342

Thanks,

Pezhman

correct_gff_feature_order.pl misplacing ##FASTA

Hello,
While the script seems to work as advertised otherwise, when given a GFF file with a FASTA at the end of the file, correct_gff_feature_order.pl places the "##FASTA" header in the wrong place in the output file, such that this header is on the second line of the file like so:

1 ##gff-version 3
2 ##FASTA

The problem in the code seems to be here, where you need special handling of the "##FASTA" line:

first write the comments to the output file

for ( @comment_lines ) {
print $ofh "$_\n";
}

Python 2 Compatability?

Hi Josh,

Thanks for sharing biocode. Any interest in accepting pull requests for single files (I'm just looking at fastq/randomly_subsample_fastq.py) that would cheaply add py2 compatibility (change the hashbang, cast to float for future division)? On the bright side it would increase usability, but I understand if you only want to test against py3. Thanks for putting this out there,

Script needed for assembly evaluation

We have need of a script which simulates fragmented sequences based on more-complete input sequence. This is perhaps best illustrated with a current use case.

We are using unsheared, paired-end reads aligned to transcriptome assemblies to determine real evidence for each, or even possibly group them further. We expect overlapping transcripts like this to be assembled:

5'---------------------3'
               5'----------------------------------3'

But paired-end grouping might also be able to pull these together, even inserting Ns given a known library insert size, if read mate pairs span the gap between them:

5'---------------------3'
                                          5'----------------------------------3'

So, here, this proposed script would allow me to take a known set of transcripts and artificially fragment them, generating some fragments that overlap and others that are separate from one another. This could be controlled with user-configurable options such as:

--min_overlap_distance=-200
--max_overlap_distance=100
--fragmentation_factor=6

Notice the negative value above, which allows for the 2nd case above where sequence fragments do not overlap. With these options, the script would transform a FASTA file with 1000 sequences into one with around 6000 sequences, with fragments generated with an overlap distance of up to 100bp and as far as 200bp apart from each other based on their parent sequence.

Data should be appended to the header descriptions in the product sequences to indicate their source and coordinates.

Annotation parsing file output options

For the annotation parsing file, it would be helpful to add a few options for different output types. GFF3 and protein fasta files are already generated. I would like to see these nucleic acid file outputs, as well:

gene (full gene, including UTRs, start, stop, exons and introns).
coding sequence (CDS from start to stop that get translated into protein)
gene plus 1000 bases up and downstream of start and stop.

Thanks!
Marcus

NameError: name 'utils' is not defined

I am trying to run convert_gff3_to_ncbi_tbl.py script but getting this error.


ubt80:EMBLmyGFF3 arslan$ python3 convert_gff3_to_ncbi_tbl.py -i juncus.fasta.transdecoder.refined.gff3 -o arslan.tbl -ln TEST -nap JE -gf juncus-rp.fasta
INFO: splitting mRNA off gene Transcript_138016|g.186294
Traceback (most recent call last):
  File "convert_gff3_to_ncbi_tbl.py", line 89, in <module>
    main()
  File "convert_gff3_to_ncbi_tbl.py", line 82, in main
    tbl.print_tbl_from_assemblies(assemblies=assemblies, ofh=ofh, go_obo=args.go_obo, lab_name=args.lab_name)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/biocode/tbl.py", line 86, in print_tbl_from_assemblies
    print_biogene(gene=new_gene, fh=ofh, obo_dict=go_idx, lab_name=lab_name)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/biocode/tbl.py", line 117, in print_biogene
    gene_coords = utils.interbase0_to_humancoords(gene_loc.fmin, gene_loc.fmax, gene_loc.strand)
NameError: name 'utils' is not defined

Can you please comment how can I fix it?
Thanks

Needed: Speed-optimized FASTQ to FASTA script

This script should accept a FASTQ file and and simply convert to FASTA. The only currently needed options are:

Allow the user to manually append a text string to the end of each header, such as "/1".
Auto-detect this header format "@SN7001163:78:C0YG5ACXX:6:1101:1241:2178 1:N:0:CCTAGGT" in which the first digit after the whitespace is the mate pair number, then add it to the read ID to make the header like: ">SN7001163:78:C0YG5ACXX:6:1101:1241:2178/1"

While this is trivial itself, what can get more interesting is finding the method to do it that performs the best. Because this will be an important component of a few other projects, speed and proper error handling is important. Most apps assume python, but I'm up for implementations in whatever language will give the best results here as long as they don't open up a huge can of worms dependency-wise.

Needed: Speed-optimized FASTQ statistics script

One of the really common tasks when given a FASTQ file is to find the following statistics:

total read count
total base count

AttributeError: type object 'str' has no attribute 'maketrans'

Traceback (most recent call last):
File "remove_duplicate_sequences.py", line 27, in
from biocode import utils
File "/opt/biocode/lib/biocode/utils.py", line 6, in
_nt_comp_table = bytes.maketrans(b'ACBDGHKMNSRUTWVYacbdghkmnsrutwvy',
AttributeError: type object 'str' has no attribute 'maketrans'

GFF3 validator

We have talked before about writing a script to confirm full gff3 format agreement, and I have found someone else's attempt at doing this (link below). This might be a good reference if we plan on writing such a script ourselves.

http://modencode.oicr.on.ca/validate_gff3_online/validate_gff3.html

convert_gff3_to_gbk.py, add full support for non-protein-coding genes

If convert_gff3_to_gbk.py finds a tRNA, rRNA, or other non protein-coding gene in the input GFF3 it will output the parent "gene" feature in the output GenBank file, but nothing else. Only protein-coding genes with an mRNA feature below the parent gene appear to be converted fully. It looks like biocodegenbank.print_biogene needs to be generalized to handle all gene types, or at least all those that currently have a corresponding representation in the biothings module.

Can't import things with write_fasta_from_gff.py

After trying to run this script with the below command, I was having trouble with important utils/things. I've used this script before (checkout and relevant .bashrc line below), so this must be due to recent changes. I noticed that these modules were in the */biocode/lib/biocode/ sub-directory, so I added that to my PYTHONPATH (below), and got the same error.

$ python ~/git/biocode/gff/write_fasta_from_gff.py -i ref.gff3 -f ref.fasta -o ref.fasta -t cds

Traceback (most recent call last):
File "/home/ktretina/git/biocode/gff/write_fasta_from_gff.py", line 31, in
from biocode import utils, gff
File "/home/jorvis/git/biocode/lib/biocode/gff.py", line 4, in
from biocode import things, annotation
File "/home/jorvis/git/biocode/lib/biocode/things.py", line 3, in
from biocode import utils, gff, tbl
File "/home/jorvis/git/biocode/lib/biocode/tbl.py", line 3, in
from biocode import utils, things

Checkout
/home/ktretina/git/biocode/

.bashrc file
export PYTHONPATH=$PYTHONPATH:/home/jorvis/lib:/home/jorvis/svn/jorvis/utilities/lib:/home/jorvis/git/biocode/lib:/home/jorvis/git/Emergence/emergence/apps:/home/ktretina/git/biocode/lib/biocode/

Augustus conversion failing

User @kayussky911 reports:

convert_augustus_to_gff3.py -i augustus_erins.gtf -o new_augustus

my input looks like,

scaffold10x_1 AUGUSTUS gene 3591 4530 0.27 - . g1
scaffold10x_1 AUGUSTUS transcript 3591 4530 0.27 - . g1.t1
scaffold10x_1 AUGUSTUS stop_codon 3591 3593 . - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS CDS 3591 3859 0.34 - 2 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS exon 3591 3859 . - . transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS intron 3860 4022 0.28 - . transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS CDS 4023 4530 0.63 - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS exon 4023 4530 . - . transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS start_codon 4528 4530 . - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS gene 26186 31433 0.2 - . g2
scaffold10x_1 AUGUSTUS transcript 26186 31433 0.2 - . g2.t1
scaffold10x_1 AUGUSTUS stop_codon 26186 26188 . - 0 transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS CDS 26186 26304 0.37 - 2 transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS exon 26186 26304 . - . transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS intron 26305 29389 0.28 - . transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS CDS 29390 30220 0.39 - 2 transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS exon 29390 30220 . - . transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS intron 30221 30844 0.45 - . transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS CDS 30845 31433 0.41 - 0 transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1 AUGUSTUS exon 30845 31433 . - . transcript_id "g2.t1"; gene_id "g2";

the output just says;

##gff-version 3

and that's it. so I think its the last columns of the input files I need to work on.

convert_gff3_to_gbk.py template error

I am getting the following error after a pip install of biocode when running convert_gff3_to_gbk.py.
raise TemplateNotFound(template)
jinja2.exceptions.TemplateNotFound: genbank_flat_file_header.template

convert_augustus_to_gff3.py error

Hi,

I used the convert_augustus_to_gff3.py with the code python3 convert_augustus_to_gff3.py -i RH88_augustus_draft.gff -o RH88_augustus_draft_converted.gff3

And I got the following error:
Traceback (most recent call last): File "convert_augustus_to_gff3.py", line 179, in <module> main() File "convert_augustus_to_gff3.py", line 135, in main feat_id = gff.column_9_value(cols[8], 'ID') NameError: name 'gff' is not defined

I tried run the script directly ./convert_augustus_to_gff3.py -i RH88_augustus_draft.gff -o RH88_augustus_draft_converted.gff3
but it didn't work.

My python version is python/3.7.0. How can I fix this?

Thanks!!
Jing

convert_metagenemark_gff_to_gff3.py produces invalid GFF3

convert_metagenemark_gff_to_gff3.py echoes comment lines from MetaGeneMark unchanged. This is a problem when the comment is "##FASTA" (i.e., as part of a predicted polypeptide) GFF3 parsers are required to interpret such a line as the beginning of the GFF3 FASTA sequence section. One possible solution would be to tack on an extra "#" before echoing the comments. The situation is exacerbated by the fact that the current Biocode GFF3 parser will accept any line starting with "##FASTA" (e.g., "##FASTATKAANICDYENLAFMG") as the FASTA section delimiter (issue #32).

To extract longest isoform from trinity assembly

Dear All,

I GETTING SOME ERROR BY USING

Dear All,

I am getting some error when using script: https://github.com/jorvis/biocode/blob/master/sandbox/jorvis/filter_longest_trinity_subcomponents.py

Error:
File "/home/yogesh/biocode/sandbox/jorvis/filter_longest_trinity_subcomponents.py", line 45, in
import biocodeutils
ImportError: No module named 'biocodeutils'

How Can I resolve it.

Thanks

The biothings module needs to support multiple parentage

The biothings module doesn't currently support features having multiple parents, which is necessary to support a lot of reasons GFF3 was written (over GFF2) in the first place.

http://www.bioinformatics.org/wiki/Generic_Feature_Format#GFF_Version_3

Conda based install

For those who don't have admin privileges and avoid apt-get, you should be able to use conda to manage the install of the biocode dependencies. You can use the following commands...

# create a new conda environment named 'misc3' with needed dependencies and install biocode
conda create -n misc3 -c conda-forge python==3.6.8 pip zlib libblas liblapack libxml2
conda activate misc3
pip install biocode

I haven't fully tested my install but have used several of the gff scripts and it all seems to work fine.

Assuming this installation method actually works (I don't see why it wouldn't) it may be worth adding these commands to the biocode README

Add motif predictions to parse_ergatis_euk_functional_pipeline.py

The euk functional annotation script (sandbox/jorvis/parse_ergatis_euk_functional_pipeline.py) might be augmented with some additional evidence. I propose adding the following predictions:
SignalP
SecretomeP
TMHMM
TargetP
(More information can be found here: http://www.cbs.dtu.dk/services/ and there are additional prediction tools there, as well.)

With respect to how to handle the annotation name in column 9 of the GFF3 file, I propose adding information to those names that would otherwise be "Hypothetical protein" due to lack of significant matches to other evidence (e.g. no named BLAST hits from UniProt, nor any HMM results). For example, if a protein is putatively secreted, but otherwise has no annotation, we might call it "Hypothetical secreted protein", and if a protein localizes to the membrane, it could be called "Hypothetical transmembrane protein".

For database submissions, this might not be useful (as GenBank would reject annotations following such nomenclature), but we could parse those prior to submission to GenBank. (For example, all proteins called "Hypothetical" followed by any other text would be renamed "Hypothetical protein".

problem with compare_gene_structures.py

Dear,
I have a problem using the code compare_gene_structures.py. I get the following error: File "/home/faino001/bin/biocode/gff/compare_gene_structures.py", line 612, in
process_files(args)
File "/home/faino001/bin/biocode/gff/compare_gene_structures.py", line 322, in process_files
for exon_1 in sorted(feat_1) :
UnboundLocalError: local variable 'feat_1' referenced before assignment

any idea why?

thanks
Luigi

write_fasta_from_gff.pl

Hello,

I'd like to suggest that a check for a start and stop codon be added to this script (for both CDS and polypeptide sequences) for each sequence. Whether there is just a warning to STDOUT or to a log file does not matter so much to me, but this would be a very useful feature, particularly for an annotation project. I have recently found that such a check does not occur in WebApollo, and so this would be the most convenient place to add this check in our pipeline.

Thanks!

Insert EC numbers into chado database issue

Hi Josh,

I am trying to run your script "insert_ec_number.pl" in chado folder to insert an EC number into chado database. I get the following error,

[snadendla@thanos chado]$ perl insert_ec_number.pl --ec_number=4.3.99.3 --name=7-carboxy-7-deazaguanine synthase --database=hcon2 --user=XXXX --password=XXXX --server=manatee-db --database_type=mysql
attempting to create database connection
INFO: got db_id 8 for name EC
INFO: got cv_id 7 for name EC
Unable to find cvterm_id corresponding to base accession 4.3.99.-. Check the base term? at insert_ec_number.pl line 283.

I tried adding the base 4.3.99.- but still get the same error as above.

What can I do to insert this EC number?

Thanks,
Suvvi

shorter than real intergenic space in "report_gff_intron_and_intergenic_stats.py"

Hi,
When I calculate the intergenic space of a contig with report_gff_intron_and_intergenic_stats.py and add the total length of the genes on that contig to it, the result is shorter than the total length of the contig. My assumption is, this code does not consider the length from beginning of the contig to the beginning of the first gene and also end of the last gene to the end of the contig.

Cheers,
Pezhman

Exclude mRNA features in bacterial TBL exports

From Suvvi:

Just a reminder that mRNA feature needs to be avoided in tbl while converting gff or genbank file into tbl file.

The genome that I submitted with mRNA feature has been sent back as it had mRNA… pasting the error here (just FYI),

“FATAL: DISC_BACTERIA_SHOULD_NOT_HAVE_MRNA:5 bacterial sequences have mRNA features

FATAL: DiscRep_ALL:DISC_BACTERIA_SHOULD_NOT_HAVE_MRNA::5 bacterial sequences have mRNA features

/tmp/tmp.zCDmHD4k9r:tig00000001_edited (length 4883137)
/tmp/tmp.zCDmHD4k9r:tig00000064_edited (length 1234209)
/tmp/tmp.zCDmHD4k9r:tig00000082_edited (length 415988)
/tmp/tmp.zCDmHD4k9r:tig00000065_edited (length 771583)
/tmp/tmp.zCDmHD4k9r:tig00000066_edited (length 630306)”

Model UTRs explicitly

Kyle - This is something you requested, but could you add a comment with a bit more information? Do you just need the class to be created or do you have a file already where they could be included? (I expect a GFF file where the mRNA/exon feature coordinates are outside of the range of the CDS ones.)

Keep in mind the GFF specification (scroll down to the section labeled "The Canonical Gene")
http://www.sequenceontology.org/gff3.shtml

And the SO definition:
http://www.sequenceontology.org/miso/current_release/term/SO:0000203

convert_gff3_to_gbk.py, convert sequences with no annotation

Currently convert_gff3_to_gbk.py will only create GenBank entries for input sequences that have at least one feature localized to them in the GFF. However, one might want to create GenBank entries for genomic sequences (in the FASTA section of the input GFF3) that have no features localized to them. The description of the converter ("Converts GFF3 representing gene models to Genbank flat-file format.") does suggest that the conversion process is based around gene models rather than sequences, but since the GenBank flat file format is inherently sequence-based it would be good to at least have an option to include unannotated sequences in the conversion.

Needed: Speed-optimized FASTA statistics script

One of the really common tasks when given a FASTA file is to find the following statistics:

Total sequence count
Total base count
GC content
Longest sequence
Shortest sequence
Mean sequence length
Median sequence length
N50
N90

biocodeutils.py add error output mRNA ID

When the biocodeutils function called "translate" finds an unknown codon, it currently will deal with it like this:

print("WARN: Encountered unknown codon during translation: {0}".format(seq[x:x+3]))

Could you please add the mRNA ID to this output? I think you'll have to add that to the function input. This will help when trying to track down the sequences with this issue.

Thanks!

Check and/or correct coordinate column order in biocodegff.py

It would be helpful if biocodegff could print a warning--and perhaps automatically switch the values-- if it detects that the GFF start coordinate (column 4) is larger than the GFF end coordinate (column 5). In the absence of this check incorrectly switched coordinates are getting passed through to the GenBank format output of convert_gff3_to_gbk.py.

Resolve competing functions to serialize GFF3

These two are extremely similar and should probably be collapsed into just one function.

utils.serialize_gff3
gff.print_gff3_from_assemblies

convert_gff3_to_ncbi_tbl.py error generated

Dear Joshua,

I've got error during convert_gff3_to_ncbi_tbl.py.

Can you please check ?

convert_gff3_to_ncbi_tbl.py -i ../gene.gff -o aasdasdasd -ln JC0 -nap adsadasd

Traceback (most recent call last):
File "/Users/wyim/bin/biocode/gff/convert_gff3_to_ncbi_tbl.py", line 92, in
main()
File "/Users/wyim/bin/biocode/gff/convert_gff3_to_ncbi_tbl.py", line 55, in main
(assemblies, features) = biocodegff.get_gff3_features( args.input_file )
File "/Users/wyim/bin/biocode/lib/biocodegff.py", line 272, in get_gff3_features
raise Exception("Error in GFF3: Parent {0} referenced by a child feature before it was defined".format(parent_id) )
Exception: Error in GFF3: Parent Mecry000010.1 referenced by a child feature before it was defined

convert_gff3_to_ncbi_tbl

Can someone tell me which assumption convert_gff3_to_ncbi_tbl makes on the formatting of the names? Apparently ours miss something:

python3 gff/convert_gff3_to_ncbi_tbl.py -i ../juncus.fasta.transdecoder.refined.sort.gff3 -o ../juncus.fasta.transdecoder.refined.sort.tbl -ln LAB -nap NAP -gf ../juncus.fasta 
Traceback (most recent call last):
  File "gff/convert_gff3_to_ncbi_tbl.py", line 89, in <module>
    main()
  File "gff/convert_gff3_to_ncbi_tbl.py", line 82, in main
    tbl.print_tbl_from_assemblies(assemblies=assemblies, ofh=ofh, go_obo=args.go_obo, lab_name=args.lab_name)
  File "/tmp/biocode/lib/biocode/tbl.py", line 95, in print_tbl_from_assemblies
    print_biogene(gene=gene, fh=ofh, obo_dict=go_idx, lab_name=lab_name)
  File "/tmp/biocode/lib/biocode/tbl.py", line 122, in print_biogene
    raise Exception("ERROR: locus_tag attributes are required for all gene elements (gene id: {0}".format(gene.id))
Exception: ERROR: locus_tag attributes are required for all gene elements (gene id: Transcript_32960|g.33387

ping @arsilan324

report_gff_intron_and_intergenic_stats.py error message Detected assembly with undefined or 0 length

When I run this script, I get a message that I'm not sure how to troubleshoot.

$ /home/cmccracken/biocode/gff/report_gff_intron_and_intergenic_stats.py -i final_annotation_bmi_20140606.fixed.newIDs.gff3
/usr/local/packages/Python-3.2.3/lib/python3.2/subprocess.py:389: RuntimeWarning: The _posixsubprocess module is not being used. Child process reliability may suffer if your program uses threads.
"program uses threads.", RuntimeWarning)
Traceback (most recent call last):
File "/home/cmccracken/biocode/gff/report_gff_intron_and_intergenic_stats.py", line 212, in
main()
File "/home/cmccracken/biocode/gff/report_gff_intron_and_intergenic_stats.py", line 91, in main
raise Exception("ERROR: Detected assembly with undefined or 0 length: {0}".format(assembly.id))
Exception: ERROR: Detected assembly with undefined or 0 length: ChromosomeIII_BmicrotiR1

Attribute error for update_selected_column9_values.py

Hi,
I am trying to add "EC-numbers" to a gff file.
This is my command:

python update_selected_column9_values.py -i LMA_1258_IMG.gff3 -u ID_EC_onlycol1258.tab -k 'ID' -a 'ec_num' -o LMA_1258_IMG_EC.gff

And this is the error I am getting:

Traceback (most recent call last):
File "update_selected_column9_values.py", line 100, in
main()
File "update_selected_column9_values.py", line 90, in main
atts = gff.column_9_dict(cols[8])
AttributeError: module 'biocode.gff' has no attribute 'column_9_dict'

What am I doing wrong ?

Thank you

report_gff3_statistics.py unsupported operand type(s) error

When using report_gff3_statistics.py, get the following error:

/usr/local/packages/Python-3.3.2/bin/python3 /home/jorvis/git/biocode/gff/report_gff3_statistics.py -i 175.annotation.gff3
The biothings.py is still under testing and development. Please feel free to try using it, though the API is in flux.
Traceback (most recent call last):
File "/home/jorvis/git/biocode/gff/report_gff3_statistics.py", line 74, in
main()
File "/home/jorvis/git/biocode/gff/report_gff3_statistics.py", line 41, in main
type_lengths['assembly'] += assemblies[assembly_id].length
TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'

The full path of the gff3 file is: /usr/local/projects/mucormycosis/annotation/175/175.annotation.gff3 in the IGS filesystem.

AttributeError: 'Gene' object has no attribute 'add_CDS'

Hello, I am trying to get intron and exon statistics using both your 'report_gff3_statistics.py' and 'report_gff_intron_and_intergenic_stats.py' and I am getting the AttributeError that is in the title.

stephenwyka@bspmgenomics:/data/wyka/Reference_genomes/originals$ /data/wyka/report_gff3_statistics.py -i Claviceps_purpurea_20_1.gff -o exon_report.txt
Traceback (most recent call last):
  File "/data/wyka/report_gff3_statistics.py", line 110, in <module>
    main()
  File "/data/wyka/report_gff3_statistics.py", line 30, in main
    (assemblies, features) = gff.get_gff3_features(args.input_file)
  File "/data/wyka/biocode/lib/biocode/gff.py", line 350, in get_gff3_features
    parent_feat.add_CDS(CDS)
AttributeError: 'Gene' object has no attribute 'add_CDS'

I downloaded this gff3 from GenBank and below is an example of the contents.

CAGA01000191.1	EMBL	region	1	224490	.	+	.	ID=id0;Dbxref=taxon:1111077;clone=scaffold00051;gbkey=Src;mol_type=genomic DNA;strain=20.1
CAGA01000191.1	EMBL	gene	3223	3902	.	-	.	ID=gene0;Name=CPUR_06801;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CPUR_06801
CAGA01000191.1	EMBL	CDS	3642	3902	.	-	0	ID=cds0;Parent=gene0;Dbxref=NCBI_GP:CCE35373.1;Name=CCE35373.1;Note=CP_06801.1;gbkey=CDS;product=uncharacterized protein;protein_id=CCE35373.1
CAGA01000191.1	EMBL	CDS	3223	3315	.	-	0	ID=cds0;Parent=gene0;Dbxref=NCBI_GP:CCE35373.1;Name=CCE35373.1;Note=CP_06801.1;gbkey=CDS;product=uncharacterized protein;protein_id=CCE35373.1
CAGA01000191.1	EMBL	exon	3223	3315	.	-	.	ID=id1;Parent=gene0;gbkey=exon
CAGA01000191.1	EMBL	exon	3642	3902	.	-	.	ID=id2;Parent=gene0;gbkey=exon
CAGA01000191.1	EMBL	gap	7156	7946	.	+	.	ID=id3;estimated_length=791;gbkey=gap
CAGA01000191.1	EMBL	gene	11485	11880	.	+	.	ID=gene1;Name=CPUR_06802;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CPUR_06802
CAGA01000191.1	EMBL	CDS	11485	11880	.	+	0	ID=cds1;Parent=gene1;Dbxref=NCBI_GP:CCE35374.1;Name=CCE35374.1;Note=CP_06802.1;gbkey=CDS;product=uncharacterized protein;protein_id=CCE35374.1
CAGA01000191.1	EMBL	exon	11485	11880	.	+	.	ID=id4;Parent=gene1;gbkey=exon
CAGA01000191.1	EMBL	gene	11895	12257	.	-	.	ID=gene2;Name=CPUR_06803;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CPUR_06803
CAGA01000191.1	EMBL	CDS	11895	12257	.	-	0	ID=cds2;Parent=gene2;Dbxref=NCBI_GP:CCE35375.1;Name=CCE35375.1;Note=CP_06803.1;gbkey=CDS;product=uncharacterized protein;protein_id=CCE35375.1
CAGA01000191.1	EMBL	exon	11895	12257	.	-	.	ID=id5;Parent=gene2;gbkey=exon
CAGA01000191.1	EMBL	gene	13574	15125	.	-	.	ID=gene3;Name=CPUR_06804;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CPUR_06804
CAGA01000191.1	EMBL	CDS	14956	15125	.	-	0	ID=cds3;Parent=gene3;Dbxref=NCBI_GP:CCE35376.1;Name=CCE35376.1;Note=CP_06804.1;gbkey=CDS;product=probable dis1-suppressing protein kinase dsk1;protein_id=CCE35376.1
CAGA01000191.1	EMBL	CDS	14507	14850	.	-	1	ID=cds3;Parent=gene3;Dbxref=NCBI_GP:CCE35376.1;Name=CCE35376.1;Note=CP_06804.1;gbkey=CDS;product=probable dis1-suppressing protein kinase dsk1;protein_id=CCE35376.1
CAGA01000191.1	EMBL	CDS	14135	14454	.	-	2	ID=cds3;Parent=gene3;Dbxref=NCBI_GP:CCE35376.1;Name=CCE35376.1;Note=CP_06804.1;gbkey=CDS;product=probable dis1-suppressing protein kinase dsk1;protein_id=CCE35376.1
CAGA01000191.1	EMBL	CDS	13822	14062	.	-	0	ID=cds3;Parent=gene3;Dbxref=NCBI_GP:CCE35376.1;Name=CCE35376.1;Note=CP_06804.1;gbkey=CDS;product=probable dis1-suppressing protein kinase dsk1;protein_id=CCE35376.1
CAGA01000191.1	EMBL	CDS	13574	13758	.	-	2	ID=cds3;Parent=gene3;Dbxref=NCBI_GP:CCE35376.1;Name=CCE35376.1;Note=CP_06804.1;gbkey=CDS;product=probable dis1-suppressing protein kinase dsk1;protein_id=CCE35376.1
CAGA01000191.1	EMBL	exon	13574	13758	.	-	.	ID=id6;Parent=gene3;gbkey=exon
CAGA01000191.1	EMBL	exon	13822	14062	.	-	.	ID=id7;Parent=gene3;gbkey=exon
CAGA01000191.1	EMBL	exon	14135	14454	.	-	.	ID=id8;Parent=gene3;gbkey=exon
CAGA01000191.1	EMBL	exon	14507	14850	.	-	.	ID=id9;Parent=gene3;gbkey=exon
CAGA01000191.1	EMBL	exon	14956	15125	.	-	.	ID=id10;Parent=gene3;gbkey=exon

Syntax error on gff.py

Hello,

I was trying to run some of your gff3 statistics and after getting the clone I get a syntax error when it tries to open the gff.py

/Wyka/bioinformatics$ python report_gff3_statistics.py -i Claviceps_purpurea_LM4.gff3 -o output_test
Traceback (most recent call last):
File "report_gff3_statistics.py", line 19, in
from biocode import gff
File "/opt/biocode/lib/biocode/gff.py", line 103
[*v] = map(unquote, tt[1].strip().split(COMMA))
^
SyntaxError: invalid syntax

I am running this on Ubuntu 18.04

write_fasta_from_gff.py silently ignores a (potentially large) portion of the input gff

Running write_fasta_from_gff.py on the output of convert_metagenemark_gff_to_gff3.py, I observed some large discrepancies between the number of CDS features in the GFF3 file and the number of CDS features written by write_fasta_from_gff.py In one case the GFF3 file contained 16042 CDS features, but the FASTA output contained only 10299 sequences, a loss of 5743 CDS sequences, ~36% of the total.

write_fasta_from_gff.py error

When running write_fasta_from_gff.py, I'm getting an error and the output file contains only a small portion of the number of proteins that should be present (protein count varies every time).

The command I'm using is: python ~/git/biocode/gff/write_fasta_from_gff.py -i BV115/BV115.gff3 -f BV115/BV115.fasta -o test.txt

cwd: /local/scratch/ncpalmateer/silva_lab/p67

Error message:
Traceback (most recent call last):
File "/home/Nicholas.Palmateer/git/biocode/gff/write_fasta_from_gff.py", line 126, in
main()
File "/home/Nicholas.Palmateer/git/biocode/gff/write_fasta_from_gff.py", line 87, in main
coding_seq = feat.get_CDS_residues(for_translation=True)
File "/home/Nicholas.Palmateer/git/biocode/lib/biocode/things.py", line 1093, in get_CDS_residues
chop = sorted_cds[0].phase
IndexError: list index out of range

path to checkout: /home/Nicholas.Palmateer/git/biocode
$PYTHONPATH in .bashrc: /home/Nicholas.Palmateer/git/biocode/lib

jorvis / biocode Goto Github PK

biocode's Introduction

Overview

The scripts

The modules

Installing dependencies

Getting the code (pip3, latest release)

Getting the code (github, current trunk)

Problems / Suggestions?

Contributing

biocode's People

Contributors

Stargazers

Watchers

Forkers

biocode's Issues

first write the comments to the output file

Recommend Projects

Recommend Topics

Recommend Org