wrf / genomegtftools Goto Github PK

convert various features into a GFF-like file for use in genome browsers

Python 84.80% R 15.20%

genomics visualization gff jbrowse genome-annotation genome-browsers

genomegtftools's Introduction

genomeGTFtools

Overview

These are some scripts to convert various features and annotations into a GFF-like file for use in genome browsers. There are also general purpose tools for genome annotation.

GFF (generic feature format) is a tab-delimited pseudoformat for annotating genomes, consisting of 8 well-defined columns, and a free-text 9th column (with some "guidelines" of what to include). Most work in fixing a GFF involves fixing the final column. While this is not a great format, most formats are insufficient for one reason or another. Ultimately, it would be better if certain fields were explicit, like ID (a unique machine-understandable identifier, which could be accession, etc.), Parent (any other ID), and Name (like gene name). This would also allow for fast indexing or sorting by ID, Parent, or Name, which would also make it easier to extract or process genes or features for graphics.

Column	Name	Description
1	seqid	The sequence ID, usually this is the contig, scaffold or chromosome (e.g. `ctg0123`, `scaffold_99`, `chrX`), and NOT the name of the gene or feature (which goes in column 9)
2	source	Usually the program that generated the data (e.g. `blastp`, `AUGUSTUS`)
3	type	Type of feature (e.g. `gene`, `exon`), typically using standarized terms
4	start	First base in the feature
5	end	Last base the feature
6	score	Float (of an arbitrary scale), could be something like coverage (say 0-1000), percent identity (0.0-100.0)
7	strand	Which DNA strand, as forward (+), reverse (-), unstranded (.) or unknown (?)
8	phase	Phase of the coding sequence as 0, 1, or 2 (i.e. whether the exon ends mid-codon), only applies to `CDS` features
9	attributes	All other information, as a string of pairs of `key=value;`, though this is variable depending on version or program. Most problems with GFF are problems with parsing information from this column.

Many of these tools were used in our analysis of the genome of the sponge Tethya wilhelma. Please cite the paper: Mills, DB. et al (2018) The last common ancestor of animals lacked the HIF pathway and respired in low-oxygen environments. eLife 7:e31176.

pfam2gff.py PFAM domains of proteins/coding sequences made into a GFF
blast2gff.py blast hits directly on a genome (i.e. the scaffolds) to GFF, generally indicating exons or conserved domains
blast2genomegff.py make features of blast hits against transcripts, showing the span of target proteins and include the splicing
microsynteny.py using two GFF files of two different genomes and blast hits, determine blocks of conserved gene order
extract_coordinates.py extract gene or exon information to draw diagrams that look like a genome browser

number_contigs_by_length

By convention, the longest chromosomes are numbered first. This naturally applies to scaffolds as well. Contigs/scaffolds can be renumbered and reordered with number_contigs_by_length.py script. Use the option -c to specify an additional output file of the conversion vector, that can be used to rename the scaffold column in any GFF file with the rename_gtf_contigs.py script.

pfam2gff

This has two modes: one will convert the "tabular" hmmscan output (generated using PFAM-A database (Pfam-A.hmm), which can be found in the FTP section of PFAM as the hmm database) into a protein GFF with domains at the protein positions. The other mode will generate a genome-GFF with the domain positions mapped onto the protein features, i.e. that the domains will fit into/across exons.

hmmscan --domE 0.1 --cpu 4 --domtblout stringtie.pfam.tab ~/PfamScan/data/Pfam-A.hmm stringtie_transdecoder_prots.fasta > stringtie.pfam.log

pfam2gff.py -i stringtie.pfam.tab > stringtie.pfam.gff

For genomic coordinates

The other output will convert the domain positions into genomic coordinates for use in genome browsers, so individual domains can be viewed spanning exons. Run hmmscan as above, then use the -g option to include genomic coordinates. Use -T for presets for TransDecoder genome GFF file.

pfam2gff.py -g stringtie_transdecoder.gff -i stringtie.pfam.tab -T > stringtie_transdecoder_pfam_domains.gff

For AUGUSTUS proteins (using extract_features.py or translated nucleotides), this would be run as:

hmmscan --domE 0.1 --cpu 4 --domtblout renilla_test_prots.pfam.tab ~/db/Pfam-A.hmm renilla_test_prots.fasta > /dev/null

Then run with the AUGUSTUS GFF (ensure this is GFF format and not GTF), instructing to use CDS features as exons with -x. Depending on version, the AUGUSTUS output might appear as below, where ID is not explicitly given in the attributes of the gene or transcripts (it just says g1), but is given for the introns or CDS. This makes it difficult to parse in a standarized way.

jcf7180000021585	AUGUSTUS	gene	921	9763	0.03	+	.	g1
jcf7180000021585	AUGUSTUS	transcript	921	9763	0.03	+	.	g1.t1
jcf7180000021585	AUGUSTUS	intron	1118	2513	0.83	+	.	transcript_id "g1.t1"; gene_id "g1";
jcf7180000021585	AUGUSTUS	CDS	921	1117	0.57	+	0	transcript_id "g1.t1"; gene_id "g1";
jcf7180000021585	AUGUSTUS	CDS	2514	2647	1	+	1	transcript_id "g1.t1"; gene_id "g1";

This should be changed to appear as GFF format. Introns are ignored anyway (thus can be removed) and CDS features will be treated as exons (with option -x).

jcf7180000021585	AUGUSTUS	gene	921	9763	0.03	+	.	ID=g1;Name=g1
jcf7180000021585	AUGUSTUS	mRNA	921	9763	0.03	+	.	ID=g1.t1;Parent=g1;Name=g1.t1
jcf7180000021585	AUGUSTUS	CDS	921	1117	0.57	+	0	Parent=g1.t1
jcf7180000021585	AUGUSTUS	CDS	2514	2647	1	+	1	Parent=g1.t1

This can be run with pfam2gff.py, specifying the GFF file with -g (this will also make it print a GFF of genomic coordinates).

pfam2gff.py -g aug_nocomments_cds.gff -e 1e-3 -i aug_prots.pfam.tab -x > aug_prots_pfam.gff

Produces the following example output. The results are not filtered by position, so it is evident that 3 different von Willebrand domains are found, though group 1 is clearly the strongest match, as indicated by the score column.

jcf7180000021585	hmmscan	PFAM	1011	1117	113.3	+	.	ID=g1.t1.VWA.1;Name=PF00092.VWA.von_Willebrand_factor_type_A_domain
jcf7180000021585	hmmscan	PFAM	2514	2585	113.3	+	.	ID=g1.t1.VWA.1;Name=PF00092.VWA.von_Willebrand_factor_type_A_domain
jcf7180000021585	hmmscan	PFAM	1014	1117	55.5	+	.	ID=g1.t1.VWA_2.1;Name=PF13519.VWA_2.von_Willebrand_factor_type_A_domain
jcf7180000021585	hmmscan	PFAM	2514	2526	55.5	+	.	ID=g1.t1.VWA_2.1;Name=PF13519.VWA_2.von_Willebrand_factor_type_A_domain
jcf7180000021585	hmmscan	PFAM	1011	1117	35.1	+	.	ID=g1.t1.VWA_3.1;Name=PF13768.VWA_3.von_Willebrand_factor_type_A_domain
jcf7180000021585	hmmscan	PFAM	2514	2566	35.1	+	.	ID=g1.t1.VWA_3.1;Name=PF13768.VWA_3.von_Willebrand_factor_type_A_domain

pfamgff2clans

Convert a PFAM protein GFF (above) to the PFAM clans, and remove some redundant hits, essentially just changing the names of the domains and merging duplicates. This is needed for the pfampipeline.py script. This script requires the clan links file, called Pfam-A.clans.tsv.

BioPython is required for this step, to get the length of each sequence.

pfampipeline

Requires BioPython, HMMER, PFAM-A and SignalP Script to generate graph of domains for a FASTA file of proteins. Both of the above scripts (pfam2gff.py and pfamgff2clans.py) are automatically called, followed by SignalP and the R script draw_protein_gtf.R. This is called as a command on a protein file, for example on the test dataset of nidogen proteins. The graph below illustrates some of the difficulty in identifying true orthologs, as the domain structure is quite variable between phyla.

pfampipeline.py test_data/nidogen_full_prots.fasta

Several output files are automatically generated, including the domain assignments in a GFF-like format, and a PDF of the domains, where a black line indicates the protein length, black box is the signal peptide, and colors correspond to different domains. Note that some domains may overlap, due to bad calls in the HMM. Domains are by definition non-overlapping, thus these must be removed. Automatic removal may be implemented in the future.

To view exon structure from a GFF file on a 3D protein structure, see instructions at my PDBcolor repo. Below is an example using human nidogen1, with the structure from Alphafold2:

blast2gff

This was a strategy to convert blast hits into gene models. The direction of the blast hit and the grouping of blast hits in the same region is most indicative of a gene (though possibly pseudogenes as well). In general, blasting all human proteins against the target genome can find many proteins even in distantly related organisms. Repeated domains or very common domains (like ATP binding for kinases) will show up all over the place, so limiting the -max_target_seqs is advisable.

Blast a protein set against the genome, and make use of the multithreading power of blast.

tblastn -query proteins.fa -db target_genome.fa -num_threads 16 -outfmt 6 > prots_vs_genome.tab

Run blast2gff.py on the output file. This will reformat the tabular blast hits into gff3 alignment style. Simple filtering options can be applied with -e -s and -F. If the queries were from SwissProt, use the -S option to correctly format the SwissProt fasta headers for the output.

blast2gff.py -b prots_vs_genome.tab > prots_vs_genome.gff3

blast2genomegff

As above for the domains in pfam2gff.py, entire blast hits to transcripts or proteins can be printed as GFF using the blast2genomegff.py script, where the blast hits will be shown spanning multiple exons as a single feature. This might help to identify erroneously fused genes. Using transcripts from a de novo transcriptome Trinity, or genome guided StringTie, convert blastx protein matches into genomic coordinates. For Trinity transcripts, the coordinates on the genome need to be determined by mapping the transcripts to the genome. This can be done with GMAP. Generically, this would be run as:

blastx -query transcripts.fasta -db uniprot_sprot.fasta -num_threads 16 -outfmt 6 -max_target_seqs 5 > transcripts_sprot.tab

blast2genomegff.py -b transcripts_sprot.tab -d uniprot_sprot.fasta -g transcripts.gtf > transcripts_sprot.genome.gff

Options are:

-p : program, by default is blastx, but change to blastp if proteins were used. The correct blast program is needed to calculate the intervals correctly, as blastp results need to be converted to nucleotide coordinates.
-x : if starting from proteins, and the CDS features are available (such as from AUGUSTUS), use CDS features from the GFF instead of exons (that is, exons are not specified at all). If exons are specified as well and are mostly identical to CDS, then also use the option --skip-exons.
-D : delimiter for spliting names in the tabular blast output. For example, if names were gene1.CDS, the option -D . would be used to split at gene1.
-F : as above for -D, but to split the IDs in the GFF. If the query column in blast (first column) and the ID in the GFF do not match, and there is no output, then this or -D may fix the problem.
-c : coverage cutoff, remove queries where the hit is under 0.1 of the subject length
-e : E-value cutoff, by default is 1e-3
-s : bitscore/length cutoff, remove hits with bitscore/length of under 0.1, that is, remove very distant matches. Set higher for more closely related species (0.3) or lower for distance species (0.05).
-G : ignore gene features, or other high-level features like mRNA or transcript, and instead extract the ID directly from each exon. This might be more convenient to use if exons are given unique IDs in the GFF, like gene1.t1.exon1. This is typically used with -F, as -G -F "."
-S : specifies that the target proteins are from SwissProt/UniProt. This is required for using --add-accession, and usually for --add-description, though the description should work anyway, but will not be parsed correctly.

starting from StringTie transcripts

StringTie transcripts can be converted to fasta using the script cufflinks_gtf_genome_to_cdna_fasta.pl (packaged with TransDecoder). These are used as input for blastx. Note that with blastx, some can hit antisense, which suggests there is a protein on the antisense strand, or possibly there is an erroneous fusion of two adjacent genes.

Blastx the transcriptome against a protein set, such as Swissprot, or perhaps gene models of a related organism.

blastx -query stringtie.fasta -db uniprot_sprot.fasta -max_target_seqs 10 -evalue 1e-3 > stringtie_vs_swissprot_blastx.tab

Convert to genomic coordinates, so individual protein hits can be seen spanning exons. The genes -g can be defined using the GTF output of StringTie.

blast2genomegff.py -b stringtie_vs_swissprot_blastx.tab -g stringtie.gtf -d uniprot_sprot.fasta -S > stringtie_vs_swissprot_blastx.gff

starting from AUGUSTUS proteins or CDS

When running AUGUSTUS, include the options --protein=on --cds=on --gff=on. As above, the proteins themselves (and the CDS nucleotides) can be extracted with the extract_features.py script. Then run either blastp or blastx, for proteins or nucleotide CDS, respectively.

blastp -query renilla_test_prots.fasta -db ~/db/human_uniprot.fasta -outfmt 6 -evalue 1e-4 -max_target_seqs 5 > renilla_vs_human_blastp_1e-4.tab

By default, AUGUSTUS does not report exon features (only intron and CDS).

blast2genomegff.py -b renilla_vs_human_blastp_1e-4.tab -g renilla_test_prots.gff -d ~/db/human_uniprot.fasta -S -x > renilla_vs_human_blastp_1e-4.gff

If proteins were used, set -p to blastp. If nucleotide CDS was used, the default -p is blastx.

microsynteny

Blocks of colinear genes between two species can be identified using blast and GFF files of the gene positions for each species.

Required terms are:

-b tabular blast or diamond results, of the query proteins or genes against some database
-q GFF file of the query genes. The results are sensitive to presence of multiple features at the same locus. It is advisable to include only features that would match the blast results, either gene or mRNA. i.e. grep gene annotation.gff > genes_features_only.gff
-d GFF of the genes of the target genome

Common options are:

-m minimum length of a block, 3 is usually sufficient for true synteny, as determined by randomized gene order (with -R)
-z max allowed distance between genes. This should be roughly the upper limit of intergenic distances in the genome, meaning 99% of genes are closer than -z (see here for an example).
-R randomize gene order of the query, to estimate false discovery rate
--make-gff produce a GFF output, instead of gene-by-gene information of the synteny blocks

Blastx or blastp of the transcriptome (or translated CDS) against a database of proteins from the target species. Use the tabular output -outfmt 6. A maximum number of sequences does not need to be set, since the objective is to find homology, and this is not assumed from blast similarity. For example, here I am using the genomes of the corals Acropora digitifera and Styllophora pistillata.

blastp -query adi_aug101220_pasa_prot.fa -db Spis.genome.annotation.pep.longest.fa -outfmt 6 -evalue 1e-2 > acropora_vs_styllophora_blastp.tab

Determine the blocks with the microsynteny.py script. Various delimiter parameters may need to be set depending on the version of GFF for the annotations and the names of the proteins. In this case, as the subject proteins contain the same names as the ID field in the GFF, the delimiter -D must be set to any alternate symbol, here using |.

microsynteny.py -b acropora_vs_styllophora_blastp.tab -q test_data/adi_aug101220_pasa_mrna_t1_only.gff -d test_data/Spis.genome.annotation.mrna_only.gff -D "|" > test_data/acropora_vs_styllophora_microsynteny.tab

The standard output is a tab-delimited text file of 11 columns. The colinear block is evident from the systematic numbering of the two genomes, as the five A. digitifera genes (numbered 23536-23540) correspond to the five S. pistillata genes (Spis14158-14162). The strand direction is the same for each gene as well.

query-scaf	sub-scaf	block-id	query-gene	q-start	q-end	q-strand	sub-gene	s-start	s-end	s-strand
scaf16952	Spis.scaffold280|size416857	blk-2	aug_v2a.23536.t1	23026	25506	+	Spis14158	255619	257861	+
scaf16952	Spis.scaffold280|size416857	blk-2	aug_v2a.23537.t1	28396	43517	+	Spis14159	264153	287169	+
scaf16952	Spis.scaffold280|size416857	blk-2	aug_v2a.23538.t1	53236	67680	+	Spis14160	294209	310911	+
scaf16952	Spis.scaffold280|size416857	blk-2	aug_v2a.23539.t1	68035	73535	-	Spis14161	321061	322080	-
scaf16952	Spis.scaffold280|size416857	blk-2	aug_v2a.23540.t1	77256	88420	+	Spis14162	340030	347635	+

There are two main caveats to the data generated by this program. First, the reported block length represents the most genes on either query or target scaffolds, meaning if genes are erronously fused or split on one of the two species, this could, for instance, generate a block of 1 gene in the query and 3 genes in the target. Below is an example of a block where gene aug_v2a.19447.t1 blasts to five genes in a row in S. pistillata. From these data alone, it cannot be determined if the A. digitifera gene is a false fusion, or the S. pistillata genes are falsely split.

scaf11463	Spis.scaffold21|size1341190	blk-19	aug_v2a.19446.t1	188491	189125	-	Spis2568	665897	666978	-
scaf11463	Spis.scaffold21|size1341190	blk-19	aug_v2a.19447.t1	192487	193775	-	Spis2569	668955	676058	-
scaf11463	Spis.scaffold21|size1341190	blk-19	aug_v2a.19447.t1	192487	193775	-	Spis2573.t2	740727	760040	-
scaf11463	Spis.scaffold21|size1341190	blk-19	aug_v2a.19447.t1	192487	193775	-	Spis2572	715149	725680	-
scaf11463	Spis.scaffold21|size1341190	blk-19	aug_v2a.19447.t1	192487	193775	-	Spis2571	705673	706617	-
scaf11463	Spis.scaffold21|size1341190	blk-19	aug_v2a.19447.t1	192487	193775	-	Spis2575	784361	787164	-

The other potential problem occurs in the case of tandem duplications. If a tandem duplication occurred in one of the two genomes, then nearly identical blocks will be identified for each duplicate, as up to 5 intervening genes are allowed (by the option -s). A hypothetical example is shown in the figure below. In such a case, hypothetical genes 1-3 would be matched twice (once normally, once with the intervening gene 1b), meaning the block will be counted twice.

Optionally, determine the frequency of random matches using the -R option. This will randomly reorder the positions of the query proteins, and then check for synteny. Blocks of 3-in-a-row are extremely rare, though blocks of two can be found easily by setting -m 2. This suggests that 3 colinear genes is usually sufficient to infer inherited gene order between the two species. Usually, long blocks are actually erroneous gene fusions in one of the genomes (as above).

microsynteny.py -b acropora_vs_styllophora_blastp.tab -q test_data/adi_aug101220_pasa_mrna_t1_only.gff -d test_data/Spis.genome.annotation.mrna_only.gff -D "|" -R -m 2 > test_data/acropora_vs_styllophora_microsynteny.randomized.tab

A summary plot of block lengths can be produced using the associated R script synteny_block_length_plot.R.

Rscript synteny_block_length_plot.R acropora_vs_styllophora_microsynteny.tab

Generate a GFF of the blocks, to plot in a genome browser. Use the option --make-gff.

scaffold_synteny

Generate a PDF of a dot plot, similar to what was done in Srivistava 2008 and Simakov 2013. This requires unidirectional blast results (not reciprocal) as duplicated blocks can be identified this way.

scaffold_synteny.py -b hoilungia_vs_trichoplax_blastp_e-3.tab -q Hhon_BRAKER1_genes.gff3 -d Trichoplax_scaffolds_JGI_AUGUSTUS_transcript_only.gff -f Hhon_final_contigs_unmasked.fasta -F Triad1_genomic_scaffolds.fasta --blast-query-delimiter . --blast-db-delimiter __ -l 80 -L 100 > hoilungia_vs_trichoplax_scaffold2d_points.tab

Then, the R script is run to generate the dot plot. Synteny is clearly evident for major sections of chromosomes, indicated by the diagonals formed by the points. For instance, the longest H. hongkongensis contig maps completely to scaffold 7 in Trichoplax. This is mostly the case for the first 5 contigs in H. hongkongensis.

Rscript synteny_2d_plot.R hoilungia_vs_trichoplax_scaffold2d_points.tab Hoilungia-hongkongensis Trichoplax-adhaerens

The same can be generated for more distant species (such as this one published in Kenny 2020). Here two choanoflagellates are used, Monosiga brevicollis (using the AUGUSTUS reannotation) and Salpingoeca rosetta (at Ensembl).

blastp -query Monbr1_augustus_v1.prot_no_rename.fasta -db Salpingoeca_rosetta.Proterospongia_sp_ATCC50818.pep.all.fa -outfmt 6 -num_threads 6 -evalue 1e-3 -max_target_seqs 100 > monbr1_vs_srosetta_blastp.tab

scaffold_synteny.py -b monbr1_vs_srosetta_blastp.tab -q Monbr1_augustus_v1_no_comment.gff -d ~/genomes/salpingoeca_rosetta/srosetta_mrna_only_ID_renamed.gff -f Monbr1_scaffolds.fasta -F ~/genomes/salpingoeca_rosetta/Salpingoeca_rosetta.Proterospongia_sp_ATCC50818.dna.toplevel.fa.gz -l 40 -L 50 > monbr1_vs_srosetta_scaffold2d_points.tab

Rscript ~/git/genomeGTFtools/synteny_2d_plot.R monbr1_vs_srosetta_scaffold2d_points.tab Monosiga-brevicollis Salpingoeca-rosetta

extract_coordinates

This is a script to extract feature positions from a GFF to generate a figure that roughly resembles what is seen in a genome browser, to make something that has the same effect as a screenshot but looks better. Because GFFs contain indices by scaffold position, rather than gene, the script effectively converts the tabular GFF to a format where the features are the genes and exons themselves. This allows for easy downstream processing and visualization with the associated script draw_annotation_blocks.R.

Required terms are:

-g one or more GFF files, each roughly representing a track in a genome browser
-s the chromosome or scaffold ID
-b the beginning position to extract from that scaffold (inclusive)
-e the end position to extract from that scaffold (also inclusive)

The output of the program is a 5-column tab-delimited file, containing gene name, feature type, start, stop, and strand. Even if multiple GFFs are used, all of the transcripts or genes will end up in the same file.

For example, studying the arrangement of the Lux operon, in the bacterium Aliivibrio fisheri, the genome and GFF are downloaded from NCBI.

extract_coordinates.py -g GCF_000011805.1_ASM1180v1_genomic.gff -s NC_006841.2 -b 1041700 -e 1051700 -p > lux_locus_short_annot.tab

draw_annotation_blocks

A rather unsophisticated script to plot the extracted coordinates of genes or exons from extract_coordinates.py. By default, it only requires the output of extract_coordinates.py and will generate a .pdf of the same name. In this example, the input file lux_locus_short_annot.tab yields the output lux_locus_short_annot.pdf.

Rscript draw_annotation_blocks.R lux_locus_short_annot.tab

As there is not a straightforward way to display features, or generate a figure of whatever someone would want to convey, the output is rather quirky, ultimately meaning that it is much easier to generate the .pdf and import it into another program like Inkscape.

The X-axis and scaling of the features is set by the boundaries from -b and -e in extract_coordinates.py, meaning to make two graphs to compare loci, the span of -b to -e should be the same
Each gene or transcript is printed on a different Y-position, of up to 30 features (this can be changed). This means that splice variants are each on their own line, as are genes in operons.
All features are colored the same.
The names are taken from the ID tag in the GFF, which may not be the same as Name.

The raw output is shown below. Some feature names are cut off due to the grouping of objects. These can be ungrouped and moved.

Naturally this requires some work to make it presentable, i.e. flipping the axis, coloring each gene, renaming them from the GFF ID, which did not have meaningful IDs for most of the proteins. Thus, a final version may look more like this, maintaining the color scheme used in Dunlap 2009.

draw_genome_annotation

This was intended to make a printable version of any NCBI prokaryote genome, and have it displayed on one or more pages, with 1Mbp per page. From the downloaded GFF (or .gz), the PDF is generated automatically with the Rscript.

Rscript draw_genome_annotation.R GCF_000011805.1_ASM1180v1_genomic.gff

repeat2gtf

From scaffolds or masked contigs, generate a feature for each long repeat of N's or n's (or any other arbitrary letter or pattern). The most obvious application is to make a track for gaps, which is the default behavior. The search is a regular expression, so could be any other simple repeat as well - CACA, CAG (glutamine repeats).

repeat2gtf.py scaffolds.fasta > scaffolds_gaps.gtf

In jbrowse, I typically change a few options in the style to make this more visually useful. The color is set to black, the height is reduced to narrow bars, and the label is the score, which is the length of the gap.

         "style" : {
            "className" : "feature",
            "color" : "#000000",
            "height" : 5,
            "label" : "score"
         },

pal2gtf

Convert palindromic repeats from the EMBOSS program palindrome into GTF features. This was meant for mitochondrial genomes, but could potentially be whole nuclear genomes.

DEPRICATED: blast2genewise

To get gene models from blast hits, the best strategy may be to use blast2gff.py with the option -A to convert the blast hits to AUGUSTUS hints (which are in a GFF-like format). This is then specified in the AUGUSTUS run as: --hintsfile=geneset_vs_scaffolds.gff

blast2gff.py -t CDSpart -b geneset_vs_scaffolds.tab -A > geneset_vs_scaffolds.gff

The original idea was intended to take advantage of the speed of blasting. Blast hits are then parsed to give a single command to Genewise, and the gff output is collected into a single file and reformatted for modern genome browsers. This is very similar to the strategy used by the BUSCO pipeline, which takes blast hits and runs AUGUSTUS on the scaffold that was hit. As AUGUSTUS can use HMM-like profiles to find specific proteins that are conserved, perhaps with more complex domain structures, this might be developed further.

genomegtftools's People

Contributors

Stargazers

Watchers

genomegtftools's Issues

The different domain genome coordinates for test viral genomes like ##sequence-region KJ755186.1 1 10774 ##sequence-region NC_007803.1 1 19212 ##sequence-region NC_012532.1 1 10794 ##sequence-region NC_045512.2 1 29903

Hi,
I met different domain genome coordinates for ZIKV, sars-cov2,
The script used in genomeGTFtools can't get the exact genome coordinates as their description in NCBI.

Can you gene me some solutions?

I run as fellow,

I download viral fasta sequence from NCBI,

##sequence-region KJ755186.1 1 10774
##sequence-region NC_007803.1 1 19212
##sequence-region NC_012532.1 1 10794
##sequence-region NC_045512.2 1 29903

ANNOT=testVirus
FilePrefix=Prokka_output

using prokka find CDS

prokka
--outdir $ANNOT --force
--prefix $FilePrefix
--addmrna
--locustag ZNL
--kingdom Viruses --gcode 11
--metagenome
testVirus.fasta

sed -i '/FASTA/,$d' $ANNOT/$FilePrefix.gff
sed -i 's/_mRNA//g' $ANNOT/$FilePrefix.gff

find domain location with hmmscan

hmmscan --domE 0.1 --cpu 4 --domtblout $ANNOT/$FilePrefix.faa.pfam.tab ~/20T/DataBase/SoftwaresEnsembel/EnTAP/interproscan-5.59-91.0/data/pfam/35.0/pfam_a.hmm $ANNOT/$FilePrefix.faa > stringtie.pfam.log

pfam2gff.py -g $ANNOT/$FilePrefix.gff -i $ANNOT/$FilePrefix.faa.pfam.tab -T > $ANNOT/$FilePrefix.faa.pfam_domains.gff

delete blank rows

sed -ir '/^$/d' $ANNOT/$FilePrefix.faa.pfam_domains.gff

The cooridinations in $ANNOT/$FilePrefix.faa.pfam_domains.gff different from their description in NCBI.

UnboundLocalError: local variable 'geneid' referenced before assignment

Hi,
I used the latest code and got UnboundLocalError: local variable 'geneid' referenced before assignment. The following parameters were used:

python /apps/genomeGTFtools/blast2genomegff.py -p blastp -b labVSviridiplantae_hybrid_blastp.out -g NbRNASeqAll.sorted-proper.add-rg.bam.dedup.bam.gtf.fasta.transdecoder.genome.FixVSaugustus.hints_utrAGAT.newID.gff3 -d uniprot-viridiplantae-reviewed-yes-isoforms.fasta --gff-delimiter "." -G -S -x > labVSviridiplantae_hybrid_blastp.gff

# Parsing target sequences from uniprot-viridiplantae-reviewed-yes-isoforms.fasta  Fri Feb 14 13:50:22 2020
# Found 42684 sequences  Fri Feb 14 13:50:23 2020
# Parsing gff from NbRNASeqAll.sorted-proper.add-rg.bam.dedup.bam.gtf.fasta.transdecoder.genome.FixVSaugustus.hints_utrAGAT.newID.gff3  Fri Feb 14 13:50:23 2020
# CDS features WILL BE USED as exons
# gene name and strand will be read for each exon
Traceback (most recent call last):
  File "/work/waterhouse_team/apps/genomeGTFtools/blast2genomegff.py", line 462, in <module>
    main(sys.argv[1:],sys.stdout)
  File "/work/waterhouse_team/apps/genomeGTFtools/blast2genomegff.py", line 456, in main
    geneintervals, genestrand, genescaffold =  gtf_to_intervals(args.genes, args.cds_exons, args.skip_exons, args.transdecoder, args.no_genes, args.gff_delimiter)
  File "/work/waterhouse_team/apps/genomeGTFtools/blast2genomegff.py", line 121, in gtf_to_intervals
    geneid = geneid.rsplit(genesplit,1)[0]
UnboundLocalError: local variable 'geneid' referenced before assignment

This my new GFF3 file:

##gff-version 3
##sequence-region NbV1Ch01 1 187295484
NbV1Ch01        annotation      remark  1       187295484       .       .       .       gff-version=3;sequence-region=%28%27NbV1Ch01%27%2C 0%2C 187295484%29
NbV1Ch01        transdecoder    gene    98185   99705   .       -       .       ID=NBlab01G00010
NbV1Ch01        transdecoder    mRNA    98185   99705   .       -       .       ID=NBlab01G00010.1;Parent=NBlab01G00010
NbV1Ch01        transdecoder    exon    98185   98571   .       -       .       ID=NBlab01G00010.1.exon4;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    exon    98679   98844   .       -       .       ID=NBlab01G00010.1.exon3;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    exon    99134   99325   .       -       .       ID=NBlab01G00010.1.exon2;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    exon    99417   99705   .       -       .       ID=NBlab01G00010.1.exon1;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    CDS     98186   98571   .       -       2       ID=NBlab01G00010.1.cds;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    CDS     98679   98844   .       -       0       ID=NBlab01G00010.1.cds;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    CDS     99134   99325   .       -       0       ID=NBlab01G00010.1.cds;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    CDS     99417   99704   .       -       0       ID=NBlab01G00010.1.cds;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    five_prime_UTR  99705   99705   .       -       .       ID=NBlab01G00010.1.5UTRg1;Parent=NBlab01G00010.1
NbV1Ch01        transdecoder    three_prime_UTR 98185   98185   .       -       .       ID=NBlab01G00010.1.3UTRg1;Parent=NBlab01G00010.1
NbV1Ch01        AUGUSTUS        gene    109665  112554  0.04    -       .       ID=NBlab01G00020
NbV1Ch01        AUGUSTUS        mRNA    109665  112554  0.04    -       .       ID=NBlab01G00020.1;Parent=NBlab01G00020
NbV1Ch01        AUGUSTUS        exon    109665  110489  .       -       .       ID=NBlab01G00020.1.exon1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        exon    110608  111042  .       -       .       ID=NBlab01G00020.1.exon2;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        exon    111592  111844  .       -       .       ID=NBlab01G00020.1.exon3;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        exon    112128  112554  .       -       .       ID=NBlab01G00020.1.exon4;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        CDS     109839  110489  0.69    -       0       ID=NBlab01G00020.1.CDS1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        CDS     110608  111042  0.21    -       0       ID=NBlab01G00020.1.CDS2;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        CDS     111592  111844  0.23    -       1       ID=NBlab01G00020.1.CDS3;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        CDS     112128  112450  0.95    -       0       ID=NBlab01G00020.1.CDS4;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        five_prime_utr  112451  112554  0.26    -       .       ID=NBlab01G00020.1.5UTR1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        intron  110490  110607  0.68    -       .       ID=NBlab01G00020.1.intron1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        intron  111043  111591  0.22    -       .       ID=NBlab01G00020.1.intron2;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        intron  111845  112127  0.49    -       .       ID=NBlab01G00020.1.intron3;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        start_codon     112448  112450  .       -       0       ID=NBlab01G00020.1.SCodon1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        stop_codon      109839  109841  .       -       0       ID=NBlab01G00020.1.ECodon1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        three_prime_utr 109665  109838  0.25    -       .       ID=NBlab01G00020.1.3UTR1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        transcription_end_site  109665  109665  .       -       .       ID=NBlab01G00020.1.TES1;Parent=NBlab01G00020.1
NbV1Ch01        AUGUSTUS        transcription_start_site        112554  112554  .       -       .       ID=NBlab01G00020.1.TSS1;Parent=NBlab01G00020.1

What did I miss?

Thank you in advance,

Michal

merging -K and no -K parameter

Hi,
I ran blast2genomegff.py twice. The first one with -K (track 2) and second without -K (track 3)

python blast2genomegff.py -p blastp -b labVSviridiplantae_hybrid_blastp.out -g augustus.hints_utrAGAT.newID.no_remark.gff3 -d uniprot-viridiplantae-reviewed-yes-isoforms.fasta --gff-delimiter "." -G -K -S -x > labVSviridiplantae_hybrid_blastp-K.gff

python blast2genomegff.py -p blastp -b labVSviridiplantae_hybrid_blastp.out -g augustus.hints_utrAGAT.newID.no_remark.gff3 -d uniprot-viridiplantae-reviewed-yes-isoforms.fasta --gff-delimiter "." -G -S -x > labVSviridiplantae_hybrid_blastp-K.gff

However, comparing NBlab03G03860.1 from the first track it appears that the proteins from track 2 and 3 should not have been split.

Is there a way to merge those two proteins to make one big one?

Thank you in advance,

Michal

microsynteny Name attribute change

Hi all,
I ran microsynteny.py -b no_remark.out -q Q.no_remark.mRNA.gff3 -d NbR.no_remark.mRNA.gff3 -D "|" --make-gff > Q-NbR.no_remark.mRNA.microsynteny.gff3

Diamond's BLASTp output:

NBqld01G09150.1 NBlab01G14990.1 100.0   209     0       0       1       209     409     617     8.5e-123        438.3
NBqld01G09150.1 NBlab01G14990.3 100.0   209     0       0       1       209     409     617     8.5e-123        438.3
NBqld01G09150.1 NBlab02G09300.1 95.7    209     9       0       1       209     191     399     4.1e-117        419.5
NBqld01G09150.1 NBlab02G09300.2 95.7    209     9       0       1       209     393     601     4.1e-117        419.5
NBqld01G09150.1 NBlab05G10170.1 42.3    182     104     1       1       181     269     450     8.1e-41 166.0
NBqld01G09150.1 NBlab16G10960.1 41.5    212     109     3       1       206     251     453     1.1e-40 165.6
NBqld01G09150.1 NBlab03G05570.2 40.6    197     108     2       1       197     418     605     2.4e-40 164.5
NBqld01G09150.1 NBlab15G04360.1 42.4    191     101     2       4       194     335     516     3.1e-40 164.1
NBqld01G09150.1 NBlab06G09770.1 40.9    193     105     2       1       193     538     721     1.2e-39 162.2
NBqld01G09150.1 NBlab06G09780.1 40.9    193     105     2       1       193     538     721     1.2e-39 162.2
NBqld01G09150.1 NBlab03G17680.1 42.8    194     102     3       1       194     317     501     2.0e-39 161.4
NBqld01G09150.1 NBlab03G17690.1 42.8    194     102     3       1       194     316     500     2.0e-39 161.4
NBqld01G09150.1 NBlab04G16080.1 39.6    197     110     2       1       197     418     605     2.6e-39 161.0
NBqld01G09150.1 NBlab04G16080.2 39.6    197     110     2       1       197     418     605     2.6e-39 161.0
NBqld01G09150.1 NBlab04G16080.3 39.6    197     110     2       1       197     418     605     2.6e-39 161.0
NBqld01G09150.1 NBlab04G16080.4 39.6    197     110     2       1       197     418     605     2.6e-39 161.0
NBqld01G09150.1 NBlab06G17770.1 42.8    194     102     2       1       194     391     575     2.6e-39 161.0
NBqld01G09150.1 NBlab19G04850.1 41.8    182     105     1       1       181     77      258     3.4e-39 160.6
NBqld01G09150.1 NBlab02G04500.1 41.0    183     104     1       1       183     294     472     5.8e-39 159.8
NBqld01G09150.1 NBlab12G07510.1 41.8    194     104     2       1       194     369     553     7.6e-39 159.5
NBqld01G09150.1 NBlab12G20100.1 40.0    200     109     3       1       200     493     681     9.9e-39 159.1
NBqld01G09150.1 NBlab12G20110.1 40.0    200     109     3       1       200     72      260     9.9e-39 159.1
NBqld01G09150.1 NBlab05G19490.1 41.7    199     101     3       5       201     552     737     1.3e-38 158.7
NBqld01G09150.1 NBlab05G19490.2 41.7    199     101     3       5       201     490     675     1.3e-38 158.7
NBqld01G09150.1 NBlab05G19490.3 41.7    199     101     3       5       201     552     737     1.3e-38 158.7
NBqld01G09150.2 NBlab01G14990.1 100.0   209     0       0       1       209     409     617     8.5e-123        438.3
NBqld01G09150.2 NBlab01G14990.3 100.0   209     0       0       1       209     409     617     8.5e-123        438.3
NBqld01G09150.2 NBlab02G09300.1 95.7    209     9       0       1       209     191     399     4.1e-117        419.5
NBqld01G09150.2 NBlab02G09300.2 95.7    209     9       0       1       209     393     601     4.1e-117        419.5
NBqld01G09150.2 NBlab05G10170.1 42.3    182     104     1       1       181     269     450     8.1e-41 166.0
NBqld01G09150.2 NBlab16G10960.1 41.5    212     109     3       1       206     251     453     1.1e-40 165.6
NBqld01G09150.2 NBlab03G05570.2 40.6    197     108     2       1       197     418     605     2.4e-40 164.5
NBqld01G09150.2 NBlab15G04360.1 42.4    191     101     2       4       194     335     516     3.1e-40 164.1
NBqld01G09150.2 NBlab06G09770.1 40.9    193     105     2       1       193     538     721     1.2e-39 162.2
NBqld01G09150.2 NBlab06G09780.1 40.9    193     105     2       1       193     538     721     1.2e-39 162.2
NBqld01G09150.2 NBlab03G17680.1 42.8    194     102     3       1       194     317     501     2.0e-39 161.4
NBqld01G09150.2 NBlab03G17690.1 42.8    194     102     3       1       194     316     500     2.0e-39 161.4
NBqld01G09150.2 NBlab04G16080.1 39.6    197     110     2       1       197     418     605     2.6e-39 161.0
NBqld01G09150.2 NBlab04G16080.2 39.6    197     110     2       1       197     418     605     2.6e-39 161.0
NBqld01G09150.2 NBlab04G16080.3 39.6    197     110     2       1       197     418     605     2.6e-39 161.0
NBqld01G09150.2 NBlab04G16080.4 39.6    197     110     2       1       197     418     605     2.6e-39 161.0
NBqld01G09150.2 NBlab06G17770.1 42.8    194     102     2       1       194     391     575     2.6e-39 161.0
NBqld01G09150.2 NBlab19G04850.1 41.8    182     105     1       1       181     77      258     3.4e-39 160.6
NBqld01G09150.2 NBlab02G04500.1 41.0    183     104     1       1       183     294     472     5.8e-39 159.8
NBqld01G09150.2 NBlab12G07510.1 41.8    194     104     2       1       194     369     553     7.6e-39 159.5
NBqld01G09150.2 NBlab12G20100.1 40.0    200     109     3       1       200     493     681     9.9e-39 159.1
NBqld01G09150.2 NBlab12G20110.1 40.0    200     109     3       1       200     72      260     9.9e-39 159.1
NBqld01G09150.2 NBlab05G19490.1 41.7    199     101     3       5       201     552     737     1.3e-38 158.7
NBqld01G09150.2 NBlab05G19490.2 41.7    199     101     3       5       201     490     675     1.3e-38 158.7
NBqld01G09150.2 NBlab05G19490.3 41.7    199     101     3       5       201     552     737     1.3e-38 158.7
NBqld01G09160.1 NBlab02G09310.1 94.6    112     6       0       1       112     1       112     3.2e-56 216.5
NBqld01G09160.1 NBlab01G15000.1 96.4    112     0       1       1       112     1       108     7.2e-56 215.3
NBqld01G09160.1 NBlab02G09310.3 84.8    112     5       1       1       112     1       100     8.0e-47 185.3
NBqld01G09160.3 NBlab01G15000.1 97.5    203     1       1       1       203     1       199     2.0e-105        380.6
NBqld01G09160.3 NBlab02G09310.1 94.0    201     12      0       1       201     1       201     1.2e-102        371.3
NBqld01G09160.3 NBlab02G09310.3 86.7    128     5       1       1       128     1       116     4.3e-55 213.4
NBqld01G09160.5 NBlab02G09310.1 94.6    112     6       0       1       112     1       112     3.2e-56 216.5
NBqld01G09160.5 NBlab01G15000.1 96.4    112     0       1       1       112     1       108     7.2e-56 215.3
NBqld01G09160.5 NBlab02G09310.3 84.8    112     5       1       1       112     1       100     8.0e-47 185.3
NBqld01G09170.1 NBlab02G09320.1 98.0    149     3       0       1       149     1       149     8.0e-83 305.1
NBqld01G09170.1 NBlab02G09320.2 98.0    149     3       0       1       149     1       149     8.0e-83 305.1
NBqld01G09170.1 NBlab02G09320.3 98.0    149     3       0       1       149     1       149     8.0e-83 305.1
NBqld01G09170.1 NBlab02G09320.4 98.0    149     3       0       1       149     1       149     8.0e-83 305.1
NBqld01G09170.1 NBlab19G02340.1 89.0    145     16      0       5       149     4       148     3.0e-74 276.6
NBqld01G09170.1 NBlab19G02340.2 89.0    145     16      0       5       149     4       148     3.0e-74 276.6
NBqld01G09170.1 NBlab05G15330.1 88.3    145     17      0       5       149     4       148     9.8e-73 271.6
NBqld01G09170.1 NBlab05G15330.2 88.3    145     17      0       5       149     4       148     9.8e-73 271.6
NBqld01G09170.1 NBlab01G15010.1 96.7    90      3       0       60      149     1       90      9.2e-47 185.3
NBqld01G09180.1 NBlab02G09330.2 92.1    114     9       0       10      123     13      126     6.4e-57 218.8
NBqld01G09180.1 NBlab02G09330.3 92.1    114     9       0       10      123     13      126     6.4e-57 218.8
NBqld01G09180.1 NBlab01G15020.1 100.0   104     0       0       14      117     1       104     2.7e-55 213.4
NBqld01G09180.1 NBlab01G15020.2 100.0   104     0       0       14      117     1       104     2.7e-55 213.4
NBqld01G09180.1 NBlab02G09330.1 96.3    108     4       0       10      117     13      120     3.5e-55 213.0
NBqld01G09180.2 NBlab01G15020.1 100.0   259     0       0       5       263     1       259     6.0e-150        528.9
NBqld01G09180.2 NBlab02G09330.1 94.7    263     14      0       1       263     13      275     1.4e-143        507.7
NBqld01G09180.2 NBlab01G15020.2 100.0   157     0       0       5       161     1       157     9.1e-90 328.9
NBqld01G09180.2 NBlab01G15020.3 98.7    158     2       0       106     263     6       163     1.9e-87 321.2
NBqld01G09180.2 NBlab02G09330.2 96.3    108     4       0       1       108     13      120     1.2e-54 212.2
NBqld01G09180.2 NBlab02G09330.3 96.3    108     4       0       1       108     13      120     1.2e-54 212.2
NBqld01G09180.2 NBlab07G01870.1 46.7    214     112     2       47      258     79      292     8.3e-51 199.5

GFF Query:

NBqld01 transdecoder    mRNA    75913489        75916784        .       +       .       ID=NBqld01G09150.1;Note=Pentatricopeptide repeat-containing protein At1g26900%2C mitochondrial;Parent=NBqld01G09150
NBqld01 transdecoder    mRNA    75913489        75916116        .       +       .       ID=NBqld01G09150.2;Note=Pentatricopeptide repeat-containing protein At1g26900%2C mitochondrial;Parent=NBqld01G09150
NBqld01 transdecoder    mRNA    75918271        75922448        .       -       .       ID=NBqld01G09160.1;Note=Ran guanine nucleotide release factor;Parent=NBqld01G09160
NBqld01 transdecoder    mRNA    75918532        75922421        .       -       .       ID=NBqld01G09160.3;Note=Ran guanine nucleotide release factor;Parent=NBqld01G09160
NBqld01 transdecoder    mRNA    75918532        75922448        .       -       .       ID=NBqld01G09160.5;Note=Ran guanine nucleotide release factor;Parent=NBqld01G09160
NBqld01 transdecoder    mRNA    75924269        75929750        .       +       .       ID=NBqld01G09170.1;Note=Cullin-3B;Parent=NBqld01G09170
NBqld01 transdecoder    mRNA    75932007        75936883        .       -       .       ID=NBqld01G09180.1;Note=haloacid dehalogenase-like hydrolase domain-containing protein 3;Parent=NBqld01G09180
NBqld01 transdecoder    mRNA    75932157        75936856        .       -       .       ID=NBqld01G09180.2;Note=Glyceraldehyde 3-phosphate phosphatase;Parent=NBqld01G09180

GFF Target

NbV1Ch01        transdecoder    mRNA    90446162        90449165        .       +       .       ID=NBlab01G14990.1;Note=Pentatricopeptide repeat-containing protein At1g26900%2C mitochondrial;Parent=NBlab01G14990
NbV1Ch01        transdecoder    mRNA    90446162        90449165        .       +       .       ID=NBlab01G14990.3;Note=Pentatricopeptide repeat-containing protein At1g26900%2C mitochondrial;Parent=NBlab01G14990
NbV1Ch01        transdecoder    mRNA    88248079        88254983        .       -       .       ID=NBlab01G14500.1;Note=Heat shock protein 90-5%2C chloroplastic;Parent=NBlab01G14500
NbV1Ch01        transdecoder    mRNA    88248079        88254983        .       -       .       ID=NBlab01G14500.2;Note=Heat shock protein 90-5%2C chloroplastic;Parent=NBlab01G14500
NbV1Ch01        AUGUSTUS        mRNA    88291066        88292874        0.04    +       .       ID=NBlab01G14510.1;Note=nucleolin-like;Parent=NBlab01G14510
NbV1Ch01        AUGUSTUS        mRNA    88322551        88324084        0.1     -       .       ID=NBlab01G14520.1;Note=zinc finger BED domain-containing protein RICESLEEPER 2-like;Parent=NBlab01G14520

The resulting GFF3 file:

NBqld01 microsynteny    match   75913489        75936856        8       +       .       ID=blk-1905;Name=blk-1905_to_NbV1Ch01;Target=NbV1Ch01 90446162 90479531
NBqld01 microsynteny    match_part      75913489        75916784        438.3   +       .       ID=blk-1905.1.NBqld01G09150.1;Parent=blk-1905;Target=NBlab01G14990.1 90446162 90449165 +
NBqld01 microsynteny    match_part      75913489        75916116        438.3   +       .       ID=blk-1905.2.NBqld01G09150.2;Parent=blk-1905;Target=NBlab01G14990.3 90446162 90449165 +
NBqld01 microsynteny    match_part      75932007        75936883        213.4   -       .       ID=blk-1905.7.NBqld01G09180.1;Parent=blk-1905;Target=NBlab01G15020.1 90474498 90479531 -
NBqld01 microsynteny    match_part      75932157        75936856        528.9   -       .       ID=blk-1905.8.NBqld01G09180.2;Parent=blk-1905;Target=NBlab01G15020.1 90474498 90479531 -
NBqld01 microsynteny    match   75913489        75936856        8       +       .       ID=blk-1906;Name=blk-1906_to_NbV1Ch01;Target=NbV1Ch01 90446162 90479531000.1 90454994 90458939 -
NBqld01 microsynteny    match_part      75913489        75916784        438.3   +       .       ID=blk-1906.1.NBqld01G09150.1;Parent=blk-1906;Target=NBlab01G14990.3 90446162 90449165 +
NBqld01 microsynteny    match_part      75913489        75916116        438.3   +       .       ID=blk-1906.2.NBqld01G09150.2;Parent=blk-1906;Target=NBlab01G14990.1 90446162 90449165 +
NBqld01 microsynteny    match_part      75918271        75922448        215.3   -       .       ID=blk-1906.3.NBqld01G09160.1;Parent=blk-1906;Target=NBlab01G15000.1 90454994 90458939 -
NBqld01 microsynteny    match_part      75918532        75922421        380.6   -       .       ID=blk-1906.4.NBqld01G09160.3;Parent=blk-1906;Target=NBlab01G15000.1 90454994 90458939 -
NBqld01 microsynteny    match_part      75918532        75922448        215.3   -       .       ID=blk-1906.5.NBqld01G09160.5;Parent=blk-1906;Target=NBlab01G15000.1 90454994 90458939 -
NBqld01 microsynteny    match_part      75924269        75929750        185.3   +       .       ID=blk-1906.6.NBqld01G09170.1;Parent=blk-1906;Target=NBlab01G15010.1 90466931 90471854 +
NBqld01 microsynteny    match_part      75932007        75936883        213.4   -       .       ID=blk-1906.7.NBqld01G09180.1;Parent=blk-1906;Target=NBlab01G15020.1 90474498 90479531 -
NBqld01 microsynteny    match_part      75932157        75936856        528.9   -       .       ID=blk-1906.8.NBqld01G09180.2;Parent=blk-1906;Target=NBlab01G15020.1 90474498 90479531 -
NBqld01 microsynteny    match   75913489        75936856        8       +       .       ID=blk-1907;Name=blk-1907_to_NbV1Ch02;Target=NbV1Ch02 67391974 67445767310.1 67415016 67420166 -
NBqld01 microsynteny    match_part      75913489        75916784        419.5   +       .       ID=blk-1907.1.NBqld01G09150.1;Parent=blk-1907;Target=NBlab02G09300.1 67391974 67409155 +
NBqld01 microsynteny    match_part      75913489        75916116        419.5   +       .       ID=blk-1907.2.NBqld01G09150.2;Parent=blk-1907;Target=NBlab02G09300.2 67398533 67409155 +
NBqld01 microsynteny    match_part      75918271        75922448        216.5   -       .       ID=blk-1907.3.NBqld01G09160.1;Parent=blk-1907;Target=NBlab02G09310.1 67415016 67420166 -
NBqld01 microsynteny    match_part      75918532        75922421        371.3   -       .       ID=blk-1907.4.NBqld01G09160.3;Parent=blk-1907;Target=NBlab02G09310.1 67415016 67420166 -
NBqld01 microsynteny    match_part      75918532        75922448        216.5   -       .       ID=blk-1907.5.NBqld01G09160.5;Parent=blk-1907;Target=NBlab02G09310.1 67415016 67420166 -
NBqld01 microsynteny    match_part      75924269        75929750        305.1   +       .       ID=blk-1907.6.NBqld01G09170.1;Parent=blk-1907;Target=NBlab02G09320.1 67421569 67434518 +
NBqld01 microsynteny    match_part      75932007        75936883        218.8   -       .       ID=blk-1907.7.NBqld01G09180.1;Parent=blk-1907;Target=NBlab02G09330.2 67441539 67445767 -
NBqld01 microsynteny    match_part      75932157        75936856        507.7   -       .       ID=blk-1907.8.NBqld01G09180.2;Parent=blk-1907;Target=NBlab02G09330.1 67441539 67445767 -
NBqld01 microsynteny    match   75913489        75936856        8       +       .       ID=blk-1908;Name=blk-1908_to_NbV1Ch02;Target=NbV1Ch02 67398533 67445767310.1 67415016 67420166 -
NBqld01 microsynteny    match_part      75913489        75916784        419.5   +       .       ID=blk-1908.1.NBqld01G09150.1;Parent=blk-1908;Target=NBlab02G09300.2 67398533 67409155 +
NBqld01 microsynteny    match_part      75913489        75916116        419.5   +       .       ID=blk-1908.2.NBqld01G09150.2;Parent=blk-1908;Target=NBlab02G09300.1 67391974 67409155 +
NBqld01 microsynteny    match_part      75918271        75922448        216.5   -       .       ID=blk-1908.3.NBqld01G09160.1;Parent=blk-1908;Target=NBlab02G09310.1 67415016 67420166 -
NBqld01 microsynteny    match_part      75918532        75922421        371.3   -       .       ID=blk-1908.4.NBqld01G09160.3;Parent=blk-1908;Target=NBlab02G09310.1 67415016 67420166 -
NBqld01 microsynteny    match_part      75918532        75922448        216.5   -       .       ID=blk-1908.5.NBqld01G09160.5;Parent=blk-1908;Target=NBlab02G09310.1 67415016 67420166 -
NBqld01 microsynteny    match_part      75924269        75929750        305.1   +       .       ID=blk-1908.6.NBqld01G09170.1;Parent=blk-1908;Target=NBlab02G09320.1 67421569 67434518 +
NBqld01 microsynteny    match_part      75932007        75936883        218.8   -       .       ID=blk-1908.7.NBqld01G09180.1;Parent=blk-1908;Target=NBlab02G09330.2 67441539 67445767 -
NBqld01 microsynteny    match_part      75932157        75936856        507.7   -       .       ID=blk-1908.8.NBqld01G09180.2;Parent=blk-1908;Target=NBlab02G09330.1 67441539 67445767 -

Query browser:

Target browser:

Is there a setting to change match_part to match in order to create easier to read names e.g. Name=NBlab01G14990.1-vs-NBqld01G09150.1?
Did I use -D "|" correctly?

Thank you in advance,

Michal

microsynteny.py: GFF input for bacterial genomes

Hi,
I have a problem running the microsynteny.py. I have rather verbose gff files for bacterial genomes, so not the intron-exon examples you have. I suppose there is something wrong with parsing the gffs?

I attached both gff-files and the blast output.
This is how I am running the script:

python genomeGTFtools/microsynteny.py -b blast -q query.1-Contigs.cds.gff -d db.1-Contigs.cds.gff

and here is the error:

# Running command:
genomeGTFtools/microsynteny.py -b blast -q query.1-Contigs.cds.gff -d db.1-Contigs.cds.gff
# Parsing query.1-Contigs.cds.gff  Tue Aug 18 10:57:34 2020
Traceback (most recent call last):
  File "genomeGTFtools/microsynteny.py", line 431, in <module>
    main(sys.argv[1:],sys.stdout)
  File "genomeGTFtools/microsynteny.py", line 417, in main
    querydict = parse_gtf(args.query_gtf, args.no_genes, args.cds_only, exclusionDict, args.query_delimiter)
  File "genomeGTFtools/microsynteny.py", line 129, in parse_gtf
    sys.stderr.write("# Estimated {} genes from {} exons  ".format(len(exonboundaries), sum(len(x) for x in exonboundaries.values() ) ) + time.asctime() + os.linesep)
UnboundLocalError: local variable 'exonboundaries' referenced before assignment

db.1-Contigs.cds.gff.gz
query.1-Contigs.cds.gff.gz
blast.gz

blast2genomegff.py missing in repository

Missing blast2genomegff.py but found it in youre bitbucket https://bitbucket.org/wrf/sequences/src
Also, are you ok with wrapping the repository into a usegalaxy.org tool?

Errors in blast2genomegff.py

Hi,

I made a gene model with BRAKER and would like to add gene names using blast2genomegff.py.
I, however, have had two types of error, which depends on file format (gtf or gff3).
If I used gtf, I got error as follows.

$ blast2genomegff.py -b braker04_augustus.hints_utr.uniprot_sprot.blastp.tab -p BLASTP -d uniprot_sprot.fasta -g augustus.hints_utr.gtf > augustus.hints_utr.blastp.uniprot_sprot.gff
# Parsing target sequences from uniprot_sprot.fasta  Fri Mar 19 21:36:27 2021
# Found 564277 sequences  Fri Mar 19 21:36:37 2021
# Parsing gff from augustus.hints_utr.gtf  Fri Mar 19 21:36:37 2021
Traceback (most recent call last):
  File "/usr/local/genome/bin/blast2genomegff.py", line 489, in <module>
    main(sys.argv[1:],sys.stdout)
  File "/usr/local/genome/bin/blast2genomegff.py", line 483, in main
    geneintervals, genestrand, genescaffold =  gtf_to_intervals(args.genes, args.cds_exons, args.skip_exons, args.transdecoder, args.no_genes, args.gff_delimiter)
  File "/usr/local/genome/bin/blast2genomegff.py", line 134, in gtf_to_intervals
    genestrand[geneid] = strand
UnboundLocalError: local variable 'geneid' referenced before assignment

If I used gff3, another error was obtained.

$ blast2genomegff.py -b braker04_augustus.hints_utr.uniprot_sprot.blastp.tab -p BLASTP -d uniprot_sprot.fasta -g augustus.hints_utr.gff3 > augustus.hints_utr.blastp.uniprot_sprot.gff
# Parsing target sequences from uniprot_sprot.fasta  Fri Mar 19 21:38:07 2021
# Found 564277 sequences  Fri Mar 19 21:38:15 2021
# Parsing gff from augustus.hints_utr.gff3  Fri Mar 19 21:38:15 2021
# Counted 2407291 lines and 0 comments  Fri Mar 19 21:38:23 2021
# Ignored 1150218 other features in the GFF
# Counted 567177 exons for 116687 inferred transcripts
# blast program is blastp, multiplying coordinates by 3
# Starting BLAST parsing on braker04_augustus.hints_utr.uniprot_sprot.blastp.tab  Fri Mar 19 21:38:23 2021
# Removed 761276 hits by shortness
# Removed 0 hits by bitscore
# Removed 0 hits by evalue
# Removed 0 hits that exceeded query max
# Found 0 hits for 0 queries  Fri Mar 19 21:38:26 2021
# WARNING: did not write any intervals, check options -D -F or -G for mismatch between IDs in GFF and blast table

My files are as follows.

BLAST result

$ head braker04_augustus.hints_utr.uniprot_sprot.blastp.tab
g21100.t1       Q90663  53.097  113     52      1       92      203     41      153     7.49e-34        130
g21100.t1       O95025  54.128  109     49      1       92      199     53      161     2.96e-33        129
g21100.t1       Q8BH34  51.327  113     54      1       92      203     53      165     5.50e-33        128
g21100.t1       Q9W6G6  47.934  121     61      1       87      205     51      171     1.34e-29        118
g21100.t1       Q9W7J1  42.478  113     64      1       92      203     40      152     1.41e-26        109
g21100.t2       Q9W6G6  44.156  77      41      1       87      161     51      127     4.15e-11        63.9
g21100.t2       Q90663  44.286  70      38      1       92      160     41      110     4.93e-11        63.5
g21100.t2       O95025  45.714  70      37      1       92      160     53      122     1.03e-10        62.8
g21100.t2       Q8BH34  42.857  70      39      1       92      160     53      122     1.17e-10        62.4
g21100.t2       Q99985  45.283  53      29      0       108     160     54      106     1.66e-07        53.1

$ grep g21100 augustus.hints_utr.gtf 
chr01   AUGUSTUS        gene    3456    15768   0.44    +       .       g21100
chr01   AUGUSTUS        transcript      3456    15768   0.14    +       .       g21100.t1
chr01   AUGUSTUS        tss     3456    3456    .       +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        5'-UTR  3456    3635    0.48    +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        exon    3456    3635    .       +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        5'-UTR  4102    4159    0.47    +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        exon    4102    4240    .       +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        start_codon     4160    4162    .       +       0       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        CDS     4160    4240    0.47    +       0       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        intron  4241    4510    0.47    +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        CDS     4511    4560    0.47    +       0       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        exon    4511    4560    .       +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        intron  4561    12338   0.92    +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        CDS     12339   12478   0.99    +       1       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        exon    12339   12478   .       +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        intron  12479   13906   1       +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        CDS     13907   14061   1       +       2       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        exon    13907   14061   .       +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        intron  14062   14195   1       +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        CDS     14196   14258   1       +       0       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        exon    14196   14258   .       +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        intron  14259   15041   1       +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        CDS     15042   15176   1       +       0       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        exon    15042   15768   .       +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        stop_codon      15174   15176   .       +       0       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        3'-UTR  15177   15768   0.74    +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        tts     15768   15768   .       +       .       transcript_id "g21100.t1"; gene_id "g21100";
chr01   AUGUSTUS        transcript      3456    15768   0.3     +       .       g21100.t2
chr01   AUGUSTUS        tss     3456    3456    .       +       .       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        5'-UTR  3456    3635    0.56    +       .       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        exon    3456    3635    .       +       .       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        5'-UTR  4102    4159    0.51    +       .       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        exon    4102    4240    .       +       .       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        start_codon     4160    4162    .       +       0       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        CDS     4160    4240    0.51    +       0       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        intron  4241    4510    0.51    +       .       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        CDS     4511    4560    0.51    +       0       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        exon    4511    4560    .       +       .       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        intron  4561    12338   0.94    +       .       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        CDS     12339   12478   0.98    +       1       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        exon    12339   12478   .       +       .       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        intron  12479   13906   1       +       .       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        CDS     13907   14061   1       +       2       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        exon    13907   14061   .       +       .       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        intron  14062   14195   1       +       .       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        CDS     14196   14282   1       +       0       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        exon    14196   15768   .       +       .       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        stop_codon      14280   14282   .       +       0       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        3'-UTR  14283   15768   0.88    +       .       transcript_id "g21100.t2"; gene_id "g21100";
chr01   AUGUSTUS        tts     15768   15768   .       +       .       transcript_id "g21100.t2"; gene_id "g21100";

GFF3

$ grep g21100 augustus.hints_utr.gff3
chr01   AUGUSTUS        gene    3456    15768   0.44    +       .       ID=g21100;
chr01   AUGUSTUS        mRNA    3456    15768   0.14    +       .       ID=g21100.t1;Parent=g21100;
chr01   AUGUSTUS        transcription_start_site        3456    3456    .       +       .       ID=g21100.t1.tss1;Parent=g21100.t1;
chr01   AUGUSTUS        five_prime_utr  3456    3635    0.48    +       .       ID=g21100.t1.5UTR1;Parent=g21100.t1;
chr01   AUGUSTUS        exon    3456    3635    .       +       .       ID=g21100.t1.exon1;Parent=g21100.t1;
chr01   AUGUSTUS        five_prime_utr  4102    4159    0.47    +       .       ID=g21100.t1.5UTR2;Parent=g21100.t1;
chr01   AUGUSTUS        exon    4102    4240    .       +       .       ID=g21100.t1.exon2;Parent=g21100.t1;
chr01   AUGUSTUS        start_codon     4160    4162    .       +       0       ID=g21100.t1.start1;Parent=g21100.t1;
chr01   AUGUSTUS        CDS     4160    4240    0.47    +       0       ID=g21100.t1.CDS1;Parent=g21100.t1;
chr01   AUGUSTUS        intron  4241    4510    0.47    +       .       ID=g21100.t1.intron1;Parent=g21100.t1;
chr01   AUGUSTUS        CDS     4511    4560    0.47    +       0       ID=g21100.t1.CDS2;Parent=g21100.t1;
chr01   AUGUSTUS        exon    4511    4560    .       +       .       ID=g21100.t1.exon3;Parent=g21100.t1;
chr01   AUGUSTUS        intron  4561    12338   0.92    +       .       ID=g21100.t1.intron2;Parent=g21100.t1;
chr01   AUGUSTUS        CDS     12339   12478   0.99    +       1       ID=g21100.t1.CDS3;Parent=g21100.t1;
chr01   AUGUSTUS        exon    12339   12478   .       +       .       ID=g21100.t1.exon4;Parent=g21100.t1;
chr01   AUGUSTUS        intron  12479   13906   1       +       .       ID=g21100.t1.intron3;Parent=g21100.t1;
chr01   AUGUSTUS        CDS     13907   14061   1       +       2       ID=g21100.t1.CDS4;Parent=g21100.t1;
chr01   AUGUSTUS        exon    13907   14061   .       +       .       ID=g21100.t1.exon5;Parent=g21100.t1;
chr01   AUGUSTUS        intron  14062   14195   1       +       .       ID=g21100.t1.intron4;Parent=g21100.t1;
chr01   AUGUSTUS        CDS     14196   14258   1       +       0       ID=g21100.t1.CDS5;Parent=g21100.t1;
chr01   AUGUSTUS        exon    14196   14258   .       +       .       ID=g21100.t1.exon6;Parent=g21100.t1;
chr01   AUGUSTUS        intron  14259   15041   1       +       .       ID=g21100.t1.intron5;Parent=g21100.t1;
chr01   AUGUSTUS        CDS     15042   15176   1       +       0       ID=g21100.t1.CDS6;Parent=g21100.t1;
chr01   AUGUSTUS        exon    15042   15768   .       +       .       ID=g21100.t1.exon7;Parent=g21100.t1;
chr01   AUGUSTUS        stop_codon      15174   15176   .       +       0       ID=g21100.t1.stop1;Parent=g21100.t1;
chr01   AUGUSTUS        three_prime_utr 15177   15768   0.74    +       .       ID=g21100.t1.3UTR1;Parent=g21100.t1;
chr01   AUGUSTUS        transcription_end_site  15768   15768   .       +       .       ID=g21100.t1.tts1;Parent=g21100.t1;
chr01   AUGUSTUS        mRNA    3456    15768   0.3     +       .       ID=g21100.t2;Parent=g21100;
chr01   AUGUSTUS        transcription_start_site        3456    3456    .       +       .       ID=g21100.t2.tss1;Parent=g21100.t2;
chr01   AUGUSTUS        five_prime_utr  3456    3635    0.56    +       .       ID=g21100.t2.5UTR1;Parent=g21100.t2;
chr01   AUGUSTUS        exon    3456    3635    .       +       .       ID=g21100.t2.exon1;Parent=g21100.t2;
chr01   AUGUSTUS        five_prime_utr  4102    4159    0.51    +       .       ID=g21100.t2.5UTR2;Parent=g21100.t2;
chr01   AUGUSTUS        exon    4102    4240    .       +       .       ID=g21100.t2.exon2;Parent=g21100.t2;
chr01   AUGUSTUS        start_codon     4160    4162    .       +       0       ID=g21100.t2.start1;Parent=g21100.t2;
chr01   AUGUSTUS        CDS     4160    4240    0.51    +       0       ID=g21100.t2.CDS1;Parent=g21100.t2;
chr01   AUGUSTUS        intron  4241    4510    0.51    +       .       ID=g21100.t2.intron1;Parent=g21100.t2;
chr01   AUGUSTUS        CDS     4511    4560    0.51    +       0       ID=g21100.t2.CDS2;Parent=g21100.t2;
chr01   AUGUSTUS        exon    4511    4560    .       +       .       ID=g21100.t2.exon3;Parent=g21100.t2;
chr01   AUGUSTUS        intron  4561    12338   0.94    +       .       ID=g21100.t2.intron2;Parent=g21100.t2;
chr01   AUGUSTUS        CDS     12339   12478   0.98    +       1       ID=g21100.t2.CDS3;Parent=g21100.t2;
chr01   AUGUSTUS        exon    12339   12478   .       +       .       ID=g21100.t2.exon4;Parent=g21100.t2;
chr01   AUGUSTUS        intron  12479   13906   1       +       .       ID=g21100.t2.intron3;Parent=g21100.t2;
chr01   AUGUSTUS        CDS     13907   14061   1       +       2       ID=g21100.t2.CDS4;Parent=g21100.t2;
chr01   AUGUSTUS        exon    13907   14061   .       +       .       ID=g21100.t2.exon5;Parent=g21100.t2;
chr01   AUGUSTUS        intron  14062   14195   1       +       .       ID=g21100.t2.intron4;Parent=g21100.t2;
chr01   AUGUSTUS        CDS     14196   14282   1       +       0       ID=g21100.t2.CDS5;Parent=g21100.t2;
chr01   AUGUSTUS        exon    14196   15768   .       +       .       ID=g21100.t2.exon6;Parent=g21100.t2;
chr01   AUGUSTUS        stop_codon      14280   14282   .       +       0       ID=g21100.t2.stop1;Parent=g21100.t2;
chr01   AUGUSTUS        three_prime_utr 14283   15768   0.88    +       .       ID=g21100.t2.3UTR1;Parent=g21100.t2;
chr01   AUGUSTUS        transcription_end_site  15768   15768   .       +       .       ID=g21100.t2.tts1;Parent=g21100.t2;

Any suggestions would be appreciated.

pfam2gff genomics coordinates generate short pfam description

I have generated genome-based coding region annotation file with Transdecoder as described here
Then I have used pfam2gff .py like this :
python pfam2gff.py -g transcripts.fasta.transdecoder.genome.gff3 \ -i PFAM.out \ -T > pfam_domains.gff

But in this file the annotation from pfam are extremely shortened :

chr1	hmmscan	PFAM	11416724	11416749	99.5	+	.	ID=g34449.t1.p1.Aldose_epim.1;Name=PF01263.Aldose_epim.1
chr1	hmmscan	PFAM	11416839	11416942	99.5	+	.	ID=g34449.t1.p1.Aldose_epim.1;Name=PF01263.Aldose_epim.1
chr4	hmmscan	PFAM	3455	3615	118.3	+	.	ID=jg1.t1.p1.RVT_1.1;Name=PF00078.RVT_1.1
chr4	hmmscan	PFAM	4970	5082	53.8	+	.	ID=jg1.t1.p1.rve.1;Name=PF00665.rve.1

Instead of in the file TrinotatePFAM.out :

#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
Aldose_epim          PF01263.19   301 g34449.t1.p1         -            194   3.7e-31  108.5   0.3   1   1   1.2e-32     2e-28   99.5   0.3    53   177    55   184    32   193 0.88 Aldose 1-epimerase
RVT_1                PF00078.26   222 jg1.t1.p1            -           1474   1.8e-34  119.1   0.1   1   1   1.6e-37   3.3e-34  118.3   0.1     1   222   627   787   627   787 0.98 Reverse transcriptase (RNA-dependent DNA polymerase)
rve                  PF00665.25   119 jg1.t1.p1            -           1474   6.4e-15   55.3   0.0   1   1   8.9e-18   1.9e-14   53.8   0.0     2   115  1108  1220  1107  1223 0.96 Integrase core domain
Retrotrans_gag       PF03732.16    97 jg1.t1.p1            -           1474   4.6e-12   46.0   1.9   1   2   2.2e-15   4.6e-12   46.0   1.9     1    96   164   259   164   260 0.96 Retrotransposon gag protein
Retrotrans_gag       PF03732.16    97 jg1.t1.p1            -           1474   4.6e-12   46.0   1.9   2   2       1.8   3.8e+03   -1.9   0.1    54    76   958   981   952   994 0.82 Retrotransposon gag protein

Leading to this in a genome browser :

Is there a way to keep more informative data in the output file from pfam2gff.py ?
for example Aldose 1-epimerase instead of Aldose_epim.1

bast2gff.py concatenate hits

Hi,
I ran blast2gff.py, and I got the following results:

Is there a way to stick together smaller alignments into a region of interest to achieve the below Blast visualisation?

Thank you in advance,

Michal

TypeError: unsupported operand

Hi,
I ran into the below error:

> python stringtie_gtf_to_gff3.py bams/stringtie_merged.gtf > bams/stringtie_merged.gff3
Traceback (most recent call last):
  File "/work/apps/genomeGTFtools/misc/stringtie_gtf_to_gff3.py", line 41, in <module>
    main(sys.argv[1:],sys.stdout)
  File "/work/apps/genomeGTFtools/misc/stringtie_gtf_to_gff3.py", line 21, in main
    print >> sys.stderr, "# Converting {} to GFF".format(args.gtf)
TypeError: unsupported operand type(s) for >>: 'builtin_function_or_method' and '_io.TextIOWrapper'. Did you mean "print(<message>, file=<output_stream>)"?

with this gtf file:

# stringtie --merge -p 8 -o bams/stringtie_merged.gtf bams/mergelist.txt
# StringTie version 2.0
NbV1Ch01        StringTie       transcript      221937  223251  1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; 
NbV1Ch01        StringTie       exon    221937  222881  1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1"; 
NbV1Ch01        StringTie       exon    222989  223251  1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "2"; 
NbV1Ch01        StringTie       transcript      773600  784171  1000    -       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; 
NbV1Ch01        StringTie       exon    773600  773945  1000    -       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; exon_number "1"; 
NbV1Ch01        StringTie       exon    774033  774173  1000    -       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; exon_number "2"; 
NbV1Ch01        StringTie       exon    783756  784171  1000    -       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; exon_number "3"; 
NbV1Ch01        StringTie       transcript      1003303 1005419 1000    -       .       gene_id "MSTRG.3"; transcript_id "MSTRG.3.1"; 
NbV1Ch01        StringTie       exon    1003303 1003738 1000    -       .       gene_id "MSTRG.3"; transcript_id "MSTRG.3.1"; exon_number "1"; 
NbV1Ch01        StringTie       exon    1004949 1005419 1000    -       .       gene_id "MSTRG.3"; transcript_id "MSTRG.3.1"; exon_number "2"; 
NbV1Ch01        StringTie       transcript      918382  942103  1000    +       .       gene_id "MSTRG.4"; transcript_id "MSTRG.4.1"; 
NbV1Ch01        StringTie       exon    918382  918907  1000    +       .       gene_id "MSTRG.4"; transcript_id "MSTRG.4.1"; exon_number "1"; 
NbV1Ch01        StringTie       exon    919974  920080  1000    +       .       gene_id "MSTRG.4"; transcript_id "MSTRG.4.1"; exon_number "2"; 
NbV1Ch01        StringTie       exon    921381  921508  1000    +       .       gene_id "MSTRG.4"; transcript_id "MSTRG.4.1"; exon_number "3"; 
NbV1Ch01        StringTie       exon    923908  923979  1000    +       .       gene_id "MSTRG.4"; transcript_id "MSTRG.4.1"; exon_number "4"; 
NbV1Ch01        StringTie       exon    928187  928288  1000    +       .       gene_id "MSTRG.4"; transcript_id "MSTRG.4.1"; exon_number "5"; 
NbV1Ch01        StringTie       exon    933546  933605  1000    +       .       gene_id "MSTRG.4"; transcript_id "MSTRG.4.1"; exon_number "6"; 
NbV1Ch01        StringTie       exon    937393  937458  1000    +       .       gene_id "MSTRG.4"; transcript_id "MSTRG.4.1"; exon_number "7"; 
NbV1Ch01        StringTie       exon    938672  938719  1000    +       .       gene_id "MSTRG.4"; transcript_id "MSTRG.4.1"; exon_number "8"; 
NbV1Ch01        StringTie       exon    939217  939393  1000    +       .       gene_id "MSTRG.4"; transcript_id "MSTRG.4.1"; exon_number "9"; 
NbV1Ch01        StringTie       exon    939484  939539  1000    +       .       gene_id "MSTRG.4"; transcript_id "MSTRG.4.1"; exon_number "10"; 
NbV1Ch01        StringTie       exon    939630  939747  1000    +       .       gene_id "MSTRG.4"; transcript_id "MSTRG.4.1"; exon_number "11"; 
NbV1Ch01        StringTie       exon    940011  940100  1000    +       .       gene_id "MSTRG.4"; transcript_id "MSTRG.4.1"; exon_number "12"; 
NbV1Ch01        StringTie       exon    941276  941385  1000    +       .       gene_id "MSTRG.4"; transcript_id "MSTRG.4.1"; exon_number "13"; 
NbV1Ch01        StringTie       exon    941521  942103  1000    +       .       gene_id "MSTRG.4"; transcript_id "MSTRG.4.1"; exon_number "14";

How is it possible to fix it?

Thank you in advance,

Michal

converting InterProScan results to genomic coordinates

Hi,
I ran interproscan.sh --cpu 10 --input $1 --seqtype p --iprlookup --pathways --goterms -d $2 --tempdir interproscan-dp-tmp and got the following output files:

augustus.hints_utr.aa.gff3
augustus.hints_utr.aa.tsv
augustus.hints_utr.aa.xml

Now I would like to load the results into JBrowse. However, the coordinates do not match the genomic coordinates. By any chance, do you have any scripts to convert InterPro scan results to genomic coordinates?

Thank you in advance,

Michal

Add the mRNA, cds features into the Stringtie outfiles?

Hi, there.
Since I have used the stringtie merge parameters to combined all the gtf file together. As we can see that there are just the transcripts and exons features in the final files.

Is there scripts that can add the gene, mRNA and CDS features into this files, which make it looks like the following files?

Thanks advance!
Sincerely
Yizhong Huang

--add-description seems not be implemented in blast2genomegff.py

Hi,
It appears that --add-description is not implemented because I get Description=None with

blast2genomegff.py -p blastp \
-b labVSviridiplantae_hybrid_blastp.out \
-g NbLab330.61k.gff3.rename.gff3 \
--add-description \
-d uniprot-taxonomy-viridiplantae-reviewed-yes.fasta > labVSviridiplantae_hybrid_blastp.gff

What did I miss?

Thank you in advance

pfampipeline.py: FileNotFoundError: [Errno 2]

Hi Warren,

I run the pfampipeline.py, and got a message that pfam2gff.py does not found:

ilya@ilya-MacBookAir:~/genomeGTFtools$ ./pfampipeline.py -P ~/pfam_base/Pfam-A.hmm ./test_data/nidogen_full_prots.fasta 
# Searching PFAM against ./test_data/nidogen_full_prots.fasta  Tue Feb  2 15:31:37 2021
Calling:
hmmscan --cpu 1 --domtblout ./test_data/nidogen_full_prots.pfam.tab /home/ilya/pfam_base/Pfam-A.hmm ./test_data/nidogen_full_prots.fasta
Calling:
pfam2gff.py -i ./test_data/nidogen_full_prots.pfam.tab -e 0.1
Traceback (most recent call last):
  File "./pfampipeline.py", line 120, in <module>
    main(sys.argv[1:],sys.stdout)
  File "./pfampipeline.py", line 112, in main
    pfamgff = call_pfamgff(hmmtblout, args.evalue)
  File "./pfampipeline.py", line 51, in call_pfamgff
    subprocess.call(pfam2gff_args, stdout=pfamgff)
  File "/usr/lib/python3.8/subprocess.py", line 340, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/usr/lib/python3.8/subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.8/subprocess.py", line 1702, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'pfam2gff.py'

But actually it is present:

ilya@ilya-MacBookAir:~/genomeGTFtools$ ls pfam* -lh
-rwxrwxr-x 1 ilya ilya  17K фев  2 15:31 pfam2gff.py
-rwxrwxr-x 1 ilya ilya 5,9K фев  2 15:31 pfamgff2clans.py
-rwxrwxr-x 1 ilya ilya 5,6K фев  2 15:13 pfampipeline.py

And even works if I run it separately:

ilya@ilya-MacBookAir:~/genomeGTFtools$ ./pfam2gff.py -i ./test_data/nidogen_full_prots.pfam.tab -e 0.1
# Parsing hmmscan PFAM tabular ./test_data/nidogen_full_prots.pfam.tab  Tue Feb  2 15:24:41 2021
# Found 911 domains for 9 proteins  Tue Feb  2 15:24:41 2021
# Removed 31 domain hits by shortness
# Removed 165 domain hits by evalue
NID1_MOUSE	hmmscan	PFAM	176	266	78.9	.	.	ID=PF06119.NIDO.1;Name=PF06119.NIDO.1
NID1_MOUSE	hmmscan	PFAM	388	419	14.3	.	.	ID=PF00008.EGF.1;Name=PF00008.EGF.1
NID1_MOUSE	hmmscan	PFAM	388	423	5.3	.	.	ID=PF07645.EGF_CA.1;Name=PF07645.EGF_CA.1
NID1_MOUSE	hmmscan	PFAM	388	424	4.2	.	.	ID=PF12946.EGF_MSP1_1.1;Name=PF12946.EGF_MSP1_1.1
NID1_MOUSE	hmmscan	PFAM	427	619	240.2	.	.	ID=PF07474.G2F.1;Name=PF07474.G2F.1
NID1_MOUSE	hmmscan	PFAM	669	706	8.6	.	.	ID=PF07645.EGF_CA.2;Name=PF07645.EGF_CA.2
NID1_MOUSE	hmmscan	PFAM	670	702	15.9	.	.	ID=PF00008.EGF.2;Name=PF00008.EGF.2
NID1_MOUSE	hmmscan	PFAM	670	706	8.0	.	.	ID=PF14670.FXa_inhibition.2;Name=PF14670.FXa_inhibition.2
NID1_MOUSE	hmmscan	PFAM	676	706	8.6	.	.	ID=PF12946.EGF_MSP1_1.2;Name=PF12946.EGF_MSP1_1.2

Could you please help me? What is the trouble with pfampipeline.py?

Ilya

pfampipeline.py CANNOT FIND FILE

Hi,

i tried to use pfampipeline.py, and got this message:

ilya@ilya-MacBookAir:~/genomeGTFtools$ ./pfampipeline.py test_data/nidogen_full_prots.fasta 
ERROR: CANNOT FIND FILE test_data/nidogen_full_prots.fasta

All dependencies are setted (Bio Python, HMMER et al) and the input is present, sure:

ilya@ilya-MacBookAir:~/genomeGTFtools$ ls test_data/ -l nidogen*
-rw-rw-r-- 1 ilya ilya 13333 ноя  4 14:53 nidogen_full_prots.fasta

Checked at laptop Ubuntu 20.04 and server CentOS 7 - looks the same.

Could you help me please?

Thank you in advance,
Ilya

wrf / genomegtftools Goto Github PK

genomegtftools's Introduction

genomeGTFtools

Overview

Jump to:

number_contigs_by_length

pfam2gff

For genomic coordinates

pfamgff2clans

pfampipeline

blast2gff

blast2genomegff

Options are:

starting from StringTie transcripts

starting from AUGUSTUS proteins or CDS

microsynteny

Required terms are:

Common options are:

scaffold_synteny

extract_coordinates

Required terms are:

draw_annotation_blocks

draw_genome_annotation

repeat2gtf

pal2gtf

DEPRICATED: blast2genewise

genomegtftools's People

Contributors

Stargazers

Watchers

Forkers

genomegtftools's Issues

I download viral fasta sequence from NCBI,

using prokka find CDS

find domain location with hmmscan

delete blank rows

Recommend Projects

Recommend Topics

Recommend Org