jstacs / jstacs Goto Github PK

License: GNU General Public License v3.0

Java 96.00% HTML 0.95% C 0.02% TeX 2.53% Shell 0.17% Perl 0.01% R 0.33% Raku 0.01%

biological-sequences biological-sequence-statistics statistical-models statistical-learning machine-lerning classification discriminative-learning generative-model gradient-descent-algorithm mixture-model

jstacs's Introduction

Description

Sequence analysis is one of the major subjects of bioinformatics. Several existing libraries combine the representation of biological sequences with exact and approximate pattern matching as well as alignment algorithms. We present Jstacs, an open source Java library, which focuses on the statistical analysis of biological sequences instead. Jstacs comprises an efficient representation of sequence data and provides implementations of many statistical models with generative and discriminative approaches for parameter learning. Using Jstacs, classifiers can be assessed and compared on test datasets or by cross-validation experiments evaluating several performance measures. Due to its strictly object-oriented design Jstacs is easy to use and readily extensible.

For more information including an API documentation, code examples, FAQs, binaries, and a cookbook visit http://www.jstacs.de.

Organization of the library

Jstacs core classes may be found in sub-packages of de.jstacs. Individual projects using and extending these core classes for specific applications are located in packages projects.

A list of projects that are based on Jstacs, including binaries documentation of user parameters is available at http://jstacs.de/index.php/Projects.

Building upon Jstacs, JstacsFX visualizes parameters and results in a JavaFX-based GUI that is built upon the generic de.jstacs.tools.JstacsTool class.

Licensing information

Jstacs is free software: you can redistribute it and/or modify under the terms of the GNU General Public License version 3 or (at your option) any later version as published by the Free Software Foundation.

For more information, please read COPYING.txt.

jstacs's People

Contributors

Stargazers

Watchers

Forkers

hlkfoz bjcox21 najtin crawlingsponge kullrich

jstacs's Issues

AnnotationFinalizer for renaming genes and transcripts

Hello,
I'm currently trying to annotate my genome with multiple references. I've used GeMoMa many times before , but this time my goal is to have my own prefix for the genes, transcripts, etc (i.e. Sg_v4.2 at the start of every gene and associated transcript). This is my current command:

java -Xmx64g -jar $GEMOMA/GeMoMa-1.9.jar CLI GeMoMaPipeline
t=$TARGET o=true p=true
s=own a=$REFANN0 g=$REFGEN0
s=own a=$REFANN1 g=$REFGEN1
s=own a=$REFANN2 g=$REFGEN2
s=own a=$REFANN3 g=$REFGEN3
s=own a=$REFANN4 g=$REFGEN4
tblastn=false
GeMoMa.c=0.4
GeMoMa.Score=ReAlign
AnnotationFinalizer.r=NO
Extractor.r=true
Extractor.f=false
GAF.f="start=='M' and stop=='*' and (isNaN(score) or score/aa>='3.50')"
outdir=$OUTDIR/gemoma_output threads=12

This results in gene IDs such as "gene_1", but the mRNA IDs still reference the the reference that it was derived from. Is there a method to rename the mRNAs with your own original prefix?

I tried AnnotationFinalizer.r=COMPOSED, but I kept getting errors. I assume I need to use "AnnotationFinalizer.p" as well, but I kept providing names that can't be used. What is meant by "the prefix of the generic name"?

I can't find clear directions for doing this kind of thing using GeMoMa. Any help would be greatly appreciated. Thanks!

GeMoMaPipeline including lots of RNA-seq bam files

Hello,
The GeMoMa is a proper software for me to do the prediction of genes in mammalian. I have downloaded the GeMoMa-1.7.1.jar to do it . When I combine the RNA-seq bam files to predicted the genes , I do not known how to merge all the bam files together (about 38 tissues). I have revised the codes " java -cp GeMoMa-.jar projects.gemoma.CombineIntronFiles " ,but it does not work.
Need help!
Thanks!

About the incomplete gene models predicted by GeMoMa

Hi, Jens,
Thanks for the very useful tool. With a lot of tests of many annotated tools based on homologous annotation, GeMoMa is the most accurate one as i known, consindering of the busco and the distributions of gene structures such as the length exon, intron, mRNA, exon number of each gene, etc. also coincide with its relatives.
here is the problem, i ve tried to predicted several mammal genomes, and most of the genes are accurate. but still with part of gene models are incomplete. And these genes were proved as the same gene compared with the close relative sepcies, and these incomplete genes were actually annotated into several gene models by GeMoMa. So, how should i solve the problem, is there any tools to integrate these incomplete ones into one complete gene, or i just need to add some parameters before the finnal annotation from the last filter steps.

could you give me some suggestions, very appreciate !

yours, Liu

Using RNA-seq from closely-related organism

Hi! -

Thanks again for the useful program!

I would like to make use of RNA-seq data for the closest-related organisms for which data is available, at least to see how it might help annotations. I have data for many tissue types, and initially mapped them to my reference genome, merged bams, indexed, then used in ERE. Although with bamstats this mapping seems to have gone well (711 million reads are mapped), using GeMoMa ERE m=$RNA s=FR_FIRST_STRAND, where $RNA is my merged/indexed bam, I get an empty .gff file. I used FR_FIRST_STRAND because I believe my RNA-seq data is TruSeq (e.g.), though I have experimentally tried other s options. Do you know what might be happening?

file statistics:
0	false	TranscriptResources/rnaseq.bam

overall statistics:
#files:	1
#corrupt files:	0
#reads:	1002408184
#split reads:	0
#questionable split reads:	0
#removed very short intron:	0
#introns:	0
#intron length:	2147483647 .. -2147483648

mapping qualities:
0-39	~300million reads	0 used reads	0 split reads               <-until here are discarded due to minimum map qual of 40
40	28932312 reads	28932312 used reads	0 split reads
41	2663338 reads	2663338 used reads	0 split reads
42	4329883 reads	4329883 used reads	0 split reads
43	3082543 reads	3082543 used reads	0 split reads
44	3444935 reads	3444935 used reads	0 split reads
45	5047684 reads	5047684 used reads	0 split reads
46	9114847 reads	9114847 used reads	0 split reads
47	3903113 reads	3903113 used reads	0 split reads
48	4122858 reads	4122858 used reads	0 split reads
49	4259958 reads	4259958 used reads	0 split reads
50	3526281 reads	3526281 used reads	0 split reads
51	3577584 reads	3577584 used reads	0 split reads
52	7327936 reads	7327936 used reads	0 split reads
53	3361437 reads	3361437 used reads	0 split reads
54	7514747 reads	7514747 used reads	0 split reads
55	3999037 reads	3999037 used reads	0 split reads
56	4756347 reads	4756347 used reads	0 split reads
57	4648650 reads	4648650 used reads	0 split reads
58	5814773 reads	5814773 used reads	0 split reads
59	5990352 reads	5990352 used reads	0 split reads
60	481850384 reads	481850384 used reads	0 split reads

Thank you!

wrong alphabet exception

Hi, thanks for GeMoMa - it works really well! I've recently run into an issue where I get the below error. I've been annotating 6 bird genomes and 5 out of the 6 finish without an error, but the 6th returns the below even when I use the same references as all the others. Thanks for any insights!
-Rebecca

Example command (v 1.7.1):
GeMoMa -Xmx600G GeMoMaPipeline threads=$NSLOTS outdir=annotation_manacus GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=manacus.fasta s=own i=pipra a=pipra.gff g=pipra.fasta

Error:

Exception in thread "main" java.lang.RuntimeException: Did not finish as intended. de.jstacs.data.WrongAlphabetException: Symbol "J" from input not defined in alphabet: [A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V, B, Z, X, *]
	at de.jstacs.data.alphabets.DiscreteAlphabet.getCode(DiscreteAlphabet.java:288)
	at de.jstacs.data.AlphabetContainer.getCode(AlphabetContainer.java:643)
	at de.jstacs.data.sequences.ByteSequence.<init>(ByteSequence.java:159)
	at de.jstacs.data.sequences.ByteSequence.<init>(ByteSequence.java:127)
	at de.jstacs.data.sequences.Sequence.create(Sequence.java:643)
	at de.jstacs.data.sequences.Sequence.create(Sequence.java:611)
	at de.jstacs.data.sequences.Sequence.create(Sequence.java:584)
	at projects.gemoma.GeMoMa.addHit(GeMoMa.java:1173)
	at projects.gemoma.GeMoMa.run(GeMoMa.java:992)
	at projects.gemoma.GeMoMaPipeline$JGeMoMa.doJob(GeMoMaPipeline.java:1922)
	at projects.gemoma.GeMoMaPipeline$FlaggedRunnable.run(GeMoMaPipeline.java:1321)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
GeMoMa for species 0 (pipra) split=44 throws an Exception

Multiple reference organisms

Hello,
How to combine the predictions from multiple reference organisms by GAF ?
My commands are "java -jar GeMoMa-1.6.4.jar CLI GAF g=species1/predicted_annotation.gff g=species2/predicted_annotation.gff g=species3/predicted_annotation.gff outdir=out" and "java -jar GeMoMa-1.6.4.jar CLI AnnotationFinalizer a=out/filtered_predictions.gff outdir=out rename=NO" Is it right?

In addition, which kind of masked genome should be used to run GeMoMa, softmask or hardmask?

amino acid J found in input

Hello I used the following gemoma command for gene prediction several times and it always did an excellent job.

java -Xms5G -Xmx150G -jar /project/bat_analysis/baiwei/software/GeMoMa-1.8.jar CLI GeMoMaPipeline
threads=40 outdir=annotation_out GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=$genome
s=own i=whale a=genomic.gff g=genome.fa
GeMoMa.m=100000

I used the exact command and reference species on another genome but these time it produces errors like this:

GeMoMa for species 0 split=29 throws an Exception
de.jstacs.data.WrongAlphabetException: Symbol "J" from input not defined in alphabet: [A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V, B, Z, X, *]
at de.jstacs.data.alphabets.DiscreteAlphabet.getCode(DiscreteAlphabet.java:288)
at de.jstacs.data.AlphabetContainer.getCode(AlphabetContainer.java:643)
at de.jstacs.data.sequences.ByteSequence.(ByteSequence.java:159)
at de.jstacs.data.sequences.ByteSequence.(ByteSequence.java:127)
at de.jstacs.data.sequences.Sequence.create(Sequence.java:643)
at de.jstacs.data.sequences.Sequence.create(Sequence.java:611)
at de.jstacs.data.sequences.Sequence.create(Sequence.java:584)
at projects.gemoma.GeMoMa.addHit(GeMoMa.java:1192)
at projects.gemoma.GeMoMa.run(GeMoMa.java:1003)
at projects.gemoma.GeMoMaPipeline$JGeMoMa.doJob(GeMoMaPipeline.java:1976)
at projects.gemoma.GeMoMaPipeline$FlaggedRunnable.run(GeMoMaPipeline.java:1370)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)

How to disable "Searching for the new GeMoMa updates ...".

Hi,

How to disable "Searching for the new GeMoMa updates ..."?

Best,
Kun

GeMoMa expected runtime

Hello!

I am trying to run GeMoMa pipeline for the first time using:
1- One target species genome
2- RNAseq data from this species of interest
2- Annotation from one closely related reference species

Here (#7) you mentioned that tblastn had some bug that could result in huge run times. My job has been already running for 5 days on 20 CPUs.

Do you know if current tblastn version have this problem solved. I'm using blast 2.12.0+?
Is there much difference between mmseq results and tblastn?

Thanks in advance,

GeMoMa Extractor Error: There are gene annotations on chromosomes/contigs with missing reference sequence

Hi,

I'm having some issues with getting Extractor to run on a reference genome and associated gff3 file. I keep getting the following error (below). Extractor then proceeds to output a cds-parts.fasta file, which is empty. I am running GeMoMa-1.7.1.

The genome assembly and gff3 file can be downloaded here;

https://ngdc.cncb.ac.cn/gwh/Assembly/7806/show

detected annotation format: GFF
number of detected CDS lines: 63563
number of detected genes: 12019
number of detected transcripts: 12019

genes   0
identical CDS of same gene      0
transcripts     0

reasons for discarding transcripts:
ambiguous nucleotide    0
start phase not zero    0
missing start   0
missing stop    0
premature stop  0
no DNA  0
wrong phase     0
conflicting phase       0

unexpected error        0

repaired        0

WARNING: There are gene annotations on chromosomes/contigs with missing reference sequence: [GWHALOE00000052, GWHALOE00000173, GWHALOE00000294, GWHALOE00000053, GWHALOE00000174, GWHALOE00000295, GWHALOE00000051, GWHALOE00000293, GWHALOE00000290, GWHALOE00000170, GWHALOE00000291, GWHALOE00000610, GWHALOE00000611, GWHALOE00000058, GWHALOE00000179, GWHALOE00000730, GWHALOE00000056, GWHALOE00000177, GWHALOE00000298, GWHALOE00000057, GWHALOE00000296, GWHALO

java.lang.NullPointerException: Cannot invoke "javax.script.ScriptEngine.createBindings()" because "engine" is null

Hi Jens,

Thank you for developing such great software! I try to use GeMoMa to annotate my genome with 5 reference species from Ensembl. But after running for a several hours, I encountered this error report

java.lang.NullPointerException: Cannot invoke "javax.script.ScriptEngine.createBindings()" because "engine" is null
        at projects.gemoma.Tools.eval(Tools.java:698)
        at projects.gemoma.Tools.filter(Tools.java:731)
        at projects.gemoma.GeMoMaAnnotationFilter.run(GeMoMaAnnotationFilter.java:555)
        at projects.gemoma.GeMoMaPipeline$JGAF.doJob(GeMoMaPipeline.java:2126)
        at projects.gemoma.GeMoMaPipeline$FlaggedRunnable.run(GeMoMaPipeline.java:1375)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)

java.lang.InterruptedException
        at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1660)
        at java.base/java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1464)
        at projects.gemoma.GeMoMaPipeline$1.run(GeMoMaPipeline.java:609)
        at projects.gemoma.GeMoMaPipeline$FlaggedRunnable.run(GeMoMaPipeline.java:1409)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)

Statistics:
Job     WAITING RUNNING INTERRUPTED     FAILED  SUCCEEDED
---------------------------------------------------------
MmseqsCreateDB  0       0       0       0       1
EREAndFill      0       0       0       0       1
ExtractorAndSplit       0       0       0       0       5
Mmseqs  0       0       0       0       5
GeMoMa  0       0       0       0       200
Cat     0       0       0       0       5
GAF     0       0       0       1       0

1 jobs did not finish as expected. Please check the output carefully.
Did not delete temporary files allowing to debug.

Elapsed time: 23863 seconds     (6h 37m 43s)
Exception in thread "main" java.lang.RuntimeException: Did not finish as intended. java.lang.NullPointerException: Cannot invoke "javax.script.ScriptEngine.createBindings()" because "engine" is null
        at projects.gemoma.Tools.eval(Tools.java:698)
        at projects.gemoma.Tools.filter(Tools.java:731)
        at projects.gemoma.GeMoMaAnnotationFilter.run(GeMoMaAnnotationFilter.java:555)
        at projects.gemoma.GeMoMaPipeline$JGAF.doJob(GeMoMaPipeline.java:2126)
        at projects.gemoma.GeMoMaPipeline$FlaggedRunnable.run(GeMoMaPipeline.java:1375)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)

And this is command to run GeMoMa

java -jar -Xmx490G /gpfs/home/liunyw/biosoft/GeMoMa/GeMoMa-1.9.jar CLI GeMoMaPipeline threads=40 \
        outdir=/gpfs/home/liunyw/howler_monkey/05.gene-annotation/homology-based/GeMoMa/aloCar/out\
        GeMoMa.Score=ReAlign \
        AnnotationFinalizer.r=NO o=true p=false \
        t=aloCar.masked.fa \
        s=own i=human a=../data_for_annotation/Homo_sapiens.GRCh38.107.chr.gff3 g=../data_for_annotation/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
        s=own i=champ a=../data_for_annotation/Pan_troglodytes.Pan_tro_3.0.107.chr.gff3 g=../data_for_annotation/Pan_troglodytes.Pan_tro_3.0.dna_sm.toplevel.fa \
        s=own i=goril a=../data_for_annotation/Gorilla_gorilla.gorGor4.107.chr.gff3 g=../data_for_annotation/Gorilla_gorilla.gorGor4.dna_sm.toplevel.fa \
        s=own i=macac a=../data_for_annotation/Macaca_mulatta.Mmul_10.107.chr.gff3 g=../data_for_annotation/Macaca_mulatta.Mmul_10.dna_sm.toplevel.fa \
        s=own i=mouse a=../data_for_annotation/Mus_musculus.GRCm39.107.chr.gff3 g=../data_for_annotation/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa

Can you help me fix this error?

Thanks
Yawen

How does GeMoMa treat masked nucleotides?

Hi GeMoMa team,

Thank you for developing GeMoMa. I tested GeMoMa on one of my reference genomes using 10 other reference species's annotation without RNA seq data. It predicts 19000 genes while the original annotation (based on RNA seq) has about 12000 genes. Besides the problem with increasing number of genes predicted as increasing number of references provided, I also wonder if transposable elements could cause the problem so I want to test running GeMoMa on masked genome.

Before I run, I want to confirm if GeMoMa ignore/down-weigh soft masked genome (changing from upper case to lower case) or should I hard-mask the genome?

Thanks,
Zexuan

Exception in thread "main" java.lang.NullPointerException

Dear Developer,

thanks a lot for developing this nice software.

i would like to ask you some questions condering my analysis.

i used Tsebra to combine with braker 2 to predict the genes and now i want to use GeMoma to add the UTR on,

i tested the software with the ./test.sh and it works.

i also used java -jar GeMoMa-1.9.jar CLI ERE m=analysis/Cday10_1.bam
#this step produce the introns.gff and covergar.bedgragh successfully.

Then when i want to run annotationfinalizer:
java -jar GeMoMa-1.9.jar CLI AnnotationFinalizer u=YES g=analysis/assembly.fasta.PolcaCorrected.FINAL.fasta a=analysis/Helsinki_Plus_Feywei_nonOverlapping_annotation.gff i=analysis/introns.gff c=UNSTRANDED coverage_unstranded=analysis/coverage.bedgraph rename=NO outdir=x

the error occurred:

Exception in thread "main" java.lang.NullPointerException at projects.gemoma.AnnotationFinalizer.read(AnnotationFinalizer.java:517) at projects.gemoma.AnnotationFinalizer.run(AnnotationFinalizer.java:663) at projects.gemoma.AnnotationFinalizer.run(AnnotationFinalizer.java:603) at projects.gemoma.GeMoMaModule.run(GeMoMaModule.java:94) at de.jstacs.tools.ui.cli.CLI.run(CLI.java:426) at projects.gemoma.GeMoMa.main(GeMoMa.java:399)

my gff input file looks like this :

HiC_scaffold_1 AUGUSTUS mRNA 3484 5786 . - . ID=Bpus_H_g1.t1;geneID=Bpus_H_g1
HiC_scaffold_1 AUGUSTUS exon 3484 3726 1.00 - . Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS exon 4045 4137 1.00 - . Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS exon 4373 4502 1.00 - . Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS exon 4875 4934 1.00 - . Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS exon 5175 5400 1.00 - . Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS exon 5582 5786 1.00 - . Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS CDS 3484 3726 . - 0 Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS CDS 4045 4137 . - 0 Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS CDS 4373 4502 . - 1 Parent=Bpus_H_g1.t1

my Java version is 1.8 and my GeMoma version is 1.9

I would like o ask why this happens and how to solve it?

Cheers,
Yuling

Catchitt tutorial: "ERROR BBFile header is unrecognized type, header magic = -498368882"

Has anyone gotten this error when doing the tutorial and running the DNase-seq access tool?

This is my input (on mac, installed Catchitt Java file ):
java -Xmx512m -jar Catchitt-0.1.3.jar access d="Bigwig" i=ENCFF901UBX.bigWig f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50 outdir=.

java.lang.RuntimeException: Index 1 out of bounds for length 1

test.log
I used command "java -jar GeMoMa-1.7.1.jar CLI GeMoMaPipeline threads=10 t=spades.fasta s=own a=Models.gff3 g=scaffolds.fasta AnnotationFinalizer.r=NO >test.log 2>&1". Some files could run perfectly, but some failed. Below is the error, do you know how to fix it? Thank you very much! I also attached the log file.

Exception in thread "main" java.lang.RuntimeException: Did not finish as intended. java.util.concurrent.ExecutionException: java.lang.RuntimeException: Index 1 out of bounds for length 1
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
at projects.gemoma.GeMoMa$TranscriptPredictor.computeTranscript(GeMoMa.java:2466)
at projects.gemoma.GeMoMa$TranscriptPredictor.compute(GeMoMa.java:2266)
at projects.gemoma.GeMoMa.run(GeMoMa.java:972)
at projects.gemoma.GeMoMaPipeline$JGeMoMa.doJob(GeMoMaPipeline.java:1922)
at projects.gemoma.GeMoMaPipeline$FlaggedRunnable.run(GeMoMaPipeline.java:1321)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)

java.lang.OutOfMemoryError: Java heap space when using GeMoma

Hello!
I used GeMoma by with given external annotation external.gff, the command is as follows:

java -jar /newlustre/home/xiongqian/software/annotation/gemoma/GeMoMa-1.9.jar CLI GeMoMaPipeline threads=24 tblastn=false GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=D2005012427.Euk.IDs.fa outdir=test s=own i=1 a=GCA_000143455.1.gff g=GCA_000143455.1.fas s=own i=2 a=GCA_001584585.1.gff g=GCA_001584585.1.fas s=own i=3 a=GCA_030144725.1.gff g=GCA_030144725.1.fas s=own i=4 a=GCA_030272155.1.gff g=GCA_030272155.1.fas ID=august e=D2005012427.augustus.gff3

And get the error is：
Problem while gene: gene-QJQ45_015695_
GeMoMa for species 2 (3) split=20 throws an Exception
java.lang.Exception: Forwarding java.lang.OutOfMemoryError: Java heap space
at de.jstacs.algorithms.alignment.Alignment.computeAlignment(Alignment.java:248)
at projects.gemoma.GeMoMa$MyAlignment.computeAlignment(GeMoMa.java:4707)
at de.jstacs.algorithms.alignment.Alignment.getAlignment(Alignment.java:185)
at de.jstacs.algorithms.alignment.Alignment.getAlignment(Alignment.java:167)
at projects.gemoma.GeMoMa$TranscriptPredictor.computeTranscript(GeMoMa.java:2636)
at projects.gemoma.GeMoMa$TranscriptPredictor.compute(GeMoMa.java:2336)
at projects.gemoma.GeMoMa.run(GeMoMa.java:1012)
at projects.gemoma.GeMoMaPipeline$JGeMoMa.doJob(GeMoMaPipeline.java:1988)
at projects.gemoma.GeMoMaPipeline$FlaggedRunnable.run(GeMoMaPipeline.java:1375)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

GeMoMa for species 2 (3) split=16 throws an Exception
GeMoMa for species 2 (3) split=7 throws an Exception
GeMoMa for species 2 (3) split=10 throws an Exception
GeMoMa for species 2 (3) split=14 throws an Exception

Really apprieciated for your help!

RuntimeException

Hey，
I am a new user for the GeMoMa, and I try to test the program with Athaliana data.
I use the latest version(1.8) and my code java -jar /home/qymeng/biosoft/GeMoMa/GeMoMa-1.8.jar CLI GeMoMaPipeline threads=10 AnnotationFinalizer.r=NO p=false o=true t=GCF_000004255.2_v.1.0_genomic.fna.gz outdir=output/ s=pre-extracted c=/home/qymeng/Gbarbadense/Annotation/Protein/Athaliana_447_Araport11.protein.fa >test.log 2>&1 &
However, when it finished, i got a temp file GeMoMa_temp, which including the filtered_predictions.gff and final_annotation.gff. I have put my log file, and could you give me some suggests to solve it.
test.log
Best Wishes;
Qingying

"does not match the regular expression for sequence IDs"

Hi!

I am trying to run GeMoMa using some gffs from NCBI as reference. It looks like my run fails, because one (or more) of the genes does not comply with a GeMoMa naming convention:

java.lang.IllegalArgumentException: Sequence ID (gene-Su(var)3-9_0) in fasta comment line (>gene-Su(var)3-9_0) does not match the regular expression for sequence IDs (([a-zA-Z\-\.:0-9]+(_\d+)?)|([a-zA-Z\-_\.:0-9]+_\d+))

How can I avoid this? I would prefer not having to change the gff from NCBI. Or am I missing the point and something else is going wrong?

GeMoMa restart error

Hello,
I'm currently using GeMoMa to transfer gene models from a draft assembly to a pseudo assembly created on the CoGe online platform. I have 40 draft assemblies that we created "pseudoassemblies" for each by mapping each assembly on to a chromosome-level reference assembly we created for a very close relative of all 40 species. I was told that GeMoMa would be a easy and quick option for "transferring" each set gene models from draft assembly to pseudo assembly.

I installed GeMoMa-1.6.4 and ran this command:
java -Xmx512g -jar /path/to/GeMoMa-1.6.4.jar CLI GeMoMaPipeline threads=32 outdir=$OUTDIR GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=$TARGET a=$REFASS g=$REFGEN

our cluster doesn't allow jobs to run for more than 72 hours. I hoped that it would finish before then, but it timed out. Luckily there is the restart option, but when I ran this command:
java -Xmx187g -jar /path/to/GeMoMa-1.6.4.jar CLI GeMoMaPipeline threads=32 restart=true outdir=$OUTDIR GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=$TARGET a=$REFASS g=$REFGEN

I got this error message:
Unknown parameters: {restart=[true]}

I'm just realizing now that restart is only available for version 1.7+. This was a big mistake on my end, but I still want to ask my question to see if I'm doing this job efficiently. Is this the command that you would use to do what I'm hoping to do? Thank you for your time
-Steve

Running GAF independtly of GeMoMa Pipeline gives empty file

Hi @JensKeilwagen ,

I ran GeMoMa pipeline using RNAseq data and 6 reference genomes. I used the -o option to get the unfiltered output files, and I've tried running GeMoMa GAF independently of the pipeline to do further filtering. However, when I do this I get an output file with some software information but nothing else. How can I get it to perform filtering? Here is the command I am running:

java -jar GeMoMa-1.8.jar CLI GAF f=score/aa>=1.0 g=unfiltered_predictions_from_species_1.gff

I also attached the first 50 lines of one of the unfiltered gff files I got from running GeMoMa pipeline.

Thanks,
Brian
#27
label:GeMoMa
unfiltered_predictions_from_species_0_50lines.txt

GeMoMa Extractor error StringIndexOutOfBoundsException

Hi,

Thanks for your tool, it's been very useful so far!
I'm trying to annotate a new genome but I keep getting an error in Extractor. The code I'm using is:
java -jar $jar CLI Extractor a=${ref_annotation} g=${ref_genome} Ambiguity=AMBIGUOUS outdir=${out} r=true sefc=true d=false

And the error:
number of detected CDS lines: 264010
number of detected genes: 20033
number of detected transcripts: 25635
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.charAt(String.java:658)
at projects.gemoma.Extractor.transcript(Extractor.java:892)
at projects.gemoma.Extractor.extract(Extractor.java:706)
at projects.gemoma.Extractor.run(Extractor.java:144)
at projects.gemoma.GeMoMaModule.run(GeMoMaModule.java:92)
at de.jstacs.tools.ui.cli.CLI.run(CLI.java:426)
at projects.gemoma.GeMoMa.main(GeMoMa.java:378)

Thanks,
Marta

Is it possible to use GeMoMa based on protein database like orthodb to annotate a target genome?

Hi,

Is it possible to use GeMoMa base on protein database like orthodb only (or incorporate with RNA-seq data) to annotate a target genome?

Best,
Kun

cdsParts=true, but the ID (gene-si: )seems to be no CDS part

Hi,

I am trying to run GeMoMa using some reference genomes and annotation files downloaded from NCBI RefSeq. It is crashing with the following error message:

Exception in thread "main" java.lang.RuntimeException: Did not finish as intended. java.lang.IllegalArgumentException: You selected cdsParts=true, but the ID (gene-si: )seems to be no CDS part.

One example is the zebrafish, GCF_000002035.6_GRCz11_genomic.fna and corresponding genomic.gff.

The genes including gene-si in the ID seem to have CDS, so I am wondering if this is a parser formatting issue?

NC_007112.7     BestRefSeq%2CGnomon     gene    65705   71667   .       -       .       ID=gene-si:zfos-1011f11.1;Dbxref=GeneID:407680,ZFIN:ZDB-GENE-080229-4;Name=si:zfos-1011f11.1;description=si:zfos-1011f11.1;gbkey=Gene;gene=si:zfos-1011f11.1;gene_biotype=protein_coding;gene_synonym=sc:d0144
NC_007112.7     BestRefSeq      mRNA    65705   71523   .       -       .       ID=rna-NM_001115105.2;Parent=gene-si:zfos-1011f11.1;Dbxref=GeneID:407680,Genbank:NM_001115105.2,ZFIN:ZDB-GENE-080229-4;Name=NM_001115105.2;gbkey=mRNA;gene=si:zfos-1011f11.1;product=si:zfos-1011f11.1;transcript_id=NM_001115105.2
NC_007112.7     BestRefSeq      exon    71433   71523   .       -       .       ID=exon-NM_001115105.2-1;Parent=rna-NM_001115105.2;Dbxref=GeneID:407680,Genbank:NM_001115105.2,ZFIN:ZDB-GENE-080229-4;gbkey=mRNA;gene=si:zfos-1011f11.1;product=si:zfos-1011f11.1;transcript_id=NM_001115105.2
NC_007112.7     BestRefSeq      exon    69271   69443   .       -       .       ID=exon-NM_001115105.2-2;Parent=rna-NM_001115105.2;Dbxref=GeneID:407680,Genbank:NM_001115105.2,ZFIN:ZDB-GENE-080229-4;gbkey=mRNA;gene=si:zfos-1011f11.1;product=si:zfos-1011f11.1;transcript_id=NM_001115105.2
NC_007112.7     BestRefSeq      exon    68862   69075   .       -       .       ID=exon-NM_001115105.2-3;Parent=rna-NM_001115105.2;Dbxref=GeneID:407680,Genbank:NM_001115105.2,ZFIN:ZDB-GENE-080229-4;gbkey=mRNA;gene=si:zfos-1011f11.1;product=si:zfos-1011f11.1;transcript_id=NM_001115105.2
NC_007112.7     BestRefSeq      exon    68049   68213   .       -       .       ID=exon-NM_001115105.2-4;Parent=rna-NM_001115105.2;Dbxref=GeneID:407680,Genbank:NM_001115105.2,ZFIN:ZDB-GENE-080229-4;gbkey=mRNA;gene=si:zfos-1011f11.1;product=si:zfos-1011f11.1;transcript_id=NM_001115105.2
NC_007112.7     BestRefSeq      exon    67834   67950   .       -       .       ID=exon-NM_001115105.2-5;Parent=rna-NM_001115105.2;Dbxref=GeneID:407680,Genbank:NM_001115105.2,ZFIN:ZDB-GENE-080229-4;gbkey=mRNA;gene=si:zfos-1011f11.1;product=si:zfos-1011f11.1;transcript_id=NM_001115105.2
NC_007112.7     BestRefSeq      exon    67621   67753   .       -       .       ID=exon-NM_001115105.2-6;Parent=rna-NM_001115105.2;Dbxref=GeneID:407680,Genbank:NM_001115105.2,ZFIN:ZDB-GENE-080229-4;gbkey=mRNA;gene=si:zfos-1011f11.1;product=si:zfos-1011f11.1;transcript_id=NM_001115105.2
NC_007112.7     BestRefSeq      exon    65705   67548   .       -       .       ID=exon-NM_001115105.2-7;Parent=rna-NM_001115105.2;Dbxref=GeneID:407680,Genbank:NM_001115105.2,ZFIN:ZDB-GENE-080229-4;gbkey=mRNA;gene=si:zfos-1011f11.1;product=si:zfos-1011f11.1;transcript_id=NM_001115105.2
NC_007112.7     BestRefSeq      CDS     71433   71475   .       -       0       ID=cds-NP_001108577.2;Parent=rna-NM_001115105.2;Dbxref=GeneID:407680,Genbank:NP_001108577.2,ZFIN:ZDB-GENE-080229-4;Name=NP_001108577.2;gbkey=CDS;gene=si:zfos-1011f11.1;product=cell surface A33 antigen precursor;protein_id=NP_001108577.2
NC_007112.7     BestRefSeq      CDS     69271   69443   .       -       2       ID=cds-NP_001108577.2;Parent=rna-NM_001115105.2;Dbxref=GeneID:407680,Genbank:NP_001108577.2,ZFIN:ZDB-GENE-080229-4;Name=NP_001108577.2;gbkey=CDS;gene=si:zfos-1011f11.1;product=cell surface A33 antigen precursor;protein_id=NP_001108577.2
NC_007112.7     BestRefSeq      CDS     68862   69075   .       -       0       ID=cds-NP_001108577.2;Parent=rna-NM_001115105.2;Dbxref=GeneID:407680,Genbank:NP_001108577.2,ZFIN:ZDB-GENE-080229-4;Name=NP_001108577.2;gbkey=CDS;gene=si:zfos-1011f11.1;product=cell surface A33 antigen precursor;protein_id=NP_001108577.2
NC_007112.7     BestRefSeq      CDS     68049   68213   .       -       2       ID=cds-NP_001108577.2;Parent=rna-NM_001115105.2;Dbxref=GeneID:407680,Genbank:NP_001108577.2,ZFIN:ZDB-GENE-080229-4;Name=NP_001108577.2;gbkey=CDS;gene=si:zfos-1011f11.1;product=cell surface A33 antigen precursor;protein_id=NP_001108577.2
...

Any advice as to how to fix this would be appreciated.

Thanks,

Rich

NullPointerExcetpion in Coverage.print

While running the "access" stage, and after lots of processing, the exception was thrown with these arguments and this stack trace. Any advice?

(Perhaps it would be possible to modify the code slightly so that the exception gets caught, and processing continues?)

Parameters of tool "Chromatin accessibility" (access, version: 0.1):
d - Data source (The format of the input file containing the coverage information, range={BAM/SAM, Bigwig}, default = BAM/SAM)	= Bigwig
    Parameters for selection "BAM/SAM":
    	i - Input SAM/BAM (The input file containing the mapped DNase-seq/ATAC-seq reads)	= null
    Parameters for selection "Bigwig":
    	i - Input Bigwig (The input file containing the mapped DNase-seq/ATAC-seq reads)	= /proj/price4/apaquett/Placenta_DHS/placental_data/dataSets/LN54270/aggregation-17562/Signal.UniqueMultiple.both.bw
    	f - FastA index (The genome index)	= hg38.fa.fai
b - Bin width (The width of the genomic bins considered)	= 50
outdir - The output directory, defaults to the current working directory (.)	= dnase
Exception in thread "main" java.lang.NullPointerException
	at projects.encodedream.Coverage.print(Coverage.java:156)
	at projects.encodedream.Coverage.coverage(Coverage.java:113)
	at projects.encodedream.tools.ChromatinAccessibility.run(ChromatinAccessibility.java:121)
	at de.jstacs.tools.ui.cli.CLI.run(CLI.java:374)
	at projects.encodedream.tools.Catchitt.main(Catchitt.java:27)

Which appears to correspond to inner for loop control statement

	private static void print(String chr, double[][] temp, PrintStream out, int bin) {
		for(int i=0;i<temp.length;i++){
			out.print(chr);
			out.print("\t");
			out.print(i*bin);
			for(int j=0;j<temp[i].length;j++){   # line 156
				out.print("\t");
				out.print(temp[i][j]);
			}
			out.println();
		}
	}

usage about the parameter of weight (w)

Hi, Jens,
after the prediction by the step of "GeMoMa" with several relative genomes, i need perform the step of "GAF" and i also strictly set the GAF.f with "iAA>=0.8 and ce/rce==1", etc, and the prediction is much better than that before this filter step.

Here is the problem, since the predicted gene supported by several predictions from the relatives, and i wanna choose the most close one as the final prediction, while others as the alternative ones.

I tried to use the parameter of "w" to do the filter, and the relatives with different weights consindering of their phylogentic relationships. for example, the very close one with the largest weights, say, 1000.0, while the not close one we set the weight of 1.0. after than, i thought the GAF should choose the prediction with the priority of the weight i earlier set. Is the parameter of "sumWeight (the sum of the weights of the references that perfectly support this prediction)" was used to perform this function.

yours, sincerely

supplementary files.
This is the predictions with the relative genes SP1, SP2, SP3, SP4.

this file includes the different weights considering of their phylogentic relationships (SP1, SP2, SP3, SP4)

the prediction best prediction should from SP1 not from SP2, but the GAF step keep the best prediction is SP2, and the SP1, SP3, SP4 as the alternative ones.

Here is the parameter of "w" from GeMoMa website,

i conduct the filter like this, firstly, add the weights of each predictions like the upper step, then, i combined all the predictions to filter with sumWeight (here i dont know how to use this to get my purpose)

GeMoMa-v1.8 overestimating gene number with multiple references

Hello,
I think GeMoMa is overestimating the number of genes in my genomes. I'm working with 3 haploid genome assemblies and using GeMoMa to annotate them. These are a few stats for one of them using different numbers of references.

1 reference: 39178 genes/39,178 mRNA; 86.7% complete BUSCOs
2 references: 42,883 genes/45,776 mRNA; 93.6% complete BUSCOs
4 references: 61,006 genes/72,630 mRNA; 96.7% complete BUSCOs

I used these commands to count the genes and mRNA
awk '{if ($3=="mRNA") print}' final_annotation.gff | wc -l
awk '{if ($3=="gene") print}' final_annotation.gff | wc -l

The chromosome-level reference (2021) that I'm comparing these to in very closely related (essentially sister taxa in the same genus) has only 35,470 genes (but a surprisingly low complete BUSCO score of 81.3%) and shares all the same polyploidy events.

My initial worry is that GeMoMa is overestimating the gene number, but is this likely? it's definitely finding potential genes and the BUSCOs are increasing, but 72k seems like too many genes. Do you have any recommendations for how I could have more confidence in this annotation? I appreciate any help. Thank you,
Steve

Here is the command I'm using for all of them:

java -Xmx64g -jar /projects/academic/vaalbert/modulefiles/gemoma-1.8/GeMoMa-1.8.jar CLI GeMoMaPipeline
t=$TARGET o=true p=true
s=own i=$REF1 a=$REFASS1 g=$REFGEN1
s=own i=$REF2 a=$REFASS2 g=$REFGEN2
s=own i=$REF3 a=$REFASS3 g=$REFGEN3
s=own i=$REF4 a=$REFASS4 g=$REFGEN4
tblastn=false
GeMoMa.c=0.4
GeMoMa.Score=ReAlign
AnnotationFinalizer.r=NO
Extractor.r=true
Extractor.f=false
outdir=$OUTDIR threads=16

Confuse about the errors

Hi,
I have used the GeMoMa to do the prediction of mammalian genome, the codes shown as follows:
GeMoMa=/home/softwares/miniconda3/pkgs/gemoma-1.6.4-0/share/GeMoMa-1.7.1.jar
genome=/home/bmx01HIFI_v130.fa
nohup java -jar ${GeMoMa} CLI GeMoMa s=test.bla t=${genome} c=cds-parts.fasta i=introns.gff &
It seems work well. but there is one problem that I am very confused:
"java.lang.IllegalArgumentException: There is no sequence with sequence ID 347 in the target genome."
I only have 20 chrosomes, why does this problem come out?
"genome parts: 20 [chrY, chr9, chr7, chrX, chr8, chr5, chr6, chr10, chr3, chr11, chr4, chr1, chr12, chr13, chr2, chr18, chr14, chr15, chr16, chr17]"
I am very confused about it
Need help!
Thanks so much!
Yizhong Huang

Running GAF independently of GeMoMa Pipeline gives empty file @GeMoMa

Hi,

java -jar GeMoMa-1.8.jar CLI GAF f=score/aa>=1.0 g=unfiltered_predictions_from_species_1.gff

I also attached the first 50 lines of one of the unfiltered gff files I got from running GeMoMa pipeline.

Thanks,
Brian
unfiltered_predictions_from_species_0_50lines.txt

Prediction: Chromosomes do not match between input files.

Hello,

I'm having this problem in the prediction step. I had the same error in the training step but I figured it out (not exactly the same reference).

input motif order is the same as training.

Now, I really don't understand this error.

java -jar -Xmx30g ${Catchitt_path} predict c=TFBS_predict_yli11_2021-10-03/trained_model/Classifiers.xml a=TFBS_predict_yli11_2021-10-03/ATAC/${COL1}/Chromatin_accessibility.tsv.gz m=TFBS_predict_yli11_2021-10-03/Motif/Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1/Motif_scores.tsv.gz m=TFBS_predict_yli11_2021-10-03/Motif/ENCSR000BHK_SP1-human_1_hg19-model-2/Motif_scores.tsv.gz m=TFBS_predict_yli11_2021-10-03/Motif/intersect_all_relaxed_filtered_lslim3-model-1/Motif_scores.tsv.gz m=TFBS_predict_yli11_2021-10-03/Motif/intersect_all_relaxed_filtered_lslim3-model-2/Motif_scores.tsv.gz m=TFBS_predict_yli11_2021-10-03/Motif/intersect_all_relaxed_filtered_lslim3-model-3/Motif_scores.tsv.gz m=TFBS_predict_yli11_2021-10-03/Motif/intersect_all_relaxed_filtered_lslim3-model-4/Motif_scores.tsv.gz m=TFBS_predict_yli11_2021-10-03/Motif/intersect_all_relaxed_filtered_lslim3-model-5/Motif_scores.tsv.gz m=TFBS_predict_yli11_2021-10-03/Motif/intersect_all_relaxed_filtered_lslim3-model-6/Motif_scores.tsv.gz m=TFBS_predict_yli11_2021-10-03/Motif/intersect_all_relaxed_filtered_lslim3-model-7/Motif_scores.tsv.gz m=TFBS_predict_yli11_2021-10-03/Motif/intersect_all_relaxed_filtered_pwm-model-1/Motif_scores.tsv.gz m=TFBS_predict_yli11_2021-10-03/Motif/NFIX.homer/Motif_scores.tsv.gz m=TFBS_predict_yli11_2021-10-03/Motif/PU1.homer/Motif_scores.tsv.gz f=/home/yli11/Data/Mouse/mm9/fasta/mm9_main.fa.fai outdir=TFBS_predict_yli11_2021-10-03/prediction/${COL1}

Thanks,
Yichao

Chromosomes do not match between input files.
[, null, null, null, null, null, null, null, null, null, null, null, null, null]

GeMoMa fails if directory containing GeMoMa JAR file is not writable

If the GeMoMa JAR file is located in a directory the user does not have write access to (e.g., as part of a global package installation by a sysadmin that is usable by all server/HPC cluster users; or in a Singularity Image Format (SIF) file, which contains a read-only file system providing the software environment for a Singularity container), GeMoMa will fail with an error like (e.g., in the case where the JAR is at /usr/local/bin/GeMoMa-1.6.4.jar):

Exception in thread "main" java.io.FileNotFoundException: /usr/local/bin/GeMoMa.ini.xml (Read-only file system)

A GeMoMa.ini.xml file is apparently assumed to exist in the same directory as the GeMoMa JAR file, and if not, one is written with default values:

Jstacs/projects/gemoma/GeMoMa.java

Line 321 in c7126b1

 File ini = new File( jarfile.getParentFile().getAbsolutePath() + File.separator + "GeMoMa.ini.xml" ); 

To support these GeMoMa installation scenarios, it convenient GeMoMa defaults were silently used (instead of being written to a new file) if GeMoMa.ini.xml does not exist (or alternatively, written to the current working directory, which is likely to be writable)

Jstacs/projects/gemoma/GeMoMa.java

Lines 319 to 320 in c7126b1

 int maxSize = -1; 

 long timeOut=3600, maxTimeOut=60*60*24*7;

Assignment parameter optional does not exist when running GeMoMa with protein sequences

Hi, I'm trying to run GeMoMa with annotation evidence from other species + protein sequences from my species (extracted from a transcriptome).

java -Xms10G -Xmx120G -jar 0_programs/GeMoMa-1.8.jar CLI GeMoMaPipeline threads=16 GeMoMa.Score=ReAlign \
AnnotationFinalizer.r=NO o=true t=../hakea_chr_v2_mask.fa p=true pc=true pgr=true outdir=5_gemoma_out/macadamia_ts2_hp \
s=own i=Nn a=GCF_000365185.1_Chinese_Lotus_1.1_genomic.gff g=GCF_000365185.1_Chinese_Lotus_1.1_genomic.fna \ 
s=own i=At a=GCF_000001735.4_TAIR10.1_genomic.gff g=GCF_000001735.4_TAIR10.1_genomic.fna \
s=own i=Mi a=GCF_013358625.1_SCU_Mint_v3_genomic.gff g=GCF_013358625.1_SCU_Mint_v3_genomic.fna \
s=own i=Ts a=Tspe_v1/Tspe_v1_gemomaannotation.gff g=Tspe_v1/Tspe_v1.fa \
s=pre-extracted c=Trinity_2021_filt_peptides_NRtop1Emb_ids_tr2.fasta

Exception in thread "main" java.lang.RuntimeException: Did not finish as intended. de.jstacs.parameters.SimpleParameter$IllegalValueException: Error in parameter(assignment): Parameter not permitted: File does not exist
at de.jstacs.parameters.FileParameter.setValue(FileParameter.java:305)
at projects.gemoma.GeMoMaPipeline.run(GeMoMaPipeline.java:1039)
at projects.gemoma.GeMoMaModule.run(GeMoMaModule.java:92)
at de.jstacs.tools.ui.cli.CLI.run(CLI.java:426)
at projects.gemoma.GeMoMa.main(GeMoMa.java:385)

I get this error only when I include the last line with the protein sequence fasta. I assume the error is to do with the assignment parameter, but I thought this was optional and I'm not sure how to make the assignment file.

Update conda to v1.8

Is it possible to push the latest version to Conda so that v1.8 can be downloaded using

conda install -c bioconda gemoma

Currently, Conda is only using v1.7.1

java.lang.IllegalArgumentException: At least two sequences with the same ID but different sequence:

Hi-

Thanks for this useful annotation program!

I am running the GeMoMaPipeline using three reference species and one bam of mapped RNA-seq reads (consisting of all RNA-seq data for that species compiled with samtools merge).
GeMoMa GeMoMaPipeline -Xms5G -Xmx50G GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO pc=true pgr=true t=$TARGET threads=24 outdir=/GemomaOut/ tblastn=true b=$BLAST_PATH o=true p=false r=MAPPED ERE.m=$RNA1 s=own i=species1 a=$ANN1 g=$GEN1 s=own i=species2 a=$ANN2 g=$GEN2 s=own i=species3 a=$ANN3 g=$GEN3>gemoma.out

In phase 2 of the pipeline, the program prints multiple times,
Exception in thread "main"java.lang.RuntimeException: Did not finish as intended. java.lang.IllegalArgumentException: At least two sequences with the same ID but different sequence: gene-XB22166507.L
The program ends with an exception. This message appears to be associated with the gff of one of my reference species, and each time the message is printed that same gene name is specified. However, if I search the 9th column of that gff file (ID=...), that ID only occurs once. Just in case, I checked the other two gff files, and neither include this same ID.

Do you know what might be causing this exception?

Thank you!

GeMoMa failing with multiple reference species + RNAseq

Hi,
I'm trying to combine evidence from 15 reference species and mapped RNAseq data.
GeMoMa works if I only use one reference species, but it is not working when I try to combine all of them.

Any tips?

Thank you,
Mark

Here's my command:

java -jar GeMoMa-1.7.1.jar CLI GeMoMaPipeline tblastn=false threads=40 t=/Geli_finishersc_pilon_masked.fasta
s=own i=Agif a=/GenomicData/Agif/genomic.gff g=/GenomicData/Agif/Agif.fna
s=own i=Btre a=/GenomicData/Btre/genomic.gff g=/GenomicData/Btre/Btre.fna
s=own i=Ccon a=/GenomicData/Ccon/genomic.gff g=/GenomicData/Ccon/Ccon.fna
s=own i=Cflo a=/GenomicData/Cflo/genomic.gff g=/GenomicData/Cflo/Cflo.fna
s=own i=Cins a=/GenomicData/Cins/genomic.gff g=/GenomicData/Cins/Cins.fna
s=own i=Csol a=/GenomicData/Csol/genomic.gff g=/GenomicData/Csol/Csol.fna
s=own i=Ctyp a=/GenomicData/Ctyp/genomic.gff g=/GenomicData/Ctyp/Ctyp.fna
s=own i=Dall a=/GenomicData/Dall/genomic.gff g=/GenomicData/Dall/Dall.fna
s=own i=Fari a=/GenomicData/Fari/genomic.gff g=/GenomicData/Fari/Fari.fna
s=own i=Mdem a=/GenomicData/Mdem/genomic.gff g=/GenomicData/Mdem/Mdem.fna
s=own i=Nvit a=/GenomicData/Nvit/genomic.gff g=/GenomicData/Nvit/Nvit.fna
s=own i=Tbra a=/GenomicData/Tbra/genomic.gff g=/GenomicData/Tbra/Tbra.fna
s=own i=Tpre a=/GenomicData/Tpre/genomic.gff g=/GenomicData/Tpre/Tpre.fna
s=own i=Tsar a=/GenomicData/Tsar/genomic.gff g=/GenomicData/Tsar/Tsar.fna
s=own i=Vcan a=/GenomicData/Vcan/genomic.gff g=/GenomicData/Vcan/Vcan.fna
outdir=Pipeline_Out m=/mmseqs/mmseqs2-sse2-13-45111/bin/ AnnotationFinalizer.r=NO
r=MAPPED ERE.s=FR_UNSTRANDED ERE.m=~/merged_sorted.bam

Everything appears to run well up to the GeMoMa step, where I get the following:

Starting: GeMoMa split=3 for species 0 (Agif) (9019.389s) Starting: GeMoMa split=4 for species 0 (Agif) (9019.39s) Starting: GeMoMa split=5 for species 0 (Agif) (9019.39s) Starting: GeMoMa split=2 for species 0 (Agif) (9019.389s) Starting: GeMoMa split=1 for species 0 (Agif) (9019.389s) Starting: GeMoMa split=6 for species 0 (Agif) (9019.39s) Starting: GeMoMa split=0 for species 0 (Agif) (9019.389s) Starting: GeMoMa split=7 for species 0 (Agif) (9019.391s) Starting: GeMoMa split=8 for species 0 (Agif) (9019.391s) Starting: GeMoMa split=9 for species 0 (Agif) (9019.391s) Starting: GeMoMa split=10 for species 0 (Agif) (9019.394s) Starting: GeMoMa split=11 for species 0 (Agif) (9019.394s) Starting: GeMoMa split=12 for species 0 (Agif) (9019.396s) Starting: GeMoMa split=13 for species 0 (Agif) (9019.398s) Starting: GeMoMa split=14 for species 0 (Agif) (9019.398s) Starting: GeMoMa split=15 for species 0 (Agif) (9019.398s) Starting: GeMoMa split=16 for species 0 (Agif) (9019.398s) Starting: GeMoMa split=17 for species 0 (Agif) (9019.398s) Starting: GeMoMa split=18 for species 0 (Agif) (9019.399s) Starting: GeMoMa split=19 for species 0 (Agif) (9019.4s) Starting: GeMoMa split=20 for species 0 (Agif) (9019.4s) Starting: GeMoMa split=21 for species 0 (Agif) (9019.404s) Starting: GeMoMa split=22 for species 0 (Agif) (9019.405s) Starting: GeMoMa split=23 for species 0 (Agif) (9019.405s) Starting: GeMoMa split=24 for species 0 (Agif) (9019.407s) Starting: GeMoMa split=25 for species 0 (Agif) (9019.408s) Starting: GeMoMa split=26 for species 0 (Agif) (9019.408s) Starting: GeMoMa split=27 for species 0 (Agif) (9019.408s) Starting: GeMoMa split=28 for species 0 (Agif) (9019.408s) Starting: GeMoMa split=29 for species 0 (Agif) (9019.408s) Starting: GeMoMa split=30 for species 0 (Agif) (9019.409s) Starting: GeMoMa split=31 for species 0 (Agif) (9019.409s) Starting: GeMoMa split=32 for species 0 (Agif) (9019.409s) Starting: GeMoMa split=35 for species 0 (Agif) (9019.41s) Starting: GeMoMa split=34 for species 0 (Agif) (9019.41s) Starting: GeMoMa split=36 for species 0 (Agif) (9019.41s) Starting: GeMoMa split=38 for species 0 (Agif) (9019.41s) Starting: GeMoMa split=37 for species 0 (Agif) (9019.41s) Starting: GeMoMa split=33 for species 0 (Agif) (9019.418s) Starting: GeMoMa split=39 for species 0 (Agif) (9125.925s) GeMoMa for species 0 (Agif) split=3 throws an Exception No external annotation given. Statistics: Job WAITING RUNNING INTERRUPTED FAILED SUCCEEDED --------------------------------------------------------- MmseqsCreateDB 0 0 0 0 1 EREAndFill 0 0 0 0 1 ExtractorAndSplit 0 0 0 0 15 Mmseqs 0 0 0 0 15 GeMoMa 560 39 0 1 0 Elapsed time: 9412 seconds (2h 36m 52s) GeMoMa for species 0 (Agif) split=6 throws an Exception GeMoMa for species 0 (Agif) split=34 throws an Exception GeMoMa for species 0 (Agif) split=17 throws an Exception GeMoMa for species 0 (Agif) split=13 throws an Exception GeMoMa for species 0 (Agif) split=8 throws an Exception GeMoMa for species 0 (Agif) split=38 throws an Exception GeMoMa for species 0 (Agif) split=19 throws an Exception GeMoMa for species 0 (Agif) split=11 throws an Exception GeMoMa for species 0 (Agif) split=16 throws an Exception GeMoMa for species 0 (Agif) split=5 throws an Exception GeMoMa for species 0 (Agif) split=37 throws an Exception GeMoMa for species 0 (Agif) split=32 throws an Exception GeMoMa for species 0 (Agif) split=24 throws an Exception GeMoMa for species 0 (Agif) split=4 throws an Exception GeMoMa for species 0 (Agif) split=21 throws an Exception GeMoMa for species 0 (Agif) split=18 throws an Exception GeMoMa for species 0 (Agif) split=12 throws an Exception GeMoMa for species 0 (Agif) split=0 throws an Exception GeMoMa for species 0 (Agif) split=1 throws an Exception GeMoMa for species 0 (Agif) split=39 throws an Exception GeMoMa for species 0 (Agif) split=14 throws an Exception GeMoMa for species 0 (Agif) split=30 throws an Exception GeMoMa for species 0 (Agif) split=25 throws an Exception GeMoMa for species 0 (Agif) split=36 throws an Exception GeMoMa for species 0 (Agif) split=23 throws an Exception GeMoMa for species 0 (Agif) split=31 throws an Exception GeMoMa for species 0 (Agif) split=26 throws an Exception GeMoMa for species 0 (Agif) split=33 throws an Exception GeMoMa for species 0 (Agif) split=7 throws an Exception GeMoMa for species 0 (Agif) split=22 throws an Exception GeMoMa for species 0 (Agif) split=10 throws an Exception GeMoMa for species 0 (Agif) split=20 throws an Exception GeMoMa for species 0 (Agif) split=28 throws an Exception GeMoMa for species 0 (Agif) split=27 throws an Exception GeMoMa for species 0 (Agif) split=9 throws an Exception GeMoMa for species 0 (Agif) split=2 throws an Exception GeMoMa for species 0 (Agif) split=35 throws an Exception GeMoMa for species 0 (Agif) split=29 throws an Exception GeMoMa for species 0 (Agif) split=15 throws an Exception

proble with dnase-seq peaks mapped in reverse strands

Hi,
first thanks for the tools, and the tutorial!

I am having troubles with a couple of DNAse-seq BAM files (here an example: https://www.encodeproject.org/experiments/ENCSR136DNA/)

Catchitt throws this error message:

$ java -jar Catchitt.jar access i=/mnt/MareNostrum/db/dnase_seq/ENCFF473YHH.bam b=100 outdir=dnase

Parameters of tool "Chromatin accessibility" (access, version: 0.1):
d - Data source (The format of the input file containing the coverage information, range={BAM/SAM, Bigwig}, default = BAM/SAM)  = BAM/SAM
    Parameters for selection "BAM/SAM":
        i - Input SAM/BAM (The input file containing the mapped DNase-seq/ATAC-seq reads)       = /mnt/MareNostrum/db/dnase_seq/ENCFF473YHH.bam
    Parameters for selection "Bigwig":
        i - Input Bigwig (The input file containing the mapped DNase-seq/ATAC-seq reads)        = null
        f - FastA index (The genome index)      = null
b - Bin width (The width of the genomic bins considered)        = 100
outdir - The output directory, defaults to the current working directory (.)    = dnase
Exception in thread "Thread-0" java.lang.ArrayIndexOutOfBoundsException: Index -84668 out of bounds for length 10000
        at projects.encodedream.Pileup.pileup(Pileup.java:299)
        at projects.encodedream.tools.ChromatinAccessibility.lambda$0(ChromatinAccessibility.java:93)
        at java.base/java.lang.Thread.run(Thread.java:834)

I believe that this happens because of the strand on which the read is mapped, for instance if I convert all revert strands to forward, by changing BAM flags, it works.

EDIT: this behavior also occurs when sequence length is negative

thanks!

java.lang.InterruptedException

Hello!
I used GeMoma by with given external annotation external.gff, the command is as follows:

Really apprieciated for your help!

Small gene models after GAF

Hi, Jens,
again, i ve got another issue confused me. Since after the prediction by GeMoMa, i ve performed the distributions of gene structures compared with the close relatives and there are some redundency small genes around 1000bp. I ve tried to filtered those genes by limited the minmum lengths of introns, cds and genes, but there are still some small gene models in redundency. Here is an example of two mammal species' gene structures, the blue one represent the close reference species annotated by NCBI, while the red one is my studied species predicted by GeMoMa. Since they are very close relatives and the gene strutures should be very conservative. Could you give me some suggestions to make the two curves fitted together.

before filter
after filter the intron and small cds

Best regards,
Liu

the accurary of prediction: shorter then relatives about the everage gene length

hi,
i conducted several mammal genomes with GeMoMa (1.7.1 version, "run.sh") without or with RNA, but the predictions of gene length more shorter than the reletives of my studied genomes.

and i ve done a lot of tries, still cannot improve the everage length of my predicted gene? could you help me why?

Here is how i performed the pipe:
the inputs: 1. target genome (mask or unmask, contig or chromosome level, all had tried); 2. relatives high quality assembly genome and gff annotations (gff3, all transcripts or longest transcrpts);
the run GeMoMA process: "run.sh".

seriously thanks waiting for your kind reply!

Format of search results for GeMoMa s= ?

Hi,

I use mmseqs results(default format) as search results for GeMoMa, but I got the following errors, why?
java -jar GeMoMa-1.8.jar CLI GeMoMa s=duck.blast ...

Problem while gene: gene-FBXW7_
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Please check the search algorithm input. It seems to have less columns than expected.
line: gene-FBXW7_16 scaffold_4 0.979 147 1 0 1 49 45179652 45179506 9.783E-26 103
split: [gene-FBXW7_16, scaffold_4, 0.979, 147, 1, 0, 1, 49, 45179652, 45179506, 9.783E-26, 103]
original message: 13

at projects.gemoma.GeMoMa.addHit(GeMoMa.java:1176)
at projects.gemoma.GeMoMa.run(GeMoMa.java:1003)
at projects.gemoma.GeMoMaModule.run(GeMoMaModule.java:92)
at de.jstacs.tools.ui.cli.CLI.run(CLI.java:426)
at projects.gemoma.GeMoMa.main(GeMoMa.java:385)

here is the duck.blast format

Best,
Kun

Few peak overlaps between predicted TFBSs for TF1 and TF2

Hello,

Thank you for this great tool.

I have trained two models for TF1 and TF2 in cell line A and predicted TFBSs for the whole blood lineages, such as HSC, CMP, GMP, etc. In cell line A, I have 700+ co-binding peaks (TF1&TF2, IDR peaks). However, with the predicted TFBSs, I only have 10-100+ overlap peaks. I feel something is not right.

Here is what I did:

(-c) Conserved peak is IDR peaks.

(-r) Relaxed peak is the union of MACS2 call-peaks

(-a) The accessibility data is single-end ATAC-seq bam file.

(-m) The motif data is the PWM for TF1 and TF2, and several others including (the given order is the same in training and prediction):

Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1
intersect_all_relaxed_filtered_lslim3-model-2
intersect_all_relaxed_filtered_lslim3-model-5
intersect_all_relaxed_filtered_pwm-model-1
ENCSR000BHK_SP1-human_1_hg19-model-2
intersect_all_relaxed_filtered_lslim3-model-3
intersect_all_relaxed_filtered_lslim3-model-6
intersect_all_relaxed_filtered_lslim3-model-1
intersect_all_relaxed_filtered_lslim3-model-4
intersect_all_relaxed_filtered_lslim3-model-7

When I got Predictions.tsv.gz, I took all regions >0.5 and used bedtools merge to merge the 50bp windows and output the final "called peaks". I then used these "called peaks" for TF1 and TF2 and took the intersection. The result is very few overlaps.

Is there any suggestion for me to better tune the model?

Thanks,
Yichao

Did not finish as intended

Hey，
I am a new user for the GeMOMA, and met a dubug.
I downloaded the latest version (1.8) and test the program with Athaliana geneme and protein data.

**java -jar /home/qymeng/biosoft/GeMoMa/GeMoMa-1.8.jar CLI GeMoMaPipeline threads=10 AnnotationFinalizer.r=NO p=false o=true t=GCF_000004255.2_v.1.0_genomic.fna.gz outdir=output/ s=pre-extracted c=/home/qymeng/Pan-genome/Annotation/Protein/Athaliana_447_Araport11.protein.fa >test.log 2>&1 &**

I got the the temp dictionary including the filtered_predictions.gff and final_annotation.gff, however the log file show the program had not finished, How to I solve to the bug.
Best wishes;
Qingying
test.log

Can a ChIP-seq .bedgraph file (four columns) be used as input for 'labels' in Catchitt

a line of my file format:
chr2L 9180 9200 0.095
...

I was hoping I could use it (instead of narrow peak 10 column format), but I am getting 'U' for all the labels and so my bedgraph file doesn't seem to work.

This file format is all that is available from the paper--do you have any recommendations?

(I know this isn't an error question, but I really appreciate the help)

how to set the intron length, say the max length?

GeMoMa multiple reference species

Hello,
I try to annotate a genome using the gemoma pipeline with protein information of multiple reference species plus transcripts of my target species.

I ran the following code
java -Xmx180G -jar /cluster/software/gemoma/GeMoMa-1.6.4/GeMoMa-1.6.4.jar CLI GeMoMaPipeline
threads=60
outdir=/gemoma/
tblastn=false
GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO
o=true
t=/target_genome.fasta
i=spec1 a=/pathtogff/spec1.gff g=/pathtogenome/spec1.fna
i=spec2 a=/pathtogff/spec2.gff g=/pathtogenome/spec2.fna
i=spec3 a=/pathtogff/spec3.gff g=/pathtogenome/spec3.fna
r=MAPPED ERE.m=/target_transcriptomeReads_sort.bam

and obtain the following error message
....
Parameter a specified multiple times or not applicable for current selection(s). Could not use value(s):
[/pathtogff/spec1.gff, /pathtogff/spec2.gff, /pathtogff/spec2.gff]

It seems that using "a" multiple times is a problem. Do I need to additionally specify s=specXY or add pre-/suffixes to "a" and "g"? Any help on how to get the pipeline running with multiple references would be appreciated.

Problem when adding external evidence

Hello!

I've tried to run the whole GeMoMa pipeline for a couple of times. I was able to run it using the evidence from one reference species (obtained with BRAKER). However, when I try to include some extenal gff evidence (BRAKER output), something fails.
I looked throught the issues but still I don't know what happens.

May there be any restrictions of the ID name or something?

Here you have the error, the code, and one example of the external gff for my target species.

Error


Check RNA-seq data (introns): 31% of the sequences in the reference genome are covered.
Did not delete temporary files allowing to debug.

Exception in thread "main" java.lang.RuntimeException: Did not finish as intended. java.lang.NullPointerException: null
        at projects.gemoma.GeMoMaPipeline.run(GeMoMaPipeline.java:1038)
        at projects.gemoma.GeMoMaModule.run(GeMoMaModule.java:94)
        at de.jstacs.tools.ui.cli.CLI.run(CLI.java:426)
        at projects.gemoma.GeMoMa.main(GeMoMa.java:399)

Command

GeMoMa GeMoMaPipeline  \
        threads=$ncpu outdir=$outdir \
        tblastn=false \
        restart=true \
        -Xms200G -Xmx400G \
        GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true \
        t=$target_genome \
        p=true pc=true pgr=true \
        s=own i=Dtil a=$ref_annot g=$ref_genome \
        r=MAPPED ERE.m=$mapped_reads \
        ID=DcatBRp24 e=$ext_annot
date

External GFF

ctg620  AUGUSTUS        gene    1806    11524   .       +       .       ID=jg39146;
ctg620  AUGUSTUS        mRNA    1806    11524   .       +       .       ID=jg39146.t1;Parent=jg39146;
ctg620  AUGUSTUS        CDS     1806    2001    0.75    +       2       ID=jg39146.t1.CDS1;Parent=jg39146.t1;
ctg620  AUGUSTUS        exon    1806    2001    .       +       .       ID=jg39146.t1.exon1;Parent=jg39146.t1;
ctg620  AUGUSTUS        intron  2002    2339    .       +       .       ID=jg39146.t1.intron1;Parent=jg39146.t1;
ctg620  AUGUSTUS        CDS     2340    3849    0.77    +       1       ID=jg39146.t1.CDS2;Parent=jg39146.t1;
ctg620  AUGUSTUS        exon    2340    3849    .       +       .       ID=jg39146.t1.exon2;Parent=jg39146.t1;
ctg620  AUGUSTUS        intron  3850    11428   .       +       .       ID=jg39146.t1.intron2;Parent=jg39146.t1;
ctg620  AUGUSTUS        CDS     11429   11524   0.53    +       0       ID=jg39146.t1.CDS3;Parent=jg39146.t1;
ctg620  AUGUSTUS        exon    11429   11524   .       +       .       ID=jg39146.t1.exon3;Parent=jg39146.t1;
ctg620  AUGUSTUS        stop_codon      11522   11524   .       +       0       ID=jg39146.t1.stop1;Parent=jg39146.t1;
ctg1420 AUGUSTUS        gene    1625453 1627745 .       -       .       ID=jg38264;
ctg1420 AUGUSTUS        mRNA    1625453 1627745 .       -       .       ID=jg38264.t1;Parent=jg38264;
ctg1420 AUGUSTUS        stop_codon      1625453 1625455 .       -       0       ID=jg38264.t1.stop1;Parent=jg38264.t1;
ctg1420 AUGUSTUS        CDS     1625453 1625999 0.96    -       1       ID=jg38264.t1.CDS1;Parent=jg38264.t1;
ctg1420 AUGUSTUS        exon    1625453 1625999 .       -       .       ID=jg38264.t1.exon1;Parent=jg38264.t1;
ctg1420 AUGUSTUS        intron  1626000 1627731 .       -       .       ID=jg38264.t1.intron1;Parent=jg38264.t1;
ctg1420 AUGUSTUS        CDS     1627732 1627745 0.2     -       0       ID=jg38264.t1.CDS2;Parent=jg38264.t1;
ctg1420 AUGUSTUS        exon    1627732 1627745 .       -       .       ID=jg38264.t1.exon2;Parent=jg38264.t1;
ctg1420 AUGUSTUS        start_codon     1627743 1627745 .       -       0       ID=jg38264.t1.start1;Parent=jg38264.t1;
Scaffold_2      AUGUSTUS        gene    747379997       747380596       .       -       .       ID=jg51762;
Scaffold_2      AUGUSTUS        mRNA    747379997       747380596       .       -       .       ID=jg51762.t1;Parent=jg51762;
Scaffold_2      AUGUSTUS        stop_codon      747379997       747379999       .       -       0       ID=jg51762.t1.stop1;Parent=jg51762.t1;
Scaffold_2      AUGUSTUS        CDS     747379997       747380596       1       -       0       ID=jg51762.t1.CDS1;Parent=jg51762.t1;
Scaffold_2      AUGUSTUS        exon    747379997       747380596       .       -       .       ID=jg51762.t1.exon1;Parent=jg51762.t1;
Scaffold_2      AUGUSTUS        start_codon     747380594       747380596       .       -       0       ID=jg51762.t1.start1;Parent=jg51762.t1;
Scaffold_5      AUGUSTUS        gene    244344569       244361688       .       -       .       ID=jg4123;
Scaffold_5      AUGUSTUS        mRNA    244344569       244361688       .       -       .       ID=jg4123.t1;Parent=jg4123;
Scaffold_5      AUGUSTUS        stop_codon      244344569       244344571       .       -       0       ID=jg4123.t1.stop1;Parent=jg4123.t1;
Scaffold_5      AUGUSTUS        CDS     244344569       244344708       0.93    -       2       ID=jg4123.t1.CDS1;Parent=jg4123.t1;
Scaffold_5      AUGUSTUS        exon    244344569       244344708       .       -       .       ID=jg4123.t1.exon1;Parent=jg4123.t1;
Scaffold_5      AUGUSTUS        intron  244344709       244346581       .       -       .       ID=jg4123.t1.intron1;Parent=jg4123.t1;
Scaffold_5      AUGUSTUS        CDS     244346582       244346629       1       -       2       ID=jg4123.t1.CDS2;Parent=jg4123.t1;
Scaffold_5      AUGUSTUS        exon    244346582       244346629       .       -       .       ID=jg4123.t1.exon2;Parent=jg4123.t1;
Scaffold_5      AUGUSTUS        intron  244346630       244355368       .       -       .       ID=jg4123.t1.intron2;Parent=jg4123.t1;
Scaffold_5      AUGUSTUS        CDS     244355369       244355416       1       -       2       ID=jg4123.t1.CDS3;Parent=jg4123.t1;
Scaffold_5      AUGUSTUS        exon    244355369       244355416       .       -       .       ID=jg4123.t1.exon3;Parent=jg4123.t1;
Scaffold_5      AUGUSTUS        intron  244355417       244361513       .       -       .       ID=jg4123.t1.intron3;Parent=jg4123.t1;
Scaffold_5      AUGUSTUS        CDS     244361514       244361573       1       -       2       ID=jg4123.t1.CDS4;Parent=jg4123.t1;
Scaffold_5      AUGUSTUS        exon    244361514       244361573       .       -       .       ID=jg4123.t1.exon4;Parent=jg4123.t1;
Scaffold_5      AUGUSTUS        intron  244361574       244361657       .       -       .       ID=jg4123.t1.intron4;Parent=jg4123.t1;
Scaffold_5      AUGUSTUS        CDS     244361658       244361688       0.98    -       0       ID=jg4123.t1.CDS5;Parent=jg4123.t1;
Scaffold_5      AUGUSTUS        exon    244361658       244361688       .       -       .       ID=jg4123.t1.exon5;Parent=jg4123.t1;
Scaffold_5      AUGUSTUS        start_codon     244361686       244361688       .       -       0       ID=jg4123.t1.start1;Parent=jg4123.t1;

Any suggestions are welcome :)
Thanks in advance,

Update GeMoMa bioconda package

The current GeMoMa bioconda package is a bit outdated (version 1.6.4). Could this be updated when convenient? Thanks!

time-out warning

Hi Jens,

I am now running GeMoMa pipeline with 4 reference species and RNAseq data as follows:

java -Xmx5G -Xms5G -Xmx70G  -jar /home/s/schradel/software/gemoma-1.8/GeMoMa-1.8.jar CLI GeMoMaPipeline threads=10 outdir=gem GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO Extractor.r=true o=true t=assembly.fasta \
$refs r=MAPPED ERE.m=s1.bam ERE.m=s2.bam

After a couple of hours of runtime, the process gets stuck and does not progress. Checking the output and log files, the only thing I could find was that file GeMoMa_temp/GeMoMaPipeline-10149417817013514234/3/3/protocol_GeMoMa.txt ends with:

time-out warning: rna-XM_044896866.1, rna-XM_044896869.1

Similarly, the protocol file for split 7 ends with

time-out warning: rna-XM_044906410.1

This fits to the GeMoMa output for that run (see screenshot), so apparently these three RNAs are causing trouble. Can you think of anything that might explain this?

Best
Lukas

GAF with multiple species and weighting

Hi,

I'm trying to use multiple species annotations with GeMoMa, and now I would like to give a different weight during GAF. But I'm not sure I fully understand how to do this.

My command so far (with default w) is:
java -jar $jar CLI GAF p=TEN g=TEN/predicted_annotation.gff p=LOX g=LOX/predicted_annotation.gff outdir=combined1

How do I go about to make the p=LOX have a higher weight than the p=TEN annotations?
I tried this, but it looks like it's doing the opposite of what I want?
java -jar $jar CLI GAF p=TEN g=TEN/predicted_annotation.gff p=LOX g=LOX/predicted_annotation.gff w=50 outdir=combined1

Thanks,
Marta

Could not open GeMoMa_temp/GeMoMaPipeline-9364982901453846136/mmseqsdb_h.index.5 for writing!

Thanks for creating GeMoMa. I've been using GeMoMa for a long time, but my university updated their hardware and we need to reinstall GeMoMa. There was no EasyBuild (new system they want us to use) to install jstacs or GeMoMa, so I decided to install Miniconda and install GeMoMa-1.9 as a conda environment. It seemed to work just fine, but I couldn't get it to run. After reading through the issues, I tried running chmod -R 777 on the conda environment and it had an effect. Now when I run GeMoMa, I get a new error:

mmseqs: Could not open GeMoMa_temp/GeMoMaPipeline-9364982901453846136/mmseqsdb_h.index.5 for writing!
java.lang.InterruptedException
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2109)
at java.base/java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1454)
at projects.gemoma.GeMoMaPipeline$1.run(GeMoMaPipeline.java:609)
at projects.gemoma.GeMoMaPipeline$FlaggedRunnable.run(GeMoMaPipeline.java:1409)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
No gene model was extracted from the references.
8 jobs did not finish as expected. Please check the output carefully.
Did not delete temporary files allowing to debug.

when I cd into GeMoMaPipeline-9364982901453846136, these files exist:
-rw-rw-r-- 1 sjfleck grp-vaalbert 0 Oct 27 13:27 mmseqsdb_h.0
-rw-rw-r-- 1 sjfleck grp-vaalbert 0 Oct 27 13:27 mmseqsdb_h.1
-rw-rw-r-- 1 sjfleck grp-vaalbert 0 Oct 27 13:27 mmseqsdb_h.2
-rw-rw-r-- 1 sjfleck grp-vaalbert 0 Oct 27 13:27 mmseqsdb_h.3
-rw-rw-r-- 1 sjfleck grp-vaalbert 0 Oct 27 13:27 mmseqsdb_h.4
-rw-rw-r-- 1 sjfleck grp-vaalbert 0 Oct 27 13:27 mmseqsdb_h.5
-rw-rw-r-- 1 sjfleck grp-vaalbert 0 Oct 27 13:27 mmseqsdb_h.index.0
-rw-rw-r-- 1 sjfleck grp-vaalbert 0 Oct 27 13:27 mmseqsdb_h.index.1
-rw-rw-r-- 1 sjfleck grp-vaalbert 0 Oct 27 13:27 mmseqsdb_h.index.2
-rw-rw-r-- 1 sjfleck grp-vaalbert 0 Oct 27 13:27 mmseqsdb_h.index.3
-rw-rw-r-- 1 sjfleck grp-vaalbert 0 Oct 27 13:27 mmseqsdb_h.index.4
-rw-rw-r-- 1 sjfleck grp-vaalbert 0 Oct 27 13:27 mmseqsdb.source
-rw-rw-r-- 1 sjfleck grp-vaalbert 206561 Oct 27 13:27 parameters.xml

do you have any ideas as to why this error is happening? Why can it make mmseqsdb_h.index.4, but not mmseqsdb_h.index.5? Thanks,
Steve

Why does GeMoma annotationfianlizer remove the exon features ?

Dear Jens,

i have another question to ask you.

i used TEBRA to combine braker 1 and braker2 annotation and then used the GeMoma annotationfinalizer to add UTRs.

This is my original gff file :

HiC_scaffold_1 AUGUSTUS gene 3484 5786 . - . ID=Bpus_H_g1
HiC_scaffold_1 AUGUSTUS mRNA 3484 5786 . - . ID=Bpus_H_g1.t1;Parent=Bpus_H_g1
HiC_scaffold_1 AUGUSTUS exon 3484 3726 . - . Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS exon 4045 4137 . - . Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS exon 4373 4502 . - . Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS exon 4875 4934 . - . Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS exon 5175 5400 . - . Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS exon 5582 5786 . - . Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS CDS 3484 3726 . - 0 Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS CDS 4045 4137 . - 0 Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS CDS 4373 4502 . - 1 Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS CDS 4875 4934 . - 1 Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS CDS 5175 5400 . - 2 Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS CDS 5582 5786 . - 0 Parent=Bpus_H_g1.t1

This is the gff after the genome annotationfinalizer running:

HiC_scaffold_1 AUGUSTUS gene 2124 10618 . - . ID=Bpus_H_g1
HiC_scaffold_1 AUGUSTUS mRNA 2124 10618 . - . ID=Bpus_H_g1.t1;Parent=Bpus_H_g1
HiC_scaffold_1 AnnotationFinalizer_ five_prime_UTR 9293 10618 . - . Parent=Bpus_H_g1.t1
HiC_scaffold_1 AnnotationFinalizer_ five_prime_UTR 5787 8846 . - . Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS CDS 5582 5786 . - 0 Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS CDS 5175 5400 . - 2 Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS CDS 4875 4934 . - 1 Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS CDS 4373 4502 . - 1 Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS CDS 4045 4137 . - 0 Parent=Bpus_H_g1.t1
HiC_scaffold_1 AUGUSTUS CDS 3484 3726 . - 0 Parent=Bpus_H_g1.t1
HiC_scaffold_1 AnnotationFinalizer_ three_prime_UTR 3471 3483 . - . Parent=Bpus_H_g1.t1
HiC_scaffold_1 AnnotationFinalizer_ three_prime_UTR 2124 3139 . - . Parent=Bpus_H_g1.t1

i noticed after running the annotationfinalizer, my gff file lost the exons features, which leads to the gffread unable to extract the UTR features for the transcripts.fa.

Do you know why the exons features are lost?

Thanks a lot!

Cheers,
Yuling

jstacs / jstacs Goto Github PK

jstacs's Introduction

Description

Organization of the library

Licensing information

jstacs's People

Contributors

Stargazers

Watchers

Forkers

jstacs's Issues

Recommend Projects

Recommend Topics

Recommend Org