Giter Club home page Giter Club logo

braker's Introduction

Docker Pulls

BRAKER User Guide

News

Here is a recording of the first BGA23 workshop session on BRAKER. If learning by watching videos is easy for you, consider watching that: https://www.youtube.com/watch?v=UXTkJ4mUkyg

BRAKER3 is now in https://usegalaxy.eu/

Contacts for Repository

TSEBRA & BRAKER3 related:

BRAKER & AUGUSTUS related:

  • Katharina J. Hoff, University of Greifswald, Germany, [email protected], +49 3834 420 4624, https://twitter.com/katharina_hoff

GeneMark related:

Core Authors of BRAKER

[a] University of Greifswald, Institute for Mathematics and Computer Science, Walther-Rathenau-Str. 47, 17489 Greifswald, Germany

[b] University of Greifswald, Center for Functional Genomics of Microbes, Felix-Hausdorff-Str. 8, 17489 Greifswald, Germany

[c] Joint Georgia Tech and Emory University Wallace H Coulter Department of Biomedical Engineering, 30332 Atlanta, USA

[d] School of Computational Science and Engineering, 30332 Atlanta, USA

[e] Moscow Institute of Physics and Technology, Moscow Region 141701, Dolgoprudny, Russia

braker2-team-2[fig10]braker2-team-1[fig11]braker2-team-3[fig12]braker2-team-4[fig13]

Figure 1: Current BRAKER authors, from left to right: Mario Stanke, Alexandre Lomsadze, Katharina J. Hoff, Tomas Bruna, Lars Gabriel, and Mark Borodovsky. We acknowledge that a larger community of scientists contributed to the BRAKER code (e.g. via pull requests).

Funding

The development of BRAKER1, BRAKER2, and BRAKER3 was supported by the National Institutes of Health (NIH) [GM128145 to M.B. and M.S.]. Development of BRAKER3 was partially funded by Project Data Competency granted to K.J.H. and M.S. by the government of Mecklenburg-Vorpommern, Germany.

Related Software

The Transcript Selector for BRAKER (TSEBRA) is available at https://github.com/Gaius-Augustus/TSEBRA .

GeneMark-ETP, one of the gene finders at the core of BRAKER, is available at https://github.com/gatech-genemark/GeneMark-ETP .

AUGUSTUS, the second gene finder at the core of BRAKER, is available at https://github.com/Gaius-Augustus/Augustus .

GALBA, a BRAKER pipeline spin-off for using Miniprot or GenomeThreader to generate training genes, is available at https://github.com/Gaius-Augustus/GALBA .

Contents

What is BRAKER?

The rapidly growing number of sequenced genomes requires fully automated methods for accurate gene structure annotation. With this goal in mind, we have developed BRAKER1R1R0, a combination of GeneMark-ET R2 and AUGUSTUS R3, R4, that uses genomic and RNA-Seq data to automatically generate full gene structure annotations in novel genome.

However, the quality of RNA-Seq data that is available for annotating a novel genome is variable, and in some cases, RNA-Seq data is not available, at all.

BRAKER2 is an extension of BRAKER1 which allows for fully automated training of the gene prediction tools GeneMark-ES/ET/EP/ETP R14, R15, R17, F1 and AUGUSTUS from RNA-Seq and/or protein homology information, and that integrates the extrinsic evidence from RNA-Seq and protein homology information into the prediction.

In contrast to other available methods that rely on protein homology information, BRAKER2 reaches high gene prediction accuracy even in the absence of the annotation of very closely related species and in the absence of RNA-Seq data.

BRAKER3 is the latest pipeline in the BRAKER suite. It enables the usage of RNA-seq and protein data in a fully automated pipeline to train and predict highly reliable genes with GeneMark-ETP and AUGUSTUS. The result of the pipeline is the combined gene set of both gene prediction tools, which only contains genes with very high support from extrinsic evidence.

In this user guide, we will refer to BRAKER1, BRAKER2, and BRAKER3 simply as BRAKER because they are executed by the same script (braker.pl).

Keys to successful gene prediction

  • Use a high quality genome assembly. If you have a huge number of very short scaffolds in your genome assembly, those short scaffolds will likely increase runtime dramatically but will not increase prediction accuracy.

  • Use simple scaffold names in the genome file (e.g. >contig1 will work better than >contig1my custom species namesome putative function /more/information/  and lots of special characters %&!*(){}). Make the scaffold names in all your fasta files simple before running any alignment program.

  • In order to predict genes accurately in a novel genome, the genome should be masked for repeats. This will avoid the prediction of false positive gene structures in repetitive and low complexitiy regions. Repeat masking is also essential for mapping RNA-Seq data to a genome with some tools (other RNA-Seq mappers, such as HISAT2, ignore masking information). In case of GeneMark-ES/ET/EP/ETP and AUGUSTUS, softmasking (i.e. putting repeat regions into lower case letters and all other regions into upper case letters) leads to better results than hardmasking (i.e. replacing letters in repetitive regions by the letter N for unknown nucleotide).

  • Many genomes have gene structures that will be predicted accurately with standard parameters of GeneMark-ES/ET/EP/ETP and AUGUSTUS within BRAKER. However, some genomes have clade-specific features, i.e. special branch point model in fungi, or non-standard splice-site patterns. Please read the options section [options] in order to determine whether any of the custom options may improve gene prediction accuracy in the genome of your target species.

  • Always check gene prediction results before further usage! You can e.g. use a genome browser for visual inspection of gene models in context with extrinsic evidence data. BRAKER supports the generation of track data hubs for the UCSC Genome Browser with MakeHub for this purpose.

Overview of modes for running BRAKER

BRAKER mainly features semi-unsupervised, extrinsic evidence data (RNA-Seq and/or protein spliced alignment information) supported training of GeneMark-ES/ET/EP/ETP[F1] and subsequent training of AUGUSTUS with integration of extrinsic evidence in the final gene prediction step. However, there are now a number of additional pipelines included in BRAKER. In the following, we give an overview of possible input files and pipelines:

  • Genome file, only. In this mode, GeneMark-ES is trained on the genome sequence, alone. Long genes predicted by GeneMark-ES are selected for training AUGUSTUS. Final predictions by AUGUSTUS are ab initio. This approach will likely yield lower prediction accuracy than all other here described pipelines. (see Figure 2),

braker2-main-a[fig1]

Figure 2: BRAKER pipeline A: training GeneMark-ES on genome data, only; ab initio gene prediction withAUGUSTUS

  • Genome and RNA-Seq file from the same species (see figure 3); this approach is suitable for short read RNA-Seq libraries with a good coverage of the transcriptome, important: this approach requires that each intron is covered by many alignments, i.e. it does not work with assembled transcriptome mappings. In principle, also alignments of long read RNA-Seq data may lead to sufficient data for running BRAKER, but only if each transcript that will go into training was sequenced and aligned to the genome multiple times. Please be aware that at the current point in time, BRAKER does not officially support the integration of long read RNA-Seq data, yet.

braker2-main-b[fig2]

Figure 3: BRAKER pipeline B: training GeneMark-ET supported by RNA-Seq spliced alignment information, prediction with AUGUSTUS with that same spliced alignment information.

  • Genome file and database of proteins that may be of unknown evolutionary distance to the target species (see Figure 4); this approach is particularly suitable if no RNA-Seq data is available. This method will work better with proteins from species that are rather close to the target species, but accuracy will drop only very little if the reference proteins are more distant from the target species. Important: This approach requires a database of protein families, i.e. many representatives of each protein family must be present in the database. BRAKER has been tested with OrthoDB R19, successfully. The ProtHint R18 protein mapping pipeline for generating required hints for BRAKER is available for download at https://github.com/gatech-genemark/ProtHint, the software on how to prepare the OrthoDB input proteins is available at https://github.com/tomasbruna/orthodb-clades. You may add proteins of a closely related species to the OrthoDB fasta file in order to incorporate additional evidence into gene prediction. We provide pre-partitioned OrthoDB v.11 clades for download at https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/ .

braker2-main-c[fig3]

Figure 4: BRAKER pipeline C: training GeneMark-EP+ on protein spliced alignment, start and stop information, prediction with AUGUSTUS with that same information, in addition chained CDSpart hints. Proteins used here can be of any evolutionary distance to the target organism.

  • Genome file and RNA-Seq set(s) from the same species, and proteins that may be of unknown evolutionary distance to the target species (see figure 5); important: this approach requires a database of protein families, i.e. many representatives of each protein family must be present in the database, e.g. OrthoDB is suitable. (You may add proteins of a closely related species to the OrthoDB fasta file in order to incorporate additional evidence into gene prediction.)

braker3-main-a[fig4]

Figure 5: BRAKER pipeline D: If necessary, download and alignment of RNA-Seq sets for the target species. Training of GeneMark-ETP supported by the RNA-Seq alignments and a large protein database (proteins can be of any evolutionary distance). Subsequently, AUGUSTUS training and prediction using the same extrinsic information together with the GeneMark-ETP results. The final prediction is the TSEBRA combination of the AUGUSTUS and GeneMark-ETP results.

Container

We are aware that the "manual" installation of BRAKER3 and all its dependencies is tedious and really challenging without root permissions. Therefore, we provide a Docker container that has been developed to be run with Singularity. All information on this container can be found at https://hub.docker.com/r/teambraker/braker3

In short, build it as follows:

singularity build braker3.sif docker://teambraker/braker3:latest

Execute with:

singularity exec braker3.sif braker.pl

Test with:

singularity exec -B $PWD:$PWD braker3.sif cp /opt/BRAKER/example/singularity-tests/test1.sh .
singularity exec -B $PWD:$PWD braker3.sif cp /opt/BRAKER/example/singularity-tests/test2.sh .
singularity exec -B $PWD:$PWD braker3.sif cp /opt/BRAKER/example/singularity-tests/test3.sh .
export BRAKER_SIF=/your/path/to/braker3.sif # may need to modify
bash test1.sh
bash test2.sh
bash test3.sh

Few users want to run their analysis inside Docker (since root permissions are required). However, if that's your goal, you can run and test the container as follows

sudo docker run --user 1000:100 --rm -it teambraker/braker3:latest bash
bash /opt/BRAKER/example/docker-tests/test1.sh # BRAKER1
bash /opt/BRAKER/example/docker-tests/test2.sh # BRAKER2
bash /opt/BRAKER/example/docker-tests/test3.sh # BRAKER3

⚠️ The container does not include Java/GUSHR/anything UTR related because we are currently not maintaining UTR prediction with BRAKER. It's buggy and unstable. Do not use it.

⚠️ Users have reported that you need to manually copy the AUGUSTUS_CONFIG_PATH contents to a writable location before running our containers from Nextflow. Afterwards, you need to specify the writable AUGUSTUS_CONFIG_PATH as command line argument to BRAKER in Nextflow.

Good luck ;-)

Installation

⚠️ Warning: If you previously used BRAKER1 and/or BRAKER2, please be aware that the usage changed in several aspects. Also, older GeneMark versions that linger in your $PATH variable might lead to unforeseen interferences, causing program failures. Please move all older GeneMark versions out of your $PATH (also e.g. the GeneMark in ProtHint/dependencies).

Supported software versions

At the time of release, this BRAKER version was tested with:

  • AUGUSTUS 3.5.0 F2

  • GeneMark-ETP (source see Dockerfile)

  • BAMTOOLS 2.5.1R5

  • SAMTOOLS 1.7-4-g93586edR6

  • Spaln 2.3.3d R8, R9, R10

  • NCBI BLAST+ 2.2.31+ R12, R13

  • DIAMOND 0.9.24

  • cdbfasta 0.99

  • cdbyank 0.981

  • GUSHR 1.0.0

  • SRA Toolkit 3.00 R14

  • HISAT2 2.2.1 R15

  • BEDTOOLS 2.30 R16

  • StringTie2 2.2.1 R17

  • GFFRead 0.12.7 R18

  • compleasm 0.2.5 R27

BRAKER

Perl pipeline dependencies

Running BRAKER requires a Linux-system with bash and Perl. Furthermore, BRAKER requires the following CPAN-Perl modules to be installed:

  • File::Spec::Functions

  • Hash::Merge

  • List::Util

  • MCE::Mutex

  • Module::Load::Conditional

  • Parallel::ForkManager

  • POSIX

  • Scalar::Util::Numeric

  • YAML

  • Math::Utils

  • File::HomeDir

For GeneMark-ETP, used when protein and RNA-Seq is supplied:

  • YAML::XS
  • Data::Dumper
  • Thread::Queue
  • threads

On Ubuntu, for example, install the modules with CPANminusF4: sudo cpanm Module::Name, e.g. sudo cpanm Hash::Merge.

BRAKER also uses a Perl module helpMod_braker.pm that is not available on CPAN. This module is part of the BRAKER release and does not require separate installation.

If you do not have root permissions on the Linux machine, try setting up an Anaconda (https://www.anaconda.com/distribution/) environment as follows:

wget https://repo.anaconda.com/archive/Anaconda3-2018.12-Linux-x86_64.sh
bash bin/Anaconda3-2018.12-Linux-x86_64.sh # do not install VS (needs root privileges)
conda install -c anaconda perl
conda install -c anaconda biopython
conda install -c bioconda perl-app-cpanminus
conda install -c bioconda perl-file-spec
conda install -c bioconda perl-hash-merge
conda install -c bioconda perl-list-util
conda install -c bioconda perl-module-load-conditional
conda install -c bioconda perl-posix
conda install -c bioconda perl-file-homedir
conda install -c bioconda perl-parallel-forkmanager
conda install -c bioconda perl-scalar-util-numeric
conda install -c bioconda perl-yaml
conda install -c bioconda perl-class-data-inheritable
conda install -c bioconda perl-exception-class
conda install -c bioconda perl-test-pod
conda install -c bioconda perl-file-which # skip if you are not comparing to reference annotation
conda install -c bioconda perl-mce
conda install -c bioconda perl-threaded
conda install -c bioconda perl-list-util
conda install -c bioconda perl-math-utils
conda install -c bioconda cdbtools
conda install -c eumetsat perl-yaml-xs
conda install -c bioconda perl-data-dumper

Subsequently install BRAKER and other software "as usual" while being in your conda environment. Note: There is a bioconda braker package, and a bioconda augustus package. They work. But they are usually lagging behind the development code of both tools on github. We therefore recommend manual installation and usage of lastest sources.

BRAKER components

BRAKER is a collection of Perl and Python scripts and a Perl module. The main script that will be called in order to run BRAKER is braker.pl. Additional Perl and Python components are:

  • align2hints.pl

  • filterGenemark.pl

  • filterIntronsFindStrand.pl

  • startAlign.pl

  • helpMod_braker.pm

  • findGenesInIntrons.pl

  • downsample_traingenes.pl

  • ensure_n_training_genes.py

  • get_gc_content.py

  • get_etp_hints.py

All scripts (files ending with *.pl and *.py) that are part of BRAKER must be executable in order to run BRAKER. This should already be the case if you download BRAKER from GitHub. Executability may be overwritten if you e.g. transfer BRAKER on a USB-stick to another computer. In order to check whether required files are executable, run the following command in the directory that contains BRAKER Perl scripts:

ls -l *.pl *.py

The output should be similar to this:

    -rwxr-xr-x 1 katharina katharina  18191 Mai  7 10:25 align2hints.pl
    -rwxr-xr-x 1 katharina katharina   6090 Feb 19 09:35 braker_cleanup.pl
    -rwxr-xr-x 1 katharina katharina 408782 Aug 17 18:24 braker.pl
    -rwxr-xr-x 1 katharina katharina   5024 Mai  7 10:25 downsample_traingenes.pl
    -rwxr-xr-x 1 katharina katharina   5024 Mai  7 10:23 ensure_n_training_genes.py
    -rwxr-xr-x 1 katharina katharina   4542 Apr  3  2019 filter_augustus_gff.pl
    -rwxr-xr-x 1 katharina katharina  30453 Mai  7 10:25 filterGenemark.pl
    -rwxr-xr-x 1 katharina katharina   5754 Mai  7 10:25 filterIntronsFindStrand.pl
    -rwxr-xr-x 1 katharina katharina   7765 Mai  7 10:25 findGenesInIntrons.pl
    -rwxr-xr-x 1 katharina katharina   1664 Feb 12  2019 gatech_pmp2hints.pl
    -rwxr-xr-x 1 katharina katharina   2250 Jan  9 13:55 log_reg_prothints.pl
    -rwxr-xr-x 1 katharina katharina   4679 Jan  9 13:55 merge_transcript_sets.pl
    -rwxr-xr-x 1 katharina katharina  41674 Mai  7 10:25 startAlign.pl

It is important that the x in -rwxr-xr-x is present for each script. If that is not the case, run

`chmod a+x *.pl *.py`

in order to change file attributes.

You may find it helpful to add the directory in which BRAKER perl scripts reside to your $PATH environment variable. For a single bash session, enter:

    PATH=/your_path_to_braker/:$PATH
    export PATH

To make this $PATH modification available to all bash sessions, add the above lines to a startup script (e.g.~/.bashrc).

Bioinformatics software dependencies

BRAKER calls upon various bioinformatics software tools that are not part of BRAKER. Some tools are obligatory, i.e. BRAKER will not run at all if these tools are not present on your system. Other tools are optional. Please install all tools that are required for running BRAKER in the mode of your choice.

Mandatory tools

GeneMark-ETP

Download GeneMark-ETPF1 from http://github.com/gatech-genemark/GeneMark-ETP or https://topaz.gatech.edu/GeneMark/etp.for_braker.tar.gz. Unpack and install GeneMark-ETP as described in GeneMark-ETP’s README file.

If already contained in your $PATH variable, BRAKER will guess the location of gmes_petap.pl or gmetp.pl automatically. Otherwise, BRAKER can find GeneMark-ES/ET/EP/ETP executables either by locating them in an environment variable GENEMARK_PATH, or by taking a command line argument (--GENEMARK_PATH=/your_path_to_GeneMark_executables/).

In order to set the environment variable for your current Bash session, type:

export GENEMARK_PATH=/your_path_to_GeneMark_executables/

Add the above lines to a startup script (e.g. ~/.bashrc) in order to make it available to all bash sessions.

Perl scripts within GeneMark-ES/ET/EP/ETP are configured with default Perl location at /usr/bin/perl.

If you are running GeneMark-ES/ET/EP/ETP in an Anaconda environment (or want to use Perl from the $PATH variable for any other reason), modify the shebang of all GeneMark-ES/ET/EP/ETP scripts with the following command located inside GeneMark-ES/ET/EP/ETP folder:

perl change_path_in_perl_scripts.pl "/usr/bin/env perl"

You can check whether GeneMark-ES/ET/EP is installed properly by running the check_install.bash and/or executing examples in GeneMark-E-tests directory.

GeneMark-ETP is downward compatible, i.e. it covers the functionality of GeneMark-EP and GeneMark-ET in BRAKER, too.

AUGUSTUS

Download AUGUSTUS from its master branch at https://github.com/Gaius-Augustus/Augustus. Unpack AUGUSTUS and install AUGUSTUS according to AUGUSTUS README.TXT. Do not use outdated AUGUSTUS versions from other sources, e.g. Debian package or Bioconda package! BRAKER highly depends in particular on an up-to-date Augustus/scripts directory, and other sources are often lagging behind.

You should compile AUGUSTUS on your own system in order to avoid problems with versions of libraries used by AUGUSTUS. Compilation instructions are provided in the AUGUSTUS README.TXT file (Augustus/README.txt).

AUGUSTUS consists of augustus, the gene prediction tool, additional C++ tools located in Augustus/auxprogs and Perl scripts located in Augustus/scripts. Perl scripts must be executable (see instructions in section BRAKER components.

The C++ tool bam2hints is an essential component of BRAKER when run with RNA-Seq. Sources are located in Augustus/auxprogs/bam2hints. Make sure that you compile bam2hints on your system (it should be automatically compiled when AUGUSTUS is compiled, but in case of problems with bam2hints, please read troubleshooting instructions in Augustus/auxprogs/bam2hints/README).

Since BRAKER is a pipeline that trains AUGUSTUS, i.e. writes species specific parameter files, BRAKER needs writing access to the configuration directory of AUGUSTUS that contains such files (Augustus/config/). If you install AUGUSTUS globally on your system, the config folder will typically not be writable by all users. Either make the directory where config resides recursively writable to users of AUGUSTUS, or copy the config/ folder (recursively) to a location where users have writing permission.

AUGUSTUS will locate the config folder by looking for an environment variable $AUGUSTUS_CONFIG_PATH. If the $AUGUSTUS_CONFIG_PATH environment variable is not set, then BRAKER will look in the path ../config relative to the directory in which it finds an AUGUSTUS executable. Alternatively, you can supply the variable as a command line argument to BRAKER (--AUGUSTUS_CONFIG_PATH=/your_path_to_AUGUSTUS/Augustus/config/). We recommend that you export the variable e.g. for your current bash session:

    export AUGUSTUS_CONFIG_PATH=/your_path_to_AUGUSTUS/Augustus/config/

In order to make the variable available to all Bash sessions, add the above line to a startup script, e.g. ~/.bashrc.

Please have a look at the Dockerfile in case you want to install AUGUSTUS as Debian package. A number of scripts needs to be patched, then.

Important:

BRAKER expects the entire config directory of AUGUSTUS at $AUGUSTUS_CONFIG_PATH, i.e. the subfolders species with its contents (at least generic) and extrinsic! Providing a writable but empty folder at $AUGUSTUS_CONFIG_PATH will not work for BRAKER. If you need to separate augustus binary and $AUGUSTUS_CONFIG_PATH, we recommend that you recursively copy the un-writable config contents to a writable location.

If you have a system-wide installation of AUGUSTUS at /usr/bin/augustus, an unwritable copy of config sits at /usr/bin/augustus_config/. The folder /home/yours/ is writable to you. Copy with the following command (and additionally set the then required variables):

cp -r /usr/bin/Augustus/config/ /home/yours/
export AUGUSTUS_CONFIG_PATH=/home/yours/augustus_config
export AUGUSTUS_BIN_PATH=/usr/bin
export AUGUSTUS_SCRIPTS_PATH=/usr/bin/augustus_scripts
Modification of $PATH

Adding directories of AUGUSTUS binaries and scripts to your $PATH variable enables your system to locate these tools, automatically. It is not a requirement for running BRAKER to do this, because BRAKER will try to guess them from the location of another environment variable ($AUGUSTUS_CONFIG_PATH), or both directories can be supplied as command line arguments to braker.pl, but we recommend to add them to your $PATH variable. For your current bash session, type:

    PATH=:/your_path_to_augustus/bin/:/your_path_to_augustus/scripts/:$PATH
    export PATH

For all your BASH sessions, add the above lines to a startup script (e.g.~/.bashrc).

Python3

On Ubuntu, Python3 is usually installed by default, python3 will be in your $PATH variable, by default, and BRAKER will automatically locate it. However, you have the option to specify the python3 binary location in two other ways:

  1. Export an environment variable $PYTHON3_PATH, e.g. in your ~/.bashrc file:

    export PYTHON3_PATH=/path/to/python3/
    
  2. Specify the command line option --PYTHON3_PATH=/path/to/python3/ to braker.pl.

Bamtools

Download BAMTOOLS (e.g. git clone https://github.com/pezmaster31/bamtools.git). Install BAMTOOLS by typing the following in your shell:

    cd your-bamtools-directory mkdir build cd build cmake .. make

If already in your $PATH variable, BRAKER will find bamtools, automatically. Otherwise, BRAKER can locate the bamtools binary either by using an environment variable $BAMTOOLS_PATH, or by taking a command line argument (--BAMTOOLS_PATH=/your_path_to_bamtools/bin/F6). In order to set the environment variable e.g. for your current bash session, type:

    export BAMTOOLS_PATH=/your_path_to_bamtools/bin/

Add the above line to a startup script (e.g. ~/.bashrc) in order to set the environment variable for all bash sessions.

NCBI BLAST+ or DIAMOND

You can use either NCBI BLAST+ or DIAMOND for removal of redundant training genes. You do not need both tools. If DIAMOND is present, it will be preferred because it is much faster.

Obtain and unpack DIAMOND as follows:

    wget http://github.com/bbuchfink/diamond/releases/download/v0.9.24/diamond-linux64.tar.gz
    tar xzf diamond-linux64.tar.gz

If already in your $PATH variable, BRAKER will find diamond, automatically. Otherwise, BRAKER can locate the diamond binary either by using an environment variable $DIAMOND_PATH, or by taking a command line argument (--DIAMOND_PATH=/your_path_to_diamond). In order to set the environment variable e.g. for your current bash session, type:

    export DIAMOND_PATH=/your_path_to_diamond/

Add the above line to a startup script (e.g. ~/.bashrc) in order to set the environment variable for all bash sessions.

If you decide for BLAST+, install NCBI BLAST+ with sudo apt-get install ncbi-blast+.

If already in your $PATH variable, BRAKER will find blastp, automatically. Otherwise, BRAKER can locate the blastp binary either by using an environment variable $BLAST_PATH, or by taking a command line argument (--BLAST_PATH=/your_path_to_blast/). In order to set the environment variable e.g. for your current bash session, type:

    export BLAST_PATH=/your_path_to_blast/

Add the above line to a startup script (e.g. ~/.bashrc) in order to set the environment variable for all bash sessions.

Mandatory tools for BRAKER3

Following tools are required by GeneMark-ETP and it will try to locate them in your $PATH variable. So make sure to add their location to your $PATH, e.g.:

export PATH=$PATH:/your/path/to/Tool

For all tools below, add the above line to a startup script (e.g. ~/.bashrc) in order to extend your $PATH variable for all bash sessions.

These software tools are only mandatory if you run BRAKER with RNA-Seq and protein data!

StringTie2

StringTie2 is used by GeneMark-ETP to assemble aligned RNA-Seq alignments. A precompiled version of StringTie2 can be downloaded from https://ccb.jhu.edu/software/stringtie/#install.

BEDTools

The software package bedtools is required by GeneMark-ETP if you want to run BRAKER with both RNA-Seq and protein data. You can download bedtools from https://github.com/arq5x/bedtools2/releases. Here, you can either download a precompiled version bedtools.static.binary, e.g.

wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools.static.binary
mv bedtools.static.binary bedtools
chmod a+x

or you can download bedtools-2.30.0.tar.gz and compile it from source using make, e.g.

wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools-2.30.0.tar.gz
tar -zxvf bedtools-2.30.0.tar.gz
cd bedtools2
make

See https://bedtools.readthedocs.io/en/latest/content/installation.html for more information.

GffRead

GffRead is a utility software required by GeneMark-ETP. It can be downloaded from https://github.com/gpertea/gffread/releases/download/v0.12.7/gffread-0.12.7.Linux_x86_64.tar.gz and installed with make, e.g.

wget https://github.com/gpertea/gffread/releases/download/v0.12.7/gffread-0.12.7.Linux_x86_64.tar.gz
tar xzf gffread-0.12.7.Linux_x86_64.tar.gz
cd gffread-0.12.7.Linux_x86_64
make

Optional tools

Samtools

Samtools is not required for running BRAKER without GeneMark-ETP if all your files are formatted, correctly (i.e. all sequences should have short and unique fasta names). If you are not sure whether all your files are fomatted correctly, it might be helpful to have Samtools installed because BRAKER can automatically fix certain format issues by using Samtools.

As a prerequisite for Samtools, download and install htslib (e.g. git clone https://github.com/samtools/htslib.git, follow the htslib documentation for installation).

Download and install Samtools (e.g. git clone git://github.com/samtools/samtools.git), subsequently follow Samtools documentation for installation).

If already in your $PATH variable, BRAKER will find samtools, automatically. Otherwise, BRAKER can find Samtools either by taking a command line argument (--SAMTOOLS_PATH=/your_path_to_samtools/), or by using an environment variable $SAMTOOLS_PATH. For exporting the variable, e.g. for your current bash session, type:

    export SAMTOOLS_PATH=/your_path_to_samtools/

Add the above line to a startup script (e.g. ~/.bashrc) in order to set the environment variable for all bash sessions.

Biopython

If Biopython is installed, BRAKER can generate FASTA-files with coding sequences and protein sequences predicted by AUGUSTUS and generate track data hubs for visualization of a BRAKER run with MakeHub R16. These are optional steps. The first can be disabled with the command-line flag --skipGetAnnoFromFasta, the second can be activated by using the command-line options --makehub [email protected], Biopython is not required if neither of these optional steps shall be performed.

On Ubuntu, install Python3 package manager with:

`sudo apt-get install python3-pip`

Then, install Biopython with:

`sudo pip3 install biopython`

cdbfasta

cdbfasta and cdbyank are required by BRAKER for correcting AUGUSTUS genes with in frame stop codons (spliced stop codons) using the AUGUSTUS script fix_in_frame_stop_codon_genes.py. This can be skipped with --skip_fixing_broken_genes.

On Ubuntu, install cdbfasta with:

    sudo apt-get install cdbfasta

For other systems, you can for example obtain cdbfasta from https://github.com/gpertea/cdbfasta, e.g.:

    git clone https://github.com/gpertea/cdbfasta.git
    cd cdbfasta
    make all

On Ubuntu, cdbfasta and cdbyank will be in your $PATH variable after installation, and BRAKER will automatically locate them. However, you have the option to specify the cdbfasta and cdbyank binary location in two other ways:

  1. Export an environment variable $CDBTOOLS_PATH, e.g. in your ~/.bashrc file:
    export CDBTOOLS_PATH=/path/to/cdbtools/
  1. Specify the command line option --CDBTOOLS_PATH=/path/to/cdbtools/ to braker.pl.

Spaln

Note: Support of stand-alone Spaln (ouside of ProtHint) within BRAKER is deprecated.

This tool is required if you run ProtHint or if you would like to run protein to genome alignments with BRAKER using Spaln outside of ProtHint. Using Spaln outside of ProtHint is a suitable approach only if an annotated species of short evolutionary distance to your target genome is available. We recommend running Spaln through ProtHint for BRAKER. ProtHint brings along a Spaln binary. If that does not work on your system, download Spaln from https://github.com/ogotoh/spaln. Unpack and install according to spaln/doc/SpalnReadMe22.pdf.

BRAKER will try to locate the Spaln executable by using an environment variable $ALIGNMENT_TOOL_PATH. Alternatively, this can be supplied as command line argument (--ALIGNMENT_TOOL_PATH=/your/path/to/spaln).

GUSHR

This tool is only required if you want either add UTRs (from RNA-Seq data) to predicted genes or if you want to train UTR parameters for AUGUSTUS and predict genes with UTRs. In any case, GUSHR requires the input of RNA-Seq data.

GUSHR is available for download at https://github.com/Gaius-Augustus/GUSHR. Obtain it by typing:

git clone https://github.com/Gaius-Augustus/GUSHR.git

GUSHR executes a GeMoMa jar file R19, R20, R21, and this jar file requires Java 1.8. On Ubuntu, you can install Java 1.8 with the following command:

sudo apt-get install openjdk-8-jdk

If you have several java versions installed on your system, make sure that you enable 1.8 prior running BRAKER with java by running

sudo update-alternatives --config java

and selecting the correct version.

Tools from UCSC

If you switch --UTR=on, bamToWig.py will require the following tools that can be downloaded from http://hgdownload.soe.ucsc.edu/admin/exe:

  • twoBitInfo

  • faToTwoBit

It is optional to install these tools into your $PATH. If you don't, and you switch --UTR=on, bamToWig.py will automatically download them into the working directory.

MakeHub

If you wish to automaticaly generate a track data hub of your BRAKER run, the MakeHub software, available at https://github.com/Gaius-Augustus/MakeHub is required. Download the software (either by running git clone https://github.com/Gaius-Augustus/MakeHub.git, or by picking a release from https://github.com/Gaius-Augustus/MakeHub/releases. Extract the release package if you downloaded a release (e.g. unzip MakeHub.zip or tar -zxvf MakeHub.tar.gz.

BRAKER will try to locate the make_hub.py script by using an environment variable $MAKEHUB_PATH. Alternatively, this can be supplied as command line argument (--MAKEHUB_PATH=/your/path/to/MakeHub/). BRAKER can also try to guess the location of MakeHub on your system.

SRA Toolkit

If you want BRAKER to download RNA-Seq libraries from NCBI's SRA, the SRA Toolkit is required. You can get a precompiled version of the SRA Toolkit from http://daehwankimlab.github.io/hisat2/download/#version-hisat2-221.

BRAKER will try to find executable binaries from the SRA Toolkit (fastq-dump, prefetch) by using an environment variable $SRATOOLS_PATH. Alternatively, this can be supplied as command line argument (--SRATOOLS_PATH=/your/path/to/SRAToolkit/). BRAKER can also try to guess the location of the SRA Toolkit on your system if the executables are in your $PATH variable.

HISAT2

If you want to use unaligned RNA-Seq reads, the HISAT2 software is required to map them to the genome. A precompiled version of HISAT2 can be downloaded from http://daehwankimlab.github.io/hisat2/download/#version-hisat2-221.

BRAKER will try to find executable HISAT2 binaries (hisat2, hisat2-build) by using an environment variable $HISAT2_PATH. Alternatively, this can be supplied as command line argument (--HISAT2_PATH=/your/path/to/HISAT2/). BRAKER can also try to guess the location of HISAT2 on your system if the executables are in your $PATH variable.

compleasm

If you want to run TSEBRA within BRAKER in a BUSCO completeness maximizing mode, you need to install compleasm.

wget https://github.com/huangnengCSU/compleasm/releases/download/v0.2.4/compleasm-0.2.4_x64-linux.tar.bz2
tar -xvjf compleasm-0.2.4_x64-linux.tar.bz2 && \

Add the resulting folder compleasm_kit to your $PATH variable, e.g.:

export PATH=$PATH:/your/path/to/compleasm_kit

Compleasm requires pandas, which can be installed with:

pip install pandas

System dependencies

BRAKER (braker.pl) uses getconf to see how many threads can be run on your system. On Ubuntu, you can install it with:

sudo apt-get install libc-bin

Running BRAKER

Different BRAKER pipeline modes

In the following, we describe “typical” BRAKER calls for different input data types. In general, we recommend that you run BRAKER on genomic sequences that have been softmasked for Repeats. BRAKER should only be applied to genomes that have been softmasked for repeats!

BRAKER with RNA-Seq data

This approach is suitable for genomes of species for which RNA-Seq libraries with good transcriptome coverage are available and for which protein data is not at hand. The pipeline is illustrated in Figure 2.

BRAKER has several ways to receive RNA-Seq data as input:

  • You can provide ID(s) of RNA-Seq libraries from SRA (in case of multiple IDs, separate them by comma) as argument to --rnaseq_sets_ids. The libraries belonging to the IDs are then downloaded automatically by BRAKER, e.g.:

        braker.pl --species=yourSpecies --genome=genome.fasta \
           --rnaseq_sets_ids=SRA_ID1,SRA_ID2
    
  • You can use local FASTQ file(s) of unaligned reads as input. In this case, you have to provide BRAKER with the ID(s) of the RNA-Seq set(s) as argument to --rnaseq_sets_ids and the path(s) to the directories, where the FASTQ files are located as argument to --rnaseq_sets_dirs. For each ID ID, BRAKER will search in these directories for one FASTQ file named ID.fastq if the reads are unpaired, or for two FASTQ files named ID_1.fastq and ID_2.fastq if they are paired.

    For example, if you have a paired library called 'SRA_ID1' and an unpaired library named 'SRA_ID2', you have to have a directory /path/to/local/fastq/files/, where the files SRA_ID1_1.fastq, SRA_ID1_2.fastq, and SRA_ID2.fastq reside. Then, you could run BRAKER with following command:

        braker.pl --species=yourSpecies --genome=genome.fasta \
           --rnaseq_sets_ids=SRA_ID1,SRA_ID2 \
           --rnaseq_sets_dirs=/path/to/local/fastq/files/
    
  • There are two ways of supplying BRAKER with RNA-Seq data as bam file(s). First, you can do it in the same way as you would supply FASTQ file(s): Provide the ID(s)/name(s) of your bam file(s) as argument to --rnaseq_sets_ids and specify directories where the bam files reside with --rnaseq_sets_dirs. BRAKER will automatically detect that these ID(s) are bam and not FASTQ file(s), e.g.:

        braker.pl --species=yourSpecies --genome=genome.fasta \
           --rnaseq_sets_ids=BAM_ID1,BAM_ID2 \
           --rnaseq_sets_dirs=/path/to/local/bam/files/
    

    Second, you can specify the paths to your bam file(s) directly, e.g. can either extract RNA-Seq spliced alignment information from bam files, or it can use such extracted information, directly.

        braker.pl --species=yourSpecies --genome=genome.fasta \
           --bam=file1.bam,file2.bam
    

    Please note that we generally assume that bam files were generated with HiSat2 because that is the aligner that would also be executed by BRAKER3 with fastq input. If you want for some reason to generate the bam files with STAR, use the option --outSAMstrandField intronMotif of STAR to produce files that are compatible wiht StringTie in BRAKER3.

  • In order to run BRAKER with RNA-Seq spliced alignment information that has already been extracted, run:

        braker.pl --species=yourSpecies --genome=genome.fasta \
           --hints=hints1.gff,hints2.gff
    

    The format of such a hints file must be as follows (tabulator separated file):

        chrName b2h intron  6591    8003    1   +   .   pri=4;src=E
        chrName b2h intron  6136    9084    11  +   .   mult=11;pri=4;src=E
        ...
    

    The source b2h in the second column and the source tag src=E in the last column are essential for BRAKER to determine whether a hint has been generated from RNA-Seq data.

It is also possible to provide RNA-Seq sets in different ways for the same BRAKER run, any combination of above options is possible. It is not recommended to provide RNA-Seq data with --hints if you run BRAKER in ETPmode (RNA-Seq and protein data), because GeneMark-ETP won't use these hints!

BRAKER with protein data

This approach is suitable for genomes of species for which no RNA-Seq libraries are available. A large database of proteins (with possibly longer evolutionary distance to the target species) should be used in this case. This mode is illustrated in figure 9.

braker2-main-a

Figure 9: BRAKER with proteins of any evolutionary distance. ProtHint protein mapping pipelines is used to generate protein hints. ProtHint automatically determines which alignments are from close relatives, and which are from rather distant relatives.

For running BRAKER in this mode, type:

braker.pl --genome=genome.fa --prot_seq=proteins.fa

We recommend using OrthoDB as basis for proteins.fa. The instructions on how to prepare the input OrthoDB proteins are documented here: https://github.com/gatech-genemark/ProtHint#protein-database-preparation.

You can of course add additional protein sequences to that file, or try with a completely different database. Any database will need several representatives for each protein, though.

Instead of having BRAKER run ProtHint, you can also start BRAKER with hints already produced by ProtHint, by providing ProtHint's prothint_augustus.gff output:

braker.pl --genome=genome.fa --hints=prothint_augustus.gff

The format of prothint_augustus.gff in this mode looks like this:

2R ProtHint intron 11506230 11506648 4 + . src=M;mult=4;pri=4
2R ProtHint intron 9563406  9563473  1 + . grp=69004_0:001de1_702_g;src=C;pri=4;
2R ProtHint intron 8446312  8446371  1 + . grp=43151_0:001cae_473_g;src=C;pri=4;
2R ProtHint intron 8011796  8011865  2 - . src=P;mult=1;pri=4;al_score=0.12;
2R ProtHint start  234524   234526   1 + . src=P;mult=1;pri=4;al_score=0.08;

The prediction of all hints with src=M will be enforced. Hints with src=C are 'chained evidence', i.e. they will only be incorporated if all members of the group (grp=...) can be incorporated in a single transcript. All other hints have src=P in the last column. Supported features in column 3 are intron, start, stop and CDSpart.

Training and prediction of UTRs, integration of coverage information

If RNA-Seq (and only RNA-Seq) data is provided to BRAKER as a bam-file, and if the genome is softmasked for repeats, BRAKER can automatically train UTR parameters for AUGUSTUS. After successful training of UTR parameters, BRAKER will automatically predict genes including coverage information form RNA-Seq data. Example call:

braker.pl --species=yourSpecies --genome=genome.fasta \
   --bam=file.bam --UTR=on

Warnings:

  1. This feature is experimental!

  2. --UTR=on is currently not compatible with bamToWig.py as released in AUGUSTUS 3.3.3; it requires the current development code version from the github repository (git clone https://github.com/Gaius-Augustus/Augustus.git).

  3. --UTR=on increases memory consumption of AUGUSTUS. Carefully monitor jobs if your machine was close to maxing RAM without --UTR=on! Reducing the number of cores will also reduce RAM consumption.

  4. UTR prediction sometimes improves coding sequence prediction accuracy, but not always. If you try this feature, carefully compare results with and without UTR parameters, afterwards (e.g. in UCSC Genome Browser).

Stranded RNA-Seq alignments

For running BRAKER without UTR parameters, it is not very important whether RNA-Seq data was generated by a stranded protocol (because spliced alignments are ’artificially stranded’ by checking the splice site pattern). However, for UTR training and prediction, stranded libraries may provide information that is valuable for BRAKER.

After alignment of the stranded RNA-Seq libraries, separate the resulting bam file entries into two files: one for plus strand mappings, one for minus strand mappings. Call BRAKER as follows:

braker.pl --species=yourSpecies --genome=genome.fasta \
    --bam=plus.bam,minus.bam --stranded=+,- \
    --UTR=on

You may additionally include bam files from unstranded libraries. Those files will not used for generating UTR training examples, but they will be included in the final gene prediction step as unstranded coverage information, example call:

braker.pl --species=yourSpecies --genome=genome.fasta \
   --bam=plus.bam,minus.bam,unstranded.bam \
   --stranded=+,-,. --UTR=on

Warning: This feature is experimental and currently has low priority on our maintenance list!

BRAKER with RNA-Seq and protein data

The native mode for running BRAKER with RNA-Seq and protein data. This will call GeneMark-ETP, which will use RNA-Seq and protein hints for training GeneMark-ETP. Subsequently, AUGUSTUS is trained on 'high-confindent' genes (genes with very high extrinsic evidence support) from the GeneMark-ETP prediction and a set of genes is predicted by AUGUSTUS. In a last step, the predictions of AUGUSTUS and GeneMark-ETP are combined using TSEBRA.

Alignment of RNA-Seq reads

GeneMark-ETP utilizes Stringtie2 to assemble RNA-Seq data, which requires that the aligned reads (BAM files) contain the XS (strand) tag for spliced reads. Therefore, if you align your reads with HISAT2, you must enable the --dta option, or if you use STAR, you must use the --outSAMstrandField intronMotif option. TopHat alignments include this tag by default.

To call the pipeline in this mode, you have to provide it with a protein database using --prot_seq (as described in BRAKER with protein data), and RNA-Seq data either by their SRA ID so that they are downloaded by BRAKER, as unaligned reads in FASTQ format, and/or as aligned reads in bam format (as described in BRAKER with RNA-Seq data). You could also specify already processed extrinsic evidence using the --hints option. However, this is not recommend for a normal BRAKER run in ETPmode, as these hints won't be used in the GeneMark-ETP step. Only use --hints when you want to skip the GenMark-ETP step!

Examples of how you could run BRAKER in ETPmode:

    braker.pl --genome=genome.fa --prot_seq=orthodb.fa \
        --rnaseq_sets_ids=SRA_ID1,SRA_ID2 \
        --rnaseq_sets_dirs=/path/to/local/RNA-Seq/files/
    braker.pl --genome=genome.fa --prot_seq=orthodb.fa \
        --rnaseq_sets_ids=SRA_ID1,SRA_ID2,SRA_ID3
        braker.pl --genome=genome.fa --prot_seq=orthodb.fa \
            --bam=/path/to/SRA_ID1.bam,/path/to/SRA_ID2.bam

BRAKER with short and long read RNA-Seq and protein data

A preliminary protocol for integration of assembled subreads from PacBio ccs sequencing in combination with short read Illumina RNA-Seq and protein database is described at https://github.com/Gaius-Augustus/BRAKER/blob/master/docs/long_reads/long_read_protocol.md

BRAKER with long read RNA-Seq (only) and protein data

We forked GeneMark-ETP and hard coded that StringTie will perform long read assembly in that particular version. If you want to use this 'fast-hack' version for BRAKER, you have to prepare the BAM file with long read to genome spliced alignments outside of BRAKER, e.g.:

T=48 # adapt to your number of threads
minimap2 -t${T} -ax splice:hq -uf genome.fa isoseq.fa > isoseq.sam     
samtools view -bS --threads ${T} isoseq.sam -o isoseq.bam

Pull the adapted container:

singularity build braker3_lr.sif docker://teambraker/braker3:isoseq

Calling BRAKER3 with a BAM file of spliced-aligned IsoSeq Reads:

singularity exec -B ${PWD}:${PWD} braker3_lr.sif braker.pl --genome=genome.fa --prot_seq=protein_db.fa –-bam=isoseq.bam --threads=${T} 

Warning Do NOT mix short read and long read data in this BRAKER/GeneMark-ETP variant!

Warning The accuracy of gene prediction here heavily depends on the depth of your isoseq data. We verified with PacBio HiFi reads from 2022 that given sufficient completeness of the assembled transcriptome you will reach similar results as with short reads. However, we also observed a drop in accuracy compared to short reads when using other long read data sets with higher error rates and less sequencing depth.

Description of selected BRAKER command line options

Please run braker.pl --help to obtain a full list of options.

--ab_initio

Compute AUGUSTUS ab initio predictions in addition to AUGUSTUS predictions with hints (additional output files: augustus.ab_initio.*. This may be useful for estimating the quality of training gene parameters when inspecting predictions in a Browser.

--augustus_args="--some_arg=bla"

One or several command line arguments to be passed to AUGUSTUS, if several arguments are given, separate them by whitespace, i.e. "--first_arg=sth --second_arg=sth". This may be be useful if you know that gene prediction in your particular species benefits from a particular AUGUSTUS argument during the prediction step.

--threads=INT

Specifies the maximum number of threads that can be used during computation. BRAKER has to run some steps on a single thread, others can take advantage of multiple threads. If you use more than 8 threads, this will not speed up all parallelized steps, in particular, the time consuming optimize_augustus.pl will not use more than 8 threads. However, if you don’t mind some threads being idle, using more than 8 threads will speed up other steps.

--fungus

GeneMark-ETP option: run algorithm with branch point model. Use this option if you genome is a fungus.

--useexisting

Use the present config and parameter files if they exist for 'species'; will overwrite original parameters if BRAKER performs an AUGUSTUS training.

--crf

Execute CRF training for AUGUSTUS; resulting parameters are only kept for final predictions if they show higher accuracy than HMM parameters. This increases runtime!

--lambda=int

Change the parameter $\lambda$ of the Poisson distribution that is used for downsampling training genes according to their number of introns (only genes with up to 5 introns are downsampled). The default value is $\lambda=2$. You might want to set it to 0 for organisms that mainly have single-exon genes. (Generally, single-exon genes contribute less value to increasing AUGUSTUS parameters compared to genes with many exons.)

--UTR=on

Generate UTR training examples for AUGUSTUS from RNA-Seq coverage information, train AUGUSTUS UTR parameters and predict genes with AUGUSTUS and UTRs, including coverage information for RNA-Seq as evidence. This is an experimental feature!

If you performed a BRAKER run without --UTR=on, you can add UTR parameter training and gene prediction with UTR parameters (and only RNA-Seq hints) with the following command:

braker.pl --genome=../genome.fa --addUTR=on \
    --bam=../RNAseq.bam --workingdir=$wd \
    --AUGUSTUS_hints_preds=augustus.hints.gtf \
    --threads=8 --skipAllTraining --species=somespecies

Modify augustus.hints.gtf to point to the AUGUSTUS predictions with hints from previous BRAKER run; modify flaning_DNA value to the flanking region from the log file of your previous BRAKER run; modify some_new_working_directory to the location where BRAKER should store results of the additional BRAKER run; modify somespecies to the species name used in your previous BRAKER run.

--addUTR=on

Add UTRs from RNA-Seq converage information to AUGUSTUS gene predictions using GUSHR. No training of UTR parameters and no gene prediction with UTR parameters is performed.

If you performed a BRAKER run without --addUTR=on, you can add UTRs results of a previous BRAKER run with the following command:

braker.pl --genome=../genome.fa --addUTR=on \
    --bam=../RNAseq.bam --workingdir=$wd \
    --AUGUSTUS_hints_preds=augustus.hints.gtf --threads=8 \
    --skipAllTraining --species=somespecies

Modify augustus.hints.gtf to point to the AUGUSTUS predictions with hints from previous BRAKER run; modify some_new_working_directory to the location where BRAKER should store results of the additional BRAKER run; this run will not modify AUGUSTUS parameters. We recommend that you specify the original species of the original run with --species=somespecies. Otherwise, BRAKER will create an unneeded species parameters directory Sp_*.

--stranded=+,-,.,...

If --UTR=on is enabled, strand-separated bam-files can be provided with --bam=plus.bam,minus.bam. In that case, --stranded=... should hold the strands of the bam files (+ for plus strand, - for minus strand, . for unstranded). Note that unstranded data will be used in the gene prediction step, only, if the parameter --stranded=... is set. This is an experimental feature! GUSHR currently does not take advantage of stranded data.

--makehub --email=[email protected]

If --makehub and [email protected] (with your valid e-mail adress) are provided, a track data hub for visualizing results with the UCSC Genome Browser will be generated using MakeHub (https://github.com/Gaius-Augustus/MakeHub).

--gc_probability=DECIMAL

By default, GeneMark-ES/ET/EP/ETP uses a probability of 0.001 for predicting the donor splice site pattern GC (instead of GT). It may make sense to increase this value for species where this donor splice site is more common. For example, in the species Emiliania huxleyi, about 50% of donor splice sites have the pattern GC (https://media.nature.com/original/nature-assets/nature/journal/v499/n7457/extref/nature12221-s2.pdf, page 5).

--busco_lineage=lineage

Use a species-specific lineage, e.g. arthropoda_odb10 for an arthropod. BRAKER does not support auto-typing of the lineage.

Specifying a BUSCO-lineage invokes two changes in BRAKER R28:

  1. BRAKER will run compleasm with the specified lineage in genome mode and convert the detected BUSCO matches into hints for AUGUSTUS. This may increase the number of BUSCOs in the augustus.hints.gtf file slightly.

  2. BRAKER will invoke best_by_compleasm.py to check whether the braker.gtf file that is by default generated by TSEBRA has the lowest amount of missing BUSCOs compared to the augustus.hints.gtf and the genemark.gtf file. If not, the following decision schema is applied to re-run TSEBRA to minimize the missing BUSCOs in the final output of BRAKER (always braker.gtf). If an alternative and better gene set is created, the original braker.gtf gene set is moved to a directory called braker_original. Information on what happened during the best_by_compleasm.py run is written to the file best_by_compleasm.log.

best_by_busco[fig14]

Please note that using BUSCO to assess the quality of a gene set, in particular when comparing BRAKER to other pipelines, does not make sense once you specified a BUSCO lineage. We recommend that you use other measures to assess the quality of your gene set, e.g. by comparing it to a reference gene set or running OMArk.

Output of BRAKER

BRAKER produces several important output files in the working directory.

  • braker.gtf: Final gene set of BRAKER. This file may contain different contents depending on how you called BRAKER

    • in ETPmode: Final gene set of BRAKER consisting of genes predicted by AUGUSTUS and GeneMark-ETP that were combined and filtered by TSEBRA.

    • otherwise: Union of augustus.hints.gtf and reliable GeneMark-ES/ET/EP predictions (genes fully supported by external evidence). In --esmode, this is the union of augustus.ab_initio.gtf and all GeneMark-ES genes. Thus, this set is generally more sensitive (more genes correctly predicted) and can be less specific (more false-positive predictions can be present). This output is not necessarily better than augustus.hints.gtf, and it is not recommended to use it if BRAKER was run in ESmode.

  • braker.codingseq: Final gene set with coding sequences in FASTA format

  • braker.aa: Final gene set with protein sequences in FASTA format

  • braker.gff3: Final gene set in gff3 format (only produced if the flag --gff3 was specified to BRAKER.

  • Augustus/*: Augustus gene set(s) in as gtf/conding/aa files

  • GeneMark-E*/genemark.gtf: Genes predicted by GeneMark-ES/ET/EP/EP+/ETP in GTF-format.

  • hintsfile.gff: The extrinsic evidence data extracted from RNAseq.bam and/or protein data.

  • braker_original/*: Genes predicted by BRAKER (TSEBRA merge) before compleasm was used to improve BUSCO completeness

  • bbc/*: output folder of best_by_compleasm.py script from TSEBRA that is used to improve BUSCO completeness in the final output of BRAKER

Output files may be present with the following name endings and formats:

  • Coding sequences in FASTA-format are produced if the flag --skipGetAnnoFromFasta was not set.

  • Protein sequence files in FASTA-format are produced if the flag --skipGetAnnoFromFasta was not set.

For details about gtf format, see http://www.sanger.ac.uk/Software/formats/GFF/. A GTF-format file contains one line per predicted exon. Example:

    HS04636 AUGUSTUS initial   966 1017 . + 0 transcript_id "g1.1"; gene_id "g1";
    HS04636 AUGUSTUS internal 1818 1934 . + 2 transcript_id "g1.1"; gene_id "g1";

The columns (fields) contain:

    seqname source feature start end score strand frame transcript ID and gene ID

If the --makehub option was used and MakeHub is available on your system, a hub directory beginning with the name hub_ will be created. Copy this directory to a publicly accessible web server. A file hub.txt resides in the directory. Provide the link to that file to the UCSC Genome Browser for visualizing results.

Example data

An incomplete example data set is contained in the directory BRAKER/example. In order to complete the data set, please download the RNA-Seq alignment file (134 MB) with wget:

cd BRAKER/example
wget http://topaz.gatech.edu/GeneMark/Braker/RNAseq.bam

In case you have trouble accessing that file, there's also a copy available from another server:

cd BRAKER/example
wget http://bioinf.uni-greifswald.de/augustus/datasets/RNAseq.bam

The example data set was not compiled in order to achieve optimal prediction accuracy, but in order to quickly test pipeline components. The small subset of the genome used in these test examples is not long enough for BRAKER training to work well.

Data description

Data corresponds to the last 1,000,000 nucleotides of Arabidopsis thaliana's chromosome Chr5, split into 8 artificial contigs.

RNA-Seq alignments were obtained by VARUS.

The protein sequences are a subset of OrthoDB v10 plants proteins.

List of files:

  • genome.fa - genome file in fasta format
  • RNAseq.bam - RNA-Seq alignment file in bam format (this file is not a part of this repository, it must be downloaded separately from http://topaz.gatech.edu/GeneMark/Braker/RNAseq.bam)
  • RNAseq.hints - RNA-Seq hints (can be used instead of RNAseq.bam as RNA-Seq input to BRAKER)
  • proteins.fa - protein sequences in fasta format

The below given commands assume that you configured all paths to tools by exporting bash variables or that you have the necessary tools in your $PATH.

The example data set also contains scripts tests/test*.sh that will execute below listed commands for testing BRAKER with the example data set. You find example results of AUGUSTUS and GeneMark-ES/ET/EP/ETP in the folder results/test*. Be aware that BRAKER contains several parts where random variables are used, i.e. results that you obtain when running the tests may not be exactly identical. To compare your test results with the reference ones, you can use the compare_intervals_exact.pl script as follows:

# Compare CDS features
compare_intervals_exact.pl --f1 augustus.hints.gtf --f2 ../../results/test${N}/augustus.hints.gtf --verbose
# Compare transcripts
compare_intervals_exact.pl --f1 augustus.hints.gtf --f2 ../../results/test${N}/augustus.hints.gtf --trans --verbose

Several tests use --gm_max_intergenic 10000 option to make the test runs faster. It is not recommended to use this option in real BRAKER runs, the speed increase achieved by adjusting this option is negligible on full-sized genomes.

We give runtime estimations derived from computing on Intel(R) Xeon(R) CPU E5530 @ 2.40GHz.

Testing BRAKER with RNA-Seq data

The following command will run the pipeline according to Figure 3:

braker.pl --genome genome.fa --bam RNAseq.bam --threads N --busco_lineage=lineage_odb10

This test is implemented in test1.sh, expected runtime is ~20 minutes.

Testing BRAKER with proteins

The following command will run the pipeline according to Figure 4:

braker.pl --genome genome.fa --prot_seq proteins.fa --threads N --busco_lineage=lineage_odb10

This test is implemented in test2.sh, expected runtime is ~20 minutes.

Testing BRAKER with proteins and RNA-Seq

The following command will run a pipeline that first trains GeneMark-ETP with protein and RNA-Seq hints and subsequently trains AUGUSTUS on the basis of GeneMark-ETP predictions. AUGUSTUS predictions are also performed with hints from both sources, see Figure 5.

Run with local RNA-Seq file:

braker.pl --genome genome.fa --prot_seq proteins.fa --bam ../RNAseq.bam --threads N --busco_lineage=lineage_odb10

This test is implemented in test3.sh, expected runtime is ~20 minutes.

Download RNA-Seq library from Sequence Read Archive (~1gb):

braker.pl --genome genome.fa --prot_seq proteins.fa --rnaseq_sets_ids ERR5767212 --threads N --busco_lineage=lineage_odb10

This test is implemented in test3_4.sh, expected runtime is ~35 minutes.

Testing BRAKER with pre-trained parameters

The training step of all pipelines can be skipped with the option --skipAllTraining. This means, only AUGUSTUS predictions will be performed, using pre-trained, already existing parameters. For example, you can predict genes with the command:

    braker.pl --genome=genome.fa --bam RNAseq.bam --species=arabidopsis \
        --skipAllTraining --threads N

This test is implemented in test4.sh, expected runtime is ~1 minute.

Testing BRAKER with genome sequence

The following command will run the pipeline with no extrinsic evidence:

braker.pl --genome=genome.fa --esmode --threads N

This test is implemented in test5.sh, expected runtime is ~20 minutes.

Testing BRAKER with RNA-Seq data and --UTR=on

The following command will run BRAKER with training UTR parameters from RNA-Seq coverage data:

braker.pl --genome genome.fa --bam RNAseq.bam --UTR=on --threads N

This test is implemented in test6.sh, expected runtime is ~20 minutes.

Testing BRAKER with RNA-Seq data and --addUTR=on

The following command will add UTRs to augustus.hints.gtf from RNA-Seq coverage data:

braker.pl --genome genome.fa --bam RNAseq.bam --addUTR=on --threads N

This test is implemented in test7.sh, expected runtime is ~20 minutes.

Starting BRAKER on the basis of previously existing BRAKER runs

There is currently no clean way to restart a failed BRAKER run (after solving some problem). However, it is possible to start a new BRAKER run based on results from a previous run -- given that the old run produced the required intermediate results. We will in the following refer to the old working directory with variable ${BRAKER_OLD}, and to the new BRAKER working directory with ${BRAKER_NEW}. The file what-to-cite.txt will always only refer to the software that was actually called by a particular run. You might have to combine the contents of ${BRAKER_NEW}/what-to-cite.txt with ${BRAKER_OLD}/what-to-cite.txt for preparing a publication. The following figure illustrates at which points BRAKER run may be intercepted.

braker-intercept[fig8]

Figure 10: Points for intercepting a BRAKER run and reusing intermediate results in a new BRAKER run.

Option 1: starting BRAKER with existing hints file(s) before training

This option is only possible for BRAKER in ETmode or EPmode and not in ETPmode!

If you have access to an existing BRAKER output that contains hintsfiles that were generated from extrinsic data, such as RNA-Seq or protein sequences, you can recycle these hints files in a new BRAKER run. Also, hints from a separate ProtHint run can be directly used in BRAKER.

The hints can be given to BRAKER with --hints ${BRAKER_OLD}/hintsfile.gff option. This is illustrated in the test files test1_restart1.sh, test2_restart1.sh, test4_restart1.sh. The other modes (for which this test is missing) cannot be restarted in this way.

Option 2: starting BRAKER after GeneMark-ES/ET/EP/ETP had finished, before training AUGUSTUS

The GeneMark result can be given to BRAKER with --geneMarkGtf ${BRAKER_OLD}/GeneMark*/genemark.gtf option if BRAKER is run in ETmode or EPmode. This is illustrated in the test files test1_restart2.sh, test2_restart2.sh, test5_restart2.sh.

In ETPmode, you can either provide BRAKER with the results of the GeneMarkETP step manually, with --geneMarkGtf ${BRAKER_OLD}/GeneMark-ETP/proteins.fa/genemark.gtf, --traingenes ${BRAKER_OLD}/GeneMark-ETP/training.gtf, and --hints ${BRAKER_OLD}/hintsfile.gff (see test3_restart1.sh for an example), or you can specify the previous GeneMark-ETP results with the option --gmetp_results_dir ${BRAKER_OLD}/GeneMark-ETP/ so that BRAKER can search for the files automatically (see test3_restart2.sh for an example).

Option 3: starting BRAKER after AUGUSTUS training

The trained species parameters for AGUSTUS can be passed with --skipAllTraining and --species $speciesName options. This is illustrated in test*_restart3.sh files. Note that in ETPmode you have to specify the GeneMark files as described in Option 2!

Bug reporting

Before reporting bugs, please check that you are using the most recent versions of GeneMark-ES/ET/EP/ETP, AUGUSTUS and BRAKER. Also, check the list of Common problems, and the Issue list on GitHub before reporting bugs. We do monitor open issues on GitHub. Sometimes, we are unable to help you, immediately, but we try hard to solve your problems.

Reporting bugs on GitHub

If you found a bug, please open an issue at https://github.com/Gaius-Augustus/BRAKER/issues (or contact [email protected] or [email protected]).

Information worth mentioning in your bug report:

Check in braker/yourSpecies/braker.log at which step braker.pl crashed.

There are a number of other files that might be of interest, depending on where in the pipeline the problem occurred. Some of the following files will not be present if they did not contain any errors.

  • braker/yourSpecies/errors/bam2hints.*.stderr - will give details on a bam2hints crash (step for converting bam file to intron gff file)

  • braker/yourSpecies/hintsfile.gff - is this file empty? If yes, something went wrong during hints generation - does this file contain hints from source “b2h” and of type “intron”? If not: GeneMark-ET will not be able to execute properly. Conversely, GeneMark-EP+ will not be able to execute correctly if hints from the source "ProtHint" are missing.

  • braker/yourSpecies/spaln/*err - errors reported by spaln

  • braker/yourSpecies/errors/GeneMark-{ET,EP,ETP}.stderr - errors reported by GeneMark-ET/EP+/ETP

  • braker/yourSpecies/errors/GeneMark-{ET,EP,ETP).stdout - may give clues about the point at which errors in GeneMark-ET/EP+/ETP occured

  • braker/yourSpecies/GeneMark-{ET,EP,ETP}/genemark.gtf - is this file empty? If yes, something went wrong during executing GeneMark-ET/EP+/ETP

  • braker/yourSpecies/GeneMark-{ET,EP}/genemark.f.good.gtf - is this file empty? If yes, something went wrong during filtering GeneMark-ET/EP+ genes for training AUGUSTUS

  • braker/yourSpecies/genbank.good.gb - try a “grep -c LOCUS genbank.good.gb” to determine the number of training genes for training AUGUSTUS, should not be low

  • braker/yourSpecies/errors/firstetraining.stderr - contains errors from first iteration of training AUGUSTUS

  • braker/yourSpecies/errors/secondetraining.stderr - contains errors from second iteration of training AUGUSTUS

  • braker/yourSpecies/errors/optimize_augustus.stderr - contains errors optimize_augustus.pl (additional training set for AUGUSTUS)

  • braker/yourSpecies/errors/augustus*.stderr - contain AUGUSTUS execution errors

  • braker/yourSpecies/startAlign.stderr - if you provided a protein fasta file, something went wrong during protein alignment

  • braker/yourSpecies/startAlign.stdout - may give clues on at which point protein alignment went wrong

Common problems

  • BRAKER complains that the RNA-Seq file does not correspond to the provided genome file, but I am sure the files correspond to each other!

    Please check the headers of the genome FASTA file. If the headers are long and contain whitespaces, some RNA-Seq alignment tools will truncate sequence names in the BAM file. This leads to an error with BRAKER. Solution: shorten/simplify FASTA headers in the genome file before running the RNA-Seq alignment and BRAKER.

  • GeneMark fails!

    (a) GeneMark by default only uses contigs longer than 50k for training. If you have a highly fragmented assembly, this might lead to "no data" for training. You can override the default minimal length by setting the BRAKER argument --min_contig=10000.

    (b) see "[something] failed to execute" below.

  • [something] failed to execute!

    When providing paths to software to BRAKER, please use absolute, non-abbreviated paths. For example, BRAKER might have problems with --SAMTOOLS_PATH=./samtools/ or --SAMTOOLS_PATH=~/samtools/. Please use SAMTOOLS_PATH=/full/absolute/path/to/samtools/, instead. This applies to all path specifications as command line options to braker.pl. Relative paths and absolute paths will not pose problems if you export a bash variable, instead, or if you append the location of tools to your $PATH variable.

  • GeneMark-ETP in BRAKER dies with '/scratch/11232323': No such file or directory.

    This appears to be related to sorting large files, and it's a system configuration depending problem. Solve it with export TMPDIR=/tmp/ before calling BRAKER via Singularity.

  • BRAKER cannot find the Augustus script XYZ...

    Update Augustus from github with git clone https://github.com/Gaius-Augustus/Augustus.git. Do not use Augustus from other sources. BRAKER is highly dependent on an up-to-date Augustus. Augustus releases happen rather rarely, updates to the Augustus scripts folder occur rather frequently.

  • Does BRAKER depend on Python3?

    It does. The python scripts employed by BRAKER are not compatible with Python2.

  • Why does BRAKER predict more genes than I expected?

    If transposable elements (or similar) have not been masked appropriately, AUGUSTUS tends to predict those elements as protein coding genes. This can lead to a huge number genes. You can check whether this is the case for your project by BLASTing (or DIAMONDing) the predicted protein sequences against themselves (all vs. all) and counting how many of the proteins have a high number of high quality matches. You can use the output of this analysis to divide your gene set into two groups: the protein coding genes that you want to find and the repetitive elements that were additionally predicted.

  • I am running BRAKER in Anaconda and something fails...

    Update AUGUSTUS and BRAKER from github with git clone https://github.com/Gaius-Augustus/Augustus.git and git clone https://github.com/Gaius-Augustus/BRAKER.git. The Anaconda installation is great, but it relies on releases of AUGUSTUS and BRAKER - which are often lagging behind. Please use the current GitHub code, instead.

  • Why and where is the GenomeThreader support gone?

    BRAKER is a joint project between teams from University of Greifswald and Georgia Tech. While the group of Mark Bordovsky from Georgia Tech contributes GeneMark expertise, the group of Mario Stanke from University of Greifswald contributes AUGUSTUS expertise. Using GenomeThreader to build training genes for AUGUSTUS in BRAKER circumvents execution of GeneMark. Thus, the GenomeThreader mode is strictly speaking not part of the BRAKER project. The previous functionality of BRAKER with GenomeThreader has been moved to GALBA at https://github.com/Gaius-Augustus/GALBA. Note that GALBA has also undergone extension for using Miniprot instead of GenomeThreader.

  • My BRAKER gene set has too many BUSCO duplicates!

    AUGUSTUS within BRAKER can predict alternative splicing isoforms. Also the merge of the AUGUSTUS and GeneMark gene set by TSEBRA within BRAKER may result in additional isoforms for a single gene. The BUSCO duplicates usually come from alternative splicing isoforms, i.e. they are expected.

  • Augustus and/or etraining within BRAKER complain that the file aug_cmdln_parameters.json is missing. Even though I am using the latest Singularity container!

    BRAKER copies the AUGUSTUS_CONFIG_PATH folder to a writable location. In older versions of Augustus, that file was indeed not existing. If the local writable copy of a folder already exists, BRAKER will not re-copy it. Simply delete the old folder. (It is often ~/.augustus, so you can simply do rm -rf ~/.augustus; the folder might be residing in $PWD if your home directory was not writable).

  • I sit behind a firewall, compleasm cannot download the BUSCO files, what can I do? See Issue #785 (comment)

Citing BRAKER and software called by BRAKER

Since BRAKER is a pipeline that calls several Bioinformatics tools, publication of results obtained by BRAKER requires that not only BRAKER is cited, but also the tools that are called by BRAKER. BRAKER will output a file what-to-cite.txt in the BRAKER working directory, informing you about which exact sources apply to your run.

  • Always cite:

    • Stanke, M., Diekhans, M., Baertsch, R. and Haussler, D. (2008). Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics, doi: 10.1093/bioinformatics/btn013.

    • Stanke. M., Schöffmann, O., Morgenstern, B. and Waack, S. (2006). Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62.

  • If you provided any kind of evidence for BRAKER, cite:

    • Gabriel, L., Bruna, T., Hoff, K. J., Borodovsky, M., Stanke, M. (2021) TSEBRA: transcript selector for BRAKER. BMC Bioinformatics 22, 1-12.
  • If you provided both short read RNA-Seq evidence and a large database of proteins, cite:

    • Gabriel, L., Bruna, T., Hoff, K. J., Ebel, M., Lomsadze, A., Borodovsky, M., Stanke, M. (2023). BRAKER3: Fully Automated Genome Annotation Using RNA-Seq and Protein Evidence with GeneMark-ETP, AUGUSTUS and TSEBRA. bioRxiV, doi: 10.1101/2023.06.10.54444910.1101/2023.01.01.474747.

    • Bruna, T., Lomsadze, A., Borodovsky, M. (2023). GeneMark-ETP: Automatic Gene Finding in Eukaryotic Genomes in Consistence with Extrinsic Data. bioRxiv, doi: 10.1101/2023.01.13.524024.

    • Kovaka, S., Zimin, A. V., Pertea, G. M., Razaghi, R., Salzberg, S. L., & Pertea, M. (2019). Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome biology, 20(1):1-13.

    • Pertea, G., & Pertea, M. (2020). GFF utilities: GffRead and GffCompare. F1000Research, 9.

    • Quinlan, A. R. (2014). BEDTools: the Swiss‐army tool for genome feature analysis. Current protocols in bioinformatics, 47(1):11-12.

  • If the only source of evidence for BRAKER was a large database of protein sequences, cite:

    • Bruna, T., Hoff, K.J., Lomsadze, A., Stanke, M., & Borodovsky, M. (2021). BRAKER2: Automatic Eukaryotic Genome Annotation with GeneMark-EP+ and AUGUSTUS Supported by a Protein Database. NAR Genomics and Bioinformatics 3(1):lqaa108, doi: 10.1093/nargab/lqaa108.
  • If the only source of evidence for BRAKER was RNA-Seq data, cite:

    • Hoff, K.J., Lange, S., Lomsadze, A., Borodovsky, M. and Stanke, M. (2016). BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics, 32(5):767-769.

    • Lomsadze, A., Paul D.B., and Mark B. (2014) Integration of Mapped Rna-Seq Reads into Automatic Training of Eukaryotic Gene Finding Algorithm. Nucleic Acids Research 42(15): e119--e119

  • If you called BRAKER3 with an IsoSeq BAM file, or if you envoked the --busco_lineage option, cite:

    • Bruna, T., Gabriel, L., Hoff, K. J. (2024). Navigating Eukaryotic Genome Annotation Pipelines: A Route Map to BRAKER, Galba, and TSEBRA. arXiv, doi: 10.48550/arXiv.2403.19416 .
  • If you called BRAKER with the --busco_lineage option, in addition, cite:

    • Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V., & Zdobnov, E. M. (2015). BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 31(19), 3210-3212.

    • Li, H. (2023). Protein-to-genome alignment with miniprot. Bioinformatics, 39(1), btad014.

    • Huang, N., & Li, H. (2023). compleasm: a faster and more accurate reimplementation of BUSCO. Bioinformatics, 39(10), btad595.

  • If any kind of AUGUSTUS training was performed by BRAKER, check carefully whether you configured BRAKER to use NCBI BLAST or DIAMOND. One of them was used to filter out redundant training gene structures.

    • If you used NCBI BLAST, please cite:

      • Altschul, A.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990). A basic local alignment search tool. J Mol Biol 215:403--410.

      • Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., and Madden, T.L. (2009). Blast+: architecture and applications. BMC bioinformatics, 10(1):421.

    • If you used DIAMOND, please cite:

      • Buchfink, B., Xie, C., Huson, D.H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods 12:59-60.
  • If BRAKER was executed with a genome file and no extrinsic evidence, cite, then GeneMark-ES was used, cite:

    • Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y.O. and Borodovsky, M. (2005). Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Research, 33(20):6494--6506.

    • Ter-Hovhannisyan, V., Lomsadze, A., Chernoff, Y.O. and Borodovsky, M. (2008). Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome research, pages gr--081612, 2008.

    • Hoff, K.J., Lomsadze, A., Borodovsky, M. and Stanke, M. (2019). Whole-Genome Annotation with BRAKER. Methods Mol Biol. 1962:65-95, doi: 10.1007/978-1-4939-9173-0_5.

  • If BRAKER was run with proteins as source of evidence, please cite all tools that are used by the ProtHint pipeline to generate hints:

    • Bruna, T., Lomsadze, A., & Borodovsky, M. (2020). GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genomics and Bioinformatics, 2(2), lqaa026.

    • Buchfink, B., Xie, C., Huson, D.H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods 12:59-60.

    • Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y.O. and Borodovsky, M. (2005). Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Research, 33(20):6494--6506.

    • Iwata, H., and Gotoh, O. (2012). Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic acids research, 40(20), e161-e161.

    • Gotoh, O., Morita, M., Nelson, D. R. (2014). Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment. BMC bioinformatics, 15(1), 189.

  • If BRAKER was executed with RNA-Seq alignments in bam-format, then SAMtools was used, cite:

    • Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R.; 1000 Genome Project Data Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16):2078-9.

    • Barnett, D.W., Garrison, E.K., Quinlan, A.R., Strömberg, M.P. and Marth G.T. (2011). BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics, 27(12):1691-2

  • If BRAKER downloaded RNA-Seq libraries from SRA using their IDs, cite SRA, SRA toolkit, and HISAT2:

    • Leinonen, R., Sugawara, H., Shumway, M., & International Nucleotide Sequence Database Collaboration. (2010). The sequence read archive. Nucleic acids research, 39(suppl_1), D19-D21.

    • SRA Toolkit Development Team (2020). SRA Toolkit. https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software.

    • Kim, D., Paggi, J. M., Park, C., Bennett, C., & Salzberg, S. L. (2019). Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature biotechnology, 37(8):907-915.

  • If BRAKER was executed using RNA-Seq data in FASTQ format, cite HISAT2:

    • Kim, D., Paggi, J. M., Park, C., Bennett, C., & Salzberg, S. L. (2019). Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature biotechnology, 37(8):907-915.
  • If BRAKER called MakeHub for creating a track data hub for visualization of BRAKER results with the UCSC Genome Browser, cite:

    • Hoff, K. J. (2019). MakeHub: fully automated generation of UCSC genome browser assembly hubs. Genomics, Proteomics and Bioinformatics, 17(5), 546-549.
  • If BRAKER called GUSHR for generating UTRs, cite:

    • Keilwagen, J., Hartung, F., Grau, J. (2019) GeMoMa: Homology-based gene prediction utilizing intron position conservation and RNA-seq data. Methods Mol Biol. 1962:161-177, doi: 10.1007/978-1-4939-9173-0_9.

    • Keilwagen, J., Wenk, M., Erickson, J.L., Schattat, M.H., Grau, J., Hartung F. (2016) Using intron position conservation for homology-based gene prediction. Nucleic Acids Research, 44(9):e89.

    • Keilwagen, J., Hartung, F., Paulini, M., Twardziok, S.O., Grau, J. (2018) Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinformatics, 19(1):189.

License

All source code, i.e. scripts/*.pl or scripts/*.py are under the Artistic License (see http://www.opensource.org/licenses/artistic-license.php).

Footnotes

[F1] EX = ES/ET/EP/ETP, all available for download under the name GeneMark-ES/ET/EP

[F2] Please use the latest version from the master branch of AUGUSTUS distributed by the original developers, it is available from github at https://github.com/Gaius-Augustus/Augustus. Problems have been reported from users that tried to run BRAKER with AUGUSTUS releases maintained by third parties, i.e. Bioconda.

[F4] install with sudo apt-get install cpanminus

[F6] The binary may e.g. reside in bamtools/build/src/toolkit

References

[R0] Bruna, Tomas, Hoff, Katharina J., Lomsadze, Alexandre, Stanke, Mario, and Borodovsky, Mark. 2021. “BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database." NAR Genomics and Bioinformatics 3(1):lqaa108.

[R1] Hoff, Katharina J, Simone Lange, Alexandre Lomsadze, Mark Borodovsky, and Mario Stanke. 2015. “BRAKER1: Unsupervised Rna-Seq-Based Genome Annotation with Genemark-et and Augustus.” Bioinformatics 32 (5). Oxford University Press: 767--69.

[R2] Lomsadze, Alexandre, Paul D Burns, and Mark Borodovsky. 2014. “Integration of Mapped Rna-Seq Reads into Automatic Training of Eukaryotic Gene Finding Algorithm.” Nucleic Acids Research 42 (15). Oxford University Press: e119--e119.

[R3] Stanke, Mario, Mark Diekhans, Robert Baertsch, and David Haussler. 2008. “Using Native and Syntenically Mapped cDNA Alignments to Improve de Novo Gene Finding.” Bioinformatics 24 (5). Oxford University Press: 637--44.

[R4] Stanke, Mario, Oliver Schöffmann, Burkhard Morgenstern, and Stephan Waack. 2006. “Gene Prediction in Eukaryotes with a Generalized Hidden Markov Model That Uses Hints from External Sources.” BMC Bioinformatics 7 (1). BioMed Central: 62.

[R5] Barnett, Derek W, Erik K Garrison, Aaron R Quinlan, Michael P Strömberg, and Gabor T Marth. 2011. “BamTools: A C++ Api and Toolkit for Analyzing and Managing Bam Files.” Bioinformatics 27 (12). Oxford University Press: 1691--2.

[R6] Li, Heng, Handsaker, Bob, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. 2009. “The Sequence Alignment/Map Format and Samtools.” Bioinformatics 25 (16). Oxford University Press: 2078--9.

[R7] Gremme, G. 2013. “Computational Gene Structure Prediction.” PhD thesis, Universität Hamburg.

[R8] Gotoh, Osamu. 2008a. “A Space-Efficient and Accurate Method for Mapping and Aligning cDNA Sequences onto Genomic Sequence.” Nucleic Acids Research 36 (8). Oxford University Press: 2630--8.

[R9] Iwata, Hiroaki, and Osamu Gotoh. 2012. “Benchmarking Spliced Alignment Programs Including Spaln2, an Extended Version of Spaln That Incorporates Additional Species-Specific Features.” Nucleic Acids Research 40 (20). Oxford University Press: e161--e161.

[R10] Osamu Gotoh. 2008b. “Direct Mapping and Alignment of Protein Sequences onto Genomic Sequence.” Bioinformatics 24 (21). Oxford University Press: 2438--44.

[R11] Slater, Guy St C, and Ewan Birney. 2005. “Automated Generation of Heuristics for Biological Sequence Comparison.” BMC Bioinformatics 6(1). BioMed Central: 31.

[R12] Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. 1990. “Basic Local Alignment Search Tool.” Journal of Molecular Biology 215:403--10.

[R13] Camacho, Christiam, et al. 2009. “BLAST+: architecture and applications.“ BMC Bioinformatics 1(1): 421.

[R14] Lomsadze, A., V. Ter-Hovhannisyan, Y.O. Chernoff, and M. Borodovsky. 2005. “Gene identification in novel eukaryotic genomes by self-training algorithm.” Nucleic Acids Research 33 (20): 6494--6506. doi:10.1093/nar/gki937.

[R15] Ter-Hovhannisyan, Vardges, Alexandre Lomsadze, Yury O Chernoff, and Mark Borodovsky. 2008. “Gene Prediction in Novel Fungal Genomes Using an Ab Initio Algorithm with Unsupervised Training.” Genome Research. Cold Spring Harbor Lab, gr--081612.

[R16] Hoff, K.J. 2019. MakeHub: Fully automated generation of UCSC Genome Browser Assembly Hubs. Genomics, Proteomics and Bioinformatics, in press, preprint on bioarXive, doi: https://doi.org/10.1101/550145.

[R17] Bruna, T., Lomsadze, A., & Borodovsky, M. 2020. GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genomics and Bioinformatics, 2(2), lqaa026. doi: https://doi.org/10.1093/nargab/lqaa026.

[R18] Kriventseva, E. V., Kuznetsov, D., Tegenfeldt, F., Manni, M., Dias, R., Simão, F. A., and Zdobnov, E. M. 2019. OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic Acids Research, 47(D1), D807-D811.

[R19] Keilwagen, J., Hartung, F., Grau, J. (2019) GeMoMa: Homology-based gene prediction utilizing intron position conservation and RNA-seq data. Methods Mol Biol. 1962:161-177, doi: 10.1007/978-1-4939-9173-0_9.

[R20] Keilwagen, J., Wenk, M., Erickson, J.L., Schattat, M.H., Grau, J., Hartung F. (2016) Using intron position conservation for homology-based gene prediction. Nucleic Acids Research, 44(9):e89.

[R21] Keilwagen, J., Hartung, F., Paulini, M., Twardziok, S.O., Grau, J. (2018) Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinformatics, 19(1):189.

[R22] SRA Toolkit Development Team (2020). SRA Toolkit. https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software.[↩](#a22)

[R23] Kim, D., Paggi, J. M., Park, C., Bennett, C., & Salzberg, S. L. (2019). Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature biotechnology, 37(8):907-915.

[R24] Quinlan, A. R. (2014). BEDTools: the Swiss‐army tool for genome feature analysis. Current protocols in bioinformatics, 47(1):11-12.

[R25] Kovaka, S., Zimin, A. V., Pertea, G. M., Razaghi, R., Salzberg, S. L., & Pertea, M. (2019). Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome biology, 20(1):1-13.

[R26] Pertea, G., & Pertea, M. (2020). GFF utilities: GffRead and GffCompare. F1000Research, 9.

[R27] Huang, N., & Li, H. (2023). compleasm: a faster and more accurate reimplementation of BUSCO. Bioinformatics, 39(10), btad595.

[R28] Bruna, T., Gabriel, L. & Hoff, K. J. (2024). Navigating Eukaryotic Genome Annotation Pipelines: A Route Map to BRAKER, Galba, and TSEBRA. arXiv, https://doi.org/10.48550/arXiv.2403.19416 .

braker's People

Contributors

baberlevi avatar douglasgscofield avatar eernst avatar epaule avatar fabiangumz avatar katharinahoff avatar kiwiroy avatar larsgab avatar mariostanke avatar michaelkarlcoleman avatar nathanweeks avatar rsettlage avatar sanjaysrikakulam avatar satyamkapoor avatar smoe avatar tomasbruna avatar youreprettygood avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

braker's Issues

AUGUSTUS_CONFIG_PATH/species (in this case /.../Augustus/config/) is not writeable.

Hello!

I'm now trying to predict genes in novel genome with BRAKER.
But I've got same errors.
I run the BRAKER with the following command:

/.../BRAKER/scripts/braker.pl --genome=/.../final.genome.scf.masked.sort.fasta --prot_seq=/.../compgene.fasta --prg=gth --ALIGNMENT_TOOL_PATH=/.../gth-1.7.1-Linux_x86_64-64bit/bin --trainFromGth --cores 30 --AUGUSTUS_CONFIG_PATH=/.../Augustus/config/

But I've got the following error:

Use of uninitialized value $species in concatenation (.) or string at /.../BRAKER/scripts/braker.pl line 1498.
Sun Jun 9 15:07:46 2019: braker.pl version 2.1.3

Sun Jun 9 15:07:46 2019: Configuring of BRAKER for using external tools...

Sun Jun 9 15:07:46 2019: Command line flag --AUGUSTUS_CONFIG_PATH was provided. Setting $AUGUSTUS_CONFIG_PATH in braker.pl to /.../Augustus/config.
Sun Jun 9 15:07:46 2019: ERROR: in file /.../BRAKER/scripts/braker.pl at line 1496
AUGUSTUS_CONFIG_PATH/species (in this case /.../Augustus/config/) is not writeable.
There are 3 alternative ways to set this variable for braker.pl:
a) provide command-line argument --AUGUSTUS_CONFIG_PATH=/your/path
b) use an existing environment variable $AUGUSTUS_CONFIG_PATH
for setting the environment variable, run
export AUGUSTUS_CONFIG_PATH=/your/path
in your shell. You may append this to your .bashrc or
.profile file in order to make the variable available to all
your bash sessions.
c) braker.pl can try guessing the location of
$AUGUSTUS_CONFIG_PATH from an augustus executable that is
available in your $PATH variable.
If you try to rely on this option, you can check by typing
which augustus
in your shell, whether there is an augustus executable in
your $PATH
Be aware: the $AUGUSTUS_CONFIG_PATH must be writable for
braker.pl because braker.pl is a pipeline that
optimizes parameters that reside in that
directory. This might be problematic in case you
are using a system-wide installed augustus
installation that resides in a directory that is
not writable to you as a user.

So I checked whether the directory /.../Augustus/config/ is writable or not.
But it was obviously writable because I found the access permission like this:

[ ~ Augustus]$ ls -ltr
.....
drwxr-xr-x 7 ... qbg 72 ... config

How could I fix this?
Any ideas would be appreciated!!
Thank you in advance!

question about splice sites

Dear Braker,
(your software has been great for me on my published genomes - thanks!!!)

In the code line 7303 in the latest Braker (December2018) there is a line --allow_hinted_splicesites=gcag,atac - these are non-cononical splice sites, which I know my beast of interest has. Does Braker by default allow non-cononical splice sites (depending on RNAseq or protein evidence)? or do you have to specify them, see the line below

In the developmental options:
--splice_sites (default GTAG) ... this says it is for the UTR regions. Is this for the UTR only, or do you need to specify the non-cononical splice sites for Braker to use in the main genic predictions?

cheers,

Peter Thorpe

AUGUSTUS_CONFIG_PATH not writeable

Hello!

I've been trying to use BRAKER but it seems that there is a problem with AUGUSTUS_CONFIG_PATH.
This is the command I run:

katerina87@WE11sv03:~/Tools/BRAKER2$ ./BRAKER/scripts/braker.pl -softmasking --genome=../RepeatMasker/polished_assembly1_x3.fasta.masked -cores=8 --AUGUSTUS_CONFIG_PATH=/opt/augustus-3.3.2/config --AUGUSTUS_BIN_PATH=/opt/augustus-3.3.2/bin --AUGUSTUS_SCRIPTS_PATH=/opt/augustus-3.3.2/scripts --bam=../../My_MinION/MMETSP/RNA_seqs_raw/Aligned.out.sorted.bam

And this is the error:

Use of uninitialized value $species in concatenation (.) or string at ./BRAKER/scripts/braker.pl line 1455.
Mon Mar 4 10:42:02 2019: braker.pl version 2.1.2

Mon Mar 4 10:42:02 2019: Configuring of BRAKER for using external tools...

Mon Mar 4 10:42:02 2019: Command line flag --AUGUSTUS_CONFIG_PATH was provided. Setting $AUGUSTUS_CONFIG_PATH in braker.pl to /opt/augustus-3.3.2/config.
Mon Mar 4 10:42:02 2019: ERROR: in file ./BRAKER/scripts/braker.pl at line 1453
AUGUSTUS_CONFIG_PATH/species (in this case /opt/augustus-3.3.2/config/) is not writeable.
There are 3 alternative ways to set this variable for braker.pl:
a) provide command-line argument --AUGUSTUS_CONFIG_PATH=/your/path
b) use an existing environment variable $AUGUSTUS_CONFIG_PATH
for setting the environment variable, run
export AUGUSTUS_CONFIG_PATH=/your/path
in your shell. You may append this to your .bashrc or
.profile file in order to make the variable available to all
your bash sessions.
c) braker.pl can try guessing the location of
$AUGUSTUS_CONFIG_PATH from an augustus executable that is
available in your $PATH variable.
If you try to rely on this option, you can check by typing
which augustus
in your shell, whether there is an augustus executable in
your $PATH
Be aware: the $AUGUSTUS_CONFIG_PATH must be writable for
braker.pl because braker.pl is a pipeline that
optimizes parameters that reside in that
directory. This might be problmatic in case you
are using a system-wide installed augustus
installation that resides in a directory that is
not writable to you as a user.

In the beginning I thought the problem was that I didnt have permission to write to the config folder but I get the same error even after becoming the owner of the augustus-3.3.2 folder.
In the meantime when I try to copy to my local folder running:

cp -r \texttt{/opt/augustus-3.3.2/config/ . export AUGUSTUS_CONFIG_PATH=./config export AUGUSTUS_BIN_PATH=/opt/augustus-3.3.2/bin export AUGUSTUS_SCRIPTS_PATH=/opt/augustus-3.3.2/scripts

I get this:
cp: target 'AUGUSTUS_SCRIPTS_PATH=/opt/augustus-3.3.2/scripts' is not a directory

which I don't understand since the path is correct.

Maybe I should also mention that augustus is executed through a link found in /usr/local/bin/augustus
Could the problem be that?
Any ideas that could help solve this, would be greatly appreciated.

Thank you!

BRAKER on Conda Cloud

Hi,

Would it be possible to provide the newest version of BRAKER on the Anaconda cloud. Currently, BRAKER version 1.7 is available on the Anaconda Cloud: [https://anaconda.org/bioconda/braker].

Many thanks,

Random fails with multiple BAM files

I run BRAKER with multiple BAM files (separated forward and reverse strand) and protein evidence. Several times it printed with a message like this:

WARNING: Format of hintsfile /data2/results2/gusev/jellyfish/annotation/braker2/braker/genemark_hintsfile.gff is incorrect in the last column, possibly src=tag is missing!

And later GeneMark-ETP fails with this or similar message:

error, unexpected ID format found on line: scaffold_2158        b2h     intron  1173736 1173804 1       -       i=4;src=E      scaffold_2158   b2h     intron  1173736 1173804 998     -       .       mult=998;pri=4;src=E

Sometimes restarting from scratch helps, some times it does not.

It seems like some lines in genemark_hintsfile.gff are garbled on merge, because
individual bam2hints.temp.*.gff seem to be correct.

I noticed that parallel processing in make_rnaseq_hints routine involves writting into the same file from different threads. See this line:

$cmdString .= "cat $bam_temp >>$hintsfile_temp";

After changing line 4473 from

 my $pj = new Parallel::ForkManager($CPU);

to

 my $pj = new Parallel::ForkManager(1);

the error goes away (however, I did not do any elaborate testing). Obvisously, this is a temporary fix and the error (if I am right about the cause) should be fixed in a different way.

Hope this helps someone.

WARNING: Number of good genes is low (30).

I run braker2 with RNA-seq data for a green algae genome with low GC content.
I get the final output with warning:
WARNING: Number of good genes is low (30). Recommended are at least 600 genes

Then I run busco with eukarya database for augustus.hints.aa file generated by braker, the missing busco group is 87.5%.

How should I do for this genome?

optimize_augustus.pl line 1224:Could not read the accuracy values out of predictions.txt when processing bucket 1

While using BRAKER2 I keep getting such error:
my code:
perl /he_lab/share/data/tuguangxian/tgx/software/miniconda3/envs/augustus/bin/optimize_augustus.pl --rounds=5 --species=qiaozuigui --kfold=9 --AUGUSTUS_CONFIG_PATH=/he_lab/share/data/local/augustus/augustus-3.3.2/config --onlytrain=/he_lab/share/data/tuguangxian/tgx/data/genome/qiaozuigui/nGS/05_braker2/braker/qiaozuigui/train.gb.train.train --cpus=9 /he_lab/share/data/tuguangxian/tgx/data/genome/qiaozuigui/nGS/05_braker2/braker/qiaozuigui/train.gb.train.test 1>/he_lab/share/data/tuguangxian/tgx/data/genome/qiaozuigui/nGS/05_braker2/braker/qiaozuigui/optimize_augustus.stdout

error:
replaced tx with 0 MEA txs
replaced tx with 0 MEA txs
replaced tx with 0 MEA txs
sh: 行 1: 81016 段错误 (核心已转储) augustus --species=qiaozuigui --AUGUSTUS_CONFIG_PATH=/he_lab/share/data/local/augustus/augustus-3.3.2/config/ --/Constant/dss_end=4 --/Constant/dss_start=3 --/Constant/ass_start=3 --/Constant/ass_end=2 --/Constant/ass_upwindow_size=30 --/IntronModel/d=100 --/IntronModel/ass_motif_memory=3 --/IntronModel/ass_motif_radius=3 --/ExonModel/tis_motif_memory=3 --/ExonModel/tis_motif_radius=2 --/Constant/trans_init_window=20 --/Constant/init_coding_len=15 --/ExonModel/patpseudocount=5.0 --/ExonModel/etpseudocount=3 --/ExonModel/etorder=2 --/Constant/intterm_coding_len=5 --/ExonModel/slope_of_bandwidth=0.3 --/ExonModel/minwindowcount=10 --/IGenicModel/patpseudocount=5.0 --/IntronModel/patpseudocount=5.0 --/IntronModel/slope_of_bandwidth=0.4 --/IntronModel/minwindowcount=4 --/IntronModel/asspseudocount=0.00266 --/IntronModel/dsspseudocount=0.0005 --/IntronModel/dssneighborfactor=0.00173 --/ExonModel/minPatSum=233.3 --/Constant/probNinCoding=0.23 --/Constant/decomp_num_steps=1 --/ExonModel/infile=exon-tmp2.pbl --/IntronModel/infile=intron-tmp2.pbl --/IGenicModel/infile=igenic-tmp2.pbl --/UtrModel/infile=utr-tmp2.pbl tmp_opt_qiaozuigui/bucket2.gb > tmp_opt_qiaozuigui/predictions-2.txt
Could not read the accuracy values out of predictions.txt when processing bucket 1. at /he_lab/share/data/tuguangxian/tgx/software/miniconda3/envs/augustus/bin/optimize_augustus.pl line 1224
any idea what may be the reason?Thank you!

ERROR in randomSplit.pl line 47: LOCUS names in genbank file are not unique!

Hi,
I'm trying to run BRAKER/2.10 using only proteins of short evolutionary distance. My script is :

braker.pl \
 --cores=8 \
 --softmasking=1 \
 --genome=/home/CAM/qlin/CE10_Genome/versions/1.9.2/repeatMasker_20181201_soft/CE10g_v1.92.fa.masked \
 --prot_seq=/home/CAM/qlin/LF10_Genome/BRAKER2/t3/braker/Sp_12/augustus.hints.filter.fa \
 --prg=gth \
 --gth2traingenes \
 --trainFromGth 

The program failed with this error:
ERROR in randomSplit.pl line 47: LOCUS names in genbank file are not unique!
And I found this error in the gbFilterEtraining.stderr file:

mRNA contains character m
GBProcessor::getGeneList(): GBProcessor::getJoin( ):  failed!!!
Encountered error after reading 1 annotations.

Could you help me solve this problem? Thank you so much!

Number of cores for large genomes

Hi Katharina,
I have a question about setting the number of cores for the genome annotation. I have a eukaryotic genome (~500Mb). And it has a large number of scaffolds (39888 scaffolds). Most of the scaffolds are > 20Mb. And the rest of the scaffolds (14% of the genome) are fragmented. I set the --cores = 32 to annotate the genome in a fast way. Then I got this warning while running braker:

file genome.fa contains a highly fragmented assembly (39888 scaffolds). This may lead to problems when running AUGUSTUS via braker in parallelized mode. You set --cores=32. You should run braker.pl in linear mode on such genomes, though (--cores=1).

I wanted to make sure the annotation can be done in a relatively fast way. So I was wondering if there're other ways to speed up the annotation process? For example, can I split the genome into two parts, the longer scaffolds part and shorter scaffold part. And annotate the longer scaffolds (> 20Mb) with --cores=32 and shorter scaffold part with --cores=1? But I don't know if splitting the genome would also affect building the gene models.

Thank you for any suggestions on this!

Yiyuan

/yourSpecies/genome.fa replaced by protein file.

/yourSpecies/genome.fa replaced by protein file.
When I run something like the following. i find that the /yourSpecies/genome.fa which should be the genome.fasta file gets replace with proteins.fa. Therefore the hints file comes up empty.
The filterIntronsFindStrand.stderr file are filled with the following error for each sequence of the genome file.
WARNING: 'Scbe7cn_1026_HRSCAF_1040' does not match any sequence in the fasta file. Maybe the two files do not belong together even though the sequences are there in the input file.

braker.pl --species=yourSpecies --genome=genome.fasta
--bam=file1.bam,file2.bam --prot_seq=proteins.fa
--prg=(gth|exonerate|spaln)

Which version of BRAKERv2.1.2 depends on GeneMark-ES?

Hi,

I install braker2 by conda install braker2, and download GeneMark-ESv4.38 at http://topaz.gatech.edu/GeneMark/license_download.cgi.

When I run braker.pl --species=test --genome=../00.ref/ref.fa --hints=../01.gth_alignment/merge.hints --epmode, I got a error

error on command line: /programs/GeneMark-ES/gm_et_linux_64/gmes_petap/gmes_petap.pl
Unknown option: ep_score
Unknown option: ep

gmes_petap.pl Algorithm options like:

Algorithm options
  --ES           to run self-training
  --fungus       to run algorithm with branch point model (most useful for fungal genomes)
  --ET           [filename]; to run training with introns coordinates from RNA-Seq read alignments (GFF format)
  --et_score     [number]; 4 (default) minimum score of intron in initiation of the ET algorithm
  --evidence     [filename]; to use in prediction external evidence (RNA or protein) mapped to genome
  --training     to run only training step
  --prediction   to run only prediction step
  --predict_with [filename]; predict genes using this file species specific parameters (bypass regular training and prediction steps)

Does this mean that the EP algorithm has been removed?

I want to use the protein sequence of unknown evolutionary distance to predict the genes of the genome. Looking forward to your answer

Shenglong

RNA.bam does not exist. ERROR

Hi, I am attempting to run a BRAKER annotation on my species of interest.

I keep running into this issue regarding the RNAseq.bam files unable to be detected;

Working Directory: /home/tng23/rds/rds-cj107-jiggins-rds/rds-cj107-heliconius/tng23/Project-Anno-HerIll/Hermetia_illucens-BRAKER

Working on Node: login-n-1

Date Started: Tue  2 Apr 14:04:49 BST 2019

intel/cce(15):ERROR:105: Unable to locate a modulefile for 'intel/cce/14.0.3.174'
intel/fce(15):ERROR:105: Unable to locate a modulefile for 'intel/fce/14.0.3.174'
intel/mkl(15):ERROR:105: Unable to locate a modulefile for 'intel/mkl/11.1.3.174'
NEXT STEP: check files and settings
NEXT STEP: check options
ERROR: BAM file ~/rds/rds-cj107-jiggins-rds/rds-cj107-heliconius/tng23/Project-Anno-HerIll/Hermetia_illucens-BRAKER/STAR-map_test-01.output/RNA-mapped-test.bam does not exist. Please check.
... options check complete.

I have ran on both test data (after grabbing separately) and my own data with the same error running from a script as follows;

#!/bin/bash
#SBATCH -p skylake
#SBATCH -A JIGGINS-SL2-CPU
#SBATCH -J BRAKER
#SBATCH --time=36:00:00
#SBATCH [email protected]
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --tasks=32
#SBATCH --exclusive

# Provide general information about job

printf "\nWorking Directory: $(pwd)\n"
printf "\nWorking on Node: $(hostname)\n"
printf "\nDate Started: $(date)\n\n"

# Activate the miniconda environment with required tools on;

module load perl/5.20.0
module load genemark/4.32
module load python/3.4.1
module load bamtools/2.4.2
module load miniconda3/4.5.1
source activate Biopython-local

# Give command for the script to run;

time braker.pl \
        --GENEMARK_PATH=~/privatemodules/gm_et_linux_64/gmes_petap \
        --BAMTOOLS_PATH=~/.conda/envs/Biopython-local/bin \
        --AUGUSTUS_CONFIG_PATH=~/privatemodules/augustus-3.3.2/bin \
        --genome=~/rds/hpc-work/Data1/Genome-Assembly_BSF_PB-10X_Sanger_2019-02-12/genomic_resources/iHerIll_ref-genome.fasta \
        --bam=~/rds/rds-cj107-jiggins-rds/rds-cj107-heliconius/tng23/Project-Anno-HerIll/Hermetia_illucens-BRAKER/STAR-map_test-01.output/RNA-mapped-test.bam \
        --softmasking \
        --species=Hermetia_illucens01 \
        --cores=1

# Information on time finished

printf "\nDate Finished: $(date)\n\n"

This is the same for the test data also;

#!/bin/bash

#SBATCH -J BRAKER
#SBATCH -o BRAKER-%j.out
#SBATCH -e BRAKER-%j.out
#SBATCH --time=36:00:00
#SBATCH [email protected]
#SBATCH --nodes=1
#SBATCH --tasks=32
#SBATCH -p skylake
#SBATCH --exclusive

# Provide general information about job

printf "\nWorking Directory: $(pwd)\n"
printf "\nWorking on Node: $(hostname)\n"
printf "\nDate Started: $(date)\n\n"

# Activate the miniconda environment with required tools on;

module load bamtools/2.4.2
module load miniconda3/4.5.1
source activate Biopython-local

# Give command for the script to run;

time    braker.pl \
        --GENEMARK_PATH=/home/tng23/privatemodules/gm_et_linux_64/gmes_petap \
        --BAMTOOLS_PATH=/home/tng23/.conda/envs/Biopython-local/bin \
        --AUGUSTUS_CONFIG_PATH=/home/tng23/privatemodules/augustus-3.3.2/bin \
        --genome=genome.fa --bam=RNAseq.bam --softmasking \
        --species=Species-Test-01 \
        --cores=1

# Information on time finished

printf "\nDate Finished: $(date)\n\n"

I have double checked paths and ensured they are OK and running from the command line directly appears to detect the same BAMs before running into another error so the issue must lie with submitting from a script.

I haven't seen any recorded issues from script submission, is there any advice or similar problems observed elsewhere?

Thanks,
Tom

EDIT: BRAKER v1.9 GeneMark-ES Suite version 4.38 augustus-3.3.2

On submitting from the command line

(Biopython-local) [tng23@login-n-1 example]$ braker.pl --GENEMARK_PATH=/home/tng23/privatemodules/gm_et_linux_64/gmes_petap --BAMTOOLS_PATH=/home/tng23/.conda/envs/Biopython-local/bin --AUGUSTUS_CONFIG_PATH=/home/tng23/privatemodules/augustus-3.3.2/bin --genome=genome.fa --bam=RNAseq.bam --species=Species-Test-01 --cores=1

NEXT STEP: check files and settings
NEXT STEP: check options
... options check complete.

WARNING: /home/tng23/privatemodules/BRAKER/example/braker/Species-Test-01 already exists. Braker will use existing files, if they are newer than the input files. You can choose another working directory with --workingdir=dir or overwrite it with --overwrite

NEXT STEP: create SAM header file /home/tng23/privatemodules/BRAKER/example/braker/Species-Test-01/RNAseq_header.sam.
SAM file /home/tng23/privatemodules/BRAKER/example/braker/Species-Test-01/RNAseq_header.sam complete.

NEXT STEP: check BAM headers
headers check for BAM file /home/tng23/privatemodules/BRAKER/example/RNAseq.bam complete.

NEXT STEP: make hints from BAM file /home/tng23/privatemodules/BRAKER/example/RNAseq.bam
failed to execute: Inappropriate ioctl for device
'''

samtools sort code

Hi,
Running braker with RNAseq data (bam files) and UTR=on, I encountered one error at the samtools sort step:
In the original braker.pl the code is (line 8991):
$cmdString .= "$SAMTOOLS_PATH/samtools sort -\@ " .($CPU-1) . " -o $otherfilesDir/merged.s.bam " . "$otherfilesDir/merged.bam " . "1> $otherfilesDir/samtools_sort_before_wig.stdout " . "2> $errorfilesDir/samtools_sort_before_wig.stderr"; print LOG "\n$cmdString\n" if ($v > 3);

Looking at samtools sort:

Usage: samtools sort [options] <in.bam> <out.prefix>
Options: -n sort by read name
-f use <out.prefix> as full file name instead of prefix
-o final output to stdout
-l INT compression level, from 0 to 9 [-1]
-@ INT number of sorting and compression threads [1]
-m INT max memory per thread; suffix K/M/G recognized [768M]

So, I thought it may be not correct running it with with "-o as final output to stdout". I changed it to:

$cmdString .= "$SAMTOOLS_PATH/samtools sort -\@ " .($CPU-1) . " $otherfilesDir/merged.bam " . "$otherfilesDir/merged.s " . "1> $otherfilesDir/samtools_sort_before_wig.stdout " . "2> $errorfilesDir/samtools_sort_before_wig.stderr"; print LOG "\n$cmdString\n" if ($v > 3);

Deleting the "-o" option, where merged.bam is the input and merged.s is the output name which will be merged.s.bam

I hope this is right.

Thanks

PS. The braker version I am using: I did git clone braker last December'18, and I see in README.TXT file it is from August 22nd.

Alternative Transcripts input and output question

Does the --alternatives_from_evidence=true require a separate hints file supplied in the command script or will it use the ones generated by GeneMark in the output directory? I ran two jobs, one with this command and one without, the one without had more predicted genes listed in the output gtf/gff file. Does this command take alternative transcripts into account when predicting genes or will should it give an output file of alternate transcripts? Should I include this command when predicting genes from an unannotated non-model organism?

Secondly, I have a question about the predicted genes output. Several genes have the suffix t1, t2 etc., (e.g. g55.t1, g55.t2) do these indicate alternate transcripts/isoforms?

Thank you

gtf2gff error

Dear,
I have downloading the latest version of BRAKER2 and runing wiht --gff3 flag. But I get follows stderr in gtf output file convert to gff3 format.

ERROR in file /home/software/BRAKER/scripts/braker.pl at line 9415
Failed to execute: cat /home/results/BRAKER/augustus.hints.gtf | perl -ne 'if(m/\tAUGUSTUS\t/) {print $_;}' | perl /home/software/Augustus/scripts/gtf2gff.pl --gff3 --out=/home/results/BRAKER/augustus.hints.gff3 >> /home/results/BRAKER/gtf2gff3.log 2>> /home/results/BRAKER/errors/gtf2gff3.err

and gtf2gff3.err file show

transcript jg1.t1 has conflicting gene parents: and jg1. Remember: In GTF txids need to be overall unique. at /home/software/Augustus/scripts/gtf2gff.pl line 119, <STDIN> line 590303.

Any help is much appreciated.
Thanks.

The hints file is empty

Hello, I am using BRAKER for predicting genes using a RNAseq bam file and protein fasta file. I consistently get the following error.

Tue May 21 04:24:52 2019: ERROR: in file /opt/rit/spack-app/linux-rhel7-x86_64/gcc-4.8.5/braker-2.1.2-75wblifp2zieps5rf7tzp7ajcwvzo2oz/bin/braker.pl at line 4189

The hints file is empty. Maybe the genome and the RNA-seq file do not belong together. I do not know what this error means. Could you help me undersand what is going wrong?
The script I am using is
braker.pl --cores=1 --overwrite --prg=exonerate --BAMTOOLS_PATH=/opt/rit/spack-app/linux-rhel7-x86_64/gcc-7.3.0/bamtools-2.5.1-7rljcjuix7pff6pcjqnu6c5ztdhk4coj/bin/ --SAMTOOLS_PATH=/opt/rit/spack-app/linux-rhel7-x86_64/gcc-4.8.5/samtools-1.9-k6deogajvbc2bpx3csxjuwtmqh5w65nr/bin/ --PYTHON3_PATH=/opt/rit/spack-app/linux-rhel7-x86_64/gcc-4.8.5/python-3.6.3-u4oaxsbnvbz6s7yxztqvvirlipfjrnx7/bin/ --ALIGNMENT_TOOL_PATH=/opt/rit/spack-app/linux-rhel7-x86_64/gcc-4.8.5/exonerate-2.4.0-ddwi7zhv5cb4rzwsyhpe32jmqn22pmcm/bin/ --prot_seq=protein.faa --genome=genome.fasta --bam=ASP_rnaseq_sorted.bam --gff3

Problem wit GeneMark

Dear developers,

I am running BRAKER 2.1.2 on a plant genome with around 40K scaffolds. But the job is dying at some stage with GeneMark. GeneMark dies with the following error reported to the STDERR:

ERROR in file /Storage/progs/BRAKER-2.1.2/scripts/braker.pl at line 5307
Failed to execute: perl /Storage/progs/gm_et_linux_64_v4.38/gmes_petap/gmes_petap.pl --verbose --seq /Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/genome.fa --max_intergenic 50000 --evidence /Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/GeneMark-ETP/evidence.gff --et_score 10 --ET /Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/genemark_hintsfile.gff --cores=1 --soft_mask 1000 1>/Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/GeneMark-ETP.stdout 2>/Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/errors/GeneMark-ETP.stderr

braker.log does not report any error. GeneMark-ETP.stdout has the following:

check before run
create directories
commit input data
data report
commit training data
training data report
prepare initial model
get GC of sequence
GC 36
build initial ET model
running step ET_A
running gm.hmm on local system
3 contigs in training
concatenate predictions: /Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/GeneMark-ETP/run/ET_A_1
training level ET_A: /Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/GeneMark-ETP/run/ET_A_1
From 261 loaded 232 and ignored dublications 29
exon no_match match_one match_two
Initial 7 5 0
Internal 8 11 28
Terminal 9 6 0
Single 4 0 0
CDS_no_match all short long seq_short seq_long
CDS_no_match 28 20 8 5927 13691
Intergenic all between_match seq_match
Intergenic: 18 2 2893
error, no valid sequences were found
error on call: /Storage/progs/gm_et_linux_64_v4.38/gmes_petap/make_nt_freq_mat.pl --cfg /Storage/data1/riano/Thalictrum/GenomeAnnotation/BRAKER/Masked/MaSuRCA_WT478/braker/MaSuRCA_WT478/GeneMark-ETP/run.cfg --section stop_TAG --format TERM_TAG

I am running BRAKER as the following:
braker.pl --etpmode --softmasking --species=$SP --genome=$ASSEMBLY --bam=${SP}.scf_gt1000bp.sorted.bam --hints=prot_hintsfile.aln2hints.gff --cores=$NSLOTS --AUGUSTUS_CONFIG_PATH=AugustusCONFIG --AUGUSTUS_BIN_PATH=/Storage/progs/Augustus-3.3.1-tag1/bin --AUGUSTUS_SCRIPTS_PATH=/Storage/progs/Augustus-3.3.1-tag1/scripts
prot_hintsfile.aln2hints.gff are hits to a related species (same family), generated in a previous run of BRAKER (unmasked genome) with GenomeThreader. The genome has been softmasked using RepeatModeller/RepeatMasker.

Any suggestion on how to carry on with genome annotation is greatly appreciated.
Thanks,
Diego

Augustus output is not produced

Hello,

I am trying to use BRAKER to annotate an insect genome. The GeneMark-ET step seems to succeed, but there is no output produced by Augustus and I cannot find the reason of that problem.

Below is the output files produced for my genome:

[slukiche@hmem00 braker]$ ls -lh
total 1.9G
-rw-rw-r-- 1 slukiche slukiche  900 Jun  7 01:56 augustus.hints.gff
-rw-rw-r-- 1 slukiche slukiche 384K Jun  6 12:45 bam_header.map
-rw-rw-r-- 1 slukiche slukiche 9.1M Jun  7 01:58 braker.log
drwxrwxr-x 2 slukiche slukiche    1 Jun  7 01:58 errors
-rw-rw-r-- 1 slukiche slukiche  417 Jun  7 01:15 filterGenemark.stdout
drwxrwxr-x 6 slukiche slukiche   12 Jun  7 01:15 GeneMark-ET
-rw-rw-r-- 1 slukiche slukiche 6.2K Jun  7 00:38 GeneMark-ET.stdout
-rw-rw-r-- 1 slukiche slukiche  11M Jun  6 18:53 genemark_hintsfile.gff
-rw-rw-r-- 1 slukiche slukiche 1.9G Jun  6 12:44 genome.fa
-rw-rw-r-- 1 slukiche slukiche 384K Jun  6 12:44 genome_header.map
-rw-rw-r-- 1 slukiche slukiche  11M Jun  6 18:53 hintsfile.gff
drwxrwxr-x 3 slukiche slukiche    1 Jun  7 01:15 species

errors folder only contains GeneMark-ET.stderr file with the error (in cleanup) Can't call method "FETCH" on an undefined value at /home/ulb/ebe/slukiche/perl5/lib/perl5/Object/InsideOut.pm line 1953 during global destruction., but GeneMark seems to run correctly.
The content of filterGenemark.stdout is:

Number of cds hints is 0
Average gene length: 3572
Average number of introns: 1.48159767565855
Good gene rate: 0.0447713908185422
Number of genes: 143889
Number of complete genes: 129413
Number of good genes: 5794
Number of one-exon-genes: 43795
Number of bad genes: 138095
Good intron rate: 0.191962402293721
One exon gene rate (of good genes): 0.304452882292026
One exon gene rate (of all genes): 0.304366560334702

The only one file produced by Augustus is augustus.hints.gff which contains the following information:

# This output was generated with AUGUSTUS (version 3.3.2).
# AUGUSTUS is a gene prediction tool written by M. Stanke ([email protected]),
# O. Keller, S. König, L. Gerischer, L. Romoth and Katharina Hoff.
# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),
# Using native and syntenically mapped cDNA alignments to improve de novo gene finding
# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013
# Sources of extrinsic information: M RM E W P
# reading in the file /mnt/fhgfs/users/s/l/slukiche/genome_annotation/braker/braker/augustus_tmp/8256.001.ctg11933.fa.1..21181.hints ...
# Have extrinsic information about 1 sequences (in the specified range).
# Initializing the parameters using config directory /CECI/home/ulb/ebe/slukiche/tools/augustus/Augustus/config/ ...
# gonioctenaQuinquepunctata version. Using default transition matrix.

I wasn't able to find any error message from Augustus in the log file so I don't understand what is going on.

Setting AUGUSTUS_BIN_PATH fails at bam2hints

Setting the AUGUSTUS_BIN_PATH variable causes an issue with bam2hints as it looks for this in bin/ this results in an incorrect path. Removing bin from AUGUSTUS_BIN_PATH fails with couldn't find AUGUSTUS.

AUGUSTUS_BIN_PATH =/opt/augustus-3.2.3/bin
Command that fails
/opt/augustus-3.2.3/bin/bin/bam2hints --intronsonly --in=/data/scratch/user/STAR_Aligned.out.bam --out=/data/scratch/user/braker/Sp_1/bam2hints.temp.gff 2>/data/scratch/user/braker/Sp_1/errors/bam2hints.0.stderr

braker2/augustus gff3 output

Hi!

How can I transfer the output gff3 of the Braker2 ab initio gene annotation pipeline to a valid EMBL flat file that I can submit to ENA?

I tried using EMBLmyGFF3 (https://github.com/NBISweden/EMBLmyGFF3). The tool seems working fine, but the BRAKER gff3 seems to be non-standard and I am always getting warnings and error messages like this (just dropped the first lines of the log):

13:59:49 ERROR feature: >>stop_codon<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.
Otherwise tell me which EMBL feature it corresponds to by adding the information within the json mapping file.
13:59:49 ERROR feature: >>start_codon<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.
Otherwise tell me which EMBL feature it corresponds to by adding the information within the json mapping file.
13:59:51 ERROR feature: >>inferred_parent<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.
Otherwise tell me which EMBL feature it corresponds to by adding the information within the json mapping file.
13:59:51 WARNING feature: Partial CDS. The CDS with ID = g5589.t1.braker.CDS2 not a multiple of three.
/home/meitel/.local/lib64/python2.7/site-packages/Bio/Seq.py:2071: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
BiopythonWarning)
13:59:52 WARNING feature: Partial CDS. The CDS with ID = g5848.t1.braker.CDS2 not a multiple of three.
13:59:52 WARNING feature: Partial CDS. The CDS with ID = g5903.t1.braker.CDS3 not a multiple of three.
13:59:52 WARNING feature: Partial CDS. The CDS with ID = g5825.t1.braker.CDS2 not a multiple of three.
13:59:52 WARNING feature: Partial CDS. The CDS with ID = g6051.t1.braker.CDS1 not a multiple of three.
13:59:52 WARNING feature: Partial CDS. The CDS with ID = g5770.t1.braker.CDS2 not a multiple of three.
13:59:52 WARNING feature: Partial CDS. The CDS with ID = g5770.t1.braker.CDS3 not a multiple of three.
13:59:52 WARNING feature: Partial CDS. The CDS with ID = g5770.t1.braker.CDS4 not a multiple of three.
13:59:52 WARNING feature: Partial CDS. The CDS with ID = g5770.t1.braker.CDS5 not a multiple of three.
13:59:52 WARNING feature: Partial CDS. The CDS with ID = g5770.t1.braker.CDS7 not a multiple of three.

I am basically getting this error for all genes and wonder if the braker2 gff is non-standard.

These are the first lines of the output gff3:

scaffold_154 AUGUSTUS gene 5501 7878 0.85 - . ID=g1.braker;
scaffold_154 AUGUSTUS mRNA 5501 7878 0.52 - . ID=g1.t1.braker;Parent=g1.braker
scaffold_154 AUGUSTUS stop_codon 5501 5503 . - 0 Parent=g1.t1.braker;
scaffold_154 AUGUSTUS CDS 5501 5587 0.85 - 0 ID=g1.t1.braker.CDS1;Parent=g1.t1
scaffold_154 AUGUSTUS exon 5501 5587 . - . ID=g1.t1.braker.exon1;Parent=g1.t1;
scaffold_154 AUGUSTUS intron 5588 7152 0.84 - . Parent=g1.t1.braker;
scaffold_154 AUGUSTUS CDS 7153 7878 0.57 - 0 ID=g1.t1.braker.CDS2;Parent=g1.t1
scaffold_154 AUGUSTUS exon 7153 7878 . - . ID=g1.t1.braker.exon2;Parent=g1.t1;
scaffold_154 AUGUSTUS start_codon 7876 7878 . - 0 Parent=g1.t1.braker;
scaffold_154 AUGUSTUS mRNA 6946 7878 0.33 - . ID=g1.t2.braker;Parent=g1.braker
scaffold_154 AUGUSTUS stop_codon 6946 6948 . - 0 Parent=g1.t2.braker;
scaffold_154 AUGUSTUS CDS 6946 7878 0.33 - 0 ID=g1.t2.braker.CDS1;Parent=g1.t2
scaffold_154 AUGUSTUS exon 6946 7878 . - . ID=g1.t2.braker.exon1;Parent=g1.t2;
scaffold_154 AUGUSTUS start_codon 7876 7878 . - 0 Parent=g1.t2.braker;
scaffold_154 AUGUSTUS gene 10822 13441 0.34 + . ID=g2.braker;
scaffold_154 AUGUSTUS mRNA 10822 13441 0.34 + . ID=g2.t1.braker;Parent=g2.braker
scaffold_154 AUGUSTUS start_codon 10822 10824 . + 0 Parent=g2.t1.braker;
scaffold_154 AUGUSTUS CDS 10822 10946 0.42 + 0 ID=g2.t1.braker.CDS1;Parent=g2.t1
scaffold_154 AUGUSTUS exon 10822 10946 . + . ID=g2.t1.braker.exon1;Parent=g2.t1;
scaffold_154 AUGUSTUS intron 10947 11358 0.42 + . Parent=g2.t1.braker;
scaffold_154 AUGUSTUS CDS 11359 11608 0.49 + 1 ID=g2.t1.braker.CDS2;Parent=g2.t1
scaffold_154 AUGUSTUS exon 11359 11608 . + . ID=g2.t1.braker.exon2;Parent=g2.t1;
scaffold_154 AUGUSTUS intron 11609 13147 0.38 + . Parent=g2.t1.braker;
scaffold_154 AUGUSTUS CDS 13148 13441 0.38 + 0 ID=g2.t1.braker.CDS3;Parent=g2.t1
scaffold_154 AUGUSTUS exon 13148 13441 . + . ID=g2.t1.braker.exon3;Parent=g2.t1;
scaffold_154 AUGUSTUS stop_codon 13439 13441 . + 0 Parent=g2.t1.braker;

Any suggestions/comments are highly appreciated.

Michael

missing RNAseq.bam from examples

There is a missing test file so test1.sh can not be completed. BTW would it be possible to have shorter tests or make the computations parallel at least?

BRAKER error on call: parse_ET.pl

I'm trying the BRAKER v2.1.0, but it stops at the GeneMark-ET step.
Here's the exact error:
error, file not found ~/src/gm_et_linux_64/gmes_petap/parse_ET.pl: set.out
error on call: ~/src/gm_et_linux_64/gmes_petap/parse_ET.pl --section ET_A --cfg ~/src/BRAKER2/BRAKER_v2.1.0/example/run.cfg --v
However, if i run ~/src/gm_et_linux_64/gmes_petap/parse_ET.pl, it will work.

Any help or suggestions would be appreciated!

PS: where can i downloade the BRAKER v1

Can't locate Hash/Merge.pm

Getting this error during the GeneMark-ET part of the pipeline:

Can't locate Hash/Merge.pm in @INC (you may need to install the Hash::Merge module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.22.1 /usr/local/share/perl/5.22.1 /usr/lib/x86_64-linux-gnu/perl5/5.22 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.22 /usr/share/perl/5.22 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base .) at /home/arkadiy_garber/bin/gm_et_linux_64/gmes_petap/parse_by_introns.pl line 22.
BEGIN failed--compilation aborted at /home/arkadiy_garber/bin/gm_et_linux_64/gmes_petap/parse_by_introns.pl line 22.
        (in cleanup)    (in cleanup)  at /home/arkadiy_garber/anaconda3/lib/site_perl/5.26.2/Object/InsideOut.pm line 1953 during global destruction.

The Hash::Merge module is, indeed, installed. And the gmes_petap.pl script is executable otherwise:

arkadiy_garber@server:/scratch/1/arkadiy/PacBIO_data/assembly_and_trimmed_reads$ perl /home/arkadiy_garber/bin/gm_et_linux_64/gmes_petap/gmes_petap.pl
# -------------------
Usage:  /home/arkadiy_garber/bin/gm_et_linux_64/gmes_petap/gmes_petap.pl  [options]  --sequence [filename]

GeneMark-ES Suite version 4.38
   includes transcript (GeneMark-ET) and protein (GeneMark-EP) based training and prediction

Input sequence/s should be in FASTA format

Algorithm options
  --ES           to run self-training
  --fungus       to run algorithm with branch point model (most useful for fungal genomes)
  --ET           [filename]; to run training with introns coordinates from RNA-Seq read alignments (GFF format)
  --EP           [filename]; to run training with introns coordinates from protein splice alighnmnet (GFF format)
  --et_score     [number]; 10 (default) minimum score of intron in initiation of the ET algorithm
  --ep_score     [number]; 4 (default) minimum score of intron in initiation of the EP algorithm
  --evidence     [filename]; to use in prediction external evidence (RNA or protein) mapped to genome
  --training     to run only training step
  --prediction   to run only prediction step
  --predict_with [filename]; predict genes using this file species specific parameters (bypass regular training and prediction steps)

Sequence pre-processing options
  --max_contig   [number]; 5000000 (default) will split input genomic sequence into contigs shorter then max_contig
  --min_contig   [number]; 50000 (default); will ignore contigs shorter then min_contig in training 
  --max_gap      [number]; 5000 (default); will split sequence at gaps longer than max_gap
                 Letters 'n' and 'N' are interpreted as standing within gaps 
  --max_mask     [number]; 5000 (default); will split sequence at repeats longer then max_mask
                 Letters 'x' and 'X' are interpreted as results of hard masking of repeats
  --soft_mask    [number] to indicate that lowercase letters stand for repeats; utilize only lowercase repeats longer than specified length

Run options
  --cores        [number]; 1 (default) to run program with multiple threads 
  --pbs          to run on cluster with PBS support
  --v            verbose

Customizing parameters:
  --max_intron          [number]; default 10000 (3000 fungi), maximum length of intron
  --max_intergenic      [number]; default 10000, maximum length of intergenic regions
  --min_gene_prediction [number]; default 300 (120 fungi) minimum allowed gene length in prediction step

Developer options:
  --usr_cfg      [filename]; to customize configuration file
  --ini_mod      [filename]; use this file with parameters for algorithm initiation
  --test_set     [filename]; to evaluate prediction accuracy on the given test set
  --key_bin
  --debug
# -------------------

thanks,
Arkadiy

Error running gmes_petap.pl and ab-initio

I am trying to run BRAKER2 for a reference haplotype fungal genome.
braker.pbs

#!/bin/bash
#PBS -P OSR
#PBS -N ref_braker
#PBS -l select=1:ncpus=8:mem=64GB
#PBS -l walltime=20:00:00
#PBS -e logs/braker.err
#PBS -o logs/braker.out

projectDir=/scratch/OSR/canu4_annotation/ref
outDir=$projectDir/results/braker
inDir=$projectDir/results/star/star2pass

module load genemark-es/4.33
module load bamtools/2.5.1
module load blast+/2.7.1
module load samtools/1.9
module load python/3.7.2
module load genomethreader/1.7.1
module load makehub/i
module load eval/2.2.8
module load braker2/2.1.2

export AUGUSTUS_CONFIG_PATH=$outDir/augustus_config
export AUGUSTUS_BIN_PATH=/usr/local/augustus/3.3.2/bin
export AUGUSTUS_SCRIPTS_PATH=/usr/local/augustus/3.3.2/scripts

braker.pl --species=OSR --genome=$projectDir/ref/ref.fa --bam=$inDir/refAligned.sortedByCoord.out.bam --softmasking --UTR=on --ab_initio --cores=8 --fungus --crf --makehub [email protected] --workingdir=$outDir 

braker.err

	The file ~/.gmkey exists and has not been copied.
Python 3.7.2 As we suffer from package overload, only minimal packages will be installed in this version.
Unknown option: ab_initio
Unknown option: makehub
Unknown option: email
ERROR in file /usr/local/braker2/2.1.2/braker.pl at line 5179
Failed to execute: perl /usr/local/genemark-es/4.33/gmes_petap.pl --verbose --sequence=/scratch/RDS-FAE-OSR-RW/canu4_annotation/ref/results/braker/genome.fa 
--ET=/scratch/RDS-FAE-OSR-RW/canu4_annotation/ref/results/braker/genemark_hintsfile.gff --et_score 10 --max_intergenic 50000 --cores=8 --fungus --soft_mask 1
000 1>/scratch/RDS-FAE-OSR-RW/canu4_annotation/ref/results/braker/GeneMark-ET.stdout 2>/scratch/RDS-FAE-OSR-RW/canu4_annotation/ref/results/braker/errors/Gen
eMark-ET.stderr

Can you help me figure out why gmes_petap.pl is not running? Also, why does braker not recognize ab_initio, makehub and email for makehub? Thank you.

example files missing or test commands wrong

I was trying to test my BRAKER installation using test1.sh in directory example/tests. The braker command is

braker.pl --genome=../genome.fa --bam=../RNAseq.bam --softmasking --workingdir=$wd

however, there is no file RNAseq.bam in the example directory. Should it perhaps be

braker.pl --genome=../genome.fa --hints=../RNAseq.hints --softmasking --workingdir=$wd

?

Also, the documentation says there should be a result file called augustus.gff and augustus.gtf. Directory example/results/test? only contain files augustus.hints.gtf. Is this the same as augustus.gtf, i.e. the final resulting predictions of augustus?

Thanks for your help!

UTRs annotations

Hi,
I have been running braker with RNAseq data (bam files) and UTR=on; I wanted to add UTRs annotations into my genes. After getting all the output considered (gtf and gff files, etc), I encountered this scenario on some (many) genes: the presence of more than one annotated 5'UTR and/or 3'UTR within the gene and the same isoform. See below for a simplified example:

Contig1001 AUGUSTUS gene 35608 37394 0.08 - . g8717 Contig1001 AUGUSTUS transcript 35608 37394 0.08 - . g8717.t1 Contig1001 AUGUSTUS tts 35608 35608 . - . transcript_id "g8717.t1"; gene_id "g8717"; Contig1001 AUGUSTUS 3'-UTR 35608 35986 0.46 - . transcript_id "g8717.t1"; gene_id "g8717"; Contig1001 AUGUSTUS exon 35608 36139 . - . transcript_id "g8717.t1"; gene_id "g8717"; Contig1001 AUGUSTUS stop_codon 35987 35989 . - 0 transcript_id "g8717.t1"; gene_id "g8717"; Contig1001 AUGUSTUS CDS 35987 36139 1 - 0 transcript_id "g8717.t1"; gene_id "g8717"; Contig1001 AUGUSTUS intron 36140 36198 1 - . transcript_id "g8717.t1"; gene_id "g8717"; Contig1001 AUGUSTUS CDS 36199 36313 1 - 1 transcript_id "g8717.t1"; gene_id "g8717"; Contig1001 AUGUSTUS exon 36199 36313 . - . transcript_id "g8717.t1"; gene_id "g8717"; ... Contig1001 AUGUSTUS intron 36949 37004 1 - . transcript_id "g8717.t1"; gene_id "g8717"; Contig1001 AUGUSTUS CDS 37005 37222 1 - 0 transcript_id "g8717.t1"; gene_id "g8717"; Contig1001 AUGUSTUS exon 37005 37227 . - . transcript_id "g8717.t1"; gene_id "g8717"; Contig1001 AUGUSTUS start_codon 37220 37222 . - 0 transcript_id "g8717.t1"; gene_id "g8717"; Contig1001 AUGUSTUS 5'-UTR 37223 37227 1 - . transcript_id "g8717.t1"; gene_id "g8717"; Contig1001 AUGUSTUS 5'-UTR 37281 37394 0.25 - . transcript_id "g8717.t1"; gene_id "g8717"; Contig1001 AUGUSTUS exon 37281 37394 . - . transcript_id "g8717.t1"; gene_id "g8717"; Contig1001 AUGUSTUS tss 37394 37394 . - . transcript_id "g8717.t1"; gene_id "g8717";

When I look into the RNAseq profile (IGV) I understand more or less why it's annotated this way: lack of RNAseq coverage that would link both initially separated 5'UTRs (or 3'UTRs).

But my question is; I haven't seen this in other genome annotation files; is this any common in other braker annotations with UTR=on (this are two Illumina 150PE lanes which give ~30x coverage)? Should I merge/connect the double UTRs as one for further analysis?

Thank you

bam file sorting

I get this error from bam2hints but it's coordinate sorted by samtools sort defaults. "BAM file MUST be sorted by target sequence names" any ideas why this is, should be coordinate sorted right?

Input file not in genbank format.

Hi,
I ran into this error:

/work/waterhouse_team/miniconda2/envs/braker2/bin//etraining: ERROR
        Input file not in genbank format.

Here is the full log output:

# Tue Apr 16 12:10:52 2019: braker.pl version 2.1.2

# Tue Apr 16 12:10:52 2019: Configuring of BRAKER for using external tools...

# Tue Apr 16 12:10:52 2019: Found environment variable $AUGUSTUS_CONFIG_PATH. Setting $AUGUSTUS_CONFIG_PATH to /work/waterhouse_team/miniconda2/envs/braker2/config/
# Tue Apr 16 12:10:52 2019: Found environment variable $AUGUSTUS_BIN_PATH. Setting $AUGUSTUS_BIN_PATH to /work/waterhouse_team/miniconda2/envs/braker2/bin/
# Tue Apr 16 12:10:52 2019: Found environment variable $AUGUSTUS_SCRIPTS_PATH. Setting $AUGUSTUS_SCRIPTS_PATH to /work/waterhouse_team/miniconda2/envs/braker2/bin/
# Tue Apr 16 12:10:52 2019: Did not find environment variable $GENEMARK_PATH  (either variable does not exist, or the path given in variable does not exist). Will try to set this variable in a different
 way, later.
# Tue Apr 16 12:10:52 2019: Trying to guess $GENEMARK_PATH from location of gmes_petap.pl executable that is available in your $PATH.
# Tue Apr 16 12:10:52 2019: Setting $GENEMARK_PATH to /work/waterhouse_team/apps/gm_et_linux_64/gmes_petap
# Tue Apr 16 12:10:52 2019: Did not find environment variable $BAMTOOLS_PATH (either variable does not exist, or the path given in variable does not exist). Will try to set this variable in a different 
way, later.
# Tue Apr 16 12:10:52 2019: Trying to guess $BAMTOOLS_BIN_PATH from location of bamtools executable that is available in your $PATH.
# Tue Apr 16 12:10:52 2019: Setting $BAMTOOLS_BIN_PATH to /work/waterhouse_team/miniconda2/envs/braker2/bin
# Tue Apr 16 12:10:52 2019: Did not find environment variable $SAMTOOLS_PATH  (either variable does not exist, or the path given in variable doesnot exist). Will try to set this variable in a different 
way, later.
# Tue Apr 16 12:10:52 2019: Trying to guess $SAMTOOLS_PATH from location of samtools executable in your $PATH.
# Tue Apr 16 12:10:52 2019: Setting $SAMTOOLS_PATH to /work/waterhouse_team/miniconda2/envs/braker2/bin
# Tue Apr 16 12:10:52 2019: Did not find environment variable $BLAST_PATH
# Tue Apr 16 12:10:52 2019: Trying to guess $BLAST_PATH from location of blastp executable that is available in your $PATH.
# Tue Apr 16 12:10:52 2019: Setting $BLAST_PATH to /work/waterhouse_team/miniconda2/envs/braker2/bin
# Tue Apr 16 12:10:52 2019: Did not find environment variable $PYTHON3_PATH
# Tue Apr 16 12:10:52 2019: Trying to guess $PYTHON3_PATH from location of python3 executable that is available in your $PATH.
# Tue Apr 16 12:10:52 2019: Setting $PYTHON3_PATH to /work/waterhouse_team/miniconda2/envs/braker2/bin
# Tue Apr 16 12:10:52 2019: Configuration of BRAKER for using external tools is complete!

# Tue Apr 16 12:10:52 2019: WARNING: /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp already exists. Braker will use existing files, if they are newer than the input files. You can ch
oose another working directory with --workingdir=dir or overwrite it with --overwrite.
# Tue Apr 16 12:10:53 2019: changing into working directory /lustre/scratch/waterhouse_team/braker/braker
cd /lustre/scratch/waterhouse_team/braker/braker

# Tue Apr 16 12:10:53 2019: Creating parameter template files for AUGUSTUS with new_species.pl

# Tue Apr 16 12:10:53 2019: new_species.pl will create parameter files for species NbV1ChF-NbRNASeq_Wlp in /work/waterhouse_team/miniconda2/envs/braker2/config//species/NbV1ChF-NbRNASeq_Wlp
perl /lustre/work-lustre/waterhouse_team/miniconda2/envs/braker2/bin/new_species.pl --species=NbV1ChF-NbRNASeq_Wlp --AUGUSTUS_CONFIG_PATH=/work/waterhouse_team/miniconda2/envs/braker2/config/ 1> /dev/nu
ll 2>/lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/errors/new_species.stderr

# Tue Apr 16 12:11:06 2019: Converting bam files to hints
# Tue Apr 16 12:11:06 2019: Preparing hints for running GeneMark

# Tue Apr 16 12:11:06 2019: Filtering intron hints for GeneMark from /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/hintsfile.gff...
cat /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/genemark_hintsfile.gff.rnaseq | sort -n -k 4,4 | sort -s -n -k 5,5 | sort -s -k 3,3 | sort -s -k 1,1 | /lustre/work-lustre/waterhou
se_team/miniconda2/envs/braker2/bin/join_mult_hints.pl > /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/genemark_hintsfile.gff.rnaseq.tmp
mv /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/genemark_hintsfile.gff.rnaseq.tmp /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/genemark_hintsfile.gff

# Tue Apr 16 12:11:09 2019: Checking whether file /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/genemark_hintsfile.gff contains sufficient multiplicity information...
# Tue Apr 16 12:11:09 2019: Executing GeneMark-ET

# Tue Apr 16 12:11:09 2019: changing into GeneMark-ET directory /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/GeneMark-ET
cd /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/GeneMark-ET

# Tue Apr 16 12:11:09 2019: Executing gmes_petap.pl
perl /work/waterhouse_team/apps/gm_et_linux_64/gmes_petap/gmes_petap.pl --verbose --sequence=/lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/genome.fa --ET=/lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/genemark_hintsfile.gff --et_score 10 --max_intergenic 50000 --cores=1 1>/lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/GeneMark-ET.stdout 2>/lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/errors/GeneMark-ET.stderr

# Sun Apr 21 10:34:10 2019: change to working directory /lustre/scratch/waterhouse_team/braker/braker
cd /lustre/scratch/waterhouse_team/braker/braker

# Sun Apr 21 10:34:10 2019 Filtering output of GeneMark for generating training genes for AUGUSTUS
# Sun Apr 21 10:34:10 2019: Checking whether hintsfile contains single exon CDSpart hints
# Sun Apr 21 10:34:10 2019: filtering GeneMark genes by intron hints
perl /lustre/work-lustre/waterhouse_team/miniconda2/envs/braker2/bin/filterGenemark.pl --genemark=/lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/GeneMark-ET/genemark.gtf --introns=/lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/hintsfile.gff 1>/lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/filterGenemark.stdout 2>/lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/errors/filterGenemark.stderr

#Sun Apr 21 10:34:14 2019: downsampling good genemark genes according to poisson distribution with Lambda 2:
perl /lustre/work-lustre/waterhouse_team/miniconda2/envs/braker2/bin/downsample_traingenes.pl --in_gtf=/lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/GeneMark-ET/genemark.f.good.gtf --out_gtf=/lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/GeneMark-ET/genemark.d.gtf --lambda=2 1> /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/downsample_traingenes.log 2> /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/errors/downsample_traingenes.err

# Sun Apr 21 10:34:14 2019: training AUGUSTUS
# Sun Apr 21 10:34:14 2019: creating softlink from /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/GeneMark-ET/genemark.gtf to /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/traingenes.gtf.
ln -s /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/GeneMark-ET/genemark.gtf /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/traingenes.gtf
# Sun Apr 21 10:34:15 2019: Converting gtf file /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/traingenes.gtf to genbank file
# Sun Apr 21 10:34:15 2019: Computing flanking region size for AUGUSTUS training genes
# Sun Apr 21 10:34:16 2019: create genbank file /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/train.gb
perl /lustre/work-lustre/waterhouse_team/miniconda2/envs/braker2/bin/gff2gbSmallDNA.pl /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/traingenes.gtf /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/genome.fa 1477 /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/train.gb 2>/lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/errors/traingenes.gtf_gff2gbSmallDNA.stderr

# Sun Apr 21 10:35:43 2019: $trainGb1 file /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/train.gb contains 77478 genes.
#  Sun Apr 21 10:35:43 2019: concatenating good and downsampled GeneMark training genes to /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/good_genes.lst.
cat /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/GeneMark-ET/genemark.d.gtf > /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/good_genes.lst
# Sun Apr 21 10:35:43 2019: Filtering train.gb for "good" mRNAs:
perl /lustre/work-lustre/waterhouse_team/miniconda2/envs/braker2/bin/filterGenesIn_mRNAname.pl /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/good_genes.lst /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/train.gb > /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/train.f.gb 2>/lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/errors/filterGenesIn_mRNAname.stderr

# Sun Apr 21 10:35:48 2019: $trainGb2 file /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/train.f.gb contains 0 genes.
# Sun Apr 21 10:35:48 2019: Running etraining to catch gene structure inconsistencies:
/work/waterhouse_team/miniconda2/envs/braker2/bin//etraining --species=NbV1ChF-NbRNASeq_Wlp --AUGUSTUS_CONFIG_PATH=/work/waterhouse_team/miniconda2/envs/braker2/config/ /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/train.f.gb 1> /lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/gbFilterEtraining.stdout 2>/lustre/scratch/waterhouse_team/braker/braker/NbV1ChF-NbRNASeq_Wlp/errors/gbFilterEtraining.stderr

What did I miss?

Thank you in advance,

Michal

Errors in the annotations with --UTR=ON

I try to annotate the genome with example data using command
"braker.pl --genome=../genome.fa --prot_seq=../prot.fa --prg=gth --bam=../RNAseq.bam --gth2traingenes --softmasking --UTR=on --gff3 --workingdir=$wd --cleanup --core=20" ,
but error came like below:

ERROR in file /public/home/fanlj/software/Braker2/BRAKER/scripts/braker.pl at line 8741
Failed to execute: perl /public/home/fanlj/software/Braker2/Augustus/scripts/aa2nonred.pl /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/utr_genes_in_gb.fa /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/utr_genes_in_gb.nr.fa --BLAST_PATH=/public/home/fanlj/software/miniconda3/bin --cores=20 1> /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/utr.aa2nonred.stdout 2> /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/errors/utr.aa2nonred.stderr!

And the file "utr_genes_in_gb.fa" was not created in the pipeline, and i cannot find error inthe braker.log, the last 20 line of braker.log was:

# Fri Mar 22 09:34:51 2019: sorting bam file...

/public/software/apps/samtools-1.3.1/bin/samtools sort -@ 19 -o /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/merged.s.bam /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/merged.bam 1> /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/samtools_sort_before_wig.stdout 2> /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/errors/samtools_sort_before_wig.stderr
# Fri Mar 22 09:34:58 2019: Creating wiggle file...

/public/home/fanlj/software/Braker2/Augustus/bin/../auxprogs/bam2wig/bam2wig /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/merged.s.bam 1>/public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/merged.wig 2> /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/errors/bam2wig.err
# Fri Mar 22 09:36:01 2019: Creating /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/utrs.gff

/public/home/fanlj/software/Braker2/Augustus/bin/utrrnaseq --in-scaffold-file /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/genome.fa -C /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/stops.and.starts.gff -I /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/rnaseq.utr.hints -W /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/merged.wig -o /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/utrs.gff -r 76 -v 100 -n 15 -i 0.7 -m 0.3 -w 70 -c 100 -p 0.5  1> /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/rnaseq2utr.stdout 2> /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/errors/rnaseq2utr.err
# Fri Mar 22 09:36:11 2019: fixing utrrnaseq output

mv /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/utrs.f.gff /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/utrs.gff
# Fri Mar 22 09:36:11 2019: Creating gb file for UTR training

cat /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/utrs.gff /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/augustus.hints.f.gtf | grep -P "(CDS|5'-UTR|3'-UTR)" | sort -n -k 4,4 | sort -s -k 10,10 | sort -s -k 1,1 >> /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/genes.gtf 2> /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/errors/cat_utrs_augustus_noUtrs.err

perl /public/home/fanlj/software/Braker2/Augustus/scripts/gff2gbSmallDNA.pl /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/genes.gtf /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/genome.fa 1854 /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/utr.gb --good=/public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/bothutr.lst 1> /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/gff2gbSmallDNA.utr.stdout 2> /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/errors/gff2gbSmallDNA.utr.stderr
# Fri Mar 22 09:36:11 2019: BLAST training gene structures (with UTRs) against themselves:
perl /public/home/fanlj/software/Braker2/Augustus/scripts/aa2nonred.pl /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/utr_genes_in_gb.fa /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/utr_genes_in_gb.nr.fa --BLAST_PATH=/public/home/fanlj/software/miniconda3/bin --cores=20 1> /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/utr.aa2nonred.stdout 2> /public/home/fanlj/software/Braker2/BRAKER/example_mao/tests/test6/errors/utr.aa2nonred.stderr


Changing genetic code table

It is currently not possible to switch genetic code table for running BRAKER. AUGUSTUS can use a different genetic code, but GeneMark-ES/ET must be extended before BRAKER can be extended.

BRAKER code extension is planned, depending on GeneMark-ES/ET extension.

Required changes:

  • command line option for genetic code table in BRAKER

  • translation script prior aa2nonred.pl (part of AUGUSTUS)

  • editing AUGUSTUS parameters prior training and prediction

  • command line option for getAnnoFastaFromJoingenes.pl in BRAKER

The hints file is empty. Maybe the genome and the RNA-seq file do not belong together

I'm try to run braker with the following code below
perl /home/chutima/wat_work_2/software/BRAKER/scripts/braker.pl --genome ref.fasta --bam=gmap_isoseq_sort.bam --BAMTOOLS_PATH=/home/chutima/wat_work_2/software/bamtools/bin/ --AUGUSTUS_CONFIG_PATH=/home/chutima/wat_work_2/software/Augustus/config/ -cores=10 --gff3 -AUGUSTUS_ab_initio --SAMTOOLS_PATH=/home/chutima/wat_work_2/software/samtools-1.9/

this is the header of bam file
c404_f29p8_2075c404 16 tig00000954_quiver 388229 40 4S395M82N108M126N87M106N81M79N135M98N222M106N81M109N162M96N191M100N142M124N96M99N139M659

This is the header of fasta file

tig00000954_quiver
AAGTTAATAATCCTTCCCCTGAATTAAAACAATTGTCTGCTCACCTACGTTATGCTTTCTTAGGAGAATCTTCTACTTTC
CAGTTATCATTTCAAAATGATTTAAGTAAAGAAGAAGAGGAAAAATTGTTGGATGTGTTAAAAAAGCATAAATCTGCCTT

I think both file have the same header name. Do you have any suggestion why that error happen.

Can't locate File/HomeDir.pm in @INC

Dear Braker team,

I have this problem when running BRAKER ;

./braker.pl
Can't locate File/HomeDir.pm in @inc (you may need to install the File::HomeDir module) (@inc contains: /home/ulg/.local/share.cpan/build/GD-2.67-0/ /home/ulg/miniconda3/lib/perl5/site_perl/5.22.0/x86_64-linux-thread-multi /home/ulg/miniconda3/lib/perl5/site_perl/5.22.0 /home/ulg/miniconda3/lib/perl5/5.22.0/x86_64-linux-thread-multi /home/ulg/miniconda3/lib/perl5/5.22.0 .) at ./braker.pl line 21.
BEGIN failed--compilation aborted at ./braker.pl line 21.

cpanm File::HomeDir
File::HomeDir is up to date. (1.004)

Can you help me ?

Failure during training

Hi,
Running BRAKER2 I get the following error:

bio@biocomp04:~/Documents/Purpureocillium/March2019$ ~/apps/BRAKER-2.1.2/scripts/braker.pl  --genome=./working-genomes/361.fa --species=Purp --AUGUSTUS_CONFIG_PATH=/home/bio/apps/augustus-3.3.2/config/ --prot_seq=hints-annotation.faa --prg=gth --ALIGNMENT_TOOL_PATH=/home/bio/apps/gth-1.7.1-Linux_x86_64-64bit/bin --trainFromGth
# Wed Mar 27 20:01:51 2019: Logfile: /home/bio/Documents/Purpureocillium/March2019/braker/Purp/braker.log!
ERROR in file /home/bio/apps/BRAKER-2.1.2/scripts/braker.pl at line 5831
Failed to execute: /home/bio/apps/augustus-3.3.2/config/../bin/etraining --species=Purp --AUGUSTUS_CONFIG_PATH=/home/bio/apps/augustus-3.3.2/config /home/bio/Documents/Purpureocillium/March2019/braker/Purp/train.f.gb 1> /home/bio/Documents/Purpureocillium/March2019/braker/Purp/gbFilterEtraining.stdout 2>/home/bio/Documents/Purpureocillium/March2019/braker/Purp/errors/gbFilterEtraining.stderr

When I do:

less /home/bio/Documents/Purpureocillium/March2019/braker/Purp/errors/gbFilterEtraining.stderr

I get:

/home/bio/apps/augustus-3.3.2/config/../bin/etraining: ERROR
        Input file not in genbank format.

Find attached the genbank file

train.gb.zip

How can I fix it?
Thanks in advance,
Luis Alfonso.

utrrnaseq segmentation fault

Hi,

I am running BRAKER with UTR prediction, and I have spit my RNA-seq into plus and minus strands as recommended here: https://www.biostars.org/p/92935/

My command is:

braker.pl --species=c_incerta_v1.3 --genome=../../repeat_masking/Chlamydomonas_incerta.V3.softmask_v1.fa --softmasking --bam=../plus.bam,../minus.bam --stranded=+,- --UTR=on –-cores=16 --PYTHON3_PATH=/usr/local/bin/

I am getting the following segmentation fault error to standard out:

sh: line 1: 132194 Segmentation fault /home/craigror/programs/augustus-3.3.1/config//../bin/utrrnaseq --in-scaffold-file /scratch/research/projects/chlamydomonas/Cincerta_deNovo/analysis/assembly_V3/BRAKER2/run_v1.3/braker/c_incerta_v1.3/genome.fa -C /scratch/research/projects/chlamydomonas/Cincerta_deNovo/analysis/assembly_V3/BRAKER2/run_v1.3/braker/c_incerta_v1.3/stops.and.starts.gff -I /scratch/research/projects/chlamydomonas/Cincerta_deNovo/analysis/assembly_V3/BRAKER2/run_v1.3/braker/c_incerta_v1.3/rnaseq.utr.hints -W /scratch/research/projects/chlamydomonas/Cincerta_deNovo/analysis/assembly_V3/BRAKER2/run_v1.3/braker/c_incerta_v1.3/rnaseq_plus.wig -o /scratch/research/projects/chlamydomonas/Cincerta_deNovo/analysis/assembly_V3/BRAKER2/run_v1.3/braker/c_incerta_v1.3/utrs_plus.gff -r 76 -v 100 -n 15 -i 0.7 -m 0.3 -w 70 -c 100 -p 0.5 > /scratch/research/projects/chlamydomonas/Cincerta_deNovo/analysis/assembly_V3/BRAKER2/run_v1.3/braker/c_incerta_v1.3/rnaseq2utr_plus.stdout 2> /scratch/research/projects/chlamydomonas/Cincerta_deNovo/analysis/assembly_V3/BRAKER2/run_v1.3/braker/c_incerta_v1.3/errors/rnaseq2utr_plus.err
ERROR in file /home/craigror/programs/BRAKER/scripts/braker.pl at line 8496
Failed to execute: /home/craigror/programs/augustus-3.3.1/config//../bin/utrrnaseq --in-scaffold-file /scratch/research/projects/chlamydomonas/Cincerta_deNovo/analysis/assembly_V3/BRAKER2/run_v1.3/braker/c_incerta_v1.3/genome.fa -C /scratch/research/projects/chlamydomonas/Cincerta_deNovo/analysis/assembly_V3/BRAKER2/run_v1.3/braker/c_incerta_v1.3/stops.and.starts.gff -I /scratch/research/projects/chlamydomonas/Cincerta_deNovo/analysis/assembly_V3/BRAKER2/run_v1.3/braker/c_incerta_v1.3/rnaseq.utr.hints -W /scratch/research/projects/chlamydomonas/Cincerta_deNovo/analysis/assembly_V3/BRAKER2/run_v1.3/braker/c_incerta_v1.3/rnaseq_plus.wig -o /scratch/research/projects/chlamydomonas/Cincerta_deNovo/analysis/assembly_V3/BRAKER2/run_v1.3/braker/c_incerta_v1.3/utrs_plus.gff -r 76 -v 100 -n 15 -i 0.7 -m 0.3 -w 70 -c 100 -p 0.5 1> /scratch/research/projects/chlamydomonas/Cincerta_deNovo/analysis/assembly_V3/BRAKER2/run_v1.3/braker/c_incerta_v1.3/rnaseq2utr_plus.stdout 2> /scratch/research/projects/chlamydomonas/Cincerta_deNovo/analysis/assembly_V3/BRAKER2/run_v1.3/braker/c_incerta_v1.3/errors/rnaseq2utr_plus.err!

The file rnaseq2utr_plus.stdout contains:

Read in of scaffold file finished successfully!
Read in of coding region file finished successfully!
Read in of intron file finished successfully!
Input Data procession finished successfully!

and rnaseq2utr_plus.err is empty. I can reproduce the fault if I run the final utrrnaseq command by itself. Any help would be much appreciated.

Cheers,
Rory

Replace BLAST by DIAMOND

Replace BLAST by DIAMOND for aa2nonred.pl (originally an AUGUSTUS script). Intention: speed up BRAKER.

Failed to execute: mv .rnaseq.tmp

Hello,

I'm running BRAKER using only proteins from a closely related species, as per your recipe in Fig 5 of the (excellent, by the way) help page. Full command is below:

~/software/BRAKER/scripts/braker.pl \
--genome=genome.fa \
--species=my_species --useexisting \
--prot_seq=proteins.faa --prg=gth --gth2traingenes --trainFromGth --gff3 --cores=1 \
--AUGUSTUS_CONFIG_PATH=/path/to/config/ \
--AUGUSTUS_BIN_PATH=/path/to/bin/ \
--AUGUSTUS_SCRIPTS_PATH=/path/to/scripts/ \
--BAMTOOLS_PATH=/path/to/bamtools/bin/ \
--GENEMARK_PATH=/path/to/genemark/bin/ \
--SAMTOOLS_PATH=/path/to/samtools \
--ALIGNMENT_TOOL_PATH=/path/to/gth

It fails with these messages to STDOUT:

# Wed Apr 24 11:10:56 2019: Log information is stored in file /path/to/braker.log
Use of uninitialized value $genemark_hintsfile in concatenation (.) or string at ~/software/BRAKER/scripts/braker.pl line 5005.
Use of uninitialized value $genemark_hintsfile in concatenation (.) or string at ~/software/BRAKER/scripts/braker.pl line 5006.
Use of uninitialized value $genemark_hintsfile in concatenation (.) or string at ~/software/BRAKER/scripts/braker.pl line 5075.
mv: missing destination file operand after ‘.rnaseq.tmp’
Try 'mv --help' for more information.
ERROR in file ~/software/BRAKER/scripts/braker.pl at line 5079
Failed to execute: mv .rnaseq.tmp 

and the last entry in braker.log is:

# Wed Apr 24 11:37:14 2019: Filtering intron hints for GeneMark from /path/to/my_species/hintsfile.gff...
cat .prot | sort -n -k 4,4 | sort -s -n -k 5,5 | sort -s -k 3,3 | sort -s -k 1,1 | /path/to/join_mult_hints.pl > .prot.tmp
mv .rnaseq.tmp 

I'm guessing it thinks there should be some RNA-seq evidence, as per $genemark_hintsfile, and fails when mv throws an error when it does not find it? Is this a bug, or perhaps I am missing a flag in my command?

Many thanks,
reubwn

PS: I updated the repo immediately before running this command.

getAnnoFastaFromJoingenes.py: error

My Braker2 would fail in the middle of the process and I couldn't find any similar problem or possible resolution online...this is my command and the error messages:

################################################

Command:

nohup time -v --output=time_braker2_RNAonly.log braker.pl --species=BUSCO_BUSCO_sunbird_trimPLK_881882613 --useexisting --cores=40 --genome=/home/cch/sunbird/sunbird_RepeatMasker_trim/sunbird_platanus_trim_kraken_scaff_gapclosed_1000.fa.masked --softmasking 1 --hints=/home/cch/sunbird/sunbird_HISAT2_trim/sunbird_trim_hints_AS_400K.gff,/home/cch/sunbird/sunbird_HISAT2_trim/sunbird_trim_hints_China_400K.gff --UTR=off &

Error messages:

In GeneMark-ET.stderr this message repetitively printed out:

(in cleanup) Can't call method "FETCH" on an undefined value at /usr/local/share/perl/5.22.1/Object/InsideOut.pm line 1953 during global destruction.

In getAnnoFastaJoingenes..stderr:

usage: getAnnoFastaFromJoingenes.py [-h] -g GENOME -f GTF -o OUT
[-t TRANSLATION_TABLE] [-s FILTER]
getAnnoFastaFromJoingenes.py: error: argument -o/--out: expected one argument

my nohup file gave error messages here:

Use of uninitialized value $_ in substitution (s///) at /opt/BRAKER/scripts/braker.pl line 7970.
ERROR in file /opt/BRAKER/scripts/braker.pl at line 7986
Failed to execute: perl /opt/Augustus/scripts/join_aug_pred.pl < /home/cch/sunbird/sunbird_BRAKER2_RNAonly_trim/braker/BUSCO_BUSCO_sunbird_trimPLK_881882613/augustus.tmp.gff > /home/cch/sunbird/sunbird_BRAKER2_RNAonly_trim/braker/BUSCO_BUSCO_sunbird_trimPLK_881882613/augustus.hints.gff

My braker.log file ends here:

#Thu Dec 27 20:33:51 2018: Making a gtf file from /home/cch/sunbird/sunbird_BRAKER2_RNAonly_trim/braker/BUSCO_BUSCO_sunbird_trimPLK_881882613/augustus.hints.gff
cat /home/cch/sunbird/sunbird_BRAKER2_RNAonly_trim/braker/BUSCO_BUSCO_sunbird_trimPLK_881882613/augustus.hints.gff | perl -ne 'if(m/\tAUGUSTUS\t/) {print $_;}' | perl /opt/Augustus/scripts/gtf2gff.pl --printExon --out=/home/cch/sunbird/sunbird_BRAKER2_RNAonly_trim/braker/BUSCO_BUSCO_sunbird_trimPLK_881882613/augustus.hints.tmp.gtf 2>/home/cch/sunbird/sunbird_BRAKER2_RNAonly_trim/braker/BUSCO_BUSCO_sunbird_trimPLK_881882613/errors/gtf2gff.augustus.hints.gtf.stderr

#Thu Dec 27 20:34:03 2018: AUGUSTUS prediction complete
#Thu Dec 27 20:34:03 2018: Making a fasta file with protein sequences of /home/cch/sunbird/sunbird_BRAKER2_RNAonly_trim/braker/BUSCO_BUSCO_sunbird_trimPLK_881882613/augustus.hints.gtf
/usr/bin/python3 /opt/Augustus/scripts/getAnnoFastaFromJoingenes.py -g /home/cch/sunbird/sunbird_BRAKER2_RNAonly_trim/braker/BUSCO_BUSCO_sunbird_trimPLK_881882613/genome.fa -f /home/cch/sunbird/sunbird_BRAKER2_RNAonly_trim/braker/BUSCO_BUSCO_sunbird_trimPLK_881882613/augustus.hints.gtf -o 1> /home/cch/sunbird/sunbird_BRAKER2_RNAonly_trim/braker/BUSCO_BUSCO_sunbird_trimPLK_881882613/getAnnoFasta..stdout 2>/home/cch/sunbird/sunbird_BRAKER2_RNAonly_trim/braker/BUSCO_BUSCO_sunbird_trimPLK_881882613/errors/getAnnoFastaJoingenes..stderr

YAML not install

I try to run braker but it shown the following error.

Perl module 'YAML' is required but not installed yet

I try to install YAML module by both conda and cpan.However, It's still produce this error. What should I do?

Failing during optimize_augustus.pl

Hey, it's me again,
I tried to generate a model for my species and during the optimize_augustus.pl, I got the following error:

bio@biocomp04:~/Documents/Purpureocillium/March2019$ perl /home/bio/apps/augustus-3.3.2/scripts/optimize_augustus.pl --rounds=5 --species=Purp --kfold=8 --AUGUSTUS_CONFIG_PATH=/home/bio/apps/augustus-3.3.2/config --onlytrain=/home/bio/Documents/Purpureocillium/March2019/braker/Purp/train.gb.train.train --cpus=8 /home/bio/Documents/Purpureocillium/March2019/braker/Purp/train.gb.train.test
Splitting training file into 8 buckets...
Reading in the meta parameters used for optimization from /home/bio/apps/augustus-3.3.2/config/species/generic/generic_metapars.cfg...
Reading in the starting meta parameters from /home/bio/apps/augustus-3.3.2/config/species/Purp/Purp_parameters.cfg...
bucket Segmentation fault (core dumped)
2 Segmentation fault (core dumped)
Segmentation fault (core dumped)
3 Segmentation fault (core dumped)
8 5 Segmentation fault (core dumped)
1 Segmentation fault (core dumped)
6 Segmentation fault (core dumped)
Segmentation fault (core dumped)
4 7 Could not read the accuracy values out of predictions.txt when processing bucket 1. at /home/bio/apps/augustus-3.3.2/scripts/optimize_augustus.pl line 1224.

Later I used the following command, and Braker2 executed perfectly,

bio@biocomp04:~/Documents/Purpureocillium/March2019$ ~/apps/BRAKER-2.1.2/scripts/braker.pl --genome=./working-genomes/Purp-polished-abyss.fasta.masked --bam=./Purp-annotation/Purp-RNAseq-to-genome-unmasked.bam --species=Purp --cores=4 --AUGUSTUS_CONFIG_PATH=/home/bio/apps/augustus-3.3.2/config --GENEMARK_PATH=/home/bio/apps/gm_et_linux_64/gmes_petap --fungus --skipOptimize --useexisting

What do you think it could be,
Am I executing it wrong?
Cheers,
Luis Alfonso.

Error with joingenes

Dear developers,
I am running BRAKER2 on a plant genome with RNA-seq and proteins of long evolutionary distance. But the program always died on the joingenes step. I am not sure.

I am running BRAKER as follow
perl braker.pl --species=myspecies --genome=genome.fasta --cores=8 \ --bam=hisat.sorted.bam --stranded=. --softmasking --UTR=on \ --prot_seq=protein.fa --prg=gth --gth2traingenes --gff3

BRAKER died with error like

Can't locate object method "tx_structures" via package "g45625.t1" (perhaps you forgot to load "g45625.t1"?) at braker.pl line 8402, line 5.

I can only find errors reported in errors/joingenes.err file like

Load warning: Did not expect feature "initial". Known features are "CDS", "UTR", "3'-UTR", "5'-UTR", "exon", "intron", "gene", "transcript", "tss", "tts", "start_codon" and "stop_codon". This feature is going to be ignored.
This warning may affect the result.
Load warning: Did not expect feature "terminal". Known features are "CDS", "UTR", "3'-UTR", "5'-UTR", "exon", "intron", "gene", "transcript", "tss", "tts", "start_codon" and "stop_codon". This feature is going to be ignored.
This warning may affect the result.
Load warning: Did not expect feature "internal". Known features are "CDS", "UTR", "3'-UTR", "5'-UTR", "exon", "intron", "gene", "transcript", "tss", "tts", "start_codon" and "stop_codon". This feature is going to be ignored.
This warning may affect the result.
Load warning: Did not expect feature "single". Known features are "CDS", "UTR", "3'-UTR", "5'-UTR", "exon", "intron", "gene", "transcript", "tss", "tts", "start_codon" and "stop_codon". This feature is going to be ignored.
This warning may affect the result.
Load warning: Did not expect feature "terminal". Known features are "CDS", "UTR", "3'-UTR", "5'-UTR", "exon", "intron", "gene", "transcript", "tss", "tts", "start_codon" and "stop_codon". This feature is going to be ignored.
This warning may affect the result.
Load warning: Did not expect feature "internal". Known features are "CDS", "UTR", "3'-UTR", "5'-UTR", "exon", "intron", "gene", "transcript", "tss", "tts", "start_codon" and "stop_codon". This feature is going to be ignored.
This warning may affect the result.
Load warning: Did not expect feature "initial". Known features are "CDS", "UTR", "3'-UTR", "5'-UTR", "exon", "intron", "gene", "transcript", "tss", "tts", "start_codon" and "stop_codon". This feature is going to be ignored.
This warning may affect the result.
Load warning: Did not expect feature "single". Known features are "CDS", "UTR", "3'-UTR", "5'-UTR", "exon", "intron", "gene", "transcript", "tss", "tts", "start_codon" and "stop_codon". This feature is going to be ignored.
This warning may affect the result.

Any suggestions on how to solve this problem would be greatly appreciated.

Best rgds.

output from protein mapping pipeline is different to what BRAKER expects

Hi Katharina,

I am attempting to predict genes with BRAKER2 by using protein families from several related species as evidence. I have followed the README in the protein mapping pipeline which generated an introns.gff file.

In your BRAKER README, you mention that the hints file from this pipeline must be in the following format to work:

chrName ProSplign   intron  6591    8003    5   +   .   mult=5;pri=4;src=P
chrName ProSplign   intron  6136    9084    11  +   .   mult=11;pri=4;src=P

However, the introns.gff file is in the following format:

CSP28.scaffold163_cov73 ProSplign       Intron  1528203 1528347 1       +       .       tmp
CSP28.scaffold295_cov78 ProSplign       Intron  414858  414903  6       +       .       tmp

Providing this file to BRAKER results in an error.

Any idea where I've gone wrong? Column 9 containing the word tmp looks wrong, so perhaps this is an issue with the combine_gff_records.pl script? Happy to contact the authors of that code this is indeed the issue.

Thanks for your help,

Lewis

braker with diamond

Hi @npavlovikj,
Could you please build a new Braker package which contains Diamond as a replacement
of Blast ( #25 )?

Thank you in advance,

Best wishes,

Michal

AUGUSTUS Segmentation fault

While using BRAKER I keep getting such error:

braker.pl --genome=/media/damian/Toshiba/Canu_Pilon.fasta --esmode Logfile: /home/damian/Programs/BRAKER-master/braker/Sp_4/braker.log! ERROR in file /home/damian/Programs/BRAKER-master/scripts/braker.pl at line 6556 Failed to execute: perl /home/damian/Programs/Augustus/scripts/optimize_augustus.pl --rounds=5 --species=Sp_4 --kfold=8 --AUGUSTUS_CONFIG_PATH=/home/damian/Programs/Augustus/config/ --onlytrain=/home/damian/Programs/BRAKER-master/braker/Sp_4/train.gb.train.train /home/damian/Programs/BRAKER-master/braker/Sp_4/train.gb.train.test 1>/home/damian/Programs/BRAKER-master/braker/Sp_4/optimize_augustus.stdout 2>/home/damian/Programs/BRAKER-master/braker/Sp_4/errors/optimize_augustus.stderr!

And this is stdout and stderr
Splitting training file into 8 buckets... Reading in the meta parameters used for optimization from /home/damian/Programs/Augustus/config/species/generic/generic_metapars.cfg... Reading in the starting meta parameters from /home/damian/Programs/Augustus/config/species/Sp_4/Sp_4_parameters.cfg... bucket

Segmentation fault (core dumped) Could not read the accuracy values out of predictions.txt when processing bucket 1. at /home/damian/Programs/Augustus/scripts/optimize_augustus.pl line 1128.

any idea what may be the reason?

Substitution loop error in filterIntronsFindStrand.pl

I'm attempting to run BRAKER v2.1.2 on a genome with large chromosomes (the largest being over 1 Gb in size), and I ran into the following error message in the file errors/filterIntronsFindStrand.stderr:

Substitution loop at PATH/to/BRAKER-2.1.2/scripts/filterIntronsFindStrand.pl line 122, <FASTA> chunk 1.

Of course, the obvious fix is to split apart the chromosomes before running BRAKER - please do note that BRAKER then ran successfully with split chromosomes. However, I still wanted to note the error for this particular use case. Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.