pierrebarbera / epa-ng Goto Github PK

View Code? Open in Web Editor NEW

68.0 12.0 7.0 33.27 MB

Massively parallel phylogenetic placement of genetic sequences

License: GNU Affero General Public License v3.0

C++ 93.21% Makefile 1.20% CMake 4.67% Shell 0.80% Dockerfile 0.11%

phylogenetics bioinformatics taxonomic-classification mpi openmp mpi-io placement

epa-ng's Introduction

EPA-ng - Fast, parallel, highly accurate Maximum Likelihood Phylogenetic Placement, by the team behind RAxML(-ng)

Introduction
Installation
Usage
Test data
Citing EPA-ng

WARNING v0.3.0 - v0.3.3!

Please note that versions v0.3.0 through v0.3.3 are affected by a result breaking bug if the input tree is rooted! If you think this may be the case for you, we urgently insist you update to at least version v0.3.4!

SUPPORT

There is now a short tutorial available that covers the basic steps of a placement project. You can find it here

The most reliable way to get in touch with us is to head over to the Phylogenetic Placement Google Group. You can also search its history, or the hostory of the RAxML Google Group for your particular question.

Alternatively I've created a gitter chat room where I can usually be found during office hours.

DISCLAIMER

This tool is still in an active development. Suggestions, bug reports and constructive comments are more than encuraged! Please do so in the google group.

Introduction

EPA-ng is a complete rewrite of the Evolutionary Placement Algorithm (EPA), previously implemented in RAxML. It uses libpll and pll-modules to perform maximum likelihood-based phylogenetic placement of genetic sequences on a user-supplied reference tree and alignment.

What can EPA-ng do?

do phylogenetic placement using explicitly specified model parameters
take as input separated reference and query alignment files, in the fasta or fasta.gz formats
handle DNA and Amino Acid data
distributed computing suitable for the cluster
prepare inputs for the cluster:
- convert query fasta file into a random access, binary encoded file called a bfast-file
output the placement results in the jplace format ready for downstream analysis by libraries like genesis and tools like gappa

Installation

With Conda

Thanks to @gavinmdouglas, EPA-ng can now be installed using conda:

conda install -c bioconda epa-ng

With Homebrew

This one is thanks to @gaberoo :)

brew install brewsci/bio/epa-ng

Building from source

First, ensure the following packages are installed or otherwise available (relevant modules loaded on your cluster):

sudo apt-get install autotools-dev libtool flex bison cmake automake autoconf

Once these dependencies are available, you need to ensure that your compiler is recent enough, as EPA-ng is built using C++14 features. The minimum required versions are as follows:

Compiler	Min. Version
gcc	4.9.2
clang	3.8
icc	16

Any one of these compilers will be sufficient. gcc is the most wide spread, and current versions of Ubuntu have gcc versions exceeding the minimum.

Now it's time to build the program.

make

Thats it! If all goes well, the build process will fetch any missing git submodule dependencies, and build them as well, before building the program itself. The executable will be located in the epa-ng/bin/ folder.

Apple

In principle same procedure as under Linux, but I recommend installing libomp (brew install libomp) before building.

Windows

Not supported at this time, though I highly recommend looking into the ubuntu subsystem if you're using Windows 10!

Usage

EPA-ng is used from the command line, as the main use-case is processing large amounts of data using a supercomputing cluster.

Here is a list of the most basic arguments you will use:

Flag	Long Flag	Meaning
-s	--ref-msa	reference MSA (fasta)
-t	--tree	reference Tree (newick)
-q	--query	query sequences (fasta or bfast)
-w	--outdir	output directory (default: current directory)
	--model	model parameter specification
-T	--threads	number of threads to use

For a full overview of command line options either run EPA-ng with no input, or with the flag -h (or --help).

Basic

On a single computer, an example execution might look like this:

epa-ng --ref-msa $REF_MSA --tree $TREE --query $QRY_MSA --model $MODEL

Note that this will use as many threads as specified by the environment variable OMP_NUM_THREADS. Usually this defaults to the number of cores. Note however, that no speedup is to be expected from hyperthreads, meaning the number of threads should be set to the number of physical cores.

Setting the Model Parameters

As of version 0.2.0, GTRGAMMA model parameters have to be specified explicitly. There are currently two ways of doing this: Either specify a raxml-ng-style model descriptor (elaborated here), like so:

epa-ng <...> --model GTR{0.7/1.8/1.2/0.6/3.0/1.0}+FU{0.25/0.23/0.30/0.22}+G4{0.47}

... or pass a file containing the relevant information, coming from one of the supported tree inference programs.

RECOMMENDED In the case of raxml-ng, pass the [...].bestModel file resulting from an evaluation run to EPA-ng:

raxml-ng --evaluate --msa $REF_MSA --tree $TREE --prefix info --model GTR+G+F
epa-ng <...> --model info.raxml.bestModel

This method has support for pretty much every model that raxml-ng supports, so it is highly recommended you do it this way.

Alternatively we also support parsing the model parameters either from RAxML 8.x info files, or from IQ-TREE report files, though there may be parsing problems as not all models are covered.

For RAxML8.x: pass a RAxML_info-file to the program, where the info file was generated from a call to RAxML option -f e:

raxmlHPC-AVX -f e -s $REF_MSA -t $TREE -n file -m GTRGAMMAX
epa-ng <...> --model RAxML_info.file

Advanced

Overview of advanced features:

Flag	Long Flag	Meaning
-g	--dyn-heur	use dynamic preplacement heuristic (default)
-G	--fix-heur	use fixed preplacement heuristic
	--no-heur	disable preplacement heuristic
	--no-pre-mask	disable premasking
-c	--bfast	convert query fasta to binary format

The description of basic cluster usage starts here

Configuring the Heuristic Preplacement

By default, EPA-ng performs placement of a sequence in two stages: first selecting promising branches quickly (preplacement), then evaluating the selected branches in greater detail.

EPA-ng currently offers three ways of selecting these candidates.

The default is the accumulated threshold method, in which branches are added to the set of candidates until the sum of their LWR exceed a user specified threshold. The flag controlling this mode is -g (or --dyn-heur), with a default setting of 0.99999, corresponding to a covered likelihood weight of 99.999%.

The second mode functions identically to the candidate selection mode in the original implementation of the EPA in RAxML. Here again the branches are sorted by the LWR of the placement of a sequence. Then, the top x% of the total number of branches are selected into the set of candidates. Like in RAxML, this behavior is controlled via the -G (or --fix-heur) flag.

The third mode works identically to the baseball heuristic from pplacer, with default settings (--strike-box 3.0, --max-strikes 6, --max-pitches 40) and is enabled using the --baseball-heur flag.

Lastly, to disable the preplacement completely, you can simply supply the --no-heur flag. Be warned however: doing so will be significantly more computationally demanding. Our advice is to use the heuristic, as it sacrifices only insignificant amounts of accuracy for greatly improved speed.

Premasking

By default, EPA-ng enables premasking, which works similarily to the same option in pplacer: If a site of the alignment is all-gaps in either the reference OR query alignment, throw it out. Further, for each query sequence, ignore the leading, and trailing gap columns (this is where we differ from pplacer, as they ignore ALL query gap columns).

This reduces both runtime and memory footprint greatly, depending on the data. For short read data, the impact will be massive, as typically query alignments will be mostly all-gap.

Cluster usage

To use distributed parallelism in EPA-ng, first we must re-compile the program with MPI enabled. This requires a version of MPI to be loaded/installed on your system. The only additional requirement EPA-ng has, is that the compiler that is loaded in conjunction with MPI satisfies the minimum version requirements. Often this can be assured by the order in which the relevant modules are loaded on the cluster: first MPI, then the compiler. However we reccomend you contact your support team should this cause issues for you.

The actual compilation is very straight-forward:

make clean && make EPA_HYBRID=1

This will attempt to compile the program with both MPI and OpenMP, as the most efficient way to run the program is to map one MPI rank per node (good alternative: one rank per socket!), each rank starting as many threads as there are physical cores.

In your job submission script, you can then call the program in a highly similar way to before:

mpirun epa-ng --ref-msa $REF_MSA --tree $TREE -q query.fasta -w ./some/output/dir

Converting the query file to `.bfast`

You may also explicitly convert the input query fasta file to our internal fasta format. This format is binary encoded (reducing the size by half) and randomly accessible. Using this format is reccomended for use under MPI, as it increases parallel efficiency.

To convert the fasta file, simply run the program with the query file specified thusly:

epa-ng --bfast query.fasta --outdir $OUT

This will produce a file called query.fasta.bfast in the specified output directory.

Test data

This repository includes a test data set which can be found under test/data/neotrop. Consult the README located there for usage examples.

Citing EPA-ng

If you use EPA-ng, please cite the following paper:

Pierre Barbera, Alexey M Kozlov, Lucas Czech, Benoit Morel, Diego Darriba, Tomáš Flouri, Alexandros Stamatakis; EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences, Systematic Biology, syy054, https://doi.org/10.1093/sysbio/syy054

epa-ng's People

Contributors

Stargazers

Watchers

Forkers

ramabit gitter-badger uscbiostats liupfskygre tgrego crosenth martin-g

epa-ng's Issues

Add a function to convert combined phylip to separated reference and query MSA

e.g. when coming from papara. Even when epa-ng can handle combined inputs, this should be useful.

Add "automake" and "autoconf" libraries reference in README.md

sudo apt-get install autoconf automake

Accept more ML models

"more" = the easy ones like F81, HKY etc.

Split command using fasta files

I used mafft to align query sequences against reference msa (both in fasta format) and also the combined msa is in fasta format. Right now I have to convert everything to phylip to use the split command, but it would be nice if epa could use fasta format as well.

I cannot find list of forbidden character for header of fasta

I got with my data set the error
"... fasta_getnext failed: Illegal header line in query fasta file"
my headers look like
">Read_3_sample=RF-pre-50cells-8h_S30_L001_R_size=121_"
So nothing clearly wrong.
But maybe with a list of forbidden character I can find a way to squeeze the info ( size=, is needed for post processing)

src/core/pll/pll_util.cpp compile error: ‘tie’ is not a member of ‘std’

I was getting this compile error. Adding #include <tuple> on /src/core/pll/pll_util.cpp fixes the issue.
Edit: I am using gcc 4.9.2 compiler.

Issue with building from source: ld throws errors

Dear, I am trying to build epa-ng from source on Ubuntu server 18.04.5 LTS. I have checked the dependencies and all goes well until the end where ld is throwing errors:

[ 99%] Building CXX object src/CMakeFiles/epa_module.dir/util/stringify.cpp.o
[100%] Linking CXX executable ../../bin/epa-ng
/usr/bin/ld: /usr/local/lib/libz.a(inflate.o): relocation R_X86_64_32S against hidden symbol `zcfree' can not be used when making a PIE object
/usr/bin/ld: /usr/local/lib/libz.a(inftrees.o): relocation R_X86_64_32 against `.rodata' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: /usr/local/lib/libz.a(zutil.o): relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: /usr/local/lib/libz.a(crc32.o): relocation R_X86_64_32 against `.rodata' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: /usr/local/lib/libz.a(inffast.o): relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
src/CMakeFiles/epa_module.dir/build.make:598: recipe for target '../bin/epa-ng' failed
make[3]: *** [../bin/epa-ng] Error 1
make[3]: Leaving directory '/usr/local/programs/epa-ng/build'
CMakeFiles/Makefile2:901: recipe for target 'src/CMakeFiles/epa_module.dir/all' failed
make[2]: *** [src/CMakeFiles/epa_module.dir/all] Error 2
make[2]: Leaving directory '/usr/local/programs/epa-ng/build'
Makefile:94: recipe for target 'all' failed
make[1]: *** [all] Error 2
make[1]: Leaving directory '/usr/local/programs/epa-ng/build'
Makefile:11: recipe for target 'run_make' failed
make: *** [run_make] Error 2

how could I resolve this? Am I missing something?

Tree Log-Likelihood -INF Error

Hello Team,
When I'm trying to place a sequence into the tree on Linux, I got en error Aborted (core dumped) without any error information.
And I tried the same data and command on MacOS, core dumped again(with libc++abi: terminating with uncaught exception of type std::runtime_error: Tree Log-Likelihood -INF!).
the EPA-ng version is EPA-ng v0.3.8
the command used:

epa-ng --tree refer_tree.tree --msa reference_v4.mafft --query query_aligned.fasta --model LG+F+G4 --outdir .

system infos:

Linux:     #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
MacOS:     20.6.0 Darwin Kernel Version 20.6.0: Fri Dec 16 00:35:00 PST 2022; root:xnu-7195.141.49~1/RELEASE_X86_64 x86_64

running details infos(MacOS):

% epa-ng --tree refer_tree.tree --msa reference_v4.mafft --query query_aligned.fasta --model LG+F+G4 --outdir .
INFO Selected: Output dir: ./
INFO Selected: Query file: query_aligned.fasta
INFO Selected: Tree file: refer_tree.tree
INFO Selected: Reference MSA: reference_v4.mafft
INFO Selected: Automatic switching of use of per rate scalers
INFO Selected: Preserving the root of the input tree
INFO Selected: Specified model: LG+F+G4
INFO     ______ ____   ___           _   __ ______
        / ____// __ \ /   |         / | / // ____/
       / __/  / /_/ // /| | ______ /  |/ // / __  
      / /___ / ____// ___ |/_____// /|  // /_/ /  
     /_____//_/    /_/  |_|      /_/ |_/ \____/ (v0.3.8)
WARN The reference MSA and tree have differing number of taxa! 196 vs. 186
INFO Using model parameters:
INFO    Rate heterogeneity: GAMMA (4 cats, mean),  alpha: 1 (ML),  weights&rates: (0.25,0.136954) (0.25,0.476752) (0.25,1) (0.25,2.38629) 
        Base frequencies (empirical): 0.250946 0 0 0 0.176994 0 0 0.367461 0 0 0 0 0 0 0 0 0.204599 0 0 0 
        Substitution rates (model): 0.425093 0.276818 0.395144 2.48908 0.969894 1.03855 2.06604 0.358858 0.14983 0.395337 0.536518 1.12403 0.253701 1.17765 4.72718 2.1395 0.180717 0.218959 2.54787 0.751878 0.123954 0.534551 2.80791 0.36397 0.390192 2.4266 0.126991 0.301848 6.32607 0.484133 0.052722 0.332533 0.858151 0.578987 0.593607 0.31444 0.170887 5.07615 0.528768 1.69575 0.541712 1.43765 4.50924 0.191503 0.068427 2.14508 0.371004 0.089525 0.161787 4.00836 2.00068 0.045376 0.612025 0.083688 0.062556 0.523386 5.24387 0.844926 0.927114 0.01069 0.015076 0.282959 0.025548 0.017416 0.394456 1.24028 0.42586 0.02989 0.135107 0.037967 0.084808 0.003499 0.569265 0.640543 0.320627 0.594007 0.013266 0.89368 1.10525 0.075382 2.78448 1.14348 0.670128 1.16553 1.95929 4.12859 0.267959 4.81351 0.072854 0.582457 3.23429 1.67257 0.035855 0.624294 1.22383 1.08014 0.236199 0.257336 0.210332 0.348847 0.423881 0.044265 0.069673 1.80718 0.173735 0.018811 0.419409 0.611973 0.604545 0.077852 0.120037 0.245034 0.311484 0.008705 0.044261 0.296636 0.139538 0.089586 0.196961 1.73999 0.129836 0.268491 0.054679 0.076701 0.108882 0.366317 0.697264 0.442472 0.682139 0.508851 0.990012 0.584262 0.597054 5.30683 0.119013 4.14507 0.159069 4.27361 1.11273 0.078281 0.064105 1.03374 0.11166 0.232523 10.6491 0.1375 6.31236 2.59269 0.24906 0.182287 0.302936 0.619632 0.299648 1.70274 0.656604 0.023918 0.390322 0.748683 1.13686 0.049906 0.131932 0.185202 1.79885 0.099849 0.34696 2.02037 0.696175 0.481306 1.89872 0.094464 0.361819 0.165001 2.45712 7.8039 0.654683 1.33813 0.571468 0.095131 0.089613 0.296501 6.47228 0.248862 0.400547 0.098369 0.140825 0.245841 2.18816 3.15182 0.18951 0.249313
INFO Output file: ./epa_result.jplace
libc++abi: terminating with uncaught exception of type std::runtime_error: Tree Log-Likelihood -INF!
zsh: abort      epa-ng --tree refer_tree.tree --msa reference_v4.mafft --query  --model   .

the query file(has the same length with MSA sequence after aligned with MAFFT:

>db29dfd5db5e2501ed9deadabc7dd91d
TACGAAGGGGGCTAGCGTTGCTCGGAATGACTGGGCGTAAAGGGCGTGTAGGCGGTTTGGACAGTCAGATGTGAAATCCCCGGGCTTAACCTGGGAGCTGCATTTGATACGTCCAGACTAGAGTGTGAGAGAGGGTTGTGGAATTCTCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCAACCTGGCTCATTACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG

It's seem like something wrong with my tree or query, I changed another query to place into another tree and everything is OK.
the tree I used in the error situation is build by iqtree2 with command

iqtree2 -s seq/all_seq.fasta -m LG+F+G4 -pre 'tree/%s_%s' -nt 48 --fast -alrt 1000

and the tree is :

(GCF_001581085.1:0.0001883592,(((((((((((GCF_002153675.1:0.6764403004,GCF_002153735.1:0.0030645544)0:0.0000029411,GCF_002153795.1:0.0158138579)0:0.6605801847,(((((((GCF_008690765.1:0.0000000000,GCF_008690505.1:0.0000000000):0.0000000000,GCF_008690645.1:0.0000000000):0.0000000000,GCF_008690705.1:0.0000000000):0.0000000000,GCF_008690905.1:0.0000000000):0.0000000000,GCF_008690915.1:0.0000000000):0.0000000000,GCF_008704295.1:0.0000000000):0.0416326726,GCF_024158225.1:0.0053982956)0:0.0000024925)0:0.0000024344,((((((((GCF_008690365.1:0.4406791409,GCF_003323795.1:0.4617060122)0:0.1640898965,GCF_024329695.1:0.0262970073)0:0.0003373681,(((GCF_000010845.1:0.6138629501,(GCF_003391275.1:1.6768345925,GCF_014196315.1:0.0139330781)0:0.0000025327)0:0.0000021894,GCF_008704325.1:0.0375132164)0:0.0000029909,((GCF_002153605.1:0.0134270192,GCF_001499615.1:0.4663964356)0:0.0000028317,GCF_008689845.1:0.4877107087)0:0.0033911496)0:0.0000024453)0:0.0000480426,GCF_001953595.1:1.7161662788)0:0.0000026774,(((((GCF_018811985.1:1.2476321189,GCF_001580535.1:0.3701136010)0:0.0000021507,GCF_002153545.1:1.3085919649)0:0.3109927057,GCF_000010905.1:0.0137330848)0:0.0000047088,GCF_001580945.1:0.6034760600)0:0.0000022867,GCF_024158385.1:0.4899289062)0:0.0028710588)0:0.0000028918,GCF_002153745.1:1.6808551804)0:0.0001082810,(GCF_003850905.1:0.0032674677,(((GCF_003966365.1:0.0147410702,GCF_021284605.1:0.5985121941)0:0.0000028764,GCF_001580695.1:0.0350269438)0:0.0005460484,GCF_000613865.1:0.0047823241)0:0.0000164957)0:0.0232318069)0:0.0054273270,(((((((GCF_002153695.1:1.4509147497,(GCF_024158305.1:0.0514046837,(GCF_008365315.1:0.4543791870,(GCF_001642635.1:0.0922419883,GCF_001581035.1:1.6490729713)0:0.0000029322)0:0.0000027350)0:0.1683579951)0:0.0124108208,GCF_002202135.1:0.3419663450)0:0.2244853940,((((GCF_002153775.1:0.2612440122,((GCF_022130865.1:0.4022580284,GCF_024158475.1:1.3897862725)0:0.0247699162,GCF_000379545.1:0.2593021461)0:0.0000027277)100:0.3193728053,(GCF_004341595.1:0.0609295276,(GCF_003850825.1:0.0112367348,GCF_004014775.2:0.4916741059)0:0.0173493126)0:0.0007038507)0:0.0000021411,(GCF_008690725.1:0.4285462033,GCF_011516935.1:1.4206867824)0:0.1558481053)0:0.0000955949,(((((((GCF_001662905.1:0.4844812877,GCF_002005445.1:0.3366328551)0:0.0317392220,GCF_011516945.1:0.2458425154)0:0.2026757393,((GCF_022130785.1:0.0322073465,GCF_000963945.1:0.5333918479)0:0.0000022699,GCF_000193495.2:1.0022380453)0:0.0000024617)0:0.0045456014,GCF_018256985.1:0.0114794035)0:0.0000200495,GCF_024158185.1:0.7045870891)0:0.0000023903,(GCF_000285315.1:0.5450676598,GCF_008689805.1:0.1960126694)0:0.4423247183)0:0.0000948410,GCF_018811975.1:0.0090654175)0:0.0002828809)0:0.0010058575)0:0.0004283380,(((((((GCF_000963965.1:0.0060146282,((GCF_001580915.1:0.0060752079,GCF_007989335.1:0.6741506451)0:0.0000024009,(GCF_000964205.1:0.2689952602,GCF_001766235.1:0.5150130446)0:0.1617072119)0:0.0000021130)0:0.2398539718,GCF_006539325.1:0.2143134267)0:0.0203072357,GCF_001499675.1:0.5146329584)0:0.0000027795,GCF_008689865.1:0.1825394650)0:0.4026651290,GCF_024158265.1:0.0044401363)0:0.0000028173,((GCF_024158205.1:0.0042233156,GCF_011516875.1:0.5034271003)0:0.0000792547,GCF_000285275.1:1.6548804506)0:0.0000020615)0:0.0273314929,(((GCF_024158315.1:0.0070175453,(GCF_001183745.1:0.0063524715,(GCF_000787635.2:0.0065526170,GCF_002276785.1:0.4184158280)0:0.0000028871)0:0.5991026047)0:0.0000022640,(GCF_024158365.1:0.2715677289,GCF_013307325.1:1.3479024265)0:0.3013223189)0:0.0112147574,GCF_002220195.1:0.4824377378)0:0.0000023096)0:0.0000024257)0:0.0000021468,((((((GCF_007989245.1:0.0332987392,((GCF_008704245.1:0.0000000000,GCF_000755665.1:0.0000000000):0.0210429353,GCF_000963925.1:0.4718752717)0:0.0021558859)0:0.0078774398,((GCF_000241585.2:0.0088264104,GCF_021961645.1:0.6286745465)0:0.0000022309,GCF_008704305.1:0.6437612063)0:0.0002032385)0:0.0000020520,GCF_003850885.1:0.5811558117)0:0.0000020541,(((GCF_001580615.1:0.0310270635,GCF_017377745.1:0.0083640209)0:0.0000025252,((GCF_007991075.1:1.1324145009,(GCF_024158505.1:0.2977152696,GCF_003850845.1:1.2075012951)0:0.0000020645)0:0.3300841784,GCF_000755675.1:0.0192709240)0:0.0000022336)0:0.0003230373,GCF_000964225.1:0.6464079685)0:0.0000023058)0:0.0000024507,((((((GCF_000010925.1:0.6580623757,(GCF_018256865.1:0.0729211723,GCF_017377715.1:1.6458223012)0:0.0000028144)0:0.0028590724,(GCF_024158445.1:0.5296455512,GCF_024158425.1:0.1631401068)0:0.0051417052)0:0.0000023336,(GCF_001580995.1:0.0095597996,GCF_011516835.1:0.0096655497)0:0.2897252936)0:0.0000021271,GCF_018256955.1:0.1057967928)0:0.4949788116,(GCF_002173775.1:0.0033052381,GCF_014132135.1:0.0135786395)0:0.0220684424)0:0.0027175137,((GCF_008704285.1:0.5204516769,(GCF_003850945.1:0.0142766333,GCF_002554745.1:1.6439203740)0:0.0000010000)0:0.0000023521,GCF_001581075.1:1.6439085500)0:0.0018296433)0:0.0032985240)0:0.0000024196,GCF_011516735.1:0.0311539844)0:0.0007464121)0:0.0000021260,((((GCF_018256895.1:0.0238122931,GCF_002276555.1:0.3567056482)0:0.0006964877,(((((GCF_022130905.1:0.6759780445,GCF_003850805.1:0.0684026210)0:0.0031813183,(((GCF_006539345.1:0.0117079797,GCF_000429165.1:0.0027619486)0:0.4105163026,GCF_011516925.1:0.2211171337)0:0.0987070937,GCF_009914215.1:0.0889556948)0:0.0000024406)0:0.0002617895,(GCF_024158285.1:0.5527586906,GCF_011516755.1:0.0753710884)0:0.0000020481)0:0.1569666418,GCF_022130805.1:0.6070661649)0:0.1229730883,GCF_002549835.1:0.0191286796)0:0.0000027567)0:0.0000027920,(((GCF_021961685.1:0.4339536548,GCF_000963905.1:0.5731901648)0:0.1371761622,(GCF_018256915.1:0.0068522593,GCF_007991375.1:1.6104189388)0:0.0002093233)0:0.0000025380,GCF_011516865.1:0.5911367130)0:0.0002295569)0:0.0000027232,((GCF_002153475.1:0.3833738686,GCF_000613905.1:0.0057836498)0:0.0000020031,(GCF_002456135.1:0.0089582661,GCF_007989285.1:0.6700584241)0:0.0161515351)0:0.0002918145)0:0.5686823855)0:0.0000025291,(((((((GCF_007989305.1:0.1238860300,GCF_019599335.1:0.2119236220)0:0.2066203913,(((((GCF_000010825.1:0.0000000000,GCF_000010965.1:0.0000000000):0.0000000000,GCF_000010945.1:0.0000000000):0.0000000000,GCF_000010885.1:0.0000000000):0.0000000000,GCF_000010865.1:0.0000000000):0.0814400697,GCF_008689815.1:0.0319721336)0:0.0035268865)0:0.0000021231,GCF_018256975.1:0.6696643419)0:0.0000027511,((((GCF_007991395.1:0.0000000000,GCF_011516825.1:0.0000000000):0.6003152616,GCF_011516885.1:0.0099408109)0:0.0000020243,GCF_002173735.1:0.0370744835)0:0.0019782092,(((GCF_014486685.1:1.5739864716,GCF_011516655.1:0.0221378734)0:0.0000020155,(GCF_002358055.1:0.4205428001,(GCF_001581105.1:0.0591542756,(GCF_002276805.1:0.2130518592,GCF_011516765.1:1.4222825718)0:0.1270049725)0:0.0012018872)0:0.0357890234)0:0.0000120705,((GCF_008704255.1:0.6340700535,GCF_011516725.1:0.0244629949)0:0.0000029892,GCF_002153485.1:1.6164220977)0:0.0000022480)0:0.0001081203)0:0.0095606788)0:0.2193311916,(((GCF_000241625.1:0.6839073885,GCF_018256935.1:0.1832314634)0:0.0067353401,(GCF_018256835.1:0.2030163707,GCF_001581005.1:0.3323390344)0:0.0000029143)0:0.0000027006,GCF_002153575.1:1.3448122176)0:0.0000028216)0:0.3003791125,(GCF_000723785.2:0.0082027889,GCF_014218315.1:0.5230507614)0:0.0000628853)0:0.0000020699,((GCF_008689795.1:0.6260611411,GCF_002723895.1:0.2262062301)0:0.1850061244,(((GCF_018256855.1:0.6672368453,GCF_000193245.1:0.0006517461)0:0.0000026401,(GCF_002156945.1:0.4738437206,GCF_011516745.1:0.0001157176)0:0.0000024761)0:0.0483889382,GCF_002153685.1:0.0741115710)0:0.2248808694)0:0.1757701247)0:0.0000021012)0:0.0000028596)0:0.0029442856)0:0.0000026584,(GCF_024158235.1:0.0241025345,GCF_022130845.1:0.0059523180)0:0.0005193121)0:0.0000021230,GCF_001766255.1:0.5120604596)0:0.0000020516,(((GCF_017377735.1:0.0621445686,GCF_000225485.1:0.3345182353)0:0.0027563893,(GCF_024158325.1:0.6234262367,(GCF_014207635.1:0.0585282647,GCF_003850965.1:0.5727943103)0:0.0000020369)0:0.0026639284)0:0.0000024753,GCF_002153515.1:0.0780511343)0:0.5342060505)0:0.0034632152,(((GCF_019083805.1:0.0001091116,GCF_003850865.1:0.0000540390)0:0.0018101296,GCF_003850925.1:0.4782026931)0:0.0000023029,GCF_000613285.1:0.6558629151)0:0.0000025192)0:0.0000026868,GCF_024158405.1:1.0557610867)0:0.0000020045,GCF_001628715.1:0.0019204407)0:0.4800599364,GCF_002153655.1:0.1172978701)0:0.2140405767,((GCF_002006565.1:0.0894455164,GCF_009295745.1:0.0927561460)0:0.2385422753,GCF_002738225.1:0.0099295595)0:0.0000010000);

After got an error with this tree, I noticed that most nodes have zero confidence. I removed the support , still error.
And I build the tree with Fasttree with fasttree -intree genus_Acetobacter.treefile ../seq/all_seq.fasta> fast.tre (transform it to binary tree with ete3:tree.resolve_polytomy()), still error.
What can I do to solve this problem?
Looking forwards to your reply.
Thanks a lot.

the MSA file is uploaded as attachment.
max_16s_ref.txt

Hello,I find issue running this program

Hello Pierre Barbera,
First, thank you for making this program :) sadly even though you made a very kind protocol for beginners to follow like me
my script seems to keep on stopped while running.

INFO Selected: Output dir: ./
INFO Selected: Query file: Onlyotualigned.fasta
INFO Selected: Tree file: RAxML_fastTree.reference_aligned.tree
INFO Selected: Reference MSA: reference_aligned.fasta
INFO Selected: Automatic switching of use of per rate scalers
INFO Selected: Preserving the root of the input tree
INFO Selected: Specified model: GTR+G+F
INFO ______ ____ ___ _ __ ______
/ // __ \ / | / | / // /
/ __/ / // // /| | ______ / |/ // / __
/ / / // ___ |/_____// /| // // /
/_____/// // || // |/ __/ (v0.3.8)
ERR Setting tip states failed for sequence: YP_004324525.1
ERR message: Illegal state code in tip "E"

This is the issue
My data contains Amino Acids
and aligned by MUSCLE and made Reference Tree with RAxML
I think I have a problem with my datasets but, I can't point the exact issue
could you pls help me with this issue?
Thank you, in advance
reference_aligned.txt

Treeparsing error

Hi there,

I'm using the latest version of epa-ng and am getting this error when I try to run it:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Treeparsing failed!

The input tree (edit: which is in newick format) is unrooted and is based on 16,539 sequences. Any idea how I can better troubleshoot what is causing the error?

Thanks,

Gavin

Utilize sequence multiplicities

Although non-standard, many sequence file preprocessing steps add meta-data to the sequence name. For example, in fasta, one often sees

>name_1234

>name;size=1234

in order to note the abundance of the sequence. Such formats are used e.g., by swarm and vsearch.

It would be helpful if epa-ng picks this information up and uses it as multiplicity in the result jplace file.

Treeparsing error

Hello, I'm tryting to run epa-ng with a newick tree generated with ETE3 (previously resolving the polytomies) and I recieve this error:

terminate called after throwing an instance of 'std::runtime_error'
what(): Treeparsing failed! syntax error, unexpected $end, expecting ','. (line 1 column 55)

I have checked several times the newick format, and I don't understand why is reading an "end" in the middle of the tree...
Could you guide me for solving this problem, please?

Thank you very much in advance

Internal node labels disappear from reference tree

Hi,

It appears that the internal node labels from the reference tree are not present in the output file. Is there an option to preserve these labels?

Thanks.

FastTree

Hi, I'm just setting up a pipeline to place query sequences onto a phylogentic tree of ~5,000 reference sequences. I'm finding Raxml to be very slow for tree building and wondered whether epa-ng supports the use of FastTree. Will this be a problem when I have to supply the model parameters to epa-ng? I've looked online but haven't found any examples of this.

Cheers,
Andrew

The model parameters under data partitioning

If a phylogenetic tree was generated using likelihood searches under data partitioning. How to get the model parameters?
I have a reference tree that was generated by 10 partitions. My query sequences are from one of 10 partitions. I am not sure whether I also may use this program (-f e ) to get model parameters for phylogenetic placement:
raxmlHPC-AVX -f e -s $REF_MSA -t $TREE -n file -m GTRGAMMAI

Version option broken

Minor issue, but I just noticed that --version and -v don't actually return the version number.

Thanks,

Gavin

Address issues with heuristic placement

For "weird" sequences that don't fit in the tree, or may not align correctly, or are just plain wrong, we have observed a tendency of them being placed on longer branches of a reference tree, sometimes with a high to very high LWR. While some of these may be caught by filtering based on pendant length of the queries, the real problem lies with with the heuristic preplacement phase which is the likely culprit. Specifically, during this phase queries are inserted using a default pendant length of 0.9, which for some cases may simply be too long.

This also touches on identification of "novel" lineages in the query data, which is usually a goal of placement analyses. However the primary goal is to re-establish LWR as the primary criterion for placement confidence.

-INF logl at branch 0 error

Hello,

It is my first time reporting an issue on github so I'm sorry if I'm not the clearest.

I've been using epa-ng v0.3.5 for phylogenetic placement with no problems so far, but recently I'm stuck with an error.

This is the code line I wrote (I changed the name to make it more clear) :
epa-ng --model GTR{1.08637/3.57995/1.86164/0.81671/6.21607/1.00000}+FU{0.235/0.257/0.328/0.179}+IU{0.014}+G4{0.639} --ref-msa reference-seq.ali --tree reference-tree.treefile --query query.ali --outdir EPA-ng-Jplacer --redo

The return I got is:

INFO Selected: Output dir: EPA-ng-Jplacer/
INFO Selected: Query file: query.ali
INFO Selected: Tree file: reference-tree.treefile
INFO Selected: Reference MSA: reference-seq.ali 
INFO Selected: Automatic switching of use of per rate scalers
INFO Selected: Preserving the root of the input tree
INFO Selected: Specified model: GTR{1.08637/3.57995/1.86164/0.81671/6.21607/1.00000}+FU{0.235/0.257/0.328/0.179}+IU{0.014}+G4{0.639}
INFO    Rate heterogeneity: GAMMA (4 cats, mean),  alpha: 0.639 (user),  weights&rates: (0.25,0.0612386) (0.25,0.333269) (0.25,0.899982) (0.25,2.70551) 
        P-inv (user): 0.014
        Base frequencies (user): 0.235235 0.257257 0.328328 0.179179 
        Substitution rates (user): 1.08637 3.57995 1.86164 0.81671 6.21607 1
INFO     ______ ____   ___           _   __ ______
        / ____// __ \ /   |         / | / // ____/
       / __/  / /_/ // /| | ______ /  |/ // / __  
      / /___ / ____// ___ |/_____// /|  // /_/ /  
     /_____//_/    /_/  |_|      /_/ |_/ \____/ (v0.3.5)
INFO Output file: EPA-ng_Jplacer_Article3_cleaned4/epa_result.jplace
srun: error: Amoeba-mt6: task 0: Aborted
task 0: Aborted

And if I look at the log file I get :
terminate called after throwing an instance of 'std::runtime_error'
what():  -INF logl at branch 0 with sequence QUERY_DALLOL_1339_Gt_7Gt_7Gt-pp_Ass

Which honestly I don't understand what it means and I didn't see anybody report the same problem.

I probably did a dumb mistake somewhere but honestly I don't have a clue of what it is. Can you please help me ?

What I tried already :

The sequence signaled is the 1st one of my file so I tried to remove it. The same issue appear with the next sequence. So that's not the problem of the sequence.
Try to use EPA-ng with other files : works OK (but the reference tree has less OTUs. For this one it's 2993 OTUs only for the reference sequences)
Reduce the number of my query from 1634 to 10 to see if it was a query number problem : same issue.
Reroot my tree : same issue.
Run EPA-ng with different number of threads : same issue.
Check for hidden caracters : didn't detect any
Redo everything and check files (the tree is in Newick format from IQTree) : didn't change anything

Hope I was clear and that there's a solution to this issue
Thanks for reading / your help
JB

model info error

Hello, I am having trouble running epa-ng with the following error:
INFO Selected: Output dir: ./epa_tree/
INFO Selected: Query file: query.fasta
INFO Selected: Tree file: T3.raxml.bestTree
INFO Selected: Reference MSA: reference.fasta
INFO Selected: Automatic switching of use of per rate scalers
INFO Selected: Preserving the root of the input tree
INFO Selected: Specified model file: RAxML_info.info
what(): Model string in provided file seems wrong.
XXXX.sh: line 20: 26465 Aborted (core dumped) epa-ng --tree T3.raxml.bestTree --ref-msa reference.fasta --query query.fasta --outdir $OUT --model RAxML_info.info

I am attempting to align 806 amplicon sequences to 1121 nifH reference sequences. I started by running raxml-ng to build a reference tree on muscle-aligned ref seqs with the following command:
raxml-ng --msa T2.raxml.rba --model GTR+G --prefix T3 --threads 8 --seed 8273

I then used papara to align query seqs, and the raxml-ng --split to seperate aligned seqs

In my first go running epa-ng, I provided the example model parameters suggested in the full stack tutorial to define the model:
GTR{0.7/1.8/1.2/0.6/3.0/1.0}+FU{0.25/0.23/0.30/0.22}+G4{0.47}

But I got the following error:
ERR When using epa-ng like this, a model has to be explicitly specified!
You may specify it generically (GTR+G), however parameters will not be optimized.
Instead we reccommend to use RAxML to re-evaluate the parameters and then pass the resulting
RAxML_info file to the epa-ng --model argument. epa-ng will then auto-parse the parameters.
( raxmlHPC -f e -s -t -n info -m GTRGAMMAX )

So I ran the example command above (but I did get an error leading me to change the -m option to GTRGAMMA [the only other possible input it GTRGAMMI), and that executed fine.
But using the RAxML_info file produced as input for epa-ng above threw the above error.

Is there some other way to get around this? If it helps below in the contents of the RAxML_info file:
_This is RAxML version 7.3.0 released by Alexandros Stamatakis in June 2011.

With greatly appreciated code contributions by:
Andre Aberer (HITS)
Simon Berger (HITS)
Nick Pattengale (Sandia)
Wayne Pfeiffer (SDSC)
Akifumi S. Tanabe (Univ. Tsukuba)

Alignment has 4167 distinct alignment patterns

Proportion of gaps and completely undetermined characters in this alignment: 93.41%

RAxML Model Optimization up to an accuracy of 0.100000 log likelihood units

Using 1 distinct models/data partitions with joint branch length optimization

All free model parameters will be estimated by RAxML
GAMMA model of rate heteorgeneity, ML estimate of alpha-parameter

GAMMA Model parameters will be estimated up to an accuracy of 0.1000000000 Log Likelihood units

Partition: 0
Alignment Patterns: 4167
Name: No Name Provided
DataType: DNA
Substitution Matrix: GTR
RAxML was called as follows:

raxmlHPC -f e -s ref.clus.phyi -t T3.raxml.bestTree -n info -m GTRGAMMA

Testing which likelihood implementation to use
Standard Implementation full tree traversal time: 2.301094
Subtree Equality Vectors for gap columns full tree traversal time: 0.809563
... using SEV-based implementation

Model parameters (binary file format) written to: /home/rodrigues-lab/msa_red/epa_ng/RAxML_binaryModelParameters.info

Overall Time for Tree Evaluation 419.071737
Final GAMMA likelihood: -186925.416854
Number of free parameters for AIC-TEST(BR-LEN): 2248
Number of free parameters for AIC-TEST(NO-BR-LEN): 9

Model Parameters of Partition 0, Name: No Name Provided, Type of Data: DNA
alpha: 1.029898
Tree-Length: 201.377284
rate A <-> C: 1.154527
rate A <-> G: 2.645042
rate A <-> T: 1.360458
rate C <-> G: 1.626075
rate C <-> T: 3.503977
rate G <-> T: 1.000000

freq pi(A): 0.240682
freq pi(C): 0.260669
freq pi(G): 0.267798
freq pi(T): 0.230851_

Error when compiling MPI enabled version

Without MPI everything works fine, but when I enable the EPA_HYBRID=1 then I get this error:

[ 90%] Building CXX object src/CMakeFiles/epa_module.dir/net/epa_mpi_util.cpp.o
In file included from /gpfs/hpchome/a51256/epa/src/net/epa_mpi_util.cpp:1:0:
/gpfs/hpchome/a51256/epa/src/net/epa_mpi_util.hpp: In function ‘void epa_mpi_send(T&, int, MPI_Comm)’:
/gpfs/hpchome/a51256/epa/src/net/epa_mpi_util.hpp:110:58: error: there are no arguments to ‘memcpy’ that depend on a template parameter, so a declaration of ‘memcpy’ must be available [-fpermissive]
   memcpy(buffer, data.c_str(), data.size() * sizeof(char));
                                                          ^
/gpfs/hpchome/a51256/epa/src/net/epa_mpi_util.hpp:110:58: note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
/gpfs/hpchome/a51256/epa/src/net/epa_mpi_util.hpp: In function ‘void epa_mpi_isend(T&, int, MPI_Comm, request_tuple&, Timer<>&)’:
/gpfs/hpchome/a51256/epa/src/net/epa_mpi_util.hpp:148:58: error: there are no arguments to ‘memcpy’ that depend on a template parameter, so a declaration of ‘memcpy’ must be available [-fpermissive]
   memcpy(buffer, data.c_str(), data.size() * sizeof(char));
                                                          ^
make[3]: *** [src/CMakeFiles/epa_module.dir/net/epa_mpi_util.cpp.o] Error 1

Loaded modules:
openmpi-4.0.0
cmake/3.8.2
gcc-5.2.0

Conda installation syntax error

Hey again,

I'm running into this error in the bioconda circleci tests:

In file included from /Users/distiller/project/miniconda/conda-bld/epa-ng_1544580516776/work/src/core/pll/pll_util.cpp:1:
In file included from /Users/distiller/project/miniconda/conda-bld/epa-ng_1544580516776/work/src/core/pll/pll_util.hpp:4:
/Users/distiller/project/miniconda/conda-bld/epa-ng_1544580516776/work/src/core/pll/rtree_mapper.hpp:73:35: error: implicit instantiation of undefined template 'std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >'
      throw std::invalid_argument{std::string("Edge ") + std::to_string(i) + " is the root edge! Please handle separately"};
                                  ^
/Users/distiller/project/miniconda/conda-bld/epa-ng_1544580516776/_build_env/bin/../include/c++/v1/iosfwd:193:32: note: template is declared here
    class _LIBCPP_TEMPLATE_VIS basic_string;
                               ^
In file included from /Users/distiller/project/miniconda/conda-bld/epa-ng_1544580516776/work/src/core/pll/pll_util.cpp:1:
In file included from /Users/distiller/project/miniconda/conda-bld/epa-ng_1544580516776/work/src/core/pll/pll_util.hpp:4:
/Users/distiller/project/miniconda/conda-bld/epa-ng_1544580516776/work/src/core/pll/rtree_mapper.hpp:73:63: error: no member named 'to_string' in namespace 'std'
      throw std::invalid_argument{std::string("Edge ") + std::to_string(i) + " is the root edge! Please handle separately"};

The full log is here.

This looks like a syntax error that is only caught for some reason in the Mac OS X stage of testing (it passes linting and the Linux tests). You can see the EPA-NG bioconda PR here.

Please let me know if you need more details!

Thanks,

Gavin

Placements sensitive to outlier sequences?

I recently noticed that the placements of all query sequences can be affected by the presence of a small number of outlier sequences when placing onto a large tree. This problem appears to be especially affect the pendant length estimates and less so the edge placements. I noticed this issue when running different subsets of a dataset with PICRUSt2, which wraps EPA-NG.

I've reproduced this for query datasets of 323 sequences into a tree of 20,000 sequences. In this example only one query sequence differs between the test datasets (essentially in one case the query sequence doesn't align the reference).

In the original case the focal query sequence alignment looks like this:

>02905cfb87861c837dde629596d9272b
....-----GGTCTTGACATC--CCTCT-GACGAGTGAGTAATGT-CG--CTT--T-C--CC--T----T----C--GG---------G------G--C-A-G-A-GGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAGTAGCCAGCA----GTAAGA-----TGGGAACTCTAGAGAGACTGCCGGGGATAACCCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCAGGGCTACACACGTGCTACAATGGC-G-T-A-A-ACAGAGGGA-AGCGACCCTGTGAAGGTAAGCAAATCCCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGAATCAGA-ATGTCGCGGTGAATACGTTCCCGG----...

I swapped in random DNA for a different test file (with the same header) and the alignment looks like this (only a single base aligned):

>02905cfb87861c837dde629596d9272b
....--------------T---------------....

You can see in the jplace outputs of running EPA-NG that there are many differences in the placements, particularly in the pendant distances.

I think this issue might be related to issue #29. I didn't expect all placements to be affected by a single weird sequence. I wasn't able to reproduce this issue with the example datasets used in the EPA-NG paper and I'm thinking that maybe this issue only arises with large trees. Any insight would be greatly appreciated!

I ran EPA-NG with this command:

epa-ng --tree pro_ref.tre \
       --ref-msa ref_seqs_hmmalign.fasta \
       --query STUDY_SEQS \
       --chunk-size 5000 \
       -T 20 \
       -m pro_ref.model \
       -w epa_out \
       --filter-acc-lwr 0.99 \
       --filter-max 100

You can see the input and output files attached in this zipfile: placement_test.zip.

STUDY_SEQS corresponds to study_seqs_hmmalign_original.fasta and study_seqs_hmmalign_funky.fasta for the original dataset and the dataset with the outlier sequence, respectively. The output jplace files are named the same way.

Thanks,

Gavin

unimplemented pure virtual method error

Hi there,

I've been troubleshooting whether EPA-NG could be installed with conda and I've been running into this error when trying to install with clang rather than gcc (first reported here):

/Users/distiller/project/miniconda/conda-bld/epa-ng_1540400737097/work/build/genesis_unity_sources/lib/all.cpp:17912:56: error: allocating an object of abstract class type 'const utils::ColorNormalization'
tree, params, std::vectorutils::Color{}, {}, {}, svg_filename
^
/Users/distiller/project/miniconda/conda-bld/epa-ng_1540400737097/work/libs/genesis/lib/genesis/utils/tools/color/normalization.hpp:168:20: note: unimplemented pure virtual method 'normalize_' in 'ColorNormalization'
virtual double normalize_( double value ) const = 0;
^
/Users/distiller/project/miniconda/conda-bld/epa-ng_1540400737097/work/libs/genesis/lib/genesis/utils/tools/color/normalization.hpp:173:18: note: unimplemented pure virtual method 'is_valid_' in 'ColorNormalization'
virtual bool is_valid_() const = 0;
^
/Users/distiller/project/miniconda/conda-bld/epa-ng_1540400737097/work/build/genesis_unity_sources/lib/all.cpp:17923:45: error: allocating an object of abstract class type 'const utils::ColorNormalization'
tree, params, color_per_branch, {}, {}, svg_filename

I know EPA-NG hasn't been tested on Mac OS X (which uses clang), but from what I understand this may actually be a bug that wasn't caught by gcc. This is based on the development version forked on Sep 5, 2018 (commit: 45a8e53). Please let me know if you would like more details!

Gavin

RNA data causes error with heuristic

character U is not being accounted for in the preplacement lookup table.

Easy to fix: the reverse lookup char_to_posish_ just needs an appropriate entry.

A bit unclear model notation

Hi!
I would like to pass on the model to EPA-ng manually for the command line, for example

--model GTR{0.7/1.8/1.2/0.6/3.0/1.0}+F+R9{r1/r2/.../r9}{w1/w2/.../w9}

The order of relative rates in the GTR{} is A-C, A-G, A-T, C-G, C-T, G-T ?

I'm just double checking this, since it is not explicitly stated in the readme.

Could you also update the README for this?

Cheers,

Joran

Avoid duplicate work

Unlike pplace and old RAxML-EPA, epa-ng re-computes the placements of identical sequences. This is not necessary.

Possible solution: Store hashes of the sequences that have already been processed. If a new sequence has a hash that was seen before, add the name to the list of names for the pquery of the previous sequence (or, if that name also already exists, increment its multiplicity). This assumes that hash collisions don't occur, so the hash function should be good enough (SHA1?).

Provide test scripts / sanity checking for jplace

Especially for this early beta phase, it would be great to have a script to check the basic sanity of the output. This could allow the user to ensure some validity, and help me debug problems if the checks are broken for their various data.

Memory management strategy for large trees / alignments

Currently the epa-ng process might be killed by the system for exceeding the maximum amount of memory. This can happen with large trees / alignments. We need a strategy to deal with this, possibly employing an offloading strategy that prefetches data from disk intelligently.

Add checkpointing

add the ability to detect incomplete runs, and recover where the last one left off

Help on Aborted (core dumped) message

Hi,

Similar to a recent post, I am trying to do placement for 1.3k sequences msa file builded with mafft. I first build a conda env with raxml & epa-ng.

I then checked my msa file with --parse & --check commands form raxml-ng, it suggested I would need 7 threads for 1200MB mem. I have around 200GB available so I don't really understand why I would have unsuficient ressources.

This is my current command line (I also used iqtree web server to find the best model and parameters).

epa-ng --ref-msa $refmsa --tree $reftree --query $querymsa -w . --threads 7 --chunk-size 500 --model GTR+F+I+G4

I don't have any error printed but the Aborted (core dumped) message.

Any suggestions that could help me ?

Edit :

Used pythia on my query sequences and obtained a 0.63 difficulty.
Not really the place for it but since I posted I also tried pplacer and also got the same aborted error but with a complementary message I don't understand yet (I am very new to phylogenetic placement) :

pplacer: loadlocale.c:129: _nl_intern_locale_data: Assertion `cnt < (sizeof (_nl_value_type_LC_TIME) / sizeof (_nl_value_type_LC_TIME[0]))' failed.
Aborted (core dumped)

Tried the test data, it worked for 10k sequences but failed for 100k returning the same Aborted (core dumped) message. Edit : It worked for 100k seq after adding the --redo option for some reason.
Actually trying to reduces number of queries, deduplicating reference sequences and other strategies I could found but so far I am stucked on the same issue.

some unanswered questions I have so far

I red "Many placement methods require query reads to be aligned to the reference". I am not sure I understand this step. I also found this information in this tutorial here. Does it mean I should pre align the queries on the reference .aln before calling epa-ng ? I also found some workflow computing the alignement of queries on the hmm profile of the reference sequences but I feel I am missing something here...
Following this last point, I generated a hmm profile from the deduplicated reference msa file, then hmmalign my queries with this hmm align with the --mapali options. Then extracted two msa files, one with queries only and one with references only. Got the same error again. I am starting to think the issue comes from my reference tree (is 1.3k sequences too big ?)

thxx,
Paul

Unable to parse model file from IQtree

Hello. EPA-NG fails to parse a model file from the latest IQtree version (option --model), when the selected DNA substitution model is not GTR. The attached IQtree model file, where the selected model is TIM3e+R10, can be used to recreate the following error:
libc++abi: terminating due to uncaught exception of type std::invalid_argument: Couldn't parse model file! (can't find 'A-R: '!)

What basically happens is that EPA-NG looks for "GTR" as the DNA substitution model (lines 174-178 in the EPA-NG source code parse_model.hpp), and if this is not found, it thinks that it is dealing with an amino-acid model. Consequently, it then looks for the substitution rate between amino acids A & R (line 182 in source file parse_model.hpp), which of course does not exist since the model is in fact a DNA substitution model (TIM3e+R10).

My impression was that EPA-NG can handle more DNA substitution models beyond just GTR, however right now this does not seem to be the case (at least not if the model file is from IQtree). Is it possible to fix this issue? It seems that this could be achieved easily using either of the following approaches:

Give the user the option to explicitly specify whether the input model is a DNA or AA substitution model.
Don't automatically switch to AA if the model is not explicitly written "GTR" in the IQtree file, but instead also accept other common specifiers such as "TIM3e" (full list here).

Thank you!

cannot compile libpll

When I type in the epa folder "make pll" to start installation
I got
mkdir -p bin
cd libs/pll-modules && ./install-with-libpll.sh ..
/bin/sh: 1: ./install-with-libpll.sh: not found
and indeed I cannot find anywhere the file install-with-libpll.sh

Error running this command

Hi there, I am trying to follow this picrust tutorial

After running this command: place_seqs.py -s ../seqs.fna -o out.tre -p 1
--intermediate intermediate/place_seqs

It shows error running this command:
epa-ng --tree /home/tayezy/miniconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/prokaryotic/pro_ref/pro_ref.tre --ref-msa intermediate/place_seqs/ref_seqs_hmmalign.fasta --query intermediate/place_seqs/study_seqs_hmmalign.fasta --chunk-size 5000 -T 1 -m /home/tayezy/miniconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/prokaryotic/pro_ref/pro_ref.model -w intermediate/place_seqs/epa_out --filter-acc-lwr 0.99 --filter-max 100

Any idea what might be the issue or how to resolve this?

Thanks in advance

error when doing the final epa-ng placement

Hi!
I got this error when running:
epa-ng
-t reference_tree.raxml.bestTree
-s reference_aligment.fasta
-q query.fasta
-w ./ --model reference_tree.raxml.bestModel

Please find attached the .err and the info.log files
jobLog_447616.err.txt
epa_info.log.txt

It looks like the query.fasta produces after ena-ng --split has a problem, but it looks normal to me.

Thank you so much for your help.

Compilation error on Bioconda

While rebuilding epa-ng on Bioconda due to updated compilers (GCC 10) I'm seeing the following error:

 [  5%] Building C object libs/pll-modules/libs/libpll/src/CMakeFiles/pll_obj.dir/parse_utree.c.o
 In file included from /opt/conda/conda-bld/epa-ng_1646126155046/work/build/libs/pll-modules/libs/libpll/src/parse_utree.c:236:
 /opt/conda/conda-bld/epa-ng_1646126155046/work/build/libs/pll-modules/libs/libpll/src/parse_utree.h:102:6: error: conflicting types for 'pll_utree_error'
   102 | void pll_utree_error (const char *msg);
       |      ^~~~~~~~~~~~~~~
 /opt/conda/conda-bld/epa-ng_1646126155046/work/libs/pll-modules/libs/libpll/src/parse_utree.y:143:13: note: previous definition of 'pll_utree_error' was here
   143 | static void pll_utree_error(pll_unode_t * node, const char * s)
       |             ^~~~~~~~~~~~~~~
 /opt/conda/conda-bld/epa-ng_1646126155046/work/build/libs/pll-modules/libs/libpll/src/parse_utree.c: In function 'pll_utree_parse':
 /opt/conda/conda-bld/epa-ng_1646126155046/work/build/libs/pll-modules/libs/libpll/src/parse_utree.c:1727:18: warning: passing argument 1 of 'pll_utree_error' from incompatible pointer type [-Wincompatible-pointer-types]
  1727 |         yyerror (tree, yymsgp);
       |                  ^~~~
       |                  |
       |                  struct pll_unode_s *

Is cmake perhaps downloading the wrong version of pll-modules?

Accept RAxML-ng style model descriptor

Intuition for a common taxonomic-assignment procedure

Hello Pierre,

I have a conceptual problem. For a set of query sequences coming from metagenomic origin, I want to know the taxonomy. I have a reference tree with ~600 sequences. Should I:

Perform a tree with all the sequences and generate a manual taxonomy.
Perform a tree only with the reference sequences, and afterwards use epa-ng for obtaining the placements of the query sequences (~600 pb at least) in this tree. Afterwards, use gappa to obtain the taxonomy.

I think the latter is a feasible option, but I have one major concern. If I have 400 query sequences, wouldn't be this a case of overfitting? Maybe some environmental sequences form a new cluster by themselves. So, what should be the most common approximation to that.

Additonally, I also have an amplicon dataset coming from the same metagenomic sequences, covering a smaller region. Should I perform the placement of these queries with the whole tree (references + query metagenomic seqs) or this again could be considered overfitting and I should work only with the reference (which are in fact the origin for the taxonomy).

EPA-ng is really fast and useful, thank you for working on it :)

MacOS M1 neon support

Hello Team,

It is not supported on aarch64 or (arm64) platform such as macOS M1? Any plan to support it in the future?

Thanks,
Jianshu

Zero-length branches altered

Firstly, thanks for developing such a great tool! I've had an odd issue occur whereby some zero-length branches in the reference tree get replaced with a branch length of 0.1053605157 in the epa_result.jplace file. I've attached an example that reproduces the issue:
submit.tar.gz

The command I ran was:

epa-ng -T 4 -m LG --redo --tree test.tree.tre --ref-msa test.msa.ref.fa --query test.msa.query.fa --preserve-rooting on --outdir out

test.tree.tre was:
"(g33013252_Mapoly0056s0034.1.p:0,((g33020677_Mapoly0173s0004.1.p:0,g33026704_Mapoly0052s0024.1.p:1.11896)100:0.662911,g33026973_Mapoly0016s0089.1.p:0.312835)1:0.234546);"

whereas the relevant line in epa_result.jplace was
"tree": "(g33013252_Mapoly0056s0034.1.p:0.0000000000{0},((g33020677_Mapoly0173s0004.1.p:0.1053605157{1},g33026704_Mapoly0052s0024.1.p:1.1189600000{2})100:0.6629110000{3},g33026973_Mapoly0016s0089.1.p:0.3128350000{4})1:0.2345460000{5});",

Note the zero branch length for "g33013252_Mapoly0056s0034.1.p" and the non-zero branch for "g33020677_Mapoly0173s0004.1.p"

This also happens in a larger tree, with the same resulting branch-length, even from branches far away from where the gene was inserted. Many/all of the zero length branches were replaced with a branch length of the same value as in the example above, 0.1053605157.

All the best
David

Segmentation fault when processing a large number of sequences

I'm trying to use the latest version EPA-ng to process ~5,000 sequences but I get a Segmentation fault (core dumped) after 3500 sequences. I don't get this issue in version 0.2.1-beta.

Segfault with large query volume under MPI

Program produces following error message when started with large volume of queries (in this case half a TB, 100M neotrop sequences):

INFO Output file: /hits/basement/sco/barbera/neotrop/out/128-hybrid-100M-10/epa_result.jplace
[haswell-002:21508] *** Process received signal ***
[haswell-002:21508] Signal: Segmentation fault (11)
[haswell-002:21508] Signal code: Address not mapped (1)
[haswell-002:21508] Failing at address: 0xbb03f8088
[haswell-002:21508] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fe83c046100]
[haswell-002:21508] [ 1] /hits/sw/shared/apps/OpenMPI/1.10.2-GCC-4.9.3-2.25/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_malloc+0x7ea)[0x7fe83b7137ba]
[haswell-002:21508] [ 2] /hits/sw/shared/apps/OpenMPI/1.10.2-GCC-4.9.3-2.25/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_memalign+0x52)[0x7fe83b715b62]
[haswell-002:21508] [ 3] /hits/sw/shared/apps/OpenMPI/1.10.2-GCC-4.9.3-2.25/lib/libopen-pal.so.13(opal_memory_ptmalloc2_memalign+0xbf)[0x7fe83b715f9f]
[haswell-002:21508] [ 4] /hits/sw/shared/apps/GCCcore/4.9.3/lib64/libstdc++.so.6(_Znwm+0x18)[0x7fe83c9fab58]
[haswell-002:21508] [ 5] /hits/sw/shared/apps/GCCcore/4.9.3/lib64/libstdc++.so.6(_ZNSs4_Rep9_S_createEmmRKSaIcE+0x59)[0x7fe83ca5a799]
[haswell-002:21508] [ 6] /hits/sw/shared/apps/GCCcore/4.9.3/lib64/libstdc++.so.6(_ZNSs4_Rep8_M_cloneERKSaIcEm+0x1b)[0x7fe83ca5b40b]
[haswell-002:21508] [ 7] /hits/sw/shared/apps/GCCcore/4.9.3/lib64/libstdc++.so.6(_ZNSs7reserveEm+0x30)[0x7fe83ca5b4b0]
[haswell-002:21508] [ 8] /hits/sw/shared/apps/GCCcore/4.9.3/lib64/libstdc++.so.6(_ZNSt15basic_stringbufIcSt11char_traitsIcESaIcEE8overflowEi+0xa4)[0x7fe83ca37884]
[haswell-002:21508] [ 9] /hits/sw/shared/apps/GCCcore/4.9.3/lib64/libstdc++.so.6(_ZNSt15basic_streambufIcSt11char_traitsIcEE6xsputnEPKcl+0x85)[0x7fe83ca3b635]
[haswell-002:21508] [10] /hits/sw/shared/apps/GCCcore/4.9.3/lib64/libstdc++.so.6(_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l+0xe6)[0x7fe83ca32bd6]
[haswell-002:21508] [11] epa(_Z23sample_to_jplace_stringRK6SampleI9PlacementE+0x1b8)[0x49e1a8]
[haswell-002:21508] [12] epa(_Z10simple_mpiR4TreeRKSsS2_RK7OptionsS2_+0x186f)[0x4de28f]
[haswell-002:21508] [13] epa(main+0x659f)[0x48358f]
[haswell-002:21508] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe83bc96b15]
[haswell-002:21508] [15] epa[0x48d7c1]
[haswell-002:21508] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 21508 on node haswell-002 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Preserve Rooting for rooted input trees

Currently, rooted trees are de-rooted during the parsing, and they stay unrooted. This does not conform to raxml and pplacer.

question on comparison with RAxML-epa

hello Team,

I can use epa algorithm in RAxML like this:

mafft --addfragments ${name}_NarG_all.faa --reorder --thread 64 NarG.ref.afa > ${name}_Refs_NarG_reads.aln

raxmlHPC-PTHREADS-AVX2 -f v -s ${name}_Refs_NarG_reads.aln -t RAxML_bestTree.NarG_ref_tree -m PROTGAMMAAUTO -n ${name}_JPlace_narG -T 64 –G 0.2

That is to first add short query sequences (${name}_NarG_all.faa) to the reference alignment (NarG.ref.afa, long one to generate the backbone tree). Then run the whole alignment including short query sequences (${name}_Refs_NarG_reads.aln) with the tree generated from NarG.ref.afa (RAxML_bestTree.NarG_ref_tree). How can do this in epa-ng? since query MSA files is needed, I assume it is the reads alignment file (no reference included). Since reads are very short and most of the tims no overlap. How do I generate a query alignment?

Thanks,

Jianshu

Support for QMaker and nQMaker from IQ-TREE?

Hi,
Very nice software.
I was wondering if you plan (or are already) supporting the new amino acid substitution models QMaker and nQMaker from IQ-TREE software?

http://www.iqtree.org/doc/Estimating-amino-acid-substitution-models

Thanks

Setting random seed

Hi @Pbdas,

Is there any plan to allow users to set a random seed at some point to ensure reproducible results? I believe that at the heuristic pre-placement stage there is randomness that can cause slightly differing results. I'm not sure how feasible setting the random seed at the start of this process would be.

Thanks for your thoughts!

PLL assertion error

Hi Pierre,

I'm placing some sequences onto a relatively small tree with EPA-NG v0.3.7 and it's aborting after a failed assertion in PLL:

epa-ng: /home/connor/bin/epa-ng/libs/pll-modules/libs/libpll/src/core_likelihood_avx2.c:911: pll_core_edge_loglikelihood_repeats_generic_avx2: Assertion `site_lk < 0. && isfinite(site_lk)' failed.
Aborted (core dumped)

Here is the command I used:
epa-ng -s references.txt -t NosZ.txt -q queries.txt --model NosZ_epa.model.txt --dyn-heur 0.9 -T 4 --no-pre-mask

Placement is completed if I allow masking by removing '--no-pre-mask'.

Files used:
NosZ.txt
NosZ_epa.model.txt
queries.txt
references.txt

Can you figure out what is wrong?

Thanks!
Connor

brew tap brewsci/bio
brew install epa-ng

brew install brewsci/bio/epa-ng

Cheers,
Gabriel

pierrebarbera / epa-ng Goto Github PK

epa-ng's Introduction

EPA-ng - Fast, parallel, highly accurate Maximum Likelihood Phylogenetic Placement, by the team behind RAxML(-ng)

WARNING v0.3.0 - v0.3.3!

SUPPORT

DISCLAIMER

Introduction

What can EPA-ng do?

Installation

With Conda

With Homebrew

Building from source

Apple

Windows

Usage

Basic

Setting the Model Parameters

Advanced

Configuring the Heuristic Preplacement

Premasking

Cluster usage

Converting the query file to .bfast

Test data

Citing EPA-ng

epa-ng's People

Contributors

Stargazers

Watchers

Forkers

epa-ng's Issues

Recommend Projects

Recommend Topics

Recommend Org

Converting the query file to `.bfast`