Giter Club home page Giter Club logo

epa-ng's People

Contributors

crosenth avatar pierrebarbera avatar ramabit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

epa-ng's Issues

Memory management strategy for large trees / alignments

Currently the epa-ng process might be killed by the system for exceeding the maximum amount of memory. This can happen with large trees / alignments. We need a strategy to deal with this, possibly employing an offloading strategy that prefetches data from disk intelligently.

Error when compiling MPI enabled version

Without MPI everything works fine, but when I enable the EPA_HYBRID=1 then I get this error:

[ 90%] Building CXX object src/CMakeFiles/epa_module.dir/net/epa_mpi_util.cpp.o
In file included from /gpfs/hpchome/a51256/epa/src/net/epa_mpi_util.cpp:1:0:
/gpfs/hpchome/a51256/epa/src/net/epa_mpi_util.hpp: In function ‘void epa_mpi_send(T&, int, MPI_Comm)’:
/gpfs/hpchome/a51256/epa/src/net/epa_mpi_util.hpp:110:58: error: there are no arguments to ‘memcpy’ that depend on a template parameter, so a declaration of ‘memcpy’ must be available [-fpermissive]
   memcpy(buffer, data.c_str(), data.size() * sizeof(char));
                                                          ^
/gpfs/hpchome/a51256/epa/src/net/epa_mpi_util.hpp:110:58: note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
/gpfs/hpchome/a51256/epa/src/net/epa_mpi_util.hpp: In function ‘void epa_mpi_isend(T&, int, MPI_Comm, request_tuple&, Timer<>&)’:
/gpfs/hpchome/a51256/epa/src/net/epa_mpi_util.hpp:148:58: error: there are no arguments to ‘memcpy’ that depend on a template parameter, so a declaration of ‘memcpy’ must be available [-fpermissive]
   memcpy(buffer, data.c_str(), data.size() * sizeof(char));
                                                          ^
make[3]: *** [src/CMakeFiles/epa_module.dir/net/epa_mpi_util.cpp.o] Error 1

Loaded modules:
openmpi-4.0.0
cmake/3.8.2
gcc-5.2.0

Version option broken

Minor issue, but I just noticed that --version and -v don't actually return the version number.

Thanks,

Gavin

RNA data causes error with heuristic

character U is not being accounted for in the preplacement lookup table.

Easy to fix: the reverse lookup char_to_posish_ just needs an appropriate entry.

Setting random seed

Hi @Pbdas,

Is there any plan to allow users to set a random seed at some point to ensure reproducible results? I believe that at the heuristic pre-placement stage there is randomness that can cause slightly differing results. I'm not sure how feasible setting the random seed at the start of this process would be.

Thanks for your thoughts!

Tree Log-Likelihood -INF Error

Hello Team,
When I'm trying to place a sequence into the tree on Linux, I got en error Aborted (core dumped) without any error information.
And I tried the same data and command on MacOS, core dumped again(with libc++abi: terminating with uncaught exception of type std::runtime_error: Tree Log-Likelihood -INF!).
the EPA-ng version is EPA-ng v0.3.8
the command used:

epa-ng --tree refer_tree.tree --msa reference_v4.mafft --query query_aligned.fasta --model LG+F+G4 --outdir .

system infos:

Linux:     #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
MacOS:     20.6.0 Darwin Kernel Version 20.6.0: Fri Dec 16 00:35:00 PST 2022; root:xnu-7195.141.49~1/RELEASE_X86_64 x86_64

running details infos(MacOS):

% epa-ng --tree refer_tree.tree --msa reference_v4.mafft --query query_aligned.fasta --model LG+F+G4 --outdir .
INFO Selected: Output dir: ./
INFO Selected: Query file: query_aligned.fasta
INFO Selected: Tree file: refer_tree.tree
INFO Selected: Reference MSA: reference_v4.mafft
INFO Selected: Automatic switching of use of per rate scalers
INFO Selected: Preserving the root of the input tree
INFO Selected: Specified model: LG+F+G4
INFO     ______ ____   ___           _   __ ______
        / ____// __ \ /   |         / | / // ____/
       / __/  / /_/ // /| | ______ /  |/ // / __  
      / /___ / ____// ___ |/_____// /|  // /_/ /  
     /_____//_/    /_/  |_|      /_/ |_/ \____/ (v0.3.8)
WARN The reference MSA and tree have differing number of taxa! 196 vs. 186
INFO Using model parameters:
INFO    Rate heterogeneity: GAMMA (4 cats, mean),  alpha: 1 (ML),  weights&rates: (0.25,0.136954) (0.25,0.476752) (0.25,1) (0.25,2.38629) 
        Base frequencies (empirical): 0.250946 0 0 0 0.176994 0 0 0.367461 0 0 0 0 0 0 0 0 0.204599 0 0 0 
        Substitution rates (model): 0.425093 0.276818 0.395144 2.48908 0.969894 1.03855 2.06604 0.358858 0.14983 0.395337 0.536518 1.12403 0.253701 1.17765 4.72718 2.1395 0.180717 0.218959 2.54787 0.751878 0.123954 0.534551 2.80791 0.36397 0.390192 2.4266 0.126991 0.301848 6.32607 0.484133 0.052722 0.332533 0.858151 0.578987 0.593607 0.31444 0.170887 5.07615 0.528768 1.69575 0.541712 1.43765 4.50924 0.191503 0.068427 2.14508 0.371004 0.089525 0.161787 4.00836 2.00068 0.045376 0.612025 0.083688 0.062556 0.523386 5.24387 0.844926 0.927114 0.01069 0.015076 0.282959 0.025548 0.017416 0.394456 1.24028 0.42586 0.02989 0.135107 0.037967 0.084808 0.003499 0.569265 0.640543 0.320627 0.594007 0.013266 0.89368 1.10525 0.075382 2.78448 1.14348 0.670128 1.16553 1.95929 4.12859 0.267959 4.81351 0.072854 0.582457 3.23429 1.67257 0.035855 0.624294 1.22383 1.08014 0.236199 0.257336 0.210332 0.348847 0.423881 0.044265 0.069673 1.80718 0.173735 0.018811 0.419409 0.611973 0.604545 0.077852 0.120037 0.245034 0.311484 0.008705 0.044261 0.296636 0.139538 0.089586 0.196961 1.73999 0.129836 0.268491 0.054679 0.076701 0.108882 0.366317 0.697264 0.442472 0.682139 0.508851 0.990012 0.584262 0.597054 5.30683 0.119013 4.14507 0.159069 4.27361 1.11273 0.078281 0.064105 1.03374 0.11166 0.232523 10.6491 0.1375 6.31236 2.59269 0.24906 0.182287 0.302936 0.619632 0.299648 1.70274 0.656604 0.023918 0.390322 0.748683 1.13686 0.049906 0.131932 0.185202 1.79885 0.099849 0.34696 2.02037 0.696175 0.481306 1.89872 0.094464 0.361819 0.165001 2.45712 7.8039 0.654683 1.33813 0.571468 0.095131 0.089613 0.296501 6.47228 0.248862 0.400547 0.098369 0.140825 0.245841 2.18816 3.15182 0.18951 0.249313
INFO Output file: ./epa_result.jplace
libc++abi: terminating with uncaught exception of type std::runtime_error: Tree Log-Likelihood -INF!
zsh: abort      epa-ng --tree refer_tree.tree --msa reference_v4.mafft --query  --model   .

the query file(has the same length with MSA sequence after aligned with MAFFT:

>db29dfd5db5e2501ed9deadabc7dd91d
TACGAAGGGGGCTAGCGTTGCTCGGAATGACTGGGCGTAAAGGGCGTGTAGGCGGTTTGGACAGTCAGATGTGAAATCCCCGGGCTTAACCTGGGAGCTGCATTTGATACGTCCAGACTAGAGTGTGAGAGAGGGTTGTGGAATTCTCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCAACCTGGCTCATTACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG

It's seem like something wrong with my tree or query, I changed another query to place into another tree and everything is OK.
the tree I used in the error situation is build by iqtree2 with command

iqtree2 -s seq/all_seq.fasta -m LG+F+G4 -pre 'tree/%s_%s' -nt 48 --fast -alrt 1000

and the tree is :

(GCF_001581085.1:0.0001883592,(((((((((((GCF_002153675.1:0.6764403004,GCF_002153735.1:0.0030645544)0:0.0000029411,GCF_002153795.1:0.0158138579)0:0.6605801847,(((((((GCF_008690765.1:0.0000000000,GCF_008690505.1:0.0000000000):0.0000000000,GCF_008690645.1:0.0000000000):0.0000000000,GCF_008690705.1:0.0000000000):0.0000000000,GCF_008690905.1:0.0000000000):0.0000000000,GCF_008690915.1:0.0000000000):0.0000000000,GCF_008704295.1:0.0000000000):0.0416326726,GCF_024158225.1:0.0053982956)0:0.0000024925)0:0.0000024344,((((((((GCF_008690365.1:0.4406791409,GCF_003323795.1:0.4617060122)0:0.1640898965,GCF_024329695.1:0.0262970073)0:0.0003373681,(((GCF_000010845.1:0.6138629501,(GCF_003391275.1:1.6768345925,GCF_014196315.1:0.0139330781)0:0.0000025327)0:0.0000021894,GCF_008704325.1:0.0375132164)0:0.0000029909,((GCF_002153605.1:0.0134270192,GCF_001499615.1:0.4663964356)0:0.0000028317,GCF_008689845.1:0.4877107087)0:0.0033911496)0:0.0000024453)0:0.0000480426,GCF_001953595.1:1.7161662788)0:0.0000026774,(((((GCF_018811985.1:1.2476321189,GCF_001580535.1:0.3701136010)0:0.0000021507,GCF_002153545.1:1.3085919649)0:0.3109927057,GCF_000010905.1:0.0137330848)0:0.0000047088,GCF_001580945.1:0.6034760600)0:0.0000022867,GCF_024158385.1:0.4899289062)0:0.0028710588)0:0.0000028918,GCF_002153745.1:1.6808551804)0:0.0001082810,(GCF_003850905.1:0.0032674677,(((GCF_003966365.1:0.0147410702,GCF_021284605.1:0.5985121941)0:0.0000028764,GCF_001580695.1:0.0350269438)0:0.0005460484,GCF_000613865.1:0.0047823241)0:0.0000164957)0:0.0232318069)0:0.0054273270,(((((((GCF_002153695.1:1.4509147497,(GCF_024158305.1:0.0514046837,(GCF_008365315.1:0.4543791870,(GCF_001642635.1:0.0922419883,GCF_001581035.1:1.6490729713)0:0.0000029322)0:0.0000027350)0:0.1683579951)0:0.0124108208,GCF_002202135.1:0.3419663450)0:0.2244853940,((((GCF_002153775.1:0.2612440122,((GCF_022130865.1:0.4022580284,GCF_024158475.1:1.3897862725)0:0.0247699162,GCF_000379545.1:0.2593021461)0:0.0000027277)100:0.3193728053,(GCF_004341595.1:0.0609295276,(GCF_003850825.1:0.0112367348,GCF_004014775.2:0.4916741059)0:0.0173493126)0:0.0007038507)0:0.0000021411,(GCF_008690725.1:0.4285462033,GCF_011516935.1:1.4206867824)0:0.1558481053)0:0.0000955949,(((((((GCF_001662905.1:0.4844812877,GCF_002005445.1:0.3366328551)0:0.0317392220,GCF_011516945.1:0.2458425154)0:0.2026757393,((GCF_022130785.1:0.0322073465,GCF_000963945.1:0.5333918479)0:0.0000022699,GCF_000193495.2:1.0022380453)0:0.0000024617)0:0.0045456014,GCF_018256985.1:0.0114794035)0:0.0000200495,GCF_024158185.1:0.7045870891)0:0.0000023903,(GCF_000285315.1:0.5450676598,GCF_008689805.1:0.1960126694)0:0.4423247183)0:0.0000948410,GCF_018811975.1:0.0090654175)0:0.0002828809)0:0.0010058575)0:0.0004283380,(((((((GCF_000963965.1:0.0060146282,((GCF_001580915.1:0.0060752079,GCF_007989335.1:0.6741506451)0:0.0000024009,(GCF_000964205.1:0.2689952602,GCF_001766235.1:0.5150130446)0:0.1617072119)0:0.0000021130)0:0.2398539718,GCF_006539325.1:0.2143134267)0:0.0203072357,GCF_001499675.1:0.5146329584)0:0.0000027795,GCF_008689865.1:0.1825394650)0:0.4026651290,GCF_024158265.1:0.0044401363)0:0.0000028173,((GCF_024158205.1:0.0042233156,GCF_011516875.1:0.5034271003)0:0.0000792547,GCF_000285275.1:1.6548804506)0:0.0000020615)0:0.0273314929,(((GCF_024158315.1:0.0070175453,(GCF_001183745.1:0.0063524715,(GCF_000787635.2:0.0065526170,GCF_002276785.1:0.4184158280)0:0.0000028871)0:0.5991026047)0:0.0000022640,(GCF_024158365.1:0.2715677289,GCF_013307325.1:1.3479024265)0:0.3013223189)0:0.0112147574,GCF_002220195.1:0.4824377378)0:0.0000023096)0:0.0000024257)0:0.0000021468,((((((GCF_007989245.1:0.0332987392,((GCF_008704245.1:0.0000000000,GCF_000755665.1:0.0000000000):0.0210429353,GCF_000963925.1:0.4718752717)0:0.0021558859)0:0.0078774398,((GCF_000241585.2:0.0088264104,GCF_021961645.1:0.6286745465)0:0.0000022309,GCF_008704305.1:0.6437612063)0:0.0002032385)0:0.0000020520,GCF_003850885.1:0.5811558117)0:0.0000020541,(((GCF_001580615.1:0.0310270635,GCF_017377745.1:0.0083640209)0:0.0000025252,((GCF_007991075.1:1.1324145009,(GCF_024158505.1:0.2977152696,GCF_003850845.1:1.2075012951)0:0.0000020645)0:0.3300841784,GCF_000755675.1:0.0192709240)0:0.0000022336)0:0.0003230373,GCF_000964225.1:0.6464079685)0:0.0000023058)0:0.0000024507,((((((GCF_000010925.1:0.6580623757,(GCF_018256865.1:0.0729211723,GCF_017377715.1:1.6458223012)0:0.0000028144)0:0.0028590724,(GCF_024158445.1:0.5296455512,GCF_024158425.1:0.1631401068)0:0.0051417052)0:0.0000023336,(GCF_001580995.1:0.0095597996,GCF_011516835.1:0.0096655497)0:0.2897252936)0:0.0000021271,GCF_018256955.1:0.1057967928)0:0.4949788116,(GCF_002173775.1:0.0033052381,GCF_014132135.1:0.0135786395)0:0.0220684424)0:0.0027175137,((GCF_008704285.1:0.5204516769,(GCF_003850945.1:0.0142766333,GCF_002554745.1:1.6439203740)0:0.0000010000)0:0.0000023521,GCF_001581075.1:1.6439085500)0:0.0018296433)0:0.0032985240)0:0.0000024196,GCF_011516735.1:0.0311539844)0:0.0007464121)0:0.0000021260,((((GCF_018256895.1:0.0238122931,GCF_002276555.1:0.3567056482)0:0.0006964877,(((((GCF_022130905.1:0.6759780445,GCF_003850805.1:0.0684026210)0:0.0031813183,(((GCF_006539345.1:0.0117079797,GCF_000429165.1:0.0027619486)0:0.4105163026,GCF_011516925.1:0.2211171337)0:0.0987070937,GCF_009914215.1:0.0889556948)0:0.0000024406)0:0.0002617895,(GCF_024158285.1:0.5527586906,GCF_011516755.1:0.0753710884)0:0.0000020481)0:0.1569666418,GCF_022130805.1:0.6070661649)0:0.1229730883,GCF_002549835.1:0.0191286796)0:0.0000027567)0:0.0000027920,(((GCF_021961685.1:0.4339536548,GCF_000963905.1:0.5731901648)0:0.1371761622,(GCF_018256915.1:0.0068522593,GCF_007991375.1:1.6104189388)0:0.0002093233)0:0.0000025380,GCF_011516865.1:0.5911367130)0:0.0002295569)0:0.0000027232,((GCF_002153475.1:0.3833738686,GCF_000613905.1:0.0057836498)0:0.0000020031,(GCF_002456135.1:0.0089582661,GCF_007989285.1:0.6700584241)0:0.0161515351)0:0.0002918145)0:0.5686823855)0:0.0000025291,(((((((GCF_007989305.1:0.1238860300,GCF_019599335.1:0.2119236220)0:0.2066203913,(((((GCF_000010825.1:0.0000000000,GCF_000010965.1:0.0000000000):0.0000000000,GCF_000010945.1:0.0000000000):0.0000000000,GCF_000010885.1:0.0000000000):0.0000000000,GCF_000010865.1:0.0000000000):0.0814400697,GCF_008689815.1:0.0319721336)0:0.0035268865)0:0.0000021231,GCF_018256975.1:0.6696643419)0:0.0000027511,((((GCF_007991395.1:0.0000000000,GCF_011516825.1:0.0000000000):0.6003152616,GCF_011516885.1:0.0099408109)0:0.0000020243,GCF_002173735.1:0.0370744835)0:0.0019782092,(((GCF_014486685.1:1.5739864716,GCF_011516655.1:0.0221378734)0:0.0000020155,(GCF_002358055.1:0.4205428001,(GCF_001581105.1:0.0591542756,(GCF_002276805.1:0.2130518592,GCF_011516765.1:1.4222825718)0:0.1270049725)0:0.0012018872)0:0.0357890234)0:0.0000120705,((GCF_008704255.1:0.6340700535,GCF_011516725.1:0.0244629949)0:0.0000029892,GCF_002153485.1:1.6164220977)0:0.0000022480)0:0.0001081203)0:0.0095606788)0:0.2193311916,(((GCF_000241625.1:0.6839073885,GCF_018256935.1:0.1832314634)0:0.0067353401,(GCF_018256835.1:0.2030163707,GCF_001581005.1:0.3323390344)0:0.0000029143)0:0.0000027006,GCF_002153575.1:1.3448122176)0:0.0000028216)0:0.3003791125,(GCF_000723785.2:0.0082027889,GCF_014218315.1:0.5230507614)0:0.0000628853)0:0.0000020699,((GCF_008689795.1:0.6260611411,GCF_002723895.1:0.2262062301)0:0.1850061244,(((GCF_018256855.1:0.6672368453,GCF_000193245.1:0.0006517461)0:0.0000026401,(GCF_002156945.1:0.4738437206,GCF_011516745.1:0.0001157176)0:0.0000024761)0:0.0483889382,GCF_002153685.1:0.0741115710)0:0.2248808694)0:0.1757701247)0:0.0000021012)0:0.0000028596)0:0.0029442856)0:0.0000026584,(GCF_024158235.1:0.0241025345,GCF_022130845.1:0.0059523180)0:0.0005193121)0:0.0000021230,GCF_001766255.1:0.5120604596)0:0.0000020516,(((GCF_017377735.1:0.0621445686,GCF_000225485.1:0.3345182353)0:0.0027563893,(GCF_024158325.1:0.6234262367,(GCF_014207635.1:0.0585282647,GCF_003850965.1:0.5727943103)0:0.0000020369)0:0.0026639284)0:0.0000024753,GCF_002153515.1:0.0780511343)0:0.5342060505)0:0.0034632152,(((GCF_019083805.1:0.0001091116,GCF_003850865.1:0.0000540390)0:0.0018101296,GCF_003850925.1:0.4782026931)0:0.0000023029,GCF_000613285.1:0.6558629151)0:0.0000025192)0:0.0000026868,GCF_024158405.1:1.0557610867)0:0.0000020045,GCF_001628715.1:0.0019204407)0:0.4800599364,GCF_002153655.1:0.1172978701)0:0.2140405767,((GCF_002006565.1:0.0894455164,GCF_009295745.1:0.0927561460)0:0.2385422753,GCF_002738225.1:0.0099295595)0:0.0000010000);

After got an error with this tree, I noticed that most nodes have zero confidence. I removed the support , still error.
And I build the tree with Fasttree with fasttree -intree genus_Acetobacter.treefile ../seq/all_seq.fasta> fast.tre (transform it to binary tree with ete3:tree.resolve_polytomy()), still error.
What can I do to solve this problem?
Looking forwards to your reply.
Thanks a lot.

the MSA file is uploaded as attachment.
max_16s_ref.txt

Intuition for a common taxonomic-assignment procedure

Hello Pierre,

I have a conceptual problem. For a set of query sequences coming from metagenomic origin, I want to know the taxonomy. I have a reference tree with ~600 sequences. Should I:

  • Perform a tree with all the sequences and generate a manual taxonomy.
  • Perform a tree only with the reference sequences, and afterwards use epa-ng for obtaining the placements of the query sequences (~600 pb at least) in this tree. Afterwards, use gappa to obtain the taxonomy.

I think the latter is a feasible option, but I have one major concern. If I have 400 query sequences, wouldn't be this a case of overfitting? Maybe some environmental sequences form a new cluster by themselves. So, what should be the most common approximation to that.

Additonally, I also have an amplicon dataset coming from the same metagenomic sequences, covering a smaller region. Should I perform the placement of these queries with the whole tree (references + query metagenomic seqs) or this again could be considered overfitting and I should work only with the reference (which are in fact the origin for the taxonomy).

EPA-ng is really fast and useful, thank you for working on it :)

Conda installation syntax error

Hey again,

I'm running into this error in the bioconda circleci tests:

In file included from /Users/distiller/project/miniconda/conda-bld/epa-ng_1544580516776/work/src/core/pll/pll_util.cpp:1:
In file included from /Users/distiller/project/miniconda/conda-bld/epa-ng_1544580516776/work/src/core/pll/pll_util.hpp:4:
/Users/distiller/project/miniconda/conda-bld/epa-ng_1544580516776/work/src/core/pll/rtree_mapper.hpp:73:35: error: implicit instantiation of undefined template 'std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >'
      throw std::invalid_argument{std::string("Edge ") + std::to_string(i) + " is the root edge! Please handle separately"};
                                  ^
/Users/distiller/project/miniconda/conda-bld/epa-ng_1544580516776/_build_env/bin/../include/c++/v1/iosfwd:193:32: note: template is declared here
    class _LIBCPP_TEMPLATE_VIS basic_string;
                               ^
In file included from /Users/distiller/project/miniconda/conda-bld/epa-ng_1544580516776/work/src/core/pll/pll_util.cpp:1:
In file included from /Users/distiller/project/miniconda/conda-bld/epa-ng_1544580516776/work/src/core/pll/pll_util.hpp:4:
/Users/distiller/project/miniconda/conda-bld/epa-ng_1544580516776/work/src/core/pll/rtree_mapper.hpp:73:63: error: no member named 'to_string' in namespace 'std'
      throw std::invalid_argument{std::string("Edge ") + std::to_string(i) + " is the root edge! Please handle separately"};

The full log is here.

This looks like a syntax error that is only caught for some reason in the Mac OS X stage of testing (it passes linting and the Linux tests). You can see the EPA-NG bioconda PR here.

Please let me know if you need more details!

Thanks,

Gavin

Help on Aborted (core dumped) message

Hi,

Similar to a recent post, I am trying to do placement for 1.3k sequences msa file builded with mafft. I first build a conda env with raxml & epa-ng.

I then checked my msa file with --parse & --check commands form raxml-ng, it suggested I would need 7 threads for 1200MB mem. I have around 200GB available so I don't really understand why I would have unsuficient ressources.

This is my current command line (I also used iqtree web server to find the best model and parameters).

epa-ng --ref-msa $refmsa --tree $reftree --query $querymsa -w . --threads 7 --chunk-size 500 --model GTR+F+I+G4

I don't have any error printed but the Aborted (core dumped) message.

Any suggestions that could help me ?

Edit :

  • Used pythia on my query sequences and obtained a 0.63 difficulty.

  • Not really the place for it but since I posted I also tried pplacer and also got the same aborted error but with a complementary message I don't understand yet (I am very new to phylogenetic placement) :

pplacer: loadlocale.c:129: _nl_intern_locale_data: Assertion `cnt < (sizeof (_nl_value_type_LC_TIME) / sizeof (_nl_value_type_LC_TIME[0]))' failed.
Aborted (core dumped)
  • Tried the test data, it worked for 10k sequences but failed for 100k returning the same Aborted (core dumped) message. Edit : It worked for 100k seq after adding the --redo option for some reason.

  • Actually trying to reduces number of queries, deduplicating reference sequences and other strategies I could found but so far I am stucked on the same issue.

some unanswered questions I have so far

  • I red "Many placement methods require query reads to be aligned to the reference". I am not sure I understand this step. I also found this information in this tutorial here. Does it mean I should pre align the queries on the reference .aln before calling epa-ng ? I also found some workflow computing the alignement of queries on the hmm profile of the reference sequences but I feel I am missing something here...

  • Following this last point, I generated a hmm profile from the deduplicated reference msa file, then hmmalign my queries with this hmm align with the --mapali options. Then extracted two msa files, one with queries only and one with references only. Got the same error again. I am starting to think the issue comes from my reference tree (is 1.3k sequences too big ?)

thxx,
Paul

Placements sensitive to outlier sequences?

I recently noticed that the placements of all query sequences can be affected by the presence of a small number of outlier sequences when placing onto a large tree. This problem appears to be especially affect the pendant length estimates and less so the edge placements. I noticed this issue when running different subsets of a dataset with PICRUSt2, which wraps EPA-NG.

I've reproduced this for query datasets of 323 sequences into a tree of 20,000 sequences. In this example only one query sequence differs between the test datasets (essentially in one case the query sequence doesn't align the reference).

In the original case the focal query sequence alignment looks like this:

>02905cfb87861c837dde629596d9272b
....-----GGTCTTGACATC--CCTCT-GACGAGTGAGTAATGT-CG--CTT--T-C--CC--T----T----C--GG---------G------G--C-A-G-A-GGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAGTAGCCAGCA----GTAAGA-----TGGGAACTCTAGAGAGACTGCCGGGGATAACCCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCAGGGCTACACACGTGCTACAATGGC-G-T-A-A-ACAGAGGGA-AGCGACCCTGTGAAGGTAAGCAAATCCCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGAATCAGA-ATGTCGCGGTGAATACGTTCCCGG----...

I swapped in random DNA for a different test file (with the same header) and the alignment looks like this (only a single base aligned):

>02905cfb87861c837dde629596d9272b
....--------------T---------------....

You can see in the jplace outputs of running EPA-NG that there are many differences in the placements, particularly in the pendant distances.

I think this issue might be related to issue #29. I didn't expect all placements to be affected by a single weird sequence. I wasn't able to reproduce this issue with the example datasets used in the EPA-NG paper and I'm thinking that maybe this issue only arises with large trees. Any insight would be greatly appreciated!

I ran EPA-NG with this command:

epa-ng --tree pro_ref.tre \
       --ref-msa ref_seqs_hmmalign.fasta \
       --query STUDY_SEQS \
       --chunk-size 5000 \
       -T 20 \
       -m pro_ref.model \
       -w epa_out \
       --filter-acc-lwr 0.99 \
       --filter-max 100

You can see the input and output files attached in this zipfile: placement_test.zip.

STUDY_SEQS corresponds to study_seqs_hmmalign_original.fasta and study_seqs_hmmalign_funky.fasta for the original dataset and the dataset with the outlier sequence, respectively. The output jplace files are named the same way.

Thanks,

Gavin

-INF logl at branch 0 error

Hello,

It is my first time reporting an issue on github so I'm sorry if I'm not the clearest.

I've been using epa-ng v0.3.5 for phylogenetic placement with no problems so far, but recently I'm stuck with an error.

This is the code line I wrote (I changed the name to make it more clear) :
epa-ng --model GTR{1.08637/3.57995/1.86164/0.81671/6.21607/1.00000}+FU{0.235/0.257/0.328/0.179}+IU{0.014}+G4{0.639} --ref-msa reference-seq.ali --tree reference-tree.treefile --query query.ali --outdir EPA-ng-Jplacer --redo

The return I got is:

INFO Selected: Output dir: EPA-ng-Jplacer/
INFO Selected: Query file: query.ali
INFO Selected: Tree file: reference-tree.treefile
INFO Selected: Reference MSA: reference-seq.ali 
INFO Selected: Automatic switching of use of per rate scalers
INFO Selected: Preserving the root of the input tree
INFO Selected: Specified model: GTR{1.08637/3.57995/1.86164/0.81671/6.21607/1.00000}+FU{0.235/0.257/0.328/0.179}+IU{0.014}+G4{0.639}
INFO    Rate heterogeneity: GAMMA (4 cats, mean),  alpha: 0.639 (user),  weights&rates: (0.25,0.0612386) (0.25,0.333269) (0.25,0.899982) (0.25,2.70551) 
        P-inv (user): 0.014
        Base frequencies (user): 0.235235 0.257257 0.328328 0.179179 
        Substitution rates (user): 1.08637 3.57995 1.86164 0.81671 6.21607 1
INFO     ______ ____   ___           _   __ ______
        / ____// __ \ /   |         / | / // ____/
       / __/  / /_/ // /| | ______ /  |/ // / __  
      / /___ / ____// ___ |/_____// /|  // /_/ /  
     /_____//_/    /_/  |_|      /_/ |_/ \____/ (v0.3.5)
INFO Output file: EPA-ng_Jplacer_Article3_cleaned4/epa_result.jplace
srun: error: Amoeba-mt6: task 0: Aborted
task 0: Aborted

And if I look at the log file I get :
terminate called after throwing an instance of 'std::runtime_error'
what():  -INF logl at branch 0 with sequence QUERY_DALLOL_1339_Gt_7Gt_7Gt-pp_Ass

Which honestly I don't understand what it means and I didn't see anybody report the same problem.

I probably did a dumb mistake somewhere but honestly I don't have a clue of what it is. Can you please help me ?

What I tried already :

  • The sequence signaled is the 1st one of my file so I tried to remove it. The same issue appear with the next sequence. So that's not the problem of the sequence.
  • Try to use EPA-ng with other files : works OK (but the reference tree has less OTUs. For this one it's 2993 OTUs only for the reference sequences)
  • Reduce the number of my query from 1634 to 10 to see if it was a query number problem : same issue.
  • Reroot my tree : same issue.
  • Run EPA-ng with different number of threads : same issue.
  • Check for hidden caracters : didn't detect any
  • Redo everything and check files (the tree is in Newick format from IQTree) : didn't change anything

Hope I was clear and that there's a solution to this issue
Thanks for reading / your help
JB

MacOS M1 neon support

Hello Team,

It is not supported on aarch64 or (arm64) platform such as macOS M1? Any plan to support it in the future?

Thanks,
Jianshu

Segfault with large query volume under MPI

Program produces following error message when started with large volume of queries (in this case half a TB, 100M neotrop sequences):

INFO Output file: /hits/basement/sco/barbera/neotrop/out/128-hybrid-100M-10/epa_result.jplace
[haswell-002:21508] *** Process received signal ***
[haswell-002:21508] Signal: Segmentation fault (11)
[haswell-002:21508] Signal code: Address not mapped (1)
[haswell-002:21508] Failing at address: 0xbb03f8088
[haswell-002:21508] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fe83c046100]
[haswell-002:21508] [ 1] /hits/sw/shared/apps/OpenMPI/1.10.2-GCC-4.9.3-2.25/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_malloc+0x7ea)[0x7fe83b7137ba]
[haswell-002:21508] [ 2] /hits/sw/shared/apps/OpenMPI/1.10.2-GCC-4.9.3-2.25/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_memalign+0x52)[0x7fe83b715b62]
[haswell-002:21508] [ 3] /hits/sw/shared/apps/OpenMPI/1.10.2-GCC-4.9.3-2.25/lib/libopen-pal.so.13(opal_memory_ptmalloc2_memalign+0xbf)[0x7fe83b715f9f]
[haswell-002:21508] [ 4] /hits/sw/shared/apps/GCCcore/4.9.3/lib64/libstdc++.so.6(_Znwm+0x18)[0x7fe83c9fab58]
[haswell-002:21508] [ 5] /hits/sw/shared/apps/GCCcore/4.9.3/lib64/libstdc++.so.6(_ZNSs4_Rep9_S_createEmmRKSaIcE+0x59)[0x7fe83ca5a799]
[haswell-002:21508] [ 6] /hits/sw/shared/apps/GCCcore/4.9.3/lib64/libstdc++.so.6(_ZNSs4_Rep8_M_cloneERKSaIcEm+0x1b)[0x7fe83ca5b40b]
[haswell-002:21508] [ 7] /hits/sw/shared/apps/GCCcore/4.9.3/lib64/libstdc++.so.6(_ZNSs7reserveEm+0x30)[0x7fe83ca5b4b0]
[haswell-002:21508] [ 8] /hits/sw/shared/apps/GCCcore/4.9.3/lib64/libstdc++.so.6(_ZNSt15basic_stringbufIcSt11char_traitsIcESaIcEE8overflowEi+0xa4)[0x7fe83ca37884]
[haswell-002:21508] [ 9] /hits/sw/shared/apps/GCCcore/4.9.3/lib64/libstdc++.so.6(_ZNSt15basic_streambufIcSt11char_traitsIcEE6xsputnEPKcl+0x85)[0x7fe83ca3b635]
[haswell-002:21508] [10] /hits/sw/shared/apps/GCCcore/4.9.3/lib64/libstdc++.so.6(_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l+0xe6)[0x7fe83ca32bd6]
[haswell-002:21508] [11] epa(_Z23sample_to_jplace_stringRK6SampleI9PlacementE+0x1b8)[0x49e1a8]
[haswell-002:21508] [12] epa(_Z10simple_mpiR4TreeRKSsS2_RK7OptionsS2_+0x186f)[0x4de28f]
[haswell-002:21508] [13] epa(main+0x659f)[0x48358f]
[haswell-002:21508] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe83bc96b15]
[haswell-002:21508] [15] epa[0x48d7c1]
[haswell-002:21508] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 21508 on node haswell-002 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Unable to parse model file from IQtree

Hello. EPA-NG fails to parse a model file from the latest IQtree version (option --model), when the selected DNA substitution model is not GTR. The attached IQtree model file, where the selected model is TIM3e+R10, can be used to recreate the following error:
libc++abi: terminating due to uncaught exception of type std::invalid_argument: Couldn't parse model file! (can't find 'A-R: '!)

What basically happens is that EPA-NG looks for "GTR" as the DNA substitution model (lines 174-178 in the EPA-NG source code parse_model.hpp), and if this is not found, it thinks that it is dealing with an amino-acid model. Consequently, it then looks for the substitution rate between amino acids A & R (line 182 in source file parse_model.hpp), which of course does not exist since the model is in fact a DNA substitution model (TIM3e+R10).

My impression was that EPA-NG can handle more DNA substitution models beyond just GTR, however right now this does not seem to be the case (at least not if the model file is from IQtree). Is it possible to fix this issue? It seems that this could be achieved easily using either of the following approaches:

  1. Give the user the option to explicitly specify whether the input model is a DNA or AA substitution model.
  2. Don't automatically switch to AA if the model is not explicitly written "GTR" in the IQtree file, but instead also accept other common specifiers such as "TIM3e" (full list here).

Thank you!

error when doing the final epa-ng placement

Hi!
I got this error when running:
epa-ng
-t reference_tree.raxml.bestTree
-s reference_aligment.fasta
-q query.fasta
-w ./ --model reference_tree.raxml.bestModel

Please find attached the .err and the info.log files
jobLog_447616.err.txt
epa_info.log.txt

It looks like the query.fasta produces after ena-ng --split has a problem, but it looks normal to me.

Thank you so much for your help.

epa-ng available on Homebrew through brewsci/bio

Hi,

I just wanted to let you know that we've put together a Homebrew formula for epa-ng that's available through the brewsci/bio tap. Just in case you wanted to list this on the README.

brew tap brewsci/bio
brew install epa-ng

or

brew install brewsci/bio/epa-ng

Cheers,
Gabriel

Issue with building from source: ld throws errors

Dear, I am trying to build epa-ng from source on Ubuntu server 18.04.5 LTS. I have checked the dependencies and all goes well until the end where ld is throwing errors:

[ 99%] Building CXX object src/CMakeFiles/epa_module.dir/util/stringify.cpp.o
[100%] Linking CXX executable ../../bin/epa-ng
/usr/bin/ld: /usr/local/lib/libz.a(inflate.o): relocation R_X86_64_32S against hidden symbol `zcfree' can not be used when making a PIE object
/usr/bin/ld: /usr/local/lib/libz.a(inftrees.o): relocation R_X86_64_32 against `.rodata' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: /usr/local/lib/libz.a(zutil.o): relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: /usr/local/lib/libz.a(crc32.o): relocation R_X86_64_32 against `.rodata' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: /usr/local/lib/libz.a(inffast.o): relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
src/CMakeFiles/epa_module.dir/build.make:598: recipe for target '../bin/epa-ng' failed
make[3]: *** [../bin/epa-ng] Error 1
make[3]: Leaving directory '/usr/local/programs/epa-ng/build'
CMakeFiles/Makefile2:901: recipe for target 'src/CMakeFiles/epa_module.dir/all' failed
make[2]: *** [src/CMakeFiles/epa_module.dir/all] Error 2
make[2]: Leaving directory '/usr/local/programs/epa-ng/build'
Makefile:94: recipe for target 'all' failed
make[1]: *** [all] Error 2
make[1]: Leaving directory '/usr/local/programs/epa-ng/build'
Makefile:11: recipe for target 'run_make' failed
make: *** [run_make] Error 2

how could I resolve this? Am I missing something?

Add checkpointing

add the ability to detect incomplete runs, and recover where the last one left off

model info error

Hello, I am having trouble running epa-ng with the following error:
INFO Selected: Output dir: ./epa_tree/
INFO Selected: Query file: query.fasta
INFO Selected: Tree file: T3.raxml.bestTree
INFO Selected: Reference MSA: reference.fasta
INFO Selected: Automatic switching of use of per rate scalers
INFO Selected: Preserving the root of the input tree
INFO Selected: Specified model file: RAxML_info.info
what(): Model string in provided file seems wrong.
XXXX.sh: line 20: 26465 Aborted (core dumped) epa-ng --tree T3.raxml.bestTree --ref-msa reference.fasta --query query.fasta --outdir $OUT --model RAxML_info.info

I am attempting to align 806 amplicon sequences to 1121 nifH reference sequences. I started by running raxml-ng to build a reference tree on muscle-aligned ref seqs with the following command:
raxml-ng --msa T2.raxml.rba --model GTR+G --prefix T3 --threads 8 --seed 8273

I then used papara to align query seqs, and the raxml-ng --split to seperate aligned seqs

In my first go running epa-ng, I provided the example model parameters suggested in the full stack tutorial to define the model:
GTR{0.7/1.8/1.2/0.6/3.0/1.0}+FU{0.25/0.23/0.30/0.22}+G4{0.47}

But I got the following error:
ERR When using epa-ng like this, a model has to be explicitly specified!
You may specify it generically (GTR+G), however parameters will not be optimized.
Instead we reccommend to use RAxML to re-evaluate the parameters and then pass the resulting
RAxML_info file to the epa-ng --model argument. epa-ng will then auto-parse the parameters.
( raxmlHPC -f e -s -t -n info -m GTRGAMMAX )

So I ran the example command above (but I did get an error leading me to change the -m option to GTRGAMMA [the only other possible input it GTRGAMMI), and that executed fine.
But using the RAxML_info file produced as input for epa-ng above threw the above error.

Is there some other way to get around this? If it helps below in the contents of the RAxML_info file:
_This is RAxML version 7.3.0 released by Alexandros Stamatakis in June 2011.

With greatly appreciated code contributions by:
Andre Aberer (HITS)
Simon Berger (HITS)
Nick Pattengale (Sandia)
Wayne Pfeiffer (SDSC)
Akifumi S. Tanabe (Univ. Tsukuba)

Alignment has 4167 distinct alignment patterns

Proportion of gaps and completely undetermined characters in this alignment: 93.41%

RAxML Model Optimization up to an accuracy of 0.100000 log likelihood units

Using 1 distinct models/data partitions with joint branch length optimization

All free model parameters will be estimated by RAxML
GAMMA model of rate heteorgeneity, ML estimate of alpha-parameter

GAMMA Model parameters will be estimated up to an accuracy of 0.1000000000 Log Likelihood units

Partition: 0
Alignment Patterns: 4167
Name: No Name Provided
DataType: DNA
Substitution Matrix: GTR
RAxML was called as follows:

raxmlHPC -f e -s ref.clus.phyi -t T3.raxml.bestTree -n info -m GTRGAMMA

Testing which likelihood implementation to use
Standard Implementation full tree traversal time: 2.301094
Subtree Equality Vectors for gap columns full tree traversal time: 0.809563
... using SEV-based implementation

Model parameters (binary file format) written to: /home/rodrigues-lab/msa_red/epa_ng/RAxML_binaryModelParameters.info

Overall Time for Tree Evaluation 419.071737
Final GAMMA likelihood: -186925.416854
Number of free parameters for AIC-TEST(BR-LEN): 2248
Number of free parameters for AIC-TEST(NO-BR-LEN): 9

Model Parameters of Partition 0, Name: No Name Provided, Type of Data: DNA
alpha: 1.029898
Tree-Length: 201.377284
rate A <-> C: 1.154527
rate A <-> G: 2.645042
rate A <-> T: 1.360458
rate C <-> G: 1.626075
rate C <-> T: 3.503977
rate G <-> T: 1.000000

freq pi(A): 0.240682
freq pi(C): 0.260669
freq pi(G): 0.267798
freq pi(T): 0.230851_

unimplemented pure virtual method error

Hi there,

I've been troubleshooting whether EPA-NG could be installed with conda and I've been running into this error when trying to install with clang rather than gcc (first reported here):

/Users/distiller/project/miniconda/conda-bld/epa-ng_1540400737097/work/build/genesis_unity_sources/lib/all.cpp:17912:56: error: allocating an object of abstract class type 'const utils::ColorNormalization'
tree, params, std::vectorutils::Color{}, {}, {}, svg_filename
^
/Users/distiller/project/miniconda/conda-bld/epa-ng_1540400737097/work/libs/genesis/lib/genesis/utils/tools/color/normalization.hpp:168:20: note: unimplemented pure virtual method 'normalize_' in 'ColorNormalization'
virtual double normalize_( double value ) const = 0;
^
/Users/distiller/project/miniconda/conda-bld/epa-ng_1540400737097/work/libs/genesis/lib/genesis/utils/tools/color/normalization.hpp:173:18: note: unimplemented pure virtual method 'is_valid_' in 'ColorNormalization'
virtual bool is_valid_() const = 0;
^
/Users/distiller/project/miniconda/conda-bld/epa-ng_1540400737097/work/build/genesis_unity_sources/lib/all.cpp:17923:45: error: allocating an object of abstract class type 'const utils::ColorNormalization'
tree, params, color_per_branch, {}, {}, svg_filename

I know EPA-NG hasn't been tested on Mac OS X (which uses clang), but from what I understand this may actually be a bug that wasn't caught by gcc. This is based on the development version forked on Sep 5, 2018 (commit: 45a8e53). Please let me know if you would like more details!

Gavin

Treeparsing error

Hi there,

I'm using the latest version of epa-ng and am getting this error when I try to run it:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Treeparsing failed!

The input tree (edit: which is in newick format) is unrooted and is based on 16,539 sequences. Any idea how I can better troubleshoot what is causing the error?

Thanks,

Gavin

cannot compile libpll

When I type in the epa folder "make pll" to start installation
I got
mkdir -p bin
cd libs/pll-modules && ./install-with-libpll.sh ..
/bin/sh: 1: ./install-with-libpll.sh: not found
and indeed I cannot find anywhere the file install-with-libpll.sh

Treeparsing error

Hello, I'm tryting to run epa-ng with a newick tree generated with ETE3 (previously resolving the polytomies) and I recieve this error:

terminate called after throwing an instance of 'std::runtime_error'
what(): Treeparsing failed! syntax error, unexpected $end, expecting ','. (line 1 column 55)

I have checked several times the newick format, and I don't understand why is reading an "end" in the middle of the tree...
Could you guide me for solving this problem, please?

Thank you very much in advance

I cannot find list of forbidden character for header of fasta

I got with my data set the error
"... fasta_getnext failed: Illegal header line in query fasta file"
my headers look like
">Read_3_sample=RF-pre-50cells-8h_S30_L001_R_size=121_"
So nothing clearly wrong.
But maybe with a list of forbidden character I can find a way to squeeze the info ( size=, is needed for post processing)

PLL assertion error

Hi Pierre,

I'm placing some sequences onto a relatively small tree with EPA-NG v0.3.7 and it's aborting after a failed assertion in PLL:

epa-ng: /home/connor/bin/epa-ng/libs/pll-modules/libs/libpll/src/core_likelihood_avx2.c:911: pll_core_edge_loglikelihood_repeats_generic_avx2: Assertion `site_lk < 0. && isfinite(site_lk)' failed.
Aborted (core dumped)

Here is the command I used:
epa-ng -s references.txt -t NosZ.txt -q queries.txt --model NosZ_epa.model.txt --dyn-heur 0.9 -T 4 --no-pre-mask

Placement is completed if I allow masking by removing '--no-pre-mask'.

Files used:
NosZ.txt
NosZ_epa.model.txt
queries.txt
references.txt

Can you figure out what is wrong?

Thanks!
Connor

Hello,I find issue running this program

Hello Pierre Barbera,
First, thank you for making this program :) sadly even though you made a very kind protocol for beginners to follow like me
my script seems to keep on stopped while running.

INFO Selected: Output dir: ./
INFO Selected: Query file: Onlyotualigned.fasta
INFO Selected: Tree file: RAxML_fastTree.reference_aligned.tree
INFO Selected: Reference MSA: reference_aligned.fasta
INFO Selected: Automatic switching of use of per rate scalers
INFO Selected: Preserving the root of the input tree
INFO Selected: Specified model: GTR+G+F
INFO ______ ____ ___ _ __ ______
/ // __ \ / | / | / // /
/ __/ / /
/ // /| | ______ / |/ // / __
/ /
/ // ___ |/_____// /| // // /
/_____//
/ /
/ |
| /
/ |
/ _
_/ (v0.3.8)
ERR Setting tip states failed for sequence: YP_004324525.1
ERR message: Illegal state code in tip "E"

This is the issue
My data contains Amino Acids
and aligned by MUSCLE and made Reference Tree with RAxML
I think I have a problem with my datasets but, I can't point the exact issue
could you pls help me with this issue?
Thank you, in advance
reference_aligned.txt

question on comparison with RAxML-epa

hello Team,

I can use epa algorithm in RAxML like this:

mafft --addfragments ${name}_NarG_all.faa --reorder --thread 64 NarG.ref.afa > ${name}_Refs_NarG_reads.aln

raxmlHPC-PTHREADS-AVX2 -f v -s ${name}_Refs_NarG_reads.aln -t RAxML_bestTree.NarG_ref_tree -m PROTGAMMAAUTO -n ${name}_JPlace_narG -T 64 –G 0.2

That is to first add short query sequences (${name}_NarG_all.faa) to the reference alignment (NarG.ref.afa, long one to generate the backbone tree). Then run the whole alignment including short query sequences (${name}_Refs_NarG_reads.aln) with the tree generated from NarG.ref.afa (RAxML_bestTree.NarG_ref_tree). How can do this in epa-ng? since query MSA files is needed, I assume it is the reads alignment file (no reference included). Since reads are very short and most of the tims no overlap. How do I generate a query alignment?

Thanks,

Jianshu

Address issues with heuristic placement

For "weird" sequences that don't fit in the tree, or may not align correctly, or are just plain wrong, we have observed a tendency of them being placed on longer branches of a reference tree, sometimes with a high to very high LWR. While some of these may be caught by filtering based on pendant length of the queries, the real problem lies with with the heuristic preplacement phase which is the likely culprit. Specifically, during this phase queries are inserted using a default pendant length of 0.9, which for some cases may simply be too long.

This also touches on identification of "novel" lineages in the query data, which is usually a goal of placement analyses. However the primary goal is to re-establish LWR as the primary criterion for placement confidence.

Provide test scripts / sanity checking for jplace

Especially for this early beta phase, it would be great to have a script to check the basic sanity of the output. This could allow the user to ensure some validity, and help me debug problems if the checks are broken for their various data.

A bit unclear model notation

Hi!
I would like to pass on the model to EPA-ng manually for the command line, for example

--model GTR{0.7/1.8/1.2/0.6/3.0/1.0}+F+R9{r1/r2/.../r9}{w1/w2/.../w9}

The order of relative rates in the GTR{} is A-C, A-G, A-T, C-G, C-T, G-T ?

I'm just double checking this, since it is not explicitly stated in the readme.

Could you also update the README for this?

Cheers,

Joran

FastTree

Hi, I'm just setting up a pipeline to place query sequences onto a phylogentic tree of ~5,000 reference sequences. I'm finding Raxml to be very slow for tree building and wondered whether epa-ng supports the use of FastTree. Will this be a problem when I have to supply the model parameters to epa-ng? I've looked online but haven't found any examples of this.

Cheers,
Andrew

Avoid duplicate work

Unlike pplace and old RAxML-EPA, epa-ng re-computes the placements of identical sequences. This is not necessary.

Possible solution: Store hashes of the sequences that have already been processed. If a new sequence has a hash that was seen before, add the name to the list of names for the pquery of the previous sequence (or, if that name also already exists, increment its multiplicity). This assumes that hash collisions don't occur, so the hash function should be good enough (SHA1?).

The model parameters under data partitioning

If a phylogenetic tree was generated using likelihood searches under data partitioning. How to get the model parameters?
I have a reference tree that was generated by 10 partitions. My query sequences are from one of 10 partitions. I am not sure whether I also may use this program (-f e ) to get model parameters for phylogenetic placement:
raxmlHPC-AVX -f e -s $REF_MSA -t $TREE -n file -m GTRGAMMAI

Error running this command

Hi there, I am trying to follow this picrust tutorial

After running this command: place_seqs.py -s ../seqs.fna -o out.tre -p 1
--intermediate intermediate/place_seqs

It shows error running this command:
epa-ng --tree /home/tayezy/miniconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/prokaryotic/pro_ref/pro_ref.tre --ref-msa intermediate/place_seqs/ref_seqs_hmmalign.fasta --query intermediate/place_seqs/study_seqs_hmmalign.fasta --chunk-size 5000 -T 1 -m /home/tayezy/miniconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/prokaryotic/pro_ref/pro_ref.model -w intermediate/place_seqs/epa_out --filter-acc-lwr 0.99 --filter-max 100

Any idea what might be the issue or how to resolve this?

Thanks in advance

Split command using fasta files

I used mafft to align query sequences against reference msa (both in fasta format) and also the combined msa is in fasta format. Right now I have to convert everything to phylip to use the split command, but it would be nice if epa could use fasta format as well.

Compilation error on Bioconda

While rebuilding epa-ng on Bioconda due to updated compilers (GCC 10) I'm seeing the following error:

 [  5%] Building C object libs/pll-modules/libs/libpll/src/CMakeFiles/pll_obj.dir/parse_utree.c.o
 In file included from /opt/conda/conda-bld/epa-ng_1646126155046/work/build/libs/pll-modules/libs/libpll/src/parse_utree.c:236:
 /opt/conda/conda-bld/epa-ng_1646126155046/work/build/libs/pll-modules/libs/libpll/src/parse_utree.h:102:6: error: conflicting types for 'pll_utree_error'
   102 | void pll_utree_error (const char *msg);
       |      ^~~~~~~~~~~~~~~
 /opt/conda/conda-bld/epa-ng_1646126155046/work/libs/pll-modules/libs/libpll/src/parse_utree.y:143:13: note: previous definition of 'pll_utree_error' was here
   143 | static void pll_utree_error(pll_unode_t * node, const char * s)
       |             ^~~~~~~~~~~~~~~
 /opt/conda/conda-bld/epa-ng_1646126155046/work/build/libs/pll-modules/libs/libpll/src/parse_utree.c: In function 'pll_utree_parse':
 /opt/conda/conda-bld/epa-ng_1646126155046/work/build/libs/pll-modules/libs/libpll/src/parse_utree.c:1727:18: warning: passing argument 1 of 'pll_utree_error' from incompatible pointer type [-Wincompatible-pointer-types]
  1727 |         yyerror (tree, yymsgp);
       |                  ^~~~
       |                  |
       |                  struct pll_unode_s *

Is cmake perhaps downloading the wrong version of pll-modules?

Utilize sequence multiplicities

Although non-standard, many sequence file preprocessing steps add meta-data to the sequence name. For example, in fasta, one often sees

>name_1234

or

>name;size=1234

in order to note the abundance of the sequence. Such formats are used e.g., by swarm and vsearch.

It would be helpful if epa-ng picks this information up and uses it as multiplicity in the result jplace file.

Zero-length branches altered

Hi

Firstly, thanks for developing such a great tool! I've had an odd issue occur whereby some zero-length branches in the reference tree get replaced with a branch length of 0.1053605157 in the epa_result.jplace file. I've attached an example that reproduces the issue:
submit.tar.gz

The command I ran was:

epa-ng -T 4 -m LG --redo --tree test.tree.tre --ref-msa test.msa.ref.fa --query test.msa.query.fa --preserve-rooting on --outdir out

test.tree.tre was:
"(g33013252_Mapoly0056s0034.1.p:0,((g33020677_Mapoly0173s0004.1.p:0,g33026704_Mapoly0052s0024.1.p:1.11896)100:0.662911,g33026973_Mapoly0016s0089.1.p:0.312835)1:0.234546);"

whereas the relevant line in epa_result.jplace was
"tree": "(g33013252_Mapoly0056s0034.1.p:0.0000000000{0},((g33020677_Mapoly0173s0004.1.p:0.1053605157{1},g33026704_Mapoly0052s0024.1.p:1.1189600000{2})100:0.6629110000{3},g33026973_Mapoly0016s0089.1.p:0.3128350000{4})1:0.2345460000{5});",

Note the zero branch length for "g33013252_Mapoly0056s0034.1.p" and the non-zero branch for "g33020677_Mapoly0173s0004.1.p"

This also happens in a larger tree, with the same resulting branch-length, even from branches far away from where the gene was inserted. Many/all of the zero length branches were replaced with a branch length of the same value as in the example above, 0.1053605157.

All the best
David

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.