roblanf / benchmarkalignments Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 11.0 685.35 MB

Benchmark empirical datasets for phylogenetic method development

License: Other

Python 99.66% Shell 0.34%

benchmarkalignments's People

Contributors

Stargazers

Watchers

Forkers

pbfrandsen brianfoley mrhelmus alicefsmith snubian terezasenfeldova wen-zha wangdi2014 aglucaci huaiyanren

benchmarkalignments's Issues

add dataset_curation section to yaml file

this section should be used to describe exactly what a user would need to do to get from the original dryad link to the dataset in our database.

For exmaple:

which alignment file to download
origin of partitioning information
justification for that alignment file over others (e.g. this is the alignment used to create figure 1 in the paper)
any edits made to that alignment file (e.g. changing characters in names etc. - this can point to a script in the repo which does this)
any sequences removed

So, we can have a section like this

dataset:
    DOI: dx.doi.org/10.5524/101041
    license: CC0
    used for tree inference: yes
    dataset_curation:
        alignment_file: 
        partition_information:
        justification:
        name_edits:
        sequence_edits:

figure out a programmatic way to get the timetree dates

e.g. query timetree with an API call, and record the API call in the YAML, along with the date the API call was made.

new dataset

Tsagkogeorga G, Turon X, Galtier N, Douzery EJP, Delsuc F (2010)	Accelerated evolutionary rate of housekeeping genes in tunicates. Journal of Molecular Evolution 71(2): 153-167.	http://dx.doi.org/10.5061/dryad.8323

name edits script and csv file

Sometimes we need to change names to remove special characters. To make this reproducible, we should have an

edit_names.py script in the utility scripts.

This should be run on every alignment. IF any names get changed in the original alignmnent, it should output a csv file with one name change per line, and two columns:

original_name
new_name

original_name, new_name
Homo-sapiens, Homo_sapiens

Rightmeyer_2013 - error with final position in CHARSETS

A few of the CHARSETs have this pattern:

    CHARSET ef1a_exon1_1stpos = 1-583\3;
    CHARSET ef1a_exon1_2ndpos = 2-581\3;
    CHARSET ef1a_exon1_3rdpos = 3-582\3;

Should be:

    CHARSET ef1a_exon1_1stpos = 1-583\3;
    CHARSET ef1a_exon1_2ndpos = 2-583\3;
    CHARSET ef1a_exon1_3rdpos = 3-583\3;

new dataset

Burleigh JG, Bansal MS, Eulenstein O, Hartmann S, Wehe A, Vision TJ (2010)	Genome-scale phylogenetics: inferring the plant tree of life from 18,896 gene trees. Systematic Biology 60(2): 117-125.	http://dx.doi.org/10.5061/dryad.7881

separate out nexus files

At the moment we have the alignment and the SETS block all in one file. But this is annoying in terms of utility.

If we go ahead and make the partitions, loci, and genomes into a csv file, it would be better to have:

alignment.nex
partitions.nex
loci.nex
genomes.nex

partitions.nex, loci.nex, and genomes.nex all have CHARSETS only, which list the relevant sites for each element.

E.g. partitions.nex would look like:

begin SETS;

	CHARSET	COI_1stpos = 1-1592\3;
	CHARSET	COI_2ndpos = 2-1592\3;
	CHARSET	COI_3rdpos = 3-1592\3;
	CHARSET	16S = 1593-3037;
	
end;

and loci.nex would look like:

begin SETS;

	CHARSET COI = COI_1stpos: 1-1592\3, COI_2ndpos: 2-1592\3, COI_3rdpos: 3-1592\3;
	CHARSET 16S = 16S: 1593-3037;

end;

etc

This lets people easily do whatever kind of analysis they are interested in.

yaml checker problem

Hey @snubian,

I've been trying to run the YAML checker.

On the 'tereza' branch, I do this:

source("~/Documents/github/PartitionedAlignments/R/check_yaml/config_yaml.R", chdir = TRUE)
checkYAML("~/Documents/github/PartitionedAlignments/datasets/Anderson_2013")

I get this:

Error in getClass(Class, where = topenv(parent.frame())) : 
  “SilentReporter” is not a defined class

It looks like I can fix this by changing this (line 29 of config_YAML):

reporter = new("SilentReporter")

To this

reporter = SilentReporter

But I don't really know for sure I'm not doing something terrible. Could you check?

Unrelated issue - we might have changed our YAML format a touch. Once this is sorted I'll get on to seeing if I can address that...

benchmark ideas

Some ideas from a chat with Fred Jaya. Use the wiki to give a few examples of simple benchmarks, e.g.

RaxML vs. IQ-TREE on individual loci (fix models to be equal, assess trees in one piece of software, measure CPU time, memory, likelihood)
NJ vs. ML vs. Parsimony
Simulate datasets with AliSim and re-do the above when you know the truth

include iqtree trees for each dataset, and alisim commands

For each dataset in the final collection, we should estimate:

A concatenated tree
Single-locus trees (i.e. from loci.nex)

And then we should provide a set of alisim commands which would let people simulate data according to those trees.

Add info on root node to metadata

For non-rev models (and maybe other stuff) you need to know the position of the root.

If I added information on that to the yaml file (where possible) it would be useful.

add concat and loci folders

for each dataset, we want a script which takes as input the dataset folder, and then outputs two new folders as follows:

concat [all the files for a concatenated analysis, bar the original alignment.nex which is already in the root folder
- alignment_partitions.nex
- alignment_partitions.raxml
loci
- locus_1.nex ([i.e. locus name, then .nex])
- locus_1_partitions.nex
- locus_1_partitions.raxml
- ...
- the same three files for every locus

Then in the readme we can include command lines to do concatenated and individual locus analyses for every alignment.

new dataset

Wiens JJ, Hutter CR, Mulcahy DG, Noonan BP, Townsend TM, Sites Jr JW, Reeder TW (2012)_	Resolving the phylogeny of lizards and snakes (Squamata) with extensive sampling of genes and species. Biology Letters 8(6): 1043-1046._	http://dx.doi.org/10.5061/dryad.g1gd8

add dataset?

I haven't seen the dryad data for this, because the embargo was still in place when I looked, but it should be lifted soon.

paper:
https://www.nature.com/articles/s41467-019-09454-w

data:
https://doi.org/10.5061/dryad.tj3gd75

good because it's on mosses which are not yet (I don't think) represented in the database.

check all alignments and turns tabs to 4 spaces

Anderson has this problem and likely many others.

update yaml file

To remove:

alignment section, because it's redundant with other information we already have
genomes section, for the same reason

To add:

outgroups

include raxml-format partition files

These look like this:

DNA, part1 = 1-100
DNA, part2 = 101-384

We can build these from the csv file, and we should name them similarly to the nexus format ones e.g.

partitions_raxml.txt
loci_raxml.txt

etc

Then we should add tests to make sure they load OK in raxml-ng

new dataset

https://datadryad.org/resource/doi:10.5061/dryad.v1d32

is a >1000 locus exon capture dataset. Well presented, will add 1 aa and 1 dna dataset

change dx.doi.org to doi.org

remco informs me that the former is now obsolete

consider adding https://datadryad.org/resource/doi:10.5061/dryad.hj07m

This dataset:
https://datadryad.org/resource/doi:10.5061/dryad.hj07m

is from a neat study that looked at gene-tree variation within mitochondrial alignments. It would add 6 aa alignments, and 6 dna alignments.

run tests with nextflow

Maybe better than just using a python script is to actually manage things properly with a workflow language.

new dataset

a good one from early photosynthetic eukaryotes

https://datadryad.org/resource/doi:10.5061/dryad.421p2

update yaml tests

to reflect the changes to the yaml file in new_2023 branch

empirical distributions

Suha's idea:

Include empirical distributions of all the things for all the datasets.

E.g.:

branch lengths
GTR model parameters (also NonRev model parameters)
base frequencies (these are already in the AMAS files)
tree balance (per locus?? per dataset)

Lots of decisions to make here, but this has the potential to be a very useful resource.

Missing data in 'Wood_2012' dataset

Hi @roblanf ,
When I used the dataset "Wood_2012", I noticed that you removed the morphological data. But these sequences:
Archeaa_paradoxa, Afrarchaea_grimaldii, Myrmecarchaea_sp, Baltarchaea_conica and Patarchaea_muralis contain only morphological data. So after removed they will not contain any data and affect the use of that dataset.

When I use this dataset in IQTREE, IQTREE automatically stopped and prompted the error "
WARNING: Sequence Archaea_paradoxa contains only gaps or missing data
WARNING: Sequence Afrarchaea_grimaldii contains only gaps or missing data
WARNING: Sequence Myrmecarchaea_sp contains only gaps or missing data
WARNING: Sequence Baltarchaea_conica contains only gaps or missing data
WARNING: Sequence Patarchaea_muralis contains only gaps or missing data
ERROR: Some sequences (see above) are problematic, please check your alignment again"

I think it might be possible to delete these sequences in the dataset so that the dataset can be used normally.

Kind regards,
Qiuyue

include gene family datasets

these are common to look at as well

remove all datasets from new_2023 branch

Let's start over.

check genome assignments in Looney_2016

Hi Rob,

I noticed that in Looney_2016 dataset the rpb1 and rpb2 genes are defined as chloroplast genome in the alignment.nex file

I’ve seen these loci in other datasets and they are always in the nuclear genome.

I checked the paper but the only sentence that mentions these two loci is not clear at all “Four loci were targeted for infrageneric clade‐level resolution including two nrDNA regions (nuclear ribosomal large subunit (LSU) and internal transcribed spacers (ITS) and two single‐copy genes (rpb1 and rpb2, which encode the largest and second largest subunits of RNA polymerase II, respectively)”. However, the paper is all about fungi!

There is also a reference (https://doi.org/10.1016/j.ympev.2004.11.014) in that paper that is addressing these two loci and defines them as nuclear genes.

Can you please check that

Best,

Suha

new dataset

https://datadryad.org/resource/doi:10.5061/dryad.7b19f33

add locus charpartitions

The point is to add explicit information on every single locus in the dataset.

At the moment we have

begin SETS;

	[loci]
	CHARSET	COI_1stpos = 1-1592\3;
	CHARSET	COI_2ndpos = 2-1592\3;
	CHARSET	COI_3rdpos = 3-1592\3;
	CHARSET	16S = 1593-3037;

    CHARPARTITION loci = 1:COI_1stpos, 2:COI_2ndpos, 3:COI_3rdpos, 4:16S;

	[genomes]
	CHARSET	mitochondrial_genome = 1-3037;

    CHARPARTITION genomes = 1:mitochondrial_genome;

end;

But we should have

begin SETS;

	[partitions]
	CHARSET	COI_1stpos = 1-1592\3;
	CHARSET	COI_2ndpos = 2-1592\3;
	CHARSET	COI_3rdpos = 3-1592\3;
	CHARSET	16S = 1593-3037;

	[loci]
	CHARPARTITION COI = 1:COI_1stpos, 2:COI_2ndpos, 3:COI_3rdpos;
	CHARPARTITION 16S = 1:16S;

	CHARPARTITION loci = 1:COI, 2:16S;

	[genomes]
	CHARPARTITION	mitochondrial_genome = 1:COI, 2:16S;

	CHARPARTITION genomes = 1:mitochondrial_genome;

This involves:

Change current [loci] to [partitions]
add a [loci] section with one CHARPARTITION for each locus
make the loci CHARPARTITION list the actual loci, not the charsets themselves
make the mitochondrial genome (and other genome) CHARPARTITIONs list the actual loci, not the charsets themselves

Is there a way to search by organism type or sequence type?

The repository currently seems to list the data sets by author_year. I only recognize one or two of the authors. It would be very nice if I could find data sets for "mammals", or "primates" or "vertebrates". Or maybe find sets with at least 3,000 bases of DNA and not more than 40 taxa.

10 datasets for MCMC diagnostics

From @rbouckaert:

Today I bumped into this bit "The first group of data sets, which we will call DS1-DS11, have become standard data sets for evaluating MCMC methods” which is on page 15 of http://arxiv.org/pdf/1405.2120v2.pdf

It is a set of alignments from TreeBase that are mainly smaller size (see below), so easy to use for running lots of experiments.

Data N Cols Type of data Study Est error
DS1 27 1949 rRNA; 18s Hedges et al. (1990) 0.0048
DS2 29 2520 rDNA; 18s Garey et al. (1996) 0.0002
DS3 36 1812 mtDNA; COII (1678); cytb (679-1812) Yang and Yoder (2003) 0.0002
DS4 41 1137 rDNA; 18s Henk et al. (2003) 0.0006
DS5 50 378 Nuclear protein coding; wingless Lakner et al. (2008) 0.0005
DS6 50 1133 rDNA; 18s Zhang and Blackwell (2001) 0.0023
DS7 59 1824 mtDNA; COII; and cytb Yoder and Yang (2004) 0.0011
DS8 64 1008 rDNA; 28s Rossman et al. (2001) 0.0009
DS9 67 955 Plastid ribosomal protein; s16 (rps16) Ingram and Doyle (2004) 0.0164
DS10 67 1098 rDNA; 18s Suh and Blackwell (1999) 0.0164
DS11 71 1082 rDNA; internal transcribed spacer Kroken and Taylor (2000) 0.0008

A genome-scale phylogeny of the kingdom Fungi

https://figshare.com/articles/dataset/Scripts_and_analyses_used_for_the_fungal_phylogeny/12751736?file=24771890

Comprehensive phylogenomic time tree of bryophytes reveals deep relationships and uncovers gene incongruences in the last 500 million years of diversification.

https://bsapubs.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/ajb2.16249?casa_token=vDewRqKp43IAAAAA:N66N89E_iRepM7D3tbaH4d6mgHWW4QUvK45Yfbw3oLJu2Zdihmb_kVlFsml7La0wFKCdpO7qlmQ0Ap4

All phylogenetic nucleotide and amino acid alignments and
resulting gene trees and species trees are available on Dryad:
https://datadryad.org/stash/share/n-BLaXfahJNFs7jvkBODGzvpjQYuBdvKESEaQGKeyE; DOI: 10.5061/dryad.3j9kd51qm.

The dryad repo isn't live yet, but this seems like it might be a really nice dataset once it is!

add some transposon datasets

this is a common thing to do, and we should include some of these datasets

new dataset

Ruhfel BR, Gitzendanner MA, Soltis PS, Soltis DE, Burleigh JG (2014)	From algae to angiosperms_inferring the phylogeny of green plants (Viridiplantae) from 360 plastid genomes. BMC Evolutionary Biology 14: 23.	http://dx.doi.org/10.5061/dryad.k1t1f

run all alignments through geneious

Right now some alignments differ just a little from others, e.g. in the inclusion of special characters in the species names, the use of lower and upper case letters etc.

A simple fix for this is to run all alignments through geneious, and export as nexus according to their specifications, replacing all special characters in species names. This will just standardise the formatting among datasets a little better.

include a script to split alignments into partitions

Easy to do with AMAS I think.

Even better if I could also make it split alignments into loci, i.e. to identify suffixes like '1st_pos' etc. and then concatenate the 1st, 2nd, and 3rd positions into a single locus.

add a csv file with alignment information

At the moment all the useful information is in nexus format, which can be annoying to work with.

E.g. we have this:

begin SETS;

	[partitions]
	CHARSET	COI_1stpos = 1-1592\3;
	CHARSET	COI_2ndpos = 2-1592\3;
	CHARSET	COI_3rdpos = 3-1592\3;
	CHARSET	16S = 1593-3037;

	[loci]
	CHARPARTITION COI = 1:COI_1stpos, 2:COI_2ndpos, 3:COI_3rdpos;
	CHARPARTITION 16S = 1:16S;

	CHARPARTITION loci = 1:COI, 2:16S;

	[genomes]
	CHARPARTITION	mitochondrial_genome = 1:COI, 2:16S;

	CHARPARTITION genomes = 1:mitochondrial_genome;

But this could be represented as a csv file with the following columns:

alignment_name (e.g. "Anderson_2012")
partition (e.g. "COI_1stpos")
partition_sites (e.g. "1-1592\3")
locus (e.g. "COI")
genome (e.g. "mitochondrial")

We could then use the csv file when entering the data, and build the nexus block directly from the csv file.

A global phylogeny of butterflies reveals their evolutionary history, ancestral hosts and biogeographic origins

paper: https://www.nature.com/articles/s41559-023-02041-9

data: https://doi.org/10.6084/m9.figshare.21774899

change charset to charpartition for genomes

Right now when I run one of the alignments in IQ-TREE it loads all the charsets:

iqtree2 -s alignment.nex -p alignment.nex --redo

Loading 5 partitions...
Subset  Type    Seqs    Sites   Infor   Invar   Model   Name
1               145     531     54      425     1       COI_1stpos
2               145     531     3       502     2       COI_2ndpos
3               145     530     303     109     3       COI_3rdpos
4               145     1445    283     924     4       16S
5               145     3037    643     1960    1       mitochondrial_genome
Degree of missing data: 0.000
Info: multi-threading strategy over partitions

So instead of this:

begin SETS;

	[loci]
	CHARSET	COI_1stpos = 1-1592\3;
	CHARSET	COI_2ndpos = 2-1592\3;
	CHARSET	COI_3rdpos = 3-1592\3;
	CHARSET	16S = 1593-3037;

    CHARPARTITION loci = 1:COI_1stpos, 2:COI_2ndpos, 3:COI_3rdpos, 4:16S;

	[genomes]
	CHARSET	mitochondrial_genome = 1-3037;

    CHARPARTITION genomes = 1:mitochondrial_genome;

end;

we should have this:

begin SETS;

	[loci]
	CHARSET	COI_1stpos = 1-1592\3;
	CHARSET	COI_2ndpos = 2-1592\3;
	CHARSET	COI_3rdpos = 3-1592\3;
	CHARSET	16S = 1593-3037;

    CHARPARTITION loci = 1:COI_1stpos, 2:COI_2ndpos, 3:COI_3rdpos, 4:16S;

	[genomes]
	CHARPARTITION	mitochondrial_genome = 1-3037;

    CHARPARTITION genomes = 1:mitochondrial_genome;

end;

add link to figshare dataset in yaml file

Remco suggested this, and it's a good one. For each dataset, add a direct link to the figshare copy of that dataset in the yaml file for that dataset.

roblanf / benchmarkalignments Goto Github PK

benchmarkalignments's People

Contributors

Stargazers

Watchers

Forkers

benchmarkalignments's Issues

Recommend Projects

Recommend Topics

Recommend Org