Giter Club home page Giter Club logo

benchmarkalignments's People

Contributors

alicefsmith avatar dkainer avatar pbfrandsen avatar roblanf avatar snubian avatar terezasenfeldova avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

benchmarkalignments's Issues

add dataset_curation section to yaml file

this section should be used to describe exactly what a user would need to do to get from the original dryad link to the dataset in our database.

For exmaple:

  • which alignment file to download
  • origin of partitioning information
  • justification for that alignment file over others (e.g. this is the alignment used to create figure 1 in the paper)
  • any edits made to that alignment file (e.g. changing characters in names etc. - this can point to a script in the repo which does this)
  • any sequences removed

So, we can have a section like this

dataset:
    DOI: dx.doi.org/10.5524/101041
    license: CC0
    used for tree inference: yes
    dataset_curation:
        alignment_file: 
        partition_information:
        justification:
        name_edits:
        sequence_edits:

name edits script and csv file

Sometimes we need to change names to remove special characters. To make this reproducible, we should have an

edit_names.py script in the utility scripts.

This should be run on every alignment. IF any names get changed in the original alignmnent, it should output a csv file with one name change per line, and two columns:

  • original_name
  • new_name
original_name, new_name
Homo-sapiens, Homo_sapiens

Rightmeyer_2013 - error with final position in CHARSETS

A few of the CHARSETs have this pattern:

    CHARSET ef1a_exon1_1stpos = 1-583\3;
    CHARSET ef1a_exon1_2ndpos = 2-581\3;
    CHARSET ef1a_exon1_3rdpos = 3-582\3;

Should be:

    CHARSET ef1a_exon1_1stpos = 1-583\3;
    CHARSET ef1a_exon1_2ndpos = 2-583\3;
    CHARSET ef1a_exon1_3rdpos = 3-583\3;

new dataset

Burleigh JG, Bansal MS, Eulenstein O, Hartmann S, Wehe A, Vision TJ (2010) Genome-scale phylogenetics: inferring the plant tree of life from 18,896 gene trees. Systematic Biology 60(2): 117-125. http://dx.doi.org/10.5061/dryad.7881

separate out nexus files

At the moment we have the alignment and the SETS block all in one file. But this is annoying in terms of utility.

If we go ahead and make the partitions, loci, and genomes into a csv file, it would be better to have:

  • alignment.nex
  • partitions.nex
  • loci.nex
  • genomes.nex

partitions.nex, loci.nex, and genomes.nex all have CHARSETS only, which list the relevant sites for each element.

E.g. partitions.nex would look like:

begin SETS;

	CHARSET	COI_1stpos = 1-1592\3;
	CHARSET	COI_2ndpos = 2-1592\3;
	CHARSET	COI_3rdpos = 3-1592\3;
	CHARSET	16S = 1593-3037;
	
end;

and loci.nex would look like:

begin SETS;

	CHARSET COI = COI_1stpos: 1-1592\3, COI_2ndpos: 2-1592\3, COI_3rdpos: 3-1592\3;
	CHARSET 16S = 16S: 1593-3037;

end;

etc

This lets people easily do whatever kind of analysis they are interested in.

yaml checker problem

Hey @snubian,

I've been trying to run the YAML checker.

On the 'tereza' branch, I do this:

source("~/Documents/github/PartitionedAlignments/R/check_yaml/config_yaml.R", chdir = TRUE)
checkYAML("~/Documents/github/PartitionedAlignments/datasets/Anderson_2013")

I get this:

Error in getClass(Class, where = topenv(parent.frame())) : 
  “SilentReporter” is not a defined class 

It looks like I can fix this by changing this (line 29 of config_YAML):

reporter = new("SilentReporter")

To this

reporter = SilentReporter

But I don't really know for sure I'm not doing something terrible. Could you check?

Unrelated issue - we might have changed our YAML format a touch. Once this is sorted I'll get on to seeing if I can address that...

benchmark ideas

Some ideas from a chat with Fred Jaya. Use the wiki to give a few examples of simple benchmarks, e.g.

  1. RaxML vs. IQ-TREE on individual loci (fix models to be equal, assess trees in one piece of software, measure CPU time, memory, likelihood)
  2. NJ vs. ML vs. Parsimony
  3. Simulate datasets with AliSim and re-do the above when you know the truth

include iqtree trees for each dataset, and alisim commands

For each dataset in the final collection, we should estimate:

  1. A concatenated tree
  2. Single-locus trees (i.e. from loci.nex)

And then we should provide a set of alisim commands which would let people simulate data according to those trees.

Add info on root node to metadata

For non-rev models (and maybe other stuff) you need to know the position of the root.

If I added information on that to the yaml file (where possible) it would be useful.

add concat and loci folders

for each dataset, we want a script which takes as input the dataset folder, and then outputs two new folders as follows:

  • concat [all the files for a concatenated analysis, bar the original alignment.nex which is already in the root folder

    • alignment_partitions.nex
    • alignment_partitions.raxml
  • loci

    • locus_1.nex ([i.e. locus name, then .nex])
    • locus_1_partitions.nex
    • locus_1_partitions.raxml
    • ...
    • the same three files for every locus

Then in the readme we can include command lines to do concatenated and individual locus analyses for every alignment.

new dataset

Wiens JJ, Hutter CR, Mulcahy DG, Noonan BP, Townsend TM, Sites Jr JW, Reeder TW (2012)_ Resolving the phylogeny of lizards and snakes (Squamata) with extensive sampling of genes and species. Biology Letters 8(6): 1043-1046._ http://dx.doi.org/10.5061/dryad.g1gd8

update yaml file

To remove:

  • alignment section, because it's redundant with other information we already have
  • genomes section, for the same reason

To add:

  • outgroups

include raxml-format partition files

These look like this:

DNA, part1 = 1-100
DNA, part2 = 101-384

We can build these from the csv file, and we should name them similarly to the nexus format ones e.g.

  • partitions_raxml.txt
  • loci_raxml.txt

etc

Then we should add tests to make sure they load OK in raxml-ng

run tests with nextflow

Maybe better than just using a python script is to actually manage things properly with a workflow language.

empirical distributions

Suha's idea:

Include empirical distributions of all the things for all the datasets.

E.g.:

branch lengths
GTR model parameters (also NonRev model parameters)
base frequencies (these are already in the AMAS files)
tree balance (per locus?? per dataset)

Lots of decisions to make here, but this has the potential to be a very useful resource.

Missing data in 'Wood_2012' dataset

Hi @roblanf ,
When I used the dataset "Wood_2012", I noticed that you removed the morphological data. But these sequences:
Archeaa_paradoxa, Afrarchaea_grimaldii, Myrmecarchaea_sp, Baltarchaea_conica and Patarchaea_muralis contain only morphological data. So after removed they will not contain any data and affect the use of that dataset.

When I use this dataset in IQTREE, IQTREE automatically stopped and prompted the error "
WARNING: Sequence Archaea_paradoxa contains only gaps or missing data
WARNING: Sequence Afrarchaea_grimaldii contains only gaps or missing data
WARNING: Sequence Myrmecarchaea_sp contains only gaps or missing data
WARNING: Sequence Baltarchaea_conica contains only gaps or missing data
WARNING: Sequence Patarchaea_muralis contains only gaps or missing data
ERROR: Some sequences (see above) are problematic, please check your alignment again"

I think it might be possible to delete these sequences in the dataset so that the dataset can be used normally.

Kind regards,
Qiuyue

check genome assignments in Looney_2016

Hi Rob,

I noticed that in Looney_2016 dataset the rpb1 and rpb2 genes are defined as chloroplast genome in the alignment.nex file

I’ve seen these loci in other datasets and they are always in the nuclear genome.

I checked the paper but the only sentence that mentions these two loci is not clear at all “Four loci were targeted for infrageneric clade‐level resolution including two nrDNA regions (nuclear ribosomal large subunit (LSU) and internal transcribed spacers (ITS) and two single‐copy genes (rpb1 and rpb2, which encode the largest and second largest subunits of RNA polymerase II, respectively)”. However, the paper is all about fungi!

There is also a reference (https://doi.org/10.1016/j.ympev.2004.11.014) in that paper that is addressing these two loci and defines them as nuclear genes.

Can you please check that

Best,

Suha

add locus charpartitions

The point is to add explicit information on every single locus in the dataset.

At the moment we have

begin SETS;

	[loci]
	CHARSET	COI_1stpos = 1-1592\3;
	CHARSET	COI_2ndpos = 2-1592\3;
	CHARSET	COI_3rdpos = 3-1592\3;
	CHARSET	16S = 1593-3037;

    CHARPARTITION loci = 1:COI_1stpos, 2:COI_2ndpos, 3:COI_3rdpos, 4:16S;

	[genomes]
	CHARSET	mitochondrial_genome = 1-3037;

    CHARPARTITION genomes = 1:mitochondrial_genome;

end;

But we should have

begin SETS;

	[partitions]
	CHARSET	COI_1stpos = 1-1592\3;
	CHARSET	COI_2ndpos = 2-1592\3;
	CHARSET	COI_3rdpos = 3-1592\3;
	CHARSET	16S = 1593-3037;

	[loci]
	CHARPARTITION COI = 1:COI_1stpos, 2:COI_2ndpos, 3:COI_3rdpos;
	CHARPARTITION 16S = 1:16S;

	CHARPARTITION loci = 1:COI, 2:16S;

	[genomes]
	CHARPARTITION	mitochondrial_genome = 1:COI, 2:16S;

	CHARPARTITION genomes = 1:mitochondrial_genome;

This involves:

  • Change current [loci] to [partitions]
  • add a [loci] section with one CHARPARTITION for each locus
  • make the loci CHARPARTITION list the actual loci, not the charsets themselves
  • make the mitochondrial genome (and other genome) CHARPARTITIONs list the actual loci, not the charsets themselves

Is there a way to search by organism type or sequence type?

The repository currently seems to list the data sets by author_year. I only recognize one or two of the authors. It would be very nice if I could find data sets for "mammals", or "primates" or "vertebrates". Or maybe find sets with at least 3,000 bases of DNA and not more than 40 taxa.

10 datasets for MCMC diagnostics

From @rbouckaert:

Today I bumped into this bit "The first group of data sets, which we will call DS1-DS11, have become standard data sets for evaluating MCMC methods” which is on page 15 of http://arxiv.org/pdf/1405.2120v2.pdf

It is a set of alignments from TreeBase that are mainly smaller size (see below), so easy to use for running lots of experiments.

Data N Cols Type of data Study Est error
DS1 27 1949 rRNA; 18s Hedges et al. (1990) 0.0048
DS2 29 2520 rDNA; 18s Garey et al. (1996) 0.0002
DS3 36 1812 mtDNA; COII (1678); cytb (679-1812) Yang and Yoder (2003) 0.0002
DS4 41 1137 rDNA; 18s Henk et al. (2003) 0.0006
DS5 50 378 Nuclear protein coding; wingless Lakner et al. (2008) 0.0005
DS6 50 1133 rDNA; 18s Zhang and Blackwell (2001) 0.0023
DS7 59 1824 mtDNA; COII; and cytb Yoder and Yang (2004) 0.0011
DS8 64 1008 rDNA; 28s Rossman et al. (2001) 0.0009
DS9 67 955 Plastid ribosomal protein; s16 (rps16) Ingram and Doyle (2004) 0.0164
DS10 67 1098 rDNA; 18s Suh and Blackwell (1999) 0.0164
DS11 71 1082 rDNA; internal transcribed spacer Kroken and Taylor (2000) 0.0008

Comprehensive phylogenomic time tree of bryophytes reveals deep relationships and uncovers gene incongruences in the last 500 million years of diversification.

https://bsapubs.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/ajb2.16249?casa_token=vDewRqKp43IAAAAA:N66N89E_iRepM7D3tbaH4d6mgHWW4QUvK45Yfbw3oLJu2Zdihmb_kVlFsml7La0wFKCdpO7qlmQ0Ap4

All phylogenetic nucleotide and amino acid alignments and
resulting gene trees and species trees are available on Dryad:
https://datadryad.org/stash/share/n-BLaXfahJNFs7jvkBODGzvpjQYuBdvKESEaQGKeyE; DOI: 10.5061/dryad.3j9kd51qm.

The dryad repo isn't live yet, but this seems like it might be a really nice dataset once it is!

new dataset

Ruhfel BR, Gitzendanner MA, Soltis PS, Soltis DE, Burleigh JG (2014) From algae to angiosperms_inferring the phylogeny of green plants (Viridiplantae) from 360 plastid genomes. BMC Evolutionary Biology 14: 23. http://dx.doi.org/10.5061/dryad.k1t1f

run all alignments through geneious

Right now some alignments differ just a little from others, e.g. in the inclusion of special characters in the species names, the use of lower and upper case letters etc.

A simple fix for this is to run all alignments through geneious, and export as nexus according to their specifications, replacing all special characters in species names. This will just standardise the formatting among datasets a little better.

include a script to split alignments into partitions

Easy to do with AMAS I think.

Even better if I could also make it split alignments into loci, i.e. to identify suffixes like '1st_pos' etc. and then concatenate the 1st, 2nd, and 3rd positions into a single locus.

add a csv file with alignment information

At the moment all the useful information is in nexus format, which can be annoying to work with.

E.g. we have this:

begin SETS;

	[partitions]
	CHARSET	COI_1stpos = 1-1592\3;
	CHARSET	COI_2ndpos = 2-1592\3;
	CHARSET	COI_3rdpos = 3-1592\3;
	CHARSET	16S = 1593-3037;

	[loci]
	CHARPARTITION COI = 1:COI_1stpos, 2:COI_2ndpos, 3:COI_3rdpos;
	CHARPARTITION 16S = 1:16S;

	CHARPARTITION loci = 1:COI, 2:16S;

	[genomes]
	CHARPARTITION	mitochondrial_genome = 1:COI, 2:16S;

	CHARPARTITION genomes = 1:mitochondrial_genome;

But this could be represented as a csv file with the following columns:

  • alignment_name (e.g. "Anderson_2012")
  • partition (e.g. "COI_1stpos")
  • partition_sites (e.g. "1-1592\3")
  • locus (e.g. "COI")
  • genome (e.g. "mitochondrial")

We could then use the csv file when entering the data, and build the nexus block directly from the csv file.

change charset to charpartition for genomes

Right now when I run one of the alignments in IQ-TREE it loads all the charsets:

iqtree2 -s alignment.nex -p alignment.nex --redo

Loading 5 partitions...
Subset  Type    Seqs    Sites   Infor   Invar   Model   Name
1               145     531     54      425     1       COI_1stpos
2               145     531     3       502     2       COI_2ndpos
3               145     530     303     109     3       COI_3rdpos
4               145     1445    283     924     4       16S
5               145     3037    643     1960    1       mitochondrial_genome
Degree of missing data: 0.000
Info: multi-threading strategy over partitions

So instead of this:

begin SETS;

	[loci]
	CHARSET	COI_1stpos = 1-1592\3;
	CHARSET	COI_2ndpos = 2-1592\3;
	CHARSET	COI_3rdpos = 3-1592\3;
	CHARSET	16S = 1593-3037;

    CHARPARTITION loci = 1:COI_1stpos, 2:COI_2ndpos, 3:COI_3rdpos, 4:16S;

	[genomes]
	CHARSET	mitochondrial_genome = 1-3037;

    CHARPARTITION genomes = 1:mitochondrial_genome;

end;

we should have this:

begin SETS;

	[loci]
	CHARSET	COI_1stpos = 1-1592\3;
	CHARSET	COI_2ndpos = 2-1592\3;
	CHARSET	COI_3rdpos = 3-1592\3;
	CHARSET	16S = 1593-3037;

    CHARPARTITION loci = 1:COI_1stpos, 2:COI_2ndpos, 3:COI_3rdpos, 4:16S;

	[genomes]
	CHARPARTITION	mitochondrial_genome = 1-3037;

    CHARPARTITION genomes = 1:mitochondrial_genome;

end;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.