Giter Club home page Giter Club logo

benchmark_old's Introduction

BenchmarkAlignments

What's this?

A curated repository of DNA and amino acid alignments with comprehensive metadata.

What's it for?

To test, verify, benchmark, and compare software and methods in phylogenetics.

How many datasets, how big?

Check out the summary.csv file in this repository. It has most of the data you might want on all of the alignments in the database. You can view it straight on github by just clicking on it above.

As of April 2019, the database contains:

  • 67 datasets
  • 66,932 partitions
  • 31,971,550 alignment columns
  • 2,623,278,645 total alignment matrix cells

Getting the data

The data are stored on figShare, and the total download size is roughly 1GB.

You can download it by hand by clicking 'Download All' at this link: https://figshare.com/s/622e9e0a156e5233944b, and then extracting each of the .tar.gz files yourself.

If you are comfortable with the commandline, you can do it like this

# make the directory to keep it in
mkdir BenchmarkAlignments
cd BenchmarkAlignemnts

# download and unzip the data
curl https://ndownloader.figshare.com/articles/7092356?private_link=622e9e0a156e5233944b > BenchmarkAlignments.zip
unzip BenchmarkAlignments.zip

# unpack the .tar.gz files
find . -name "*.tar.gz" -exec tar xzf {} \;

# clean up
rm *.tar.gz
rm *.zip

You should now have a series of folders named e.g. Anderson_2013. This is the database.

What's in each folder?

Inside each folder is:

  1. README.yaml: a YAML file which has metadata on the alignment including but not limited to: the license, DOIs for the original study and the dataset, notes on the dataset itself.

  2. alignment.nex: a nexus formatted alignment file that contains the sequence alignment in non-interleaved nexus format, plus a SETS block which contains information on partitions, genomes, and outgroups.

  3. alignment.nex-summary.txt: summary stats on the whole alignment generated by AMAS.

  4. alignment.nex-seq-summary.txt: summary stats on each sequence in the alignment generated by AMAS.

Can I use these datasets?

Yes. All of the original datasets are publicly available and can be re-used. The datasets themselves are all released under a CC0 or CCBY license.

Everything here and on figShare that is not a dataset (e.g. the summary.csv file, README.yaml, the code here) is released under a CC-BY license.

Attribution

If you use any of the datasets, please make sure to reference three things:

  1. The original study (the full reference and DOI are provided in README.yaml and in summary.csv)

  2. The dataset itself (the DOI is provided in README.yaml and in summary.csv)

  3. This repository (github.com/roblanf/BenchmarkAlignments)

This is essential to reward and acknowledge those who spend weeks and months in the field, laboriously chasing frogs/flies/lizards etc., then are kind enough to share their data with the world so that people like me (and you, if you're reading this) can re-use them for other things.

I want individual loci, not concatenated alignments

Depending on what you're doing, you might be more interested in single-locus alignments rather than concatenated multi-locus alignments. If this is the case, please use the script split_into_loci.py (in the utility_scripts folder), as follows:

python3 split_into_loci.py -i 'INPUT_FOLDER' -o 'OUTPUT_FOLDER'

Your output folder will now contain a series of single-locus alignments, with the name of the dataset prepended to the locus name (which is itself taken from the concatenated alignment file):

Anderson_2013_16S
Anderson_2013_COI
Seago_2011_28S
Seago_2011_COI
Seago_2011_COII

Note that this script will only work properly on alignments from this database, because all of the loci and other character sets in the nexus files are named with a consistent naming scheme upon which the script relies. This script will recursively search for alignments in the INPUT_FOLDER, and then output each locus to a new nexus file in the OUTPUT_FOLDER. There are two differences between the charsets in the nexus file and the single locus alignments in the OUTPUT_FOLDER:

  1. Protein coding loci in the original files are split into 1st, 2nd, and 3rd codon positions. These are concatenated into single alignments in the output folder. (Note that they will be concatenated in no particular order).

  2. Genome charsets (which are present in the original alignments) are ignored

One thing to note is that the meaning of a 'locus' differs somewhat (and unavoidably) between datatsets. For example, in many datasets a 'locus' corresponds to a single transcript (i.e. multiple exons). But in other datasets, a 'locus' corresponds to a single exon. You can find more information on all of this by studying the alignments themselves, and by reading the original papers describing the study. The reference and DOI for each original study is given in the readme.YAML file in each folder, and in the summary.csv file on this repository.

I have something to say

If you find errors, bugs, or have suggested datasets or features, please leave suggestions on the issue tracker here: https://github.com/roblanf/BenchmarkAlignments/issues

benchmark_old's People

Contributors

roblanf avatar huaiyanren avatar dkainer avatar terezasenfeldova avatar snubian avatar alicefsmith avatar pbfrandsen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.