Giter Club home page Giter Club logo

jvcfparser's Introduction

GitHub GitHub top language GitHub language count GitHub code size in bytes GitHub repo size

jVCFparser

A command-line parser for VCF files designed for population genetics analyses.

screen1

This is a beta version of the jVCFparser command line tool for processing variant call format (VCF) files. The tool uses memory-efficient descriptive statistics to load VCF file data into memory and then perform population genetics calculations on it. Since the tool only stores allele and genotype frequencies, it is able to process large files. Although reading the files may take some time, all calculations are extremely fast. The tool has been tested on VCF versions 4.0 and 4.2.

VCF 4.2 description (08.23.2022): The manual can be accessed on SAMtools site.

Date of last modification: 04.04.2023

Usage (example)

Get the JAR artifact HERE!

$ java -jar jVCFparser.jar -f ".\populations.snps.vcf" -mg 

screen2

Flag LFlag Description
-mg -missg Missing genotype counts
-ra -refa REF allele counts
-aa -alta ALT allele counts
-gc -gcounts Genotype counts
-dgc -diffgcounts Different genotype counts
-da -dacounts Different allele counts
-ea -eacounts Effective allele counts
-het -hetcounts Heterozygote counts
-hom -homcounts Homozygote counts
-oh -obshet Average Observed Heterozygosity (Ho)
-eh -exphet Average Expected Heterozygosity (He)
-ueh -uexphet Average Unbiased Expected Heterozygosity (uHe)
-sh -shann Average Shannon's Information Index (H)
-si -simp Average Simpson's Diversity Index (D)
-fx -fix Average Fixation Index (F)
-ar -arich Average Allelic Richness (Ar)

Requirements:
GNU/Linux, Microsoft Windows, or macOS
JRE (JDK 11 or later)

Sample data used for testing:
~25K SNP (loci) and 180 sample matrix: Sessile oak SNP dataset; de novo assembly; File size: ~15MB
~50K SNP (loci) and ~20K sample matrix: SoySNP50K iSelect BeadChip, Wm82.a1; File size: ~3.5GB
~1.3M SNP (loci) and ~2K sample matrix: 1000 genomes project, Phase 3, Chromosome 21; File size: ~10GB

Allele and genotype counts for reading the files are the following:

  • Number of reference allele '0'
  • Number of alternative allele '1'
  • Number of unique alleles
  • Number of homozygote genotypes (e.g. 0/0 or 1/1)
  • Number of heterozygote genotypes (e.g. 0/1 or 1/0)
  • Number of missing genotypes (e.g. ./.)
  • Number of unique genotypes

Currently implemented diversity-ralated descriptive statistics, and calculations (and counts) as follows:

  • Number of missing genotypes
  • Number of REF alleles
  • Number of ALT alleles
  • Number of genotypes
  • Number of heterozygotes
  • Number of homozygotes
  • Average number of different genotypes (Ng)
  • Average number of different alleles (Na)
  • Average number of effective alleles (Ne)
  • Average Observed Heterozygosity (Ho)
  • Average Expected Heterozygosity (He)
  • Average Unbiased Expected Heterozygosity (uHe)
  • Average Shannon's Information Index (SI)
  • Average Simpson's Diversity Index (D)
  • Average Fixation Index (F)
  • Average Allelic Richness (Ar)
Calculation details [Formulas used in calculations and their references.]

Average number of different genotypes (Ng):

avg_ng
Ng represents the mean number of distinct genotypes across n loci, denoted as gi for i = 1,2,...,n. It is computed as the sum of the distinct genotypes at each locus, divided by the total number of loci.

Average number of different alleles (Na):

avg_na
Na represents the mean number of different alleles across n loci, denoted as ai for i = 1,2,...,n. It is computed as the sum of the different alleles at each locus, divided by the total number of loci.

Average number of effective alleles (Ne):

avg_ne
Ne represents the mean number of effective alleles across n genetic loci, denoted as pi for i = 1,2,...,n. It is calculated as the inverse of the sum of allele frequencies, divided by the total number of loci. Based on Brown & Weir (1983).

Average Observed Heterozygosity (Ho):

ho
Ho represents the average proportion of heterozygous individuals across n genetic loci. For each locus i = 1,2,...,n, the proportion of heterozygous individuals is computed as the ratio of the number of heterozygotes to the total number of individuals N. Observed heterozygosity is then calculated as the mean of these proportions across all n loci. Based on Hartl & Clark (1997).

Average Expected Heterozygosity (He):

he
He represents the mean probability that two randomly chosen alleles at a given locus are different, across n genetic loci, denoted as pi and qi for i = 1,2,...,n. It is calculated as the average of 1 minus the sum of squared allele frequencies. Based on the intra locus gene diversity (H = 1-p2-q2) derived from the Hardly-Weinberg equilibrium.

Average Unbiased Expected Heterozygosity (uHe):

uhe
uHe represents the mean probability that two randomly chosen alleles at a given locus are different, across n genetic loci, adjusted for sample size and population size bias. It is calculated as the average of 1 minus the sum of squared allele frequencies, adjusted for sample size bias. Based on Peakall & Smouse (2006).

Average Shannon's Information Index (H):

shannon
H represents the Average Shannon's Information Index, defined as the average amount of uncertainty associated with predicting the identity of a randomly chosen allele at a given locus, across n loci. It is calculated as the negative average of the product of the frequency of the i-th allele, pi, and the natural logarithm of pi. Based on Brown & Weir (1983).

Average Simpson's Diversity Index (D):

simpson
D represents the Average Simpson's Diversity Index, defined as the probability that two randomly chosen alleles at a given locus are identical, across n loci. It is calculated as 1 minus the average of the sum of squared allele frequencies. Based on Simpson (1949) and Morris et al. (2014).

Average Fixation Index (F):

f
F represents the Average Fixation Index, averaged across n loci. It is calculated as the difference between observed heterozygosity (Ho) and expected heterozygosity (He), normalized by expected heterozygosity and averaged across n loci. Based on Hartl & Clark (1997).

Average Allelic Richness (Ar):

ar
Ar represents the Average Allelic Richness, defined as the expected number of species in a sample of n genotypes selected at random from a collection containing N alleles ("genes") from S loci. It is calculated as the number of alleles observed in a sample of size Ni, normalized by the sample size Ni and averaged across S loci. Based on Hurlbert (1971) and El Mousadik & Petit (1996). NOTE: Not designed and not suitable for big data!

References
Brown, A. H., & Weir, B. S. (1983). Measuring genetic variability in plant populations. Isozymes in plant genetics and breeding, part A, 219-239.

El Mousadik, A., & Petit, R. J. (1996). High level of genetic differentiation for allelic richness among populations of the argan tree [Argania spinosa (L.) Skeels] endemic to Morocco. Theoretical and applied genetics, 92, 832-839.

Hartl, D. L., & Clark, A. G. (1997). Principles of population genetics (Vol. 116). Sunderland: Sinauer associates.

Hurlbert, S. H. (1971). The nonconcept of species diversity: a critique and alternative parameters. Ecology, 52(4), 577-586.

Morris, E. K., Caruso, T., Buscot, F., Fischer, M., Hancock, C., Maier, T. S., ... & Rillig, M. C. (2014). Choosing and using diversity indices: insights for ecological applications from the German Biodiversity Exploratories. Ecology and evolution, 4(18), 3514-3524.

Peakall, R. O. D., and Peter E. Smouse. "GENALEX 6: genetic analysis in Excel. Population genetic software for teaching and research." Molecular ecology notes 6.1 (2006): 288-295.

Simpson, E. H. (1949). Measurement of diversity. nature, 163(4148), 688-688.

jvcfparser's People

Contributors

endreth avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.