jVCFparser

A command-line parser for VCF files designed for population genetics analyses.

This is a beta version of the jVCFparser command line tool for processing variant call format (VCF) files. The tool uses memory-efficient descriptive statistics to load VCF file data into memory and then perform population genetics calculations on it. Since the tool only stores allele and genotype frequencies, it is able to process large files. Although reading the files may take some time, all calculations are extremely fast. The tool has been tested on VCF versions 4.0 and 4.2.

VCF 4.2 description (08.23.2022): The manual can be accessed on SAMtools site.

Date of last modification: 04.04.2023

Usage (example)

Get the JAR artifact HERE!

$ java -jar jVCFparser.jar -f ".\populations.snps.vcf" -mg

Flag	LFlag	Description
`-mg`	`-missg`	Missing genotype counts
`-ra`	`-refa`	REF allele counts
`-aa`	`-alta`	ALT allele counts
`-gc`	`-gcounts`	Genotype counts
`-dgc`	`-diffgcounts`	Different genotype counts
`-da`	`-dacounts`	Different allele counts
`-ea`	`-eacounts`	Effective allele counts
`-het`	`-hetcounts`	Heterozygote counts
`-hom`	`-homcounts`	Homozygote counts
`-oh`	`-obshet`	Average Observed Heterozygosity (Ho)
`-eh`	`-exphet`	Average Expected Heterozygosity (He)
`-ueh`	`-uexphet`	Average Unbiased Expected Heterozygosity (uHe)
`-sh`	`-shann`	Average Shannon's Information Index (H)
`-si`	`-simp`	Average Simpson's Diversity Index (D)
`-fx`	`-fix`	Average Fixation Index (F)
`-ar`	`-arich`	Average Allelic Richness (Ar)

Requirements:
GNU/Linux, Microsoft Windows, or macOS
JRE (JDK 11 or later)

Sample data used for testing:
~25K SNP (loci) and 180 sample matrix: Sessile oak SNP dataset; de novo assembly; File size: ~15MB
~50K SNP (loci) and ~20K sample matrix: SoySNP50K iSelect BeadChip, Wm82.a1; File size: ~3.5GB
~1.3M SNP (loci) and ~2K sample matrix: 1000 genomes project, Phase 3, Chromosome 21; File size: ~10GB

Allele and genotype counts for reading the files are the following:

Number of reference allele '0'
Number of alternative allele '1'
Number of unique alleles
Number of homozygote genotypes (e.g. 0/0 or 1/1)
Number of heterozygote genotypes (e.g. 0/1 or 1/0)
Number of missing genotypes (e.g. ./.)
Number of unique genotypes

Currently implemented diversity-ralated descriptive statistics, and calculations (and counts) as follows:

Number of missing genotypes
Number of REF alleles
Number of ALT alleles
Number of genotypes
Number of heterozygotes
Number of homozygotes
Average number of different genotypes (Ng)
Average number of different alleles (Na)
Average number of effective alleles (Ne)
Average Observed Heterozygosity (Ho)
Average Expected Heterozygosity (He)
Average Unbiased Expected Heterozygosity (uHe)
Average Shannon's Information Index (SI)
Average Simpson's Diversity Index (D)
Average Fixation Index (F)
Average Allelic Richness (Ar)

Calculation details [Formulas used in calculations and their references.]

Average number of different genotypes (Ng):

N_g represents the mean number of distinct genotypes across n loci, denoted as g_i for i = 1,2,...,n. It is computed as the sum of the distinct genotypes at each locus, divided by the total number of loci.

Average number of different alleles (Na):

N_a represents the mean number of different alleles across n loci, denoted as a_i for i = 1,2,...,n. It is computed as the sum of the different alleles at each locus, divided by the total number of loci.

Average number of effective alleles (Ne):

N_e represents the mean number of effective alleles across n genetic loci, denoted as p_i for i = 1,2,...,n. It is calculated as the inverse of the sum of allele frequencies, divided by the total number of loci. Based on Brown & Weir (1983).

Average Observed Heterozygosity (Ho):

H_o represents the average proportion of heterozygous individuals across n genetic loci. For each locus i = 1,2,...,n, the proportion of heterozygous individuals is computed as the ratio of the number of heterozygotes to the total number of individuals N. Observed heterozygosity is then calculated as the mean of these proportions across all n loci. Based on Hartl & Clark (1997).

Average Expected Heterozygosity (He):

H_e represents the mean probability that two randomly chosen alleles at a given locus are different, across n genetic loci, denoted as p_i and q_i for i = 1,2,...,n. It is calculated as the average of 1 minus the sum of squared allele frequencies. Based on the intra locus gene diversity (H = 1-p²-q²) derived from the Hardly-Weinberg equilibrium.

Average Unbiased Expected Heterozygosity (uHe):

uH_e represents the mean probability that two randomly chosen alleles at a given locus are different, across n genetic loci, adjusted for sample size and population size bias. It is calculated as the average of 1 minus the sum of squared allele frequencies, adjusted for sample size bias. Based on Peakall & Smouse (2006).

Average Shannon's Information Index (H):

H represents the Average Shannon's Information Index, defined as the average amount of uncertainty associated with predicting the identity of a randomly chosen allele at a given locus, across n loci. It is calculated as the negative average of the product of the frequency of the i-th allele, p_i, and the natural logarithm of p_i. Based on Brown & Weir (1983).

Average Simpson's Diversity Index (D):

D represents the Average Simpson's Diversity Index, defined as the probability that two randomly chosen alleles at a given locus are identical, across n loci. It is calculated as 1 minus the average of the sum of squared allele frequencies. Based on Simpson (1949) and Morris et al. (2014).

Average Fixation Index (F):

F represents the Average Fixation Index, averaged across n loci. It is calculated as the difference between observed heterozygosity (Ho) and expected heterozygosity (He), normalized by expected heterozygosity and averaged across n loci. Based on Hartl & Clark (1997).

Average Allelic Richness (Ar):

Ar represents the Average Allelic Richness, defined as the expected number of species in a sample of n genotypes selected at random from a collection containing N alleles ("genes") from S loci. It is calculated as the number of alleles observed in a sample of size N_i, normalized by the sample size N_i and averaged across S loci. Based on Hurlbert (1971) and El Mousadik & Petit (1996). NOTE: Not designed and not suitable for big data!

References

Brown, A. H., & Weir, B. S. (1983). Measuring genetic variability in plant populations. Isozymes in plant genetics and breeding, part A, 219-239.

El Mousadik, A., & Petit, R. J. (1996). High level of genetic differentiation for allelic richness among populations of the argan tree [Argania spinosa (L.) Skeels] endemic to Morocco. Theoretical and applied genetics, 92, 832-839.

Hartl, D. L., & Clark, A. G. (1997). Principles of population genetics (Vol. 116). Sunderland: Sinauer associates.

Hurlbert, S. H. (1971). The nonconcept of species diversity: a critique and alternative parameters. Ecology, 52(4), 577-586.

Morris, E. K., Caruso, T., Buscot, F., Fischer, M., Hancock, C., Maier, T. S., ... & Rillig, M. C. (2014). Choosing and using diversity indices: insights for ecological applications from the German Biodiversity Exploratories. Ecology and evolution, 4(18), 3514-3524.

Peakall, R. O. D., and Peter E. Smouse. "GENALEX 6: genetic analysis in Excel. Population genetic software for teaching and research." Molecular ecology notes 6.1 (2006): 288-295.

Simpson, E. H. (1949). Measurement of diversity. nature, 163(4148), 688-688.

endreth / jvcfparser Goto Github PK

jvcfparser's Introduction

jVCFparser

A command-line parser for VCF files designed for population genetics analyses.

Usage (example)

jvcfparser's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent