Giter Club home page Giter Club logo

regens-analysis's Introduction

ATTENTION โ—

You are now in the regens-analysis repository. Click here if you want to go back to regens.

Regens algorithm methods ๐Ÿค–

Regens repeats the following process for each chromosome. Each chromosome that REGENS simulates begins as a set of SNPs without genotypes, which is demarcated into segments by breakpoints. The user selects the number of breakpoints per chromosome as one of REGENS' input arguments, and then that many breakpoint positions are drawn from the empirical distribution of recombination event positions. This empirical distribution is computed via equation 2 in the REGENS manuscript, where P(Ri = 1) is computed for the ith recombination interval by feeding its recombination rate (computed from the recombination map) into haldane's map function. Once an empty chromosome is segmented by breakpoints, the row indices of whole genome bed file rows from a real dataset are duplicated so that 1) there is one real individual for each empty segment and 2) every real individual is selected an equal number of times (minus 1 for each remainder sample if the number of segments is not divisible by the number of individuals). Then, for each empty segment, a whole chromosome is randomly selected without replacement from the set of autosomal genotypes that correspond to the duplicated indices, and the empty simulated segment is filled with the the homologous segment from the sampled real chromosome. These steps are repeated for every empty simulated segment in every chromosome so that all of the empty simulated genomes are filled with real SNP values. This quasirandom selection of individuals minimizes maf variation between the simulated and real datasets and also maintains normal population level genetic variability by randomizing segment selection.

about the recombination maps (input that we provided) ๐Ÿฆƒ

REGENS converts output recombination rate maps from pyrho (which correspond to the twenty-six 1000 Genome populations on a one to one basis) into probabilities of drawing each simulated breakpoint at a specific genomic location. It is also possible to simulate GWAS data from a custom plink (bed, bim, bam) fileset or a custom recombination rate map (or both files can be custom). Note that recombination rate maps between populations within a superpopulation (i.e. british and italian) have pearson correlation coefficients of roughly 0.9 (see figure 2B of the pyrho paper), so if a genotype dataset has no recombination rate map for the exact population, then map for a closely relatrf population should suffice.

1000 Genomes Project data acquirement methods

REGENS can easily simulate GWAS data from any of the 26 populations in the 1000 genomes project, and a filtered subset of these subpopulations' genotype data is provided in the github in corresponding plink filesets. In summary, I kept a random subset of 500000 quality control filtered, biallelic SNPs such that every subpopulation contains at least two instances of the minor allele. Exact thinning methods are in the supplementary analysis.

benchmark comparison against Triadsim:

REGENS simulated simulated 20000 samples with 500000 SNPs per sample ten times. Triadsim simulated 10000 trios with 500000 SNPs per individual ten times. A perfect comparison is not possible because simulating 10000 trios simulates 30000 individuals but only simulates 20000 unrelated individuals (assuming each kin's mother and father are not related). REGENS benefits from this comparison by having to read and write only two thirds as many samples, while Triadsim benefits because they only have to draw half as many breakpoints. To clarify the latter, each of Triadsim's breakpoints is applied to a trio, of which 10000 were simulated. On the other hand, each of REGENS' breakpoints is applied to an individual, of which 20000 were simulated. Since this at least roughy ammounts to a wash, the fairest comparison was to compare each algorithm's ability to simulate the same number of unrelated individuals because relatives are generally removed from real GWAS data. A bootstrap confidence interval was computed for the ratio of Triadsims mean runtime to REGENS' mean runtime, and another one was computed for the ratio of Triadsim's max RAM usage to REGENS' max ram usage. All replicate runs for both algorithms were run on an Intel(R) Xeon(R) CPU E5-2690 v4 2.60GHz processor. Instructions for how to rerun those tests are here.

regens-analysis's People

Contributors

greggj2016 avatar trangdata avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar  avatar

Forkers

trangdata

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.