Giter Club home page Giter Club logo

new-code-for-vcf's Introduction

New code for reading VCF files

UPDATE: all code has now been included in pegas.

See below for some examples.

These functions provide new tools to handle VCF files. The new user-level functions are:

  • VCFheader extracts the header of a VCF file as a single character string (can be printed in a more friendly way with cat).

  • VCFlabels extracts the labels of the individuals.

  • VCFloci extracts the information of the loci, by default all nine fields (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT) are read. This returns an object of class "VCFinfo" with a print method.

  • is.snp tests whether a locus is a SNP (this is now a generic function).

  • rangePOS selects loci within a given range of POSition using an object of class "VCFinfo".

  • selectQUAL selects loci above a given QUALity using an object of class "VCFinfo".

  • getINFO extracts specific information from the INFO field (by default "DP").

The function read.vcf (already in pegas) has been rewritten with a new interface:

read.vcf(file, from = 1, to = 1e4, which.loci = NULL, quiet = FALSE)

By default the first 10,000 loci. An alternative is to use which.loci which specifies which loci to read, typically as an output from the above function.

Another new feature is that read.vcf (and VCFloci too) can read both compressed (*.gz) or uncompressed files.

Example (with the chromosome Y from 1000 Genomes):

> info <- VCFlociinfo("chrY.vcf")
Scanning file chrY.vcf 
 171.6615 / 171.6615 Mb
Done.

The INFO field is big so we remove it to get a nicer print:

> INFO <- info$INFO
> info$INFO <- NULL
> info
      CHROM      POS         ID REF ALT QUAL FILTER FORMAT
1         Y  2655180 rs11575897   G   A  100   PASS     GT
2         Y  2655471          .   A   C  100   PASS     GT
3         Y  2655754          .   A   T  100   PASS     GT
4         Y  2655800          .   A   G  100   PASS     GT
....
.....
62039     Y 28770656          .   A   G  100   PASS     GT
62040     Y 28770756          .   C   G  100   PASS     GT
62041     Y 28770875          .   C   G  100   PASS     GT
62042     Y 28770931          .   T   C  100   PASS     GT

We can now identify the loci which are real SNPs (i.e., substitutions with only 2 alleles):

> SNP <- is.snp(info)
> table(SNP)
SNP
FALSE  TRUE 
 1537 60505

We now read the SNPs:

> X <- read.vcf("chrY.vcf", which.loci = which(SNP))
Reading 60505 loci.
Done
> X
Allelic data frame: 1233 individuals
                    60505 loci

And it's a simple matter to read the other (non-SNP) loci:

> Y <- read.vcf("chrY.vcf", which.loci = which(!SNP))
Reading 1537 loci.
Done
> Y
Allelic data frame: 1233 individuals
                    1537 loci

A lot of things can be done from this (as usual with R). For instance, the help page ?read.vcf gives code to draw the distribution of mutations along chromosome Y for non-SNP mutations (in red) and SNPs (in blue) for the whole chromosome (first plot) and on a restricted portion (second plot) marked with a dashed square:

alt text

new-code-for-vcf's People

Contributors

emmanuelparadis avatar

Watchers

 avatar  avatar  avatar

new-code-for-vcf's Issues

New Code Not loading VCF File

Hi,

I think that peas was updated to include the new vcc info into the workflow, but I tried to load a vcf file into R and got the following error messages (after I force quit after an hour waiting for file to load); including both command and error:

read.vcf("2015_termiteRF.vcf", from = 1, to = 10000, which.loci = NULL, quiet = FALSE)

Warning messages:
1: In i + 0:5 : NAs produced by integer overflow
2: In i + 0:5 : NAs produced by integer overflow
3: In i + 0:5 : NAs produced by integer overflow
4: In i + 0:5 : NAs produced by integer overflow
5: In i + 0:5 : NAs produced by integer overflow
6: In i + 1L : NAs produced by integer overflow

Also, if I try using the command to get info from the vcc file, I get the following error:

RF_info <-VCFlociinfo("2015_termiteRF.vcf")
Error: could not find function "VCFlociinfo"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.