arrogantrobot / 23andme2vcf Goto Github PK
View Code? Open in Web Editor NEWconvert your 23andme raw file to VCF | DEPRECATED, please see https://github.com/plantimals/2vcf
License: MIT License
convert your 23andme raw file to VCF | DEPRECATED, please see https://github.com/plantimals/2vcf
License: MIT License
The 23andMe SNPs have been updated, so when I run the script I see this error message:
raw data file and reference file are out of sync at ./23andme2vcf.pl line 153, <GEN1> line 587611.
The most current VCF specification is VCF 4.2 (http://www.1000genomes.org/wiki/analysis/variant%20call%20format/vcf-variant-call-format-version-41).
In the limited utilization of the features of the VCF spec by the generated VCF file, none of the fields used change from VCF4.1 to VCF4.2. Additionally, when the version in the VCF header is changed to 4.2, the generated VCF file is still syntactically valid according to vcf-validator
.
Are there any reasons not to have 23andme2vcf generate 4.2 instead of 4.1? If there are none, I'm happy to make the PR.
I followed the simple instructions to convert my 23andme downloaded .txt file to a .vcf file, receiving this output response:
gunzip: 23andme_v3_hg19_ref.txt.gz: not in gzip format 587747 sites were not included; these unmatched references can be found in sites_not_in_reference.txt.Try running again, but specify the other reference version: ./23andme2vcf.pl {path to}mysnps.txt mysnps.vcf 4
FYI, I replaced the correct path to the .txt file with {path to} in this issue.
I tried 23andme_v3_hg19_ref.txt.gz version as well, however received the same response, telling me to try the version 4. In either case, the output file (mysnps.vcf) is empty below this section:
`##fileformat=VCFv4.2
`
Any clue on what's wrong? Please and thanks.
It appears that you have used UCSC's reference hg19 instead of NCBI's GRCh37 to build your own reference. Normally this is fine, but there are differences between these builds at the chromosome MT. For example, looking at UCSC genome browser for MT:150 shows a T instead of a C.
You can use bcftools to validate against a reference.
bcftools norm -ce -f /reference/homo.sapiens/GRCh37/Homo_sapiens_assembly19.fasta 23andme.vcf
hi rob,
i tried using your script for converting the 23andme data from the personal genome project. it appears that data is still aligned to hg18. could you proivde the corresponding reference or let me know how to compile it myself (like, do you have a script that derives it from the ncbi fasta files?)
thanks
tim
I noticed that there's no licensing document or statement for the script; I'm interested to expand it but I don't want to cause any issues. What terms is this released under, please?
My file looks like this:
rs548049170 1 69869 TT
rs13328684 1 74792 --
rs9283150 1 565508 --
But the reference document in the cloned repo looks like this:
chr1 734462 rs12564807 G
chr1 752721 rs3131972 A
chr1 760998 rs148828841 C
I keep getting just "0" or "1" for genotype after ~560,000 correct conversions using the 23andme_v4_hg19_ref.txt.gz data and after ~500,000 correct conversions using the 23andme_v4_hg19_ref.txt.gz data using "perl 23andme2vcf.pl <path to 23andme txt.zip file> genome_XYZ.vcf 4 (or 3)". The break happens in both after:
chrX 2689575 rs311150 G A . . . GT 1/1
I've seen this with two different individuals' SNP files, one generated in Feb 2015 and another generated Dec 2016. Both the above rsID and the following one are still listed in the current dbSNP, and both entries in the 23andme_v5_ht19_ref.txt.gz appear valid. I've run this on a laptop and on a Linux server so don't think it's a resource issue. Any suggestions? Thanks.
Larry
Could you please update the reference files for the new V5 chip?
Hi,
I would like to convert v2 23andme files (build 36) to vcf. There are alot of v2 files to convert, and I am not able to re-download the data. Could you update your program to support v2 build 36 conversion?
Hi,
I couldn't find any other way to contact you so I suppose it'll have to be through this. I was wondering if you could make an exact list of steps of how to convert my 23andme.txt file to the VCF format using your tool. I'm asking you as a person who has near to no experience with coding or computer science. If you are able to tell me exactly how to do this, then I can share your tool with others like myself for a personal genomics class for uploading VCF files to various different programs and tools for analysis. Your assistance in this matter would be greatly appreciated.
Thanks.
OS X includes a broken zcat (can't handle .gz files, only .Z files). The following patch fixes this issue.
Cheers,
Shaun
diff --git a/23andme2vcf.pl b/23andme2vcf.pl
index 2ef215f..d545a56 100755
--- a/23andme2vcf.pl
+++ b/23andme2vcf.pl
@@ -18,10 +18,10 @@ missing($raw_path) unless -s $raw_path;
missing($ref_path) unless -s $ref_path;
#open the raw data as a zip or text
-my $fh = ($raw_path =~ m/zip$/) ? IO::File->new("zcat $raw_path|") : IO::File->new($raw_path);
+my $fh = ($raw_path =~ m/zip$/) ? IO::File->new("gunzip -c $raw_path|") : IO::File->new($raw_path);
#open the compressed reference file
-my $ref_fh = IO::File->new("zcat $ref_path|");
+my $ref_fh = IO::File->new("gunzip -c $ref_path|");
my $output_fh = IO::File->new(">$output_path");
running the script on 23andMe raw data (downloaded 29th of Oct, 2013) prints out a TON of warnings:
Use of uninitialized value $data_line in scalar chomp at 23andme2vcf.pl line 164, <GEN1> line 960613. Use of uninitialized value $data_line in split at 23andme2vcf.pl line 165, <GEN1> line 960613. Use of uninitialized value $chr in string eq at 23andme2vcf.pl line 166, <GEN1> line 960613. Use of uninitialized value $my_pos in numeric gt (>) at 23andme2vcf.pl line 170, <GEN1> line 960613.
over and over again. Also, when script starts, I get this warning for every line in the data file:
Use of uninitialized value $my_pos in numeric gt (>) at 23andme2vcf.pl line 170, <GEN1> line 274267.
This may take some significant development. The exact coordinates and alleles involved in the insertions and deletions signified by the I's and D's in the genotype column of the 23andme raw data may need to be retrieved from dbSNP or some other outside source. Once done, this reference can be added in to the existing SNP reference file.
Hi Rob,
thanks for the script, do you mind committing the code that generates the reference file, my says 8140 sites were not included even with version 4.
Many thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.