arrogantrobot / 23andme2vcf Goto Github PK

View Code? Open in Web Editor NEW

94.0 12.0 30.0 29.78 MB

convert your 23andme raw file to VCF | DEPRECATED, please see https://github.com/plantimals/2vcf

License: MIT License

Perl 100.00%

vcf 23andme ancestry annotations perl

23andme2vcf's People

Contributors

Stargazers

Watchers

23andme2vcf's Issues

23andMe SNPs have been updated

The 23andMe SNPs have been updated, so when I run the script I see this error message:

raw data file and reference file are out of sync at ./23andme2vcf.pl line 153, <GEN1> line 587611.

Upgrade to VCF4.2?

The most current VCF specification is VCF 4.2 (http://www.1000genomes.org/wiki/analysis/variant%20call%20format/vcf-variant-call-format-version-41).

In the limited utilization of the features of the VCF spec by the generated VCF file, none of the fields used change from VCF4.1 to VCF4.2. Additionally, when the version in the VCF header is changed to 4.2, the generated VCF file is still syntactically valid according to vcf-validator.

Are there any reasons not to have 23andme2vcf generate 4.2 instead of 4.1? If there are none, I'm happy to make the PR.

Tried to use it, but the output file doesn't contain any rows below the static headers for the .vcf file?

I followed the simple instructions to convert my 23andme downloaded .txt file to a .vcf file, receiving this output response:

gunzip: 23andme_v3_hg19_ref.txt.gz: not in gzip format 587747 sites were not included; these unmatched references can be found in sites_not_in_reference.txt.Try running again, but specify the other reference version: ./23andme2vcf.pl {path to}mysnps.txt mysnps.vcf 4

FYI, I replaced the correct path to the .txt file with {path to} in this issue.

I tried 23andme_v3_hg19_ref.txt.gz version as well, however received the same response, telling me to try the version 4. In either case, the output file (mysnps.vcf) is empty below this section:

`##fileformat=VCFv4.2

fileDate=20151116

source=23andme2vcf.pl https://github.com/arrogantrobot/23andme2vcf

reference=file://23andme_v4_hg19_ref.txt.gz

FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GENOTYPE

Any clue on what's wrong? Please and thanks.

MT reference incorrect

It appears that you have used UCSC's reference hg19 instead of NCBI's GRCh37 to build your own reference. Normally this is fine, but there are differences between these builds at the chromosome MT. For example, looking at UCSC genome browser for MT:150 shows a T instead of a C.

You can use bcftools to validate against a reference.

bcftools norm -ce -f /reference/homo.sapiens/GRCh37/Homo_sapiens_assembly19.fasta 23andme.vcf

hg18 reference

hi rob,

i tried using your script for converting the 23andme data from the personal genome project. it appears that data is still aligned to hg18. could you proivde the corresponding reference or let me know how to compile it myself (like, do you have a script that derives it from the ncbi fasta files?)

thanks
tim

No Licensing Data

I noticed that there's no licensing document or statement for the script; I'm interested to expand it but I don't want to cause any issues. What terms is this released under, please?

23andMe Changed their columns

My file looks like this:

rs548049170	1	69869	TT
rs13328684	1	74792	--
rs9283150	1	565508	--

But the reference document in the cloned repo looks like this:

chr1	734462	rs12564807	G
chr1	752721	rs3131972	A
chr1	760998	rs148828841	C

Broken genotypes

I keep getting just "0" or "1" for genotype after ~560,000 correct conversions using the 23andme_v4_hg19_ref.txt.gz data and after ~500,000 correct conversions using the 23andme_v4_hg19_ref.txt.gz data using "perl 23andme2vcf.pl <path to 23andme txt.zip file> genome_XYZ.vcf 4 (or 3)". The break happens in both after:

chrX 2689575 rs311150 G A . . . GT 1/1

I've seen this with two different individuals' SNP files, one generated in Feb 2015 and another generated Dec 2016. Both the above rsID and the following one are still listed in the current dbSNP, and both entries in the 23andme_v5_ht19_ref.txt.gz appear valid. I've run this on a laptop and on a Linux server so don't think it's a resource issue. Any suggestions? Thanks.

Larry

V5 Chip

Could you please update the reference files for the new V5 chip?

b36?

Hi,

I would like to convert v2 23andme files (build 36) to vcf. There are alot of v2 files to convert, and I am not able to re-download the data. Could you update your program to support v2 build 36 conversion?

List of Steps to Convert 23andme.txt to VCF

Hi,

I couldn't find any other way to contact you so I suppose it'll have to be through this. I was wondering if you could make an exact list of steps of how to convert my 23andme.txt file to the VCF format using your tool. I'm asking you as a person who has near to no experience with coding or computer science. If you are able to tell me exactly how to do this, then I can share your tool with others like myself for a personal genomics class for uploading VCF files to various different programs and tools for analysis. Your assistance in this matter would be greatly appreciated.

Thanks.

Use gunzip -c rather than zcat for OS X

OS X includes a broken zcat (can't handle .gz files, only .Z files). The following patch fixes this issue.

Cheers,
Shaun

diff --git a/23andme2vcf.pl b/23andme2vcf.pl
index 2ef215f..d545a56 100755
--- a/23andme2vcf.pl
+++ b/23andme2vcf.pl
@@ -18,10 +18,10 @@ missing($raw_path) unless -s $raw_path;
 missing($ref_path) unless -s $ref_path;

 #open the raw data as a zip or text
-my $fh = ($raw_path =~ m/zip$/) ? IO::File->new("zcat $raw_path|") : IO::File->new($raw_path);
+my $fh = ($raw_path =~ m/zip$/) ? IO::File->new("gunzip -c $raw_path|") : IO::File->new($raw_path);

 #open the compressed reference file
-my $ref_fh = IO::File->new("zcat $ref_path|");
+my $ref_fh = IO::File->new("gunzip -c $ref_path|");

 my $output_fh = IO::File->new(">$output_path");

Warnings for "use of uninitialized value" lines 164-170

running the script on 23andMe raw data (downloaded 29th of Oct, 2013) prints out a TON of warnings:

Use of uninitialized value $data_line in scalar chomp at 23andme2vcf.pl line 164, <GEN1> line 960613. Use of uninitialized value $data_line in split at 23andme2vcf.pl line 165, <GEN1> line 960613. Use of uninitialized value $chr in string eq at 23andme2vcf.pl line 166, <GEN1> line 960613. Use of uninitialized value $my_pos in numeric gt (>) at 23andme2vcf.pl line 170, <GEN1> line 960613.

over and over again. Also, when script starts, I get this warning for every line in the data file:

Use of uninitialized value $my_pos in numeric gt (>) at 23andme2vcf.pl line 170, <GEN1> line 274267.

Add In/Del support

This may take some significant development. The exact coordinates and alleles involved in the insertions and deletions signified by the I's and D's in the genotype column of the 23andme raw data may need to be retrieved from dbSNP or some other outside source. Once done, this reference can be added in to the existing SNP reference file.

New reference

Hi Rob,

thanks for the script, do you mind committing the code that generates the reference file, my says 8140 sites were not included even with version 4.

Many thanks.