illumina / platinumgenomes Goto Github PK

The Platinum Genomes Truthset

Home Page: https://illumina.github.io/PlatinumGenomes

platinumgenomes's Introduction

Platinum Genomes

This repo contains the Platinum Genomes small variant truthset for samples NA12878 (also known as hg001) and NA12877. Platinum Genomes truthset variants were validated using haplotype inheritance information through a well studied 17-member pedigree (CEPH 1463).

Truthsets

Truthsets are made up of a VCF of validated variant records and a BED file of confident regions. These files aren't huge (00s of MB) but are too large to play nicely with git and github, here's a few ways to download:

AWS CLI

Truthset files are stored in an AWS S3 bucket called platinum-genomes, one way to download is via the AWS CLI:

aws s3 cp s3://platinum-genomes/2017-1.0 pg2017 --recursive

To download without AWS credentials, add the --no-sign-request flag. You can also explore the bucket and download individual files with this S3 bucket display.

wget

Alternatively, use wget or similar with the file URIs in this repo, e.g.:

wget -xi files/2017-1.0.files

You can then use the relevant md5 checksum in each release to validate data integrity.

Finally, truthset files can also be downloaded via FTP, e.g.:

wget ftp://platgene_ro:''@ussd-ftp.illumina.com/2017-1.0/hg38/small_variants/NA12878/NA12878.vcf.gz

Usage

To compare a VCF against these truthsets, we recommend using hap.py which performs a sophisticated haplotype comparison rather than a simpler tool such as bcftools isec.

Applications wrapping hap.py and containing these truthsets are available on the following platforms:

BaseSpace Sequence Hub (Hap.py Benchmarking and VCAT)
PrecisionFDA (GA4GH Benchmarking)

Details

See the attached wiki for technical information.

Raw data

Sequencing data for NA12878, NA12877 and samples NA12889-NA12892 (grandparents) are available through the ENA:

ENA Study: PRJEB3381

BaseSpace users can access this data via a shared BaseSpace project:

BaseSpace project share

Sequencing data for the remaining pedigree members is not consented for public release and so is made available through the dbGaP database:

dbGaP: phs001224.v1.p1

Issues

Please open an issue for comments, issues and other feedback.

Citation

For further information or to cite Platinum Genomes resources, see:

Eberle, MA et al. (2017) A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Research, 27:157-164. doi:10.1101/gr.210500.116

platinumgenomes's People

Contributors

Stargazers

Watchers

Forkers

claudiucreanga carpenterxu mohammadbashiri linhxxx biocq jing-xinxing thoughtsynapse genostack langjidong

platinumgenomes's Issues

Two indels or one SNV

Hi, I am a PhD student who work on genomic variants.
I found in Platinum genome 2017 files, NA12878.vcf.gz, hg19 (https://s3.eu-central-1.amazonaws.com/platinum-genomes/2017-1.0/hg19/small_variants/NA12878/NA12878.vcf.gz), there are two records:
chr19 36397290 . CA C . PASS MTD=bwa_platypus;KM=8.96;KFP=0;KFF=0 GT 0|1
chr19 36397299 . A AT . PASS MTD=bwa_platypus;KM=9.30;KFP=0;KFF=0 GT 0|1

However, when I check the reference sequence of these two variants:
CAAAAAAAAATTTTTTTTA
Then I confused whether one deletion (A got delete) and one insertion (After A there is a T inserted) should be written as a SNV (A changes to T)

After I used GATK (v4.1.9.0) to call variants with the BAM file which download from ENA (PRJEB1813), the result is:
19 36397299 . A T 645.64 .AC=1;AF=0.500;AN=2;BaseQRankSum=-0.782;DP=46;ExcessHet=3.0103;FS=1.177;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=14.04;ReadPosRankSum=1.153;SOR=0.458 GT:AD:DP:GQ:PL 0/1:20,26:46:99:653,0,519

I wondered, why in the VCF file, there are two indels instead of one SNV. I think the genotypes of both indels are 0|1 means they happened on the same haplotype.

Waiting for your reply
Thank you very much

Question about Confident region

Hi, I am a master student master student who using PG.
I am confused with the confident region. Based on the paper and Github wiki, my understanding is that: inside the confident regions, they are non-variants (0|0) or homozygous variants (1|1), but I still found some heterozygous variants (0|1) from truth set that are located inside confident regions. So what is the feature of confident regions? Fully homozygous( only 0|0 and 1|1)? Or partly homozygous (0|0 and 1|1) with validated heterozygous variants (0|1)?
Thank you

Another question about BAM file

Hi, I am the master student who ask a "simple" question before.
In my research plan, I am using NA12878 as truth set, download from here ENA(https://www.ebi.ac.uk/ena/data/view/ERR194147).
In this webpage, I can download both fastqs or BAM file. My question is, is the BAM file made from that fastqs? If yes, what is your trimming strategy?
Thank you

[Errno 95] Operation not supported

I get the following error trying to download into an AWS S3 bucket
failed: s3://platinum-genomes/2017-1.0/hg19/small_variants/NA12877/NA12877.vcf.gz to pg2017/hg19/small_variants/NA12877/NA12877.vcf.gz [Errno 95] Operation not supported

corresponding BAM file for NA12877.vcf.gz

Hi,
We are exploring the platinum truthset variant calls for NA12877 (aws s3 cp s3://platinum-genomes/2017-1.0 pg2017 --recursive). Which is the correct corresponding BAM file from which these variant calls were made (i.e. the BAM to match this VCF)?

The ENA BAM file for NA12877 (ftp://ftp.sra.ebi.ac.uk/vol1/ERA172/ERA172924/bam/NA12877_S1.bam) has a header which does not match the ref genome
Homo_sapiens/NCBI/GRCh38Decoy/Sequence/WholeGenomeFasta/genome.fa reported in the NA12877.vcf.gz.

For example the BAM file header:

@SQ	SN:chrM	LN:16571
@SQ	SN:chr1	LN:249250621
@SQ	SN:chr2	LN:243199373
@SQ	SN:chr3	LN:198022430
@SQ	SN:chr4	LN:191154276
@SQ	SN:chr5	LN:180915260
@SQ	SN:chr6	LN:171115067
@SQ	SN:chr7	LN:159138663
@SQ	SN:chr8	LN:146364022
@SQ	SN:chr9	LN:141213431
@SQ	SN:chr10	LN:135534747
@SQ	SN:chr11	LN:135006516
@SQ	SN:chr12	LN:133851895
@SQ	SN:chr13	LN:115169878
@SQ	SN:chr14	LN:107349540
@SQ	SN:chr15	LN:102531392
@SQ	SN:chr16	LN:90354753
@SQ	SN:chr17	LN:81195210
@SQ	SN:chr18	LN:78077248
@SQ	SN:chr19	LN:59128983
@SQ	SN:chr20	LN:63025520
@SQ	SN:chr21	LN:48129895
@SQ	SN:chr22	LN:51304566
@SQ	SN:chrX	LN:155270560
@SQ	SN:chrY	LN:59373566
@RG	ID:NA12877	SM:NA12877

While the reference fasta dictionary seems to be a different reference sequence.

less  /ifs/labs/andrews/walter/varcal/data/platinum/Homo_sapiens/NCBI/GRCh38Decoy/Sequence/WholeGenomeFasta/genome.fa.fai 

chr1    248956422       112     80      81
chr2    242193529       252068602       80      81
chr3    198295559       497289663       80      81
chr4    190214555       698064029       80      81
chr5    181538259       890656400       80      81
chr6    170805979       1074464000      80      81
chr7    159345973       1247405166      80      81
chr8    145138636       1408743076      80      81
chr9    138394717       1555696057      80      81
chr10   133797422       1695820821      80      81
chr11   135086622       1831290824      80      81
chr12   133275309       1968066142      80      81
chr13   114364328       2103007506      80      81
chr14   107043718       2218801515      80      81
chr15   101991189       2327183393      80      81
chr16   90338345        2430449584      80      81
chr17   83257441        2521917271      80      81
chr18   80373285        2606215543      80      81
chr19   58617616        2687593620      80      81
chr20   64444167        2746944069      80      81
chr21   46709983        2812193914      80      81
chr22   50818468        2859487897      80      81
chrX    156040895       2910941708      80      81
chrY    57227415        3068933262      80      81
chrM    16569   3126876142      80      81

NA12877 chrX calls

Hi,

We are using the PlatinumGenomes NA12877 resource and are wondering why calls on chrX begin at position 2781986? The corresponding ConfidentRegions.bed.gz begins at: chrX 251053 251087.

Thank you for your help!

NA12877 VCF file issue

Hi,
I am using NA12877 VCF file for my project and I found that this file contains a duplicate position entry for a chromosome position. As per my knowledge, There is no need of such duplicate entry in a VCF file.
Can you please explain the reason for duplicate position in yours VCF?
Also, why there is no chrY entry in VCF file for the male NA12877 individual?

The file and entry is given below:-
/ussd-ftp.illumina.com/2016-1.0/hg38/small_variants/NA12877/NA12877.vcf.gz

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12877
chr12 32192430 . T TTAAA 0 PASS KM=11.9;KFP=0;KFF=0;MTD=isaac_strelka GT 0|1
chr12 32192430 . T TTAAA 0 PASS MTD=isaac_strelka GT 0|1

Thanks.

Found Reflected XSS On your Site

Hello Security Team today I found Reflected XSS on your own website

Steps:-

1-Got this Url: https://illumina.github.io/PlatinumGenomes/?prefix=
2-Add Payload XSS In Parameter prefix
Done Exploit

Example:-

https://illumina.github.io/PlatinumGenomes/?prefix=1%27%22%3CImg%20Src%20OnError=confirm(%27xElkomy%27)%3E

payload:

1'"<Img Src OnError=confirm('xElkomy')>

Fix:-

Delete the reflect for prefix parameter
Filter input on arrival
Encode data on output
Use appropriate response headers
Content Security Policy.

HLA Truth set

Hi, I want to do some benchmark HLA typing methods, I found out that Illumina Platinum Genomes is a common source of solid validation. However, I haven't found the truth set of HLA allele, I have searched from the website and the github too.

Could you please direct me to where I can get this dataset?

homozygous reference positions for FP assesment

Hi there,

I'm looking for the data described in your Genomes res. 2017 paper as:
“… we identified 2,737,246,156 positions that are homozygous reference across the pedigree. These positions can be used to calculate false positive rates when assessing variant calling pipelines.”

Could you please direct me to the correct file for hg38?

From the description of the Confident Regions at https://github.com/Illumina/PlatinumGenomes/wiki/Confident-regions I can't tell if this is homozygous reference data as the first and second paragraph on this page are confusing when read together.

Thanks for your help

helen

Depth 0 position for platinum call

Hi,

I downloaded NA12878 bam file (114G) and NA12878 VCF file from the ENA (https://www.ebi.ac.uk/ena/browser/view/PRJEB3381).
As far as I know, platinum variant calls are the positions with the 'PASS' filter in VCF file.
If so, I think I found weird variant positions with depth zero.
For instance, chromosome 1, depth of 1-based coordinate 248,806,349 is zero.
The GT of this position is 1|1 and Ref/Alt are G/A.
How can this position called as platinum variant? Is there anything I interpret wrongly?

Best,
Ahn

hg38 reference genome for

Hi,
This is a great resource - I'm wondering which specific human reference genome you used for generating the sequence data NA12877_S1.bam and truth set pg2017/hg38/small_variants/NA12877/NA12877.vcf.gz so that we can get the appropriate ref genome to make best use of these tools?

Thank you!

The VCF file lists the genome here but I'm wondering which publicly available version this is.
/illumina/sync/igenomes/Homo_sapiens/NCBI/GRCh38Decoy/Sequence/WholeGenomeFasta/genome.fa

Can't access NA12878 files on ftp site

It wants a password.

hg19 release VCF contig order does not match VCF header / hg19 fasta

Some tools care about this, expected ordering is natural sort (1-22 + X).

Workaround is to sort with bcftools:

bcftools sort NA12878.vcf.gz -Oz -o NA12878.sorted.vcf.gz

SNP Base pair calls for NA12878 and NA12877

Hello, I need some technical assistance. I have a genotyping results for a number of SNPs (rs.....) and associated allele 1 and allelle 2 base calls for NA12878 and NA12877. I'm trying to find where the data exists to ensure I have concordance. How do I obtain this data for comparisons of my illumina genotyping of these snps with data that exists on platinum genomes? Is there a file you can point me to?