Interpreting the pathogenicity of coding variants using the ExAC database
Fundamentals of Clinical Genetics
Wellcome Trust Genome Campus, January 2016
Tarjinder Singh, Jeffrey C. Barrett
Session description
Interpreting genetic variation in an individual patient’s genome can only be done in the context of variation in the wider population. Many databases now exist with variation data from thousands of healthy individuals. This session will demonstrate one of the most valuable, the Exome Aggregation Consortium (or ExAC) database of protein-coding variation in 60,000 individuals. Typical use cases will be illustrated, including demonstration of the ExAC website interface. We will also highlight other resources, such as 1000 Genomes and ENSEMBL.
Interpreting the function of coding variants
Which of many genetic variants in an individual are functional or likely pathogenic?
We can use:
-
coding consequences
- synonymous, missense, loss-of-function (LoF)
-
gene function
- e.g. a LoF variant in ARID1B more likely to be pathogenic than a LoF variant in OR2T1
-
allelic frequency in the general population, as a proxy for selective pressure
Exome Aggregation Consortium (ExAC) - the largest public database of genetic variation to date
-
earlier exome sequencing projects:
- 1000 Genomes Project (n = 2,504)
- NHLBI ESP project (n = 6,503)
- UK10K project (n = ~5,000)
-
ExAC combines all available exomes from global sequencing projects
-
describes the functional consequence and allele frequency of each observed coding variant in 60,706 individuals (as of January 2016, v0.3)
-
over 10 million variants: one variant every 6 base pairs; most are rare and novel
http://exac.broadinstitute.org/
ExAC database?
How I can use the-
Browse high-quality genetic variants in individual transcripts, genes, and genomic regions
-
Identify the functional consequence, allele frequency, and quality of an individual variant
-
Find differences in allele frequency of a single variant between global populations (African, American, Non-Finnish Europeans, Finnish Europeans, East Asians, South Asians)
-
Annotate the variants identified in a patient to prioritise likely pathogenic variants
Genes likely intolerant of damaging mutations
-
calculated from how depleted the gene is of damaging variants compared to expectation given the gene's mutation rate
-
measured by pLI
- a score from 0 - 1
- genes with pLI > 0.9 described as under genic constraint
- a proxy for if a single copy loss of a gene is selected against in the population
-
CHD8 has a pLI of 1, and when disrupted, is highly penetrant for developmental disorders
-
OR2T1 has a pLI of 0, is an olfactory receptor, and a single-copy loss is unlikely to cause a severe phenotype
How to use ExAC
-
for individual queries, access online browser at http://exac.broadinstitute.org/
-
first, type in:
- gene symbols (e.g. PCSK9)
- Ensembl or RefSeq transcript IDs (e.g. ENST00000407236)
- rs IDs (rs1800234)
- variant positions (22-46615880)
- region of interest (22:46615715-46615880)
- in the gene, transcript, and region view, we see:
Top left: gene name, number of variants, and link to other online resources and references
Top right: observed and expected number of variants of each functional class, and the pLI score
Middle: Exonic coverage for gene or transcript (proxy for regional quality)
Below: Table of all variants observed in this gene
-
for each variant, the chromosome, position, consequence, annotation, allele frequency is provided
-
can filter by consequence (Missense + LoF, or LoF)
- in the variant view, we see:
Top left: ID, frequency, and link to other online resources
Top right: quality metrics
Middle left: Functional consequence, and link to gene and transcript
Middle right: Frequencies in different global populations
Bottom: Read-level data, for a low-level evaluation of quality
Quick examples
-
rs334, the causal variant in sickle cell anemia
- note the differences in allele frequency between populations
-
p.Phe508del in CFTR, the causal variant in cystic fibrosis
- again, note the differences in allele frequency
- because cystic fibrosis is recessive (no dominant mode), the pLI for CFTR is 0
-
KMT2A, a gene when disrupted causes Wiedenmann-Steiner syndrome
- there are only 4 LoF variants in >60,000 individuals
- highly constrained gene (pLI = 1)
Demo
For the following genes (TP53, ARID1B, NOD2, NRXN1):
-
Determine the number of LoF variants in the canonical transcript
-
Find the pLI score and determine if the gene is constrained
-
Find the number of transcripts
-
For the first missense variant in the gene, find the allele frequency in Non-Finnish Europeans
-
Identify any exons not well-covered by the exome capture technology
Ensembl VEP for annotating large numbers of variants
-
we can use the Ensembl VEP tool to annotate a large number of variants (for example, all variants in a single patient)
-
can use a number of input formats, but the most common is the VCF format
http://www.ensembl.org/common/Tools/VEP
Demo
-
For variants encoded in GRCh37, go here
-
Paste the following into "Either paste data:":
12 49416554 . G GA . . .
18 53070914 . G A . . .
22 36142530 . AAGCGGCTGC A . . .Alternatively, you can upload a VCF to the website.
-
Under Identifiers and frequency data, go to Frequency data for co-located variants, and select ExAC allele frequencies
-
Click Run and wait. Click view results.
Answer the following questions:
-
What are the allele frequencies of each variant in ExAC?
-
If the three variants are observed in the same patient, which variant is most likely to be diagnostic? Use constraint scores on the ExAC website as support.
Things to be aware of
-
ExAC is not a collection of phenotypically healthy individuals and includes individuals with schizophrenia, inflammatory bowel disease, diabetes etc. (see here for more information)
- for rare or severe developmental disorders, this should be less of an issue
-
some regions without genetic variants are simply not well covered in earlier exome captures, so note the coverage!
-
ExAC is still in development, so changes in some variants might be observed