caleblareau / gchromvar Goto Github PK

View Code? Open in Web Editor NEW

42.0 7.0 9.0 28.72 MB

Cell type specific enrichments using finemapped variants and quantitative epigenetic data

Home Page: https://caleblareau.github.io/gchromVAR/

License: MIT License

R 100.00%

gwas epigenetics atac-seq single-cell

gchromvar's Introduction

gchromVAR

About:

Two outstanding challenges in the post-GWAS era are (1) the precise identification of causal variants within associated loci and (2) determining the exact mechanisms by which these variants result in the observed phenotypes, starting with identification of the pertinent cell types. To address (1), we used robust genetic fine mapping to identify hundreds of likely causal variants for 16 blood cell traits, allowing for up to 5 causal variants in each locus. We combined our fine-mapped results with high resolution open chromatin data for 18 primary hematopoietic populations and derived functional annotations to identify predicted target genes, mechanisms, and disease relevance. Moreover, we elucidate compelling anecdotes for the utility of this approach. To address (2), we developed a novel enrichment method (gchromVAR) that can discriminate between closely related cell types and score single cells for GWAS enrichment.

We've implemented gchromVAR as an R package for computing cell-type specific GWAS enrichments from GWAS summary statistics and quantitative epigenomic data. This web resource and vignette compiliation shows how to reproduce these results in hematopoesis and how to run gchromVAR on other data sets.

Installation:

Once all of the dependencies for gchromVAR are installed, the package can be installed directly from GitHub by typing the following into an R console:

devtools::install_github("caleblareau/gchromVAR")

Application to hematopoesis

We performed our genome-wide association studies (GWASs) on the UK Biobank (UKBB), which consists of 113,000 individuals of predominantly European descent in which 16 blood cell traits have been measured. These blood cell traits represent several important and distinct hematopoietic lineages, including red blood cells (RBCs), platelets, lymphocytes (T and B cells), monocytes, eosinophils, and basophils (bottom Figure 1). We overlaid these GWAS results with chromatin accessibility data derived from ATAC-seq (shown in the population tree).

Figure 1. Overview of hematopoesis with cell types (top) and GWAS traits (bottom) explored thus far with gchromVAR.

Although several methods have been developed to calculate enrichment of genetic variation with genomic annotations such as changes in chromatin accessibility (Trynka et al. 2015, Finucane et al. 2015), a method which takes into account both (1) the strength and specificity of the genomic annotation and (2) the probability of variant causality, accounting for LD structure, is needed to resolve associations within the stepwise hierarchies that define hematopoiesis. To these ends, we developed a new approach called genetic-chromVAR (gchromVAR), an adaptation of a recently described method, to measure the enrichment of regulatory variants in each cell state using our fine-mapped genetic variants and quantitative genomic annotations (Fig. 2A). Briefly, this method weights chromatin features by variant posterior probabilities and computes the enrichment for each cell type versus an empirical background matched for GC content and feature intensity. We show that g-chromVAR successfully identifies true enrichments of causal variants and is generally robust to variant posterior probability thresholds.

Figure 2. Schematic and overview of results when applying gchromVAR to bulk populations.

We applied gchromVAR to each of the 16 traits and 18 bulk ATAC-seq hematopoietic progenitor populations primarily sorted from the bone marrow of multiple healthy individual donors (Fig. 1A). We compared g-chromVAR to two state of the art methods: LDSR (Finucane et al. 2015), which calculates the enrichment for genome-wide heritability using binary annotations after accounting for LD and overlapping annotations, and goShifter (Trynka et al. 2015), which calculates the enrichment of tight LD blocks containing sentinel GWAS SNVs for binary annotations. Using a Bonferroni correction, g-chromVAR identified 22 trait-tissue enrichments, LDSR identified 71, goShifter identified 39, and chromVAR identified 79 (Fig. 2C).

In order to compare the performance of enrichment tools, we leveraged our knowledge of the hematopoietic system and devised a lineage specificity test. For any measured cell trait we identified all possible upstream progenitors that could be passed through before terminal differentiation (Fig. 1A). For example, the differentiation of an RBC is generally thought to begin at the hematopoietic stem cell (HSC) and progress through multipotent progenitor (MPP), common myeloid progenitor (CMP), and megakaryocyte erythroid progenitor (MEP) before reaching the erythroid progenitor (Ery) stage. Thus, the lineage specificity test is a nonparametric rank-sum test that measures the relative ranking of lineage specific trait-cell type pairs relative to the non lineage specific traits for each of the compared methodologies. Using this metric for specificity, we found that g-chromVAR vastly outperformed all three other methods (Fig 2D). Additionally, we found that 21/22 (95%) of g-chromVAR trait-cell type enrichments were supported by LDSR, all of which were lineage specific (Fig. 2C). For certain traits such as monocyte count, we found highly similar enrichment patterns for g-chromVAR and LDSR, but non lineage enrichments for chromVAR. For other traits, such as mean reticulocyte volume, g-chromVAR identified only the two most terminally proximal cell types (MEP and Ery) as significantly enriched for the trait, whereas LDSR non-specifically identified 15/18 of the investigated cell types as enriched after Bonferroni correction. We note that we can improve the lineage specificity of LDSR by including all hematopoietic ATAC-seq annotations in the model as covariates, but this results in a loss of power.

Having validated our approach, we investigated cell type enrichments for each of the 16 traits. We found that the most lineage-restricted progenitor populations were typically most strongly enriched (Fig. 2E-H). For example, RBC count was most strongly enriched in erythroid progenitors (Fig. 3E), platelet count was most strongly enriched in megakaryocytes (platelet progenitors) (Fig. 2F), and lymphocyte count was most strongly enriched in CD4+ and CD8+ T cells (Fig. 2H). In several instances, we observed significant enrichments for traits in earlier progenitor cell types within each lineage, including enrichment for platelet traits in CMPs and enrichment for monocyte traits in a specific subpopulation of GMPs. Building on several studies that recently demonstrated transcriptomic and chromatin accessibility heterogeneity within these populations, we next sought to apply g-chromVAR to single-cell ATAC-seq (scATAC-seq) data in order to interrogate the impact of common genetic variation underlying blood cell traits in heterogeneous cell populations.

Contact:

Caleb Lareau developed and maintains this package with Erik Bao and Jacob Ulirsch.

gchromvar's People

Contributors

Stargazers

Watchers

Forkers

julirsch shicheng-guo kvshams kerwin12580 yayanfeng55 neurogenomics qiyanghong2020 mayunlong89 eyalbenda

gchromvar's Issues

SE part with colData errow

Thank you so much. I tried again and it works on this step. I had an error that the col numbers are different. Any suggestions? I make the colData <- DataFrame. The DataFrame is the excel sheet with the sample name E16WTc_nPN_peakchr.narrowPeak, and sample type E16, saved as csv.

colData = DataFrame(names = colnames(counts)) what this part should look like?

nrows <- 131417

ncols <- 10
counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows)
countsFile <- counts
peaksFile<- "E16WTc_nPN_peakschr.narrowPeak"
peaksdf <- read.table(peaksFile)
peaks <- makeGRangesFromDataFrame(peaksdf, seqnames = "V1", start.field = "V2", end.field = "V3")
counts <- data.matrix(countsFile)
SE <- SummarizedExperiment(assays = list(counts = counts),

                       rowData = peaks, 
                       colData = DataFrame(names = colnames(counts)))

Error in validObject(.Object) :
invalid class “SummarizedExperiment” object:
nb of cols in 'assay' (10) must equal nb of rows in 'colData' (0)

count files and peak files

Dear Caleb,
Thank you for the gChromVAR, I am trying to use your gChromVAR for our bulk ATAC seq. I have narrowpeak files and bam files as well.

I have a basic questions about the files. for the files in gChromVAR, what is the count files, what is peak files? Can I load two files at the same time into counts files and peak files? Thank you. I am new to this so please bear with me the simple questions. Thank you very much.

Have a great day.

Renfang

Do I need to normalize ATAC-seq matirx?

Hi,

I am using gchromVAR to identify the cellular targets for disease by using snATAC-seq, and I have some fine-mapped SNPs.
I have clustered and annotated the cells and generated a bulk ATAC-seq matrix of peaks by cell types,
so do I need to do some normalization for this matrix before I use it as input?

Thanks!
Li

example code failed: computeWeightedDeviations(SE, ukbb)

Example codes from gchromVAR User's Guide are implemented line by line and when it goes to the line:

ukbb_wDEV <- computeWeightedDeviations(SE, ukbb)

There is an error:

Error in nrow(counts_mat) == nrow(background_peaks) :
the leading minor of order 2 is not positive definite

question about annotations

Thank you for the creation of the nice package and documentations.

Is it possible to use other ATAC-seq peaks as self-defined annotations to check cell-type specific enrichment beyond the hematopoiesis cells?

Thank!

Could you please tag your project

To promote reproducible science, could you please use git tags. Creating a tag also creates a release for your project. We require tagged releases when building scientific software. Pulling from the master is not reproducible.

I would also recommend using the standard for semantic versioning. (Semver)[https://semver.org/]
Version number in the form: Magor.Minor.Patch. Please do not
follow the git examples by putting a "v" as the leading character. Github will create a "release" when the tag is pushed.

Thank you for making your software available

git tag 1.0.0
git push origin 1.0.0

gchromVAR peaks

Dear Caleb,

This is Renfang again. I have been trying gchromVAR for a few days. when I try to get peaks, it showed me the errow that it could not determine start/end columns, any suggestions? thank you.

peaksdf <- data.frame(peaksFile)

peaks <- makeGRangesFromDataFrame(peaksdf, seqnames = "V1", start.field = "V2", end.field = "V3")
Error in .find_start_end_cols(df_colnames0, start.field0, end.field0) :
cannnot determine start/end columns

gchromVAR installation

Dear Caleb,Thank you very much for the help. I add the line for colname and it worked! so I am in the step of importbedscore. It said there is no function of importbedscore so I tried to install the package again and it gave me this error. Any other way to install it? Thank you !

devtools::install_github("caleblareau/gchromVAR")
Downloading GitHub repo caleblareau/gchromVAR@master
Skipping 1 packages not available: chromVARmotifs
Installing 3 packages: chromVARmotifs, motifmatchr, RSQLite
Installing packages into ‘C:/Users/rsong/Documents/R/win-library/3.5’
(as ‘lib’ is unspecified)
Error: Failed to install 'gchromVAR' from GitHub:
(converted from warning) package ‘chromVARmotifs’ is not available (for R version 3.5.1)

What's is the V5 colunmn of GWAS summary statistics in your vignette

Hi, I am interetsed in your work. Excuse me, what does the fifth column of GWAS summary statistics files in vigente represent？
It shows below.

V1 V2 V3 V4 V5

1 chr1 25653526 25653527 region1 1.0000

2 chr1 24850597 24850598 region1 0.0242

3 chr1 24722451 24722452 region1 0.0103

4 chr1 24665802 24665803 region1 0.0096

5 chr1 24994340 24994341 region1 0.0096

6 chr1 25095653 25095654 region1 0.0061

And I also want to known how to make this format GWAS summary statistics files. Is there a script shipped with gchromVAR that does this, or do I just make it by my self?Thanks!

importBedScore function

Dear Caleb,
I installed gchromVAR and run the importBedScore, it gives me an error like this , any suggestions? Thank you so much!

nrows <- 131417

ncols <- 10
counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows)
countsFile <- counts
colnames(countsFile) <- paste0("E16WTc_nPN_peakschr.narrowPeak", as.character(1:dim(countsFile)[2]))
peaksFile<- "E16WTc_nPN_peakschr.narrowPeak"
peaksdf <- read.table(peaksFile)
peaks <- makeGRangesFromDataFrame(peaksdf, seqnames = "V1", start.field = "V2", end.field = "V3")
counts <- data.matrix(countsFile)
SE <- SummarizedExperiment(assays = list(counts = counts),

                           rowData = peaks,

                           colData = DataFrame(names = colnames(counts)))

SE <- addGCBias(SE, genome = BSgenome.Mmusculus.UCSC.mm10)
E16 <- importBedScore(rowRanges(SE), peaks1, colidx = 5)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘importBedScore’ for signature ‘"GRanges", "GRanges"’

Which GWAS SNPs should I use?

I ran through the vignette using my scATAC-seq data for the peaks, and I downloaded a fine-mapped posterior-probabilities for a few traits of interest from CausalDB. However I found that the results are strange, where the Z-scores and deviations are extremely high for a few cells, and around zero for the remaining cells.

I am wondering if I am supposed to have posterior-probabilities for every single SNP included the GWAS, or just at the loci that reached genome-wide significance that were followed up with Baysian fine-mapping? For instance, the GWAS for Alzheimer's Disease (Jansen et al 2019) has over 1 million SNPs profiled in the GWAS itself, but CausalDB only has fine-mapping posterior-probabilities for ~10k SNPs.

input without weight

hi,
Is it possible to just input the snp locations without the weights? The snp file is from public source and does not have weights with it.
Thanks!