Giter Club home page Giter Club logo

team_bloodies's Introduction

This is the repository for the group project of Team Bloodies.

Project: Data-driven analysis of the potential candidate transcription factors in hematopoietic stem cell differentiation into multiple progenitor compartments.

Links to:

Proposal
Progress Report
Poster

Members and division of labor

Name Initial work assignment Affiliation Expertise
Annie Cavalla TF motif enrichment analysis Bioinformatics Cancer genomics
Rawnak Hoque RNA-seq analysis and TF motif enrichment analysis Genome Science and Technology Genome scale data analysis
Fangwu Wang DNA methylation analysis, TF clustering Medical Genetics Stem cell biology
Somdeb Paul DNA methylation analysis Genome Science and Technology Transcriptomics

Rationale: Human hematopoietic stem cells (HSCs) hold great clinical promises for curative HSC transplantation therapies for numerous hematologic malignancies and diseases. Understanding the mechanisms regulating the self-renewal and lineage restriction of HSCs is crucial for improving transplantation regimens. HSC is thought to acquire multi-step lineage restriction through going down multiple progenitor populations, during which process the myeloid vs.lymphoid binary decision is made with subsequent progeny restricted to either fate. In this project, we are interested in the epigenomic status of HSCs and other progenitor populations and how it interacts with transcription factor binding to regulate lineage differentiation program.

Data source:

Our Dataset includes matched DNA methylation (bisulfite-seq) and RNA-seq data from HSCs and 5 other progenitor cell types, obtained from a recent publication (Farlik M. et al, Cell, 2016) which characterized the differentiation path of HSCs based on cell DNA methylation profiles.

Different strategy from the published paper: To more rigorously identify TFs with a potential function in cell differentiation, we annotated DNA methylation using both promoters and customized enhancers. The enhancer regions were defined from two hematopoietic cell lines (K562, GM12878) from the Genome Segment ChromHMM tracks (UCSC table browser).

Data replicate summary:

Cell Type Replicates for Methylation Replicates for RNA
HSC 3 1
MPP 3 2
MLP 3 2
CMP 3 1
GMP 3 2
CLP 3 1

Workflow: We first analyzed differential DNA methylation of 5 pairwise comparisons in the annotated promoter and enhancer regions using RnBeads. The biological meaning of the 5 pairwise comparisons:

Comparison Biological Meaning
HSC-MPP loss of long-term regeneation potential
MPP-CMP multipotent to myeloid commitment
MPP-MLP multipotent to lymphoid commitment
CMP-MLP difference between myeloid and lymphoid on the CMP-MLP level
GMP-CLP difference between myeloid and lymphoid on the GMP-MLP level

We then used low methylated regions of each cell type from each comparison (defined by the > 40% difference from pairwise comparison) to find enriched transcription factor binding motifs using HOMER findingmotif tools, and generated a list of our data-driven candidate TFs for each population from each comparison.

We analyzed the overlapped genes of DNA methylation and RNA expression to see if there is any correlation between low methylation and high expression of genes. We inspected the expression of TFs identified from motif enrichment to see if they are highly expressed in the corresponding population. Then we used the expression of TFs identified from CMP/MLP comparison (representing the myeloid and lymphoid lineages) to cluster the leukemia samples to see whether the samples from the same leukemia type group together.

Analysis and Major Findings:

RnBeads analysis of pairwise comparison:
a. Beta-value distribution and variation
b. PCA
c. Clustering
d. Differential methylated regions
e. Correlation with RNA expression
Methods:
f. Data preparation: replicate merging
g. Enhancer annotation-code
h. RnBeads: all samples and pairwise comparison (CLP-GMP as an example)
i. intersection between DNA/RNA gene lists-code

a. Sanity check:sample-sample correlation, heatmap clustering
b. Differential expression gene lists
Methods:
c. Data processing and gene id conversion

a. Results
Differential gene table
Methods:
b. limma

a. TFs found at Enhancer
b. TFs found at Promoter
Methods
c. Input files
d. HOMER Findingmotif tool

a. Normal samples CMP/MLP
b. Leukemia samples AML/CLL
Methods
c. TF list feeding into expression

team_bloodies's People

Contributors

fangwuwang avatar acavalla avatar rawnakhoque avatar santina avatar

Watchers

 avatar  avatar Farnush Farhadi avatar  avatar

Forkers

psomdeb25 ldroc

team_bloodies's Issues

readme links

@santina @singha53 I just noticed that some of our links in the main readme file were empty, due to a mistaken commit conflict of our group members. Our last commit was 11:59pm, Friday, April 7th. Is it allowed to only edit the links to the readme file in a new commit?

Analyzing RNASeq data without replicates

@singha53 @santina
Hi,
Here https://github.com/STAT540-UBC/team_Bloodies/tree/master/Data/RNA-seq/Normal/
is a GSE87195_rnaseq_ensT_all.csv file for RNASeq count data from ~60000 transcripts of 13 samples. Since I do not have replicates, I would like to perform only pairwise comparison. Do I need to perform any statistical analysis before comparison? Could you please mention some tools/statistical approach I can do at this point? I see many of the cells contain zero value. Should I get rid of the zero? Thanks.

Progress Report of Bloodies

Progress Report of Bloodies

What has changed based on the final proposal

We are using the same datasets as proposed but has eliminated a few samples from the analysis. For example, we removed one outlier sample from the normal cell RNA-seq data and randomly chose 7 out of 11 samples from the AML dataset to keep the number of two groups (AML and CLL) equal to each other. Also, we noticed that the MEP population in the RNA-seq data we planed to include in our analysis is missing. The major conceptual change of this project is that we will conduct the transcription factor binding analysis based more on the differential DNA methylation results instead of the RNA-seq data, since there is no replicate in the RNA-seq data as mentioned in our proposal and we suspect there is high variability in this data that we cannot effectively measure. The task assignment of group members remains the same except that Fangwu has taken the data preparation part of the DNA methylation and RNA-seq data.

What is the progress of the analyses

I. DNA methylation analysis of the seven normal progenitor populations

Our DNA methylation (bisulfite-seq) and RNA-seq data were from the published BluePrint Epigenome project [1] which were downloaded from GEO (GSE 87197). The RNA-seq data was deposited in our Data folder and the code for getting the methylation data was provided here using GEOquery. There are three biological replicates for each of the seven populations. The data were post-aligned, unquantified in Bed format and we first performed merging of technical replicates to increase the overall coverage. We used bedtools and R to add up the reads from technical replicates (aliquots from the same donor condition) which resulted in three merged data (three biological replicates) for each cell type (details and codes for merging was shown here).
Then we used the RnBeads software to conduct the DNA methylation analysis, including data import, QC, methylation quantification, region annotation and differential methylation analysis [2].

The datasets from the samples, that were combined into bed files based on the cell type were then subjected to the RnBeads package for analyzing the different methylation profiles. The code for the RnBeads analysis can be found here and the annotation file here. The data files cannot be shared, however, due to size restriction on GitHub. Prior to running the analysis, I have set certain memory control arguments in the rnboptions() to utilize less resources on the system.

The bed files were imported for analysis using bismarkCov as a bed style as they were combined using BisMark's coverage file output. The particular style was chosen, as the files contain methylation site information on the chromosomes, including methylation signal intensity, and un-methylation intensity which were used to calculate density plot of beta values. We also disabled greedy cut as it consumes a lot of memory for the analysis and thereby slowing the process down. Furthermore, just prior to running the analysis, we set the columns for differential methylation comparison to the cell type as we wanted to look into methylation profiles across the different types of cells.

II. RNA-seq analysis - normal progenitors

The RNA-seq data contains 62,589 rows (transcripts) and 14 columns (samples). The values in the data represent estimated reads from the Bitseq software. We first inspected the data by visualizing the sample-sample correlation in the heatmap and we saw an outlier sample “RNA_D2_GMP_100” that showed relatively poor correlations with other samples, even with the sample of the same cell type “RNA_D1_GMP_100”. The Rscript for this part is here.
Then we took out this outlier sample and also removed several populations that we are not looking at (HSCbm, MLP0, MLP2, MLP3). There is no replicate available for the HSC, CMP, GMP, CLP populations and there are two replicates for MPP and MLP.
We filtered out the transcripts with the total reads from all samples of less than 50. We found there was a big differences between the mean of each sample the effective library sizes might be different among samples, so we used DESeq sizefactor function to adjust for the discrepancy of library sizes. After this adjustment, the column means look closer to each other. Then we did a clustering in the heatmap of the 2000 most abundantly expressed transcripts to see if the cell types clustered based on the differentiation hierarchy. Unexpected, the clustering pattern was not fully correlated with the differentiation hierarchy. MPP and HSC were closely grouped together indicating similar stem cell expression profile. The GMP population was clustered with the lymphoid progenitors MLP and CLP, which is consistent with recent findings that these progenitors are not completely separated in their functions and there might be “cross-activities” of their lineage output. Other explanations could be that the noise in the samples is pretty high and cannot be taken into account due to the lack of replicate. Also, the most abundant genes may not necessarily reflect the “molecular signatures” of these samples. Without biological replicates, we only looked at the fold-change of expression values between samples only for the exploratory purpose. The R codes for this part is here.

III. RNA-seq analysis - leukemia samples

We obtained two types of leukemia samples (AML and CLL) from the BLUEPRINT consortium as a compensation analysis and validation data for the normal RNA-seq data. There are seven patient samples in each group. A list of differentially expressed genes in the two groups were generated using limma and edgeR Bioconductor packages. We first normalized the data and fitted into a linear model. We created a mean difference plot displaying the log-fold-changes and average log expression values for each gene. We also inspected the P-value distribution. With this linear model, the differentially expressed genes were detected.

Results

I. DNA methylation analysis of the seven normal progenitor populations

The pre-processing of the samples produced a beta (β) value density plot for identifying the criteria, for removing samples from analysis which do not have a good enough coverage. The plot can be found here. We also performed Quality Control on the data for a summary on the coverage in each sample cell type, and a resulting spreadsheet file can be found at this link. The coverage distribution in each of the cell types can also be visualized using the the following violin density plots - A and B. The analysis also produced differential methylation results, and PCA clustering but we are yet to analyze the outputs.

II. RNA-seq analysis - normal progenitors

We calculated the mean value of the two replicates for the two populations (MPP, MLP) and transformed the data to log2 scale and calculated the “log2FC” for each pair of comparison. The lists of genes with over 2-fold changes were generated for the pairwise comparisons of HSC vs. MPP, MPP vs. CMP, MPP vs. CLP, CLP vs. MLP, CMP vs. GMP. These pairs were chosen because they were directly related in the differentiation hierarchy.

III. RNA-seq analysis - leukemia samples

Using edgeR and limma, a mean difference plot was created displaying the log-fold-changes and average log expression values for each gene. From this plot, we found that the genes with high fold-changes between the two groups tend to be moderately or lowly expressed.
We generated a distribution histogram for the adjusted-p value (shown as P. Value in the pot). The values are not uniformly distributed which indicates the null hypothesis is rejected or there are significant differences between the two leukemia groups(AML and CLL). With a cutoff of adjusted p-value of 0.05 and logFC of 1, around 9000 genes were evaluated as differentially expressed.

Challenges

A major problem with the RNA-seq analysis is that there is no replicate for many samples so any variation-based statistical analyses are not applicable. So we decided to look at the fold change of the expression value as an exploratory process without drawing any definitive conclusion based on this data.
Another challenge with the DNA methylation data is that there are many aliquots consisting of small number of cells (1, 10, 50, 1000 cells) for each biological sample, however the coverage is very low (around 1 per CpG site) due to the low cell input. So we merged the data from samples with 50 and 1000 cells to increase the overall coverage. However, a drawback of this manipulation is that there might be technical variation within the replicates, although they are from the same batch and the same flowcell. And in some cases the intersection between replicates is only a small proportion of the total reads and will not substantially increase the coverage.

References:
  1. Farlik M, Halbritter F, Müller F, et al. DNA Methylation Dynamics of Human Hematopoietic Stem Cell Differentiation. Cell Stem Cell. 2016 Dec 1;19(6):808-822. PMID: 27867036
  2. Assenov Y, Müller F, Lutsik P, et al. Comprehensive analysis of DNA methylation data with RnBeads. Nat Methods. 2014 Nov;11(11):1138-40. PMID: 25262207

RNAseq for leukemia sample

@fangwuwang @acavalla @psomdeb25
Here is the table for top 9863 genes that are deferentially expressed in Leukemia samples. The genes are ranked in descending order. I could not do much for the Normal data though as there are no replicates. But a single heatmap file might give us some idea about the data quality.

Update on Wednesday's Class :

Hey,

I was unable to make it to this Wednesday's class and seminar, as I had an appointment. Was there anything specific related to the project that was discussed?

Thanks!
Somdeb

Regarding poster

@santina @singha53

Can the font size 9-14 be visible on the poster? On powerpoint, I am unable to understand whether the final content would be visible.

Thanks!

Issue with RnBeads:

Following is the code that I am running on R Studio with the RnBeads package:

library(RnBeads)
data_dir <- "/Volumes/Dark/Study/University of British Columbia/Courses/GSAT 540/Project/Combined Data"
annotation <- file.path(data_dir, "annotation.csv") # file.path() attaches the file mentioned in 2nd argument to an R object that you assign to. 
data.source <- c(data_dir, annotation)

# Directory where the file is written
analysis.dir <- "/Volumes/Dark/Study/University of British Columbia/Courses/GSAT 540/Project/analysis1.2"

# Directory where the report file is written
report.dir <- file.path(analysis.dir, "reports_details")

rnb.initialize.reports(report.dir)
rnb.options(import.bed.style="bismarkCov")
rnb.options("import.bed.columns")
rnb.options(filtering.greedycut=F)
rnb.options(differential = FALSE)

# Set some analysis options
# rnb.options(filtering.sex.chromosomes.removal = TRUE, identifiers.column = "bedFile")
# logger.start(fname=NA)

# Setting Up RnbSet Object
result <- rnb.run.import(data.source = data.source, data.type = "bs.bed.dir", dir.reports = report.dir)

The rnb.run.import() stops midway. Is there a way I can look at the bed files to check whether the data is arranged in order.
@fangwuwang @singha53 @santina

Thanks!

Data format

Sorry I could not come for the seminar today. I have downloaded the data but could not push to the repo because of the large data size. Any way to upload data with a size (100-200MB) to Github?
For your information, the data format is as below:
Bigwig for Bisulfite-seq (also defined hyper/hypo-me regions in bigbed format);
Processed RNA-seq with quantitation on gene/transcript level in txt format (contig data in Bigwig also available if we don't trust their processing);
Processed ChIP-seq in bigbed format (bigwig data also available but much bigger).

The available data for each sample is summarized in the exel doc uploaded earlier. I have all data on my PC already and will find a way to pass to everyone later.

What do you mean by coverage?

@singha53

I am doing DNA methylation analysis on my dataset. A lot of places have mentioned coverage. But I am not quite able to grasp the concept of what coverage is in a DNA Methylation data. I tried looking up literature but I am not able to get a clear meaning for the same.

Thanks!

Poster

@rawnakhoque @fangwuwang @psomdeb25

If people want to paste links here of bits they think we should include in the poster (final plots that look pretty), I can collate them into a poster and add descriptions of what they represent.

Thanks!

How to interpret a Principal Component Analysis.

@singha53 @santina @ppavlidis @farnushfarhadi

We were looking at some principal component analyses between cell types for our DNA methylation data. The RnBeads analysis has differentiated the regions - promoter, gene, and enhancers according to three principal components. When we try to see the PCA plot with respect to PC1 and PC2, the clustering is not very clear. But, with respect to PC2 and PC3, the difference in clustering is prominent.

Could you clarify how do we understand what the different principal components mean?

I referred to this link for an example. But it is not very clear.

This is also a bit difficult to follow.

Thanks!

Homer Findingmotifs TFBS

@rawnakhoque I asked the PDF in our lab and he showed me that everything has been done in bash. Follow the installation and basic configuration step by step here. As shown in the webpage, genome configuration is done using this line (see Download Homer Packages session)-- perl /path-to-homer/configureHomer.pl -install hg19_
And to do the analysis there is only one line to run (link)-- findMotifsGenome.pl <peak/BED file> -size # [options]

Initial feedback

Name Department/Program Experties/Interests GitHub ID
Annie Cavalla Bioinformatics Cancer genomics, single cell transcriptomics @acavalla
Rawnak Hoque Genome Science and Technology Genome scale data analysis @rawnakhoque
Somdeb Paul Genome Science and Technology Transcriptomics @psomdeb25
Fangwu Wang Medical Genetics Stem cell biology, Epigenomics @fangwuwang

Team name: Bloodies

Project summary: Our project is interested in how hematopoietic stem cells, a rare stem cell population able to regenerate all erythroid, myeloid and lymphoid lineages in humans, make cell fate decisions during multiple-stage differentiation. We will obtain RNA-seq, DNA Methylome and ChIP-seq data from a European public resource (BLUEPRINT: http://dcc.blueprint-epigenome.eu/#/home). We will study the transcription factors (TFs) specifically expressed in one cell type (transcriptional signature) and the epigenomic features of different cells.

The preliminary plan includes:

  1. Using RNA-seq data to find signature genes and generate a list of essential TFs for the development of each cell type;
  2. Using methylome and ChIP-seq of histone marks to identify active/poised promoter and enhancer regions, and recognize the TF binding motif within these regions to infer the important transcriptional regulation during differentiation;
  3. Correlating these promoter/enhancer regions from (2) with gene expression to verify the transcriptional regulation;
  4. Constructing a TF network for lineage differentiation using these datasets and known TF interactions from literature.
    There will be comparisons and statistical analyses involved in each step.

Contribution summary

Hi all @rawnakhoque @psomdeb25 @fangwuwang

We also have to produce a 'contribution summary'. I've written my section and uploaded it into the main repo. I left the instructions of how to write it in there, so maybe leave those there until the last person has finished their part. This part is due tomorrow night at 23:59 I believe.

Thanks!
Annie

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.