hbctraining / dge_workshop Goto Github PK

Home Page: https://hbctraining.github.io/DGE_workshop/

HTML 98.61% SCSS 1.39%

dge_workshop's Introduction

THIS REPO IS ARCHIVED, PLEASE GO TO https://hbctraining.github.io/main FOR CURRENT LESSONS.

Differential gene expression workshop

Audience	Computational skills required	Duration
Biologists	Introduction to R	1.5-day workshop (~10 hours of trainer-led time)

Description

This repository has teaching materials for a 1.5-day, hands-on Introduction to differential gene expression (DGE) analysis workshop. The workshop will lead participants through performing a differential gene expression analysis workflow on RNA-seq count data using R/RStudio. Working knowledge of R is required or completion of the Introduction to R workshop.

Learning Objectives

QC on count data using Principal Component Analysis (PCA) and hierarchical clustering
Using DESeq2 to obtain a list of significantly different genes
Visualizing expression patterns of differentially expressed genes
Performing functional analysis on gene lists with R-based tools

These materials are developed for a trainer-led workshop, but also amenable to self-guided learning.

Lessons

Below are links to the lessons and suggested schedules:

Installation Requirements

Download the most recent versions of R and RStudio for your laptop:

R
RStudio

Install the following packages using the instructions provided below.

NOTE: When installing the following packages, if you are asked to select (a/s/n) or (y/n), please select “a” or "y" as applicable but know that it can take awhile.

(a) Install the below packages on your laptop from CRAN. You DO NOT have to go to the CRAN webpage; you can use the following function to install them one by one:

install.packages("insert_first_package_name_in_quotations")
install.packages("insert__second_package_name_in_quotations")
& so on ...

Packages to install from CRAN (note that these package names are case sensitive!):

BiocManager
RColorBrewer
pheatmap
ggrepel
devtools
tidyverse

(b) Install the below packages from Bioconductor, using BiocManager::install() function 7 times for the 7 packages:

BiocManager::install("insert_first_package_name_in_quotations")
BiocManager::install("insert_second_package_name_in_quotations")

Packages to install from Bioconductor (note that these package names are case sensitive!):

DESeq2
clusterProfiler
DOSE
org.Hs.eg.db
pathview
DEGreport
EnsDb.Hsapiens.v86
AnnotationHub
ensembldb

Finally, please check that all the packages were installed successfully by loading them one at a time using the library() function.

library(DESeq2)
library(ggplot2)
library(RColorBrewer)
library(pheatmap)
library(ggrepel)
library(clusterProfiler)
library(DEGreport)
library(org.Hs.eg.db)
library(DOSE)
library(pathview)
library(tidyverse)
library(EnsDb.Hsapiens.v86)
library(AnnotationHub)
library(ensembldb)

Once all packages have been loaded, run sessionInfo().

sessionInfo()

These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

dge_workshop's People

Contributors

Stargazers

Watchers

Forkers

yixf-self inambioinfo marypipes teninq juadiegaitan ajwije flopezo maozhitao samuel-marsh nplegendre learning-jusue404 pythseq feigeliudan01 federicomarini ardaamen yangming ho-su vallurumk madara-dilhani fengyq raymondshang ukrcherry philarnold4242 aipolly akv3001 b1234561 kevingogh911 singlecoated ruixiangliu fantomq kc-lan dhaimes-b lizhaozhi alisaei biov fc-wang angela-wei jialuw xmuyulab rumarova pablormier sridhar0605 xjyx kvshams dxw5099 dragonmasterx87 lhaclove urbankunej odinokov sungminhwang-duke wenjingk xlw1207 lixiaopi1985 lidweixiang yajass pennynero abhijitcbio zhiyil wattersr yhfoong xueba100 gp10 maricastanon zhiyiz123 vishimenon28 helenmasson mfpfox bioamelie liangdp1984 fawnshao asingh164 d3fil3r snijesh szjshuffle 1010stone mitsuhamiyamizu atrevinoflitton xchromosome219 powerhorse1986 rstatistics loalon ksmbandi sek012 pandan74 magusagnus thyagoleal hdttorrance anampc xxziris norabbull mamess danilotat aynur31 sanjeevardodlapati scy-bio programming-sbcg853-cohort3 babasaraki jhpeach hooooooly gaochenxuzi

dge_workshop's Issues

DESeq2 documentation links

Add links to DESeq2 vignette: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

Add links to Mike's book: http://genomicsclass.github.io/book/pages/rnaseq_gene_level.html

add attr() function to look at design matrix

attr(dds, "modelMatrix")

change as.data.frame to data.frame

change all instances for consistency

plots in deseq2

Hi,
First, totally new to rnaseq analysis... figured out the command line portion (star, stringtie) and also managed deseq2 in R.
So I figured out how to analyze the differential expression in R, got actual fold change values in the command line tables!! But... I can't figure out how to 'save' the resulting expression analysis as a csv file. I did this below but I have no idea where the file is; I checked all of the folders, don't see it.

write.table(res, file="ctrlVS10um.txt", append = FALSE, sep = "\t" )

The other issue is when I tried plotMA or plotcounts, it looks like its running, but I have no idea where to find the graphs. Is it supposed to pop up in my command line window? How would I find the graphs so I can save and export them onto my local computer?

I've been googling both, but all I find are instructions on how to generate graphs or save files, nothing on how to actually save them or where they go.

Thank you.

Change Sleuth lesson to download metadata file

Or we have it match Salmon

Correct issues encountered in November workshop

Modify the DGE repo for consistency with changes made in last workshop
Remove added section for resKD added in last workshop

results() function needs alpha value

We should be specifying the alpha value in the results() function when extracting our results. By default it tests against a alpha = 0.1, and since we select genes with padj < 0.05, we should be testing for an alpha = 0.05.

Also, if we use a lfcThreshold, we should specify that in the results() function as well. I think this is what Mike had mentioned to us previously.

modify the hypothesis testing lesson

create separate contrast vectors for OE and KD
For OE save the unshrunken FC to plot MA with and without shrinkage

bring in counts from geo and make metadata

sleuth pca function plots pca on non-log transformed counts

code below to use log transformed values

# Extract data from object
norm_counts <- sleuth_to_matrix(de, "obs_norm", "est_counts")
log_norm_counts <- de$transform_fun_counts(norm_counts)

# Compute PCs
pc <- prcomp(t(log_norm_counts))
plot_pca <- data.frame(pc$x, summarydata)


# Plot with sample names used as data points
ggplot(plot_pca, aes(PC1, PC2)) + 
  theme_bw() +
  geom_point(aes(color=genotype)) +
  xlab('PC1') +
  ylab('PC2') +
  scale_x_continuous(expand = c(0.3,  0.3)) +
  #geom_text_repel(aes(x=PC1, y=PC2), label=name) +
  theme(plot.title = element_text(size = rel(1.5)),
        axis.title = element_text(size = rel(1.5)),
        axis.text = element_text(size = rel(1.25)))

Error/Issue with code for stripping version from ENSEMBL ids

Hi,

I believe there might be an error in the code for stripping version ids from ENSEMBL IDs because some version numbers are double digits. (lesson 9a)

If you apply current code to a file that contains any double digit ensembl version ids (including the demo salmon files):
Current code: ids.strip <- str_replace(ids, "([.][0-9])", "")
Then the second number in the ENSEMBL version ID is appended to the end to the ENSEMBL IDs which results in errors in downstream processes.
ENST00000339924.12
becomes
ENST000003399242
instead of
ENST00000339924

Probably a more efficient way to do this but I circumvented this by running two code steps to strip double digit and then single digit version ids:
ids.strip <- str_replace(ids, "([.][0-9][0-9])", "")
ids.strip <- str_replace(ids.strip, "([.][0-9])", "")

Best,
Sam

update normalization lesson

Bring in materials from Mary's BOSC lesson, specifically the table for normalization methods and associated text.

remove square brackets from visualization

clusterprofiler:: change enrichMap to emapplot

Overview of DGE Analysis Workflow

Hello,

I am new to R program, and i can not follow the step in "Overview of DGE Analysis Workflow" from: Salmon (quant.sf) to tximport. https://hbctraining.github.io/DGE_workshop_salmon/lessons/01_DGE_setup_and_overview.html

when i run the below codes:

List all directories containing data

samples <- list.files(path = "./data", full.names = T, pattern="salmon$")

Obtain a vector of all filenames including the path

files <- file.path(samples, "quant.sf")

Since all quant files have the same name it is useful to have names for each element

names(files) <- str_replace(samples, "./data/", "") %>%
str_replace(".salmon", "")

It showed" files character(0) & samples character (empty)", and i can not run tximport.

Look for the advice. Really appreciate.

add a note for `keytype` versus `keyType` in enrichGO

move the thresholds stuff from visualization to hypothesis testing

Cannot use FPKM/RPKM/TPM for comparison between samples？

I have checked the DESeq2 normalizaion result, the sum are also different between samples like FPKM/RPKM/TPM. In your idea, i think use RPGC(1x average coverage) are better.

In most paper(about cancer) when show the gene change quote from TCGA adopt the FPKM instead of DESeq2 normalization result( TCGA produce the raw count)

Include code for prcomp() in PCA lesson

Replace bitr() with ensembldb() in functional analysis

How to use DEseq2 for Differential expression

Hi,
I have a .txt file generated from featureCounts file generated from featureCounts. I want to use DEseq2 for Differential expression analysis. Please suggest any script to run the program.
Here is my input file looks like:
counts.txt

Thank you,

functional analysis updates from the DGE_salmon repo

the images need to be updated, but maybe also some code?

put the vignette function somewhere in the lessons

results() function to have lfc threshold incorporated

Change the PCA plots in the QC lesson

The example PCA plots are too small ; increase the datapoints and the axis titles.
Also Batch should not be a continuous variable

says

" For genes with moderate to high count values, the square root of dispersion will be equal to the coefficient of variation (Var / μ) "
Whereas it should be SD / μ.

(Judging by the formula Var = μ + α*μ^2)