scienceparkstudygroup / rna-seq-lesson Goto Github PK

A Carpentries-style lesson on RNA-Sequencing

Home Page: https://scienceparkstudygroup.github.io/rna-seq-lesson/

License: Other

Makefile 1.89% HTML 38.33% CSS 2.38% JavaScript 0.93% R 26.61% Shell 0.12% Python 24.49% Ruby 0.13% Jupyter Notebook 1.43% SCSS 3.69%

rna-seq-lesson's People

Contributors

Stargazers

Watchers

Forkers

yongming-duan urbankunej emartinezcalvo lagro19 gurinina dmgatti sidy2015 wangdi2016 hnnd yfukasawa bioacademy floresans rbngithub

rna-seq-lesson's Issues

upload the assignment Rmarkdown notebooks

In order to have them available for the next course.

scripts/mypca.R

Dear ScienceParkStudyGroup-Team,

I am struggling with your Introduction to RNA-seq lessons.
I managed to start a RStudio instance in the browser and now tried to follow Episode 05. However I got stuck at the very beginning when I try to execute:
source("scripts/mypca.R")
This is followed by an error:
Error in file(filename, "r", encoding = encoding) :
cannot open the connection
In addition: Warning message:
In file(filename, "r", encoding = encoding) :
cannot open file 'scripts/mypca.R': No such file or directory

My this is a trivial problem, but I cannot figure out where it comes from.

Help would be appreciated
Thanks and best,
Matthias

Outdated GO terms for enrichment analysis

Apparently, the GO terms in the org.At.tair.db Bioconductor package (release 3.10) are outdated.

In this package, there are 27,416 Arabidopsis genes for a total of 4837 GO terms.

columns(org.At.tair.db)
 [1] "ARACYC"       "ARACYCENZYME" "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL" 
 [7] "GENENAME"     "GO"           "GOALL"        "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
[13] "PMID"         "REFSEQ"       "SYMBOL"       "TAIR"        

length(keys(org.At.tair.db, keytype = 'TAIR'))
[1] 27416

length(keys(org.At.tair.db, keytype = 'GO'))
[1] 4837

One solution would perhaps to use the official most up to date GO terms (available here) and perform the enrichment analysis manually.

Good source of inspiration

To add to the next versions of the lesson:
https://www.slideshare.net/jakonix/4rna-seqpart4extracting-countsandanalysing

A great week of courses:
https://www.physalia-courses.org/courses-workshops/course19/curriculum-19/

Broken link in episode 3

Hello,
Thanks for providing this wonderful resource.

I wanted to point out that there is a broken link in episode 3. Specifically, the linked file does not exist, and I couldn't find it anywhere in the repo:

We encourage you to look at the full set of reads and note how the QC results differ when using the entire dataset

We encourage you to look at the [full set of reads](../fastqc/Mov10oe_1-fastqc_report.html) and note how the QC results differ when using the entire dataset

Can't find this file: ../fastqc/Mov10oe_1-fastqc_report.html

Cheers,
Jacob

Searching for regulatory elements

From a list of differential genes.

6.1 Extracting the coordinates of genes

Using biomartr

6.2 Adding or substracting X nts

For instance, 5000 nts
If gene is on DNA strand + then substract 5000 nts
If gene is on DNA strand - then add 5000 nts

Promoter retrieval using GenomicRanges
MEME for motif...

episode 02 improvement ideas

For the statistical refresher part:

What is a good p-value histogram? It should have a high peak on the left suggesting that you are comparing two conditions that have different distributions.
If distribution is uniform then no difference between your experimental conditions being tested.

Perhaps also split episode 02 into "statistical refresher" and a new episode termed 03 "statistics applied to RNA-seq"

For episode 02:

population and sample notions
simulate two populations from two countries with different heights.
draw a sample + increase size of sample and make average + sd estimations.
Case 1 = identical populations (= same country)
- draw N samples of similar size. Say N = 5, N = 10 groups or N = 10,000 groups.
- perform a t test for each of these three group sizes.
- draw a p-value histogram for these 3 group sizes.
- FDR procedure to control for type I error = false positives.
Case 2 = different populations
- draw N samples of similar size. Say N = 5, N = 10 groups or N = 10,000 groups.
- perform a t test for each of these three group sizes.
- draw a p-value histogram for these 3 group sizes.
- FDR procedure to control for type I error = false positives.
Type I error and type II error.
FDR procedure
Power

For episode 03 = application to RNA-seq

maximise biological replicate sample numbers to increase statistical power.
talk about sample sequencing depth = rarefaction curve.
p-values histogram profiles and what to do about it.

Useful links
https://www.bioconductor.org/packages/release/bioc/html/qvalue.html

tips for statistical episode

https://carpentries-incubator.github.io/Data-Science-for-Docs/07-just-enough-statistics/index.html

Episode 7 Functional Enrichment Analysis GO/KEGG for non model organisms

I just wanted to point out that using clusterprofiler with OrgDb objects is not ideal for less well annotated species. This is the case where the OrgDb comes from AnnotationHub.

This includes rice for example. The issue is with OrgDb not having translations from EntrezIDs to GO terms ~75% of the input EntrezIDs do not map to GO terms through this method.
Since the OrgDb object does not have a ensembl keytype I was forced to translate using biomart from ensembl to entrez. This also loses some IDs.
A direct translation from ensembl to GO terms (using biomaRt) leads to only ~39 % non-mapping genes.
I am unaware of a method to update OrgDb objects with, for example, new keyTypes. But need to look into it as this clusterprofiler method for GSEA is unusable for lesser annotated species.

I have not tried creating an OrgDb from ncbi, but I would not recommend using AnnotationHub as was recommended by the authors of clusterProfiler

grading table

Figures: present / not:
- PCA: + 1
- Top ten table: +1
- Volcano plot: +1
- Heatmap: +1
Interpretation of the figures: 0.5 point for each figure commented.
- PCA: + 0.5
- Top ten table: + 0.5
- Volcano plot: + 0.5
- Heatmap: + 0.5
Volcano plot:
Title: + 0.2
Explicit arguments + 0.2
Set cutoff axes + 0.2
Set the limits of the min/max +0.2
Heatmap
- Top ten/50/differential genes: + 0.2
- Clustering of the genes: + 0.2

assignment ideas

Make a clustering of the samples: step by step.

Find individual genes affected by the treatment and make up a story from it:

retrieving info on TAIR.org
gene function
subcellular localization, etc.

Have each student to pick up a gene different from the other students in the list of differentially expressed genes.

Variance stabilisation

Need to add the vst(dds) step before the PCA. It is not correct to plot the PCA without this transformation.

biomart cache clear

Add a little piece to exemplify when the biomartr function runs. Make a screenshot of the query when it runs.

Then add the BiomaRt::biomartCacheClear() piece of code in the main body of the episode as this is a frequent bug.