namlab / qbio Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 117.74 MB

QBIO HHU course materials

R 87.22% Shell 12.78%

qbio's People

Contributors

Watchers

qbio's Issues

Error in Step 5

myTopHits2 <- topTable(ebFit2, adjust ="BH", coef=1, number=50, sort.by="logFC")
myTopHits2
ebFit2
gost.res2 <- gost(rownames(myTopHits2), organism = "tccriollo", correction_method = "fdr", significant = F)
gost.res2

That's the code I run and that's the corresponding error code:
No results to show
Please make sure that the organism is correct or set significant = FALSE

We couldn't solve this problem on our own by setting significant = FALSE and adding a user.threshold didn't help as well.

Gene sets enrichment analysis (GSEA) using g:Profiler still works fine for me!

See https://github.com/IngoGiebel/qbio304-student-work/blob/main/scripts/dge-analysis-PRJCA004229.R on how the variables used here were created,

------------------------------------------------------------------------------

Step 5: Gene sets enrichment analysis (GSEA) using g:Profiler

------------------------------------------------------------------------------

Oryza nivara

Functional enrichment analysis of the 100 top-ranked genes

top_genes_gostres_onivara <- gprofiler2::gost(
top_genes_onivara_df$geneID[1:100],
organism = "onivara",
correction_method = "fdr")

Produce an interactive manhattan plot of the enriched GO terms

gprofiler2::gostplot(
top_genes_gostres_onivara,
interactive = TRUE,
capped = FALSE)

Produce a static publication quality manhattan plot

with the first 10 top-ranked GO terms highlighted.

gprofiler2::gostplot(
top_genes_gostres_onivara,
interactive = FALSE,
capped = FALSE) |>
gprofiler2::publish_gostplot(
highlight_terms = top_genes_gostres_onivara$result$term_id[1:10])

Generate a table of the gost results of the first 20 top-ranked GO terms

gprofiler2::publish_gosttable(
top_genes_gostres_onivara,
highlight_terms = top_genes_gostres_onivara$result$term_id[1:20],
show_columns = c("source", "term_name", "term_size", "intersection_size"))

Oryza sativa

Functional enrichment analysis of the 100 top-ranked genes

top_genes_gostres_osativa <- gprofiler2::gost(
top_genes_osativa_df$geneID[1:100],
organism = "osativa",
correction_method = "fdr")

Produce an interactive manhattan plot of the enriched GO terms

gprofiler2::gostplot(
top_genes_gostres_osativa,
interactive = TRUE,
capped = FALSE)

Produce a static publication quality manhattan plot

with the first 10 top-ranked GO terms highlighted.

gprofiler2::gostplot(
top_genes_gostres_osativa,
interactive = FALSE,
capped = FALSE) |>
gprofiler2::publish_gostplot(
highlight_terms = top_genes_gostres_osativa$result$term_id[1:10])

Generate a table of the gost results of the first 20 top-ranked GO terms

gprofiler2::publish_gosttable(
top_genes_gostres_osativa,
highlight_terms = top_genes_gostres_osativa$result$term_id[1:20],
show_columns = c("source", "term_name", "term_size", "intersection_size"))

Oryza sativa: Found GMT files use different gene codes from that used in BioMart

Checked GMT files: http://structuralbiology.cau.edu.cn/PlantGSEA/download.php

- GO (Gene Ontology) gene sets : http://structuralbiology.cau.edu.cn/PlantGSEA/database/Osa_GO

- Gene Family based gene sets : http://structuralbiology.cau.edu.cn/PlantGSEA/database/Osa_GFam

- KEGG gene sets : http://structuralbiology.cau.edu.cn/PlantGSEA/database/Osa_KEGG

- PO gene sets : http://structuralbiology.cau.edu.cn/PlantGSEA/database/Osa_PO

MIR gene sets : http://structuralbiology.cau.edu.cn/PlantGSEA/database/Osa_MIR

All these files do not fully adhere the GMT standard which states that the genes must be separated by tabs. In these file the genes are separated by ",". That issue can of course be tackled. When doing so, a knockout problem arises... The codes for the genes differ from the codes used in the reference genome file "https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-56/plants/fasta/oryza_sativa/cdna/".

For example:
BioMart gene codes: Os12g0469300, Os07g0249200
MSU Rice Genome Annotation Project gene codes (used in the GMT files): LOC_Os01g07760, LOC_Os01g40630, LOC_Os03g59220

At http://plants.ensembl.org/Oryza_sativa/Location/Viewdb=core;g=Os03g0786000;r=3:32624612-32627796;t=Os03t0786000-01 I found the following information (and only there) when displaying the information for one of the genes:

Transcript LOC_Os01g02240.1.1
Gene LOC_Os01g02240
Protein product LOC_Os01g02240.1
Location Chromosome 1: 678,778-684,594
Gene type Msu gene
Strand Reverse
Base pairs 4,758
Amino acids 1,585
Analysis Genes (MSU)
Annotation method Gene annotation by MSU Rice Genome Annotation Project dated 2011-10-31. These genes are included alongside the IRGSP annotations, but are not included in Compara or BioMart. Read more...;

Genome Analysis
rGREAT: an R/bioconductor package for functional
enrichment on genomic regions

Unfortunately, I could not find any other suitable GMT files which use the BioMart gene codes (used with kallisto/reference genome file and the tximport).

Script 5 - getGmt Error

Some of you get errors while importing some of the PlantGSEA gmt files

> broadSet.C2.ALL <- getGmt("Osa.DetailInfo.csv", geneIdType=SymbolIdentifier())
Error in validObject(.Object) : 
  invalid class “GeneSetCollection” object: each setName must be distinct
In addition: Warning message:
In getGmt("Osa.DetailInfo.csv", geneIdType = SymbolIdentifier()) :
  5788 record(s) contain duplicate ids: 'DE_NOVO'_IMP_BIOSYNTHETIC_PROCESS, 'DE_NOVO'_PYRIMIDINE_NUCLEOBASE_BIOSYNTHETIC_PROCESS, ..., ZINC_ION_TRANSMEMBRANE_TRANSPORTER_ACTIVITY, ZINC_ION_TRANSPORT

The error is caused by duplicated names of some of the gene sets. I dont know why such duplicates occur in the file, they can be easily removed using R and the code below.

# Quick solution 
# 1. Add ".csv "extension to the downloaded file, here for rice, the file name is "Osa.DetailInfo" downloaded from PlantGSEA
# 2. Read the file
tmp = read.csv("Osa.DetailInfo.csv", header = F, sep = "\t")
# 3. make tibble
tmp = as.tibble(tmp)
# 4. remove Duplicates
tmp = tmp[!duplicated(tmp$V1), ]
# 5. write new file
write.table(tmp, "OsaUnique.csv", sep="\t",col.names = F,row.names = F)
# 6. read the file as Gmt
broadSet.Osa.Unique = getGmt("OsaUnique.csv", geneIdType=SymbolIdentifier())

namlab / qbio Goto Github PK

qbio's People

Contributors

Watchers

qbio's Issues

------------------------------------------------------------------------------

Step 5: Gene sets enrichment analysis (GSEA) using g:Profiler

------------------------------------------------------------------------------

Oryza nivara

Functional enrichment analysis of the 100 top-ranked genes

Produce an interactive manhattan plot of the enriched GO terms

Produce a static publication quality manhattan plot

with the first 10 top-ranked GO terms highlighted.

Generate a table of the gost results of the first 20 top-ranked GO terms

Oryza sativa

Functional enrichment analysis of the 100 top-ranked genes

Produce an interactive manhattan plot of the enriched GO terms

Produce a static publication quality manhattan plot

with the first 10 top-ranked GO terms highlighted.

Generate a table of the gost results of the first 20 top-ranked GO terms

- GO (Gene Ontology) gene sets : http://structuralbiology.cau.edu.cn/PlantGSEA/database/Osa_GO

- Gene Family based gene sets : http://structuralbiology.cau.edu.cn/PlantGSEA/database/Osa_GFam

- KEGG gene sets : http://structuralbiology.cau.edu.cn/PlantGSEA/database/Osa_KEGG

- PO gene sets : http://structuralbiology.cau.edu.cn/PlantGSEA/database/Osa_PO

Recommend Projects

Recommend Topics

Recommend Org