ganglilab / genekitr Goto Github PK

View Code? Open in Web Editor NEW

53.0 3.0 7.0 85.09 MB

🧬 Gene analysis toolkit based on R

Home Page: https://www.genekitr.fun

License: GNU General Public License v3.0

R 100.00%

gene enrichment-analysis id-converter plotting

genekitr's Introduction

Overview

Genekitr is a gene analysis toolkit based on R.

Five core features:

Search: gene-related information (exp. gene functional summary, gene name, location, GC content, gene biotype ...) and PubMed records
Convert: ID conversion among Symbol & Alias, NCBI Entrez, Ensembl ,Uniprot and human microarray probe
Analysis: users could select interested gene set from hundreds of gene sets for both model and non-model species, including GO (BP, CC and MF), KEGG (pathway, module, enzyme, network, drug and disease), WikiPathways, MSigDB, EnrichrDB, Reactome, MeSH, DisGeNET, Disease Ontology (DO), Network of Cancer Gene (NCG) (version 6 and v7) and COVID-19. Gene enrichment analysis (GSA) contains both over representation analysis (ORA) and gene set enrichment analysis (GSEA) methods. ORA is capable of supporting multi-group comparisons.
Plot: easily generate 13 ORA plots, 5 GSEA plots, 2 Venn plots, and 1 Volcano plot with customizable features such as text, color, border, axis, and legend. The function is capable of accepting a dataframe as input and supports GeneOntology website results based on PantherDB..
Export: quickly export numerous datasets as different sheets within a single Excel file.

Supported organisms:

For more details, please refer to this site.

Search & ID conversion: 195 vertebrate species, 120 plant species and 2 bacteria species
Enrichment analysis: GO supports 143 species, KEGG supports 8213 species, MeSH supports 71 species, MsigDb supports 20 species, WikiPahtwaysupports 16 species, Reactome supports 11 species, EnrichrDB supports 5 species and human-specific gene sets (DO, NCG, DisGeNET and COVID-19)

🛠 Installation

# check current version
packageVersion('genekitr')

Install stable version from CRAN:

install.packages("genekitr")

Install development version from GitHub:

remotes::install_github("GangLiLab/genekitr")

Install development version from Gitee (for CHN mainland users):

remotes::install_git("https://gitee.com/genekitr/pacakge_genekitr")

📚 Vignette

ENGLISH: https://www.genekitr.fun/

🧙🏻‍♂️ Tell a story ~ why develop genekitr?

Genes, the essence of life's art, Omics research's fundamental part,

Like cells in our physical frame, Their study reveals life's vibrant flame.

Let me tell you a story about Mr. Doodle, a computational biology student working with his PI.

Scene 1: repeat work

One day, PI gave him 30 genes to check for their locations and exact names, preferably with sequences.

Mr. Doodle searched for each gene on NCBI, copying and pasting the information into an Excel sheet. He sent the file to PI an hour later, and received praise for his work. But just when he thought he was done, PI gave him another 50 genes to check!

Despite feeling a little overwhelmed, Mr. Doodle repeated the same process with determination, determined to complete the task to the best of his abilities.

Doodle wondered how to avoid having to repeatedly search for the same information.?

Scene 2: embarrassing name

Once upon a time, PI gave Mr. Doodle a DEG matrix and a target gene list file. The task was to determine if the target gene was up-regulated after treatment.

Mr. Doodle searched the matrix but couldn't find the PDL1 gene, even though it was in the gene list. He asked PI about it, and PI explained that the gene was listed as CD274, which is an alias for PDL1.

This left Mr. Doodle feeling a little confused. He wondered how to distinguish between real gene names and aliases.

Doodle wondered how to differentiate between a real gene name and an alias.

Scene 3: outdated database

Mr. Doodle was analyzing KEGG pathways for the up-regulated genes in the last DEG matrix. However, KEGG only supported Entrez IDs, and the genes were listed by their symbols.

Mr. Doodle needed to convert the gene symbols to Entrez IDs, but he found that some symbols did not match the corresponding Entrez IDs. However, he discovered that NCBI had the correct IDs.

Mr. Doodle realized that he was using an outdated org.Hs.eg.db v3.15 annotation package. After updating to the current version, v3.17, he was finally able to obtain all the matched IDs, and continue his analysis of the KEGG pathways.

Doodle wondered if there was a method to help him obtain updated results automatically, instead of having to check them manually every time.

Scene 4: imcompatible format

PI did some fancy enrichment analysis all by himself on a website called GeneOntology. He then asked Mr.Doodle to help him make a pretty picture of the results. . "Can you make a bubble plot for me and show the FoldEnrichment on the x-axis?" he asked with a smile. Doodle tried to use a fancy R package called clusterProfiler, but it wouldn't work with the data. So, he bravely coded it himself using ggplot2.

Doodle wondered why there isn't a tool that supports easy data frames.

Scene 5: annoying plot theme

Doodle finally finished making the bubble plot and sent it to PI. After 15 minutes, PI sent him a message with a smile: "the text is too small, and can you make the background white with a border size of 4 points?" Doodle tweaked the ggplot theme and made the changes in 10 minutes. But, after a little while, PI sent another message saying, "The border is too thick in the second version. Can you please redo it?"

Doodle wondered if there was a function that could help him process the plot theme instead of having to modify the current code repeatedly.

Scene 6: limited plot types

PI gave Doodle the GO enrichment analysis result and asked him to think of a creative way to display it. Doodle found that each tool had its specific plot. For example, WEGO could compare BP, CC, and MF terms; GOplot had a chord plot to show the relationship between genes and GO terms; and clusterProfilersupported enriched map and network, which could explore the relationship among enriched terms. However, there was a big problem - the input data for each tool was not compatible, making it inconvenient to plot WEGO plots using clusterProfiler objects.

Doodle wondered if there was a method that could produce beautiful plots from different tools using a universal data format.

Scene 7: chaotic export files

Doodle finished conducting differential expression analysis and GO/KEGG enrichment analysis. PI asked him to send over all the result files. Doodle saved the results into three separate excel files, naming them "DEG_data.xlsx," "GO_enrich.xlsx," and "KEGG_enrich.xlsx." He then compressed the three files into a zipped folder, naming it after the date, and sent it to PI. After a while, PI asked him if he could put all three results into a single excel file.

Doodle wondered if there is a way to save all data into a single file without having to perform many manual operations.?

If you have encountered similar problems like Mr. Doodle, give genekitr a try!

✍️ Author

Yunze Liu

🔖 Citation

For now, the paper is published. Please cite:

Liu, Y., Li, G. Empowering biologists to decode omics data: the Genekitr R package and web server. BMC Bioinformatics 24, 214 (2023). https://doi.org/10.1186/s12859-023-05342-9

💓 Welcome to contribute

If you are interested in genekitr, welcome contribute your ideas as follows:

Git clone this project
Double click genekitr.Rproj to open RStudio
Modify source code in R/ folder
Run devtools::check() to make sure no errors, warnings or notes
Pull request and describe clearly your changes

genekitr's People

Contributors

Stargazers

Watchers

Forkers

rnaimehaom lzlgboy mahlaranjeet genomicsnx healthvivo antecede

genekitr's Issues

transId function sometimes give error

Hi,
Thanks for your tools. When I used transId to transform gene ids, I found that this tool sometimes works fine, but sometimes it gives an error：

ec <- 0
for (i in 1:10){
  esm_id <- try(genekitr::transId(unique(dt$NCBIid), transTo = "ensembl"))
  if(inherits(esm_id, "try-error")){
    ec <- ec + 1
  }
}
ec
##[1] 3

In this example, this gives error in 3 out 10 times. Why this happened? How could I handle this? (The dt file is provided)
Thanks.
dt.zip

Combining pathway gene sets for ORA/GSEA analyses

Hi Genekitr developers,

Is there a way to combine pathway gene sets from different sources like (mSigDB, Reactome, KEGG) into a single geneset for ORA or GSEA analysis?

Thank you in advance,
Adi

transId keeping unique ids issue

Hello,

some genes are not changed to the new symbols (the new symbol is BABAM2):

Also, the information of the genee is not complete:

Problem about transId

Hi,

Thank you for your fantastic work and the great convenience you've brought to us.

However, recently, when I attempted to convert a column of IDs, despite setting both the 'keepNA' and 'unique' parameters to TRUE, I noticed that the returned data length doesn't match the input. What's even more peculiar is that when I re-enter the initially missed IDs into the function, the data is then output completely, although some may be None. The package version of genekitr is 1.2.5. Details are as mentioned above. I'm looking forward to your response, and once again, thank you for your awesome work.

Best wishes,

Zhaoyu

org.db bug in gotangram

Hi!

I'm trying to plot "gotangram" for GO BP enriched A.thaliana data. Other types like bar, upset, and the network worked fine, but gotangram is raising an error: "Error: Bioconductor orgdb for org.At.eg.db not found. You should install first.". It looks like a bug since for A.thaliana it's org.At.tair.db

test <- c('AT1G12610', 'AT5G47600', 'AT1G33760') # just to make it easy to reproduce
gs <- getGO(org="Arabidopsis thaliana", ont="bp")
go_bp <- genORA(test,geneset=gs) 
plotEnrich(go_bp, plot_type = "gotangram", sim_method = "Rel", org='Arabidopsis thaliana')

ps
Many thanks for the package. I do love it.

load older versions of the package from CRAN

hello, is there any way one can load older versions of the genekitr package? thank you

Plotting for ShinyGO ORA results

Describe the bug
Hello, I am attempting to perform a GO term visualization of my ShinyGO ORA results with plotEnrich. There are GO terms that plotEnrich won't recognize, is there a way to skip them entirely? thank you

Could not resolve host: genekitr-china.oss-accelerate.aliyuncs.com

Hi reedliu,

I was running the example code, and got the error, "Could not resolve host: genekitr-china.oss-accelerate.aliyuncs.com". It seems like a host issue and data access issue. I am in Europe.

data(geneList, package = "genekitr")
entrez_id <- names(geneList)[abs(geneList) > 2]
head(entrez_id, 5)
hg_gs <- geneset::getGO(org = "human",ont = "mf")

Best, Chen

plotGSEA classic type for non-model species

Describe the bug
Hello, I was planning to coerce a fgsea (preranked gsea) result onto a plotEnrich function for plotting, with a previous step of gene count and GeneRatio calculation, geneID_symbol mapping and column name changes so that the dataframe looked identical to the model dataframe returned by genGSEA (which I find less flexible than fgsea).

However, when I attempted plotting the results for a single category which I checked was in the gsea_df, i recieved an error

Error in `$<-.data.frame`(`*tmp*`, "gene", value = c("BnaA04g22070D", : 
replacement has 64949 rows, data has 65732

With the following traceback:

8.
stop(sprintf(ngettext(N, "replacement has %d row, data has %d", 
"replacement has %d rows, data has %d"), N, nrows), domain = NA)
7.
`$<-.data.frame`(`*tmp*`, "gene", value = c("BnaA04g22070D", 
NA, NA, NA, NA, "BnaC01g43250D", NA, NA, NA, NA, "BnaC07g39370D", 
"BnaA03g47170D", NA, NA, "BnaCnng19060D", NA, NA, "BnaC01g19310D", 
NA, NA, "BnaC07g50360D", NA, NA, NA, "BnaA05g03390D", NA, NA, ...
6.
`$<-`(`*tmp*`, "gene", value = c("BnaA04g22070D", NA, NA, NA, 
NA, "BnaC01g43250D", NA, NA, NA, NA, "BnaC07g39370D", "BnaA03g47170D", 
NA, NA, "BnaCnng19060D", NA, NA, "BnaC01g19310D", NA, NA, "BnaC07g50360D", 
NA, NA, NA, "BnaA05g03390D", NA, NA, NA, NA, NA, NA, NA, NA, ...
5.
calcScore(geneset, genelist, x, exponent, fortify = TRUE, org)
4.
FUN(X[[i]], ...)
3.
lapply(show_pathway, function(x) {
calcScore(geneset, genelist, x, exponent, fortify = TRUE, 
org)
})
2.
do.call(rbind, lapply(show_pathway, function(x) {
calcScore(geneset, genelist, x, exponent, fortify = TRUE, 
org)
}))
1.
genekitr::plotGSEA(BP_HDAC_list, plot_type = "classic", show_pathway = "GO:0040029")

To Reproduce
reprex_plotGSEA_filtered.xlsx

This is my excel file representing the list of different dataframes I used after preprocessing (with names "gsea_df", "genelist", "geneset", "exponent" and "org" . I am working with Brassica napus external_gene_name ENA identifiers

I filtered the gsea_result to having only 21 rows in order to preserve confidenciality of my results, but it still has the identifier Im looking forward to create a GSEA plot from, GO:0040029. If this is a problem for test generation, please confirm.

Additional context
Any other supplements?

Wrong label side in bar plot when more positive NES than negative NES

Hi there,

In your code plotGSEA.R line 498:120, when less positive NES than negative NES, the geneset names are plot at the right side, but when more positive NES than negative NES, the geneset names are plot at the wrong side.

Issue with ORA and importCP

Hi,
Firstly, kudos for the development of genekitr. It's a great tool and your reasons for its creation resonate so much with my experiences so far.

I'm currently working with the organism Yarrowia lipolytica and have noted some challenges:

The geneset for Yarrowia lipolytica is available in the geneset package (GO and KEGG), but there is no organism value attached to it when running getGO:

> mf <- getGO(org = "Yarrowia lipolytica", ont = "mf")
> head(mf$geneset)
          mf          gene
1 GO:0000030 YALI0_C04004g
2 GO:0000030 YALI0_D10549g
3 GO:0000030 YALI0_B01672g
4 GO:0000030 YALI0_E02222g
5 GO:0000030 YALI0_A20922g
6 GO:0000030 YALI0_A13585g
> head(mf$geneset_name)
          id                           name
1 GO:0000030   mannosyltransferase activity
2 GO:0000049                   tRNA binding
3 GO:0000149                  SNARE binding
4 GO:0000166             nucleotide binding
5 GO:0000175 3'-5'-RNA exonuclease activity
6 GO:0000287          magnesium ion binding
> mf$organism
[1] NA

However, I ran into issues with follow-up functions, specifically the genORA function. It suggests there's no short name for the organism. This is perplexing given the initial inclusion of Yarrowia lipolytica in the geneset. I also tried to add the organism value, but the function still does not work.

>   gs <- genORA(de.genes$ensembl_gene_id, mf$geneset,padj_method = "BH",
+                p_cutoff = 0.05,)
Error in if (organism == "hg" | organism == "human" | organism == "hsa" |  : 
  argument is of length zero

I also tried a different route, performing the ORA with ClusterProfiler and then importing the results to genekitr. But this too resulted in an error.

>   ora_go <- clusterProfiler::enrichGO(gene = de.genes,
+                         OrgDb = org.Ylipolytica.eg.db,
+                         universe = filtered_data$entrez,
+                         keyType = "ENTREZID",  
+                         ont = "ALL",  # Biological Process
+                         pAdjustMethod = "BH",  # adjust method
+                         pvalueCutoff = 0.05,
+                         minGSSize = 5,
+                         maxGSSize = 500,
+                         readable = FALSE)
>   go_easy <- importCP(ora_go, type = "go")
Error in mapEnsOrg(object@organism) : 
Check the latin_short_name in `genekitr::ensOrg_name`

I'd appreciate any insights or suggestions you might have regarding these issues. Is there a workaround or am I possibly missing a step?
Thanks!

transId loading issue

Describe the bug
transId not working

To Reproduce
Steps to reproduce the behavior:
Just run the example below.

Screenshots

Desktop (please complete the following information):

OS: Windows
Version [11 ]
Browser [Brave]

labels overlay bars in plotEnrichAdv when left x-axis limit is less than the right limit

Hi!
When I'm trying to create a figure with plotEnrichAdv on simplified data and left xaxis limit is less than the right xaxis limit labels overlay bars of thee graph.

Let up_go_bp_sim and down_go_bp_sim be the resultant dataframes returned by genORA function ran with up- and downregulated DEGs.
Then:

Left limit is greater

plotEnrichAdv(up_go_bp_sim, down_go_bp_sim,
              plot_type = "one",
              term_metric = "FoldEnrich",
              stats_metric = "p.adjust",
              xlim_left = 15, xlim_right = 20) +
    theme(legend.position = c(0.8, 0.5))

Right limit is greater (as in the example in the documentation)
Everything is OK.

plotEnrichAdv(up_go_bp_sim, down_go_bp_sim,
              plot_type = "one",
              term_metric = "FoldEnrich",
              stats_metric = "p.adjust",
              xlim_left = 20.1, xlim_right = 20) +  # now left border is greater than the right one
    theme(legend.position = c(0.8, 0.5))

ps:
It also would be great to add more parameters to simGO function like cutoff etc.

pps:
Thanks again for the package!

ORA result plotting error because of duplicated terms

Describe the bug
I am trying to create bar plots of my ORA results but keep getting an error in dyplr::mutate()

To Reproduce
Steps to reproduce the behavior:
using attached testfile 'testgenelist.csv', the following code should reproduce the error

library(genekitr)
library(geneset)
gs3 <- getReactome(org = "human")
testgenes <- read.csv(file = "data/testgenelist.csv", header = TRUE, sep = ",")
## ORA Analysis
id <- testgenes$GeneID
test_ego <- genORA(id,
                        geneset = gs3,
                        p_cutoff = 0.05,
                        q_cutoff = 0.10
)

#plot
plotEnrich(test_ego, plot_type = "bar")

See error
The following error was raised (screenshot included):

plotEnrich(test_ego, plot_type = "bar")
Error in dplyr::mutate():
ℹ In argument: Description = factor(.$Description, levels = .$Description, ordered = T).
Caused by error in levels<-:
! factor level [20] is duplicated
Run rlang::last_trace() to see where the error occurred.

When rlang last trace is run:
Error in `dplyr::mutate()`:
ℹ In argument: `Description = factor(.$Description, levels = .$Description, ordered = T)`.
Caused by error in `levels<-`:
! factor level [20] is duplicated

Backtrace:
▆

├─genekitr::plotEnrich(test_ego, plot_type = "bar")
│ └─... %>% ...
├─dplyr::mutate(...)
├─dplyr:::mutate.data.frame(...)
│ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
│ ├─base::withCallingHandlers(...)
│ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
│ └─mask$eval_all_mutate(quo)
│ └─dplyr (local) eval()
├─base::factor(.$Description, levels = .$Description, ordered = T)
└─base::.handleSimpleError(...)
└─dplyr (local) h(simpleError(msg, call))

└─rlang::abort(message, class = error_class, parent = parent, call = error_call)

Expected behavior

I expected the barplot to be generated as normal. I haven't had this issue with any other datasets I have analyzed. Inspection of the test_ego result doesn't seem to be impacted either. Dataframe of ORA result (test_ego) screenshot included.

Screenshots
testgenelist.csv

Desktop (please complete the following information):

OS: macOS
Version 12.6.5
Browser Chrome

Additional context

genInfo missing data

Hello,

I am again having issues with getting genes information.

for example this gene: ENSG00000257122 has an HGNC symbol but the package reports it as NA.

Best,

transId from alias to symbol no longer works

I'm encountering the following error when using transId() to convert gene aliases to symbols despite the same script working a week ago.

Here's the output using the example in the documentation:

> transId(c("BCC7", "TP53", "PD1", "PDL1", "TET2"), "sym")
Maybe your "trans_to" argument is wrong, please check again...
Error in tbl_vars_dispatch(x) : object 'res' not found

Could not resolve host: genekitr-china.oss-accelerate.aliyuncs.com

Hi,

When I attempt to run

"gse <- genGSEA(genelist = ranks, geneset = gs)"

I receive the following error:

"Error in function (type, msg, asError = TRUE) :
Could not resolve host: genekitr-china.oss-accelerate.aliyuncs.com"

What is the reason for this error?

symbols with Excel misidentified gene names

Hi--

I wanted to report that some symbols are official and are not returned by transId().
Also, the tool does not fix the date problem of Excel.

plotGSEA with max.overlap parameter

Hi there,

Thanks for developing this fantasy package. I have tried this package a lot, and I want to raise an issue about the visualization of the GSEA results.

In the 'classic' mode, if the genes are overlapped, they will only show part of the genes. May I ask if you could add a "max.overlap" to customize the number of showing in the GSEA plot.

Best,
Logan

Potential issue with upcoming ggplot2 3.5.0

Hi there,

We have been preparing for a new release of ggplot2, and during a reverse dependency check, it became apparent that the prospective ggplot2 3.5.0 would break genekitr.

The issue we encountered is in the plotVenn() function, but we believe the cause of the issue is krassowski/complex-upset#192.

I think no action is required from genekitr's side, but this issue is a heads up that plotVenn() might become broken through no fault of your own.

new version is memory hungry

Hello--

In the previous version, I used to convert 30k gene symbols in one command on my machine with 32GB and never had a problem. Now, when I try to run the same command (transId) on the same symbols, even a machine with 128GN will kill the process as the memory is not enough.

Citation

How do we cite your tool? I can't find it? Thanks

transId() updating symbols weird behaviour

Hello--

I am updating old gene symbols with keepNA = FALSE, unique = FALSE.

I am getting some strange data (please see below).
row 138 (Gm553): is official symbol and it is returned as NA.
row 149-151: the original symbols are Ankrd44 & 4930444A19Rik.
row 156 & 157 (Mob4): it comes one time as Mob4 and one time as NA.

transId() does not return all input symbols

Hi--

I use transId():

transId(id = IDs, transTo= "symbol", org = "mouse", keepNA = TRUE, unique = TRUE)

which should return all the input symbols; however, it returns less records and there is always a row of all NA.

Please use the same symbols file to verify.

transId() mouse symbols

Hello,

I was comparing between transId() and biomaRt and found that biomaRt returns more symbols than transId() from ensembl ids. They are official mgi symbols, what would be the reason?

ganglilab / genekitr Goto Github PK

genekitr's Introduction

Overview

Five core features:

Supported organisms:

🛠 Installation

Install stable version from CRAN:

Install development version from GitHub:

Install development version from Gitee (for CHN mainland users):

📚 Vignette

🧙🏻‍♂️ Tell a story ~ why develop genekitr?

Scene 1: repeat work

Doodle wondered how to avoid having to repeatedly search for the same information.?

Scene 2: embarrassing name

Scene 3: outdated database

Scene 4: imcompatible format

Scene 5: annoying plot theme

Scene 6: limited plot types

Scene 7: chaotic export files

✍️ Author

🔖 Citation

💓 Welcome to contribute

genekitr's People

Contributors

Stargazers

Watchers

Forkers

genekitr's Issues

When rlang last trace is run: Error in dplyr::mutate(): ℹ In argument: Description = factor(.$Description, levels = .$Description, ordered = T). Caused by error in levels<-: ! factor level [20] is duplicated

Recommend Projects

Recommend Topics

Recommend Org

When rlang last trace is run:
Error in `dplyr::mutate()`:
ℹ In argument: `Description = factor(.$Description, levels = .$Description, ordered = T)`.
Caused by error in `levels<-`:
! factor level [20] is duplicated