uc-bd2k / grein Goto Github PK

GREIN : GEO RNA-seq Experiments Interactive Navigator

Home Page: https://shiny.ilincs.org/grein

License: GNU General Public License v2.0

R 15.27% CSS 0.59% HTML 83.87% JavaScript 0.15% Dockerfile 0.12%

rna-seq geo bioinformatics-pipeline bioinformatics-analysis rna-seq-pipeline rna-seq-data rna-seq-analysis graphical-interface r shiny-apps

grein's Introduction

GREIN : GEO RNA-seq experiments interactive navigator for re-analyzing GEO RNA-seq data

GREIN is accessible at:
https://shiny.ilincs.org/grein

The Gene Expression Omnibus (GEO) is a public repository of gene expression data that hosts more than 7,500 RNA-seq datasets and this number is increasing. Most of these samples are deposited in raw sequencing format which needs to be downloaded and processed. With an aim to transform all these datasets in an analysis-ready format, we are currently processing the available RNA-seq samples of human, mouse, and rat from GEO using an R-based automated pipeline GREP2. This pipeline simultaneously downloads and processes RNA-seq raw sequencing data available in GEO. We demonstrate the results in a web application, GREIN (GEO RNA-seq Experiments Interactive Navigator) using the shiny framework. This interactive and intuitive application allows a user with little or no computational programming background to explore and analyze processed GEO RNA-seq datasets in a point-and-click manner. GREIN provides the flexibility to analyze and create ilincs compliant signatures that can be uploaded to ilincs for further in-depth analysis. In addition, GREIN produces publication-ready graphs and let the user to download all the analysis results. By accumulating the processed GEO RNA-seq datasets in a common platform, we present GREIN as a prominent choice to the practitioner for analyzing GEO RNA-seq datasets.

If you use GREIN, please cite:

Al Mahi, Naim, et al. "GREIN: An interactive web platform for re-analyzing GEO RNA-seq data." Scientific reports 9.1 (2019): 7580; doi: https://doi.org/10.1038/s41598-019-43935-8

Docker instructions

Installation of Docker

Ubuntu: follow the instructions to get Docker CE for Ubuntu.

Mac: follow the instructions to install the Stable verion of Docker CE on Mac.

Windows: follow the instructions to install Docker Toolbox on Windows.

Download and run the docker container

To obtain the latest docker image, run the following in your command line:

[sudo] docker pull ucbd2k/grein

Linux users may need to use sudo to run Docker.

To run the container execute the following command:

[sudo] docker run -d -p <an available port>:3838 ucbd2k/grein

Typically one can use port 3838 if not already used by another application. In that case the commad is

[sudo] docker run -d -p 3838:3838 ucbd2k/grein

First make sure that port 3838 is free to use. If not free, you can stop and kill any othe docker containers on this port by

[sudo] docker stop <container ID> && docker rm <container ID>

To know the container ID run this command:

docker ps -a

Before starting GREIN, please download the github repository first. We have put an example dataset (GSE22666) in the data folder. You should keep all the processed datasets with corresponding files (eset.RData, multiqc_report.html, and transcripts.RData) in this folder. You can process GEO RNA-seq daatsets using our GREP2 pipeline. If you want to process data on the fly using GREIN, then you will have to run the GREP2 pipeline in the backend (run GEO_data_processing.R for any new dataset you want to process and change the parameters accordingly) which will grab the GEO accession ID from the user_geo_request folder. GREIN will look for the log file within this directory. After you process the data, you will have to run after_processing.R which will update the datatable used by GREIN.

To start GREIN, open a browser and type in the address bar <Host URL>:<available port as specified>. For example http://localhost:3838 on Mac or Linux systems when 3838 port is used.

Host URL on Ubuntu and Mac is localhost, if accessed locally. On Windows, the IP is shown when Docker is launched by double-clicking the Docker Quickstart Terminal icon on desktop, or it can be obtained from the output of docker-machine ls in the interactive shell window.

Issues and bug reports

Please submit any bug reports, issues, or comments here: https://github.com/uc-bd2k/GREIN/issues

grein's People

Contributors

Stargazers

Watchers

Forkers

zhilongjia naimmahi htnani jyotsanamehra bbyun28 buihoangnam1988 metabdel ldweinstock thudxz phagehunter iiiime standardgalactic nilesh-iiita changxin-wang rnaimehaom zacrasca nsmc11 bioinf-lab

grein's Issues

Analysis not done

Hello

I appreciate for making a great program.
I tried to analyze GSE144269 on the http://www.ilincs.org/apps/grein/?gse=.
The processing console showed the analysis was activated, but
the analysis did not proceed more than 2 steps
and went back to the beginning of the analysis.
I tried more than 4 times, and all the interruption was the same.

Could you solve this problem?
Is this interruption sensitive to the big dataset?

Regards,

Header in data and meta data file does not match

In the meta data the id column uses the GSE id. In the transcript count file (not sure about gene) the sample columns have the id structure: paste0( SRR_id, '_', 'GSE_id')

GSE2666 example in container, Create Signature, not working

Hi,
I installed GREIN using docker, the interface shows as advertised, however when I wanted to run the analysis of the example dataset GSE2666, using Analyze -> Create Signature -> Variable of Interest: Cell Line, Type of Comparison: Two group without covariate, Experimental Group: H9, Control Group: HeLa and Click Generate Signature, looks like is computing, but then shows no output, when I download the signature, it is also empty.

I think your work is great, and I would like to use it locally (by the way when I go to the public server, the data set GSE2666 has not been analyzed yet ... I thought it should be there as you chose it as the example in the container)

I appreciate it if you can help, it may be something simple (I hope). I can ssh into the container, but don't know which log files could help to understand the issue.

API for download and TPM calculation

Hello,

The GREIN looks great! I am interested in downloading expression files for blood samples from GEO.
What will be the best way to do so? also, please let me know if I have transcripts counts and the genome reference version to transfer the measurements to TPM

Many thanks,
Eila

download metadata

hello naimmahi,
I can't load the page http://www.ilincs.org/tmp/GREIN_metadata.h5,and I can't downloaded data from https://shiny.ilincs.org/grein. Can you give me some advice to download the GREIN_metadata?Thanks very much.

Server down error?

Hey there, I've tried to load this page http://www.ilincs.org/apps/GRIN/ unsuccessfully for a few times today and yesterday and keep seeing this error.

Help?

Duplicated gene symbol

Hi, I got the gene raw count from GREIN, but in the gene symbol column have too many duplicated gene ( around 2700) but different Ensembl ID, If I want to use gene symbol for further analysis, how I can remove duplicated raw. Thank you

ex:
ENSG00000204574 ABCF1
ENSG00000206490 ABCF1
ENSG00000225989 ABCF1
ENSG00000231129 ABCF1
ENSG00000232169 ABCF1
ENSG00000236149 ABCF1
ENSG00000236342 ABCF1

All counts are zero

Hello
I thank you for your academic contributions.

I downloaded the count data for GSE110390 that have already been analyzed
from the http://www.ilincs.org/apps/grein/?gse= .
However, the dataset showed that all counts for all genes were zero.

Can I ask you to solve this problem?

Regards,

GEO_data_processing.R get error

Hi, I try to run the GREP2 for the new dataset, I just follow the GEO_data_processing.R but get the error,
In folder "data/user_geo_request/" I only saw the test.txt file with empty content. , how I can fix it?
For example, I want to do with the dataset: GSE22666

library(GREP2)

logdir <- "data/user_geo_request/"
destdir <- "data/user_geo_request/"
cat(paste("STEP 1: Processing starts... ","\n",sep=""),file=paste0(logdir,"/",geo_series_acc,"/log.txt"))
Error in paste0(logdir, "/", geo_series_acc, "/log.txt") :
object 'geo_series_acc' not found

process_geo_rnaseq (geo_series_acc=GSE22666,destdir="data/user_geo_request",
ascp=TRUE,prefetch_workspace="path_to_prefetch_workspace",
ascp_path="path_to_aspera",get_sra_file=FALSE,trim_fastq=FALSE,
trimmomatic_path=NULL,index_dir="path_to_indexDir",
species=species,countsFromAbundance="lengthScaledTPM",n_thread=2)
Error in process_geo_rnaseq(geo_series_acc = GSE22666, destdir = "data/user_geo_request", :
unused arguments (get_sra_file = FALSE, trimmomatic_path = NULL)

Combine multiple dataset from GREIN

I have one question about combination results, as I know a combination of the different datasets will get the batch effect. I want to ask how to avoid the batch effect when combining results from GREIN? actually, I want to combine gene count or normalized count results.
Thank you

ERROR: GSE174836 cannot be processed

Hi!

Thank you for this great tool!

One dataset that I would like to analyse is not yet in GREIN so I am trying to process it. Its GEO accession number is GSE174836.

However, it is giving the following error. Does this mean that single-cell RNA data cannot be processed? I find it strange because many pre-processed datasets in GREIN are single-cell RNA data.

How can I preprocess and analyse this dataset on GREIN? Thank you in advance!

Gene number different from different dataset

I got the results from some different dataset. however, the gene numbers in the count table were different.
For example GSE55807: 28,089 genes, and GSE126669: 28,125 genes.
Can I combine results from multiple datasets for further analysis?
Thank you.

Suggestions for improvements

I think GREIN is a really usefull tool but have two suggestions for imporvements:

Make abundances downloadable as well - many people would be interested in that and the improved abundance estimates are one of the main features of a Salmon quantification.
If you re-run the pipeline I would wish for running salmon with the bias correction (--seqBias and --gcBias) since those are some of the main features of Salmon.

The following datasets are not complete in terms of sample size.

GREIN is a fantastic tool for exploring RNA-seq data, and I greatly appreciate it. However, there appears to be an issue where certain datasets include only 20 samples each, which is not consistent with the sample size listed in GEO. It seems there might be some bugs present. Could I re-procressing the following datasets?
GSE184941
GSE190504
GSE180280
GSE183947
GSE189757

GSE146009
GSE162960
GSE165255
GSE183984
GSE107422

GSE179746
GSE158420
GSE171415
GSE142441
GSE172356

GSE181273
GSE133626
GSE147493
GSE179252
GSE184336

GSE113255
GSE126304
GSE127165
GSE142083
GSE173855

GSE112026
GSE179351
Than you very much!!

Post your comments or suggestions here. Thanks!

Hi. I downloaded raw counts for GSE83577, but the data corresponds to normalized counts. Is this a problem just for this dataset of a platform bug? thanks

Some datasets are not loading completely

Hi guys,
I was trying to look at some datasets (i.e. GSE133317) and it seems to be processed but can not load metadata/count table etc...
Probably there is a bug in the way!

Thanks,
Mehdi

new datasets

Hi guys,
I was wondering why the number of in progress datasets are more than the processed ones (first page plot). I noticed that something probably has changed and the pipeline can no longer download data from GEO. I checked several recently published data on GEO and when tried to analyze, they either had been tried before by someone else and failed or if I submit it for analysis, it fails the download step shortly after extracting metadata. (ie GSE125422, GSE159067...). It seems many dataset names and metadata have been added to the database list but failed to download and process. Any idea?

gene names not correspond to human ensembl v91

Hi team,

Recently I download gene expression files for GSE58375 from grein but I found out something contradictory. When I check GSE58375_GeneLevel_Raw_data.csv I found out the genes named "COX1","ND5","COX3","CYTB","COX2" and more are not in human ensembl v91 gtf files. But these genes can be found on gerin website and downloaded files with corresponding expression value. Could you explain to me how this happens please? Thank you.

Best,
Regards

Header in downloadable data is partially worng

I found this in the transcript level data (not sure about gene level data) The header looks like this:

",V1,Sample1,Sample2"

The data is structured as:

"rowNumber,transcriptId,Sample1"

This means that there is a headers missing - which will cause R's read.csv to skip the transcriptId column.

Is there a particular reason you would include rowNumbers in the data?

Count table at gene level

Hi,
I am trying to run DEA for some datasets using raw count table at gene level. I see that some gene symbol are mapped to multiple ID's such as AADACL2 mapped to 2 different ENSEMBL ids and LST1 mapped to 8 different ENSEMBL ids. Is there a way to get one row for each gene symbol?
Thanks,

Fast download of Count Matrix?

Thank you for creating this very powerful database. You have made outstanding contributions to GEO data mining. I would like to ask if there is a faster way to download the count matrix, as this is not conducive for website usage by web crawlers. Please contact me if there are any further developments. Thank you once again.

Incomplete Dataset and New Datasets

Respected developers and authors,

I sincerely thank you for hosting and maintaining this website analyzing and providing raw data for tons of RNA-seq studies in GEO. I have also recently used the processing request to get some of the datasets analyzed and its great to see such a quick process on your end.

I have some queries regarding certain human datasets I am investigating that are hosted on GEO. The details with GSE IDs are as follows:

GSE135743 - It seems the dataset is analyzed for only 20 samples while the total samples in the GEO is 59. Can it be redone on your end with all the samples?

GSE144254 and GSE78928 - It seems these datasets are not available on your backend. Last I checked GSE144254 has 42 samples of only bulk mRNA-seq data.

GSE78928 has bulk mRNA seq samples in both Homo Sapiens and Mus musculus. Additionally it has ncRNA-seq data as well. I guess the pipeline cannot differentiate multiple organisms/RNA library types in a single dataset?

Atleast GEO shows that Raw SRA for both of these datasets are available. Is there a way to request the authors/developers of GREIN to add datasets and specifc samples on your backend for analysis?

GSE182866 - This dataset has no accessible raw SRA in the GEO database. So I assume this cannot be processed?

Thank You .

GEO_data_processing error

library(GREP2)

logdir <- "data/user_geo_request/"
destdir <- "data/user_geo_request/"
cat(paste("STEP 1: Processing starts... ","\n",sep=""),file=paste0(logdir,"/",geo_series_acc,"/log.txt"))
**Error in paste0(logdir, "/", geo_series_acc, "/log.txt") :
object 'geo_series_acc' not found**

process_geo_rnaseq (geo_series_acc=GSE22666,destdir="data/user_geo_request",
ascp=TRUE,prefetch_workspace="path_to_prefetch_workspace",
ascp_path="path_to_aspera",get_sra_file=FALSE,trim_fastq=FALSE,
trimmomatic_path=NULL,index_dir="path_to_indexDir",
species=species,countsFromAbundance="lengthScaledTPM",n_thread=2)
**Error in process_geo_rnaseq(geo_series_acc = GSE22666, destdir = "data/user_geo_request", :
unused arguments (get_sra_file = FALSE, trimmomatic_path = NULL)**

website down?

Hello,
I haven't been able to access the GREIN website for a few days. A blue circle is constantly rotating and the background turns light grey after a few seconds. The problem remains the same with Chrome, Opera and Firefox.
Thank you.

Christophe

SIte not loading (bis)

The site is still not loading today (may 23 2022)

ALL SAMPLES ARE NOT INCLUDED FOR ANALYSIS

I am trying to analyse GSE131705 and GSE134900 through GREIN . when I searched these two in 2 section it showed the data has been processed for both the series . In NCBI GEO showed both series have more than 90 samples but in GREIN only 20 samples are included.
And it is also not accepting re analysis of these series.

Any possibility for mass download or API for dataset

Hi, I really appreciate the webserver and its high-qualitied RNA-seq datasets. However, I am working on a project requiring LOTS of GEO RNA-seq datasets, but I found downloading manually or using crawler tools like selenium inefficient for GREIN. Is their any possibility for the team to offer access for database files, like an archive of RNA-seq rawcount.csv files, or have an API like getGEO in R? Thanks a lot.

Site not loading

Hi,
I was wondering if the website is still active?
When I try to use it, the elements on the screen keep showing the loading animation and the screen is greyed out.

Best regards,
Sebastian