maayanlab / archs4 Goto Github PK

ARCHS4 RNA-seq processing scripts and web server pages.

License: Other

R 5.63% CSS 9.49% HTML 20.64% PHP 7.86% JavaScript 51.60% Shell 0.39% Python 0.98% XSLT 0.33% Perl 1.92% Batchfile 0.01% Rebol 0.10% Hack 0.97% Dockerfile 0.03% Raku 0.04%

archs4's Introduction

ARCHS4

ARCHS4 provides access to gene counts from HiSeq 2000 and HiSeq 2500 platforms for human and mouse experiments from GEO and SRA. The website enables downloading of the data in H5 format for programmatic access as well as a 3-dimensional view of the sample and gene spaces. Search features allow browsing of the data by meta data annotation, ability to submit your own up and down gene sets, and explore matching samples enriched for annotated gene sets. Selected sample sets can be downloaded into a tab separated text file through auto-generated R scripts for further analysis. Reads are aligned with Kallisto using a custom cloud computing platform. Human samples are aligned against the GRCh38 human reference genome, and mouse samples against the GRCm38 mouse reference genome.

Website: https://amp.pharm.mssm.edu/archs4
BioRxiv: https://www.biorxiv.org/content/early/2017/09/15/189092

The collection of scripts is provided as is and there is currently no streamlined instructions how to use it in other projects. The alignment and processing of RNA-seq samples encompasses a large amount of prerequisites. In the future this code base will be cleaned up and made more user friendly. Running the code will require Docker/Marathon/Mesos, Python and R as well as access to the Amazon Cloud.

archs4's People

Contributors

Stargazers

Watchers

Forkers

wangdi2014 raymondshang knight134 amalthomas111 thismax cooleel gloria0306 ahmedelhosseiny shadow3g sultanghazala nine-sarayut

archs4's Issues

Error in H5Fopen(file, "H5F_ACC_RDONLY", native = native) : HDF5. File accessibilty. Unable to open file.

I downloaded the R code about human esophagus from https://amp.pharm.mssm.edu/archs4/data.html , and ran on R.

an error was occurred on the step of retrieving information from compressed data,

samples = h5read(destination_file, "meta/Sample_geo_accession")
Error in H5Fopen(file, "H5F_ACC_RDONLY", native = native) :
HDF5. File accessibilty. Unable to open file.

this problem has been shared here by others, but I could not solve my problem when I applied the suggested solution, because I am new in using R, please anyone can guide me through this problem, thanks a lot

"File download ran into problems. Please try to download again."

Hi and thank you for this great resource! I have been trying to download the hdf5 files using the auto-generated rscripts, but I continue to run into this error regardless of the dataset I try to download.

"File download ran into problems. Please try to download again."

Do you have any recommendations on how to fix this?

Best,
Dylan

Filtering results?

I am interested in RNA-Seq datasets that have rRNA depleted, would I be able to search that query in the ARCHS4 interface?

Please let me know?

Duplicate Gene Names Implications?

hi @lachmann12. I really appreciate this resource , it is truly great help. But apparently I noticed this too and as you may see in the screenshot attached that values of each entry is not identical, should I imply int was at transcript level rather than gene ?

Unexpected number of transcripts for mouse

Hello @lachmann12,

I was wondering why do mouse transcripts quantification only has 98,492 rows? Metadata says you've used Ensembl v90, which has 131,195 unique transcripts in the GTF file, and 109,282 in the cDNA file provided by Ensembl. Was there any additional filtering?

Thank you!

Meta data

Hello, I had a few questions around the metadata of downloaded h5 files. Namely:

the ARCHS4 homepage shows what I assume is the number of FASTQ files that have been processed by the ARCHS4 pipeline, for example human indicates 135 K out of approximately 700 K - the h5 files show approximately 620 K samples. So, am I right assuming that the 135 K has to do with processing? And how do I distinguish the 135 K within the 620 K samples?.
Your visualization tool has a break down by tissue and cell type, I can't find the field in the metadata that matches that break down (though I can see that there is a series field that indicates, I assume through some mapping table, which tissue and cell type a sample belongs to.

Those are my main queries for now!

Thank you in advance,
Edgar

RequestTimeout

Hi again,

When I upload a fastq.gz file to elysium from a server (tunneled via SMB), I get the following error:

RequestTimeoutYour socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.

Please advise. Thanks.

ENSG genes when using gene_symbols

Hi,

When I use the R script (h5read(destination_file, "meta/genes/gene_symbol")) the matrix generated includes genes with prefix ENSG along with regular gene symbols. Why is this?

Thanks and good day.

Missing gene correlation in gene page

API backend does not seem to return genes

Multiple downloads -- hundreds of zipped folders?

Hello!

I noticed today that if you attempt to download multiple files at once, it can spawn over 100 downloads for each one on Chrome. My example is that I tried to download from the mouse+sample page several tissue-specific gene expression files at once (Image attached).

It took a while to start downloading -- and when it did start, it downloaded about 300 zipped folders totaling nearly 4 GB. I am also fairly sure it downloaded the wrong files since the gene count tsv in the colon folder only had 285 samples -- though according to the page I downloaded it from, it should contain 1169 samples.

Anyways, hope this is helpful! I really love ARCHS4 and despite this issue I think it's an amazing tool!

Best,
Henry Miller

Updating R Scripts with New h5 Matrices

When I download an R script to read gene expression data for human data, the initial variables are as follows:

destination_file = "human_matrix_v10.h5"
extracted_expression_file = "GSE30017_expression_matrix.tsv"
url = "https://s3.amazonaws.com/mssm-seq-matrix/human_matrix_v10.h5"

I see that on the ARCHS4 downloads page there is a "human_matrix_v11.h5" available. Should the h5 file that the R script prompts the user to download be the updated "v11" data?

gene counts format

Hi,

Why are the gene counts from elysium in float format whereas the gene counts from archs4 in integer?

Thanks.

What gencode version for 'human_matrix_v1.11.h5' gene name annotation?

Hello,

The latest ARCHS4 (ARCHS4 Version 2.3) is based on Ensembl 107.
Thus, I guessed Entrez gene symbol was annotated based on gencode v41.

However, there is no information for 'human_matrix_v1.11.h5'.
The data was released on 11-16-2021, so I thought the gene symbol was annotated based on gencode v38, which was released on 05-2021. But the 2971 of 35238 genes in 'human_matrix_v1.11.h5' was not overlapped with gencode v38 gene names.

So What gencode version for 'human_matrix_v1.11.h5' gene name annotation?

Desired feature: On Results page, permit download of individual output tables

When an ARCHS4 query is performed for a particular gene, and the results page (e.g. https://amp.pharm.mssm.edu/archs4/gene/ACE2) displays numerous tables of output data, it would be helpful to allow users to select each individual table for download as a file; offering multiple formats such as CSV, TSV, and TXT would be convenient.

Add meta/gene_ensemblid to human_matrix.hdf5

It would be great if the human gene-level hdf5 file included a meata/gene_ensemblid object, like the mouse file does, so that users can use those a bit more confidently in downstream analysis.

The human_matrx.h5 (v8) file I took for a spin when v8 first came out does not include them.

different expression genes

Hello @lachmann12 :
I did different expression genes(DEG) by ARCHS4,but when I used DEseq,DEseq2 or EdgeR to find DEGs, the numeber of DEG is zero for GSE49110 in ARCHS4. When I use raw count from GEO,I can get more than 100 DEGs. The situation isn't accidental. There is over 0.7 for the correlation of GSM1193921 from GEO and ARCHS4. I can't find out the reason that the numeber of DEG is zero for GSE49110 in ARCHS4.
ARCHS4 is a great database,I love it.

best wish,

Newest reference genome version of human_matrix.h5 v8 (Date: 2/2020)

Hi,

Thank you for your data. Could you please tell me which version of ensembl you use to create human_matrix.h5 v8 (Date: 2/2020)? I need to calculate gene-level TPM from transcript data. Thank you.

Best,
Zheng

Metadata info

Hi,
The datasets look very good here.
I wish to download the files for looking at the human data. I am more interested in the ages of the donors across all of the tissues. Is this metadata embedded within these H5 files? Or could there be a separate metadata file available that contains such information. I would love to be able to get a hold of such information before working on the main files.
Many thanks.

Incorrect ensembl id's in v5 mouse gene expression hdf5 file

The entries (ordering) of meta/genes has changed between the v4 and v5 mouse_matrix.h5 files, and the corresponding meta/gene_ensemblid wasn't updated to match.

For the v5 dataset, it looks like the expression values found in data/expression likely correspond to the re-ordered meta/genes entries, which makes the v5 meta/gene_ensemblid entries wrong ... and likely the other gene-level metadata in the v5 matrix (ie. I just checked that the entrez id's haven't changed from v4, which means they would also be incorrect).

library(rhdf5)
library(dplyr)
v4.h5 <- "mouse_matrix_v4.h5"
v5.h5 <- "mouse_matrix_v5.h5"

ginfo <- tibble(
  v4name = h5read(v4.h5, "meta/genes"),
  v4ens = h5read(v4.h5, "meta/gene_ensemblid"),
  v5name = h5read(v5.h5, "meta/genes"),
  v5ens = h5read(v5.h5, "meta/gene_ensemblid"))

head(ginfo)
# A tibble: 6 x 4
#   v4name  v4ens              v5name        v5ens             
#   <chr>   <chr>              <chr>         <chr>             
# 1 A1bg    ENSMUSG00000022347 0610007P14Rik ENSMUSG00000022347
# 2 A1cf    ENSMUSG00000052595 0610009B22Rik ENSMUSG00000052595
# 3 A2m     ENSMUSG00000030111 0610009L18Rik ENSMUSG00000030111
# 4 A3galt2 ENSMUSG00000028794 0610009O20Rik ENSMUSG00000028794
# 5 A4galt  ENSMUSG00000047878 0610010F05Rik ENSMUSG00000047878
# 6 A4gnt   ENSMUSG00000037953 0610010K14Rik ENSMUSG00000037953

all.equal(ginfo$v4ens, ginfo$v5ens)
# [1] TRUE

(cc @lachmann12 )

CPM and TPM from gene_abundance.tsv

Hi,

In issue #30, you shared how to obtain gene abundance values from the transcript expression levels. I would like to know how to obtain CPM and TPM values from these gene abundance values (gene_abundance.tsv). From what I understand some normalization is already performed to obtain gene_abundance.tsv. Can I still just perform the regular calculations for CPM and TPM?

Thanks.

Wrong size of "/meta/samples/singlecellprobability" in mouse matrix v9

In mouse matrix v9 there is only 307268 elements in probabilities vector, but there is 360627 samples in total.
Is it OK?
Am I right, that I can match these probabilities with first 307268 samples?

Quality filters to select/ reject GEO samples

Hello ARCHS4 team.

Thanks for developing this database and making it so freely available. The fact that all the raw files were uniformly processed and kallisto counts were directly shared is pretty awesome.

My issue is regarding the GEO accession: GSE57872 for homo sapiens. In GEO database, there are 800+ samples for this study and processed data is made available for 500+ after removal of low quality cells.
But in ARCHS4, only 80 odd samples are available from this study. May I know what filters/ criteria were used during the processing to reject the remaining cells?

I could not get this information from the ARCHS4 publication or from the codes shared here. (I tried to go through them as best as i could)
I hope this isn't something silly as no-one has raised this kind of issue before

Error in H5Fopen(file, "H5F_ACC_RDONLY", native = native)

I downloaded the R code about human esophagus from https://amp.pharm.mssm.edu/archs4/data.html , and ran on R.

an error was occurred on the step of retrieving information from compressed data,

samples = h5read(destination_file, "meta/Sample_geo_accession")
Error in H5Fopen(file, "H5F_ACC_RDONLY", native = native) :
HDF5. File accessibilty. Unable to open file.

I reinstalled the "rhdf5" package, but the problem was still existing.

elysium

Transcript analysis using downloaded human_transcript_v7.h5

Hi there,

We love ARChS4 and would like to extract transcript (all isoforms) expression data for our gene (ILRUN) in different tissues. There are three transcripts for ILRUN in ensembl.

ENST00000374023.8
ENST00000374026.7
ENST00000374021.1

However, only second two are retrieved from your human_transcript_v7.h5 file. The first ENST00000374023.8 is isoform a which is believed to be the dominantly expressed transcript (principle isoform).

Looking forward to your response.
Kind regards,
Marina

Thank you kindly,
Marina

Sample disease state

Is the status of a given sample/experiment captured somewhere in the metadata for a given dataset download? I'd like to know if a given series/experiment was tagged with a phenotype, e.g. "breast cancer", or something similar.

Fix the footer

Replace BD2K link to ITCR.
Replace DCIC logo with Ma'ayan Lab new logo.

Add meta/* data to humant_transcript.h5 (v8)

The human_transcript.h5 (v8) file (rounded TPMs) seems to be missing most all of the metadata for the data in this file.

For instance, the human_transcript_v8.h5/meta/* directory in the HDF5 file only has a Sample_channel_count file in it (no transcript ids, or anything else).

[Question]: License page clarificaion

Hi @AviMaayan and @lachmann12

Thanks for the work here.

I read through the license page - https://github.com/MaayanLab/archs4/blob/master/LICENSE and I am not clear if say utilities like gget can be used programmatically to query the database and if the results of these queries via gget are going to be used in R&D work/presentations in biotech/pharma etc.

As an example:

https://github.com/pachterlab/gget#-quick-start-guide has an example for gget archs4 -w tissue ACE2

Thanks in advance

Update blurb on the landing page

All RNA-seq and ChIP-seq sample and signature search (ARCHS4) (https://maayanlab.cloud/archs4/) is a resource that provides access to gene and transcript counts uniformly processed from all human and mouse RNA-seq experiments from GEO and SRA. The ARCHS4 website provides the uniformly processed data for download and programmatic access in H5 format, and as a 3-dimensional interactive viewer and search engine. Users can search and browse the data by metadata enhanced annotations, and can submit their own gene sets for search. Subsets of selected samples can be downloaded as a tab delimited text file that is ready for loading into the R programming environment. To generate the ARCHS4 resource, the kallisto aligner is applied in an efficient parallelized cloud infrastructure. Human and mouse samples are aligned against the most recent Ensembl annotation (Ensembl 107).

Duplicate gene symbols

Hi,
I've noticed that there are some genes that have the same symbol, however have a different index (meaning they might have different sample values).

It looks like there are around ~2000 duplicate gene symbols, with some appearing over 10 times:

Any idea what could cause this?

ARCHS4 bugs in the gene pages

The list of co-expressed genes sometimes does not show up. The numbers of the tissues expression levels are not legible (see screenshot that captures both bugs).

Gene correlation files on the site

Hi,
I noticed that there is new ARCHS4 data (from 2024), yet the gene correlation files are old (2018).
I was wondering if the correlation data on the site itself is up to date and if so, is there a way to download it as a file?
Thanks!

Error when downloading gene expression files

Hi,

When I download certain gene expression files (e.g. GSE121380) from the generated R scripts I run into the following error:

Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, :
Not enough memory to read data! Try to read a subset of data by specifying the index or count parameter.
Calls: t ... tryCatch -> tryCatchList -> tryCatchOne ->
Error: Error in h5checktype(). H5Identifier not valid.
Execution halted

I've tried using up to 120gb and I still get the same error.

Pls. advise. Thanks.

Some questions regarding h5 files

Hey,
I hope this is the right place to ask these questions - please point me in the right direction if not.
There were two questions that arose while working with the downloadable gene expression h5 files:

How I can tell if a gene wasn't part of a sample? Currently I assume 0 expression means the gene isn't there.
What is the difference between the "Expression (gene level)" and "Expression (transcript level)" files offered for download on the archs4 site?
I was wondering if there is a best practice when creating gene expression correlations using the ARCHS4 data? For example the batch effect correction I saw on the site, or handling missing expressions from samples while measuring expression correlation.

Thank you!

Difference between human_correlation.rda and human_correlation_archs4.f?

Dear ARCHS4 developers,

As the title, I was wondering what's the difference between these two files? I noticed the value and number of genes are different.
Is the archs4.f more lately??

Thanks!

Instructions on running pipeline

Are there instructions somewhere on how to run this pipeline? Would like to expand upon it, but it's clear to me what the process is, or what the dependencies are.

Much appreciated.

Incomplete meta/ensemblid and meta/transcriptlength entries in mouse_hiseq_eid_1.0.h5

It seems that that the transcript information is incomplete for the mouse transcript-level expression matrix (mouse_hiseq_eid_1.0.h5).

The data/expression matrix has quantitation for 178,136 transcripts, however the meta/ensemblid and meta/transcriptlength objects only have 98,492 entries each.

The dimensionality between these same objects in the human_hiseq_eid_1.0.h5, however, appear to be concordant.

Programmatic way to submit fastq files

Hi,

I have about 700+ fastq files (16 GSE IDs) I would like to submit to get gene expression files. Is there a way for me to programmatically do this instead of going to the elysium/biojupies webpage and uploading the file each time?

Thanks and good day.

Latest pipeline to create ARCHS4 Version 2.1.2 h5 files?

Hi,
What is the pipeline used to create human_matrix_v2.1.2.h5 and mouse_matrix_v2.1.2.h5 ? Is it the same as the pipeline mentioned in the 2018 paper?
Thanks.