Giter Club home page Giter Club logo

d-gex's Introduction

README for D-GEX

INTRODUCTION

Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations and so on. Although the cost of whole-genome expression profiling has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only หœ1,000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression, limiting its accuracy since it does not capture complex nonlinear relationship between expression of genes.

We present a deep learning method (abbreviated as DGEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based GEO dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms linear regression with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than linear regression in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2,921 expression profiles. Deep learning still outperforms linear regression with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes.

This code base provides all the necessary pieces to reproduce the main results of D-GEX. If you have any questions, please email [email protected]

PREREQUISITES

DATA

The original data files are not provided within this codebase, as some of them require applying for access. Once you download all of them, please put them in this codebase.

GEO and GTEx

The GEO and GTEx data we used in our paper is a preliminary version before their official publication, and is not publicly available. For those who are interested in the data, please email us ([email protected]) with your basic information through an academic institute email address, and we will provide you the private download link. The data you will download is bgedv2_QNORM.gctx and GTEx_RNASeq_RPKM_n2921x55993.gctx.

1000G

The 1000 Genomes RNA-Seq expression data can be accessed from EMBL-EBI. The original data downloaded is GD462.GeneQuantRPKM.50FN.samplename.resk10.txt.

L1000

The predicted expression of L1000 data based on D-GEX can be downloaded at l1000_n1328098x22268.gctx. It consists of 1328098 expression profiles of 22268 genes. The first 978 genes are landmark genes that were directly measured by the L1000 platform. The other 21290 genes are target genes infered by D-GEX based on the GEO data. The expression profiles of each gene were standardized to mean 0 and standard deviation 1.

PREPROCESS

The whole preprocessing step should be done by run

$ ./preprocess.sh

Specifically, there are four steps.

  1. Removing duplicates by k-means: kmeans.py, nodup_idx.py.
  2. Coverting data into numpy format: bgedv2.py, GTEx.py, 1000G.py.
  3. Quantile normalization: bgedv2_reqnorm.py, GTEx_reqnorm.py, 1000G_reqnorm.py.
  4. Standardization: bgedv2_norm.py, GTEx_norm.py, 1000G_norm.py.

TRAINING

Training D-GEX is done by run H1_0-4760.py, H1_4760-9520.py, H2_0-4760.py, H2_4760-9520.py, H3_0-4760.py, H3_4760-9520.py. Each stript trains half of the target genes (0-4760 or 4760-9520) with a certain architecture (1, 2 or 3 hidden layers).

A training example using 200 epoch, 0.75 include rate (0.25 dropout rate) and 1 hidden layer with 9000 hidden units in each hidden layer for 0-4760 target genes is by:

$ ./H1_0-4760.py 9000_H1_0-4760_75 200 9000 0.75

In which, 9000_H1_0-4760_75 is the base name for all the output files.

OUTPUT

Each training instance will output 7 files. For example, by running

$ ./H1_0-4760.py 9000_H1_0-4760_75 200 9000 0.75

It outputs:

9000_H1_0-4760_75.log, the log file of the training instance.

9000_H1_0-4760_75_bestva_model.pkl, the model saved by best performance on Y_va (GEO microarray data).

9000_H1_0-4760_75_bestva_Y_va_hat.npy, the Y_va_hat predicted by best performance on Y_va (GEO microarray data).

9000_H1_0-4760_75_bestva_Y_te_hat.npy, the Y_te_hat predicted by best performance on Y_va (GEO microarray data).

9000_H1_0-4760_75_best1000G_model.pkl, the model saved by best performance on Y_1000G (1000G RNA-Seq data).

9000_H1_0-4760_75_best1000G_Y_1000G_hat.npy, the Y_1000G_hat predicted by best performance on Y_1000G (1000G RNA-Seq data).

9000_H1_0-4760_75_best1000G_Y_GTEx_hat.npy, the Y_GTEx_hat predicted by best performance on Y_1000G (1000G RNA-Seq data).

Reference

Gene expression inference with deep learning, 2016. Bioinformatics, bioRxiv.

d-gex's People

Contributors

yil8 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

d-gex's Issues

data file

Hi,

In the script "GTEx_reqnorm.py", one of the input file is "bgedv2_GTEx_1000G_quantile_float64.npy". However, I didn't see this file in the previous output or in the data directory [https://cbcl.ics.uci.edu/public_data/D-GEX/]. Could you please help me with this? Is this the same file as "bgedv2_GTEx_1000G_float64.npy" in "bgedv2.py" ?

I would highly appreciate your help!

The GTEx RNA-Seq expression data

Hello, professor Li. Recently,I am doing a research in this filed.
But,I can't find the the GTEx RNA-Seq expression data ----GTEx_RNASeq_RPKM_n2921x55993.gctx at GTEx Portal.
I don,t know if this data---GTEx_Analysis_v6_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct.gz is appropriate?

some question about output?

hello.
I have run your code in my dataset.
but there are some errors in some output file.
for example, H2-6000
image

Do you know how to solve them?

thanks a lot in advance.

Same genes in landmark set and target set

Hi,

I am a PhD student at Polytechnic University of Milan and I am interested in the D-GEX work, of which you are one of the authors.
I would like to ask how did you extract the two lists of landmark and target genes map_lm.txt and map_tg.txt in the Github repository.

I am noticing that a subset of the landmark genes (more or less 400 genes) is also part of the set of target genes (but considering different probes). What is the reason behind this?

Thank you in advance.

Dataset

Hi
can you save your dataset in CSV file and send the link for download?
thanks for it

Paper Dataset

Hi
I found your dataset in this URL and I downloat thems but after download completed they extension change into .hdf files. can you help me?
thanks.

The supplemental file in bioinformatics

Hello,

The bioinformatics paper of D-GEX has a dead link to the supplemental files. Would it be possible for you to post the supplemental files here or email me a copy? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.