Giter Club home page Giter Club logo

Comments (14)

agitter avatar agitter commented on August 21, 2024 3

While I think this paper makes the case that deep learning is better at imputation, I don't think it's good enough to salvage the imputed LINCS L1000 expression calls.

This could be a nice theme for the review. Almost every paper will show that neural networks are better than baseline regression/classification techniques. But when are the improvements enough to make a practical difference in the domain?

from deep-review.

cgreene avatar cgreene commented on August 21, 2024 1

Thanks for chiming in @in4matx! We don't have a solution. We were discussing this paper that you were a co-author on that used deep learning. In an evaluation of the previous imputation ( https://thinklab.com/discussion/assessing-the-imputation-quality-of-gene-expression-in-lincs-l1000/185 ), @dhimmel found that the imputed genes had a very different distribution than the directly measured genes in their knockdown/overexpression experiments.

What we're particularly interested in is whether or not the deep learning approach in your paper - which reduces imputation error - also affects these distributions.

Quoting @dhimmel - we need to know:

Whether I can re-do my analysis of LINCS L1000 with their imputation data depends on whether we're building off of the same raw LINCS L1000 data. I used a modzs.gctx file (learn more on figshare or dhimmel/lincs#3). We will want to make sure that the only difference between the modzs.gctx I used and l1000_n1328098x22268.gctx is the imputation method.

from deep-review.

gwaybio avatar gwaybio commented on August 21, 2024

Great example of deep learning (feed forward neural network) significantly outperforming a simpler machine learning algorithm (linear regression) on an important task (predicting gene expression from an informative panel). Also demonstrates ability of classifier trained on microarray data to infer RNAseq data.

Biology

  • Inferring expression of ~10,000 genes using only the measurements of a representative set of ~1000 genes
  • Comparing performance of algorithm to infer RNAseq vs. Microarray assayed genes
  • Trained on GEO data and evaluated on holdout GEO, GTEx, and 1000 genomes expression data

Computational aspects

  • Feed forward neural network
    • Trained with dropout, momentum, adaptive learning rate and with glorot initialized weights
    • Three hidden layers
      • 9000 hidden units in each layer shown to perform best
  • Performance evaluated using mean absolute error (MAE) at a per gene level
    • Compared to standard LINCS method (linear regression) and KNN regression
  • Nice discussion of visualizing network components

Results

  • ~15% improvement in MAE in NN compared to LR, KNN regression performed poorly
  • 99.97% of genes show improvement in NN method compared to LR
  • Trained NN on microarray data predicts RNAseq genes very well
    • Different optimal hyperparameters
      • e.g. Two hidden layers (9000 x 2)
    • NN does not have as significant improvements (but still improvements!) compared to LR
  • Visualization techniques
    • Sparse connections from input layer to hidden layer
    • Certain hidden layers are hubs of activity - probably capture some central co-regulated organization principles

from deep-review.

agitter avatar agitter commented on August 21, 2024

Ideally, we should also configure D-GEX with 9520 units in the output layer corresponding to the 9520 target genes. However, each of our GPUs has only 6 GB of memory, thus we cannot configure hidden layers with sufficient number of hidden units if all the target genes are included in one output layer. Therefore, we randomly partitioned the 9520 target genes into two sets that each contains 4760 target genes. We then built two separate neural networks with each output layer corresponding to one half of the target genes.

This part was unfortunate. I wonder how much better they could have done without this artificial limitation.

Their code has pairs of scripts for training networks on the first half of the data and then the second half. It may not be too difficult to train on all of the genes if someone is feeling especially curious (not me).

from deep-review.

gwaybio avatar gwaybio commented on August 21, 2024

That is unfortunate - makes it even more impressive how much better their performance was than LR

from deep-review.

dhimmel avatar dhimmel commented on August 21, 2024

Greetings, we discussed this paper in the 9/9 at 9 Greene Lab Journal Club (see slides). My take was that their deep learning model (D-GEX) did perform almost universally better than linear regression. However, both performed poorly:

deep learning reduced imputation error from 38% to 32% … still too much error for many expression applications. [Source: Tweet]

I originally was interested in this paper because of the poor imputation quality in LINCS L1000. While I think this paper makes the case that deep learning is better at imputation, I don't think it's good enough to salvage the imputed LINCS L1000 expression calls.

from deep-review.

cgreene avatar cgreene commented on August 21, 2024

@agitter Totally agree with that sentiment! That's what I really want to see. @dhimmel - is it feasible to re-do your imputation quality analysis if the authors would provide their new imputed data. It may be possible to request it from them.

from deep-review.

dhimmel avatar dhimmel commented on August 21, 2024

The imputed LINCS data from the study is available as described in their Methods:

we have re-trained GEX-10%-9000 × 3 using all the 978 landmark genes and the 21 290 target genes from the GEO data and inferred the expression values of unmeasured target genes from the L1000 data. The full dataset consists of 1 328 098 expression profiles and can be downloaded at https://cbcl.ics.uci.edu/public_data/D-GEX/l1000_n1328098x22268.gctx. We hope this dataset will be of great interest to researchers who are currently querying the LINCS L1000 data.

Note that l1000_n1328098x22268.gctx is 110 GB. Whether I can re-do my analysis of LINCS L1000 with their imputation data depends on whether we're building off of the same raw LINCS L1000 data. I used a modzs.gctx file (learn more on figshare or dhimmel/lincs#3). We will want to make sure that the only difference between the modzs.gctx I used and l1000_n1328098x22268.gctx is the imputation method.

Tagging the study authors @admiral-chen, @yil8, and @in4matx to see if they can provide more information regarding l1000_n1328098x22268.gctx and its relation to modzs.gctx.

from deep-review.

cgreene avatar cgreene commented on August 21, 2024

This one garnered quite a bit of discussion so I won't close it at this point. @admiral-chen, @yil8, and @in4matx - would be nice to highlight your contribution - can you provide some quick info on whether or not the potential eval is feasible?

from deep-review.

in4matx avatar in4matx commented on August 21, 2024

We ran a contest recently with a number of folks submitting their
improvements to inference and they were scored against a benchmark. While
the contest is over, several folks have asked for the benchmarks so that
their ideas can be compared to the current best performer.

Let me know if you are interested in comparing your algorithm.

https://community.topcoder.com/longcontest/?module=ViewProblemStatement&rd=16753&pm=14337

On Fri, Oct 14, 2016 at 1:16 PM, Casey Greene [email protected]
wrote:

This one garnered quite a bit of discussion so I won't close it at this
point. @admiral-chen https://github.com/admiral-chen, @yil8
https://github.com/yil8, and @in4matx https://github.com/in4matx -
would be nice to highlight your contribution - can you provide some quick
info on whether or not the potential eval is feasible?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#24 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA381tMSYaIDFL8flb-54o921mq_EkOrks5qz7kAgaJpZM4Jdwr3
.

from deep-review.

in4matx avatar in4matx commented on August 21, 2024

Oh, sorry I didnt catch that.

The application of deep learning was an exploratory analysis applied to a
very specific set of criteria. I'd ping Xiaohui Xie or his student for
more on the methods.

But just so that you are aware that particular deep learning inferred
dataset isn't used for the "bread and butter" CMap/LINCS analysis. For that
we use a linear regression based inference and which the community recently
improved with a knn-based approach.

So, I'm sorry I don't know much about the relative distributions and as the
deep learning approach hasn't been as extensively vetted/looked at from the
perspective of predicting knockdowns, I'd caution against conclusions
before comparing to the current linear regression based dataset.

FYI - we expect the improved knn-based algorithm and results will be
released later this Fall.

aravind

On Fri, Oct 14, 2016 at 4:15 PM, Casey Greene [email protected]
wrote:

Thanks for chiming in @in4matx https://github.com/in4matx! We don't
have a solution. We were discussing this paper that you were a co-author on
that used deep learning. In an evaluation of the previous imputation (
https://thinklab.com/discussion/assessing-the-imputation-quality-of-gene-
expression-in-lincs-l1000/185 ), @dhimmel https://github.com/dhimmel
found that the imputed genes had a very different distribution than the
directly measured genes in their knockdown/overexpression experiments.

What we're particularly interested in is whether or not the deep learning
approach in your paper - which reduces imputation error - also affects
these distributions.

Quoting @dhimmel https://github.com/dhimmel - we need to know:

Whether I can re-do my analysis of LINCS L1000 with their imputation data
depends on whether we're building off of the same raw LINCS L1000 data. I
used a modzs.gctx file (learn more on figshare or dhimmel/lincs#3
dhimmel/lincs#3). We will want to make sure
that the only difference between the modzs.gctx I used and
l1000_n1328098x22268.gctx is the imputation method.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#24 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA381t8GqVqCkk8WFTp94KFqC-bm6Smdks5qz-LZgaJpZM4Jdwr3
.

from deep-review.

yil8 avatar yil8 commented on August 21, 2024

Hi all,

Thanks so much for following up our work. If I remember correctly, the L1000 prediction was based on expression values of 978 landmark genes from /data.lincscloud.org/l1000/level3/q2norm_n1328098x22268.gctx (the first 978 are the landmark genes). And the predicted values of the other ~21 K genes are all normalized to 0-mean, 1-std. I don't know too much about modzs.gctx. As for the best imputation methods for L1000 data, honestly, I personally have also taken the contest @in4matx referred to https://community.topcoder.com/longcontest/?module=ViewProblemStatement&rd=16753&pm=14337

And surprisingly, I only made to top 10 using neural networks. I am also interested to see what the best performer used in the contest. I am currently in China now, and the internet connection is not very good. Will be back around the end of Oct, and come back to this issue.

Thanks

Yi Li

from deep-review.

cgreene avatar cgreene commented on August 21, 2024

This is very helpful! We won't push ahead with the evaluation given the caveats that you've raised, but we do look forward to the improved predictions.

Thanks for your time Aravind and Yi!

from deep-review.

dhimmel avatar dhimmel commented on August 21, 2024

I tested out replacing modzs.gctx with l1000_n1328098x22268.gctx in our consensus signature pipeline (notebook). While l1000_n1328098x22268.gctx contained all of the probes we need, it contained different perturbagen identifiers.

Specifically,

  • modzs.gctx signature IDs look like CPC005_VCAP_6H:BRD-A47494775-003-03-0:10
  • whereas l1000_n1328098x22268.gctx contained CPC005_VCAP_6H_X1_F1B3_DUO52HI53LO:K06

Therefore, I'm unable to proceed unless we figure out a way to convert between perturbagen vocabularies.

from deep-review.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.