benjamin-lee / deep-rules Goto Github PK

View Code? Open in Web Editor NEW

227.0 227.0 45.0 45.83 MB

Ten Quick Tips for Deep Learning in Biology

Home Page: https://benjamin-lee.github.io/deep-rules/

License: Other

HTML 63.53% Shell 4.41% Jupyter Notebook 32.07%

bioinformatics biology computational-biology data-science deep-learning genomics machine-learning manubot manuscript

deep-rules's People

Contributors

Stargazers

Watchers

deep-rules's Issues

If a human can't perform the task with ease, be surprised if a machine can.

Have you checked the list of proposed rules to see if the rule has already been proposed?

Feel free to elaborate, rant, and/or ramble.

Alternative heading: Garbage in -- garbage out.

Some people confuse machine learning -- and deep learning is no exception -- with elaborate statistical methods, and belief that neural networks have magical, super-human powers that can detect patterns nobody else can. As a consequence they belief that they can take their data in any form, throw the trending machine learning method of the day at it (which will automagically find a suitable representation of the data and then learn the task), and then profit.

Life rarely works out that way. Humans are actually incredibly good at detecting patterns. More often than not, if a human can't see "it", a machine won't either (c.f. the rule on the importance on baseline performances). If the machine performs better than chance, be suspicious. It is probably overfitting the training data (c.f. the rule on freezing a test set). For success, it is generally tantamount to find a representation of the data based on which the human could perform the task, and then -- and only then -- try to teach a machine to do it.

Any citations for the rule? (peer-reviewed literature preferred but not required)

Use existing model architectures whenever possible.

Don't forget your ML fundamentals/Rules that apply to ML also apply to DL

Have you checked the list of proposed rules to see if the rule has already been proposed?

Feel free to elaborate, rant, and/or ramble.

Maybe we could dedicate one rule to ML basics that people still don't seem to get right, which, although they are not DL specific, still apply to it. Issues #1, #19, #20, #21 all come to mind. This could possibly be rule one.

Understand, protect against, and benchmark possible overfitting

Have you checked the list of proposed rules to see if the rule has already been proposed?

Feel free to elaborate, rant, and/or ramble.

Any citations for the rule? (peer-reviewed literature preferred but not required)

Evaluating/validating the algorithm/predictions

The "Ten Quick Tips" cited in #1 has a number of rules related to this. As with other issues/potential questions, most of these are related to ML in general, but should at least be mentioned for DL. The rules related to this in "Ten Quick Tips" are:

Split the input dataset into training, validation, test sets
Take care of the imbalanced data problem
Optimize each hyper-parameter
Minimize overfitting
Evaluate algorithm with the MCC and precision-recall curve

I think we only need 1-2 rules related to this. One issue that has been coming up recently is that of bias in training data eg https://arxiv.org/abs/1711.08536 and http://www.pnas.org/content/115/16/E3635.short. Not sure if that would fit in here or if we should have that as a separate rule relating to generalizability or if it relates enough to #9 that it can just fit in there.

Be patient and pay attention to seemingly trivial hyperparameter settings

Have you checked the list of proposed rules to see if the rule has already been proposed?

In general, getting DL models to work on even simple, structured datasets requires much extensive hyperparameter tuning (Koutsoukas et al. 2017) [Could also add a rule saying that a 2-layer multi-layer perceptron is not deep learning ;)] compared to "traditional" machine learning. Hence, it is important to be patient and try many different hyperparameter settings and combinations.

For instance, the previously "best" learning rate might become useless if we add another layer or change the activation function. Hence, extensive tuning and a near-exhaustive search is recommended.

Furthermore, it should be stressed that even the same architecture and hyperparameter configuration should be tested multiple times with different random seeds since random weight initialization may make a difference between convergence and non-convergence. For model selection and evaluation, it is recommended is to evaluate models based on their average performance based on at least a "top 3 out of 5 models for a given hyperparameter setting" comparison.

Any citations for the rule? (peer-reviewed literature preferred but not required)

Koutsoukas et al. 2017 DOI

When it comes to data, size matters.

Logo Feedback

In an effort to help with publicity, I hired a logo designer on Fiverr for the project. They just came back with their first draft of the logo:

We have 24 hours to request revisions, so any ideas/comments/critiques to communicate to the designer would be highly appreciated.

Be suspicious: high prediction accuracy may backfire

Have you checked the list of proposed rules to see if the rule has already been proposed?

Did you add yourself as a contributor by making a pull request if this is your first contribution?

Yes, I added myself or am already a contributor

While we are amazed by the impressive predictive performance of deep learning models, we may have to stay suspicious when deep learning reports significantly higher predictive performance than other interpretable models with explicit constructed features. This is particularly important because people notice that deep learning does not predict through we expect as "semantic" information, but through some superficial statistics that should not be regarded as useful (Ref 1). To reiterate in an optimistic way, deep learning is a little bit too powerful, and what it learns within the scope of training data (and tested within the scope of testing data), may only be related to the superficial patterns of the data set itself, but not actually related to the task (Many more references about this phenonmeon are availbale in vision/NLP tasks, but I guess I should not digress too much). Thus, when you train a model, and test it and get high predictive peformance, it looks OK, but it may backfire (even more than traditional methods) when it is actually applied in industry.

Relations to other proposed rules:

Related to #43 but extends it as the data can be skewed in a very subtle way (through superficial patterns) (e.g. Ref 1), thus one may not be able to easily check. Also related to @rasbt comment within the thread as the bias in superficial way seems to be only related to deep learning due to DL's power. (#43 seems to extend #12)
Related to #26 but serves as a more fundamental ML-oriented discussion and offers many more evidence (in vision/NLP, refs omitted at this moment) about why #26 is important.
Kind of related to #27 and #35 as currently, the best way to avoid this issue is probably a more careful design of experiments.

Any citations for the rule? (peer-reviewed literature preferred but not required)

Ref 1

Fairly validate your model with proper test/train data

Have you checked the list of proposed rules to see if the rule has already been proposed?

Feel free to elaborate, rant, and/or ramble.

Any citations for the rule? (peer-reviewed literature preferred but not required)

Expect to revise your models as new (or improved) datasets and methods become available

Have you checked the list of proposed rules to see if the rule has already been proposed?

Feel free to elaborate, rant, and/or ramble.

Despite all our best efforts in trying to use the the cleanest of the datasets, balanced but heterogenous data, optimal hyperparameter settings, etc., we should expect to revise our models as new data and methods become available. We have to assume that our current set of observations is incomplete and our understanding is limited. Larger and improved datasets, metadata and other annotations will inevitably become available as new experimental techniques are introduced. This is particularly true across prolific fields such as regulatory genomics, transcriptomics, systems biology and biological image analysis (Angermueller et al., 2016, Ching et al., 2018). It will mean that models need to be re-trained and re-tested. Deep learning is still a young technology and in addition to new data, we should expect new developments and improvements to the methods and libraries we currently use.

Any citations for the rule? (peer-reviewed literature preferred but not required)

Angermueller et al., 2016. doi.org/10.15252/msb.20156651
Ching et al., 2018. doi.org/10.1098/rsif.2017.0387

Your deep learning model does not need to be a black box.

Have you checked the list of proposed rules to see if the rule has already been proposed?

Many machine learning application tolerate black-box models; however, research in Life Sciences benefits from details that come from the model interpretation. The idea of this rule is to describe the use of methods, focusing on autoencoders, as a way to interpret you deep learning model.

Any citations for the rule? (peer-reviewed literature preferred but not required)

Understanding Random Forests: From Theory to Practice: not deep learn, but relevant.
Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
Visual Interpretability for Deep Learning: a Survey

Ensure that your data is not skewed or biased

Have you checked the list of proposed rules to see if the rule has already been proposed?

Feel free to elaborate, rant, and/or ramble.
Training and test sets might contain bias because e.g. more data is available on a subject or specimen. Or a subject has been studied more. Or only positive results are published. This can all result in models with prejudice. It can also ethical issues as described by Cathy O’Neil in the book 'Weapons of Math Destruction'.

Any citations for the rule? (peer-reviewed literature preferred but not required)

There's more than one way to do it: find the learning that's right for you (and your data)

Have you checked the list of proposed rules to see if the rule has already been proposed?

Feel free to elaborate, rant, and/or ramble.

Any citations for the rule? (peer-reviewed literature preferred but not required)

Lock testing cohorts and only evaluate a trained model once.

This is probably the most occurring mistake. A lot of people test-test-test and test different models on the same testing cohorts, and then only report the best performing model. The deep learning model is, of course, over fitted. The tuning datasets can be used more frequently, but the testing should be locked and not used before the final model is identified

If you don't have enough data, consider whether there is other data that can help pre-train a model.

Often we lack a large amount of labeled data, but sometimes there are abundant unlabeled data. We could touch on pre-training approaches (autoencoder pre-training, etc) as well as transfer + refinement (though you have that in #6 so transfer could just be added there). Perhaps this ends up subsumed into a broader point as people contribute their tips.

Know your resources

Have you checked the list of proposed rules to see if the rule has already been proposed?

Feel free to elaborate, rant, and/or ramble.

Might be the wrong title, but using a rule to list available community resources, intro papers, and tutorials might be helpful

Any citations for the rule? (peer-reviewed literature preferred but not required)

Propose ten rules

To get this project started, we're going to need to have an initial idea of ten rules to send to PLOS as a presubmission inquiry.

I propose the following:

Don't use deep learning if you don't have to.
Don't share model weights trained on private data.
When it comes to data, size matters.
Use existing model architectures whenever possible.
Normalize, normalize, normalize

Project organization

@Benjamin-Lee I had a few big picture ideas for this collaborative manuscript based on our deep review experience.

In addition to the ideas in the readme about how participants can contribute, which should remain short like they are now, you could consider a more expansive contributing document. That would outline authorship criteria or other expectations for the collaborative project. See the deep review example.

That would also be a good place to include a code of conduct, which we did not have in deep review but now recommend for Manubot-based projects.

Send presubmission inquiry to PLOS Comput Bio

I drafted a presubmission inquiry that we'll send in once we have a full ten rules proposed.

Feedback from anyone would be highly appreciated. We can always change the rules and their ordering, but we should get this in and hear that PLOS is interested before doing any big publicity effort (just in case PLOS isn't happy about the idea).

@cgreene @agitter Did you do a presubmission inquiry for the deep review and, if so, do you have any feedback with respect to this one?

Have an operational plan

Have you checked the list of proposed rules to see if the rule has already been proposed?

Did you add yourself as a contributor by making a pull request if this is your first contribution?

Yes, I added myself or am already a contributor

Feel free to elaborate, rant, and/or ramble.

At the time of writing, rules.md is at 9 rules. How about a last rule that discusses some of the practical (operational/meta) considerations of ML/DL? Some ideas:

Sure, #21 says "be reproducible", but how?! Some discussion of Sacred or similar is warranted. What is the "best" way for someone to save and share results? Can you upload a Python script and a flat file to github, or is it better to invest time in figuring out docker? What is docker, anyways?
How do my analyses scale? I just learned about about MNIST on Kaggle and ran some DL on my personal laptop. Now I can scrape all of the Gene Expression Omnibus and run some models, right?!
Does a newbie need to buy a fancy computer with 8 graphics cards? Can they just use their institution's HPC? What are the pros/cons for HPC vs cloud?
Why is Python and/or R not good enough? Why on Earth would I want to use this Spark thing?

So you have a model/classifier....now what

Have you checked the list of proposed rules to see if the rule has already been proposed?

Feel free to elaborate, rant, and/or ramble.

Eg the right ways to interpret weights, covariance of features, etc

Any citations for the rule? (peer-reviewed literature preferred but not required)

Understand the biomedical questions

There are a large number of prediction tasks DL algorithms can do. However, many of them may not be clinically useful or may be confounded by factors in the current clinical practice. For example, in any retrospective study with AP view (usually for bed-ridden patients) and PA view chest X-ray images, it would not be surprising the DL identifies that AP view images are correlated with poorer clinical outcomes, due to confounding by indication.

Make sure your problem is a good fit for deep learning.

Not all problems are equally amenable to deep learning. Make sure that there is a high prior probability that your problem will benefit from deep learning. Some useful rules of thumb:

Is there "structure" in your data. Data types such as imaging, text, time series, etc contain useful and regular kinds of structure and correlation that a deep learning model can more easily exploit relative to traditional models. If all you have is tabular data then DL is unlikely to provide much lift.
Do you have a lot of data? Deep learning is much more scalable (due to the ability to leverage GPU computing) and can take advantage of large datasets more easily.
Do you want to assess statistical significance? If so deep learning might not be the best fit.

Condense down to ten rules

Now for the fun part! We have a bunch of rules that have been proposed, but can only have ten rules in the paper:

Dashnow H, Lonsdale A, Bourne PE (2014) Ten Simple Rules for Writing a PLOS Ten Simple Rules Article. PLoS Comput Biol 10(10): e1003858. https://doi.org/10.1371/journal.pcbi.1003858

Let's start winnowing down the rules. As @rasbt proposed in #47:

A next step would maybe be that we vote on the rules to include. Maybe everyone (the contributors here) should give a 1-10 score for each rule and then we can take the top 10 rules based on the highest average scores and see if the results make sense (and whether we maybe need to look for additional ideas).

Let's do this programmatically. How does a CSV wherein contributors PR/push their votes sound?

For full reproducibility, always report all details of your analyses, including sharing of data, methods, and models

Fine-tune the baseline models before concluding they are worse than DL

Related to #10, many conventional machine learning methods (e.g., support vector machines, random forests) require parameter tuning. Before jumping to the conclusion that DL is better than other machine learning methods, investigate whether the baseline models are fine-tuned.

Understand the trade-off between interpretability and performance

Deep learning methods are notably difficult to interpret. If data is provided without explicitly engineered features, how are you going to address finding any biases in predictions?

What sort of problem are you trying to solve with DL?
Is high performance from a DL approach worth the difficulty in explaining how the model assigns a value, or is the value in the model in understanding the biological problem at hand?

Moving forward (meta)

Have we identified corresponding author(s) to help keep this on track and importantly pay the ~$2k USD publication fee?
Should we agree on some timeline for getting the rules finalized and the presubmission inquiry sent in? Should the corresponding author(s) facilitate this?
Will @Benjamin-Lee submit the presubmission inquiry?
If the presubmission inquiry goes well, then we should form a plan about who works on what or do we just see how it turns out? (some discussion in #48 about this)
If the presubmission inquiry does not go well, then do we try a different journal with the same format or turn this into a review/primer or maybe a Methods in Molecular Biology chapter?
Do some of the authors communicate offline? Some of you seem to work/have worked together on other projects. If so, can these communications/decisions be shared with us?

Define the research questions, curate the data, and specify the analysis strategies before looking at the data.

Before diving into the data and starting an 'exploratory' analyses; first, simply think about what you want to do and what strategy you would like to take. This will narrow your analysis options, hopefully reducing false findings.

Ten Quick Tips

It would be helpful to make sure this ten simple rules gets distinguished from a ten quick tips article that got published last year:
https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0155-3

Normalize, normalize, normalize.

I think this would be a rule to talk about feature scaling, batch normalization, etc.

Test the robustness of your DL model in a simulation framework in which the ground truth is known

Have you checked the list of proposed rules to see if the rule has already been proposed?

Did you add yourself as a contributor by making a pull request if this is your first contribution?

Yes, I added myself or am already a contributor

Deep learning (DL) and generally machine learning on biological data suffers from the fact that the a priori ground truth remains unknown. Therefore, if available, a given DL architecture should be tested first on a simulation framework in which the different classes were explicitly modelled in order to understand strengths and weaknesses of the used DL model (robustness). The concept of stress testing the robustness of DL models is not unique to DL – however, given the complexity of DL, it is even more important in the context of DL.

Perform sanity checks and follow good coding practices

Have you checked the list of proposed rules to see if the rule has already been proposed?

Did you add yourself as a contributor by making a pull request if this is your first contribution?

Yes, I added myself or am already a contributor

Feel free to elaborate, rant, and/or ramble.
When coding DL models, it is important to maintain good software engineering practices. All code should be documented and include rigorous tests. Sanity checks are also useful. For instance, something is probably wrong (e.g. bug in code, ill posed problem, bad hyperparameters) if model training loss does not decrease (i.e. not overfitting) when considering a very small subset of the training data.

Any citations for the rule? (peer-reviewed literature preferred but not required)

Correlation is not causality

DL has caveats just like any other model. Just because I feel like this still needs to be said and there more times, the better.

A primer on deep learning in genomics

https://doi.org/10.1038/s41588-018-0295-5

Deep learning methods are a class of machine learning techniques capable of identifying highly complex patterns in large datasets. Here, we provide a perspective and primer on deep learning applications for genome analysis. We discuss successful applications in the fields of regulatory genomics, variant calling and pathogenicity scores. We include general guidance for how to effectively use deep learning methods as well as a practical guide to tools and resources. This primer is accompanied by an interactive online tutorial.

This primer isn't in a ten simple rules format, but there is some overlap we with the goals of this project. We should review it and potentially use it as a reference. For instance, Table 1 lists resources for #31.

Future-proof: Plan how you will support and share your model

Have you checked the list of proposed rules to see if the rule has already been proposed?

Did you add yourself as a contributor by making a pull request if this is your first contribution?

Yes, I added myself or am already a contributor

Much of the discussion this far has concerned with creating/crafting a model, but as far as I can tell, there hasn't been much discussion on what happens afterwards (perhaps, let's say, after you publish the model). Since lots of deep learning is done to solve problems, making the model available as a tool seems just as important as crafting it.

This conversation is not unique to deep learning, but I think it's especially relevant because the hard work being done to make the application layer (e.g. Tensorflow, pytorch) runnable across machines/architectures and to support it.

In the past, there's been a question of "can you get everything running on your machine", say to process the input data in the format necessary for inference (in my experience, this is not trivial) and then actually run the inference code. However, with deep learning, often times the input is the raw data itself (perhaps with a tad bit of processing). In my experience with sequence bioinformatics, people are building models that use the DNA or protein sequence itself, without any sort of featurization calculations/pre-processing.

With a deep learning model, if it's built on a "standard" platform, sharing the model and running it oneself is significantly simplified.

When building a deep learning model (or perhaps shortly after), my suggestion is that the authors think about how they will share it with the user community. Here are some ideas:

Will they provide the model, weights, and a code example of how to use it for inference (least convenient)?
Will they run a web application that takes in data, runs the inference, and then returns the output to the user in a easy-to-understand way (and where will they run this? heroku? locally?)?
Will they have a database of precomputed inputs and provide this to the user (simply as data? or though a web api?)?
Will they create a static website that can take input data and run inference in the user's browser itself and render the output there itself? There's lots of work being done (e.g. TensorFlowJS, TensorRT) on the inference stage/lifetime of a model. (possibly the most convenient in the long term)

And, how will they maintain the application going forward? What if they (or someone else) puts out a model that's performs better at the same benchmarks? Will they turn their service off if the second model is made available? Will they host this second model?

PS. This is an awesome effort!

Don't share model weights trained on private data.

Adversarial learning (not to be confused with generative adversarial networks) is a growing threat. One particular issue is model inversion, which allows the extraction of sensitive data, either with or without access to the weights.

To be safe, unless you know what you are doing, it's probably best not to share weights that have been trained on sensitive data.

Design and run experiments systematically

Have you checked the list of proposed rules to see if the rule has already been proposed?

Deep learning experiments need to be thought out just like any other experiment. You wouldn't set foot in a biology lab to just start trying things and seeing what sticks. Once you have results, you need to manage them so you know where you've been and where you're going. What constitutes success and failure are especially important to define since it's easy to fall into a trap of endlessly iterating or scope creep.

Don't use deep learning if you don't have to.

Oftentimes, deep learning is not the best approach. I think one rule (possibly even the first in the article) should be a warning against using DL when it's not needed.

Discuss computational issues

Mention compute time issues, tools that are available, open source programming. The "Ten Quick Tips" for comp bio article linked in #1 specifically has "Program your software with open source code and platforms" as tip 9, so we have to make it somewhat DL-specific (eg how much can be done in R vs needing TensorFlow, how much can actually be done on a PC vs needing access to a cluster.) Also figure out the best way to create an actual usable rule around this.

If you don't have enough data, don't even start

This is, surprisingly, a very reoccurring problem. People spend a lot of time on analyzing small datasets, just as a proof-of-principle. However, the data are too small, no independent validation, etc to derive any meaningful results. Not enough data -> do something else

Use a simple linear and non-linear model as a baseline for measuring progress.

Have you checked the list of proposed rules to see if the rule has already been proposed?

Use a simple, linear model, e.g., (multinomial) logistic regression or (multiple) linear regression as a performance baseline. This may be expanded to also include off-the-shelf easy-to-use ensemble methods like random forests that are relatively robust and don't require much tuning to work well out of the box.

There are many situations where a ML/DL expert can, without much doubt, say that throwing DL at the problem (given the size of the dataset and the nature of the task) doesn't make sense. However, let's face it, DL is popular, and people want and will use it even if it is not always the best thing to do in every situation. I think a good thing to recommend though is to start with a linear model as a baseline and compare the DL efforts to it.

But on the more positive side, using a simple model as a baseline is also useful in situations where using DL does make sense.

Any citations for the rule? (peer-reviewed literature preferred but not required)

Make sure you are optimizing the right objective function!

Have you checked the list of proposed rules to see if the rule has already been proposed?

Feel free to elaborate, rant, and/or ramble.

Any citations for the rule? (peer-reviewed literature preferred but not required)

Feature engineering is still (or is no longer) important

Have you checked the list of proposed rules to see if the rule has already been proposed?

Did you add yourself as a contributor by making a pull request if this is your first contribution?

Yes, I added myself or am already a contributor

Feel free to elaborate, rant, and/or ramble.

Any citations for the rule? (peer-reviewed literature preferred but not required)

Seeking relevant citations from other contributors

I have seen assertions that deep learning reduces or eliminates the need for feature engineering because the network itself constructs features from "raw" inputs. Most examples supporting this assertion come from imaging tasks.

I propose having a rule discussing whether or not this is true in biology, or perhaps the types of biological tasks for which feature engineering is still required. Unlike some of our other rules that are applicable to all DL or all ML in biology, this one would be specific to DL in biology.

I have some initial thoughts about feature engineering in DL but am hoping to gather a broad set of references before proposing the final rule.

Understand the data

Before hitting the data with fancy DL algorithms, always spend some time to understand how the data were generated, how the samples are selected, and what were the assumptions in the data generation and data cleaning process.

Treat it as a statistical model

Have you checked the list of proposed rules to see if the rule has already been proposed?

This rule is related to @ttriche's comments in #5. Because DL models can be difficult to interpret intuitively, there is a temptation to anthropomorphize DL models. We should resist this temptation. We would not expect a linear model to learn beyond pattern recognition, so we should also not expect an overparameterized non-linear model to do so. Because of this, we need to pay just as much attention to statistical caveats with DL models as we do with traditional models. For instance:

Don't use a deep learning model to extrapolate beyond the domain of the training set. While architectures with saturating nonlinearities will produce outputs in a reasonable range outside the training domain, that does not mean the predictions are reliable.
Don't use predictive accuracy or interpretability as evidence of casual reasoning (#9). We wouldn't interpret R^2 or MSE as evidence of causal reasoning in statistical models; don't do it for DL models either.
Interrogate the model for input sensitivity. Does your model respond to confounding effects, translations on the data domain, etc.? Do you want it to? Just as we would inspect the coefficients of a linear model, we should inspect the sensitivity of DL models to understand which signals they identify.

Exploit knowledge of similar data, use transfer learning!

Have you checked the list of proposed rules to see if the rule has already been proposed?

[x ] Yes

Feel free to elaborate, rant, and/or ramble.

Any citations for the rule? (peer-reviewed literature preferred but not required)

Always compare the performance of DL models with simple baseline models

Depending on the amount of available data and the type of tasks, DL models may not be the best performing one. For example, the Supplemental Materials of Rajkomar A et al. (2018) (https://www.nature.com/articles/s41746-018-0029-1) indicated that their simple baseline models had achieved performance comparable with that of DL in a number of tasks, which may be a surprise to many.

Strive for balance between training, tuning, and testing datasets

Scientists should strive for balance in the training, tuning, and testing datasets to assure that different phenotypic groups are represented appropriately/similarly. This refers to the balanced proportion of different classes of outcome or target variables. If class imbalance is inevitable, appropriate strategies like augmentation or bootstrapping can be used.

benjamin-lee / deep-rules Goto Github PK

deep-rules's People

Contributors

Stargazers

Watchers

Forkers

deep-rules's Issues

Recommend Projects

Recommend Topics

Recommend Org