cognoma / machine-learning Goto Github PK

View Code? Open in Web Editor NEW

32.0 17.0 47.0 9.93 MB

Machine learning for Project Cognoma

License: Other

Jupyter Notebook 97.08% Python 2.76% Shell 0.01% HTML 0.16%

machine-learning sklearn python jupyter-notebook classifier dataphilly

machine-learning's Introduction

Project Cognoma

Putting machine learning in the hands of cancer biologists.

Project Cognoma is an open source project to create a webapp for analyzing cancer data. We're a community-driven philanthropic project that began as a collaboration between the Greene Lab, DataPhilly, and Code for Philly. Our contributors are primarily based in the Philadelphia area, but anyone anywhere is welcome. This GitHub repository is the administrative and informational home of Cognoma.

The Meetup phase of Cognoma is now complete! The Childhood Cancer Data Lab of Alex's Lemonade Stand Foundation will be providing longterm maintenance. Public contributions are still welcome through GitHub. The main priority is enhancements and bug fixes to improve http://cognoma.org. For a nice overview of the project, see its coverage by The Philadelphia Citizen.

Teams and Repositories

The project is composed of four teams with their own corresponding repositories:

Team Name	Repositories	Description
Cancer Data	`cancer-data`, `genes`, `figshare`	processing the underlying cancer data to the formats required for this project.
Machine Learning	`machine-learning`, `cognoml`	building classifiers to predict mutation status from gene expression data.
Backend	`core-service`, `task-service`, `ml-workers`, `infrastructure`	creating the infrastructure to power the webapp and glue the components together.
Frontend	`frontend`, `uiux`	building the webapp that users interact with.

New Here?

If you are a new user and would like to get involved, please introduce yourself. Contributions are made through GitHub, so if you are unfamiliar with git or GitHub, check out the sandbox for a place to learn by doing.

Meetup Schedule

We hold project meetups. Our usual meeting spot is at Industrious (where CandiDate is located). The address is 230 S Broad St, Floor 17, Philadelphia.

📅 Date	⌚ Time	🗺 Location	ℹ️ Meetup Details	💰 Sponsor
~~Wednesday, October 11, 2017~~	6:00 PM	MilkBoy	DataPhilly	Alex’s Lemonade Stand Foundation
~~Tuesday, August 15, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, July 11, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, June 27, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, May 30, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, April 25, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, April 4, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, February 28, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Monday, February 13, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, January 31, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Monday, January 16, 2017~~	9:00 AM	Philly Think Space	Frontend Only	MLK Day Volunteers from Think Company
~~Tuesday, January 10, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, December 20, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, December 6, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, November 15, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, November 1, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, October 18, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, October 4, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Monday, September 19, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, September 6, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, August 23, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, August 9, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, July 26, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, July 19, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, July 12, 2016~~	6:00 PM	CandiDate	DataPhilly	MilkBoy
~~Tuesday, July 5, 2016~~	6:00 PM	CandiDate	DataPhilly	Neo Technology
~~Tuesday, June 28, 2016~~	6:00 PM	MilkBoy	DataPhilly / Code for Philly	MilkBoy

Contributing

Community contributions are the driving force behind Cognoma. The heatmap below shows which users have contributed to which repositories:

See the guidelines for contributing for more information.

Maintainers

Cognoma relies on our generous community maintainers to assist with contributions. Thanks to the following maintainers for their help:

Cancer Data: Claire McLeod (@clairemcleod)
Machine Learning: Patrick Miller (@patrick-miller), Ryan Velazquez (@rdvelazquez), Jesse Prestwood-Taylor (@jessept), Yichuan Liu (@yl565)
Backend: Derek Goss (@dcgoss), Andrew Madonna (@awm33), Kurt Wheeler (@kurtwheeler)
Frontend: Benjamin Dolly (@bdolly)
Community: Karin Wolok (@KarinSpiderwoman)
Wildcards: Daniel Himmelstein (@dhimmel), Gregory Way (@gwaygenomics), Casey Greene (@cgreene)

machine-learning's People

Contributors

Stargazers

Watchers

machine-learning's Issues

Using incremental(on-line) learning to reduce memory cost

Since there might be many people using the website simultaneously, it would be good to adopt incremental(on-line) learning algorithms as much as possible to train classifier. In incremental(on-line) learning, classifiers are not trained using the whole dataset, the samples are feed into the learning algorithm one-by-one (or by small batches). Here is the reference:
http://scikit-learn.org/stable/modules/scaling_strategies.html

cognoml tests

cognoml should really have tests. At the very least the "public" functions. The complex operations could be mocked.

Could a general mutation-load pattern confound mutation-specific signals?

I think it's likely that there is a general expression pattern for how mutated a tumor is. For example, super mutated tumors may have wacky gene expression, solely because they're super mutated and not specifically because of which exact mutations they contain.

For a given gene, tumors with mutations are more likely to be highly mutated overall. This could cause confounding. It may appear that a mutation is associated with a specific expression pattern, although the signal is be driven by general mutation-load.

So we may need to end up including a mutation-load covariate. In the meantime, someone should see whether it's possible to use gene expression to predict the mutation-load of each sample (labeling this a task and looking for a volunteer).

Characterize runtime performance of each ML algorithm in our toolbox vs data set size for a representative set of queries

Build data sets that span a range of complexity.
Number of samples X number of features.
Run each algorithm in Issue #5 multiple times for each data set on a consistent set of hardware and collect results.

@hhummel

Predicting BRAF V600E activation in colorectal cancer

BRAF V600E mutations are common in several cancer types, including melanoma. This mutation is also present in about 8% of colorectal cancer patients (COAD and READ). However, there is emerging evidence that this subgroup is genomically heterogeneous, and Phase II Vemurafenib clinical trails have failed.

There are also recent efforts to stratify BRAF V600E colorectal tumors based on gene expression data. For example, Barras et al. 2017 compiled a dataset of 218 BRAF V600E mutated colorectal tumors and identified two subgroups and suggest that this heterogeneity may be the basis for poor clinical trial results.

I think that Cognoma could be a nice system to test this model. I think this could be a nice research based analysis for the group to undertake. Roughly, I would approach this analysis like this:

Build a model to predict BRAFV600E activating mutations
- Would be nice to see a table of counts across different tumor types.
- Hold out both COAD and READ tumors from this model
Assess training/testing performance
Apply model to COAD/READ tumors and investigate heterogeneity of predictions
Possibly apply model to other datasets listed in Barras et al.

My hypothesis is that cognoma will select BRAF V600E melanomas (SKCM) to appear similarly as a subset of BRAF V600E COAD/READ tumors that may respond to antibody treatment.

ML module API

Hello,

In an effort to build the global cognoma architecture, it would be very useful to determine an API which defines exactly what is given to the ML module (and incidentally what it will return).

As an exemple of strong API documentation, I believe OpenStack is a good start. Note how every module's API is listed, and how for each of those modules each route is described.

Some direct example for a cognoma API can be found here. This is a first specification for the frontend module.

Thrashing in 2.TCGA-MLexample

When my MBP with 16 GB of RAM hits this line cv_pipeline = GridSearchCV(estimator=pipeline, param_grid=param_grid, n_jobs=-1, scoring='roc_auc') it thrashes. The n_jobs parameter causes multiple jobs to be created, with them a higher demand on RAM. My MBP has a i7 processor which is hyper-threaded. When n_jobs=-1, it spins up as many tasks as the machine has cores, but it thinks that my machine has 8 cores when it really only has 4 and 4 virtual cores. Hyper-threading uses inefficiency in the pipeline to create a virtual pipeline. This works fine for multi-tasking like browsing and editing but not for processor hungry tasks like GridSearchCV that most likely do not use the real pipeline inefficiently. So 8 tasks was swamping my RAM and there is probably no benefit on my system to spinning up more than 4.
To fix this, I merely set n_jobs=4.

Integrating dimensionality reduction into the pipeline

It will benefit all of us if the operations of dimensionality reduction can be integrated into the pipeline.

Moreover, it seems necessary to place dimensionality reduction after preliminary feature selection (keeping 5000?); otherwise, our computers are likely to run out of memory.

Preventing overfitting when evaluating many hyperparameters

In #18 I propose using a grid search to fit the classifier hyperparameters (notebook). We end up with average performance across cross-validation folds for many hyperparameter combinations. Here's the performance visualization from the notebook:

So the question is given a performance grid, how do we pick the optimal parameter combination? Picking just the highest performer can be a recipe for overfitting.

Here's a sklearn guide that doesn't answer my question but is still helpful. See also #19 (comment) where overfitting has been mentioned. I'm paging @antoine-lizee, who has dealt with this issue in the past, and who can hopefully provide solutions from afar as he lives in the hexagon.

Evaluate dask-searchcv to speed up GridSearchCV

I'm excited about trying out dask-searchcv as a drop-in replacement for GridSearchCV. For info on dask-searchcv see, the blog post, github, docs, and video.

I'm hoping using dask-searchcv for GridSearchCV will help solve the following problems:

High memory usage, e.g. #70, caused by joblib overhead.
The slow performance of the pipeline when properly implementing cross-validation. See discussion at scikit-learn/scikit-learn#7536 (comment). The builtin GridSearchCV is repeating the same transform steps making it brutally slow.

I initially mentioned dask-searchcv in #93 (comment), a PR by @patrick-miller. I thought this would be a good issue for @rdvelazquez to work on. @rdvelazquez are you interested?

We'll have to add some additional dependencies to our environment. It may be a good time to also update the package versions of existing packages (especially pandas).

Characterize predictive performance of each ML algorithm in our toolbox vs various data sets for a representative set of queries

Issue #5 describes a table of ML methods. (This is the "toolbox" referred to in the title)
Issue #11 describes creating a set of sample data sets.
This issue is a task that consists of:

Running each algorithm on each data set and evaluating its predictive quality.
To the best of your capability, provide a few words explaining the results

Since there are a lot of algorithms and a lot of datasets there is an opportunity for many people to participate in this task.

@dhimmel

OncoKB: Precision Oncology Knowledge Base

http://oncokb.org contains 418 genes with varying levels of evidence for their role as cancer targets. This is a great list of genes to fit classifiers for.

Installing a Python Neo4j driver on OSX

Hi all,

I am having hard time running the "2.TCGA-MLexample.ipynb" notebook. The problem occurs at the line

from neo4j.v1 import GraphDatabase

I believe I have to install Neo4j, or rather py2neo, the Python library that gives access to it. When I use the recipe from anaconda.org

conda install -c ivoflipse py2neo=1.6.4

I get a message

PackageNotFoundError: Package not found: '' Package missing in current osx-64 channels: 
  - py2neo 1.6.4*

Has anybody successfully installed py2neo on OSX?

Notebook explaining an RNA-seq classifier

There are Jupyter notebooks on acquiring and cleaning data from TCGA. Is there one that has a outline of the type of classifier that Cognoma will run on this data for our users, and if not, would it be much trouble to create one?

Ideally, it would continue from 1) data acquisition and 2) cleansing (done above), and go on to 3) feature extraction, 4) learning, and 5) interpretation. Perhaps I have missed some steps?

A paper of @cgreene, @dhimmel or @gwaygenomics might also suffice, but it could be nice for everyone to have some code to play around with. Thoughts?

Task Payload Data

The task service can hand off anything you can store in JSON. This is typically where task configuration is stored, for instance what algorithm to use, maybe the gene list, algorithm parameters.

For a given task, what do you need to know?

{
    algorithm: "svm",
    ....
}

Standardize the plots in notebooks

I wanted to open this issue so that we could discuss what framework we wanted to use in the notebook that we export for our users. There are a lot of options.

Right now, the majority of the plots are in matplotlib/seaborn. There is a pretty good replication of R's ggplot2 with yhat's ggpy.

We can use JavaScript based, dynamic plots with bokeh or Vega. There has already been some work done with Vega (#74, #77, #84), and these can be incorporated into notebooks. The benefit of using Vega is that once we build a frontend results viewer, presumably we will be moving towards this dynamic method of displaying plots. Maintaining consistency would be a plus.

One issue with displaying Vega plots with ipyvega, is that (as detailed in this ipyvega pull request) the plots are exported to a static image when viewed through GitHub or nbviewer. Only live notebooks get the JavaScript version. As we are going to be serving a static notebook, I assume we would have the same result. I'll be looking into if this new project holds any answers.

Multiple comparisons problems

I'm still working my way through the paper published by @gwaygenomics, @allaway and @cgreene, but it made me think of an issue that I believe we should try to deal with in our final product. In the paper they had a specific hypothesis that they tested; however, we are going to provide people with the ability to test out hypotheses on thousands of different mutations.

There are some problems with this ability, such as non-response bias. There are bound to be many uninteresting results (AUROC = 0.5) for different genes that people will tend to glance over. I can very easily imagine a scenario where someone iterates through many different genes until they reach one where a model does a good job at predicting a mutation.

We could approach this issue in a few different ways:

hold out some data for validation -- only to be used for publication
apply some sort of correction (e.g. Bonferroni)
place strong emphasis on effect sizes
list a clear disclaimer

I wanted to open this issue up so we can discuss the importance of the problem and possible solutions.

Git large-file storage

It might be useful to store the files currently being downloaded by 1.download.ipynb on git's large file storage. That way we can eliminate 1.download.ipynb and have the data files under version control.
https://git-lfs.github.com/

It needs to be investigated whether git-lfs can be incorporated via conda into the environment automatically.

Learning Resources for the Uninitiated - Machine Learning Edition

Going off of @gwaygenomics's post in the Cognoma repo (cognoma/cognoma#15) , and @cgreene's suggestion this might be a good place to discuss ML-specific resources for the uninitiated, especially for others like me who are new to the bioinformatics domain and perhaps last studied biology in high school.

Machine Learning Punch List for Launch

Here's the general punch list that we discussed at tonight's meetup for getting the machine learning part of cognoma launch ready.

Update Notebook 2 to be compatible with ml-workers and PR ml-workers so the two notebooks are the same (#111 by @patrick-miller)
Change plotting in notebooks from vega to plotnine (#112 by @patrick-miller)
Look into the memory usage issues (@wisygig)
Look into the number of pca components to use (#113 and #114 by @rdvelazquez)
Optimize between memory usage, pca components (and the regulirization_alpha_list), and how large of an AWS instance we use (@dhimmel)

To be completed at a later date: Templating for jupyter notebooks (@wisygig)

Selecting the elastic net mixing parameter

Thus far we've been using grid search (cross validation) to select the optimal elastic net mixing parameter. For SGDClassifier, this mixing parameter is set using l1_ratio, where l1_ratio = 0 performs ridge regularization and l1_ratio = 1 performs lasso regularization.

Here's what I'm thinking:

Grid search is not the appropriate way to select the mixing parameter. Ridge (with the optimal regularization penalty, alpha) will always perform better than the optimal Lasso. The reason is that there's a cost for the convenience of sparsity. Lasso makes difficult decisions about which features to select. Therefore the sparsity can aid in model interpretation, but weakens performance because identifying only the predictive features is an impossible task.

For example, see our grid from this notebook (note this used MAD feature selection to select only 500 features which likely accentuates the performance deficit as l1_ratio increases).

So my sense is that l1_ratio should be chosen based on what properties we want the model to have, not based on maximum CV performance. If we only care about performance, we might as well save ourselves the computation time and always go with ridge or the default l1_ratio = 0.15. l1_ratio = 0.15 can still filter ~50% of features with little performance degradation. But if you want real sparsity (lasso) there's going to be a performance cost -- and the user not grid search will have to make this decision.

Create benchmark data sets

Go through the data we have
Select a diverse range of possible formulations based on the actual data, but designed to be diverse so e.g. a small number of positives in a lot of data, a lot of positives, smaller number of genes, larger number of genes, different gene expression distributions, etc. @dhimmel: please review description

When we run an analysis, what do we want to get back?

We need to design our results json so that we can later visualize the most important results via the results viewer from the UI team.

Literature review

There are several research papers and webpages that can be helpful to the cognoma task. Maybe I missed it, but I am not aware of any other issue that addresses this question. It would be good to have a place where members can post papers that could benefit the community. Since not everyone has the academic access to databases, it is preferable that papers posted are open-access.

I recently found this paper - Dudoit, Sandrine, Jane Fridlyand, and Terence P. Speed. "Comparison of discrimination methods for the classification of tumors using gene expression data." http://www.stat.cmu.edu/~jiashun/Research/software/GenomicsData/papers/dudoit.pdf

It is published in 2002 so their dataset is way smaller. However, it contains some useful information regarding the data processing and gene datasets in general. It was a good read even though I am not in this field.

Memory error

As per the instruction, I made copy of TGCA-MLexample.ipny with CART-VijYadav.ipynb. I didn't make any change to the code ( I just wanted to see if the existing code works) and finding the following memory error. I am running Windows 10 with 32 GB RAM. Can you please let me know how to fix the issue.

%%time
path = os.path.join('data', 'expression.tsv.bz2')
X = pd.read_table(path, index_col=0)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-7-6c501a1d1501> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', "path = os.path.join('data', 'expression.tsv.bz2')\nX = pd.read_table(path, index_col=0)")

C:\Users\Vijay\Anaconda3\envs\cognoma-machine-learning\lib\site-packages\IPython\core\interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2113             magic_arg_s = self.var_expand(line, stack_depth)
   2114             with self.builtin_trap:
-> 2115                 result = fn(magic_arg_s, cell)
   2116             return result
   2117 

<decorator-gen-60> in time(self, line, cell, local_ns)

C:\Users\Vijay\Anaconda3\envs\cognoma-machine-learning\lib\site-packages\IPython\core\magic.py in <lambda>(f, *a, **k)
    186     # but it's overkill for just that one bit of state.
    187     def magic_deco(arg):
--> 188         call = lambda f, *a, **k: f(*a, **k)
    189 
    190         if callable(arg):

C:\Users\Vijay\Anaconda3\envs\cognoma-machine-learning\lib\site-packages\IPython\core\magics\execution.py in time(self, line, cell, local_ns)
   1178         else:
   1179             st = clock2()
-> 1180             exec(code, glob, local_ns)
   1181             end = clock2()
   1182             out = None

<timed exec> in <module>()

C:\Users\Vijay\Anaconda3\envs\cognoma-machine-learning\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    560                     skip_blank_lines=skip_blank_lines)
    561 
--> 562         return _read(filepath_or_buffer, kwds)
    563 
    564     parser_f.__name__ = name

C:\Users\Vijay\Anaconda3\envs\cognoma-machine-learning\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    323         return parser
    324 
--> 325     return parser.read()
    326 
    327 _parser_defaults = {

C:\Users\Vijay\Anaconda3\envs\cognoma-machine-learning\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
    813                 raise ValueError('skip_footer not supported for iteration')
    814 
--> 815         ret = self._engine.read(nrows)
    816 
    817         if self.options.get('as_recarray'):

C:\Users\Vijay\Anaconda3\envs\cognoma-machine-learning\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   1312     def read(self, nrows=None):
   1313         try:
-> 1314             data = self._reader.read(nrows)
   1315         except StopIteration:
   1316             if self._first_chunk:

pandas\parser.pyx in pandas.parser.TextReader.read (pandas\parser.c:8748)()

pandas\parser.pyx in pandas.parser.TextReader._read_low_memory (pandas\parser.c:9428)()

pandas\parser.pyx in pandas.parser._concatenate_chunks (pandas\parser.c:25134)()

MemoryError:

Fix cognoml package

Trying to run the cognoml package, I found a couple issues.

The requirements are not in setup.py, so a consumer has to manually install them. I manually installed them by looking at the repo's environment.yml file. The following were needed before the next issue was raised.
- pandas==0.18.1
- numpy==1.11.1
- scikit-learn==0.18.0
- scipy==0.17.1
Take a look at https://packaging.python.org/requirements/
Looks like cogoml.classifiers is not included in the package

>>> from cognoml analysis
  File "<stdin>", line 1
    from cognoml analysis
                        ^
SyntaxError: invalid syntax
>>> from cognoml import analysis
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/amadonna/Documents/ml-workers/env/lib/python3.5/site-packages/cognoml/analysis.py", line 11, in <module>
    from cognoml.classifiers.logistic_regression import grid_search
ImportError: No module named 'cognoml.classifiers'

Might have to be included here.

I'd suggest testing the module in a clean environment.

Installed using pip install git+git://github.com/cognoma/machine-learning

Should testing data be used for unsupervised feature tranformation or selection

Imagine splitting the data as follows, where X is the complete feature matrix and y is the outcome array (train_test_split doc):

X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y)

The goal of this discussion is to evaluate whether we should apply any operations on X (the union of X_train and X_test). @htcai cautioned against feature selection/transformation on the entire X: #18 (comment).

What are the drawbacks and advantages of selection/transformation on an X that includes X_test?

Decisions required to reach a minimum viable product

We're nearing the point where we'll need to implement a machine learning module to execute user queries. We're looking to create a minimum viable product. We can expand functionality later, but for now let's focus on the simplest and most succinct implementation. There are several decisions to make:

Classifier: which classifiers should we support? If we want to support only a single classifier for now, which one?
Predictions: do we want to return probabilities, scores, or class predictions?
Threshold: do we want to report performance measures that depend on a single classification threshold? Or do we want report performance that span thresholds?
Testing: Do we want to use a testing partition in addition to cross-validation? If so, do we refit a model on all observations?
Features Should we include covariates in addition to expression features (see #21)?
Feature selection: Do we want to perform any feature selection?
Feature extraction: Do we want to perform features extraction, such as PCA (see #43)?

So let's work out these choices, with a focus on simplicity.

PCA (and other pre-processing steps) on just the expression matrix in CV pipeline

We have two sources of features: the covariates and the gene expression matrix. When pre-processing this data, we generally want to perform dimensionality reduction only on the expression matrix.

This can prove cumbersome when trying to implement PCA in a pipeline. Scikit-learn does not provide this sort of functionality out of the box, but I believe it is possible to continue using pipelines. Here is an example.

Performance on genes with targeted cancer therapies

It seems that some of the most interesting genes to predict using expression will be genes that have targeted therapies. These genes are ideal, because we may be able identify unmutated samples that would still respond to the therapy based on their expression.

Here's a list of targeted cancer drugs and thier genes. Tagging @brankaj.

TOTAL column is skewing the heatmap in 3.TCGA-MLexample_Pathway

Hi, @dhimmel and @gwaygenomics

I hope that this is the right place to ask a question about the code in
3.TCGA-MLexample_Pathway.ipynb

I'm working on converting the first heatmap:
"percentage of different mutations across different cancer types"
from seaborn to to Altair/vega-lite, continuing from the fine work of
@superkostya.

I have figured out how to use different color maps in vega-lite,
e.g. the viridis color map.

The 'TOTAL' column is not really a gene, and because it is a sum,
it's values are much larger than the gene expression values, causing
the differences between other values to be less apparent in the display.

I can move this column to the right of the chart with some slicing and dicing,
but I'm not sure it really belongs.

here are the relevant lines from cognoma/machine-learning/3.TCGA-MLexample_Pathway.ipynb
which create the 'TOTAL' column:

unique_pos = y.groupby('disease').apply(lambda x: x['indicator'].sum())
heatmap_df0 = y_full.groupby('disease').sum().assign(TOTAL = unique_pos)
heatmap_df = heatmap_df0.divide(y_full.disease.value_counts(sort=False).sort_index(), axis=0)

It is not clear to me what the TOTAL column means after the 3rd line does the divide operation,
is it now some kind of average?

Thanks for any clarification.

Create a table classifying the different algorithms

Columns including binary classification, how hard to implement and estimate of quality

What covariates should we include as features?

In addition to gene expression, we probably should include other information on samples. This discussion will focus on identifying potential covariates and evaluating whether they make sense to include in models. If we don't include the right covariates, confounding is likely to be an issue.

See #8 as a potential example of confounding that may be addressable by adding a mutation load feature.

Work with django team to select which examples should be in the training set

Check out this project: TPOT Data Science Assistant

I am not sure if the TPOT project is applicable or if it would help but it is an interesting spin on ML
http://rhiever.github.io/tpot/

TPOT is a Python tool that automatically creates and optimizes machine
learning pipelines using genetic programming.

Selecting the number of components returned by PCA

A topic that has come up a number of times is how many components should be returned by PCA...

The number of components (n_components) can be a parameter that is searched across in GridSearchCV. This used to cause problems with thrashing (#43) but these problems seem to have been eliminated by using the dasksearcCV implementation of GridSearchCV (#94).
Although n_components can be included in GridSearchCV, we would like to limit the range that needs to be searched over based on the specifics of the query (i.e. how many samples are included {the user's filter by disease} and how many positive/negative mutations there are {the user's filter by gene(s)}).
Anecdotally, it seems that the optimal n_components is larger for balanced datasets (equal number of mutated and non-mutated samples) and smaller for unbalanced datasets (typically small number of mutated samples). Using a small n_components for balanced datasets results in low training and testing scores. Using a large n_components for unbalanced datasets results in higher training and lower testing scores (over-fitting).
As @dhimmel pointed out in #94,

When working with small n_positives, you'll likely need to switch the CV assessment to use repeated cross validation or a large number of StratifiedShuffleSplits. See discussion on #71 by @htcai. We ended up going with:

sss = StratifiedShuffleSplit(n_splits=100, test_size=0.1, `random_state=0)

I'm thinking the next step should be creating a dataset that provides good coverage of the different query scenarios (#11), and perform GridSearchCV on these datasets, searching over a range of n_components to see how changing n_components effects performance (AUROC).

@dhimmel, @htcai, @patrick-miller feel free to comment now or we can discuss at tonight's meetup.

Add stronger support for pip

cognoma runs into a specific import issue when installing directly from github:

>>> from cognoml import analysis
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Projects/miniconda3/envs/test_cognoma/lib/python3.5/site-packages/cognoml/analysis.py", line 11, in <module>
    from cognoml.classifiers.logistic_regression import grid_search
ImportError: No module named 'cognoml.classifiers'

This does not appear when installing in conda, but makes installing with pip more time-consuming. This would require some exploration into the files necessary to be able to pip install directly from github.

How notebooks will work in production

@wisygig, @dcgoss and/or @dhimmel, what are your thoughts on how the Jupyter Notebook part of the application will work? I'm specifically interested in:

Where and how will the notebook be hosted/executed?
How will the back end interface with the notebooks?
How will the specific information from the user's query (genes and diseases) be inputted/updated in the notebook?
How will the specifics of the classifier be selected (what list of parameters to include in cross-validation, include or exclude covariates, and potentially in the future which classifier pipeline to use)? We have discussed automatically selecting some parameters (n_components) based on the query (#106) and we have also discussed letting the user select some parameters (l1_ratio) based on their preference (#106).
What are we thinking for the MVP... one notebook template to cover all queries or a number of different notebook templates for different situations?

I know this may be getting ahead of ourselves so feel free to differ this till later but I thought I'd at least mention that these topics are starting to come up. This issue spans a few different repos but I thought the machine-learning repo might be the best place for it... I'll also tag #63 from cognoma/cognoma.

A quick guide for new members

Since there often are new members who join us, it seems a good idea to provide a quick guide for the machine learning work for Cognoma. The README.md file has been working as a guidance for setting up the Python environment. In addition, I think it is helpful to cover the following issues in the quick guide by either providing the code, or the link to the appropriate webpages:

Install and configure Git, clone the machine-learning repository, etc.
Which jupyter notebooks people can/should read and play around.
The ongoing problems which we are tackling and the contributions people can make.
...

These things are for Cognoma members instead of outsiders. Therefore, they probably should be incorporated into a separate file than README.md. I would like to compose the file, as I have given a tour to many new members who are interested in the machine learning work for Cognoma.

Facilitate application of multiple methods to multiple data sets

Issue #13 requires application of multiple methods to multiple data sets.
To simplify this process, it would be helpful to set up a reusable workbook or code set that supports plugging in different methods.

@RenasonceGent has expressed interest in this task

@dhimmel

Median absolute deviation feature selection

@gwaygenomics presented evidence that median absolute deviation (MAD) feature selection (selecting genes with the highest MADs) can eliminate most features without hurting performance: #18 (comment). In fact, it appears that performance increased with the feature selection, which could make sense if the selection enriched for predictive features, increasing the signal-to-noise ratio.

Therefore, I think we should investigate this method of feature selection further. Specifically, I'm curious whether:

@gwaygenomics' findings hold true for outcomes other than RAS?
MAD is better than MAD / median? I think MAD could be biased against selecting genes that are lowly expressed but still variable?
MAD outperforms random selection of the same feature set size?
MAD performs well for other algorithms besides logistic regression?

I'm labeling this issue a task, so please investigate if you feel inclined.

Memory issue

I am running my notebook obtained by revising the latest 2.TCGA-MLexample in Ubuntu on my laptop (8GB RAM & 8GB swap). I used over-sampling which increased the size of the training data by about 7%. My machine keeps running into memory problem: OSError: [Errno 12] Cannot allocate memory, as well as other exceptions

There is no problem after I discard pipeline. I will use my MacBook (using compressed memory) to run the notebook, but it will be much slower.

feature engineering

• Feature Engineering
o Feature Transformations
 Log
 Square
 Inverse
 Percentile
 ZScore
o Feature Creation
 Interactions (+ * - )
 LDA (Latent Dirichlet Allocation)
• Feature Reduction (Selection / Extraction)
o Stepwise Regression
o RFE (Recursive Feature Elimination)
o PCA (Principle Component Analysis)
o LDA (Latent Dirichlet Allocation)
o Linear Discriminate Analysis -Also LDA =/
o Genetic Algorithms
o Wrapper Methods
• Algorithms
o Linear Regression
 Ridge Regression
 LASSO
 Elastic Net
 OLS (Ordinary Least Squares)
o Logistic Regression
o SVM (Support Vector Machine)
 RBF Kernel (Radial Basis Function)
 Polynomial & Linear Kernel
 Histogram Kernel
o Random Forest
o Adaboost
o Logitboost
o KNN (K Nearest Neighbors)
o Naïve Bayes
o K Means
o Perceptron
o Neural Nets
o GBM (Gradient Boosting Machines)

Defining features and labels

This issue is a follow-up of the results obtained for different genes #52 . It is still not clear why few oncogenes produced such bad results. Before analyzing genes themselves, I got puzzled by one thing in the code.

If we want to run the classifier for a different gene, the only part that is currently changed is y, i.e., vector of labels y=Y[GENE]. Matrix X, which contains our feature values, remains the same. This means that one set of feature values can belong to class '0' in one iteration, while in another iteration same set is denoted as class '1'. Even though each iteration corresponds to a different gene, classifier sees it as another combination of '0' and '1' for which model has to be built.

If the matrix X is static, i.e., its values are completely reliable, I guess the main question is how reliable are the labels given in matrix Y and would it be possible to measure that reliability.

Claim an sklearn algorithm to implement and troubleshoot

In the August 26 meetup, we discussed having each team member in the machine learning group claim an algorithm. We've made lot's of progress on the example notebook (1.TCGA-MLexample.ipynb) since then (see #18 & #25). Currently, 1.TCGA-MLexample.ipynb uses elastic net logistic regression implemented in SGDClassifier.

The goal of this repository is for people to:

Claim an algorithm. See the list of classifiers at #5 (comment). The main requirement is that the algorithm uses the sklearn API so we can use it in the pipeline. Make a comment here once you've chosen an algorithm.
Create a modified version of 1.TCGA-MLexample.ipynb in an algorithms directory. So if I took the SVM classifier, I would copy 1.TCGA-MLexample.ipynb to algorithms/SVC-dhimmel.ipynb. Then I would make my edits to algorithms/SVC-dhimmel.ipynb to switch to an SVC classifier.
Your goal should be to pick a good set of parameters for grid search. It would also be great if you could document what seems to work well about the algorithm (or if it doesn't seem to work well).

Best of luck! If you can work on this before the August 9 meetup then great! Otherwise make sure to bring a laptop with the cognoma-machine-learning environment installed.

TP53 mutation prediction from metadata

I'm new to the group so let me know if there is a better place to write this kind of thing...

I am working on assessing whether the gene expression data provides considerably more predictive information than the metadata (samples.tsv). I created a notebook to predict TP53 mutation from the metadata alone and achieved ~~.82 AUROC. This is substantially lower than the AUROC achieved using gene expression (~~.92). I have a few other ideas for what to do next, but am interested in any input. The new notebook can be found on my forked repo (4.TCGA-Metadata-MLexample). Have not submitted a pull request.

Should cognoml reside in a separate repo?

cognoml is the Python package we're developing for machine-learning

I'm starting to think this would simplify things.

@awm33 what do you think?

Save performance-related parameters when processing queries (for later response time prediction)

Some queries will be much faster than others. It will be very helpful to both end users and UI designers to have info about how long a query can be expected to take.

If we save performance-related parameters such as the query itself, the sizes of referenced tables, and relevant query time intervals (database, post-processing), then we will have lots of data to mine later for predicting response time.