snap-stanford / mars Goto Github PK

Discovering novel cell types across heterogenous single-cell experiments

License: MIT License

Python 11.71% Jupyter Notebook 88.29%

mars's Introduction

MARS

PyTorch implementation of MARS, a meta-learning approach for cell type discovery in heterogenous single-cell data. MARS annotates known and new cell types by transferring latent cell representations across multiple datasets. It is able to discover cell types that have never been seen before and characterize experiments that are yet unannotated. For a detailed description of the algorithm, please see our manuscript Discovering Novel Cell Types across Heterogeneous Single-cell Experiments (2020).

Setup

MARS requires anndata and scanpy libraries. Please check the requirements.txt file for more details on required Python packages. You can create new environment and install all required packages with:

pip install -r requirements.txt

Using MARS

We implemented MARS model in a self-contained class. To make an instance and train MARS:

mars = MARS(n_clusters, params, labeled_exp, unlabeled_exp, pretrain_data)
adata, landmarks, scores = mars.train(evaluation_mode=True)

MARS provides annotations for the unlabeled experimentm as well as embeddings for annotated and unannotated experiments, and stores them in anndata object. In the evaluation_mode annotations for unlabeled experiment need to be provided, and they are used to compute metrics and evaluate the performance of the model.

MARS embeddings can be visualized in the 2D space using UMAP or tSNE. Example embeddings for diaphragm and liver tissues:

MARS can generate interpretable names for discovered clusters by calling:

mars.name_cell_types(adata, landmarks, cell_type_name_map)

Example of the MARS naming approach:

Example of running MARS on Tabula Muris dataset in leave-one-tissue-out manner is provided in the main_TM.py. We also provide two example notebooks that illustrate MARS on small-scale datasets:

cellbench.ipynb demonstrates MARS on two CellBench dataset of five sorted lung cancer cell lines sequenced with 10Xand CEL-Seq2 protocols.
kolod_pollen_bench.ipynb demonstrates MARS on Pollen dataset of diverse human cell types, and Kolodziejczyk dataset of mouse pluripotent cells.

Cross-validation benchmark

We provide cross-validation benchmark cross_tissue_generator.py for classifying cell types of Tabula Muris data. The iterator goes over cross-organ train/test splits with an auto-download of Tabula Muris data.

Datasets

Tabula Muris Senis datasets is from https://figshare.com/projects/Tabula_Muris_Senis/64982.

Tabula Muris Senis dataset in h5ad format can be downladed at http://snap.stanford.edu/mars/data/tms-facs-mars.tar.gz. Small-scale example datasets CellBench and Kolodziejczyk/Pollen can be downloaded at http://snap.stanford.edu/mars/data/cellbench_kolod_pollen.tgz.

Pretrained models for each tissue in Tabula Muris can be downladed from http://snap.stanford.edu/mars/data/TM_trained_models.tar.gz.

Citing

If you find our research useful, please consider citing:

@article{brbic2020mars,
  title={MARS: Discovering Novel Cell Types across Heterogeneous Single-cell Experiments},
  author={Brbic, Maria and Zitnik, Marinka and Wang, Sheng and Pisco, Angela O and Altman, 
          Russ B and Darmanis, Spyros and Leskovec, Jure},
  journal={Nature Methods},
  year={2020},
}

Contact

Please contact Maria Brbic at [email protected] for questions.

mars's People

Contributors

Stargazers

Watchers

mars's Issues

How to use memory across multiple GPUs with MARS?

Hi, I am running MARS on a dataset containing approximately 20k cells. I was running on a 1080Ti, which has 11GB of memory. But 11GB GPU memory is not enough for MARS to process this dataset, and I got a CUDA out of memory error.

Is MARS able to use memory across multiple GPUs, so we can scale up to more than 20K cells? Or is there any other way to run MARS with large dataset?

What version of python?

Hi, I'm trying to install your package but am running into issues installing all the requirements. I'm wondering what version of python you used? Thanks!

How to determine n_clusters for a unseen data

I noticed that in the kolod_pollen_bench.ipynb, cellbench.ipynb, and main_TM.py, the ground truth of number of cell types in the unlabeled dataset was given to the n_clusters parameter, for example, like what was shown in the main_TM.py

n_clusters = len(np.unique(unlabeled_data.y))
mars = MARS(n_clusters, params, labeled_data, unlabeled_data, pretrain_data[idx], hid_dim_1=1000, hid_dim_2=100)

But in practice, the unlabeled dataset is usually unseen before, so number of cell types in this unlabeled dataset is usually unknown. What is your recommendation for determining the value of n_clusters parameter, if we have a unlabeled dataset that is completely unseen before? If the value of n_clusters is off, would this greatly influence the final cell type labeling outcome?

Provide tutorial for mars.name_cell_types

Hi, I am wondering if you could provide more guidance for how to generate interpretable names for discovered clusters by calling: mars.name_cell_types(adata, landmarks, cell_type_name_map).

This was not included in any available tutorial, and the help/comments in the source code are not detailed enough.

add MARS to SingleCellOpenProblems

@mbrbic can you take a look at openproblems-bio/openproblems#115

Mapping of MARS predicted IDs to groundtruth/reference dataset

Should the MARS predicted cluster IDs map to the cluster IDs in the groundtruth/reference dataset?

I'm able to get cell type predictions with MARS, but they don't appear to be mapping back to the groundtruth cluster IDs.

This issue also seems to be present in the example notebook where the predicted cluster IDs don't line up with the reference cluster IDs (for example, in the visualization output for notebook cell 38 where ground_truth and MARS_labels don't appear to match up each other).

Thanks in advance!

NameError: name 'anndata' is not defined

Hi MARS team,

I got my loom object from seurat object. But when I run the "anndata.read_loom", I got the following error:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-62-f16757897d08> in <module>
----> 1 adata = anndata.read_loom('~/Skin.loom', sparse=True, cleanup=False, X_name='spliced', obs_names='CellID', obsm_names=None, var_names='Gene', varm_names=None, dtype='float32', **kwargs)

**NameError: name 'anndata' is not defined**

Can you help me fix it?

Thanks a bunch!
Yale

Seurat / Reticulate example

Feature Request: Would it be possible to show an example of this using Seurat with the reticulate library? Possible expanding the original vignette ?

Thanks for the really cool method!

how to load args_parser and model?

Hi MARS team,

Thanks for such great package. As a beginner, when I try to run your notebook:

from args_parser import get_parser
from model.mars import MARS
from model.experiment_dataset import ExperimentDataset
from data.benchmarks import BenchmarkData
import warnings

I got an issue that No module named 'args_parser', 'get_parser', 'model'.

I am wondering whether you can help me to load these packages correctly?

Thanks,
Yale

mars.train reports inaccurate performance scores?

Hi, I am evaluating the performance of MARS, but got a question about how MARS compute the performance score in evaluation mode:

mars = MARS(n_clusters, params, labeled_exp, unlabeled_exp, pretrain_data)
adata, landmarks, scores = mars.train(evaluation_mode=True)

here scores are the performance score. And here is the code computing the scores: https://github.com/snap-stanford/mars/blob/master/model/metrics.py

I noticed that in the above mentioned code, when computing the score, MARS first uses hungarian algorithm to match the predicted labels to the original labels. If my understanding is correct, this hungarian match step finds the best correspondence between original cell type labels and predicted labels, such that the number of correctly predicted labels can be maximized.

Having this hungarian match this might be okay for a quick evaluation, because we could let Hungarian algorithm to find the optimal match between original cell type and the predicted labels, otherwise we have to run mars.name_cell_types to find the match by ourselves. However, in some cases, having this hungarian match could result in inaccurate performance score, and even overestimate the performance in some cases. I am attaching a working example (a zipped jupyter notebook
compute_score.zip) here FYI. You could see how hungarian match influences performance evaluation.

Therefore I believe the way that MARS evaluating its own performance might be inaccurate, and could report inaccurate performance scores, if you are running MARS in the evaluation mode. But I noticed this score is used in many places, including the online tutorials (kolod_pollen_bench.ipynb and cellbench.ipynb), even in the leave-one-tissue-out experiment that reported in the original paper (https://github.com/snap-stanford/mars/blob/master/main_TM.py), suggesting the authors might not only used it as a quick evaluation, but taken it as a formal evaluation of the model performance.

Please correct me if my understanding about hungarian match/score calculation is wrong :-) ! Also, could the authors share how did they evaluate the MARS performance in the leave-one-tissue-out experiment, if they were using a more accurate way to calculate performance scores, if it is not the same as shown in here: https://github.com/snap-stanford/mars/blob/master/main_TM.py

requirements.txt

Hello. I wanted to try MARS for my project. I created a new environment via conda, which I called mars. Then I installed requirements.txt via conda again
while read requirement; do conda install --yes $requirement; done < requirements.txt 2>error.log

Then, when I ran the cellbench.ipynb, I got a lot of errors from the very beginning, while loading the required libraries. Looking at the error.log that I created above, I see that many libraries could not be found. For example, annoy==1.16.0. Doing conda search annoy, I see that there is no package called just annoy, but python-annoy instead. I am wondering if the requirements.txt file is outdated...? And which python version is MARS built on? I have python 3.8 in my system.
I actually rarely use python, I am an R user, so I may miss something here.

Thank you very much!

MARS_labels is totally different from ClusterID

Hi MARS team,

I used our great package to run my own data and found the MARS_labels are very different from my own cluster IDs. The result is the following:

Can you tell me whether it means my clustering is not very good and can not group every cell type very well? Thanks!

Also, I checked our two example notebooks and found you always split the data into annotated and unannotated. Is there any way we don't have to do so? As we mainly focus on this cluster no matter what kind of conditions or tissues.

Thanks,
Yale

How to visualize the " MARS_embeddings" and how to define the 'cell_type_name_map'?

Hi MARS team,

Sorry to bother you again.

Can you tell me how to visualize the " MARS_embeddings" ?

Also, when I run the "mars.name_cell_types(adata, landmarks, cell_type_name_map)", I got error:
NameError: name 'cell_type_name_map' is not defined. Can you tell me how to define the 'cell_type_name_map'?

Sorry that I have no background on python, but I really like our package.

Thanks for your patience,

Thanks again,
Yale

unavailable to the example dataset

Hi,
I try to test your workflow in my machine while I could not get the data from your link
https://github.com/snap-stanford/mars/blob/master/benchmark_datasets/cellbench_kolod_pollen.tgz.
Would you please help me to check out?

MARS code problems

I could not properly install all the requirements.txt for MARS but I somehow went around some of them manually. Now, I have some problems running the cellbench.ipynb notebook. Especially scanpy is quite error-prone.
When I have the scanpy version specified in requiements.txt (scanpy 1.4.4.post1), scanpy is randomly loaded (in one run I have error and in the other run I don't).

When it does get imported, I get the error further down the script:
'tuple' object has no attribute 'tocsr' when I run the command
sc.pp.neighbors(adata, n_neighbors=30, use_rep='X')

I found online that I need to upgrade scanpy to solve this. When I did so, I got other errors (linked to the umap function)

Also the args_parser library is randomly found (in one run it is found and in another run later it is not found)

Please help me solve these problems. I don't use python, I am used to R. Maybe some of the problems that I face are trivial for python-programmers... Or maybe the requirements.txt is outdated?

Note: I did not create a separate environment to run MARS. I did all the above in my base environment. I have python 3.6 installed via anaconda3.

Thank you very much.

ValueError: too many dimensions 'str' when running MARS

Hi MARS team,

Thanks for your help.

Today I ran a new issue.

When I ran the mars = MARS(n_clusters, params, [annotated], unannnotated, pretrain_data),
I got the error: ValueError: too many dimensions 'str'

Can you help me with it?

Thanks,
Yale

snap-stanford / mars Goto Github PK

mars's Introduction

MARS

Setup

Using MARS

Cross-validation benchmark

Datasets

Citing

Contact

mars's People

Contributors

Stargazers

Watchers

Forkers

mars's Issues

Recommend Projects

Recommend Topics

Recommend Org