calico / solo Goto Github PK

software to detect doublets

License: MIT License

Python 99.89% Shell 0.11%

solo's Introduction

solo -- Doublet detection via semi-supervised deep learning

Why

Cells subjected to single cell RNA-seq have been through a lot, and they'd really just like to be alone now, please. If they cannot escape the cell social scene, you end up sequencing RNA from more than one cell to a barcode, creating a doublet when you expected single cell profiles. https://www.cell.com/cell-systems/fulltext/S2405-4712(20)30195-2

solo is a neural network framework to classify doublets, so that you can remove them from your data and clean your single cell profile.

We benchmarked solo against other doublet detection tools such as DoubletFinder and Scrublet, and found that it consistently outperformed them in terms of average precision. Additionally, Solo performed much better on a more complex tissue, mouse kidney.

Quick set up

Run the following to clone and set up ve. git clone [email protected]:calico/solo.git && cd solo && conda create -n solo python=3.12 && conda activate solo && pip install -e .

Or install via pip conda create -n solo python=3.12 && conda activate solo && pip install solo-sc

If you don't have conda follow the instructions here: https://docs.conda.io/projects/conda/en/latest/user-guide/install/

≈

usage: solo [-h] -j MODEL_JSON_FILE -d DATA_PATH
            [--set-reproducible-seed REPRODUCIBLE_SEED]
            [--doublet-depth DOUBLET_DEPTH] [-g] [-a] [-o OUT_DIR]
            [-r DOUBLET_RATIO] [-s SEED] [-e EXPECTED_NUMBER_OF_DOUBLETS] [-p]
            [-recalibrate_scores] [--version] [--lr_st] [--lr_vae]

optional arguments:
  -h, --help            show this help message and exit
  -j MODEL_JSON_FILE    json file to pass VAE parameters (default: None)
  -d DATA_PATH          path to h5ad, loom, or 10x mtx dir cell by genes
                        counts (default: None)
  --set-reproducible-seed REPRODUCIBLE_SEED
                        Reproducible seed, give an int to set seed (default:
                        None)
  --doublet-depth DOUBLET_DEPTH
                        Depth multiplier for a doublet relative to the average
                        of its constituents (default: 2.0)
  -g                    Run on GPU (default: True)
  -a                    output modified anndata object with solo scores Only
                        works for anndata (default: False)
  -o OUT_DIR
  -r DOUBLET_RATIO      Ratio of doublets to true cells (default: 2)
  -s SEED               Path to previous solo output directory. Seed VAE
                        models with previously trained solo model. Directory
                        structure is assumed to be the same as solo output
                        directory structure. should at least have a vae.pt a
                        pickled object of vae weights and a latent.npy an
                        np.ndarray of the latents of your cells. (default:
                        None)
  -e EXPECTED_NUMBER_OF_DOUBLETS
                        Experimentally expected number of doublets (default:
                        None)
  -p                    Plot outputs for solo (default: False)
  -recalibrate_scores   Recalibrate doublet scores (not recommended anymore)
                        (default: False)
  --version             Get version of solo-sc (default: False)
  --lr_st            
                        Learning rate used for solo.train (default: 1e-3)
  --lr_vae             
                        Learning rate used for vae (default: 1e-3)

Warning: If you are going directly from cellranger 10x output you may want to manually inspect your data prior to running solo.

model_json example:

{
  "n_hidden": 384,
  "n_latent": 64,
  "n_layers": 1,
  "cl_hidden": 128,
  "cl_layers": 1,
  "dropout_rate": 0.2,
  "lr_st": 1e-3,
  "valid_pct": 0.10
}

The suggested learning rates work best in most settings, but in case a ValueError occurs, you might consider changing the learning rates to 1e-5

Outputs:

is_doublet.npy np boolean array, true if a cell is a doublet, differs from preds.npy if -e expected_number_of_doublets parameter was used
vae scVI directory for vae
classifier.pt scVI directory for classifier
latent.npy latent embedding for each cell
preds.npy doublet predictions
softmax_scores.npy updated softmax of doublet scores (see paper), same as no_update_softmax_scores.npy now
no_update_softmax_scores.npy raw softmax of doublet scores
logit_scores.npy logit of doublet scores
real_cells_dist.pdf histogram of distribution of doublet scores
accuracy.pdf accuracy plot test vs train
train_v_test_dist.pdf doublet scores of test vs train
roc.pdf roc of test vs train
softmax_scores_sim.npy see above but for simulated doublets
logit_scores_sim.npy see above but for simulated doublets
preds_sim.npy see above but for simulated doublets
is_doublet_sim.npy see above but for simulated doublets

How to demultiplex cell hashing data using HashSolo CLI

Demultiplexing takes as input an h5ad file with only hashing counts. Counts can be obtained from your fastqs by using kite. See tutorial here: https://github.com/pachterlab/kite

usage: hashsolo [-h] [-j MODEL_JSON_FILE] [-o OUT_DIR] [-c CLUSTERING_DATA]
                [-p PRE_EXISTING_CLUSTERS] [-q PLOT_NAME]
                [-n NUMBER_OF_NOISE_BARCODES]
                data_file

positional arguments:
  data_file             h5ad file containing cell hashing counts

optional arguments:
  -h, --help            show this help message and exit
  -j MODEL_JSON_FILE    json file to pass optional arguments (default: None)
  -o OUT_DIR            Output directory for results (default:
                        hashsolo_output)
  -c CLUSTERING_DATA    h5ad file with count transcriptional data to perform
                        clustering on (default: None)
  -p PRE_EXISTING_CLUSTERS
                        column in cell_hashing_data_file.obs to specifying
                        different cell types or clusters (default: None)
  -q PLOT_NAME          name of plot to output (default: hashing_qc_plots.pdf)
  -n NUMBER_OF_NOISE_BARCODES
                        Number of barcodes to use to create noise distribution
                        (default: None)

model_json example:

{
  "priors": [0.01, 0.5, 0.49]
}

Priors is a list of the probability of the three hypotheses, negative, singlet, or doublet that we test when demultiplexing cell hashing data. A negative cell's barcodes doesn't have enough signal to identify its sample of origin. A singlet has enough signal from single hashing barcode to associate the cell with ins originating sample. A doublet is a cell barcode which has signal for more than one hashing barcode. Depending on how you processed your cell hashing matrix before hand you may want to set different priors. Under the assumption that you have subset your cell barcodes using typical QC on your cell by genes matrix, e.g. min UMI counts, percent mitochondrial reads, etc. We found the above setting of prior performed well (see paper). If you have only done relatively light QC in transcriptome space I'd suggest an even prior, e.g. [1./3., 1./3., 1./3.].

Outputs:

hashsoloed.h5ad anndata with demultiplexing information in .obs
hashing_qc_plots.png plots of probabilites for each cell

How to demultiplex cell hashing data using HashSolo in line

>>> from solo import hashsolo
>>> import anndata
>>> cell_hashing_data = anndata.read("cell_hashing_counts.h5ad")
>>> hashsolo.hashsolo(cell_hashing_data)
>>> cell_hashing_data.obs.head()
                  most_likeli_hypothesis  cluster_feature  negative_hypothesis_probability  singlet_hypothesis_probability  doublet_hypothesis_probability         Classification
index                                                                                                                                                                            
CCTTTCTGTCCGAACC                       2                0                     1.203673e-16                        0.000002                        0.999998                Doublet
CTGATAGGTGACTCAT                       1                0                     1.370633e-09                        0.999920                        0.000080  BatchF-GTGTGACGTATT_x
AGCTCTCGTTGTCTTT                       1                0                     2.369380e-13                        0.996992                        0.003008  BatchE-GAGGCTGAGCTA_x
GTGCGGTAGCGATGAC                       1                0                     1.579405e-09                        0.999879                        0.000121  BatchB-ACATGTTACCGT_x
AAATGCCTCTAACCGA                       1                0                     1.867626e-13                        0.999707                        0.000293  BatchB-ACATGTTACCGT_x
>>> demultiplex.plot_qc_checks_cell_hashing(cell_hashing_data)

most_likeli_hypothesis 0 == Negative, 1 == Singlet, 2 == Doublet
cluster_feature how the cell hashing data was divided if specified or done automatically by giving a cell by genes anndata object to the cluster_data argument when calling demultiplex_cell_hashing
negative_hypothesis_probability
singlet_hypothesis_probability
doublet_hypothesis_probability
Classification The sample of origin for the cell or whether it was a negative or doublet cell.

solo's People

Contributors

Stargazers

Watchers

Forkers

hy395 yynst2 cnk113 celsiustx wflynny matthieurouland avargoksu elhl93 qindan2008 yazaja ccasar sukses24

solo's Issues

scipy error

Hi,

I am attempting to implement the hash demultiplexing pipeline described in the vignette. I have generated h5ad files using the KITE pipeline from kallisto and am able to cleanly demultiplex using the seurat pipeline.

When I try to run solo I keep getting this error:

Do you have any suggestions?

Best,
Dylan

h5ad dataset

Hi, I would like to compare the results of different doublet detection models using the dataset 'Kang et al. Control PBMCs (2c)' mentioned in your paper. Since there's no h5ad file format of this dataset available online, would you please offer the dataset 'Kang et al. Control PBMCs (2c)' of h5ad file format?

Thank you so much

QUESTION: Running `solo` on multiple batches?

Hi,

I have been playing with solo and I have been having good results. I have a question regarding how best to run solo on my data:

I have data coming from three different experiments, say Cells-CD45+, Cells-CD31+ and Cells-CD19+, and I was wondering if you recommend to run solo on each of the individual data sources or if it is better to run it on a merged AnnData with all the concatenated sources?

Thanks in advance for the advice!

Error in inline usage

Hi!

I'm trying to use Solo inline on an Anndata object with 4 hashtag features. My code looks roughly like this:

# Load adata, filter cells and genes on min_counts = 1
...

# Subset on hashtag features
hdata = adata[:, hashtag_features].copy()

from solo import hashsolo

hashsolo.hashsolo(hdata)

Initially, the error from solo was NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported - I assume because hdata.X is a sparse matrix. Then I tried to do

hdata.X = hdata.X.todense()

prior to the hashsolo call, but now I'm getting a IndexError: boolean index did not match indexed array along dimension 1; dimension is 4 but corresponding boolean dimension is 1. Any ideas how to proceed?

Thanks,
Jens

Fix requirements

Make requirements more general

Solo on 10x genomics scRNA data

Hi,
I am trying to run solo on the 10x genomics data. This is the command I used:
solo -d /projects/lihc_hiseq/active/SingleCell/processedDATA/12_M491_PBMC_CTC_cDNArep/outs/filtered_feature_bc_matrix -j model_json.json -o 12_M491_output
The model_json.json is the default you have suggested.
This is the error I get:
Cuda is not available, switching to cpu running! Min cell depth: 500.0, Max cell depth: 68659.0 INFO No batch_key inputted, assuming all cells are same batch INFO No label_key inputted, assuming all cells have same label INFO Using data from adata.X INFO Computing library size prior per batch INFO Successfully registered anndata object containing 7333 cells, 32738 vars, 1 batches, 1 labels, and 0 proteins. Also registered 0 extra categorical covariates and 0 extra continuous covariates. INFO Please do not further modify adata until model is trained. GPU available: False, used: False TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs /home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/tqdm/std.py:538: TqdmWarning: clamping frac to range [0, 1] colour=colour) Epoch 1/2000: -0%| | -1/2000 [00:00<?, ?it/s]/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:398: LightningDeprecationWarning: One of the returned values {'reconstruction_loss_sum', 'kl_global', 'kl_local_sum', 'n_obs'} has a grad_fn. We will detach it automatically but this behaviour will change in v1.6. Please detach it manually: return {'loss': ..., 'something': something.detach()}f"One of the returned values {set(extra.keys())} has agrad_fn. We will detach it automatically" Traceback (most recent call last): File "/home/paulyr2/miniconda/envs/solo/bin/solo", line 33, in <module> sys.exit(load_entry_point('solo-sc', 'console_scripts', 'solo')()) File "/home/paulyr2/solo/solo/solo.py", line 240, in main callbacks=scvi_callbacks, File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/scvi/model/base/_training_mixin.py", line 70, in train return runner() File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/scvi/model/base/_trainrunner.py", line 75, in __call__ self.trainer.fit(self.training_plan, train_dl, val_dl) File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/scvi/lightning/_trainer.py", line 131, in fit super().fit(*args, **kwargs) File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit self._run(model) File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run self._dispatch() File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch self.accelerator.start_training(self) File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training self.training_type_plugin.start_training(trainer) File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training self._results = trainer.run_stage() File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage return self._run_train() File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train self.fit_loop.run() File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/loops/base.py", line 111, in run self.advance(*args, **kwargs) File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance epoch_output = self.epoch_loop.run(train_dataloader) File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/loops/base.py", line 118, in run output = self.on_run_end() File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 235, in on_run_end self._on_train_epoch_end_hook(processed_outputs) File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 275, in _on_train_epoch_end_hook trainer_hook(processed_epoch_output) File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_hook.py", line 109, in on_train_epoch_end callback.on_train_epoch_end(self, self.lightning_module) File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 170, in on_train_epoch_end self._run_early_stopping_check(trainer) File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 185, in _run_early_stopping_check logs File "/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 134, in _validate_condition_metric raise RuntimeError(error_msg) RuntimeError: Early stopping conditioned on metric reconstruction_loss_validationwhich is not available. Pass in or modify yourEarlyStoppingcallback to use any of the following:elbo_train, reconstruction_loss_train, kl_local_train, kl_global_train/home/paulyr2/miniconda/envs/solo/lib/python3.6/site-packages/tqdm/std.py:538: TqdmWarning: clamping frac to range [0, 1] Epoch 1/2000: -0%|

Suggestions?
Thanks!

Problem with hashsolo

Hi,
I just ran my solo analysis and now i need the information a bout which of my cells was predicted to be a doublet and so forth.
If i understood correctly i have to use hashsolo for this, but that just runs with no logger or information about progress and i dont seem to get anything...
I only have an h5ad file with ~800 cells does this just take over a day or am i getting something wrong?
Thank you

Solo in line?

Hello,

I was wondering if it possible to run solo in line? I noticed hashsolo has the capabilities, but would like to run solo in line as well.

Thanks,
Chang

Add testing for performance

We should automatically test performance for every PR to ensure things havent gone awry.

Implementation rough draft:
Pull 2c kang dataset
Run Solo current version and new version 10 times each. Mann whitney U test to check performance

Optimal Preprocessing

Hi,

I have question regarding the optimal way for preprocessing counts for solo. In particular, should the input be raw counts or do you recommend some preporcessing? What preprocessing steps would that be?

Thanks a lot for your help!

Difference between is_doublet and preds

Hi!
I'm trying your tool to identify doublets in my scRNASeq data but I'm not sure why I get slightly different results in is_doublet.csv and preds.csv files. I'm not using the -e parameter (expected_number_of_doublets), so shouldn't they be the same? What's the difference between these files?

Thanks!!
Yamil

Error during run

Hello,

I have tried using solo on a few different datasets but have run into the same error with all datasets I have tried:

Traceback (most recent call last):
  File "/opt/conda/envs/py36/bin/solo", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/solo/solo.py", line 252, in main
    non_zero_indexes = np.where(singlet_scvi_data.X > 0)
  File "<__array_function__ internals>", line 6, in where
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scipy/sparse/base.py", line 287, in __bool__
    raise ValueError("The truth value of an array with more than one "
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

The model gets through some of the epochs (usually ~15-30).

solo was installed with pip (python 3.6.8) and is being run out of a singularity bucket. Please let me know if you have any recommendations on how I might fix this or would like additional information.

Thanks,
Drew

moving to scvi-tools

MemoryError

Thanks for providing this resource.

I am running into the following error -
MemoryError: Unable to allocate 1.66 TiB for an array with shape (6794880, 33538) and data type int64.

I wonder how much space you recommend having available to run a sample using the raw 10X data set.

Or do you recommend filtering our 10X objects and then using the filtered matrix to run solo?

Thanks for your help

installation error

Hi,
When I tried to install the package, it gave some error. Any idea what could be causing this and how to fix it?
I am installing as you instructed in a new conda enviroment with python 3.6.9.
log:
... (omitted the non-error lines)
Collecting ptyprocess>=0.5
Using cached https://files.pythonhosted.org/packages/d1/29/605c2cc68a9992d18dada28206eeada56ea4bd07a239669da41674648b6f/ptyprocess-0.6.0-py2.py3-none-any.whl
ERROR: jupyter-console 6.0.0 has requirement prompt-toolkit<2.1.0,>=2.0.0, but you'll have prompt-toolkit 3.0.2 which is incompatible.
Installing collected packages: ConfigArgParse, six, cycler, decorator, numpy, h5py, joblib, mock, natsort, networkx, numexpr, python-dateutil, pytz, pandas, patsy, Pillow, pyparsing, scipy, anndata, llvmlite, numba, scikit-learn, umap-learn, tqdm, kiwisolver, matplotlib, statsmodels, seaborn, tables, scanpy, cloudpickle, future, hyperopt, click, numpy-groupies, loompy, torch, xlrd, ipython-genutils, traitlets, pygments, jupyter-core, tornado, pyzmq, jupyter-client, parso, jedi, pickleshare, ptyprocess, pexpect, wcwidth, prompt-toolkit, backcall, ipython, ipykernel, qtconsole, jupyter-console, pyrsistent, attrs, more-itertools, zipp, importlib-metadata, jsonschema, nbformat, Send2Trash, terminado, prometheus-client, webencodings, bleach, testpath, mistune, entrypoints, pandocfilters, defusedxml, MarkupSafe, jinja2, nbconvert, notebook, widgetsnbextension, ipywidgets, jupyter, scvi, dataclasses, python-igraph, leidenalg, atomicwrites, packaging, py, pluggy, pytest, solo

Running setup.py develop for solo
ERROR: Command errored out with exit status -11:
command: /home/mzhibo/anaconda3/envs/solo/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/mzhibo/apps/solo/setup.py'"'"'; file='"'"'/home/mzhibo/apps/solo/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' develop --no-deps
cwd: /home/mzhibo/apps/solo/
Complete output (1 lines):
[2019-12-17 13:53:26,158] INFO - scvi._settings | Added StreamHandler with custom formatter to 'scvi' logger.
----------------------------------------

ERROR: Command errored out with exit status -11: /home/mzhibo/anaconda3/envs/solo/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/mzhibo/apps/solo/setup.py'"'"'; file='"'"'/home/mzhibo/apps/solo/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.

Output Recommendation

Hi solo developers,

I am interested in using solo but noticed that all the outputs are in numpy formats. I think it would be helpful to have bash-readable file outputs so that the user is not required to use python to see the results following running the software. Any chance this could be implemented for solo?

ValueError: Length of values does not match length of index

Hi again @njbernstein and @davek44 :)

After the fix from issue #13 solo is working really great in some of my samples. Thanks very much!

However, for some of them, solo runs without any problem but just before it's going to report the scores/doublets it fails with the following error:

Train accuracy: 0.8743
Test accuracy:  0.8638
Train AUROC: 0.8638
Test AUROC:  0.8553

Traceback (most recent call last):
      File "/nfs/users/nfs_c/ct5/storage/anaconda3/envs/solo/bin/solo", line 11, in <module>
          load_entry_point('solo', 'console_scripts', 'solo')()
      File "/nfs/users/nfs_c/ct5/tools/solo/solo/solo.py", line 390, in main
         adata.obs['is_doublet'] = is_doublet[:num_cells]
      File "/nfs/users/nfs_c/ct5/storage/anaconda3/envs/solo/lib/python3.6/site-packages/pandas/core/frame.py", line 3370, in __setitem__
         self._set_item(key, value)
     File "/nfs/users/nfs_c/ct5/storage/anaconda3/envs/solo/lib/python3.6/site-packages/pandas/core/frame.py", line 3445, in _set_item
        value = self._sanitize_column(key, value)
     File "/nfs/users/nfs_c/ct5/storage/anaconda3/envs/solo/lib/python3.6/site-packages/pandas/core/frame.py", line 3630, in _sanitize_column
       value = sanitize_index(value, self.index, copy=False)
    File "/nfs/users/nfs_c/ct5/storage/anaconda3/envs/solo/lib/python3.6/site-packages/pandas/core/internals/construction.py", line 519, in sanitize_index
      raise ValueError('Length of values does not match length of index')
ValueError: Length of values does not match length of index

Now, I understand that the index somehow doesn't match the original object, but there's nothing to be obviously wrong with the original anndata object.

Do you have any ideas?

Add batching to Solo

currently, users must manually break up their dataset if it contains multiple sample. We can help them by running Solo for them per batch. The main issue with this is that it will be slower if the user has multiple GPUs available to them.

scvi is deprecated

Hi all,

I wanted to let you all know that we are deprecating the scvi software package. We recently pushed a final update (0.6.8) with a deprecation warning upon import. scvi is now implemented in the scvi-tools package (same repo). Based on viewing your code, it seems like the easiest thing to do at the moment is pin the scvi requirement to 0.6.7. We will be changing some of the classes you use (like Classifier etc.) in scvi-tools, so it will require non-trivial updating to achieve the same solo algorithm. To fix the issue with the invalid parameter, you might consider detecting outliers in the dataset in terms of library size, possibly truncating the library size prior, or reducing the number of latent dimensions.

Please let me, @galenxing, or @romain-lopez know if you have any questions.

Problem with loom file

Hey,

I'm trying to use your tool on a loom file and it's giving me an error I dont quite know how to solve..
Mind having a look at it and tell me how you think I could solve?

Kind regards,
Margherita

(base) KI-C02Z42TFLVDM:solo marzam$ solo solo/solo_params_example.json dev_all.loom
[2021-01-13 16:12:25,205] INFO - scvi._settings | 'scvi' logger already has a StreamHandler, set its level to 10.
Cuda is not available, switching to cpu running!
[2021-01-13 16:12:25,205] INFO - scvi.dataset.loom | Preprocessing dataset
Traceback (most recent call last):
  File "/Users/marzam/miniconda3/bin/solo", line 33, in <module>
    sys.exit(load_entry_point('solo-sc', 'console_scripts', 'solo')())
  File "/Users/marzam/OneDrive - KI.SE/Mac/Documents/sequencing/ionut/doublets/solo/solo/solo/solo.py", line 119, in main
    scvi_data = LoomDataset(data_path)
  File "/Users/marzam/miniconda3/lib/python3.7/site-packages/scvi/dataset/loom.py", line 66, in __init__
    delayed_populating=delayed_populating,
  File "/Users/marzam/miniconda3/lib/python3.7/site-packages/scvi/dataset/dataset.py", line 2026, in __init__
    self.populate()
  File "/Users/marzam/miniconda3/lib/python3.7/site-packages/scvi/dataset/loom.py", line 138, in populate
    data = ds[:, select].T  # change matrix to cells by genes
  File "/Users/marzam/.local/lib/python3.7/site-packages/loompy/loompy.py", line 206, in __getitem__
    return self.layers[""][slice_]
  File "/Users/marzam/.local/lib/python3.7/site-packages/loompy/loom_layer.py", line 88, in __getitem__
    return self.ds._file['/matrix'].__getitem__(slice)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/Users/marzam/miniconda3/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 777, in __getitem__
    selection = sel.select(self.shape, args, dataset=self)
  File "/Users/marzam/miniconda3/lib/python3.7/site-packages/h5py/_hl/selections.py", line 82, in select
    return selector.make_selection(args)
  File "h5py/_selector.pyx", line 272, in h5py._selector.Selector.make_selection
  File "h5py/_selector.pyx", line 183, in h5py._selector.Selector.apply_args
TypeError: Indexing arrays must have integer dtypes
(base) KI-C02Z42TFLVDM:solo marzam$

json variables

Hello,

I am interested in using solo for doublet detection and I was wondering if you could provide more details about the variables in the json file that are used for the model and how one might change the values based on their dataset?

Thanks!
-Drew

Can not read from remote repository

When I run the command:
git clone [email protected]:calico/solo.git && cd solo && conda create -n solo python=3.6 && conda activate solo && pip install -e

I get this message:

Cloning into 'solo'...
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Or if I run the command with sudo :

 The authenticity of host 'github.com (140.82.113.4)' can't be established.
RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'github.com,140.82.113.4' (RSA) to the list of known hosts.
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

solo installation issue

Hi,

Thank you so much for creating this fantastic tool! I have previously successfully installed solo and got the pipeline running nicely. But later, I probably installed some other packages and messed up the conda environment, and I got error messages when I tried to run solo. So I decided to remove the conda environment and reinstall solo. I believe I followed everything I did before: conda create -n solo python=3.6 && conda activate solo && pip install -e .
This time I got error messages I had never seen before, no matter how many times I tried. And after googling, I have no idea how to solve it. Thank you so much for any help!

Here is the screenshot of the error messages:
ERROR: Command errored out with exit status 1:
command: /home/yi.ding/anaconda3/envs/solo/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/yi.ding/solo/setup.py'"'"'; file='"'"'/home/yi.ding/solo/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info
cwd: /home/yi.ding/solo/
Complete output (112 lines):
Traceback (most recent call last):
File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/errors.py", line 662, in new_error_context
yield
File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/lowering.py", line 258, in lower_block
self.lower_inst(inst)
File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/lowering.py", line 301, in lower_inst
val = self.lower_assign(ty, inst)
File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/lowering.py", line 459, in lower_assign
return self.lower_expr(ty, value)
File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/lowering.py", line 919, in lower_expr
res = self.lower_call(resty, expr)
File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/lowering.py", line 711, in lower_call
res = self._lower_call_normal(fnty, expr, signature)
File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/lowering.py", line 890, in _lower_call_normal
res = impl(self.builder, argvals, self.loc)
File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/targets/base.py", line 1132, in call
res = self._imp(self._context, builder, self._sig, args, loc=loc)
File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/targets/base.py", line 1157, in wrapper
return fn(*args, **kwargs)
File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/targets/arrayobj.py", line 3375, in numpy_zeros_nd
_zero_fill_array(context, builder, ary)
File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/targets/arrayobj.py", line 3305, in _zero_fill_array
cgutils.memset(builder, ary.data, builder.mul(ary.itemsize, ary.nitems), 0)
File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/cgutils.py", line 866, in memset
builder.call(fn, [ptr, value, size, int32_t(0), bool_t(0)])
File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/llvmlite/ir/builder.py", line 841, in call
cconv=cconv, tail=tail, fastmath=fastmath)
File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/llvmlite/ir/instructions.py", line 84, in init
raise TypeError(msg)
TypeError: Type of #4 arg mismatch: i1 != i32

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/yi.ding/solo/setup.py", line 9, in <module>
    from solo import __author__, __email__
  File "/home/yi.ding/solo/solo/__init__.py", line 5, in <module>
    from . import hashsolo, utils
  File "/home/yi.ding/solo/solo/hashsolo.py", line 10, in <module>
    import scanpy as sc
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/scanpy/__init__.py", line 27, in <module>
    check_versions()
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/scanpy/utils.py", line 33, in check_versions
    import anndata, umap
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/umap/__init__.py", line 1, in <module>
    from .umap_ import UMAP
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/umap/umap_.py", line 23, in <module>
    import umap.sparse as sparse
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/umap/sparse.py", line 9, in <module>
    from umap.utils import (
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/umap/utils.py", line 106, in <module>
    @numba.njit("f8[:, :, :](i8,i8)")
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/decorators.py", line 186, in wrapper
    disp.compile(sig)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/dispatcher.py", line 693, in compile
    cres = self._compiler.compile(args, return_type)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/dispatcher.py", line 76, in compile
    status, retval = self._compile_cached(args, return_type)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/dispatcher.py", line 90, in _compile_cached
    retval = self._compile_core(args, return_type)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/dispatcher.py", line 108, in _compile_core
    pipeline_class=self.pipeline_class)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/compiler.py", line 972, in compile_extra
    return pipeline.compile_extra(func)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/compiler.py", line 390, in compile_extra
    return self._compile_bytecode()
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/compiler.py", line 903, in _compile_bytecode
    return self._compile_core()
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/compiler.py", line 890, in _compile_core
    res = pm.run(self.status)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/compiler.py", line 266, in run
    raise patched_exception
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/compiler.py", line 257, in run
    stage()
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/compiler.py", line 764, in stage_nopython_backend
    self._backend(lowerfn, objectmode=False)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/compiler.py", line 703, in _backend
    lowered = lowerfn()
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/compiler.py", line 690, in backend_nopython_mode
    self.metadata)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/compiler.py", line 1143, in native_lowering_stage
    lower.lower()
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/lowering.py", line 177, in lower
    self.lower_normal_function(self.fndesc)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/lowering.py", line 218, in lower_normal_function
    entry_block_tail = self.lower_function_body()
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/lowering.py", line 243, in lower_function_body
    self.lower_block(block)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/lowering.py", line 258, in lower_block
    self.lower_inst(inst)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/errors.py", line 670, in new_error_context
    six.reraise(type(newerr), newerr, tb)
  File "/home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/numba/six.py", line 659, in reraise
    raise value
numba.errors.LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)
Type of #4 arg mismatch: i1 != i32

File "../anaconda3/envs/solo/lib/python3.6/site-packages/umap/utils.py", line 129:
def make_heap(n_points, size):
    <source elided>
    """
    result = np.zeros((3, int(n_points), int(size)), dtype=np.float64)
    ^

[1] During: lowering "$0.14 = call $0.2($0.10, func=$0.2, args=[Var($0.10, /home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/umap/utils.py (129))], kws=[('dtype', Var($0.12, /home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/umap/utils.py (129)))], vararg=None)" at /home/yi.ding/anaconda3/envs/solo/lib/python3.6/site-packages/umap/utils.py (129)
----------------------------------------

ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

sklearn.metrics error

Hi,

during creation of a new conda environment and install of solo by:
conda create -n solo python=3.7 && conda activate solo && pip install solo-sc
I got the following error:
ERROR: umap-learn 0.4.3 has requirement numba!=0.47,>=0.46, but you'll have numba 0.45.0 which is incompatible.
ERROR: scvi 0.6.0 has requirement llvmlite==0.30.0, but you'll have llvmlite 0.32.1 which is incompatible.

I manually install the required versions by:
conda install cython numba=0.49.1 llvmlite=0.30.0

now, install of solo-sc works without errors, but if I'm trying to run I got the error:
Traceback (most recent call last):
File "/home/sguenth/.conda/envs/solo/bin/solo", line 5, in
from solo.solo import main
File "/home/sguenth/.conda/envs/solo/lib/python3.7/site-packages/solo/init.py", line 5, in
from . import hashsolo, utils
File "/home/sguenth/.conda/envs/solo/lib/python3.7/site-packages/solo/hashsolo.py", line 13, in
from sklearn.metrics import calinski_harabaz_score
ImportError: cannot import name 'calinski_harabaz_score' from 'sklearn.metrics' (/home/sguenth/.conda/envs/solo/lib/python3.7/site-packages/sklearn/metrics/init.py)

Could you please help out. Thanks in advance.

Hashsolo failing when only two HTOs are present?

Hi,

Thank you for the tool - it's very useful for reprocessing cell hashing experiments! However, I've come across a strange issue which I think comes down to how hashsolo estimates noise. When the multiplexed sample has only 2 HTOs, the hashsolo invariably fails - generating lots of NaNs, and assigning all cells to either "negative" or a "doublet", while the HTO distribution clearly suggests otherwise.

Is this expected? I saw this comment in the code and thought that's probably what's causing the issue:

Noise distributions for a hashing barcode are estimated from samples where that hashing barcode is one the k-2 lowest barcodes, where k is the number of barcodes.

Perhaps it's something more basic though. The manual/README doesn't seem to mention this - is the user expected to provide a matrix of raw HTO counts, or do they need to be normalized/filtered in any way?

Would be thankful for any comment/feedback.

Problem with solo - PyTorch Lightning

Hi everyone!

I've been facing an error for some time. When I first installed solo 1.3, PyTorch 1.3.1 was installed by default. With this configuration, it was impossible to even call solo on terminal (I'm working on Mac OS Monterey 12.4).
The error was the following one:

File "/opt/anaconda3/envs/demul/bin/solo", line 5, in <module>
    from solo.solo import main
  File "/opt/anaconda3/envs/demul/lib/python3.9/site-packages/solo/solo.py", line 15, in <module>
    from pytorch_lightning.callbacks.early_stopping import EarlyStopping
  File "/opt/anaconda3/envs/demul/lib/python3.9/site-packages/pytorch_lightning/__init__.py", line 20, in <module>
    from pytorch_lightning import metrics  # noqa: E402
  File "/opt/anaconda3/envs/demul/lib/python3.9/site-packages/pytorch_lightning/metrics/__init__.py", line 15, in <module>
    from pytorch_lightning.metrics.classification import (  # noqa: F401
  File "/opt/anaconda3/envs/demul/lib/python3.9/site-packages/pytorch_lightning/metrics/classification/__init__.py", line 14, in <module>
    from pytorch_lightning.metrics.classification.accuracy import Accuracy  # noqa: F401
  File "/opt/anaconda3/envs/demul/lib/python3.9/site-packages/pytorch_lightning/metrics/classification/accuracy.py", line 18, in <module>
    from pytorch_lightning.metrics.utils import deprecated_metrics
  File "/opt/anaconda3/envs/demul/lib/python3.9/site-packages/pytorch_lightning/metrics/utils.py", line 22, in <module>
    from torchmetrics.utilities.data import get_num_classes as _get_num_classes

This error is linked to the version of PyTorch, so updating the library to 1.3.8 or higher was a solution.
Anyways, after installing PyTorch lightning 1.3.8 and (later PyTorch lightning 1.7.3), calling the tool on terminal is possible, but it's producing the following error:

Global seed set to 0
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Cuda is not available, switching to cpu running!
Global seed set to 2732
/opt/anaconda3/envs/demul/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function setup_anndata is deprecated; Please use the model-specific setup_anndata methods instead. The global method will be removed in version 0.15.0.
  warnings.warn(msg, category=FutureWarning)
INFO     No batch_key inputted, assuming all cells are same batch                                                                                                                        
INFO     No label_key inputted, assuming all cells have same label                                                                                                                       
INFO     Using data from adata.X                                                                                                                                                         
INFO     Successfully registered anndata object containing 4889 cells, 33694 vars, 1 batches, 1 labels, and 0 proteins. Also registered 0 extra categorical covariates and 0 extra       
         continuous covariates.                                                                                                                                                          
INFO     Please do not further modify adata until model is trained.                                                                                                                      
/opt/anaconda3/envs/demul/lib/python3.9/site-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
  new_rank_zero_deprecation(
/opt/anaconda3/envs/demul/lib/python3.9/site-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: The `pytorch_lightning.loggers.base.LightningLoggerBase` is deprecated in v1.7 and will be removed in v1.9. Please use `pytorch_lightning.loggers.logger.Logger` instead.
  return new_rank_zero_deprecation(*args, **kwargs)
Traceback (most recent call last):
  File "/opt/anaconda3/envs/demul/bin/solo", line 8, in <module>
    sys.exit(main())
  File "/opt/anaconda3/envs/demul/lib/python3.9/site-packages/solo/solo.py", line 251, in main
    vae.train(
  File "/opt/anaconda3/envs/demul/lib/python3.9/site-packages/scvi/model/base/_training_mixin.py", line 69, in train
    runner = TrainRunner(
  File "/opt/anaconda3/envs/demul/lib/python3.9/site-packages/scvi/train/_trainrunner.py", line 66, in __init__
    self.trainer = Trainer(max_epochs=max_epochs, gpus=gpus, **trainer_kwargs)
  File "/opt/anaconda3/envs/demul/lib/python3.9/site-packages/scvi/train/_trainer.py", line 138, in __init__
    super().__init__(
  File "/opt/anaconda3/envs/demul/lib/python3.9/site-packages/pytorch_lightning/utilities/argparse.py", line 345, in insert_env_defaults
    return fn(self, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'flush_logs_every_n_steps'

This error seems to be related to PyTorch Lightning. I called the tool with 2 different datasets and got similar errors, both pointing PyTorch Lightning as the reason.
Does anyone else have this error? Any help would be very appreciated.
Thank you very much in advance!

Finding solo installed version

Is there a way to know which version of solo we have installed? A --version argument would be highly appreciated

2c.h5ad doesn't contain gene names

I was trying to use the h5ad data files collected by the paper. However, that 2c.h5ad doesn't contain gene names - nor does it contain cell type labels.
Is there any chance you can provide complete h5ad data files you used for doublet detection demonstration?

Thanks!

Running error: "Resource temporarily unavailable"

Hi, I'm having the following error for some of my samples:

file sample1.loom ; output folder solo_results/sample1
[2021-05-27 13:31:32,706] INFO - scvi._settings | 'scvi' logger already has a StreamHandler, set its level to 10.
Cuda is not available, switching to cpu running!
[2021-05-27 13:31:32,708] INFO - scvi.dataset.loom | Preprocessing dataset

Traceback (most recent call last):
  File "/shared/matiasfa/miniconda2/envs/solo/bin/solo", line 8, in <module>
    sys.exit(main())
  File "/shared/matiasfa/miniconda2/envs/solo/lib/python3.7/site-packages/solo/solo.py", line 114, in main
    scvi_data = LoomDataset(data_path)
  File "/shared/matiasfa/miniconda2/envs/solo/lib/python3.7/site-packages/scvi/dataset/loom.py", line 66, in __init__
    delayed_populating=delayed_populating,
  File "/shared/matiasfa/miniconda2/envs/solo/lib/python3.7/site-packages/scvi/dataset/dataset.py", line 2026, in __init__
    self.populate()
  File "/shared/matiasfa/miniconda2/envs/solo/lib/python3.7/site-packages/scvi/dataset/loom.py", line 88, in populate
    ds = loompy.connect(os.path.join(self.save_path, self.filenames[0]))
  File "/shared/matiasfa/miniconda2/envs/solo/lib/python3.7/site-packages/loompy/loompy.py", line 1389, in connect
    return LoomConnection(filename, mode, validate=validate)
  File "/shared/matiasfa/miniconda2/envs/solo/lib/python3.7/site-packages/loompy/loompy.py", line 81, in __init__
    if not lv.validate(filename):
  File "/shared/matiasfa/miniconda2/envs/solo/lib/python3.7/site-packages/loompy/loom_validator.py", line 48, in validate
    with h5py.File(path, mode="r") as f:
  File "/shared/matiasfa/miniconda2/envs/solo/lib/python3.7/site-packages/h5py/_hl/files.py", line 408, in __init__
    swmr=swmr)
  File "/shared/matiasfa/miniconda2/envs/solo/lib/python3.7/site-packages/h5py/_hl/files.py", line 173, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')

The samples are filtered and with similar number of genes and cells as other samples that did run. Any suggestion?

Underestimation of doublets with data set that contains a lot of proliferative cells

Dear Solo team,

your tool is great and easy to use, thank you!
Still, I have a question: I am running several 10x samples in order to identify doublets and I observed that if my data set consists mainly of cells that are is G2M or S Phase (cell cycle classification according to Seurat), the number of detected doublets is extremely underestimated. Eg.: In a data set with 10^4 Cells where more than 50% of the cells are classified of being in G2M/S Phase, less than 100 cells are found to be doublets. With my other data, where less than 30 % is in G2M/S phase, the number of detected doublets are similar to the expected ones.

Do you have any idea how we could manage to find the doublets in data sets where cells are proliferating?

Best,

Lena

Upload on conda/PyPI

Hi,

I had a look at your preprint and it really looks promising!

It would be cool if you could upload the package on bioconda and/or PyPI.
This makes it easier to define it as a dependency in a workflow (e.g. through a conda yml file) compared to installing it from the git repository.

Best,
Gregor

Error on new solo version

Hello,

I updated to the newest version on github as well as scvi-tools.
However when I start it fails on the first step:

Min cell depth: 500.0, Max cell depth: 40136.0                                                                                                                                                                   
INFO     No batch_key inputted, assuming all cells are same batch                                                                                                                                                
INFO     No label_key inputted, assuming all cells have same label                                                                                                                                               
INFO     Using data from adata.X                                                                                                                                                                                 
INFO     Computing library size prior per batch                                                                                                                                                                  
INFO     Successfully registered anndata object containing 14643 cells, 36601 vars, 1 batches, 1 labels, and 0 proteins. Also registered 0 extra categorical covariates and 0 extra continuous covariates.       
INFO     Please do not further modify adata until model is trained.                                                                                                                                              
GPU available: True, used: True                                                                                                                                                                                  
TPU available: False, using: 0 TPU cores                                                                                                                                                                         
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]                                                                                                                                                                        
Epoch 1/2000:   0%|                                                                                                                                   | 1/2000 
[00:02<1:34:48,  2.85s/it, loss=8.43e+03, v_num=1]
...
  File "/home/chang/miniconda3/envs/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
    self.train_loop.run_training_epoch()
  File "/home/chang/miniconda3/envs/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 577, in run_training_epoch
    self.trainer.optimizer_connector.update_learning_rates(interval='epoch')
  File "/home/chang/miniconda3/envs/venv/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/optimizer_connector.py", line 66, in update_learning_rates
    f'ReduceLROnPlateau conditioned on metric {monitor_key}'
pytorch_lightning.utilities.exceptions.MisconfigurationException: ReduceLROnPlateau conditioned on metric reconstruction_loss_validation which is not available. 
Available metrics are: ['train_loss_step', 'train_loss_epoch', 'train_loss', 'elbo_train', 'reconstruction_loss_train', 'kl_local_train', 'kl_global_train']. 
Condition can be set using `monitor` key in lr scheduler dict
Epoch 1/2000:   0%|

Any pre-trained model

Hi, I am wondering if there's any pre-trained model to start with. And I also would like to know if there's any open dataset in h5ad format so that I can run directly with your code.

Thank you in advance.

Allow user to change interval at which validation loss is checked

Currently the scVI model is trained with the following invocation in which check_val_every_n_epoch argument is hardcoded at 5:

        vae.train(
            max_epochs=2000,
            validation_size=valid_pct,
            check_val_every_n_epoch=5,
            plan_kwargs=plan_kwargs,
            callbacks=scvi_callbacks,
        )

We'd like to allow users to specify this value using the Solo command line tool.

There are two ways this can be addressed via a command line argument or the model_json_file which specifies model parameters. In this case we will use model_json_file to allow users to change this argument for training.

The example json file can be seen here: https://github.com/calico/solo/blob/master/solo_params_example.json

Dataset10x Failing to Load

Hello,

I'm not sure the best location to put this issue - it arises when using the solo package but I'm fairly certain that the issue lies with the scvi package. I am getting the following output with the error:


[2020-11-16 15:18:18,195] INFO - scvi._settings | 'scvi' logger already has a StreamHandler, set its level to 10.
Cuda is not available, switching to cpu running!
[2020-11-16 15:18:18,202] DEBUG - scvi.dataset.dataset10X | Loading extracted local 10X dataset with custom filename
[2020-11-16 15:18:18,202] INFO - scvi.dataset.dataset10X | Preprocessing dataset
/opt/conda/envs/py36/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /opt/conda/conda-bld/pytorch_1603729021865/work/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1032, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1039, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/py36/bin/solo", line 33, in <module>
    sys.exit(load_entry_point('solo-sc', 'console_scripts', 'solo')())
  File "/opt/solo/solo/solo.py", line 123, in main
    dense=True)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/dataset/dataset10X.py", line 156, in __init__
    delayed_populating=delayed_populating,
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/dataset/dataset.py", line 2026, in __init__
    self.populate()
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/dataset/dataset10X.py", line 196, in populate
    gene_names = measurements_info[self.measurement_names_column].astype(np.str)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py", line 2902, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2893, in get_loc
    raise KeyError(key) from err
KeyError: 1

I can replicate this error when in python after importing scvi and trying to load the dataset with Dataset10X but can't identify why I am receiving this error.

The data is a dataset where the barcodes are letters followed by a dash, a number, and sometimes another letter, ie:

AACCGCGGTTGGTTTG-16
AACTCAGCACGGTAAG-16
ACCTTTACAACAACCT-5
ACGCAGCCAATGAAAC-9
ACGGAGAGTCAGATAA-9
ACGGGCTGTTTACTCT-14
ACTGATGTCTTGCAAG-4
ACTTTCAGTCTCTTTA-9
AGGTCATCAAACAACA-4D

and the files are produced with umitools_to_mtx from the R scrunchy package. I have a feeling the main problem is due somehow related to the fact that these files were not directly produced by the 10x cellranger pipeline. Here's the top of the matrix.mtx file:

%%MatrixMarket matrix coordinate integer general
%
20469 15266 4765018
1 1 1
90 1 1
129 1 1
169 1 1
170 1 13
245 1 1

I'll do some more digging but would love if you have some recommendations or input.

Thanks!

Have solo consume 10x directory directly

Have solo consume 10x directory directly. scVI now handles this for us so it should not be brittle.

Rough sketch:
use scVI 10xDataset function
provide some light feedback to users about whether they should further QC the data

ValueError: The parameter mu has invalid values

Hello solo developers,

I am receiving a strange error from solo for one of my pools. I have run the same code for 72 other pools without issue. However, for this pool, I have tried to run 8 times and receive the same error: ValueError: The parameter mu has invalid values. This is running from a 10x directory but the data should be pretty good quality (ie no empty dropets, similar number of UMIs and counts to the other pools, low mt%). I have included the full output below. Any idea what might be going wrong here?

[2020-08-07 10:21:06,357] INFO - scvi._settings | 'scvi' logger already has a StreamHandler, set its level to 10.
Cuda is not available, switching to cpu running!
[2020-08-07 10:21:06,359] DEBUG - scvi.dataset.dataset10X | Loading extracted local 10X dataset with custom filename
[2020-08-07 10:21:06,359] INFO - scvi.dataset.dataset10X | Preprocessing dataset
[2020-08-07 10:21:25,885] INFO - scvi.dataset.dataset10X | Finished preprocessing dataset
[2020-08-07 10:21:28,177] WARNING - scvi.dataset.dataset | Gene names are not unique.
[2020-08-07 10:21:28,177] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2020-08-07 10:21:28,178] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2020-08-07 10:21:33,886] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2020-08-07 10:21:34,948] INFO - scvi.dataset.dataset | Downsampled from 17734 to 17734 cells
Min cell depth: 809.0, Max cell depth: 82986.0
[2020-08-07 10:21:35,633] DEBUG - scvi.inference.trainer | 
EPOCH [0/2000]: 
[2020-08-07 10:21:35,633] DEBUG - scvi.inference.trainer | Train Set
[2020-08-07 10:22:01,487] DEBUG - scvi.inference.posterior | ELBO : 28455.5914
[2020-08-07 10:22:26,058] DEBUG - scvi.inference.posterior | Reconstruction Error : 28385.3297
[2020-08-07 10:22:26,059] DEBUG - scvi.inference.trainer | Test Set
[2020-08-07 10:22:28,805] DEBUG - scvi.inference.posterior | ELBO : 27671.6659
[2020-08-07 10:22:31,538] DEBUG - scvi.inference.posterior | Reconstruction Error : 27413.0038
[2020-08-07 10:22:31,538] INFO - scvi.inference.inference | KL warmup for 400 epochs
training:   0%|          | 1/2000 [01:14<41:11:32, 74.18s/it][2020-08-07 10:24:58,636] DEBUG - scvi.inference.trainer | 
EPOCH [2/2000]: 
[2020-08-07 10:24:58,636] DEBUG - scvi.inference.trainer | Train Set
[2020-08-07 10:25:22,855] DEBUG - scvi.inference.posterior | ELBO : 8610.9732
[2020-08-07 10:25:46,947] DEBUG - scvi.inference.posterior | Reconstruction Error : 8240.9735
[2020-08-07 10:25:46,948] DEBUG - scvi.inference.trainer | Test Set
[2020-08-07 10:25:49,655] DEBUG - scvi.inference.posterior | ELBO : 7807.6187
[2020-08-07 10:25:52,310] DEBUG - scvi.inference.posterior | Reconstruction Error : 7437.9775
training:   0%|          | 3/2000 [04:33<47:02:30, 84.80s/it][2020-08-07 10:28:18,850] DEBUG - scvi.inference.trainer | 
EPOCH [4/2000]: 
[2020-08-07 10:28:18,855] DEBUG - scvi.inference.trainer | Train Set
training:   0%|          | 3/2000 [05:56<65:58:41, 118.94s/it]
Traceback (most recent call last):
  File "/opt/conda/envs/py36/bin/solo", line 11, in <module>
    load_entry_point('solo-sc', 'console_scripts', 'solo')()
  File "/opt/solo/solo/solo.py", line 232, in main
    utrainer.train(n_epochs=2000, lr=learning_rate)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/inference/trainer.py", line 182, in train
    if not self.on_epoch_end():
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/inference/trainer.py", line 224, in on_epoch_end
    self.compute_metrics()
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/inference/trainer.py", line 137, in compute_metrics
    result = getattr(posterior, metric)()
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/inference/posterior.py", line 251, in elbo
    elbo = compute_elbo(self.model, self)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/models/log_likelihood.py", line 33, in compute_elbo
    **kwargs
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/models/vae.py", line 316, in forward
    reconst_loss = self.get_reconstruction_loss(x, px_rate, px_r, px_dropout)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/models/vae.py", line 219, in get_reconstruction_loss
    -NegativeBinomial(mu=px_rate, theta=px_r).log_prob(x).sum(dim=-1)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/scvi/models/distributions.py", line 95, in __init__
    super().__init__(validate_args=validate_args)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/torch/distributions/distribution.py", line 36, in __init__
    raise ValueError("The parameter {} has invalid values".format(param))
ValueError: The parameter mu has invalid values

Preprocessing Error - IndexError: tuple index out of range

Dear solo team,

First of all, thanks for your great tool! I have the following little problem: If I run solo (version 0.6) on CellRanger output, everything works fine. If I run it on one of the other optional formats, e.g. .loom or .h5ad, then I always get the same error quickly after start: "IndexError: tuple index out of range", involving the scvi/dataset package (version 0.6.5). It seems to have a problem with the gene names... My input files were created using scanpy (version 1.4.6), for example like adata.write('data.h5ad', force_dense=True).

Since the solo results on the unfiltered CellRanger output for most of my samples is actually not correct (it tremendously underestimates the number of doublets, maybe due to cell cycle playing a large role in my samples?), I would really like to try it on filtered cells. Do you have any ideas regarding the aforementioned problem?

Thanks for your help!

Best,

Jonas

Running solo on a sample as a batch job

Hello,
I'm sure this is a python question versus a solo question, but I'm going to ask it anyway because I cant find the answer online.
I'm on a SLURM computing cluster and was able to install everything

conda create -n solo python=3.7 && conda activate solo && pip install solo-sc
conda init bash
exec bash
conda activate solo

Once solo is loaded I was able to run solo (this example has the names removed)
solo a b -o c

It takes a while. I would like to submit these jobs to my computing cluster instead of interactively.
I wrote a bash script that looks like this:

#!/bin/bash
input1=$1
input2=$2
input3=$3

module load conda2/4.2.13

eval "$(conda shell.bash hook)"
conda activate solo

python python_solo_script.py $input1 $input2 $input3

The python script looks like this
#!/usr/bin/env python3

import sys

input1 = sys.argv[1]
input2 = sys.argv[2]
input3 = sys.argv[3]

solo %input1 %input2 -o %input3

The error that I get is Traceback (most recent call last):
File "python_solo_script.py", line 10, in
solo %input1 %input2 -o %input3
NameError: name 'solo' is not defined

So it looks like solo doesn't load. Would you be able to help me?
Thanks
Lauren

Implement a proper logger

Implement a proper logger. Print statements aren't great

Solo encountering Nan values with 10x data

I'm getting this error while running solo on one (but not all) of my samples:

ValueError: Expected parameter loc (Tensor of shape (128, 64)) of distribution Normal(loc: torch.Size([128, 64]), scale: torch.Size([128, 64])) to satisfy the constraint Real(), but found invalid values: tensor([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], device='cuda:0', grad_fn=<AddmmBackward0>)

Using this as my model json file, if helpful:
{ "n_hidden": 384, "n_latent": 64, "n_layers": 1, "cl_hidden": 128, "cl_layers": 1, "dropout_rate": 0.2, "learning_rate": 0.001, "valid_pct": 0.10 }

Interestingly, if I filter out cells with < 1000 UMI, this error goes away. Weirdly, I have several other samples where this does not appear to be a problem at all (regardless of filtering).

Any thoughts on how to resolve this without applying the UMI cutoff? Thanks!

Add scores and labels directly to `anndata` object

Hi,

Is there a way to add the scores directly to the anndata object after running solo? Right now the is_doublet.npy contains less number of cells as the original object used.

Running problem

Hi,

I followed the instruction to run but had follwoing error:

(solo) nxi@server1-X10DAx:/media/nxi/nxi/doublet/real_data$ solo solo.jason, cline-ch.h5ad
[2020-01-15 20:58:03,502] INFO - scvi._settings | Added StreamHandler with custom formatter to 'scvi' logger.
/home/nxi/anaconda3/envs/solo/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d
[2020-01-15 20:58:05,545] INFO - scvi._settings | 'scvi' logger already has a StreamHandler, set its level to 10.
Traceback (most recent call last):
File "/home/nxi/anaconda3/envs/solo/bin/solo", line 8, in
sys.exit(main())
File "/home/nxi/anaconda3/envs/solo/lib/python3.6/site-packages/solo/solo.py", line 108, in main
scvi_data = AnnDatasetFromAnnData(adata)
File "/home/nxi/anaconda3/envs/solo/lib/python3.6/site-packages/scvi/dataset/anndataset.py", line 40, in init
cell_types=cell_types,
File "/home/nxi/anaconda3/envs/solo/lib/python3.6/site-packages/scvi/dataset/dataset.py", line 147, in populate_from_data
else np.zeros((X.shape[0], 1)),
AttributeError: 'NoneType' object has no attribute 'shape'

Could you help resolve it? Thank you!

a feedback on compatiblility with scvi_0.6.0

I tried to install solo with scvi_0.6.0, and it mostly is compatible except for line 124 in solo.py. scvi_0.6.0 expect toarray() (numpy.ndarray) instead of todense() (numpy.matrix). This might or might not need a fix since currently solo installation is just dependent on scvi_0.4.0.

python 3.6 incompatibility

Hi there,

I am writing to report an incompatability with pythono 3.6 even thought PyPi indicates that solo-sc v1.0 should be compatible with python 3.6 and the PyPi clone + ve instructions still use python 3.6 (git clone [email protected]:calico/solo.git && cd solo && conda create -n solo python=3.6 && conda activate solo && pip install -e .). This is actually due to an incompatibility between scvi-tools and python 3.6 that returns the following error:

Traceback (most recent call last):
  File "/directflow/SCCGGroupShare/projects/DrewNeavin/software/anaconda3/envs/solo_soup_36a/bin/solo", line 33, in <module>
    sys.exit(load_entry_point('solo-sc', 'console_scripts', 'solo')())
  File "/directflow/SCCGGroupShare/projects/DrewNeavin/software/anaconda3/envs/solo_soup_36a/bin/solo", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/directflow/SCCGGroupShare/projects/DrewNeavin/software/anaconda3/envs/solo_soup_36a/lib/python3.6/site-packages/importlib_metadata/__init__.py", line 105, in load
    module = import_module(match.group('module'))
  File "/directflow/SCCGGroupShare/projects/DrewNeavin/software/anaconda3/envs/solo_soup_36a/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/drenea/solo/solo/solo.py", line 20, in <module>
    from scvi.external import SOLO
ModuleNotFoundError: No module named 'scvi.external'

This seems to have been resolved in scvi-tools v0.9.1. However, that requires python >= 3.7.

I've been trying to install on 3.6 due to requirements of other softwares I want in the same image. Wanted to post here to ask for update in documents and in case others are facing similar error

Create notebook showing how to use and the benefits of Solo

cc: @vjojic

h5ad for HashSolo & solo .pdf plots output

Hi,

I'd like to run HashSolo on an experiment, but am not sure about the structure/specification of the h5ad input file...

Is there anything beyond reading it into an anndata object & writing out the h5?

Also, ran solo on the (a?) pbmc dataset but the 4 .pdf files did not appear in the solo_out directory... Should they automatically be produced?

Thanks much!

3K PBMC & AnnData Output

I have two questions:

1

I tried solo on 3K PBMCs from a Healthy Donor (available at 10x), and I got about 313 doublets out of 2,828 (just counted the number of True in is_doublet.npy). I thought it's a bit too much, but I'm just curious whether you have tried it before, and if so, does this number sound right to you?

2

I supplied -a to get an AnnData output, but when I tried to load the generated .h5ad file, I got this error message:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-27-ac5b857875b6> in <module>
----> 1 adata = sc.read_h5ad(os.path.join(path_data, "out/pbmc/soloed.h5ad"))

~/miniconda/envs/notebook/lib/python3.6/site-packages/anndata/readwrite/read.py in read_h5ad(filename, backed, chunk_size)
    445     else:
    446         # load everything into memory
--> 447         constructor_args = _read_args_from_h5ad(filename=filename, chunk_size=chunk_size)
    448         X = constructor_args[0]
    449         dtype = None

~/miniconda/envs/notebook/lib/python3.6/site-packages/anndata/readwrite/read.py in _read_args_from_h5ad(adata, filename, mode, chunk_size)
    500     if not backed:
    501         f.close()
--> 502     return AnnData._args_from_dict(d)
    503 
    504 

~/miniconda/envs/notebook/lib/python3.6/site-packages/anndata/core/anndata.py in _args_from_dict(ddata)
   2155             if d_true_keys[true_key] is not None:
   2156                 for key in keys:
-> 2157                     if key in d_true_keys[true_key].dtype.names:
   2158                         d_true_keys[true_key] = pd.DataFrame.from_records(
   2159                             d_true_keys[true_key], index=key)

AttributeError: 'dict' object has no attribute 'dtype'

Thanks!

Newbie questions: warning message, model parameters, and outputs

Hi, I've used solo a few times now and am really appreciating how user-friendly it is. It seems to work really well on my dataset.

Every time I run solo, I get the following warning:
"UserWarning: Make sure the registered X field in anndata contains unnormalized count data."
I want to confirm that this is a normal error that shows with scvi-tools and I'm not screwing something up- before running Solo, I've been removing ambient RNA and empty droplets from my dataset using CellBender, then doing some subsetting in Seurat to remove droplets with aggressively low or high counts, but that's all.
I found the warning in this vignette on the scvi-tools website: https://docs.scvi-tools.org/en/0.13.0/user_guide/notebooks/scarches_scvi_tools.html so my gut says it's ok but I figured I'd check since I'm new at all this.
This relates to the model parameters:
On the README, the example parameters are:
{
"n_hidden": 384,
"n_latent": 64,
"n_layers": 1,
"cl_hidden": 128,
"cl_layers": 1,
"dropout_rate": 0.2,
"learning_rate": 0.001,
"valid_pct": 0.10
}
But the model.json file included with Solo has:
{
"n_hidden": 128,
"n_latent": 16,
"cl_hidden": 64,
"cl_layers": 1,
"dropout_rate": 0.1,
"learning_rate": 0.001,
"valid_pct": 0.10
}
Which of these should I use for regular snRNAseq data? is one of these examples intended for use with demultiplexing/hashsolo?
I wanted to make sure that I should go by the is_doublet.csv binary predictions, and not worry about the preds.npy files etc.
My data are from muscle nuclei and when I create a FeaturePlot for a particular muscle marker, many of the non-muscle cells that express this marker are categorized as doublets in is_doublet.csv, so the results seem reasonable.
I got some example code from a colleague for adding solo calls to seurat metadata and they use rcpp to import the preds.npy output to R, which doesn't work well for me. It's something to do with my newer versions of either rcpp or solo- he gets consistent number strings for the binary "T" and 0 for "F" upon importing the preds.npy file to R, and I have more than 2 different number strings when I do this. When I open my preds.npy output in python it IS binary, and matches what I see in is_doublet.csv. I'm happy to cut out the rcpp middleman and just use is_doublet.csv, but wanted to make sure that this is correct.
I just started using solo in Dec2021/Jan2022 so I suspect the other person wrote the code for data generated with an older version of solo, given what you mentioned in #62.
However, it's been several months since that issue was resolved so I wanted to make sure that I'm using the correct output files.

Thank you so much for your time!!

limited reproducibility

Hi,

I applied solo to an immune cell dataset and it appears to work quite well.
However, I get (slightly) different results every time I run solo, which is a bit annoying as this also changes the downstream clustering step.

Is there any way to make solo's result reproducible (such as setting a seed)? I would like to be able to share my entire analysis as a reproducible pipeline in the end.

Best,
Gregor

calico / solo Goto Github PK

solo's Introduction

solo -- Doublet detection via semi-supervised deep learning

Why

Quick set up

≈

How to demultiplex cell hashing data using HashSolo CLI

How to demultiplex cell hashing data using HashSolo in line

solo's People

Contributors

Stargazers

Watchers

Forkers

solo's Issues

1

2

Recommend Projects

Recommend Topics

Recommend Org