slowkow / harmonypy Goto Github PK

View Code? Open in Web Editor NEW

177.0 6.0 23.0 2.83 MB

🎼 Integrate multiple high-dimensional datasets with fuzzy k-means and locally linear adjustments.

Home Page: https://portals.broadinstitute.org/harmony/

License: GNU General Public License v3.0

Python 99.15% Makefile 0.85%

data-science data-integration bioinformatics single-cell-analysis

harmonypy's People

Stargazers

Watchers

harmonypy's Issues

Question re variance explained from harmony-corrected components

Hi there,
thanks for implementing harmony for python - worked fine for me! I was wondering if there was any more information about how to get the variance explained for each harmony components (similar to sd in Seurat objects)? This issue was solved for the R version here, but I was wondering if there is a way to get this information from the python output object? Sorry if this is obvious but I couldn't find details about the output object or how to get this from the output.
Thanks a lot,
Kathrin

Feature request: plot convergence

Can you make the parameter "plot_convergence" available?

feature request:

Hey! Thank you so much for porting harmony into python -- it's super useful.

I had one feature request that I wanted to make, in case others found it useful: Is there any chance you could include support for multiple covariates? It would make harmonypy even more useful and powerful, but I haven't been able to find out whether this is already supported as is by the current program base.

Thank you again for this wonderful contribution!
David

Build changes have broken Conda install

The conda package build no longer works with the latest release, CI produces errors like:

...
09:54:24 BIOCONDA INFO (OUT) Processing $SRC_DIR
09:54:24 BIOCONDA INFO (OUT)   Added file://$SRC_DIR to build tracker '/tmp/pip-build-tracker-xqauy_ow'
09:54:24 BIOCONDA INFO (OUT)   Running setup.py (path:$SRC_DIR/setup.py) egg_info for package from file://$SRC_DIR
09:54:24 BIOCONDA INFO (OUT)   Created temporary directory: /tmp/pip-pip-egg-info-9vdginkf
09:54:24 BIOCONDA INFO (OUT)   Running command python setup.py egg_info
09:54:24 BIOCONDA INFO (OUT)   Preparing metadata (setup.py): started
09:54:25 BIOCONDA INFO (OUT)   Traceback (most recent call last):
09:54:25 BIOCONDA INFO (OUT)     File "<string>", line 2, in <module>
09:54:25 BIOCONDA INFO (OUT)     File "<pip-setuptools-caller>", line 34, in <module>
09:54:25 BIOCONDA INFO (OUT)     File "/opt/conda/conda-bld/harmonypy_1669197095540/work/setup.py", line 3, in <module>
09:54:25 BIOCONDA INFO (OUT)       with open("README.md", "r") as fh:
09:54:25 BIOCONDA INFO (OUT)            ^^^^^^^^^^^^^^^^^^^^^^
09:54:25 BIOCONDA INFO (OUT)   FileNotFoundError: [Errno 2] No such file or directory: 'README.md'
09:54:25 BIOCONDA INFO (OUT)   error: subprocess-exited-with-error
09:54:25 BIOCONDA INFO (OUT) 
09:54:25 BIOCONDA INFO (OUT)   × python setup.py egg_info did not run successfully.
09:54:25 BIOCONDA INFO (OUT)   │ exit code: 1
09:54:25 BIOCONDA INFO (OUT)   ╰─> See above for output.
...

Conda installs the package from the PyPi source (not directly from here) like:

python -m pip install . --no-deps --ignore-installed -vv

I'm thinking the issues may be due to recent build changes like 472c15b, and the fact that README.md is both included and excluded here: https://github.com/slowkow/harmonypy/blob/master/MANIFEST.in.

Harmony with two or more covariates

Hello @slowkow,

Thank you for adding Harmony to Python!

May you please allow harmony to be run with multiple covariates in harmonypy as this will make many workflows easier? This can be done in the R package.

Try filprofiler

It is possible that harmonypy uses more memory than the R version of Harmony.

filprofiler might help to optimize the code to use less memory.

Reference mapping option

Hi,

when reviewing run_harmony(), I found that reference_values is available as a keyword argument. However, reference_values is not referenced anywhere in the function and also does not appear in the Harmony class. I wonder if this feature of reference mapping was implemented in harmonypy.

Thank you

run harmonypy true_divide error

When running harmonypy I got an error I have not seen before. Not sure what is wrong or how to fix it.

import harmonypy
sce.pp.harmony_integrate(adata, 'SINGLECELL_TYPE', max_iter_harmony = 10)

2021-11-09 14:10:44,301 - harmonypy - INFO - Iteration 1 of 10
/PHShome/hm604/.conda/envs/HM_Py3.9_try2/lib/python3.9/site-packages/harmonypy/harmony.py:295: RuntimeWarning: invalid value encountered in true_divide
  self.R[:,b] = self.R[:,b] / np.linalg.norm(self.R[:,b], ord=1, axis=0)
2021-11-09 14:14:11,826 - harmonypy - INFO - Iteration 2 of 10

It keeps running until it's done.

Then I try this:

sc.pp.neighbors(adata,  use_rep='X_pca_harmony', n_neighbors=10, n_pcs=50)

computing neighbors
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_54497/579152319.py in <module>
----> 1 sc.pp.neighbors(adata, 
      2                 use_rep='X_pca_harmony',
      3                 n_neighbors=10, n_pcs=50)
      4 sc.tl.umap(adata)
      5 sc.pl.umap(adata,color=['SINGLECELL_TYPE'])

~/.conda/envs/HM_Py3.9_try2/lib/python3.9/site-packages/scanpy/neighbors/__init__.py in neighbors(adata, n_neighbors, n_pcs, use_rep, knn, random_state, method, metric, metric_kwds, key_added, copy)
    137         adata._init_as_actual(adata.copy())
    138     neighbors = Neighbors(adata)
--> 139     neighbors.compute_neighbors(
    140         n_neighbors=n_neighbors,
    141         knn=knn,

~/.conda/envs/HM_Py3.9_try2/lib/python3.9/site-packages/scanpy/neighbors/__init__.py in compute_neighbors(self, n_neighbors, knn, n_pcs, use_rep, method, random_state, write_knn_indices, metric, metric_kwds)
    789                 X = pairwise_distances(X, metric=metric, **metric_kwds)
    790                 metric = 'precomputed'
--> 791             knn_indices, knn_distances, forest = compute_neighbors_umap(
    792                 X, n_neighbors, random_state, metric=metric, metric_kwds=metric_kwds
    793             )

~/.conda/envs/HM_Py3.9_try2/lib/python3.9/site-packages/scanpy/neighbors/__init__.py in compute_neighbors_umap(X, n_neighbors, random_state, metric, metric_kwds, angular, verbose)
    303     random_state = check_random_state(random_state)
    304 
--> 305     knn_indices, knn_dists, forest = nearest_neighbors(
    306         X,
    307         n_neighbors,

~/.conda/envs/HM_Py3.9_try2/lib/python3.9/site-packages/umap/umap_.py in nearest_neighbors(X, n_neighbors, metric, metric_kwds, angular, random_state, low_memory, use_pynndescent, n_jobs, verbose)
    326         n_iters = max(5, int(round(np.log2(X.shape[0]))))
    327 
--> 328         knn_search_index = NNDescent(
    329             X,
    330             n_neighbors=n_neighbors,

~/.conda/envs/HM_Py3.9_try2/lib/python3.9/site-packages/pynndescent/pynndescent_.py in __init__(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_degree_multiplier, diversify_prob, n_search_trees, tree_init, init_graph, random_state, low_memory, max_candidates, n_iters, delta, n_jobs, compressed, verbose)
    684         self.verbose = verbose
    685 
--> 686         data = check_array(data, dtype=np.float32, accept_sparse="csr", order="C")
    687         self._raw_data = data
    688 

~/.conda/envs/HM_Py3.9_try2/lib/python3.9/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    790 
    791         if force_all_finite:
--> 792             _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
    793 
    794     if ensure_min_samples > 0:

~/.conda/envs/HM_Py3.9_try2/lib/python3.9/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    112         ):
    113             type_err = "infinity" if allow_nan else "NaN, infinity"
--> 114             raise ValueError(
    115                 msg_err.format(
    116                     type_err, msg_dtype if msg_dtype is not None else X.dtype

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I am using:

python == 3.9
harmonypy  == 0.0.5
scanpy == 1.8.2
anndata == 0.7.6
umap == 0.5.2
numpy == 1.20.3
scipy == 1.7.1
pandas == 1.3.4
scikit-learn == 1.0.1
statsmodels == 0.13.0
pynndescent == 0.5.5

How to cite this work?

Hi,

Thank you so much for this contribution to the Python ecosystem. I've been applying your implementation of Harmony to cytometry data with some success after hyperparameter tuning and normalisation. I was wondering how best to cite your work? I had planned on citing the original Harmony paper but would also like to give credit to your library.

ValueError: operands could not be broadcast together with shapes

Hi there,
i ran harmonypy via scanpy external api on 12 different data sets and it worked for 11 of them.
Unfortunately, for one of them (contatining 1974 cells) I get the following error.

  File "/nfs/home/students/postd/.local/lib/python3.6/site-packages/harmonypy/harmony.py", line 126, in run_harmony
    epsilon_cluster, epsilon_harmony, nclust, block_size, lamb_mat, verbose
  File "/nfs/home/students/postd/.local/lib/python3.6/site-packages/harmonypy/harmony.py", line 172, in __init__
    self.init_cluster()
  File "/nfs/home/students/postd/.local/lib/python3.6/site-packages/harmonypy/harmony.py", line 195, in init_cluster
    self.R = self.R / self.sigma[:,None]
ValueError: operands could not be broadcast together with shapes (64,1974) (66,1)

I tried again and got the same error with different shapes:
ValueError: operands could not be broadcast together with shapes (65,1974) (66,1)

I extracted all necessary code to reproduce the error and to find its cause.

km = kmeans(self.Z_cos.T, self.K, iter=10)
self.Y = km[0].T

I found that kmeans does not always return the expected number (self.K) of clusters (probably to centroids with no attached node?) which leads self.Y to be smaller than expected and also self.R.shape[0] will not match the length of self.sigma.
This finally leads to the errors above.
I guess i can just set 'nclust' to a smaller number as a param, but maybe i am not the only one with this problem.

Best,
Dennis

Error in running large dataset

First of all, thank you very much for migrating the Harmony algorithm from R to Python. Today, I encountered the following error while integrating 318,720 cells from different datasets using harmonypy.

Hope to receive your help. Thank you.

sklearn.cluster.KMeans issue with kmeans_single() and threadpool_limits()

Hello, I am trying to run harmonypy on an AnnData object. This has 58k spots and 36k genes (doing spatial tx). When I run harmony, I run into an issue during the kmeans phase:

     8 print('beginning harmony')
----> 9 ho = hm.run_harmony(data_mat, meta_data, vars_use)

File ~/opt/anaconda3/envs/py39new/lib/python3.9/site-packages/harmonypy/harmony.py:124, in run_harmony(data_mat, meta_data, vars_use, theta, lamb, sigma, nclust, tau, block_size, max_iter_harmony, max_iter_kmeans, epsilon_cluster, epsilon_harmony, plot_convergence, verbose, reference_values, cluster_prior, random_state)
    120 phi_moe = np.vstack((np.repeat(1, N), phi))
    122 np.random.seed(random_state)
--> 124 ho = Harmony(
    125     data_mat, phi, phi_moe, Pr_b, sigma, theta, max_iter_harmony, max_iter_kmeans,
    126     epsilon_cluster, epsilon_harmony, nclust, block_size, lamb_mat, verbose
    127 )
    129 return ho

File ~/opt/anaconda3/envs/py39new/lib/python3.9/site-packages/harmonypy/harmony.py:172, in Harmony.__init__(self, Z, Phi, Phi_moe, Pr_b, sigma, theta, max_iter_harmony, max_iter_kmeans, epsilon_kmeans, epsilon_harmony, K, block_size, lamb, verbose)
    169 self.kmeans_rounds  = []
    171 self.allocate_buffers()
--> 172 self.init_cluster()
    173 self.harmonize(self.max_iter_harmony, self.verbose)

File ~/opt/anaconda3/envs/py39new/lib/python3.9/site-packages/harmonypy/harmony.py:191, in Harmony.init_cluster(self)
    188 logger.info("Computing initial centroids with sklearn.KMeans...")
    189 model = KMeans(n_clusters=self.K, init='k-means++',
    190                n_init=10, max_iter=25)
--> 191 model.fit(self.Z_cos.T)
    192 km_centroids, km_labels = model.cluster_centers_, model.labels_
    193 logger.info("sklearn.KMeans initialization complete.")

File ~/opt/anaconda3/envs/py39new/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:1468, in KMeans.fit(self, X, y, sample_weight)
   1465     print("Initialization complete")
   1467 # run a k-means once
-> 1468 labels, inertia, centers, n_iter_ = kmeans_single(
   1469     X,
   1470     sample_weight,
   1471     centers_init,
   1472     max_iter=self.max_iter,
   1473     verbose=self.verbose,
   1474     tol=self._tol,
   1475     n_threads=self._n_threads,
   1476 )
   1478 # determine if these results are the best so far
   1479 # we chose a new run if it has a better inertia and the clustering is
   1480 # different from the best so far (it's possible that the inertia is
   1481 # slightly better even if the clustering is the same with potentially
   1482 # permuted labels, due to rounding errors)
   1483 if best_inertia is None or (
   1484     inertia < best_inertia
   1485     and not _is_same_clustering(labels, best_labels, self.n_clusters)
   1486 ):

File ~/opt/anaconda3/envs/py39new/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:679, in _kmeans_single_lloyd(X, sample_weight, centers_init, max_iter, verbose, tol, n_threads)
    675 strict_convergence = False
    677 # Threadpoolctl context to limit the number of threads in second level of
    678 # nested parallelism (i.e. BLAS) to avoid oversubscription.
--> 679 with threadpool_limits(limits=1, user_api="blas"):
    680     for i in range(max_iter):
    681         lloyd_iter(
    682             X,
    683             sample_weight,
   (...)
    689             n_threads,
    690         )

File ~/opt/anaconda3/envs/py39new/lib/python3.9/site-packages/sklearn/utils/fixes.py:151, in threadpool_limits(limits, user_api)
    149     return controller.limit(limits=limits, user_api=user_api)
    150 else:
--> 151     return threadpoolctl.threadpool_limits(limits=limits, user_api=user_api)

File ~/opt/anaconda3/envs/py39new/lib/python3.9/site-packages/threadpoolctl.py:171, in threadpool_limits.__init__(self, limits, user_api)
    167 def __init__(self, limits=None, user_api=None):
    168     self._limits, self._user_api, self._prefixes = \
    169         self._check_params(limits, user_api)
--> 171     self._original_info = self._set_threadpool_limits()

File ~/opt/anaconda3/envs/py39new/lib/python3.9/site-packages/threadpoolctl.py:268, in threadpool_limits._set_threadpool_limits(self)
    265 if self._limits is None:
    266     return None
--> 268 modules = _ThreadpoolInfo(prefixes=self._prefixes,
    269                           user_api=self._user_api)
    270 for module in modules:
    271     # self._limits is a dict {key: num_threads} where key is either
    272     # a prefix or a user_api. If a module matches both, the limit
    273     # corresponding to the prefix is chosed.
    274     if module.prefix in self._limits:

File ~/opt/anaconda3/envs/py39new/lib/python3.9/site-packages/threadpoolctl.py:340, in _ThreadpoolInfo.__init__(self, user_api, prefixes, modules)
    337     self.user_api = [] if user_api is None else user_api
    339     self.modules = []
--> 340     self._load_modules()
    341     self._warn_if_incompatible_openmp()
    342 else:

File ~/opt/anaconda3/envs/py39new/lib/python3.9/site-packages/threadpoolctl.py:371, in _ThreadpoolInfo._load_modules(self)
    369 """Loop through loaded libraries and store supported ones"""
    370 if sys.platform == "darwin":
--> 371     self._find_modules_with_dyld()
    372 elif sys.platform == "win32":
    373     self._find_modules_with_enum_process_module_ex()

File ~/opt/anaconda3/envs/py39new/lib/python3.9/site-packages/threadpoolctl.py:428, in _ThreadpoolInfo._find_modules_with_dyld(self)
    425 filepath = filepath.decode("utf-8")
    427 # Store the module if it is supported and selected
--> 428 self._make_module_from_path(filepath)

File ~/opt/anaconda3/envs/py39new/lib/python3.9/site-packages/threadpoolctl.py:515, in _ThreadpoolInfo._make_module_from_path(self, filepath)
    513 if prefix in self.prefixes or user_api in self.user_api:
    514     module_class = globals()[module_class]
--> 515     module = module_class(filepath, prefix, user_api, internal_api)
    516     self.modules.append(module)

File ~/opt/anaconda3/envs/py39new/lib/python3.9/site-packages/threadpoolctl.py:606, in _Module.__init__(self, filepath, prefix, user_api, internal_api)
    604 self.internal_api = internal_api
    605 self._dynlib = ctypes.CDLL(filepath, mode=_RTLD_NOLOAD)
--> 606 self.version = self.get_version()
    607 self.num_threads = self.get_num_threads()
    608 self._get_extra_info()

File ~/opt/anaconda3/envs/py39new/lib/python3.9/site-packages/threadpoolctl.py:646, in _OpenBLASModule.get_version(self)
    643 get_config = getattr(self._dynlib, "openblas_get_config",
    644                      lambda: None)
    645 get_config.restype = ctypes.c_char_p
--> 646 config = get_config().split()
    647 if config[0] == b"OpenBLAS":
    648     return config[1].decode("utf-8")

AttributeError: 'NoneType' object has no attribute 'split'

I am not sure if this is an error in the data that I'm inputting, something to do with Harmony, or Sklearn. I currently believe it to be Sklearn but am opening this issue in case there is anybody who has dealt with the same issue.

ValueError: operands could not be broadcast together

Hi Kamil,

I get the following error when I run harmonization on an AnnData with ~2.7k cells, using a PCA matrix of 18 PCs and using one variable for vars_use, which has 30 unique values:

Traceback (most recent call last):
  File "/home/sramesh/software/miniconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-27-f125ff7d55f1>", line 1, in <module>
    hm_obj = hm.run_harmony(adata.obsm['X_pca'], adata.obs, [channel_name])
  File "/home/sramesh/software/miniconda3/lib/python3.8/site-packages/harmonypy/harmony.py", line 124, in run_harmony
    ho = Harmony(
  File "/home/sramesh/software/miniconda3/lib/python3.8/site-packages/harmonypy/harmony.py", line 172, in __init__
    self.init_cluster()
  File "/home/sramesh/software/miniconda3/lib/python3.8/site-packages/harmonypy/harmony.py", line 202, in init_cluster
    self.compute_objective()
  File "/home/sramesh/software/miniconda3/lib/python3.8/site-packages/harmonypy/harmony.py", line 214, in compute_objective
    w = np.dot(y * z, self.Phi)
ValueError: operands could not be broadcast together with shapes (92,30) (92,35)

Not sure why I'm getting this error, when usually harmonypy works most of the time. Any help is appreciated. Thanks!

Unique key is not in pandas describe

Hi, I guess in this line should be an option include='all' for describe() method.

harmonypy/harmonypy/harmony.py

Line 88 in 18a4f02

phi_n = meta_data[vars_use].describe().loc['unique'].to_numpy().astype(int)

Otherwise there is an error

     86 
     87     phi = pd.get_dummies(meta_data[vars_use]).to_numpy().T
---> 88     phi_n = meta_data[vars_use].describe().loc['unique'].to_numpy().astype(int)
     89 
     90     if theta is None:

KeyError: 'unique'

Running multiple instances of the same variable in vars_use.

Hello, I have noticed that if I add multiple entries of the same .obs I want to correct for in vars_use, I will obtain different results, as if that .obs is being corrected for multiple times. The order used also affects results. For example:

vars_use = ['phase',"batch","batch","batch"] will seemingly correct for phase once and then batch three times, whereas
vars_use = ['batch',"batch","batch","phase"] will correct batch three times and then for phase. The results between these two runs are different.

Also, note that this is not the same as running harmony several times for the same variable, in which each separate run will remove the top PC from the observations. E.g.:

vars_use = ['batch'], then a new run with vars_use = ['batch'], then a new run with vars_use = ['batch'], will remove the top PC per run. So for this example if I start with 50, I'll end up with 47 in the end.

My questions are: what is it actually doing when I add multiple instances of the same observation? Also, why is the top PC removed for each successive harmonypy run?

Will the latest R harmony updates be propagated to the harmonypy

hi ,
I was wondering, if the recent updates made to the harmony R package which takes harmony to v1.2 will be propagated to the harmonypy package as well? We have pipelines which regularly use the harmonypy package and we would really appreciate it if these changes were also updated in the python version.

Best,
Devika

Failed to set number of max iterations to more than 10

When I tried to run Harmony using:

hm.run_harmony(data_mat, meta_data, vars_use, max_iter_harmony=50)

It still only ran 10 iterations and stopped prematurely before reaching convergence. Any idea how to harmonize a dataset with > 10 iterations?

Thanks in advance!

Error when trying to run Harmony with multiple covariates

From reading the documentation at https://github.com/immunogenomics/harmony, it seems that feeding a list of multiple covariates to the argument vars_use would harmonize the data over these covariates. However when I tried to do this using:

meta_data = adata.obs
vars_use = ['batch', 'library_name']
data_mat = adata.obsm['X_pca']
ho = hm.run_harmony(data_mat, meta_data, vars_use)

where adata is my dataset in AnnData format, I got this error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-91-b088ad1e792b> in <module>
----> 1 ho = hm.run_harmony(data_mat, meta_data, vars_use)

~/.conda/envs/py37_TOC_PAGA/lib/python3.7/site-packages/harmonypy/harmony.py in run_harmony(data_mat, meta_data, vars_use, theta, lamb, sigma, nclust, tau, block_size, max_iter_harmony, max_iter_cluster, epsilon_cluster, epsilon_harmony, plot_convergence, return_object, verbose, reference_values, cluster_prior)
     56     for i in range(len(categories.categories)):
     57         ix = categories == categories.categories[i]
---> 58         phi[i,ix] = 1
     59 
     60     N_b = phi.sum(axis = 1)

IndexError: boolean index did not match indexed array along dimension 1; dimension is 23381 but corresponding boolean dimension is 2

Was there something I overlooked? Please advise. Thanks!

How to get the result after running harmony?

I cannot find the result after batch effect removal in python. How to get the data value after batch effect removal? Thanks

AttributeError: 'NoneType' object has no attribute 'split'

Hi,

I am trying to run Harmonypy on an anndata object. It gave me the error below.

AttributeError                            Traceback (most recent call last)
Cell In[198], line 1
----> 1 sc.external.pp.harmony_integrate(xenium_breast_htma_breast, key = 'patient')

File ~/anaconda3/lib/python3.10/site-packages/scanpy/external/pp/_harmony_integrate.py:82, in harmony_integrate(adata, key, basis, adjusted_basis, **kwargs)
     79 except ImportError:
     80     raise ImportError("\nplease install harmonypy:\n\n\tpip install harmonypy")
---> 82 harmony_out = harmonypy.run_harmony(adata.obsm[basis], adata.obs, key, **kwargs)
     84 adata.obsm[adjusted_basis] = harmony_out.Z_corr.T

File ~/anaconda3/lib/python3.10/site-packages/harmonypy/harmony.py:124, in run_harmony(data_mat, meta_data, vars_use, theta, lamb, sigma, nclust, tau, block_size, max_iter_harmony, max_iter_kmeans, epsilon_cluster, epsilon_harmony, plot_convergence, verbose, reference_values, cluster_prior, random_state)
    120 phi_moe = np.vstack((np.repeat(1, N), phi))
    122 np.random.seed(random_state)
--> 124 ho = Harmony(
    125     data_mat, phi, phi_moe, Pr_b, sigma, theta, max_iter_harmony, max_iter_kmeans,
    126     epsilon_cluster, epsilon_harmony, nclust, block_size, lamb_mat, verbose
    127 )
    129 return ho

File ~/anaconda3/lib/python3.10/site-packages/harmonypy/harmony.py:172, in Harmony.__init__(self, Z, Phi, Phi_moe, Pr_b, sigma, theta, max_iter_harmony, max_iter_kmeans, epsilon_kmeans, epsilon_harmony, K, block_size, lamb, verbose)
    169 self.kmeans_rounds  = []
    171 self.allocate_buffers()
--> 172 self.init_cluster()
    173 self.harmonize(self.max_iter_harmony, self.verbose)

File ~/anaconda3/lib/python3.10/site-packages/harmonypy/harmony.py:191, in Harmony.init_cluster(self)
    188 logger.info("Computing initial centroids with sklearn.KMeans...")
    189 model = KMeans(n_clusters=self.K, init='k-means++',
    190                n_init=10, max_iter=25)
--> 191 model.fit(self.Z_cos.T)
    192 km_centroids, km_labels = model.cluster_centers_, model.labels_
    193 logger.info("sklearn.KMeans initialization complete.")

File ~/anaconda3/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:1468, in KMeans.fit(self, X, y, sample_weight)
   1465     print("Initialization complete")
   1467 # run a k-means once
-> 1468 labels, inertia, centers, n_iter_ = kmeans_single(
   1469     X,
   1470     sample_weight,
   1471     centers_init,
   1472     max_iter=self.max_iter,
   1473     verbose=self.verbose,
   1474     tol=self._tol,
   1475     n_threads=self._n_threads,
   1476 )
   1478 # determine if these results are the best so far
   1479 # we chose a new run if it has a better inertia and the clustering is
   1480 # different from the best so far (it's possible that the inertia is
   1481 # slightly better even if the clustering is the same with potentially
   1482 # permuted labels, due to rounding errors)
   1483 if best_inertia is None or (
   1484     inertia < best_inertia
   1485     and not _is_same_clustering(labels, best_labels, self.n_clusters)
   1486 ):

File ~/anaconda3/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:679, in _kmeans_single_lloyd(X, sample_weight, centers_init, max_iter, verbose, tol, n_threads)
    675 strict_convergence = False
    677 # Threadpoolctl context to limit the number of threads in second level of
    678 # nested parallelism (i.e. BLAS) to avoid oversubscription.
--> 679 with threadpool_limits(limits=1, user_api="blas"):
    680     for i in range(max_iter):
    681         lloyd_iter(
    682             X,
    683             sample_weight,
   (...)
    689             n_threads,
    690         )

File ~/anaconda3/lib/python3.10/site-packages/sklearn/utils/fixes.py:139, in threadpool_limits(limits, user_api)
    137     return controller.limit(limits=limits, user_api=user_api)
    138 else:
--> 139     return threadpoolctl.threadpool_limits(limits=limits, user_api=user_api)

File ~/anaconda3/lib/python3.10/site-packages/threadpoolctl.py:171, in threadpool_limits.__init__(self, limits, user_api)
    167 def __init__(self, limits=None, user_api=None):
    168     self._limits, self._user_api, self._prefixes = \
    169         self._check_params(limits, user_api)
--> 171     self._original_info = self._set_threadpool_limits()

File ~/anaconda3/lib/python3.10/site-packages/threadpoolctl.py:268, in threadpool_limits._set_threadpool_limits(self)
    265 if self._limits is None:
    266     return None
--> 268 modules = _ThreadpoolInfo(prefixes=self._prefixes,
    269                           user_api=self._user_api)
    270 for module in modules:
    271     # self._limits is a dict {key: num_threads} where key is either
    272     # a prefix or a user_api. If a module matches both, the limit
    273     # corresponding to the prefix is chosed.
    274     if module.prefix in self._limits:

File ~/anaconda3/lib/python3.10/site-packages/threadpoolctl.py:340, in _ThreadpoolInfo.__init__(self, user_api, prefixes, modules)
    337     self.user_api = [] if user_api is None else user_api
    339     self.modules = []
--> 340     self._load_modules()
    341     self._warn_if_incompatible_openmp()
    342 else:

File ~/anaconda3/lib/python3.10/site-packages/threadpoolctl.py:371, in _ThreadpoolInfo._load_modules(self)
    369 """Loop through loaded libraries and store supported ones"""
    370 if sys.platform == "darwin":
--> 371     self._find_modules_with_dyld()
    372 elif sys.platform == "win32":
    373     self._find_modules_with_enum_process_module_ex()

File ~/anaconda3/lib/python3.10/site-packages/threadpoolctl.py:428, in _ThreadpoolInfo._find_modules_with_dyld(self)
    425 filepath = filepath.decode("utf-8")
    427 # Store the module if it is supported and selected
--> 428 self._make_module_from_path(filepath)

File ~/anaconda3/lib/python3.10/site-packages/threadpoolctl.py:515, in _ThreadpoolInfo._make_module_from_path(self, filepath)
    513 if prefix in self.prefixes or user_api in self.user_api:
    514     module_class = globals()[module_class]
--> 515     module = module_class(filepath, prefix, user_api, internal_api)
    516     self.modules.append(module)

File ~/anaconda3/lib/python3.10/site-packages/threadpoolctl.py:606, in _Module.__init__(self, filepath, prefix, user_api, internal_api)
    604 self.internal_api = internal_api
    605 self._dynlib = ctypes.CDLL(filepath, mode=_RTLD_NOLOAD)
--> 606 self.version = self.get_version()
    607 self.num_threads = self.get_num_threads()
    608 self._get_extra_info()

File ~/anaconda3/lib/python3.10/site-packages/threadpoolctl.py:646, in _OpenBLASModule.get_version(self)
    643 get_config = getattr(self._dynlib, "openblas_get_config",
    644                      lambda: None)
    645 get_config.restype = ctypes.c_char_p
--> 646 config = get_config().split()
    647 if config[0] == b"OpenBLAS":
    648     return config[1].decode("utf-8")

AttributeError: 'NoneType' object has no attribute 'split'

I tried the method in the closed issue. Didn't work for me. Any help is greatly appreciated!! Thanks.

Harmonypy results with anndata

Once the adjusted PCs are calculated...

data_mat = adata.obsm['X_pca']
meta_data = adata.obs
vars_use = ['batch']
ho = hm.run_harmony(data_mat, meta_data, vars_use)

...how do I integrate the results back into the anndata object (adata) to proceed with the workflow? I am attempting to use harmonypy in the scanpy workflow but am very new at this.

Thanks in advance!

Runharmony with R and run_harmony with Python produced different results

Hi! Thank you for porting harmony into python -- it's very useful.

I found that Runharmony with R and run_harmony with Python produced different results using the same principal component and meta data. Is that normal？And it feels like R is going to be a little bit better.

Zhaoyu

Add converged status to harmony object

Thank you very much for this port!
What do you think about adding a "converged" state to Harmony object properties? (https://github.com/slowkow/harmonypy/blob/master/harmonypy/harmony.py#L242)
It would be easier to check it programmatically rather than reading logs.

Results not reproducible

Looks like the recent updated to KMeans method made the results not reproducible. Please see comment here
9c081b8#commitcomment-98243211

slowkow / harmonypy Goto Github PK

harmonypy's People

Stargazers

Watchers

Forkers

harmonypy's Issues

Recommend Projects

Recommend Topics

Recommend Org