Hi there, i ran harmonypy via scanpy external api on 12 different data sets and it

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

ValueError: operands could not be broadcast together with shapes about harmonypy HOT 9 CLOSED

DennisPost10 commented on May 27, 2024

ValueError: operands could not be broadcast together with shapes

from harmonypy.

Comments (9)

slowkow commented on May 27, 2024 2

Thank you @bahnk for confirming that we should switch from kmeans() to kmeans2().

I'll make a new release soon.

from harmonypy.

slowkow commented on May 27, 2024 1

Thank you @onionpork for bringing this up again.

The documentation for scipy.cluster.vq.kmeans describes the objects returned by kmeans():

Returns
codebook ndarray
A k by N array of k centroids. The ith centroid codebook[i] is represented with the code i. The centroids and codes generated represent the lowest distortion seen, not necessarily the globally minimal distortion. Note that the number of centroids is not necessarily the same as the k_or_guess parameter, because centroids assigned to no observations are removed during iterations.
distortion float
The mean (non-squared) Euclidean distance between the observations passed and the centroids generated. Note the difference to the standard definition of distortion in the context of the k-means algorithm, which is the sum of the squared distances.

So, the documentation is consistent with your experience: the kmeans function does not always return k centroids.

I think the best solution would be to choose a different implementation of kmeans that returns a consistent output. I might consider:

from harmonypy.

slowkow commented on May 27, 2024

Hi Dennis, thanks for opening this issue! And thanks for investigating which lines might be the root cause of the error.

Did you find that setting nclust to a smaller number resolves your issue?

Have you considered whether the 12th dataset is representative of any of the other 11 datasets? For example, have you looked at a clustering result for each dataset independently?

Could I please ask if you have any proposal for what you think might be the best modification to the code? For example, would it be helpful to add a check how many clusters we got from kmeans(), and then to report an error to the user along with a suggestion that they might try a smaller value for nclust?

from harmonypy.

onionpork commented on May 27, 2024

Hey there,
I think I have the same problem here, even with tuning the nclust down. The problem is that it always return the shape size smaller than defined nclust.
I think the problem is on initialize the centroid (line 88). It always return the smaller number of the cluster center.
For example,

km = kmean(self.Z_cos.T, 100, iter=10)
km[0].shape[0] = 87 (<100)

Any idea how to fix it?
Thank you so much!

from harmonypy.

slowkow commented on May 27, 2024

@onionpork @DennisPost10 I think the issue should be fixed by commit 2fd234e

I would be grateful if you could confirm that the new code works for your application.

from harmonypy.

slowkow commented on May 27, 2024

@onionpork and @DennisPost10

Please feel free to comment on this issue if you would like to share your experience with the new code. Thanks.

from harmonypy.

bahnk commented on May 27, 2024

Hey @slowkow,

Just had the same issue with the pip version that doesn't use kmeans2. But it works well with the git version that uses kmeans2.

Cheers,

from harmonypy.

wlason commented on May 27, 2024

Hi! I'm using harmony-you as part of panpipes and I am still facing this issue with harmonypy version 0.0.9

`2024-03-20 15:35:23,333 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans... \`
`2024-03-20 15:35:36,417 - harmonypy - INFO - sklearn.KMeans initialization complete. \`
` Traceback (most recent call last): \`
  `File "[...]/panpipes/panpipes/python_scripts/batch_correct_harmony.py", line 110, in <module> \`
    `ho = hm.run_harmony(adata.obsm[dimred][:,0:int(args.harmony_npcs)], \`
  `File "[...]/conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 124, in run_harmony \`
    `ho = Harmony( \`
  `File "[...]/conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 172, in __init__ \`
    `self.init_cluster() \`
  `File "[...]/conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 207, in init_cluster \`
    `self.compute_objective() \`
  `File "[...]//conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 219, in compute_objective \`
    `w = np.dot(y * z, self.Phi) \`
`ValueError: operands could not be broadcast together with shapes (100,33) (100,34)  \
 \`

Might it have to do with switch of kmeans2()?

harmonypy/harmonypy/harmony.py

Line 21 in 182a5c6

from sklearn.cluster import KMeans

from harmonypy.

slowkow commented on May 27, 2024

@wlason

Feel free to switch to kmeans2 if you want to check if that might solve your issue. The test demonstrates how you can do that:

harmonypy/tests/test_harmony.py

Lines 65 to 75 in 72da6fb

 def cluster_fn(data, K): 

 centroid, label = kmeans2(data, K, minit='++', seed=0) 

 return centroid 

 def run(cluster_fn): 

 ho = hm.run_harmony(data_mat, 

 meta_data, ['donor'], 

 max_iter_harmony=2, 

 max_iter_kmeans=2, 

 cluster_fn=cluster_fn) 

 return ho.Z_corr

If you want more help, please consider sharing a minimal reproducible example, so others might have a chance to reproduce the error that you are facing. We don't need your data, any random data would be fine. But if this issue depends on your data, then you might consider sharing it privately with me so I can use it to debug.

I do not know if sklearn.cluster.KMeans is guaranteed to return the number of clusters that we asked for. I hope so, but I am not certain.

We switched to KMeans for faster execution time. But it seems that all of the implementations we've tried so far have some limitations. If you can suggest a specific implementation of a k-means algorithm that you think might be suitable, that would also be helpful.

Good luck!

from harmonypy.

ValueError: operands could not be broadcast together with shapes about harmonypy HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	def cluster_fn(data, K):
	centroid, label = kmeans2(data, K, minit='++', seed=0)
	return centroid

	def run(cluster_fn):
	ho = hm.run_harmony(data_mat,
	meta_data, ['donor'],
	max_iter_harmony=2,
	max_iter_kmeans=2,
	cluster_fn=cluster_fn)
	return ho.Z_corr