Comments (9)
Thank you @bahnk for confirming that we should switch from kmeans()
to kmeans2()
.
I'll make a new release soon.
from harmonypy.
Thank you @onionpork for bringing this up again.
The documentation for scipy.cluster.vq.kmeans describes the objects returned by kmeans()
:
Returns
codebook ndarray
A k by N array of k centroids. The ith centroid codebook[i] is represented with the code i. The centroids and codes generated represent the lowest distortion seen, not necessarily the globally minimal distortion. Note that the number of centroids is not necessarily the same as the k_or_guess parameter, because centroids assigned to no observations are removed during iterations.
distortion float
The mean (non-squared) Euclidean distance between the observations passed and the centroids generated. Note the difference to the standard definition of distortion in the context of the k-means algorithm, which is the sum of the squared distances.
So, the documentation is consistent with your experience: the kmeans function does not always return k centroids.
I think the best solution would be to choose a different implementation of kmeans that returns a consistent output. I might consider:
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans2.html
- https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
from harmonypy.
Hi Dennis, thanks for opening this issue! And thanks for investigating which lines might be the root cause of the error.
Did you find that setting nclust
to a smaller number resolves your issue?
Have you considered whether the 12th dataset is representative of any of the other 11 datasets? For example, have you looked at a clustering result for each dataset independently?
Could I please ask if you have any proposal for what you think might be the best modification to the code? For example, would it be helpful to add a check how many clusters we got from kmeans()
, and then to report an error to the user along with a suggestion that they might try a smaller value for nclust
?
from harmonypy.
Hey there,
I think I have the same problem here, even with tuning the nclust
down. The problem is that it always return the shape size smaller than defined nclust
.
I think the problem is on initialize the centroid (line 88). It always return the smaller number of the cluster center.
For example,
km = kmean(self.Z_cos.T, 100, iter=10)
km[0].shape[0] = 87 (<100)
Any idea how to fix it?
Thank you so much!
from harmonypy.
@onionpork @DennisPost10 I think the issue should be fixed by commit 2fd234e
I would be grateful if you could confirm that the new code works for your application.
from harmonypy.
Please feel free to comment on this issue if you would like to share your experience with the new code. Thanks.
from harmonypy.
Hey @slowkow,
Just had the same issue with the pip version that doesn't use kmeans2. But it works well with the git version that uses kmeans2.
Cheers,
from harmonypy.
Hi! I'm using harmony-you as part of panpipes and I am still facing this issue with harmonypy version 0.0.9
`2024-03-20 15:35:23,333 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans... \`
`2024-03-20 15:35:36,417 - harmonypy - INFO - sklearn.KMeans initialization complete. \`
` Traceback (most recent call last): \`
`File "[...]/panpipes/panpipes/python_scripts/batch_correct_harmony.py", line 110, in <module> \`
`ho = hm.run_harmony(adata.obsm[dimred][:,0:int(args.harmony_npcs)], \`
`File "[...]/conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 124, in run_harmony \`
`ho = Harmony( \`
`File "[...]/conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 172, in __init__ \`
`self.init_cluster() \`
`File "[...]/conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 207, in init_cluster \`
`self.compute_objective() \`
`File "[...]//conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 219, in compute_objective \`
`w = np.dot(y * z, self.Phi) \`
`ValueError: operands could not be broadcast together with shapes (100,33) (100,34) \
\`
Might it have to do with switch of kmeans2()?
harmonypy/harmonypy/harmony.py
Line 21 in 182a5c6
from harmonypy.
Feel free to switch to kmeans2 if you want to check if that might solve your issue. The test demonstrates how you can do that:
harmonypy/tests/test_harmony.py
Lines 65 to 75 in 72da6fb
If you want more help, please consider sharing a minimal reproducible example, so others might have a chance to reproduce the error that you are facing. We don't need your data, any random data would be fine. But if this issue depends on your data, then you might consider sharing it privately with me so I can use it to debug.
I do not know if sklearn.cluster.KMeans
is guaranteed to return the number of clusters that we asked for. I hope so, but I am not certain.
We switched to KMeans for faster execution time. But it seems that all of the implementations we've tried so far have some limitations. If you can suggest a specific implementation of a k-means algorithm that you think might be suitable, that would also be helpful.
Good luck!
from harmonypy.
Related Issues (20)
- Try filprofiler HOT 1
- How to cite this work? HOT 2
- Feature request: plot convergence HOT 1
- ValueError: operands could not be broadcast together HOT 1
- How to get the result after running harmony? HOT 2
- feature request: HOT 2
- run harmonypy true_divide error HOT 3
- Running multiple instances of the same variable in vars_use. HOT 5
- Runharmony with R and run_harmony with Python produced different results HOT 1
- Harmony with two or more covariates HOT 3
- Build changes have broken Conda install HOT 1
- Unique key is not in pandas describe HOT 3
- Add converged status to harmony object HOT 1
- Results not reproducible HOT 18
- sklearn.cluster.KMeans issue with kmeans_single() and threadpool_limits() HOT 4
- Reference mapping option HOT 2
- AttributeError: 'NoneType' object has no attribute 'split' HOT 5
- Error in running large dataset HOT 2
- Will the latest R harmony updates be propagated to the harmonypy HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from harmonypy.