Giter Club home page Giter Club logo

Comments (9)

slowkow avatar slowkow commented on May 27, 2024 2

Thank you @bahnk for confirming that we should switch from kmeans() to kmeans2().

I'll make a new release soon.

from harmonypy.

slowkow avatar slowkow commented on May 27, 2024 1

Thank you @onionpork for bringing this up again.

The documentation for scipy.cluster.vq.kmeans describes the objects returned by kmeans():

Returns
codebook ndarray
A k by N array of k centroids. The ith centroid codebook[i] is represented with the code i. The centroids and codes generated represent the lowest distortion seen, not necessarily the globally minimal distortion. Note that the number of centroids is not necessarily the same as the k_or_guess parameter, because centroids assigned to no observations are removed during iterations.
distortion float
The mean (non-squared) Euclidean distance between the observations passed and the centroids generated. Note the difference to the standard definition of distortion in the context of the k-means algorithm, which is the sum of the squared distances.

So, the documentation is consistent with your experience: the kmeans function does not always return k centroids.

I think the best solution would be to choose a different implementation of kmeans that returns a consistent output. I might consider:

from harmonypy.

slowkow avatar slowkow commented on May 27, 2024

Hi Dennis, thanks for opening this issue! And thanks for investigating which lines might be the root cause of the error.

Did you find that setting nclust to a smaller number resolves your issue?

Have you considered whether the 12th dataset is representative of any of the other 11 datasets? For example, have you looked at a clustering result for each dataset independently?

Could I please ask if you have any proposal for what you think might be the best modification to the code? For example, would it be helpful to add a check how many clusters we got from kmeans(), and then to report an error to the user along with a suggestion that they might try a smaller value for nclust?

from harmonypy.

onionpork avatar onionpork commented on May 27, 2024

Hey there,
I think I have the same problem here, even with tuning the nclust down. The problem is that it always return the shape size smaller than defined nclust.
I think the problem is on initialize the centroid (line 88). It always return the smaller number of the cluster center.
For example,

km = kmean(self.Z_cos.T, 100, iter=10)
km[0].shape[0] = 87 (<100)

Any idea how to fix it?
Thank you so much!

from harmonypy.

slowkow avatar slowkow commented on May 27, 2024

@onionpork @DennisPost10 I think the issue should be fixed by commit 2fd234e

I would be grateful if you could confirm that the new code works for your application.

from harmonypy.

slowkow avatar slowkow commented on May 27, 2024

@onionpork and @DennisPost10

Please feel free to comment on this issue if you would like to share your experience with the new code. Thanks.

from harmonypy.

bahnk avatar bahnk commented on May 27, 2024

Hey @slowkow,

Just had the same issue with the pip version that doesn't use kmeans2. But it works well with the git version that uses kmeans2.

Cheers,

from harmonypy.

wlason avatar wlason commented on May 27, 2024

Hi! I'm using harmony-you as part of panpipes and I am still facing this issue with harmonypy version 0.0.9

`2024-03-20 15:35:23,333 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans... \`
`2024-03-20 15:35:36,417 - harmonypy - INFO - sklearn.KMeans initialization complete. \`
` Traceback (most recent call last): \`
  `File "[...]/panpipes/panpipes/python_scripts/batch_correct_harmony.py", line 110, in <module> \`
    `ho = hm.run_harmony(adata.obsm[dimred][:,0:int(args.harmony_npcs)], \`
  `File "[...]/conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 124, in run_harmony \`
    `ho = Harmony( \`
  `File "[...]/conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 172, in __init__ \`
    `self.init_cluster() \`
  `File "[...]/conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 207, in init_cluster \`
    `self.compute_objective() \`
  `File "[...]//conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 219, in compute_objective \`
    `w = np.dot(y * z, self.Phi) \`
`ValueError: operands could not be broadcast together with shapes (100,33) (100,34)  \
 \`

Might it have to do with switch of kmeans2()?

from sklearn.cluster import KMeans

from harmonypy.

slowkow avatar slowkow commented on May 27, 2024

@wlason

Feel free to switch to kmeans2 if you want to check if that might solve your issue. The test demonstrates how you can do that:

def cluster_fn(data, K):
centroid, label = kmeans2(data, K, minit='++', seed=0)
return centroid
def run(cluster_fn):
ho = hm.run_harmony(data_mat,
meta_data, ['donor'],
max_iter_harmony=2,
max_iter_kmeans=2,
cluster_fn=cluster_fn)
return ho.Z_corr

If you want more help, please consider sharing a minimal reproducible example, so others might have a chance to reproduce the error that you are facing. We don't need your data, any random data would be fine. But if this issue depends on your data, then you might consider sharing it privately with me so I can use it to debug.

I do not know if sklearn.cluster.KMeans is guaranteed to return the number of clusters that we asked for. I hope so, but I am not certain.

We switched to KMeans for faster execution time. But it seems that all of the implementations we've tried so far have some limitations. If you can suggest a specific implementation of a k-means algorithm that you think might be suitable, that would also be helpful.

Good luck!

from harmonypy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.