jasonlaska / spherecluster Goto Github PK

Clustering routines for the unit sphere

Home Page: https://medium.com/@jaska_at_clara/simple-datetime-disambiguation-fd2374ce664a

License: MIT License

Python 100.00%

sphericalclustering scikit-learn k-means circular-statistics von-mises-fisher spherical-k-means clustering-algorithm sampling directional-statistics

spherecluster's People

Contributors

Stargazers

Watchers

spherecluster's Issues

AttributeError: 'SphericalKMeans' object has no attribute '_check_fit_data'

I've been getting a
AttributeError: 'SphericalKMeans' object has no attribute '_check_fit_data'

error every time I try to run a SphericalKMeans fit. Diving a little further, it looks like the _check_fit_data method only exists in two locations in the repository...

Line 318 in spherical_kmeans.py (the line that throws the error in this case).

Line 772 in von_mises_fisher_mixture.py where it seems to be a defined method.

Based on what I can see from the imports, it looks like the _check_fit_data doesn't actually exist in the context of spherical_kmeans.py, so the error kind of makes sense.

Could this be the result of some accidental deletions? I went through the commit history and couldn't find anything that immediately seemed like the issue. Or is there something very obvious that I'm missing... wouldn't be the first time :)

Also as an FYI, I'm running Python 3.6.4.

Source install fails due exceptions in setup.py

Hello,

I stumbled over following issue.

When installing a package with spherecluster dependency and spherecluster is installed from source distribution (.tgz) then it fails with exception numpy is required during installation raised from setup.py#12 even when package has correctly numpy (and scipy) dependencies listed.

How to reproduce:

make a toy setup.py

from setuptools import setup

setup(
    name="spherecluster_test",
    version="1.0.0",
    install_requires=["numpy", "scipy", "spherecluster"]
)

build it to wheel

python setup.py bdist_wheel

try to install it with --no-binary option to force spherecluser source distribution:

pip install --no-binary "spherecluster" dist/spherecluster_test-1.0.0-py3-none-any.whl

Processing ./dist/spherecluster_test-1.0.0-py3-none-any.whl
Collecting scipy (from spherecluster-test==1.0.0)
  Using cached https://files.pythonhosted.org/packages/a8/0b/f163da98d3a01b3e0ef1cab8dd2123c34aee2bafbb1c5bffa354cc8a1730/scipy-1.1.0-cp36-cp36m-manylinux1_x86_64.whl
Collecting numpy (from spherecluster-test==1.0.0)
  Using cached https://files.pythonhosted.org/packages/16/21/2e88568c134cc3c8d22af290865e2abbd86efa58a1358ffcb19b6c74f9a3/numpy-1.15.3-cp36-cp36m-manylinux1_x86_64.whl
Collecting spherecluster (from spherecluster-test==1.0.0)
  Using cached https://files.pythonhosted.org/packages/27/27/614b9e568e9a9a8d46938310b7caf092657343bf037b9fae416baf611d06/spherecluster-0.1.6.tar.gz
    Complete output from command python setup.py egg_info:
    numpy is required during installation

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-gtqjhbjz/spherecluster/

I suggest to remove following lines from setup.py#L9-L19.

try:
    import numpy  # NOQA
except ImportError:
    print('numpy is required during installation')
    sys.exit(1)

try:
    import scipy  # NOQA
except ImportError:
    print('scipy is required during installation')
    sys.exit(1)

TypeError: _labels_inertia() got an unexpected keyword argument 'precompute_distances'

The error occurs after running:

from spherecluster import SphericalKMeans
skm = SphericalKMeans(n_clusters=2)
skm.fit(wv) # wv is a (6, 200) numpy array

a question about updating centroids

hi ~ I have a question about updating centroids in your code as follows:

        # computation of the means
        if sp.issparse(X):
            centers = _k_means._centers_sparse(X, labels, n_clusters,
                                               distances)
        else:
            centers = _k_means._centers_dense(X, labels, n_clusters, distances)

        # l2-normalize centers (this is the main contibution here)
        centers = normalize(centers)

When using cosine similarity in clustering, if you just normalize the centers calculated with _k_means._centers_XXX, which were designed to update centers when using eu distance, won't the result centers have different directions from what they should be?
Hope I've describe my question clearly and looking forward to your reply~ Thanks~

Initialization is using euclidean distance

spherecluster/spherecluster/spherical_kmeans.py

Lines 44 to 48 in 701b0b1

 centers = _init_centroids( 

 X, n_clusters, init, random_state=random_state, x_squared_norms=x_squared_norms 

 ) 

 if verbose: 

 print("Initialization complete")

I might be getting this wrong, but the code here seems to be using initialization function from sklearn. This could cause issue since the kmeans++ initialization in sklearn is based on euclidean distance. It should be replaced with cosine distance.

Cannot import spherecluster with scikit-learn 1.0.2: sklearn.cluster.k_means_ has been renamed

The traceback is

Traceback (most recent call last):
File "//python3.9/site-packages/IPython/core/interactiveshell.py", line 3457, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
import spherecluster
File "/snap/pycharm-community/267/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "//python3.9/site-packages/spherecluster/init.py", line 2, in
from .spherical_kmeans import SphericalKMeans
File "/snap/pycharm-community/267/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self.system_import(name, *args, **kwargs)
File "//python3.9/site-packages/spherecluster/spherical_kmeans.py", line 7, in
from sklearn.cluster.k_means import (
File "/snap/pycharm-community/267/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self.system_import(name, *args, **kwargs)
ModuleNotFoundError: No module named 'sklearn.cluster.k_means'

The problem is that 'sklearn.cluster.k_means_' has been renamed to 'sklearn.cluster._kmeans' in some intermediate scikit-learn version.

ImportError: cannot import name '_k_means'

Hi. I install spherecluster usingpip install spherecluster successfully in Ubuntu 18.04. But when I call from spherecluster import SphericalKMeans, I got an ImportError.

Traceback (most recent call last):
  File "LM/vectors_cluster.py", line 9, in <module>
    from spherecluster import SphericalKMeans
  File "/path_to_anaconda/lib/python3.6/site-packages/spherecluster/__init__.py", line 2, in <module>
    from .spherical_kmeans import SphericalKMeans
  File "/path_to_anaconda/lib/python3.6/site-packages/spherecluster/spherical_kmeans.py", line 16, in <module>
    from sklearn.cluster import _k_means
ImportError: cannot import name '_k_means'

Here is my environmental information:

Package	Version
numpy	1.14.3
scipy	1.1.0
scikit-learn	0.22.2.post1
pytest	3.5.1
nose	1.3.7
joblib	0.14.1
spherecluster	0.1.7

If anyone can help me, I would really appreciate it!

TypeError: Expected sequence or array-like, got <class 'NoneType'>

self.clus.fit(self.data)
File "D:\工程\知识图谱自动构建\AutoBuild\venv\lib\site-packages\spherecluster\spherical_kmeans.py", line 363, in fit
return_n_iter=True,
File "D:\工程\知识图谱自动构建\AutoBuild\venv\lib\site-packages\spherecluster\spherical_kmeans.py", line 189, in spherical_k_means
random_state=random_state,
File "D:\工程\知识图谱自动构建\AutoBuild\venv\lib\site-packages\spherecluster\spherical_kmeans.py", line 39, in _spherical_kmeans_single_lloyd
sample_weight = _check_sample_weight(X, sample_weight)
File "D:\工程\知识图谱自动构建\AutoBuild\venv\lib\site-packages\sklearn\utils\validation.py", line 1215, in _check_sample_weight
n_samples = _num_samples(X)
File "D:\工程\知识图谱自动构建\AutoBuild\venv\lib\site-packages\sklearn\utils\validation.py", line 147, in _num_samples
raise TypeError(message)
TypeError: Expected sequence or array-like, got <class 'NoneType'>

hi，the params of " _check_sample_weight(sample_weight, X, dtype=None):" defined by sklearn is "sample_weight,X" ,but in spherical_kmeans(line 39) call this fuction : "_check_sample_weight(X, sample_weight)" . does the order of params lead to this?

Using it for dataframe

Hello, how can i use this code for a csv dataframe

Returned labels are floats in VonMissesFisherMixture (soft and hard)

Spherical KMeans returns integer labels, as expected.
However, VonMissesFisherMixture returns labels as floats, which causes trouble when using them to index integer-only functions.

How to use spherecluster in Jupyter notebook

Hi Jason,
I am trying to use the package 'spherecluster' in Jupyter notebook, but I encounter the following message:
ModuleNotFoundError: No module named 'spherecluster'
after the command "import spherecluster"
I have installed the package through the Windows command window without problem.
Could you give me an insight of what I have done wrong?
Thank you.
Dmitry

SphericalKMeans does not converge

Hi,
for some reason SphericalKMeans doesn't find any valid centroids. On my data. The same happens with randomly generated data, which clearly has clusters.

Running the following code always leads to the cluster centers [1. 1. 1.]:

from scipy.stats import vonmises
import numpy as np
from spherecluster import SphericalKMeans

ang = vonmises.rvs(25, loc=1, size=100)
ang = np.hstack((ang, vonmises.rvs(25, loc=3, size=100)))
ang = np.hstack((ang, vonmises.rvs(25, loc=5, size=100)))

skm = SphericalKMeans(n_clusters=3)
skm.fit(ang.reshape((-1, 1)))
print skm.cluster_centers_

Python 2.7.12, spherecluster 0.1.2

Am I missing something?

Mistake in sample_vMF

In the function _sample_weight, b is wrongly calculated as b = dim / (np.sqrt(4. * kappa ** 2 + dim ** 2) + 2 * kappa).
The reference material (eq 4 in Wood 1994) has it as b = (-2 * kappa + sqrt(4 * kappa ** 2 + dim ** 2)) / dim

ValueError: Data l2-norm must be 1, found 0.0

Thank you for your great source code.
While I using soft von_mises_fisher_mixture, I got this error. The error only happen with 1000 short documents, it run well with small amount. Below is the full error log.

Could you please show me how to fix it? Thank you so much

File "mvmf_document_clustering.py", line 65, in <module>
    vmf_soft.fit(X)
  File "C:\Users\phuocphan\miniconda3\envs\Py36\lib\site-packages\spherecluster\von_mises_fisher_mixture.py", line 826, in fit
    X = self._check_fit_data(X)
  File "C:\Users\phuocphan\miniconda3\envs\Py36\lib\site-packages\spherecluster\von_mises_fisher_mixture.py", line 789, in _check_fit_data
    raise ValueError("Data l2-norm must be 1, found {}".format(n))
ValueError: Data l2-norm must be 1, found 0.0

This repo is dead

The version in PyPI does not work as described in #29 #35 and #34 is not merged.
There is a fork https://github.com/rfayat/spherecluster but it doesn't work for me either.
Downgrading scikit-learn==0.20.0 for some reason also didn't work, maybe I'm doing it wrong.

return value of movMF function includes None when n_jobs!=1

Hello.

I found that result of VonMisesFisherMixture() with n_jobs=-1 includes None.
I think there is no "best_posterior" in else statement of movMF().

This library has been very helpful.
Thank you as always.

Question about sample_vMF

Hi,
@jasonlaska Thanks for your codes! I learned a lot from the codes!

Recently I met a problem: when I used 'VonMisesFisherMixture' to estimate a distribution of sequence data, and then used 'sample_vMF' to produce some pseudo samples, I found all pseudo samples have the same trend as the real ones, but the Y value is always smaller than the real samples.

Later, I created a list of 10 sequences, all with the value of [0.95 , 0.9, 0.85, 0.8, ..., 0.05, 0]. In my opinion, when I produced pseudo samples from vmf distribution, I should get the same sequence. However, I got [3.82291703e-01, 3.62182865e-01, 3.42056000e-01, 3.21941382e-01, 3.01821386e-01, 2.81705993e-01, 2.61568550e-01, 2.41452144e-01, 2.21317166e-01, 2.01214795e-01, 1.81081612e-01, 1.60967944e-01, 1.40855783e-01, 1.20740353e-01, 1.00600080e-01, 8.04905383e-02, 6.03600387e-02, 4.02586200e-02, 2.01175857e-02, -8.79339154e-06], which is still a straight line, but each value becomes smaller (that is, 0.95 -> 0.382). Why is that? How to solve this problem?

Thank you!

VMF scaling denominator was inf

Hi
I tried to run the soft version, and this error appears 'VMF scaling denominator was inf'. what could be the reasons for that.
Another question is there any way to estimate the number of the clusters.

Thanks

Strict Matplotlib requirement prevents installation on python 3.6

Hello,

Thanks for making this package, it is really useful!

I am using python 3.6 on OS X 10.12.6 and I tried to install the package through pip. The installation fails while trying to install the matplotlob version as specified in requirements.txt file. This is, I think, a known issue for matplotlib which was fixed in later versions (see issues like matplotlib/matplotlib#3889). I have successfully installed matplotlib version 2.0.2 in my environment.

I think that there are 3 ways of solving this issue.

Bump the version of matplotlib this package depends on
Remove the strict dependency (==) requirement
Remove the matplotlib dependency altogether. From what I understand, matplotlib is only used in the examples and is thus not needed for the packaged version of the library (similar to seaborn and tabulate)

Spherical K-Means is producing different results each run even when fixing `random_state` at an integer

It seems to me that fixing the random_state parameter at an integer when calling SphericalKMeans constructor is still seeding a numpy.random.RandomState pseudo-random number generator and thus not making Spherical K-Means completely deterministic. The consequent call to _k_init provides it with a random_state of the type RandomState instance instead of int, which preserves the randomness as per the routine's documentation:

    random_state : int, RandomState instance
        The generator used to initialize the centers. Use an int to make the
        randomness deterministic.

As a result, I get different close results across runs.

I want the algorithm to be deterministic for the sake of research, am I missing something in that regard?

P.S. I am initialising the algorithm with k-means++.

pip package not up to date?

First of all, thank you for sharing this package!

I'm installing spherecluster with pip and had to manually edit spherical_kmeans.py to fix the import of _k_means (I changed it to from sklearn.cluster import _k_means_fast as _k_means).

I can see this change has already been made in the repo.

Maybe the pip package isn't up to date?

Best
Mehdi

Add `normalize=True` parameter

Add normalize=True parameter that normalizes data (optional to user) to both classes so that check_estimator can be applied in tests.

installation: no module spherical kmeans

I'm using anaconda, installing in the environment where i have python3.4 version.
I tried downloading via pip and via python setup.py install in this environment and also in a global, python2.7 environment, each time uninstalling everything and trying again. No luck. Everywhere I get a no-module "spherical_kmeans" message. What can be done here? Thanks a lot!

Using Spherical clustering for Mini-Batch K-Means

Thank you for sharing this great package,
I wanted to experiment with K-Means on a big enough data set that it would require a Mini-Batch version of K-Means,
Do you have any direction for me to follow on and extend your implementation to Mini-Batch?

typo in the Readme

skm.intertia_ might be typo

I think it should be skm.inertia

Error in _sample_orthonormal_to

In line 55 in spherecluster/util.py, in the _sample_orthonormal_to function, it reads:

proj_mu_v = mu * np.dot(mu, v) / np.linalg.norm(mu)

Shouldn't it instead be:

proj_mu_v = mu * np.dot(mu, v) / np.linalg.norm(mu)**2

If norm(mu)=1 then it doesn't make a difference, but otherwise they are quite different.

black dependency breaks python3.5 install

black is only available for python3.6 and higher; its presence in requirements.txt makes installation of this package on python3.5 (under, for me, Ubuntu 16.04 LTS) fail.

Since black doesn't seem to be used by the package (just in the dev process?) maybe it can be removed from requirements.txt? Doing so makes local installation work fine for me.

	centers = _init_centroids(
	X, n_clusters, init, random_state=random_state, x_squared_norms=x_squared_norms
	)
	if verbose:
	print("Initialization complete")

jasonlaska / spherecluster Goto Github PK

spherecluster's People

Contributors

Stargazers

Watchers

Forkers

spherecluster's Issues

Recommend Projects

Recommend Topics

Recommend Org