jasonlaska / spherecluster Goto Github PK
View Code? Open in Web Editor NEWClustering routines for the unit sphere
Home Page: https://medium.com/@jaska_at_clara/simple-datetime-disambiguation-fd2374ce664a
License: MIT License
Clustering routines for the unit sphere
Home Page: https://medium.com/@jaska_at_clara/simple-datetime-disambiguation-fd2374ce664a
License: MIT License
I've been getting a
AttributeError: 'SphericalKMeans' object has no attribute '_check_fit_data'
error every time I try to run a SphericalKMeans fit. Diving a little further, it looks like the _check_fit_data method only exists in two locations in the repository...
Line 318 in spherical_kmeans.py (the line that throws the error in this case).
Line 772 in von_mises_fisher_mixture.py where it seems to be a defined method.
Based on what I can see from the imports, it looks like the _check_fit_data doesn't actually exist in the context of spherical_kmeans.py, so the error kind of makes sense.
Could this be the result of some accidental deletions? I went through the commit history and couldn't find anything that immediately seemed like the issue. Or is there something very obvious that I'm missing... wouldn't be the first time :)
Also as an FYI, I'm running Python 3.6.4.
Hello,
I stumbled over following issue.
When installing a package with spherecluster dependency and spherecluster is installed from source distribution (.tgz) then it fails with exception numpy is required during installation
raised from setup.py#12 even when package has correctly numpy (and scipy) dependencies listed.
How to reproduce:
from setuptools import setup
setup(
name="spherecluster_test",
version="1.0.0",
install_requires=["numpy", "scipy", "spherecluster"]
)
python setup.py bdist_wheel
pip install --no-binary "spherecluster" dist/spherecluster_test-1.0.0-py3-none-any.whl
Processing ./dist/spherecluster_test-1.0.0-py3-none-any.whl
Collecting scipy (from spherecluster-test==1.0.0)
Using cached https://files.pythonhosted.org/packages/a8/0b/f163da98d3a01b3e0ef1cab8dd2123c34aee2bafbb1c5bffa354cc8a1730/scipy-1.1.0-cp36-cp36m-manylinux1_x86_64.whl
Collecting numpy (from spherecluster-test==1.0.0)
Using cached https://files.pythonhosted.org/packages/16/21/2e88568c134cc3c8d22af290865e2abbd86efa58a1358ffcb19b6c74f9a3/numpy-1.15.3-cp36-cp36m-manylinux1_x86_64.whl
Collecting spherecluster (from spherecluster-test==1.0.0)
Using cached https://files.pythonhosted.org/packages/27/27/614b9e568e9a9a8d46938310b7caf092657343bf037b9fae416baf611d06/spherecluster-0.1.6.tar.gz
Complete output from command python setup.py egg_info:
numpy is required during installation
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-gtqjhbjz/spherecluster/
I suggest to remove following lines from setup.py#L9-L19.
try:
import numpy # NOQA
except ImportError:
print('numpy is required during installation')
sys.exit(1)
try:
import scipy # NOQA
except ImportError:
print('scipy is required during installation')
sys.exit(1)
The error occurs after running:
from spherecluster import SphericalKMeans
skm = SphericalKMeans(n_clusters=2)
skm.fit(wv) # wv is a (6, 200) numpy array
hi ~ I have a question about updating centroids in your code as follows:
# computation of the means
if sp.issparse(X):
centers = _k_means._centers_sparse(X, labels, n_clusters,
distances)
else:
centers = _k_means._centers_dense(X, labels, n_clusters, distances)
# l2-normalize centers (this is the main contibution here)
centers = normalize(centers)
When using cosine similarity in clustering, if you just normalize the centers calculated with _k_means._centers_XXX, which were designed to update centers when using eu distance, won't the result centers have different directions from what they should be?
Hope I've describe my question clearly and looking forward to your reply~ Thanks~
spherecluster/spherecluster/spherical_kmeans.py
Lines 44 to 48 in 701b0b1
I might be getting this wrong, but the code here seems to be using initialization function from sklearn. This could cause issue since the kmeans++ initialization in sklearn is based on euclidean distance. It should be replaced with cosine distance.
The traceback is
Traceback (most recent call last):
File "//python3.9/site-packages/IPython/core/interactiveshell.py", line 3457, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
import spherecluster
File "/snap/pycharm-community/267/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "//python3.9/site-packages/spherecluster/init.py", line 2, in
from .spherical_kmeans import SphericalKMeans
File "/snap/pycharm-community/267/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self.system_import(name, *args, **kwargs)
File "//python3.9/site-packages/spherecluster/spherical_kmeans.py", line 7, in
from sklearn.cluster.k_means import (
File "/snap/pycharm-community/267/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self.system_import(name, *args, **kwargs)
ModuleNotFoundError: No module named 'sklearn.cluster.k_means'
The problem is that 'sklearn.cluster.k_means_' has been renamed to 'sklearn.cluster._kmeans' in some intermediate scikit-learn version.
Hi. I install spherecluster usingpip install spherecluster
successfully in Ubuntu 18.04. But when I call from spherecluster import SphericalKMeans
, I got an ImportError.
Traceback (most recent call last):
File "LM/vectors_cluster.py", line 9, in <module>
from spherecluster import SphericalKMeans
File "/path_to_anaconda/lib/python3.6/site-packages/spherecluster/__init__.py", line 2, in <module>
from .spherical_kmeans import SphericalKMeans
File "/path_to_anaconda/lib/python3.6/site-packages/spherecluster/spherical_kmeans.py", line 16, in <module>
from sklearn.cluster import _k_means
ImportError: cannot import name '_k_means'
Here is my environmental information:
Package | Version |
---|---|
numpy | 1.14.3 |
scipy | 1.1.0 |
scikit-learn | 0.22.2.post1 |
pytest | 3.5.1 |
nose | 1.3.7 |
joblib | 0.14.1 |
spherecluster | 0.1.7 |
If anyone can help me, I would really appreciate it!
hi,the params of " _check_sample_weight(sample_weight, X, dtype=None):" defined by sklearn is "sample_weight,X" ,but in spherical_kmeans(line 39) call this fuction : "_check_sample_weight(X, sample_weight)" . does the order of params lead to this?
Hello, how can i use this code for a csv dataframe
Spherical KMeans returns integer labels, as expected.
However, VonMissesFisherMixture returns labels as floats, which causes trouble when using them to index integer-only functions.
Hi Jason,
I am trying to use the package 'spherecluster' in Jupyter notebook, but I encounter the following message:
ModuleNotFoundError: No module named 'spherecluster'
after the command "import spherecluster"
I have installed the package through the Windows command window without problem.
Could you give me an insight of what I have done wrong?
Thank you.
Dmitry
Hi,
for some reason SphericalKMeans doesn't find any valid centroids. On my data. The same happens with randomly generated data, which clearly has clusters.
Running the following code always leads to the cluster centers [1. 1. 1.]
:
from scipy.stats import vonmises
import numpy as np
from spherecluster import SphericalKMeans
ang = vonmises.rvs(25, loc=1, size=100)
ang = np.hstack((ang, vonmises.rvs(25, loc=3, size=100)))
ang = np.hstack((ang, vonmises.rvs(25, loc=5, size=100)))
skm = SphericalKMeans(n_clusters=3)
skm.fit(ang.reshape((-1, 1)))
print skm.cluster_centers_
Python 2.7.12, spherecluster 0.1.2
Am I missing something?
In the function _sample_weight, b is wrongly calculated as b = dim / (np.sqrt(4. * kappa ** 2 + dim ** 2) + 2 * kappa).
The reference material (eq 4 in Wood 1994) has it as b = (-2 * kappa + sqrt(4 * kappa ** 2 + dim ** 2)) / dim
Thank you for your great source code.
While I using soft von_mises_fisher_mixture, I got this error. The error only happen with 1000 short documents, it run well with small amount. Below is the full error log.
Could you please show me how to fix it? Thank you so much
File "mvmf_document_clustering.py", line 65, in <module>
vmf_soft.fit(X)
File "C:\Users\phuocphan\miniconda3\envs\Py36\lib\site-packages\spherecluster\von_mises_fisher_mixture.py", line 826, in fit
X = self._check_fit_data(X)
File "C:\Users\phuocphan\miniconda3\envs\Py36\lib\site-packages\spherecluster\von_mises_fisher_mixture.py", line 789, in _check_fit_data
raise ValueError("Data l2-norm must be 1, found {}".format(n))
ValueError: Data l2-norm must be 1, found 0.0
The version in PyPI does not work as described in #29 #35 and #34 is not merged.
There is a fork https://github.com/rfayat/spherecluster but it doesn't work for me either.
Downgrading scikit-learn==0.20.0 for some reason also didn't work, maybe I'm doing it wrong.
Hello.
I found that result of VonMisesFisherMixture() with n_jobs=-1 includes None.
I think there is no "best_posterior" in else statement of movMF().
This library has been very helpful.
Thank you as always.
Hi,
@jasonlaska Thanks for your codes! I learned a lot from the codes!
Recently I met a problem: when I used 'VonMisesFisherMixture' to estimate a distribution of sequence data, and then used 'sample_vMF' to produce some pseudo samples, I found all pseudo samples have the same trend as the real ones, but the Y value is always smaller than the real samples.
Later, I created a list of 10 sequences, all with the value of [0.95 , 0.9, 0.85, 0.8, ..., 0.05, 0]. In my opinion, when I produced pseudo samples from vmf distribution, I should get the same sequence. However, I got [3.82291703e-01, 3.62182865e-01, 3.42056000e-01, 3.21941382e-01, 3.01821386e-01, 2.81705993e-01, 2.61568550e-01, 2.41452144e-01, 2.21317166e-01, 2.01214795e-01, 1.81081612e-01, 1.60967944e-01, 1.40855783e-01, 1.20740353e-01, 1.00600080e-01, 8.04905383e-02, 6.03600387e-02, 4.02586200e-02, 2.01175857e-02, -8.79339154e-06], which is still a straight line, but each value becomes smaller (that is, 0.95 -> 0.382). Why is that? How to solve this problem?
Thank you!
Hi
I tried to run the soft version, and this error appears 'VMF scaling denominator was inf'. what could be the reasons for that.
Another question is there any way to estimate the number of the clusters.
Thanks
Hello,
Thanks for making this package, it is really useful!
I am using python 3.6 on OS X 10.12.6 and I tried to install the package through pip. The installation fails while trying to install the matplotlob
version as specified in requirements.txt file. This is, I think, a known issue for matplotlib
which was fixed in later versions (see issues like matplotlib/matplotlib#3889). I have successfully installed matplotlib
version 2.0.2
in my environment.
I think that there are 3 ways of solving this issue.
matplotlib
this package depends onmatplotlib
dependency altogether. From what I understand, matplotlib
is only used in the examples and is thus not needed for the packaged version of the library (similar to seaborn
and tabulate
)It seems to me that fixing the random_state
parameter at an integer when calling SphericalKMeans
constructor is still seeding a numpy.random.RandomState
pseudo-random number generator and thus not making Spherical K-Means completely deterministic. The consequent call to _k_init
provides it with a random_state
of the type RandomState
instance instead of int
, which preserves the randomness as per the routine's documentation:
random_state : int, RandomState instance
The generator used to initialize the centers. Use an int to make the
randomness deterministic.
As a result, I get different close results across runs.
I want the algorithm to be deterministic for the sake of research, am I missing something in that regard?
P.S. I am initialising the algorithm with k-means++
.
Hi
First of all, thank you for sharing this package!
I'm installing spherecluster with pip and had to manually edit spherical_kmeans.py to fix the import of _k_means (I changed it to from sklearn.cluster import _k_means_fast as _k_means).
I can see this change has already been made in the repo.
Maybe the pip package isn't up to date?
Best
Mehdi
Add normalize=True
parameter that normalizes data (optional to user) to both classes so that check_estimator
can be applied in tests.
I'm using anaconda, installing in the environment where i have python3.4 version.
I tried downloading via pip and via python setup.py install in this environment and also in a global, python2.7 environment, each time uninstalling everything and trying again. No luck. Everywhere I get a no-module "spherical_kmeans" message. What can be done here? Thanks a lot!
Thank you for sharing this great package,
I wanted to experiment with K-Means on a big enough data set that it would require a Mini-Batch version of K-Means,
Do you have any direction for me to follow on and extend your implementation to Mini-Batch?
skm.intertia_ might be typo
I think it should be skm.inertia
In line 55 in spherecluster/util.py, in the _sample_orthonormal_to function, it reads:
proj_mu_v = mu * np.dot(mu, v) / np.linalg.norm(mu)
Shouldn't it instead be:
proj_mu_v = mu * np.dot(mu, v) / np.linalg.norm(mu)**2
If norm(mu)=1 then it doesn't make a difference, but otherwise they are quite different.
black
is only available for python3.6 and higher; its presence in requirements.txt
makes installation of this package on python3.5 (under, for me, Ubuntu 16.04 LTS) fail.
Since black
doesn't seem to be used by the package (just in the dev process?) maybe it can be removed from requirements.txt
? Doing so makes local installation work fine for me.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.