Comments (8)
Can you show me how you're using this sequence in the function call to sample_VMF
? The signature takes both mu
and kappa
parameters each the length of the dimension (and a number of samples). Each output sample will have unit norm, which may be why you are seeeing that scaling effect.
from spherecluster.
Here is my source code:
x = [i/20. for i in range(19)]
x_list = []
for i in range(10):
x_list.append(np.array(x))
x_list = np.array(x_list)
vmf_soft = VonMisesFisherMixture(n_clusters=1, n_jobs=15, posterior_type='hard')
vmf_soft.fit(x_list)
center = vmf_soft.cluster_centers_[0]
kappa = vmf_soft.concentrations_[0]
fake_sample = np.array(sample_vMF(center, kappa, 2))
Thanks a lot!
from spherecluster.
Hi @dmortem,
First, I'd like to validate that this is what you intended: You're generating a set of identical 19-dimensional points, so we should expect the clustering algorithm to produce a center that is identical to these points and have a very high kappa
so that most samples are near to this point.
The main thing I notice here is that the input points are not unit normalized (i.e., they are not on the surface of the sphere). You can see that if we were to normalize the rows, we will get the following values for each row:
x_list[0] / np.linalg.norm(x_list[0])
array([0. , 0.02177518, 0.04355036, 0.06532553, 0.08710071,
0.10887589, 0.13065107, 0.15242624, 0.17420142, 0.1959766 ,
0.21775178, 0.23952696, 0.26130213, 0.28307731, 0.30485249,
0.32662767, 0.34840285, 0.37017802, 0.3919532 ])
Since the input provided is not normalized, the algorithm will automatically project these points to their closest point on the sphere (i.e., l2-normalize them) as part of the clustering. Looking at the center
and kappa
from your code reveals:
>>> center
array([0. , 0.02177518, 0.04355036, 0.06532553, 0.08710071,
0.10887589, 0.13065107, 0.15242624, 0.17420142, 0.1959766 ,
0.21775178, 0.23952696, 0.26130213, 0.28307731, 0.30485249,
0.32662767, 0.34840285, 0.37017802, 0.3919532 ])
>>> kappa
10000000000.0
Thus, drawing any number of samples with this mu and kappa results in vectors that are very close to center
:
>>> samples = sample_vMF(center, kappa, 5)
>>> samples
array([[ 6.32807952e-06, 2.17864995e-02, 4.35500737e-02,
6.53322527e-02, 8.71095727e-02, 1.08881644e-01,
1.30639007e-01, 1.52421854e-01, 1.74200326e-01,
1.95993355e-01, 2.17752592e-01, 2.39536626e-01,
2.61298267e-01, 2.83075639e-01, 3.04853585e-01,
3.26616624e-01, 3.48406769e-01, 3.70171007e-01,
3.91954664e-01],
[-2.55597079e-06, 2.17727691e-02, 4.35469153e-02,
6.53037181e-02, 8.71155660e-02, 1.08857345e-01,
1.30663982e-01, 1.52433263e-01, 1.74222724e-01,
1.95972540e-01, 2.17760572e-01, 2.39528979e-01,
2.61289916e-01, 2.83078353e-01, 3.04848925e-01,
3.26616358e-01, 3.48405330e-01, 3.70185607e-01,
3.91948825e-01],
[-1.36665318e-06, 2.17944338e-02, 4.35683507e-02,
6.53144059e-02, 8.71042671e-02, 1.08872207e-01,
1.30636564e-01, 1.52429533e-01, 1.74200980e-01,
1.95979053e-01, 2.17759354e-01, 2.39538467e-01,
2.61295634e-01, 2.83088949e-01, 3.04848135e-01,
3.26637761e-01, 3.48401665e-01, 3.70155995e-01,
3.91956256e-01],
[ 1.45076901e-05, 2.17576482e-02, 4.35492214e-02,
6.53258495e-02, 8.70920150e-02, 1.08878649e-01,
1.30659238e-01, 1.52415545e-01, 1.74194117e-01,
1.95975390e-01, 2.17747226e-01, 2.39530462e-01,
2.61307605e-01, 2.83069683e-01, 3.04841559e-01,
3.26632971e-01, 3.48404188e-01, 3.70177861e-01,
3.91965990e-01],
[-2.12690749e-08, 2.17772530e-02, 4.35430105e-02,
6.53248000e-02, 8.71232600e-02, 1.08865881e-01,
1.30640309e-01, 1.52433249e-01, 1.74201411e-01,
1.95978082e-01, 2.17733089e-01, 2.39521882e-01,
2.61299886e-01, 2.83091667e-01, 3.04850095e-01,
3.26631264e-01, 3.48415265e-01, 3.70162670e-01,
3.91958856e-01]])
>>> for sample in samples:
... print(np.linalg.norm(center - sample))
...
3.276563405953557e-05
4.696966559956132e-05
4.536780825124043e-05
3.567424819392873e-05
4.314138629688178e-05
In summary, the inputs you provided are not normalized, the resulting cluster center is the normalized input.
I hope this helps.
from spherecluster.
That really helps! Thanks a lot!
from spherecluster.
When I tried to set 'n_clusters=K
' where K is larger than 1 in 'VonMisesFisherMixture
', I met the problem 'VMF scaling denominator was inf.
'. In #10, you said 'That exception is raised when the dimension is too large.
', but my data is only 10*19, not so large. Thus how to solve it?
from spherecluster.
Yea it's an outstanding issue. It has to do with values becoming too large in some iterations, which I think is due to long vectors (since they are normalized to 1). Let me know if you figure it out.
from spherecluster.
OK, I will try to do more experiments about it.
from spherecluster.
Normalization when kappa
or mu
are large can be dealt with by noting that you calculate log likelihood, not likelihood, so overflow can be avoided. For example log(exp(kappa*X.dot(mu).T) = kappa*X.dot(mu).T
, and for the normalization I think this trick can be used as well. For example, when the dimension is 3 the normalization has a closed form, so I've modified the code in _vmf_method
by adding:
if n_features == 3:
log_norm = np.log(kappa) - np.log(2. * np.pi) - kappa - np.log(1. - np.exp(-2. * kappa))
log_density = kappa * X.dot(mu).T
return log_norm + log_density
I believe it can be done for general dimension, but I didn't verify it
from spherecluster.
Related Issues (20)
- VMF scaling denominator was inf HOT 8
- black dependency breaks python3.5 install HOT 4
- AttributeError: 'SphericalKMeans' object has no attribute '_check_fit_data' HOT 4
- Source install fails due exceptions in setup.py
- Initialization is using euclidean distance HOT 3
- Using Spherical clustering for Mini-Batch K-Means HOT 1
- Spherical K-Means is producing different results each run even when fixing `random_state` at an integer HOT 1
- TypeError: Expected sequence or array-like, got <class 'NoneType'> HOT 5
- ImportError: cannot import name '_k_means' HOT 10
- Returned labels are floats in VonMissesFisherMixture (soft and hard)
- pip package not up to date? HOT 8
- TypeError: _labels_inertia() got an unexpected keyword argument 'precompute_distances' HOT 2
- Mistake in sample_vMF HOT 1
- Error in _sample_orthonormal_to
- Cannot import spherecluster with scikit-learn 1.0.2: sklearn.cluster.k_means_ has been renamed HOT 3
- Using it for dataframe
- ValueError: Data l2-norm must be 1, found 0.0
- This repo is dead
- return value of movMF function includes None when n_jobs!=1 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spherecluster.