xifengguo / dec-keras Goto Github PK
View Code? Open in Web Editor NEWKeras implementation for Deep Embedding Clustering (DEC)
License: MIT License
Keras implementation for Deep Embedding Clustering (DEC)
License: MIT License
I am using the repo on hyperspectral datasets and without changing any hyperparameters, I get around negative or very close to 0 ari values and the number of predicted classes is less than the ground truth number of classes.
Any suggestions? The pretraining itself is biased towards having a lesser number of clusters.
If I understand correctly, the model is evaluated on the same data that it's trained on. Doesn't this lead to a wrong evaluation?
Load data
Line 290 in 2438070
Lines 94 to 103 in 2438070
Evaluate
Lines 333 to 335 in 2438070
Shouldn't x_train and y_trained used to pretrain and fit, and then x_test and y_test used to evaluate the model?
Using your code on my dataset (17 columns x ~30k rows) it gives me the following error:
ValueError: n_samples=23205 should be >= n_clusters=85617.
How can i fix this?
Hello great work!
i think 'from sklearn.utils.linear_assignment_ import linear_assignment' is now deprecated and I would recommend making the following changes to the accuracy module.
def acc(y_true, y_pred):
"""
Calculate clustering accuracy. Require scikit-learn installed
# Arguments
y: true labels, numpy.array with shape (n_samples,)
y_pred: predicted labels, numpy.array with shape (n_samples,)
# Return
accuracy, in [0,1]
"""
y_true = y_true.astype(np.int64)
assert y_pred.size == y_true.size
D = max(y_pred.max(), y_true.max()) + 1
w = np.zeros((D, D), dtype=np.int64)
for i in range(y_pred.size):
w[y_pred[i], y_true[i]] += 1
from scipy.optimize import linear_sum_assignment as linear_assignment
ind = np.transpose(np.asarray(linear_assignment(w.max() - w)))
return sum([w[i, j] for i, j in ind]) * 1.0 / y_pred.size
Thanks for all the great work!
Ali
I got the following error when running python DEC.py
Using TensorFlow backend.
Namespace(ae_weights=None, batch_size=256, dataset='mnist', maxiter=20000.0, pretrain_epochs=None, save_dir='results', tol=0.001, update_interval=None)
MNIST samples (70000, 784)
Traceback (most recent call last):
File "DEC.py", line 321, in
dec = DEC(dims=[x.shape[-1], 500, 500, 2000, 10], n_clusters=n_clusters, init=init)
File "DEC.py", line 138, in init
clustering_layer = ClusteringLayer(self.n_clusters, name='clustering')(self.encoder.output)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/base_layer.py", line 463, in call
self.build(unpack_singleton(input_shapes))
File "DEC.py", line 91, in build
self.clusters = self.add_weight((self.n_clusters, input_dim), initializer='glorot_uniform', name='clusters')
TypeError: add_weight() got multiple values for argument 'name'
Thank you for your contribution!
python DEC.py --dataset mnist
This runs fine.
However,
python run_exp.py
yields this:
Reached tolerance threshold. Stopping training.
('saving model to:', './results/exp1/reuters10k/trial9/DEC_model_final.h5')
Traceback (most recent call last):
File "run_exp.py", line 26, in
x, y = load_data(db)
File "./datasets.py", line 324, in load_data
return load_stl()
File "./datasets.py", line 283, in load_stl
y1 = np.fromfile(data_path + '/train_y.bin', dtype=np.uint8) - 1
IOError: [Errno 2] No such file or directory: './data/stl/train_y.bin'
Any ideas on what the problem is?
Much appreciated.
Thanks.
$ python DEC.py --dataset mnist
Using plaidml.keras.backend backend.
Namespace(ae_weights=None, batch_size=256, dataset='mnist', maxiter=20000.0, pretrain_epochs=None, save_dir='results', tol=0.001, update_interval=None)
('MNIST samples', (70000, 784))
INFO:plaidml:Opening device "metal_amd_radeon_hd_-_firepro_d500.1"
return array(obj, copy=False)
Traceback (most recent call last):
File "DEC.py", line 321, in
dec = DEC(dims=[x.shape[-1], 500, 500, 2000, 10], n_clusters=n_clusters, init=init)
File "DEC.py", line 138, in init
clustering_layer = ClusteringLayer(self.n_clusters, name='clustering')(self.encoder.output)
File "/Users/test/plaidml-venv/lib/python2.7/site-packages/keras/engine/base_layer.py", line 457, in call
output = self.call(inputs, **kwargs)
File "DEC.py", line 106, in call
q **= (self.alpha + 1.0) / 2.0
TypeError: unsupported operand type(s) for ** or pow(): 'Value' and 'float'
作者您好,感谢您对DEC的复现,想咨询一些问题。我在代码里看到有cifar-10数据集以及特征抽取,但我在cifar-10上跑出来的效果很差,acc在0.3左右,请问是哪些超参影响很大啊,期待您的回复
@XifengGuo, thanks for providing some interesting code. Can you add a licence?
During training with clustering layers, the loss keeps 0 through all the iteration while nmi, acc and ari work fine. Why does this happen? Does it indicate that the encoder layers haven't been trained during training?
The SAE (stacked autoencoder) part should be trained layer-wise, which means the next autoencoder starts to be trained after the previous one is trained. From origin paper:
After training of one layer, we use its output h as the input to train the n
However from the output of model structure image (autoencoders.png), the encoders are connected to each other and then follows a number of decoders and there is only one training phase over the whole "autoencoder".
Great work!
For the problem I studied, the accuracy reaches 97%, which is very impressive.
How can I compute the DEC loss of every instance after the training has been completed. For the autoencoder, it is straightforward by defining a simple function:
def ae_loss(autoencoder, X):
ae_rec = autoencoder.predict(X)
ae_loss = tf.keras.losses.mse(ae_rec, X)
return ae_loss
Defining similar function for computing the clustering loss is not working. Any idea how can this be implemented?
I would like to do a further investigation by studying the loss distribution.
Thanks for your great implementation!
I' ve tried to solve classification problem whose input data have the shape of 1000*221 by DEC model.
I want to train my data with over 80 thousand data (very large size [8000000,1000,221],dtype=float32 about 60GB ), so it's not possible load whole dataset into python array.
After googling, I found tf.TFRecord helps me to get out this capacity problem.
I followed the tutorial in the official TensorFlow site to write TFRecord file and I can load the TFReocrd into the conventional Keras Model. However, I can't find how to feed into the DEC-model. The input (mnist) of DEC-model is one numpy file that has the shape [70000,784].
Like flowing:
dataset = tf.data.TFRecordDataset(filenames=[filenames])
parsed_dataset = dataset.map(_parse_function, num_parallel_calls=8)
final_dataset = parsed_dataset.shuffle(buffer_size=number_of_sample).batch(10)
iterator = dataset.make_one_shot_iterator()
parsed_record = iterator.get_next()
feature, label = parsed_record['feature'], parsed_record['label']
#keras
inputs = keras.Input(shape=(1000,221 ), name='feature', tensor=feature)
model.compile(optimizer=tf.keras.optimizers.Adam(0.001),
loss='categorical_crossentropy',
metrics=['accuracy','categorical_crossentropy'],
target_tensors=[label]
)
train_model.fit(epochs= 30,
steps_per_epoch= 800000/256)
This is what I encountered when running the script. Can anyone help me resolving this issue?
input (InputLayer) [(None, 784)] 0 []
encoder_0 (Dense) (None, 500) 392500 ['input[0][0]']
encoder_1 (Dense) (None, 500) 250500 ['encoder_0[0][0]']
encoder_2 (Dense) (None, 2000) 1002000 ['encoder_1[0][0]']
encoder_3 (Dense) (None, 10) 20010 ['encoder_2[0][0]']
tf.expand_dims (TFOpLambda) (None, 1, 10) 0 ['encoder_3[0][0]']
tf.math.subtract (TFOpLambda) (None, 10, 10) 0 ['tf.expand_dims[0][0]']
tf.math.square (TFOpLambda) (None, 10, 10) 0 ['tf.math.subtract[0][0]']
tf.math.reduce_sum (TFOpLambda (None, 10) 0 ['tf.math.square[0][0]']
)
tf.math.truediv (TFOpLambda) (None, 10) 0 ['tf.math.reduce_sum[0][0]']
tf.operators.add (TFOpLamb (None, 10) 0 ['tf.math.truediv[0][0]']
da)
tf.math.truediv_1 (TFOpLambda) (None, 10) 0 ['tf.operators.add[0][0]']
tf.math.pow (TFOpLambda) (None, 10) 0 ['tf.math.truediv_1[0][0]']
tf.compat.v1.transpose (TFOpLa (10, None) 0 ['tf.math.pow[0][0]']
mbda)
tf.math.reduce_sum_1 (TFOpLamb (None,) 0 ['tf.math.pow[0][0]']
da)
tf.math.truediv_2 (TFOpLambda) (10, None) 0 ['tf.compat.v1.transpose[0][0]',
'tf.math.reduce_sum_1[0][0]']
tf.compat.v1.transpose_1 (TFOp (None, 10) 0 ['tf.math.truediv_2[0][0]']
Lambda)
==================================================================================================
Total params: 1,665,010
Trainable params: 1,665,010
Non-trainable params: 0
Update interval 140
Save interval 1365
Initializing cluster centers with k-means.
2188/2188 [==============================] - 10s 4ms/step
Traceback (most recent call last):
File "DEC.py", line 335, in
y_pred = dec.fit(x, y=y, tol=args.tol, maxiter=args.maxiter, batch_size=args.batch_size,
File "DEC.py", line 210, in fit
self.model.get_layer(name='clustering').set_weights([kmeans.cluster_centers_])
File "/research/DEC_Pytorch_tutorial/dec_venv/lib/python3.8/site-packages/keras/engine/training.py", line 3353, in get_layer
raise ValueError(
ValueError: No such layer: clustering. Existing layers are: ['input', 'encoder_0', 'encoder_1', 'encoder_2', 'encoder_3', 'tf.expand_dims', 'tf.math.subtract', 'tf.math.square', 'tf.math.reduce_sum', 'tf.math.truediv', 'tf.operators.add', 'tf.math.truediv_1', 'tf.math.pow', 'tf.compat.v1.transpose', 'tf.math.reduce_sum_1', 'tf.math.truediv_2', 'tf.compat.v1.transpose_1'].
You have shown the acc produced by your implementation, 0.91,which is higher than that in papers.
Could you explain the improvement?
Hellos, this is great work. Thanks!
I just have a question, how can I use my own dataset with this? I have a folder of images that I would like clustered.
Thanks!
Getting the error with the following message
run DEC.py mnist
Using TensorFlow backend.
Namespace(ae_weights=None, batch_size=256, dataset='mnist', gamma=0.1, maxiter=20000.0, n_clusters=10, save_dir='results', tol=0.001, update_interval=140)
MNIST samples (70000, 784)
No pretrained ae_weights given, start pretraining...
Pretraining the 1th layer...
learning rate = 0.1
Traceback (most recent call last):
File "C:\Projects\ProvidersSimilarity\code\DEC-2\DEC-keras-master\DEC.py", line 311, in
x=x)
File "C:\Projects\ProvidersSimilarity\code\DEC-2\DEC-keras-master\DEC.py", line 170, in initialize_model
sae.fit(x, epochs=400)
File "C:\Projects\ProvidersSimilarity\code\DEC-2\DEC-keras-master\SAE.py", line 133, in fit
self.pretrain_stacks(x, epochs=epochs/2)
File "C:\Projects\ProvidersSimilarity\code\DEC-2\DEC-keras-master\SAE.py", line 102, in pretrain_stacks
self.stacks[i].fit(features, features, batch_size=self.batch_size, epochs=epochs/3)
File "C:\Users\kaneja\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\models.py", line 867, in fit
initial_epoch=initial_epoch)
File "C:\Users\kaneja\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\engine\training.py", line 1598, in fit
validation_steps=validation_steps)
File "C:\Users\kaneja\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\engine\training.py", line 1130, in _fit_loop
for epoch in range(initial_epoch, epochs):
TypeError: 'float' object cannot be interpreted as an integer
Great work! .
So i am training this using a different approach with 20newsgroup dataset but the thing is training is finished and i visualize the clusters too with z_2d, and cluster centroids from pickle file.
But how to now predict on a new data and know in which cluster it is mapped onto ?
Hi all,
Thanks for sharing a great Model for Deep Clustering. I aim to leverge your excellent work to cluster images for my project.
Baed on your paper, the Network Structure is a basis of Stacked Autoencoder which is composed of 2 pairs of Dropout-Dense. However, the implementation does not follow the network structure. Is there any reason behind?
Hello! The links for the Reuters data are outdated. The new base link is http://www.ai.mit.edu/projects/jmlr/
instead of http://jmlr.csail.mit.edu/
.
This should be the new contents of get_data.sh:
#!/bin/sh
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt0.dat.gz
gunzip lyrl2004_tokens_test_pt0.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt1.dat.gz
gunzip lyrl2004_tokens_test_pt1.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt2.dat.gz
gunzip lyrl2004_tokens_test_pt2.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt3.dat.gz
gunzip lyrl2004_tokens_test_pt3.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_train.dat.gz
gunzip lyrl2004_tokens_train.dat.gz
wget http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a08-topic-qrels/rcv1-v2.topics.qrels.gz
gunzip rcv1-v2.topics.qrels.gz
Many thanks for the code.
I have implemented ' Unsupervised Deep Embedding for Clustering Analysis' using Pytorch, and I noticed that the pytorch version is converging much slower than your keras version. By going through the details, I noticed that in your version encoders' weights are not updating during the clustering stage.
I'm not sure what is the reason, but according to the paper, the encoders' weight must get updated.
several run times lead to several accuracy values? what can be the problem of this issue?
If I am not getting the desired results with my dataset, will modifying the architecture of the autoencoder help? Do I have to keep the embedding layer below a certain size?
I checked closed issue about acc. you replied pretraining strategy is different.
Can you explain different between this implementation and papers? (I checked the authors' pretraining method in paper, but,i can't find your strategy in this repository)
and in DEC paper, they used dropout, but why didn't you use dropout layer?
你好,我在用你的dec的时候loss 是先上升,后下降 。但是准确率一直在上升,这个你知道是什么原因么 ?
Hi, May i know do you use any trick to preprocessing your reuters ?
Because when i loaded the keras's reuters dataset for training, the accuracy only around 0.19.
Thank you
Hi, I notice the autoencoder here is not described as the original implementation (although the result seems good enough).
If needed, I would love to provide the implementation of de-nosing autoencoder.
I am wondering what it means if the accuracy, nmi, and ari metrics fluctuate a lot. I noticed when training on MNIST, every update interval pretty much has an improvement in accuracy and there is a upward trend.
However, when I train on my dataset, there are lots of fluctuations. it sometimes starts high at iteration 0, then goes lower, then goes high again, then ends up somewhere in between. Does this mean something is wrong with the data? Is this trend representative of something else?
Thanks for your great implementation!
I have a question doing experiment with it. There are default settings for epoch. (e.g. MNIST - 300 epochs) Are they the same value with your IDEC paper experiment?
I want to reproduce your experiment for study, but the accuracy score that I took a experiment with your DEC implementation is not correct with accuracy in your paper(IDEC).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.