Giter Club home page Giter Club logo

Comments (12)

Cory-M avatar Cory-M commented on July 29, 2024 2

hey guys @kartikgupta-at-anu @sumo8291 @TsungWeiTsai
the log file I attached above was found in the previous repository, and some modification to the code had been made later so there are some differences in formats. For instance, the "wd" argument that appeared in the log file was actually not used in the training so we removed it in later edition.
For the released code, we just re-ran the cifar10/100 again yesterday and here are the current results:
test_cifar10_041.log
test_cifar100.log
Please kindly refer to these files for this repo. It is trained on PyTorch0.4.1; we tried running it on PyTorch 0.4.0; the results were indeed not optimal and requires hyperparameter tuning (but we are also able to find the right hyperparameters for the reported result). We don't know the reason for this observation yet, but we strongly encourage you to try on PyTorch 0.4.1 and maybe explore the reasons for the instability.

from dccm.

sKamiJ avatar sKamiJ commented on July 29, 2024 2

I also got a different result. I am using a newer version of PyTorch (1.3) and Keras (2.3.1), but the result seems to be similar to the one from Kartik. Also, the training speed is slow, it took me around 1 hour for each epoch on my (single) GeForce RTX 2080 Ti. Besides, I've changed the Keras backend to "channel_first" but still got the warning, would it be the reason why the numbers don't match? Thank you.

Some log for CIFAR10:

[2020-04-16 07:52:12,916][main.py][line:276][INFO][rank:0] Epoch: [10/200] ARI against ground truth label: 0.057
[2020-04-16 07:52:12,916][main.py][line:277][INFO][rank:0] Epoch: [10/200] NMI against ground truth label: 0.112
[2020-04-16 07:52:12,916][main.py][line:278][INFO][rank:0] Epoch: [10/200] ACC against ground truth label: 0.219
/data/anaconda3/envs/torch/lib/python3.7/site-packages/keras_preprocessing/image/numpy_array_iterator.py:127: UserWarning: NumpyArrayIterat
or is set to use the data format convention "channels_last" (channels on axis 3), i.e. expected either 1, 3, or 4 channels on axis 3. However, it w
as passed an array with shape (32, 3, 32, 32) (32 channels).
str(self.x.shape[channels_axis]) + ' channels).')
[2020-04-16 07:53:46,243][main.py][line:238][INFO][rank:0] Epoch: [10/200][0/50] Time: 0.299 (0.306) Data: 1.662 (1.662) Loss: 0.415
4 graph: 0.2718 label: 0.0156 local: 1.3074
[2020-04-16 08:09:45,322][main.py][line:238][INFO][rank:0] Epoch: [10/200][10/50] Time: 0.409 (0.319) Data: 0.115 (0.139) Loss: 0.460
0 graph: 0.2831 label: 0.0222 local: 1.3135
[2020-04-16 08:25:46,428][main.py][line:238][INFO][rank:0] Epoch: [10/200][20/50] Time: 0.267 (0.331) Data: 0.114 (0.133) Loss: 0.391
4 graph: 0.2567 label: 0.0139 local: 1.3009
[2020-04-16 08:41:44,206][main.py][line:238][INFO][rank:0] Epoch: [10/200][30/50] Time: 0.395 (0.305) Data: 0.133 (0.121) Loss: 0.426
6 graph: 0.2700 label: 0.0187 local: 1.2642
[2020-04-16 08:57:34,701][main.py][line:238][INFO][rank:0] Epoch: [10/200][40/50] Time: 0.227 (0.286) Data: 0.131 (0.135) Loss: 0.387
6 graph: 0.2355 label: 0.0178 local: 1.2629

@TsungWeiTsai This is the code of ImageDataGenerator in Keras 2.3.1:

def init(self,
  ...
  data_format='channels_last',
  ...):
  if data_format is None:
   data_format = backend.image_data_format()

Note that the default value of data_format is set to 'channels_last' instead of None, so it won't call backend.image_data_format() to get the image_data_format you set.
Just manually set data_format='channels_first' when you create ImageDataGenerator.

from dccm.

kartikgupta-at-anu avatar kartikgupta-at-anu commented on July 29, 2024 1

All the libraries have the correct version in my installation, also I have used the exact same seed as you mentioned. Such large variation should not be related to randomness. Are you yourself able to reproduce your results again now? Are you sure you have put the correct hyper parameters in the config file? What is this parameter alpha in your log file name?

from dccm.

sumo8291 avatar sumo8291 commented on July 29, 2024 1

After having gone through your log files... there seems to be some differences which I could observe.
(1) when you print out the arguments for architecture in your log files.. 'arch' has 'dac' assigned to it while in the yaml file.. the value of 'arch' is 'cifar_C4_L2'.
(2) There is no such argument called "arch_dim" in either main.py or in the yaml files.
(3) Neither your namespace printout has "c_layer", "layers" and "classifier" as an argument.

Further I hope "alpha" and "coeff[label]" are similar according to your yaml files. However does "coeff[local]" correspond to the hyerparameter beta in the Equation 15 of the paper? If so, could you please clarify its desired value as in the paper ... you have set the value of beta to 0.1, while in the yaml files ... "coeff[local]" value is set to 0.05?

Next, in the log file provided by you, you have provided the value of weight_decay 'wd' to be 1e-05, while in the 'main.py', torch.optim.RMSProp uses the default value 0 as the weight_decay.

So it seems there is something wrong with the code or the log files which you seem to have provided.

hi, attached is the log of CIFAR10 of the 50 Epoch: (the whole log for cifar10/100 is attached below)
test_cifar10_1k_32_alpha005_thres95.log
test_cifar100_1k_32_alpha005_thres95.log

[2019-03-19 16:40:36,527][main_cifar10_bs_1k_32_alpha005_thres95.py][line:385][INFO][rank:0] Epoch: [51][0/60] Time: 3.951 (3.951) Data: 3.921 (3.921)
[2019-03-19 16:40:39,241][main_cifar10_bs_1k_32_alpha005_thres95.py][line:385][INFO][rank:0] Epoch: [51][10/60] Time: 0.079 (0.601) Data: 0.055 (0.576)
[2019-03-19 16:40:40,252][main_cifar10_bs_1k_32_alpha005_thres95.py][line:385][INFO][rank:0] Epoch: [51][20/60] Time: 0.118 (0.361) Data: 0.094 (0.337)
[2019-03-19 16:40:41,783][main_cifar10_bs_1k_32_alpha005_thres95.py][line:385][INFO][rank:0] Epoch: [51][30/60] Time: 0.178 (0.292) Data: 0.155 (0.268)
[2019-03-19 16:40:43,759][main_cifar10_bs_1k_32_alpha005_thres95.py][line:385][INFO][rank:0] Epoch: [51][40/60] Time: 0.212 (0.268) Data: 0.189 (0.244)
[2019-03-19 16:40:46,195][main_cifar10_bs_1k_32_alpha005_thres95.py][line:385][INFO][rank:0] Epoch: [51][50/60] Time: 0.257 (0.262) Data: 0.234 (0.238)
[2019-03-19 16:40:49,632][main_cifar10_bs_1k_32_alpha005_thres95.py][line:227][INFO][rank:0] ARI against ground truth label: 0.407
[2019-03-19 16:40:49,633][main_cifar10_bs_1k_32_alpha005_thres95.py][line:228][INFO][rank:0] NMI against ground truth label: 0.496
[2019-03-19 16:40:49,633][main_cifar10_bs_1k_32_alpha005_thres95.py][line:229][INFO][rank:0] ACC against ground truth label: 0.621
[2019-03-19 16:40:49,633][main_cifar10_bs_1k_32_alpha005_thres95.py][line:230][INFO][rank:0] #######################

It doesn't require that many epochs to get the reported results. I suggest you check the following configs: the version of PyTorch, Cuda, the seed of NumPy, the seed of PyTorch, and etc. It's true that the results seem to be really sensitive, we don't know why yet, but it does change a lot with tiny modifications to configurations, which could be left as potential future works. A lot of clustering-based methods suffer the same issue of instability. Please let me know if you are still struggling reproducing the result, maybe I could help checking with you.

from dccm.

Cory-M avatar Cory-M commented on July 29, 2024 1

@Cory-M since this (as your latest log files seem to have got the correct results) seems related to some library version mismatch. I have tried using exact same version of libraries. The best is if you can provide a copy of "pip list" so that I can try exact same versions of all other libraries which you have not mentioned. Also let me know what are the cuda and cudnn versions you are using? Currently I have already been using Python 3.6.5, torch 0.4.1, keras 2.0.2. For example, look at my pip list:
absl-py 0.9.0
bleach 1.5.0
certifi 2020.4.5.1
cycler 0.10.0
decorator 4.4.2
easydict 1.9
html5lib 0.9999999
imageio 2.8.0
joblib 0.14.1
Keras 2.0.2
kiwisolver 1.2.0
lmdb 0.94
Markdown 3.2.1
matplotlib 3.2.1
networkx 2.4
numpy 1.16.2
opencv-python 4.0.0.21
Pillow 7.1.1
pip 20.0.2
plyfile 0.7
protobuf 3.11.3
pyparsing 2.4.7
python-dateutil 2.8.1
PyWavelets 1.1.1
PyYAML 5.3.1
scikit-image 0.16.2
scikit-learn 0.22.2.post1
scipy 1.4.1
setuptools 46.1.3.post20200330
six 1.14.0
sklearn 0.0
tensorboardX 1.2
tensorflow-gpu 1.5.0
tensorflow-tensorboard 1.5.1
Theano 1.0.4
torch 0.4.1
torchvision 0.2.1
Werkzeug 1.0.1
wheel 0.34.2

Another thing you could do is to upload data directory at least for cifar10/cifar100 on google drive/drop box so that we can copy the exact same setup to reproduce.

Hey @kartikgupta-at-anu , we uploaded the data to:
https://drive.google.com/file/d/1gP67avGLl5zpeeX1kvkspQdJ3pBwfBNl/view?usp=sharing
and here's how we tried to set up everything from scratch:

  1. Install anaconda3; create a virtual env
  2. install pytorch0.4.1 by 'conda install pytorch==0.4.1 torchvision==0.2.1 cuda80 -c pytorch'
  3. pip install tensorboardX
  4. pip install keras==2.0.2
  5. revise ~/.keras/keras.json to following form:
    {
    "floatx": "float32",
    "epsilon": 1e-07,
    "backend": "theano",
    "image_dim_ordering": "th",
    "image_data_format": "channels_first"
    }
  6. install other dependencies.

Here is the pip list:
pip_list.txt

Hope that helps.

from dccm.

Cory-M avatar Cory-M commented on July 29, 2024

hi, attached is the log of CIFAR10 of the 50 Epoch: (the whole log for cifar10/100 is attached below)
test_cifar10_1k_32_alpha005_thres95.log
test_cifar100_1k_32_alpha005_thres95.log

[2019-03-19 16:40:36,527][main_cifar10_bs_1k_32_alpha005_thres95.py][line:385][INFO][rank:0] Epoch: [51][0/60] Time: 3.951 (3.951) Data: 3.921 (3.921)
[2019-03-19 16:40:39,241][main_cifar10_bs_1k_32_alpha005_thres95.py][line:385][INFO][rank:0] Epoch: [51][10/60] Time: 0.079 (0.601) Data: 0.055 (0.576)
[2019-03-19 16:40:40,252][main_cifar10_bs_1k_32_alpha005_thres95.py][line:385][INFO][rank:0] Epoch: [51][20/60] Time: 0.118 (0.361) Data: 0.094 (0.337)
[2019-03-19 16:40:41,783][main_cifar10_bs_1k_32_alpha005_thres95.py][line:385][INFO][rank:0] Epoch: [51][30/60] Time: 0.178 (0.292) Data: 0.155 (0.268)
[2019-03-19 16:40:43,759][main_cifar10_bs_1k_32_alpha005_thres95.py][line:385][INFO][rank:0] Epoch: [51][40/60] Time: 0.212 (0.268) Data: 0.189 (0.244)
[2019-03-19 16:40:46,195][main_cifar10_bs_1k_32_alpha005_thres95.py][line:385][INFO][rank:0] Epoch: [51][50/60] Time: 0.257 (0.262) Data: 0.234 (0.238)
[2019-03-19 16:40:49,632][main_cifar10_bs_1k_32_alpha005_thres95.py][line:227][INFO][rank:0] ARI against ground truth label: 0.407
[2019-03-19 16:40:49,633][main_cifar10_bs_1k_32_alpha005_thres95.py][line:228][INFO][rank:0] NMI against ground truth label: 0.496
[2019-03-19 16:40:49,633][main_cifar10_bs_1k_32_alpha005_thres95.py][line:229][INFO][rank:0] ACC against ground truth label: 0.621
[2019-03-19 16:40:49,633][main_cifar10_bs_1k_32_alpha005_thres95.py][line:230][INFO][rank:0] #######################

It doesn't require that many epochs to get the reported results. I suggest you check the following configs: the version of PyTorch, Cuda, the seed of NumPy, the seed of PyTorch, and etc. It's true that the results seem to be really sensitive, we don't know why yet, but it does change a lot with tiny modifications to configurations, which could be left as potential future works. A lot of clustering-based methods suffer the same issue of instability. Please let me know if you are still struggling reproducing the result, maybe I could help checking with you.

from dccm.

Cory-M avatar Cory-M commented on July 29, 2024

yes we can, we've tested it a lot of times before releasing...

from dccm.

TsungWeiTsai avatar TsungWeiTsai commented on July 29, 2024

I also got a different result. I am using a newer version of PyTorch (1.3) and Keras (2.3.1), but the result seems to be similar to the one from Kartik. Also, the training speed is slow, it took me around 1 hour for each epoch on my (single) GeForce RTX 2080 Ti. Besides, I've changed the Keras backend to "channel_first" but still got the warning, would it be the reason why the numbers don't match? Thank you.

Some log for CIFAR10:

[2020-04-16 07:52:12,916][main.py][line:276][INFO][rank:0] Epoch: [10/200] ARI against ground truth label: 0.057
[2020-04-16 07:52:12,916][main.py][line:277][INFO][rank:0] Epoch: [10/200] NMI against ground truth label: 0.112
[2020-04-16 07:52:12,916][main.py][line:278][INFO][rank:0] Epoch: [10/200] ACC against ground truth label: 0.219
/data/anaconda3/envs/torch/lib/python3.7/site-packages/keras_preprocessing/image/numpy_array_iterator.py:127: UserWarning: NumpyArrayIterat
or is set to use the data format convention "channels_last" (channels on axis 3), i.e. expected either 1, 3, or 4 channels on axis 3. However, it w
as passed an array with shape (32, 3, 32, 32) (32 channels).
str(self.x.shape[channels_axis]) + ' channels).')
[2020-04-16 07:53:46,243][main.py][line:238][INFO][rank:0] Epoch: [10/200][0/50] Time: 0.299 (0.306) Data: 1.662 (1.662) Loss: 0.415
4 graph: 0.2718 label: 0.0156 local: 1.3074
[2020-04-16 08:09:45,322][main.py][line:238][INFO][rank:0] Epoch: [10/200][10/50] Time: 0.409 (0.319) Data: 0.115 (0.139) Loss: 0.460
0 graph: 0.2831 label: 0.0222 local: 1.3135
[2020-04-16 08:25:46,428][main.py][line:238][INFO][rank:0] Epoch: [10/200][20/50] Time: 0.267 (0.331) Data: 0.114 (0.133) Loss: 0.391
4 graph: 0.2567 label: 0.0139 local: 1.3009
[2020-04-16 08:41:44,206][main.py][line:238][INFO][rank:0] Epoch: [10/200][30/50] Time: 0.395 (0.305) Data: 0.133 (0.121) Loss: 0.426
6 graph: 0.2700 label: 0.0187 local: 1.2642
[2020-04-16 08:57:34,701][main.py][line:238][INFO][rank:0] Epoch: [10/200][40/50] Time: 0.227 (0.286) Data: 0.131 (0.135) Loss: 0.387
6 graph: 0.2355 label: 0.0178 local: 1.2629

from dccm.

Cory-M avatar Cory-M commented on July 29, 2024

All the libraries have the correct version in my installation, also I have used the exact same seed as you mentioned. Such large variation should not be related to randomness. Are you yourself able to reproduce your results again now? Are you sure you have put the correct hyper parameters in the config file? What is this parameter alpha in your log file name?

@kartikgupta-at-anu Most of the clustering methods suffer this problem of instability and large variation of performance, which is one of their defects. They are commonly super sensitive to initialization and hyperparameters, especially on smaller datasets like CIFAR: the accuracy would vary a lot if a part of the samples is misclassified because of the bad initialization.

from dccm.

Cory-M avatar Cory-M commented on July 29, 2024

I also got a different result. I am using a newer version of PyTorch (1.3) and Keras (2.3.1), but the result seems to be similar to the one from Kartik. Also, the training speed is slow, it took me around 1 hour for each epoch on my (single) GeForce RTX 2080 Ti. Besides, I've changed the Keras backend to "channel_first" but still got the warning, would it be the reason why the numbers don't match? Thank you.

Some log for CIFAR10:

[2020-04-16 07:52:12,916][main.py][line:276][INFO][rank:0] Epoch: [10/200] ARI against ground truth label: 0.057
[2020-04-16 07:52:12,916][main.py][line:277][INFO][rank:0] Epoch: [10/200] NMI against ground truth label: 0.112
[2020-04-16 07:52:12,916][main.py][line:278][INFO][rank:0] Epoch: [10/200] ACC against ground truth label: 0.219
/data/anaconda3/envs/torch/lib/python3.7/site-packages/keras_preprocessing/image/numpy_array_iterator.py:127: UserWarning: NumpyArrayIterat
or is set to use the data format convention "channels_last" (channels on axis 3), i.e. expected either 1, 3, or 4 channels on axis 3. However, it w
as passed an array with shape (32, 3, 32, 32) (32 channels).
str(self.x.shape[channels_axis]) + ' channels).')
[2020-04-16 07:53:46,243][main.py][line:238][INFO][rank:0] Epoch: [10/200][0/50] Time: 0.299 (0.306) Data: 1.662 (1.662) Loss: 0.415
4 graph: 0.2718 label: 0.0156 local: 1.3074
[2020-04-16 08:09:45,322][main.py][line:238][INFO][rank:0] Epoch: [10/200][10/50] Time: 0.409 (0.319) Data: 0.115 (0.139) Loss: 0.460
0 graph: 0.2831 label: 0.0222 local: 1.3135
[2020-04-16 08:25:46,428][main.py][line:238][INFO][rank:0] Epoch: [10/200][20/50] Time: 0.267 (0.331) Data: 0.114 (0.133) Loss: 0.391
4 graph: 0.2567 label: 0.0139 local: 1.3009
[2020-04-16 08:41:44,206][main.py][line:238][INFO][rank:0] Epoch: [10/200][30/50] Time: 0.395 (0.305) Data: 0.133 (0.121) Loss: 0.426
6 graph: 0.2700 label: 0.0187 local: 1.2642
[2020-04-16 08:57:34,701][main.py][line:238][INFO][rank:0] Epoch: [10/200][40/50] Time: 0.227 (0.286) Data: 0.131 (0.135) Loss: 0.387
6 graph: 0.2355 label: 0.0178 local: 1.2629

@TsungWeiTsai hi, I checked my log and found that we didn't encounter this warning before. I am not that sure but I would say it could be.... I think you could try setting a breakpoint and check whether the images are augmented correctly (for the keras dataloader)

from dccm.

kartikgupta-at-anu avatar kartikgupta-at-anu commented on July 29, 2024

@Cory-M since this (as your latest log files seem to have got the correct results) seems related to some library version mismatch. I have tried using exact same version of libraries. The best is if you can provide a copy of "pip list" so that I can try exact same versions of all other libraries which you have not mentioned. Also let me know what are the cuda and cudnn versions you are using? Currently I have already been using Python 3.6.5, torch 0.4.1, keras 2.0.2. For example, look at my pip list:
absl-py 0.9.0
bleach 1.5.0
certifi 2020.4.5.1
cycler 0.10.0
decorator 4.4.2
easydict 1.9
html5lib 0.9999999
imageio 2.8.0
joblib 0.14.1
Keras 2.0.2
kiwisolver 1.2.0
lmdb 0.94
Markdown 3.2.1
matplotlib 3.2.1
networkx 2.4
numpy 1.16.2
opencv-python 4.0.0.21
Pillow 7.1.1
pip 20.0.2
plyfile 0.7
protobuf 3.11.3
pyparsing 2.4.7
python-dateutil 2.8.1
PyWavelets 1.1.1
PyYAML 5.3.1
scikit-image 0.16.2
scikit-learn 0.22.2.post1
scipy 1.4.1
setuptools 46.1.3.post20200330
six 1.14.0
sklearn 0.0
tensorboardX 1.2
tensorflow-gpu 1.5.0
tensorflow-tensorboard 1.5.1
Theano 1.0.4
torch 0.4.1
torchvision 0.2.1
Werkzeug 1.0.1
wheel 0.34.2

Another thing you could do is to upload data directory at least for cifar10/cifar100 on google drive/drop box so that we can copy the exact same setup to reproduce.

from dccm.

kartikgupta-at-anu avatar kartikgupta-at-anu commented on July 29, 2024

I tried with exact same libraries set and your dataset. But still could not reproduce the results. If anyone else except @Cory-M gets to reproduce, please let me know.
I have attached my log file so that other people can see whether they get inferior results similar to mine.
cifar10_log (copy).txt

My pip list:
certifi 2020.4.5.1
cffi 1.14.0
cycler 0.10.0
decorator 4.4.2
easydict 1.9
imageio 2.8.0
joblib 0.14.1
Keras 2.0.2
kiwisolver 1.2.0
lmdb 0.94
matplotlib 3.2.1
mkl-fft 1.0.15
mkl-random 1.1.0
mkl-service 2.3.0
networkx 2.4
numpy 1.18.1
olefile 0.46
opencv-python 4.0.0.21
Pillow 6.2.2
pip 20.0.2
plyfile 0.7
protobuf 3.11.3
pycparser 2.20
pyparsing 2.4.7
python-dateutil 2.8.1
PyWavelets 1.1.1
PyYAML 5.3.1
scikit-image 0.16.2
scikit-learn 0.22.2.post1
scipy 1.4.1
setuptools 46.1.3.post20200330
six 1.14.0
sklearn 0.0
tensorboardX 2.0
Theano 1.0.4
torch 0.4.1
torchvision 0.2.1
wheel 0.34.2

from dccm.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.