I was wondering if you could clarify some evaluation details for the models/datasets l

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Evaluation settings question about mimicry HOT 4 CLOSED

ttaa9 commented on May 26, 2024 1

Evaluation settings question

from mimicry.

Comments (4)

kwotsin commented on May 26, 2024

Hi @ttaa9 , thanks for your questions! The training splits follow the splits available for the datasets. For example, for CIFAR-10 there are train and test splits, but similar to many works I used train split instead. On the other hand, for STL-10, there is the unlabeled version from the dataset, which is also most commonly used. For the scores, they are obtained from a single training run and no retraining of the model to find the best one was done.

from mimicry.

ttaa9 commented on May 26, 2024

Hi @kwotsin , thanks so much for the quick reply. It might be useful to put this info about splits somewhere in the readme so we know which splits to test against to compare against your scores.

Also, the reason I asked about multiple runs is that when I run the GANs myself I don't get quite the same scores. E.g., on celeb-A (128 x 128), I get FID/KID of 13.08/0.00956 (versus your 12.93/0.0076) when training on the train set and evaluating on the test set. (It's a bit worse, 13.39/0.010, when evaluating on the training set, unsurprisingly). FID is quite close but KID is a bit off. So I am wondering whether this is simply stochasticity across training runs versus a difference in the training settings. Perhaps you could post your Trainer settings/object as well as just the architectures, which you currently have?

from mimicry.

kwotsin commented on May 26, 2024

Hi @ttaa9 , no worries! Indeed, on the splits, the information is currently listed under the "Baselines", which is also available for all other datasets tested. To clarify, similar to many existing works, the same split was used for both training and evaluation for each dataset. On the training settings and architectures, these are listed on the README page as well, which are the same ones used for the checkpoint.

On the CelebA run, I think your obtained FID score looks correct, with the difference quite similar to the error interval (which as you mentioned, is probably due to the stochasticity across different training runs). For the KID score, could you check and see if the JSON file scores have any anomalous scores? For example, my current JSON file for the KID scores have the following values:

[
    0.007495319259681859,
    0.007711712250735898,
    0.007619357938282523
]

I suspect an anomalous reading could affect the KID score significantly, which is not surprising since I noticed it can happen even for FID -- e.g. at the same checkpoint, generating using a different random seed can sometimes give few hundred FID points instead of the 20+ points from other readings, although this is very rare. I've re-run the evaluation with the given checkpoint and have gotten a similar score as well: 0.007659641506459136 (± 7.556746387021168e-06). Given that your obtained FID is similar to the one I got, I suspect the KID score might have an anomaly for one of the readings.

To reproduce the scores for KID CelebA, you can download the checkpoint file and run this minimal script:

import torch
import torch_mimicry as mmc
from torch_mimicry.nets import sngan

# Replace with checkpoint file from CelebA 128x128, SNGAN. https://drive.google.com/open?id=1rYnv2tCADbzljYlnc8Ypy-JTTipJlRyN
ckpt_file = "/path/to/checkpoints/netG/netG_100000_steps.pth"

# Default variables
log_dir = './examples/example_log_celeba'
dataset = 'celeba_128'
device = torch.device('cuda:0' if torch.cuda.is_available() else "cpu")

# Restore model
netG = sngan.SNGANGenerator128().to(device)
netG.restore_checkpoint(ckpt_file)

# Metrics
scores = []
for seed in range(3):
    score = mmc.metrics.kid_score(num_samples=50000,
                                  netG=netG,
                                  seed=seed,
                                  dataset=dataset,
                                  log_dir=log_dir,
                                  device=device)

    scores.append(score)

print(scores)

Feel free to let me know if this is helpful!

from mimicry.

kwotsin commented on May 26, 2024

Closing this issue for now, but feel free to let me know if you have more questions!

from mimicry.

Evaluation settings question about mimicry HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent