facebookresearch / moco Goto Github PK

PyTorch implementation of MoCo: https://arxiv.org/abs/1911.05722

License: MIT License

Python 100.00%

moco's Introduction

MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

This is a PyTorch implementation of the MoCo paper:

@Article{he2019moco,
  author  = {Kaiming He and Haoqi Fan and Yuxin Wu and Saining Xie and Ross Girshick},
  title   = {Momentum Contrast for Unsupervised Visual Representation Learning},
  journal = {arXiv preprint arXiv:1911.05722},
  year    = {2019},
}

It also includes the implementation of the MoCo v2 paper:

@Article{chen2020mocov2,
  author  = {Xinlei Chen and Haoqi Fan and Ross Girshick and Kaiming He},
  title   = {Improved Baselines with Momentum Contrastive Learning},
  journal = {arXiv preprint arXiv:2003.04297},
  year    = {2020},
}

Preparation

Install PyTorch and ImageNet dataset following the official PyTorch ImageNet training code.

This repo aims to be minimal modifications on that code. Check the modifications by:

diff main_moco.py <(curl https://raw.githubusercontent.com/pytorch/examples/master/imagenet/main.py)
diff main_lincls.py <(curl https://raw.githubusercontent.com/pytorch/examples/master/imagenet/main.py)

Unsupervised Training

This implementation only supports multi-gpu, DistributedDataParallel training, which is faster and simpler; single-gpu or DataParallel training is not supported.

To do unsupervised pre-training of a ResNet-50 model on ImageNet in an 8-gpu machine, run:

python main_moco.py \
  -a resnet50 \
  --lr 0.03 \
  --batch-size 256 \
  --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

This script uses all the default hyper-parameters as described in the MoCo v1 paper. To run MoCo v2, set --mlp --moco-t 0.2 --aug-plus --cos.

Note: for 4-gpu training, we recommend following the linear lr scaling recipe: --lr 0.015 --batch-size 128 with 4 gpus. We got similar results using this setting.

Linear Classification

With a pre-trained model, to train a supervised linear classifier on frozen features/weights in an 8-gpu machine, run:

python main_lincls.py \
  -a resnet50 \
  --lr 30.0 \
  --batch-size 256 \
  --pretrained [your checkpoint path]/checkpoint_0199.pth.tar \
  --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

Linear classification results on ImageNet using this repo with 8 NVIDIA V100 GPUs :

	pre-train epochs	pre-train time	MoCo v1 top-1 acc.	MoCo v2 top-1 acc.
ResNet-50	200	53 hours	60.8±0.2	67.5±0.1

Here we run 5 trials (of pre-training and linear classification) and report mean±std: the 5 results of MoCo v1 are {60.6, 60.6, 60.7, 60.9, 61.1}, and of MoCo v2 are {67.7, 67.6, 67.4, 67.6, 67.3}.

Models

Our pre-trained ResNet-50 models can be downloaded as following:

	epochs	mlp	aug+	cos	top-1 acc.	model	md5
MoCo v1	200				60.6	download	`b251726a`
MoCo v2	200	✓	✓	✓	67.7	download	`59fd9945`
MoCo v2	800	✓	✓	✓	71.1	download	`a04e12f8`

Transferring to Object Detection

See ./detection.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

moco's People

Contributors

Stargazers

Watchers

Forkers

chongruo deppmeng poodarchu psu1 trendingtechnology wang21jun hyzcn forkkit heronimus etema19 leule ajabri hzhang57 openseg-group xiaopingzeng wh-forker enqing626 namwoo magauiya phecy christophalt djiajunustc britefury roozbehsanaei happog shengzhang90 chaos1992 tengdahan dzcgaara iiihunter kakafeicoffee gztangde fangchengji jackroos jiangxuehan abrliu shaotengliu mrchill rainyzhang16 forks-learning bnu-wangxun by-liu weidixie adamdad abhinav95 yf817 keyky sunwook-hwang sailfish009 clks-wzz chisyliu zhengfangwu elischwartz magicsen l1aoxingyu hs189 carina34 chunfuchen suchir shiyongde jiayunli matt-peters jiasenlu tzs930 andreevp youngfly11 chaoso ayushmehta linktopast1990 depengchen123 hushunbo mengkunzhao gangmingzhao xiaolaodi luweishuang junweston phychaos tanghaotommy mahayat ljm198134 yenchenlin bruinxiong twistedmove neabfi ml-nic eulerlee hjander soeaver superhans elogi yoosan ssnl raegher mtlong ml-lab chenyangsi baiti01 jiangbingqing mseyfi ntpthinh

moco's Issues

Use configurable loggers rather than swallowing print statements

Just a nit, but print statements are used throughout the code rather than configurable loggers. Since the print builtin is overridden on processes other than process 0, this can be surprising to developers. Consider using python's logging module to make this configuration more standardized and clear.

Will you release RegNet model pre-trained by moco method?

Hi, thx for the super work!!!
But, the resnet50 is too heavy. I wonder if the RegNet will be released, which is trained by moco?

ImageNet linear classifier weights?

Hi, would you mind also uploading the weights (or whole checkpoint) for a model with the linear classifier on ImageNet? I'm running main_lincls.py myself currently, but it looks like it will take quite some time to get through the 100ep needed and I guess it can be generally useful to others to have these weights readily downloadable.

About shuffleBN with pytorch ddp

I am new to pytorch DistributedDataParallel (DDP), and not clear about the shuffleBN process.

In the code, you first do concat_all_gather(), and then broadcast a random indexes to every devices from src=0.

Here is my question:
Is only device 0 broadcasting? Does other devices doing __batch_shuffle_ddp()_?

Is there any mathematical proof for the convergence of the momentum encoder?

Suppose the parameters of satisfies , and if the parameters of satisfies , then follow the update rule for the momentum encoder: , we can infer that ![](http://latex.codecogs.com/png.latex?\lim_{n\to +\infty}\theta_k=\lim_{n\to+\infty}\mathit{m}\theta_k+\lim_{n\to+\infty}(1-\mathit{m})\theta_q), thus . But how can we prove the convergence of ??

mocov2 performance curve for different K

Hi @KaimingHe , We had fun reading your paper and thank you for sharing your work.

Fig. 3 in mocov1 paper compares various contrastive learning mechanisms accuracy for varying K. We could not find this plot for mocov2. Could you please share this for mocov2? Basically acc for those 6 points K=(256, ... 65536)

thanks again,
Srikar

How to deal with the bn layer in key encoder and query encoder?

There are some bn layers in key and query encoder, how to deal with these layers, freezing it or not freezing? Does someone do some experiments about comparing freezing bn layer with not freezing bn layer. When i refer to others' implentation of moco(https://github.com/HobbitLong/CMC/blob/master/train_moco_ins.py , line 412), i noticed that he freezed the bn layer of key encoder but didn't freeze bn layer of query encoder, and in official implentation, both encoder didn't freeze bn layer. I wonder does freezing bn layer of encoder have effect on final result, and why?

about the FPN setting on COCO

Hi,

Thank you for open-sourcing this simple and clear repo!

I have tried to reproduce the R50-FPN results on COCO, and am curious about the setting of the normalization. I have created a config file here. I wonder if you mind taking a look at if there is any difference between mine setting and yours?

Thank you!

Must detection experiments be performed on 8 GPUs?

Hi, I am curious about the detection downstream task. My question is that if we must use 8 GPUs to reproduce the performance in your paper. And is there any way to reproduce the official number with 4 or less GPUs?

Thanks.

Question about cloning the queue

Thank you for providing such clear and easy-to-follow code for this great project! I was just curious about line 146 in builder.py:

l_neg = torch.einsum('nc,ck->nk', [q, self.queue.clone().detach()])

Is it necessary to make a copy of the queue at all? Does this introduce unnecessary overhead? Or am I misunderstanding something?

About the FPN setting on COCO

Hi, thanks for open-sourcing this excellent repo.
I try to reproduce the performance of Mask R-CNN (R50-FPN, 1x) following #34 (comment).
But there still a gap between the reproduced AP (34.8%) and the AP reported in the paper (35.1%).
Are there differences between our config file and yours?
This is our config file:

_BASE_: "Base-RCNN-FPN.yaml"
MODEL:
  PIXEL_MEAN: [123.675, 116.280, 103.530]
  PIXEL_STD: [58.395, 57.120, 57.375]
  MASK_ON: True
  WEIGHTS: "Mocov1 Model"
  BACKBONE:
    FREEZE_AT: 0
  RESNETS:
    DEPTH: 50
    NORM: "SyncBN"
    STRIDE_IN_1X1: False
  FPN:
    NORM: "SyncBN"
TEST:
  PRECISE_BN:
    ENABLED: True
  EVAL_PERIOD: 5000
SOLVER:
  STEPS: (60000, 80000)
  MAX_ITER: 90000
INPUT:
  FORMAT: "RGB"
OUTPUT_DIR: "./output/mask_fpn_1x_mocov1/"

Can BN be applied within DistributedDataParallel (DDP)?

Typically, we will use SyncBN in DDP to ensure that the computed gradients are identical across different GPUs. It maintains the models on different GPUs with exactly same parameters during training.

However, in moco training (IN-1M), the encoders consist of several vanilla BNs. How to ensure that the models across GPUs are with same parameters? Thanks.

Different model parameters initialized in each GPU worker

Hi,
Following the instructions for both 'Unsupervised Training' and 'Linear Classification', I find different model parameters are initialized in each GPU worker. Because random seed is not set inside main_worker function.
For pytorch DistributedDataParallel, do you think initializing the same set of model parameters across all GPU workers could give more accurate gradient and better performance?
Thanks!

Something wrong with SyncBN

When I try to run voc detection training with command lines
python train_net.py --config-file configs/pascal_voc_R_50_C4_24k_moco.yaml --num-gpus 4 MODEL.WEIGHTS ./output.pkl, it ran out of memory on 4 GTX 2080Ti (11G).

I think it makes no sense since the original pascal_voc_R_50_C4_24k_moco in detectron2 only takes about 7.5G per GPU. I found that the only difference between them lies in RESNETS.NORM, which is set to FrozenBN in detectron2 while SyncBN for moco.

I tried to change it to FrozenBN and the memory footprint looks good except that the loss turns to Nan after 30~40 iterations. Only after decreasing the lr from 0.2 to 0.05 can it maintain training stability. I am not sure why the SyncBN will add large amount of memory expense. Whether the training stability is caused by removing the SyncBN? Any help will be appreciated. Thanks.

Loss stuck at ~6.90

I am trying to train MoCo V2 on a machine with 2 GPUs using the hyperparameters recommended in this repo. However, the loss function gets stuck at value 6.90-ish. Is this behaviour normal or should I try with a different set of hyperparameters? I see that you have used a machine with 8 GPUs. Could this explain the difference? Thanks!

strange top-1

Epoch: [34][3590/4999] Time 0.426 ( 1.635) Data 0.000 ( 0.227) Loss 6.8926e+00 (6.9147e+00) Acc@1 73.44 ( 76.76) Acc@5 87.50 ( 87.55)
Epoch: [34][3600/4999] Time 0.437 ( 1.638) Data 0.000 ( 0.227) Loss 7.0694e+00 (6.9147e+00) Acc@1 59.38 ( 76.76) Acc@5 76.56 ( 87.55)
Epoch: [34][3610/4999] Time 0.432 ( 1.638) Data 0.000 ( 0.226) Loss 6.9074e+00 (6.9146e+00) Acc@1 78.12 ( 76.76) Acc@5 90.62 ( 87.55)
Epoch: [34][3620/4999] Time 0.423 ( 1.639) Data 0.000 ( 0.225) Loss 6.9464e+00 (6.9146e+00) Acc@1 71.88 ( 76.76) Acc@5 85.94 ( 87.55)
Epoch: [34][3630/4999] Time 0.436 ( 1.644) Data 0.000 ( 0.225) Loss 6.8364e+00 (6.9145e+00) Acc@1 81.25 ( 76.77) Acc@5 89.06 ( 87.56)
Epoch: [34][3640/4999] Time 0.425 ( 1.646) Data 0.000 ( 0.224) Loss 6.9520e+00 (6.9145e+00) Acc@1 71.88 ( 76.76) Acc@5 85.94 ( 87.56)
Epoch: [34][3650/4999] Time 0.426 ( 1.646) Data 0.000 ( 0.224) Loss 6.8319e+00 (6.9145e+00) Acc@1 84.38 ( 76.77) Acc@5 87.50 ( 87.56)
Epoch: [34][3660/4999] Time 0.428 ( 1.646) Data 0.000 ( 0.223) Loss 6.8066e+00 (6.9144e+00) Acc@1 75.00 ( 76.78) Acc@5 90.62 ( 87.57)
Epoch: [34][3670/4999] Time 0.471 ( 1.651) Data 0.000 ( 0.222) Loss 6.9694e+00 (6.9144e+00) Acc@1 78.12 ( 76.77) Acc@5 89.06 ( 87.57)
Epoch: [34][3680/4999] Time 0.431 ( 1.650) Data 0.000 ( 0.222) Loss 6.8628e+00 (6.9144e+00) Acc@1 81.25 ( 76.77) Acc@5 87.50 ( 87.57)
Epoch: [34][3690/4999] Time 0.428 ( 1.650) Data 0.000 ( 0.221) Loss 6.8666e+00 (6.9145e+00) Acc@1 81.25 ( 76.77) Acc@5 92.19 ( 87.56)
Epoch: [34][3700/4999] Time 0.434 ( 1.650) Data 0.000 ( 0.221) Loss 6.9402e+00 (6.9144e+00) Acc@1 71.88 ( 76.78) Acc@5 87.50 ( 87.57)
Epoch: [34][3710/4999] Time 0.434 ( 1.654) Data 0.000 ( 0.220) Loss 6.8522e+00 (6.9144e+00) Acc@1 81.25 ( 76.78) Acc@5 92.19 ( 87.57)
Epoch: [34][3720/4999] Time 0.421 ( 1.655) Data 0.000 ( 0.219) Loss 6.8393e+00 (6.9145e+00) Acc@1 79.69 ( 76.78) Acc@5 90.62 ( 87.57)
Epoch: [34][3730/4999] Time 0.426 ( 1.658) Data 0.000 ( 0.219) Loss 6.9804e+00 (6.9145e+00) Acc@1 68.75 ( 76.78) Acc@5 81.25 ( 87.57)
Epoch: [34][3740/4999] Time 0.424 ( 1.658) Data 0.000 ( 0.218) Loss 7.0028e+00 (6.9144e+00) Acc@1 75.00 ( 76.78) Acc@5 82.81 ( 87.57)
Epoch: [34][3750/4999] Time 0.438 ( 1.662) Data 0.000 ( 0.218) Loss 6.9528e+00 (6.9144e+00) Acc@1 75.00 ( 76.78) Acc@5 82.81 ( 87.57)
Epoch: [34][3760/4999] Time 0.423 ( 1.664) Data 0.000 ( 0.217) Loss 6.8455e+00 (6.9143e+00) Acc@1 76.56 ( 76.79) Acc@5 93.75 ( 87.57)
Epoch: [34][3770/4999] Time 0.430 ( 1.666) Data 0.000 ( 0.217) Loss 6.9374e+00 (6.9143e+00) Acc@1 81.25 ( 76.79) Acc@5 90.62 ( 87.57)

I use the following command to train on ImageNet with 4 2080ti:

python main_moco.py -a resnet50 --mlp --moco-t 0.2 --aug-plus --cos --lr 0.015 --batch-size 256 --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 /job/large_dataset/open_datasets/ImageNet/

I doubt it is training on the supervised manner. Are there anything wrong with my experiments?

How to valuate during traing

I notice the finnal R@1 after 200epoches in ReadMe is 60%, but there is no code to valuate the model in the repo, only training accuracy.

Can you help me that how to valuate the perfomance during the training?

is there any recommended PyTorch version to use?

Thanks!

How to measure convergence?

When I apply MoCo on other datasets, I change the length of the queue and other parameters. I find that it is very hard to judge the convergence of the model.

Some times the accuracy increases very fast and loss approaches 0 quickly. Under other circumstances, the loss steadily increases while accuracy remains low. Both situations could lead to good feature extractor. That is because those metrics heavily rely on the queue length. Large queue length leads to excessive negative samples，which is extremely unbalanced to learn. Small queue length could also degrade the instance discrimination problem, with insufficient difficulty. Problems that are too simple or too hard will affect the learning process。

How can we judge the convergence of the MoCo training? And How can we select a proper queue length depending on our dataset size?

Augmentation

Hi. Thanks for this amazing work.

I have a question though. Is there a specific reason for which you choose the augmentations Jitter, Grayscale and Gaussian Blur? Do you know if stronger augmentations like randaugment with random magnitude or auto augment could provide better results? Or will theses hurt the performances?

Thanks

Pre-train time too long.

According to your result, pretrain 200 epochs(Resnet 50 baseline) need 53H in a 8 V100 machine. But the training speed in my 8V100 machine is three/four times slower than this. I don't know why. Maybe the environment configs is different. So, can you release your environment configs? Thanks!

This is pretrain log, 0.6s per batch, 3000s(about 1h) per epoch.

2020-07-16T09:20:06.867Z: [1,0]<stdout>:Epoch: [16][4000/5004]	Time  1.300 ( 0.685)	Data  0.000 ( 0.084)	Loss 1.0633e+00 (1.2471e+00)	Acc@1 100.00 ( 95.40)	Acc@5 100.00 ( 97.76)
2020-07-16T09:20:12.016Z: [1,0]<stdout>:Epoch: [16][4010/5004]	Time  0.309 ( 0.685)	Data  0.000 ( 0.084)	Loss 1.4829e+00 (1.2472e+00)	Acc@1  87.50 ( 95.40)	Acc@5  93.75 ( 97.76)
2020-07-16T09:20:18.283Z: [1,0]<stdout>:Epoch: [16][4020/5004]	Time  1.043 ( 0.685)	Data  0.000 ( 0.084)	Loss 1.1532e+00 (1.2472e+00)	Acc@1  96.88 ( 95.40)	Acc@5  96.88 ( 97.75)
2020-07-16T09:20:24.301Z: [1,0]<stdout>:Epoch: [16][4030/5004]	Time  0.271 ( 0.685)	Data  0.000 ( 0.084)	Loss 1.1201e+00 (1.2469e+00)	Acc@1  96.88 ( 95.40)	Acc@5 100.00 ( 97.75)
2020-07-16T09:20:30.259Z: [1,0]<stdout>:Epoch: [16][4040/5004]	Time  0.413 ( 0.684)	Data  0.000 ( 0.083)	Loss 1.4439e+00 (1.2468e+00)	Acc@1  90.62 ( 95.40)	Acc@5  93.75 ( 97.75)
2020-07-16T09:20:36.487Z: [1,0]<stdout>:Epoch: [16][4050/5004]	Time  0.213 ( 0.684)	Data  0.000 ( 0.083)	Loss 1.1293e+00 (1.2468e+00)	Acc@1  93.75 ( 95.40)	Acc@5 100.00 ( 97.76)
2020-07-16T09:20:42.951Z: [1,0]<stdout>:Epoch: [16][4060/5004]	Time  0.232 ( 0.684)	Data  0.000 ( 0.083)	Loss 1.1727e+00 (1.2470e+00)	Acc@1 100.00 ( 95.40)	Acc@5 100.00 ( 97.75)
2020-07-16T09:20:48.433Z: [1,0]<stdout>:Epoch: [16][4070/5004]	Time  0.260 ( 0.684)	Data  0.000 ( 0.083)	Loss 1.3516e+00 (1.2469e+00)	Acc@1  96.88 ( 95.40)	Acc@5  96.88 ( 97.75)
2020-07-16T09:20:54.556Z: [1,0]<stdout>:Epoch: [16][4080/5004]	Time  0.271 ( 0.684)	Data  0.000 ( 0.083)	Loss 1.0669e+00 (1.2469e+00)	Acc@1  96.88 ( 95.40)	Acc@5 100.00 ( 97.76)
2020-07-16T09:21:01.362Z: [1,0]<stdout>:Epoch: [16][4090/5004]	Time  0.914 ( 0.684)	Data  0.000 ( 0.082)	Loss 1.3178e+00 (1.2468e+00)	Acc@1  90.62 ( 95.40)	Acc@5  96.88 ( 97.75)
2020-07-16T09:21:07.425Z: [1,0]<stdout>:Epoch: [16][4100/5004]	Time  0.215 ( 0.683)	Data  0.000 ( 0.082)	Loss 9.2172e-01 (1.2467e+00)	Acc@1 100.00 ( 95.40)	Acc@5 100.00 ( 97.75)
2020-07-16T09:21:14.707Z: [1,0]<stdout>:Epoch: [16][4110/5004]	Time  0.359 ( 0.684)	Data  0.000 ( 0.082)	Loss 1.3362e+00 (1.2468e+00)	Acc@1  96.88 ( 95.40)	Acc@5  96.88 ( 97.75)

➜  2020-7-16 nvidia-smi
Thu Jul 16 09:41:17 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:05:00.0 Off |                    0 |
| N/A   54C    P0   181W / 250W |   4802MiB / 32480MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:08:00.0 Off |                    0 |
| N/A   56C    P0   109W / 250W |   4810MiB / 32480MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000000:0D:00.0 Off |                    0 |
| N/A   42C    P0   176W / 250W |   4808MiB / 32480MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000000:13:00.0 Off |                    0 |
| N/A   43C    P0   172W / 250W |   4810MiB / 32480MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE...  Off  | 00000000:83:00.0 Off |                    0 |
| N/A   56C    P0   197W / 250W |   4804MiB / 32480MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   58C    P0   168W / 250W |   4810MiB / 32480MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-PCIE...  Off  | 00000000:8E:00.0 Off |                    0 |
| N/A   43C    P0    64W / 250W |   4810MiB / 32480MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-PCIE...  Off  | 00000000:91:00.0 Off |                    0 |
| N/A   42C    P0   157W / 250W |   4808MiB / 32480MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+

It seems that this problem is caused by pytorch version. This is my running environment:

pytorch1.3.1-py36-cuda10.0-cudnn7.0

torch.multiprocessing.spawn got error "process 0 terminated with exit code 1"

When I use a node with 2 GPUs to run this code, I got this problem. Could you guys help me to solve this?

Sincerely

question regarding the weight decay

Thank you for releasing the code.
In training moco,
optimizer = torch.optim.SGD(model.parameters(), args.lr,
momentum=args.momentum,
weight_decay=args.weight_decay)
whether model.parameters() should be just model.encoder_q.parameters()? (or the weight_decay has been tuned accordingly for the entire model)?

Loss curves on ImageNet

Hello --
I'm trying to reproduce some of these results on a different dataset, and the loss slowly bounces up and down, without converging (see below). Is that expected behavior? I don't think the paper shows what the loss/pretext accuracy look like in the ImageNet training -- might it be possible to share those plots here?

Thanks!

Edit: Note, my dataset has ~ 250K images, so ~25% the size of Imagnet -- I'm wondering whether the difference in dataset sizes could be causing problems? Eg perhaps because the length of the momentum buffer is 4x larger relative to the size of the dataset.

number of gpus vs batch size

Hi,

I found a note:
"Note: for 4-gpu training, we recommend following the linear lr scaling recipe: --lr 0.015 --batch-size 128 with 4 gpus. We got similar results using this setting."

If my gpus have enough memory so that each gpu can handle batch size 64, then is it fine to use the original recipe --lr 0.03 --batch-size 256?
Or, do you have a reason why you recommend to use (batch size) / (# gpus) = 32?

Question about code in moco/builder.py

Hi,

Thanks for your impressive work.

In moco/builder.py, line 63:

self.queue[:, ptr:ptr + batch_size] = keys.T

I suppose that the keys is a Tensor with the batch_size dim, and T is a float scaler attribute of self as self.T .

So an AttributeError AttributeError: 'Tensor' object has no attribute 'T' would be raised if I directly run the train code.

Should it be self.T(I guess)? or any specific setting i missed?

Regards,

The unsupervised training method in the README breaks

Hi,

Thanks for releasing the code!!! I think the launching method in the README should be updated a bit. I run like this:

python main_moco.py -a resnet50 --lr 0.03 --batch-size 256 --world-size 1 --rank 0 /data2/zzy/imagenet

And I got the error of:

Traceback (most recent call last):
  File "main_moco.py", line 402, in <module>
    main()
  File "main_moco.py", line 133, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "main_moco.py", line 186, in main_worker
    raise NotImplementedError("Only DistributedDataParallel is supported.")
NotImplementedError: Only DistributedDataParallel is supported.

I think the rank is not correctly assigned. Did I miss anything useful ?

Implementation Multiple nodes.

Solved

problem in reproducing keypoint detection results

Hi, thanks for the great work.

I tried to reproduce your results on COCO keypoint detection using the pertained MOCO model provided. I strictly followed the training pipeline in moco/detection and used the configs in detectron2/configs/COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml. But training diverged after ~700 iterations as loss became NaN.

I have tried reduced the base lr but it does not seem to help much. Also, as I am using imgs_per_batch = 16, I don't feel like a super small base lr is appropriate.

So:

Would you mind releasing the config files for keypoint detection?
If not, could you please take a look at the configs below and point out the where the problem is.

=======
the command I run is python moco/detection/train_net.py --config-file configs_keypoints/keypoint_rcnn_R_50_FPN_3x.yaml \ --num-gpus 2 MODEL.WEIGHTS ./output.pkl

The following is the config file generated after running train_net.py

CUDNN_BENCHMARK: false
DATALOADER:
  ASPECT_RATIO_GROUPING: true
  FILTER_EMPTY_ANNOTATIONS: true
  NUM_WORKERS: 4
  REPEAT_THRESHOLD: 0.0
  SAMPLER_TRAIN: TrainingSampler
DATASETS:
  PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000
  PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000
  PROPOSAL_FILES_TEST: []
  PROPOSAL_FILES_TRAIN: []
  TEST:
  - keypoints_coco_2017_val
  TRAIN:
  - keypoints_coco_2017_train
GLOBAL:
  HACK: 1.0
INPUT:
  CROP:
    ENABLED: false
    SIZE:
    - 0.9
    - 0.9
    TYPE: relative_range
  FORMAT: BGR
  MASK_FORMAT: polygon
  MAX_SIZE_TEST: 1333
  MAX_SIZE_TRAIN: 1333
  MIN_SIZE_TEST: 800
  MIN_SIZE_TRAIN:
  - 640
  - 672
  - 704
  - 736
  - 768
  - 800
  MIN_SIZE_TRAIN_SAMPLING: choice
MODEL:
  ANCHOR_GENERATOR:
    ANGLES:
    - - -90
      - 0
      - 90
    ASPECT_RATIOS:
    - - 0.5
      - 1.0
      - 2.0
    NAME: DefaultAnchorGenerator
    OFFSET: 0.0
    SIZES:
    - - 32
    - - 64
    - - 128
    - - 256
    - - 512
  BACKBONE:
    FREEZE_AT: 0
    NAME: build_resnet_fpn_backbone
  DEVICE: cuda
  FPN:
    FUSE_TYPE: sum
    IN_FEATURES:
    - res2
    - res3
    - res4
    - res5
    NORM: ''
    OUT_CHANNELS: 256
  KEYPOINT_ON: true
  LOAD_PROPOSALS: false
  MASK_ON: false
  META_ARCHITECTURE: GeneralizedRCNN
  PANOPTIC_FPN:
    COMBINE:
      ENABLED: true
      INSTANCES_CONFIDENCE_THRESH: 0.5
      OVERLAP_THRESH: 0.5
      STUFF_AREA_LIMIT: 4096
    INSTANCE_LOSS_WEIGHT: 1.0
  PIXEL_MEAN:
  - 103.53
  - 116.28
  - 123.675
  PIXEL_STD:
  - 1.0
  - 1.0
  - 1.0
  PROPOSAL_GENERATOR:
    MIN_SIZE: 0
    NAME: RPN
  RESNETS:
    DEFORM_MODULATED: false
    DEFORM_NUM_GROUPS: 1
    DEFORM_ON_PER_STAGE:
    - false
    - false
    - false
    - false
    DEPTH: 50
    NORM: SyncBN
    NUM_GROUPS: 1
    OUT_FEATURES:
    - res2
    - res3
    - res4
    - res5
    RES2_OUT_CHANNELS: 256
    RES5_DILATION: 1
    STEM_OUT_CHANNELS: 64
    STRIDE_IN_1X1: true
    WIDTH_PER_GROUP: 64
  RETINANET:
    BBOX_REG_WEIGHTS: &id001
    - 1.0
    - 1.0
    - 1.0
    - 1.0
    FOCAL_LOSS_ALPHA: 0.25
    FOCAL_LOSS_GAMMA: 2.0
    IN_FEATURES:
    - p3
    - p4
    - p5
    - p6
    - p7
    IOU_LABELS:
    - 0
    - -1
    - 1
    IOU_THRESHOLDS:
    - 0.4
    - 0.5
    NMS_THRESH_TEST: 0.5
    NUM_CLASSES: 80
    NUM_CONVS: 4
    PRIOR_PROB: 0.01
    SCORE_THRESH_TEST: 0.05
    SMOOTH_L1_LOSS_BETA: 0.1
    TOPK_CANDIDATES_TEST: 1000
  ROI_BOX_CASCADE_HEAD:
    BBOX_REG_WEIGHTS:
    - - 10.0
      - 10.0
      - 5.0
      - 5.0
    - - 20.0
      - 20.0
      - 10.0
      - 10.0
    - - 30.0
      - 30.0
      - 15.0
      - 15.0
    IOUS:
    - 0.5
    - 0.6
    - 0.7
  ROI_BOX_HEAD:
    BBOX_REG_WEIGHTS:
    - 10.0
    - 10.0
    - 5.0
    - 5.0
    CLS_AGNOSTIC_BBOX_REG: false
    CONV_DIM: 256
    FC_DIM: 1024
    NAME: FastRCNNConvFCHead
    NORM: ''
    NUM_CONV: 0
    NUM_FC: 2
    POOLER_RESOLUTION: 7
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
    SMOOTH_L1_BETA: 0.5
    TRAIN_ON_PRED_BOXES: false
  ROI_HEADS:
    BATCH_SIZE_PER_IMAGE: 512
    IN_FEATURES:
    - p2
    - p3
    - p4
    - p5
    IOU_LABELS:
    - 0
    - 1
    IOU_THRESHOLDS:
    - 0.5
    NAME: StandardROIHeads
    NMS_THRESH_TEST: 0.5
    NUM_CLASSES: 1
    POSITIVE_FRACTION: 0.25
    PROPOSAL_APPEND_GT: true
    SCORE_THRESH_TEST: 0.05
  ROI_KEYPOINT_HEAD:
    CONV_DIMS:
    - 512
    - 512
    - 512
    - 512
    - 512
    - 512
    - 512
    - 512
    LOSS_WEIGHT: 1.0
    MIN_KEYPOINTS_PER_IMAGE: 1
    NAME: KRCNNConvDeconvUpsampleHead
    NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true
    NUM_KEYPOINTS: 17
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
  ROI_MASK_HEAD:
    CLS_AGNOSTIC_MASK: false
    CONV_DIM: 256
    NAME: MaskRCNNConvUpsampleHead
    NORM: ''
    NUM_CONV: 4
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
  RPN:
    BATCH_SIZE_PER_IMAGE: 256
    BBOX_REG_WEIGHTS: *id001
    BOUNDARY_THRESH: -1
    HEAD_NAME: StandardRPNHead
    IN_FEATURES:
    - p2
    - p3
    - p4
    - p5
    - p6
    IOU_LABELS:
    - 0
    - -1
    - 1
    IOU_THRESHOLDS:
    - 0.3
    - 0.7
    LOSS_WEIGHT: 1.0
    NMS_THRESH: 0.7
    POSITIVE_FRACTION: 0.5
    POST_NMS_TOPK_TEST: 1000
    POST_NMS_TOPK_TRAIN: 1500
    PRE_NMS_TOPK_TEST: 1000
    PRE_NMS_TOPK_TRAIN: 2000
    SMOOTH_L1_BETA: 0.0
  SEM_SEG_HEAD:
    COMMON_STRIDE: 4
    CONVS_DIM: 128
    IGNORE_VALUE: 255
    IN_FEATURES:
    - p2
    - p3
    - p4
    - p5
    LOSS_WEIGHT: 1.0
    NAME: SemSegFPNHead
    NORM: GN
    NUM_CLASSES: 54
  WEIGHTS: ./output.pkl
OUTPUT_DIR: ./output
SEED: -1
SOLVER:
  BASE_LR: 0.02
  BIAS_LR_FACTOR: 1.0
  CHECKPOINT_PERIOD: 5000
  CLIP_GRADIENTS:
    CLIP_TYPE: value
    CLIP_VALUE: 1.0
    ENABLED: false
    NORM_TYPE: 2.0
  GAMMA: 0.1
  IMS_PER_BATCH: 16
  LR_SCHEDULER_NAME: WarmupMultiStepLR
  MAX_ITER: 180000
  MOMENTUM: 0.9
  NESTEROV: false
  STEPS:
  - 120000
  - 160000
  WARMUP_FACTOR: 0.001
  WARMUP_ITERS: 1000
  WARMUP_METHOD: linear
  WEIGHT_DECAY: 0.0001
  WEIGHT_DECAY_BIAS: 0.0001
  WEIGHT_DECAY_NORM: 0.0
TEST:
  AUG:
    ENABLED: false
    FLIP: true
    MAX_SIZE: 4000
    MIN_SIZES:
    - 400
    - 500
    - 600
    - 700
    - 800
    - 900
    - 1000
    - 1100
    - 1200
  DETECTIONS_PER_IMAGE: 100
  EVAL_PERIOD: 0
  EXPECTED_RESULTS: []
  KEYPOINT_OKS_SIGMAS: []
  PRECISE_BN:
    ENABLED: true
    NUM_ITER: 200
VERSION: 2
VIS_PERIOD: 0

Question: Using this method to train on mnist

Hello, @ppwwyyxx @KaimingHe
I use MOCO to train on mnist dataset as an easy example. The mnist train.py is refered from pytorch mnist.
It is easy to reach 99% when directly train with supervised setting.
When I use moco method to pretrain the model firstly, and then I finetune the pretrained weight (here the conv weight is frozen, and only the fc layer can be changed), the performance on the test set can only reach 95%, and could not get better result.
Concretely, when training with mnist dataset, the length of the queue I set is 3840 rather than the default setting 65536. Because the mnist dataset length is smaller than ImageNet.

Does this means the feature extraction network is not trained well, can you give me some suggestions on this phenomenon?
What's more, can you give me some suggestions that how to train on custom dataset? What change is required in hyperparams?

Unable to reproduce the wall clock efficiency in the paper

Hi @KaimingHe , @ppwwyyxx

I am unable to reproduce the wall clock efficiency from the paper. Here's a screen shot of the linear classifier steps/sec:

I observed a similar throughput for the pre-training as well. The total pre-training took me around 120 hrs to finish. And the linear classifier training is slow as well.

How to train on COCO dataset

Hi,

I am wondering what kind of modifications that I should make to be able to train on COCO?

Error in distributed training

I got an error frequently when distributed training is enabled. It occurs roughly for every 50~100 epochs. Here is the error message:

terminate called after throwing an instance of 'std::system_error'
  what():  Transport endpoint is not connected
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

Could you help me to resolve the issue?

Using Tensorboard (TypeError: can't pickle _thread.lock objects)

Hi!
I am trying to incorporate tensorboard with the following snippet in the train function
if args.gpu == 0:
args.tb.add_scalar('loss/train', loss.item(), (len(train_loader)*epoch)+i)
args.tb.add_scalar('acc1/train', acc1[0], (len(train_loader)*epoch)+i)

But I am receiving TypeError: can't pickle _thread.lock objects error originating from mp.spawn().

Any way out?

Question regarding the parallelism

Hey, thanks for your contribution to unsupervised CNN learning.

I would like to do some research based on your architecture, but unfortunately, I don't possess multiple GPUs. Would it be easy to change this architecture to run on 1gpu system?

Affected methods would be:
concat_all_gather, forward function _batch_unshuffle_ddp, _batch_shuffle_ddp

On top of that, I have a windows server which doesn't support distributed module.

Thanks

Pretrained model cannot be loaded for detection

Hi, thanks for the amazing code!

When I tried to load a pretrained checkpoint for object detection, this error happens:
"ValueError: Unsupported type found in checkpoint! model: <class 'dict'>"

I can resolve this error if I only save the state_dict directly in the checkpoint without using a "model" key, but that would result in the "running_mean" and "running_bias" for batchnorm layers not loaded in the detector. I guess it has something to do with "matching_heuristics".

Thanks and looking forward for your reply!

Checkpoint from the pre-training step

Can you please provide checkpoints from the pretraining step (main_moco.py)? When I use the checkpoints you provided to resume pretraining with main_moco.py, I receive errors regarding missing weights. The checkpoints work fine when I use them with the main_lincls.py script.

accuracy is 100% for 1st epoch 10itr/10batches. further it decreases to 0 or very less value

I am trying to run code on my own data (apart from datasets mentioned in paper). During training, it is seen that accuracy is 100% for 1st epoch 1st 10 batches, however it is decreasing to 0 or comparatively very small value throughout further training. Also loss is increasing all time. Snapshot for reference.

how to load moco trained efficientnet weights？

how to load moco trained efficientnet weights to EfficientNet defined in efficientnet_pytorch？keys don't match.

Question regarding the code in main_moco.py

Hi,

I am pretty new to DDP and I am wondering what is the purpose for the following line:

moco/main_moco.py

Line 268 in 3631be0

if not args.multiprocessing_distributed or (args.multiprocessing_distributed

Why all targets are zeros?

moco/moco/builder.py

Line 155 in 3631be0

labels = torch.zeros(logits.shape[0], dtype=torch.long).cuda()

Can synchronized batch norm (SyncBN) be used to avoid cheating? Is shuffling BN a must?

Ditto. I kept wondering about SyncBN vs ShuffleBN as to whether the former can effectively prevent cheating.
SimCLR appears to be using SyncBN (referred to as "Global BN").

SyncBN is out of the box with PyTorch whereas Shuffling BN requires a bit more hacking. The fact that Shuffling BN is chosen must mean that it is better? (or that SyncBN wasn't ready at the time MoCo was designed?)

Learning Rate Scheduling Logic

According to the paper, during training you run for a default of 200 epochs and multiply the learning rate by 0.1 at 120 and 160 epochs. During finetuning, these numbers turn out to be 100, 60, and 80 respectively. For the finetuning case, this would imply a learning rate of 30, then 3, then 0.3, however this is not what the logic of the milestone scheduler performs.

def adjust_learning_rate(optimizer, epoch, args):
    """Decay the learning rate based on schedule"""
    lr = args.lr
    for milestone in args.schedule:
        lr *= 0.1 if epoch >= milestone else 1.
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

Instead what happens is that the learning rate is constant at 30 up until epoch 59, and then at every epoch between 60 and 79 it is multiplied by 0.1. Furthermore, at epochs 80 to 100, it is multiplied by 0.1 twice in each epoch cycle (once for epoch >= 60, and again for epoch >= 80. You end up with a final learning rate that is essentially zero. The key is the greater than or equal to operator, which should be just an equal operator. Correct me if I'm wrong, but shouldn't the logic be:

def adjust_learning_rate(optimizer, epoch, args):
    """Decay the learning rate based on schedule"""
    lr = args.lr
    for milestone in args.schedule:
        lr *= 0.1 if epoch == milestone else 1.
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

In order to conduct the kind of stepwise learning rate scheduling that is described in the paper?

tar: Error opening archive: Unrecognized archive format

I am using macOS and I tried to download the pre-trained model with curl -OL https://dl.fbaipublicfiles.com/moco/moco_checkpoints/moco_v2_800ep/moco_v2_800ep_pretrain.pth.tar. I tried to decompress the tar file with tar xvf moco_v2_800ep_pretrain.pth.tar but received the error tar: Error opening archive: Unrecognized archive format. Could you verify that the pre-trained model files are not corrupted please? Thank you very much.

Question about shuffleBN

Awesome work!
In my opinion, shuffleBN is proposed to maintain the differences of running mean and variance between encoder q and encoder k, which prevents local optimal encoder parameters. How do you evaluate the benefits of shuffleBN?
Moreover, distribute training of MoCo suffers from the time-consuming broadcast and allgather operations in shuffleBN. Do you have any suggestion for accelerating distribute training with shuffleBN?

Any suggestions for fine-tuning the whole network after pret-raining. What kind of hyper-parameters should we use?

question about training the linear classification model

Hi，i did unsupervised pre-training of a ResNet-50 model on a dataset which contains 122,208 unlabeled bird images and the last epoch log is below:

the loss stucks at ~6.90 which is similar to another closed issue #12. In that issue it seems not tha bad. Is this normal?

Then i use this pretrained model to train and eval on a dataset which contains 3,959 train images and 2000 val images. These images are in 200 categories of birds. I follow the
'''
python main_lincls.py
-a resnet50
--lr 30.0
--batch-size 256
--pretrained [your checkpoint path]/checkpoint_0199.pth.tar
--dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0
[your imagenet-folder with train and val folders]
'''
however the validate accuracy is quite low (~12%), which is much lower than supervised training method(~60%). I tried serval learning rate (0.1, 5，10， 100.0) but the results seems still bad.
So can i ask how do you set these hyperparameters？Or, the pretrained model is bad? how can i check this probelm?
Thanks!