vturrisi / solo-learn Goto Github PK

View Code? Open in Web Editor NEW

1.4K 11.0 182.0 5.17 MB

solo-learn: a library of self-supervised methods for visual representation learning powered by Pytorch Lightning

License: MIT License

Shell 3.00% Python 97.00%

simclr nvidia-dali contrastive-learning pytorch pytorch-lightning barlow-twins self-supervised-learning swav byol moco

solo-learn's Introduction

PhD @ University of Trento.

Applied Scientist Intern @ Amazon Berlin (2022)

AI Research Intern @ Samsung Research Center in Cambridge (2022).

solo-learn's People

Contributors

Stargazers

Watchers

Forkers

c1a1o1 evgeneus noahrjohnson sudormroot evgeni-nikolaev pantheon5100 cc1164 chaoso pzhang266 jaedukseo wx-b damilu1 dumpmemory coffeecatt junya-chen zlapp codeaudit ankitpatnala cv-ip fariasfc plzhai sailfish009 moinnabi skachano xwyzsn bourcierj repo-collection froskekongen varun1423 socar-tigger xuanhanyu usama13o peppesaccardi trungpx asifsetu lavoiems jesperkers wr19960001 lijuny yangsenwxy yotofu ggyysophia rawss777 twistedmove ajdeboer deanjgoldman greyjoeyzhou szhong-ssmhealth pbevan1 celsopitta woozch haorantang kaland313 haoweiclouds1 jacobarose jhvics1 vincezengqiang john12458 trannhiem bj2016 anshmittal1811 yimikai eleanortrollope peternara phdshliang nehakalibhat guanlongtianzi one-sixth cryptowealth-technology bkleiner2 sunarker aureliengauffre albert-92 nikunj-gupta nashid abol313 e7mul wenbinlee sccbhxc arav-agarwal2 bjmeo8 slienteagle-wyb piotrmwojcik ksoumya fabiofumarola luffy03 richa10 geeks-sid oddecust shism2 rohtash azuria-earth mbrukman rootvisionai shaonc gopalaniyengar winci-ai jindongwang techthiyanes lixiang007666

solo-learn's Issues

Configuration for linear eval of BYOL on cifar-10/100

Hi!

I had tried a slightly modified version of BYOL on cifar-10/100. The results of additional class-classifier used in pretraining are 1% (cifar-10) and 6% (cifar-100) better than the classifier trained with the final backbone during linear evaluation. This might be due to the fact that we used the IN-100 configuration of BYOL for linear evaluation. The cifar-10/100 configuration for linear evaluation is not provided. It would be very helpful if you can share the respective configurations for cifar.

Thanks!

Problem about using repo

I tried to install the repo and the provided scripts however met pkg dependencies error. Also tried conda new env with the given requirements file still failed. Any idea about fixing this?

Thanks so much!
Best.

How to continue the training?

my training process crashed for power problem, now I need to continue the training...
But I don't know how to do it for it is the first time I use the pytorch-lightning

Thank you for help

Handling of custom datasets

Can I use this code base to run the methods over multispectral images.
i have to take care of transformations differently.
Data will be .tif file rather .png
Backbones will be slightly different as now the first conv layer will have n in_channels compared to 3 in plain images

Implement ROMA (https://arxiv.org/abs/2107.10419v1)

Implement Deepcluster v2

Configuration for BYOL - Imagenet-1000

Hello!

For the BYOL run on Imagenet-1000, I could not find any bash file for the same in the 'main' branch. Can you please provide the configuration for this setting? I see the results table filled up for BYOL on Imagenet-1000 in the readme.

Thanks a lot!

Loss curves for VICReg

Hi,

Thank you very much for the excellent work on bringing all the different self supervised models at one place, and high code quality.

I am currently working on the VICReg loss, and was wondering if you had any loss curves for invariance/ covariance/ variance losses with you, and if you could please share it? It would be really helpful for me to understand the behaviour of these losses.

Looking forward to your help.

Thank You.

Best Regards,
Anuj

Auto UMAP

Create a queue class and move `find_nn` to `misc.py`

Which ImageNet-100?

Hi all,

First of all, thank you so much for creating this library, I have found it to be super useful for my own research!

I was wondering if you could provide some details on the ImageNet-100 dataset that you used? I cannot seem to find any "standard" ImageNet-100 dataset for downloading on the internet and the papers that use this dataset (eg. 1 and 2) seem to randomly select 100 classes from the dataset.

Any of your help would be much appreciated!

Offline k-NN eval

Use of the target in the base model

Hello, greate work in those models. I have some question in the implementation of those models. In the base model of all self-supervised model, there is a [_shared_step function](https://github.com/vturrisi/solo-learn/blob/main/solo/methods/base.py#:~:text=def%20_shared_step(self,1%20and%20acc) used to forward calculate the 'class_loss' of the network. And the 'class loss' is added to the other constrastive loss later to backprob and update the network. The 'class_loss' is compute use label from the data and the cross entropy function, but in self-supervised learning, we are not allow to use label during the pretrain. Maybe I missunderstand your implementation, if this is the case, please let me know where did I make the mistake.

Linear Model missing "patch_size" parameter

When I run the following script

python main_linear.py \
    --dataset cifar10 \
    --encoder resnet18 \
    --data_dir ./datasets \
    --max_epochs 100 \
    --gpus 0 \
    --precision 16 \
    --optimizer sgd \
    --scheduler step \
    --lr 1.0 \
    --lr_decay_steps 60 80 \
    --weight_decay 0 \
    --batch_size 256 \
    --num_workers 5 \
    --name simclr-cifar10-linear-eval \
    --pretrained_feature_extractor  EXTRATOR_PATH \
    --project selfsupervised  \
    --wandb

The program raise Error: 'Namespace' object has no attribute 'patch_size'

It seems Linear Model missing some required parameters

Implement ReSSL (https://arxiv.org/abs/2107.09282v1)

[New method suggestion] Augmentation-Augmented Variational Autoencoders

Ciao ragazzi!

I was wondering whether the new method AAVAEs proposed by William Falcon et al. can interest you and thus be implemented in your library or not.

The authors of the paper claim and suggest that autoencoding is a viable third family of self-supervised learning approach in addition to contrastive and non-contrastive learning although their method failed to outperform or perform comparably to the existing families of self-supervised learning algorithms.

I really enjoyed the idea behind this method and I wanted to strongly suggest it to you!

API for custom data

SimSiam results on ImageNet.

Can you share your results on the full Imagenet-1k dataset? I think the results on imagenet-100 is not sufficient to prove that the implementation is right.

My implementation here can reach 65% top-1 acc on ImageNet 1k (using mocov2's linear evaluation protocol),
https://github.com/poodarchu/SelfSup/blob/master/examples/simsiam/SimSiam.res50.imagenet.256bs.224size.100e.lin_cls/README.md, as reported in the paper, this number should be ~66.7%。

Moreover, my results on CIFAR can match the performance in paper exactly.

Add "how to implement new methods" tutorial

saving model checkpoints and initializing using pretrained models

Currently, the model checkpoints are saved at the end of validation epoch (

solo-learn/solo/utils/checkpointer.py

Lines 156 to 157 in 09f5f89

if epoch % self.frequency == 0:

self.save(trainer)

) and are skipped if only training data is used without validation data. Can the model checkpoint be saved at the end of training epoch when val data is not used?
How to initialize using pretrained (imagenet /other datasets) resenet18 and resnet50 models at the start of training?

Implement multi-crop (SwAV, DCv2, DINO, MoCo)

and improve SimCLR's implementation

Separate train and validation datasets without and with labels

Hi there! First of all nice work on the library, looks great.

I am in the special situation where I have a large dataset for pretraining but without any labels. I would still like to do linear and k-NN eval online on multiple smaller, related labelled datasets to measure transfer performance. This is different from the standard benchmarks where we train and evaluate on splits of the same (usually labelled) dataset, but I think should be a very common problem. It seems like this is not possible in solo-learn at the moment, is this correct? If yes, what changes would be necessary?

Thank you.

Implement additional down-stream tasks for eval

e.g. object detection

Implement AAVAE (https://arxiv.org/abs/2107.12329)

Add ImageNet results for all methods

Why is main_pretrain.py using classification_dataloader and not pretrain_dataloader?

Trying to load a custom dataset without labels only for pretraining (not linear evaluation).

I implemented the necessary changes to the pipeline in pretrain_dataloader.py, but I realized upon running main_pretrain.py (line 88) calls classification_dataloader.py and not pretrain_dataloader.py as I was expecting. Please advise how to address this for loading such a custom dataset.

tutorial

Is there any tutorial for this repo that one can use?

Thanks a lot :)

targets = batch[-1]

Hi, thanks for your sharing.
What is the meaning of the batch？

Thanks

Some problems when using the well trained model

Hello!

Thank you for your previous answer.
However some errors occurred when i load the well trained simsiam model for cifar10:
ModuleNotFoundError: No module named 'solo.methods'
and moco for cifar100:
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

When I use simclr, byol, moco for cifar10, there is no such problem.

How to use these well trained model

I appreciate your work, but I have some small questions.
The well-trained model you gave on the homepage is the encoder or encoder plus Linear?
How should I use these models?

A bug that occurs while using more than 2 num_crops and DALI

I'm trying to train an old method 'Rotation Prediction' by the library which needs 4 crops of images, and then find a bug

in solo.methods.dali , around line 219

the origin code is
output_map = ["large1", "large2", "label"]

In fact, following codes is correct:
output_map = []
for i in range(self.num_crops):
output_map.append(f"large{i+1}")
output_map.append('label')

Implementation of Barlow Twins

Hello, thanks for your great work!
I have some questions about Barlow Twins. I used the official code on Cifar-10, and modified the conv1=nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=2, bias=False), maxpool = nn.Identity(). Then, I got the accuracy @top1: 89.57. Other parameters are followed by https://github.com/facebookresearch/barlowtwins.
Could you give some suggestion? many thanks!

Implement OBoW

Add support for Online Bag of Words (OBoW) - https://arxiv.org/pdf/2012.11552.pdf

Implement DirectPred (https://arxiv.org/abs/2102.06810)

Performance degeneration when batch size is increased(BYOL)

Hi, thanks for your impressive work!

During the BYOL experiment on your codes(Imagenet-100, resnet18, 200epochs, same hyper-params as bash_files), it seems that the performance degenerated a lot if the batch size was increased([val acc1] bsz128: 75.5%, bsz256:70.54%).
Is there anything I need to modify when the batch size is increased?
I don't think it's because of the learning rate

Improve readthedocs

Add "how to implement new methods" tutorial

We already have tutorial which should be fine for now.

Fail to run on custom dataset

@vturrisi Hi vturrisi, I have run BYOL method on Imagnet100. Howerver, when I try to run it on custom dataset, it throws the following error:

ValueError: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Traceback (most recent call last):

I use the script custom/byol.sh. The Dataloader seems it has some problem for custom dataset.

The parameters are as follows:

python3 ../../../main_pretrain.py \
    --dataset custom \
    --encoder resnet18 \
    --data_dir /raid/yuanyong/datasets \
    --train_dir imagenet100_data/train \
    --no_labels \
    --max_epochs 400 \
    --gpus 0,1,2,3,4,5,6,7 \
    --distributed_backend ddp \
    --sync_batchnorm \
    --precision 16 \
    --optimizer sgd \
    --lars \
    --grad_clip_lars \
    --eta_lars 0.02 \
    --exclude_bias_n_norm \
    --scheduler warmup_cosine \
    --lr 1.0 \
    --classifier_lr 0.1 \
    --weight_decay 1e-5 \
    --batch_size 128 \
    --num_workers 8 \
    --brightness 0.4 \
    --contrast 0.4 \
    --saturation 0.2 \
    --hue 0.1 \
    --gaussian_prob 1.0 0.1 \
    --solarization_prob 0.0 0.2 \
    --name byol-400ep-custom \
    --entity unitn-mhug \
    --project solo-learn-custom \
    --wandb \
    --method byol \
    --output_dim 256 \
    --proj_hidden_dim 4096 \
    --pred_hidden_dim 8192 \
    --base_tau_momentum 0.99 \
    --final_tau_momentum 1.0

I use the imagenet100 as the custom dataset to validate the training process.

queue size

hi,thanks for your work,
I have a question about how to set the queue size properly, such as a datasets is 13,000 and the queue size can set as 65536?

Pretrain on custom dataset without labels

I am looking to pretrain a model on images for which labels are not available/defined. For that, I'm trying to replicate the DataLoader for other datasets in the catalog (like cifar10), but it seems the methods are implemented to return labels along with the images. Could you please guide on how/what to tweak in order to make the training pipeline work without labels?

train on custom dataset

Hi I try to train on a custom dataset (~150k images )
with bash_files/pretrain/custom/byol.sh
I change

--train_dir path_to_dir_with_all_images

and add the --no_labels flag
I got failed for CUDA out of memory
and on the train prints I see

1 | classifier         | Linear                | 306 M

debugging the code and I see inline

solo-learn/solo/methods/base.py

Line 235 in 09f5f89

self.classifier = nn.Linear(self.features_dim, num_classes)

that num_classes = #num_of_images (~150k)
is that what supposed to be?
in case we have flag no_labels don't we want to remove the classifier layer?
thanks for the help

Is the self.encoder learning even after feats.detach()?

Calling feats.detach() will prevent the gradient from flowing during backpropagation to previous computational nodes, right? Will not it prevent the encoder (ResNet) to learn?

solo-learn/solo/methods/base.py

Line 314 in 9f1b131

logits = self.classifier(feats.detach())

Pass data augmentation hyperparameters as arguments

Is it possible to add these augmentation hyperparameters ( color_jitter_prob: float = 0.8, gray_scale_prob: float = 0.8, horizontal_flip_prob: float = 0.5) as arguments that can be passed to code similar to other parameters (gaussian_prob and solarization_prob)?

barlowtwins train input question?

I'm very glad to see this amazing project, and found it support balowtwins.
so, I read the code, and got one question:

why use out feats for barlowtwins input?

feats1, feats2 = out["feats"]

and why use diffent time feats?

Improve tests

Add "how to contribute" to readme

Query

May be very stupid question but i found you always used class loss along with main loss function which is mentioned in paper. Class loss is CE loss between targets and predicted logits.
is it something supervised contrastive learning or am I reading the code wrong.

nvjpeg memory allocation failure

Hi,

I ran into an issue that the pretraining script crashes after 8.5 epochs due to an allocation failure. I am guessing there might be a memory leak somewhere.

Details:

Nvidia Titan V GPU (12GB)
Using Nvidia Dali
commit 85b888a (i will do another pull, rerun and post the result, just to be sure.)
arguments: python3 main_pretrain.py --dataset imagenet --encoder resnet50 --data_dir /data --train_dir imagenet/train --val_dir imagenet/val --max_epochs 100 --gpus 0 --distributed_backend ddp --sync_batchnorm --precision 16 --optimizer sgd --scheduler warmup_cosine --lr 0.5 --classifier_lr 0.1 --weight_decay 1e-5 --batch_size 48 --num_workers 12 --brightness 0.4 --contrast 0.4 --saturation 0.4 --hue 0.1 --zero_init_residual --name simsiam-resnet50-100ep-imagenet --dali --entity tomsal --project solo-learn --wandb --method simsiam --proj_hidden_dim 2048 --pred_hidden_dim 512 --output_dim 2048 --amp_level O2 --log_gpu_memory all
I did disable the val_loader for unrelated reasons (by val_loader = None just before line 159). No other changes were made.

The error I get is the following:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/configuration_vali
dator.py:101: UserWarning: you defined a validation_step but have no val_dataloader. Skipping val loop
  rank_zero_warn(f'you defined a {step_name} but have no {loader_name}. Skipping {stage} loop')
Global seed set to 5
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name       | Type       | Params
------------------------------------------
0 | encoder    | ResNet     | 23.5 M 
1 | classifier | Linear     | 2.0 M
2 | projector  | Sequential | 12.6 M
3 | predictor  | Sequential | 2.1 M 
------------------------------------------
40.2 M    Trainable params
2.0 K     Non-trainable params
40.3 M    Total params
161.002   Total estimated model params size (MB)
Global seed set to 5
read 1281167 files from 1000 directories
Epoch 8:  50%|████████████████                | 13369/26690 [1:55:18<1:54:53,  1.93it/s, loss=3.67, v_num=ok1z]
Traceback (most recent call last):
  File "main_pretrain.py", line 136, in <module>
    main()
  File "main_pretrain.py", line 130, in main
    trainer.fit(model, val_dataloaders=val_loader)
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py
", line 460, in fit
    self._run(model)
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py
", line 758, in _run
    self.dispatch()
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py
", line 799, in dispatch
    self.accelerator.start_training(self)
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/accelerators/accel
erator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/plugins/training_t
ype/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py
", line 809, in run_stage
    return self.run_train()
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py
  ", line 871, in run_train           
    self.train_loop.run_training_epoch() 
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/training_l
oop.py", line 491, in run_training_epoch                                                                       
    for batch_idx, (batch, is_last_batch) in train_dataloader:   
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/profiler/profilers
.py", line 112, in profile_iterable                                                                            
    value = next(iterator)                      
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/supporters
.py", line 534, in prefetch_iterator        
    for val in it:                                                                                             
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/supporters
.py", line 464, in __next__                                                                                    
    return self.request_next_batch(self.loader_iters)     
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/supporters
.py", line 478, in request_next_batch
    return apply_to_collection(loader_iters, Iterator, next)              
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/utilities/apply_fu
nc.py", line 85, in apply_to_collection
    return function(data, *args, **kwargs) 
  File "~/Code/solo-learn/solo/methods/dali.py", line 59, in __next__                                
    batch = super().__next__()                     
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line
 194, in __next__                                                                                              
    outputs = self._get_outputs()                                                                              
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/plugin/base_iterator.py"
, line 255, in _get_outputs                                                                                    
    outputs.append(p.share_outputs())
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 863,
in share_outputs
    return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline:
Error when executing Mixed operator decoders__Image encountered:                                               
Error in thread 2: [/opt/dali/dali/operators/decoder/nvjpeg/nvjpeg_decoder_decoupled_api.h:917] NVJPEG error "5
" : NVJPEG_STATUS_ALLOCATOR_FAILURE n02447366/n02447366_33293.jpg
Stacktrace (7 entries):                                                                                        
[frame 0]: ~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/libdali_operators.so(
+0x4cbbee) [0x7efc6c55dbee]                     
[frame 1]: ~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/libdali_operators.so(
+0x87a63b) [0x7efc6c90c63b]                 
[frame 2]: ~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/libdali_operators.so(
+0x87aa2e) [0x7efc6c90ca2e]                                                                                    
[frame 3]: ~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/libdali.so(dali::Thre
adPool::ThreadMain(int, int, bool)+0x1f0) [0x7efc6b5ed330]
[frame 4]: ~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/libdali.so(+0x70718f)
 [0x7efc6bb9f18f]                    
[frame 5]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7efd013b96db]
[frame 6]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7efd010e2a3f]                                        
                                       
Current pipeline object is no longer valid.

After I ran into this the first time, I reran it with GPU memory logging. This is the plot I get:

I am a bit confused that there is an increase after 3.5k steps (from 11979 GB to 1201GB). Let me know in case, I should provide more logs, or so.

P.S.: Great work! It is a pleasure to work with! :)

About the category list of ImageNet-100 subset

Hi @vturrisi, thanks for your great work and also sharing the code.
Was the ImageNet-100 Checkpoint you posted trained on a subset of the following?

https://github.com/HobbitLong/CMC/blob/master/imagenet100.txt

If not, please give us a list of them.

Method VICReg output ungrouped runs based on classes during training

As shown below, wandb could not find a way to manage all classes in one run. Is there any way to fix this?
I have tried method Balow Twins and everything looks good.

My config file looks like this:
python3 main_pretrain.py
--dataset $1
--encoder resnet18
--data_dir ./datasets
--max_epochs 1000
--gpus 0,1,2,3
--precision 16
--optimizer sgd
--lars
--grad_clip_lars
--eta_lars 0.02
--exclude_bias_n_norm
--scheduler warmup_cosine
--lr 0.3
--weight_decay 1e-4
--batch_size 256
--num_workers 4
--crop_size 32
--min_scale 0.2
--brightness 0.4
--contrast 0.4
--saturation 0.2
--hue 0.1
--solarization_prob 0.1
--gaussian_prob 0.0 0.0
--crop_size 32
--num_crops_per_aug 1 1
--name vicreg-$1
--project solo-learn
--entity qiuyanxin
--wandb
--save_checkpoint
--method vicreg
--proj_hidden_dim 2048
--proj_output_dim 2048
--sim_loss_weight 25.0
--var_loss_weight 25.0
--cov_loss_weight 1.0
--accelerator ddp

	if epoch % self.frequency == 0:
	self.save(trainer)