seoungwugoh / stm Goto Github PK

View Code? Open in Web Editor NEW

405.0 405.0 81.0 6.75 MB

Video Object Segmentation using Space-Time Memory Networks

Python 100.00%

stm's People

Contributors

Stargazers

Watchers

Forkers

maciej3031 joonyoung-cv fegonda yangshushuaige lucidin gewenbin292 hkchengrex carrierlxk ziwenzhuang zhique930716 sourabhswain dtm3302 nparkstar mornydew lianghann 9p15p tinyhhj lmquan2000 cv-ip ryancll xmlyqing00 azurewoods hetingjian shellylingling fhcgit yak1990 dandingbudanding hukefei lisenbuaa xueluli ai2083 xieyunjiao amengi tamoghnag chouphone cheng321284 oluwaseunpeter suixiaodan wdjang cloveryww haotianz94 donglaiw maftouni mbencherif castile hyunbo9 dogragon joaogjorge charlesfrye rahulbhadani jasminezz noone65536 haochenheheda anirudh-chakravarthy cvlinks yuhuang-ca metavai tchang1997 masterbin-iiau zzwei1 mafran22 cycu-jyw wolfworld6 nont4e hitsz-zuoqi muvguan wangchaojfj jjho1314 dabeschte zishanqin cv-seg shuowang-ai xjtulyc pipixiapipi sateodoro blessingforever shujunyy123 kuangzijian wangyujie615 sunshineywz123 pinghe-stan

stm's Issues

Questions about YouTubeVOS training

Hi! Thanks for your great job. I have three questions on YouTube-VOS dataset.

Do you use training videos from YouTubeVOS-2018 or YouTubeVOS-2019?
Do you train the model with full frames or sampled frames?
You mentioned in issue 6 that you use random resized and crop for data augmentation.
For a given input frame (most are 720x1280), resize the short side in a random length from 384 to original length (720), then resize the long side to keep frame aspect. Then randomly crop a (384 x 384) area. You also apply different zoom ratios from 0.9 to 1.1 on height and width independently.
Correct me if I am wrong. I wonder whether such procedure is equivalent to the RandomResizedCrop function in torchvision.

About Optimization

Hi, I want to know some details of the configuration of Adam optimizer. In the paper, as you just mentioned use constant learning rate 1e^{-5}, but did not mention about the weight decay which is also important for optimization. Would you mind sharing with us the hyperparameter setting for Adam optimizer (i.e. weight_decay and betas).

Thanks

About pre-training.

Hello. Thanks for you great job!

Do you plan to release the weights of STM pre-trained on images? So that we can focus on reproducing the training script for STM.

Number of iterations in main training

Hi, thanks for your code and work.

I read on another issue #6 that the main training runs for 260 epochs with 3771 samples per epoch. That should be 260*3771/4(batch size) ~ 240K iterations while pretraining runs for 2M iterations. Why would it take just 4 days for pretraining but 3 days for main training as mentioned in the paper, given that each iteration should approximately take the same amount of time?

Am I missing something? I am trying to re-train the network but 260 epochs seem insufficient. Thanks a lot!

There is a question when I try to reproduce the training process

It was mentioned in the paper that the STM samples three frames during the main training stage. After I random sample three frames how the model do forward confuses me for a while? Suppose here are three frames named A,B and C, should I first compute the segmentation result of B according the prev_key and prev_value of A generated in memorize stage and then feed the B and C into next forward pass. Or should I just need compute the segmentation result of C?

Question about DAVIS training with extra YoutubeVOS data

Hi seoungwugoh :
Your work are great. And I have a question about DAVIS training using extra YoutubeVOS dataset. How do you set the ratio of the two training datasets or you training the model using YoutubeVOS data before training using DAVIS training data? Looking forward to your reply.
Thanks.

Test input size

Thanks for the good work done.
I tried to repeat your results. As it is written in your wonderful article, I trained at a size of 384x384. My trained model works at 384x384 input size.
But when I try to input the full size from the Davis dataset (as you have in the demo script), I get trash drawn. Have you encountered such a problem and how did you solve it?
Waiting for your reply!

When is the training script released ?

Some questions about Memory

Hi, sir! Thank you for your fine work, but I still have some question.

Should we make train_code's memory part(Sth like memory part in eval_code) under "with torch.no_grad"?Does memory require grad?

Understanding `K`

From what I understood, K is the number of categories in a Dataset. This affects the dimensionality of the embeddings, which is (Batch, K, C/8, Time, Height, Width), since we compute different embeddings for each category. In the case of DAVIS, k=11, since we have 10 categories and the background category.

If this is correct, then I'm curious why throughout the code, you ignore the embeddings for the background (K=0). Wouldn't this increase performance, along the lines of this paper by Yang et al?

Also, if we're not using K=0, aren't we wasting memory by calculating these embeddings and storing them in VRAM?

Finally, since I'm using this for just people, I've set K=2. Are there any problems with this change?

Training epochs

Hi,
Can you please clarify how many epochs did you trained the model for each stage of training?

Pre-training performance

Hi, thank you for your great work. Could you please provide the segmentation performance (like J, F on DAVIS-2016 and DAVIS-2017?) of the pre-trained STM model to help us to validate the reproduced pre-training process? Thanks.

Question about multi object

Hello, when I use STM to do VOS task, I find the Object Edge is good, however there are several colors in one object like this, should I add loss about the index of num_objects when trainning

use of B_list['o']

B_list['o'].append( (torch.sum(masks[:,1:o], dim=1) + \ torch.sum(masks[:,o+1:num_objects+1], dim=1)).clamp(0,1) )
at

STM/model.py

Line 217 in 905f114

B_list['o'].append( (torch.sum(masks[:,1:o], dim=1) + \

sorry, I follow this operation can you explain?

some problems in inference code

B_list['o'].append( (torch.sum(masks[:,1:o], dim=1) + \ torch.sum(masks[:,o+1:num_objects+1], dim=1)).clamp(0,1) )

I don't understand the meaning of B_list['o'], besides,
x = self.conv1(f) + self.conv1_m(m) + self.conv1_o(o)
f represents image, m means mask, while o I don't know its meaning.

I would be grateful if you could answer my questions.

Some questions about training

Thanks for you answered my previous question,but i still have many questions......
how did you choose the first frame of the 3 temporally ordered frames?
how many epochs will you increase the maximum_skip?
what is maximum_skip when the dataset is youtube vos?
Thk a lot !

the meaning of in_m/in_o

could you please explain the mean of in_m & in_o in forward function of Encoder_M?

The details of calculating losses

For multi-obj, the input frame size is [batch_size, color channels, H, W], and the input objects mask size is [batch_size, num_objects + BG, H, W], and the questions are:

When the STM module inputs the data whose batch size is greater than 1, it failed, so our work is based on batch_size =1;
The network-output logit's size is [batch_size, num_objects + BG, H, W], then resize it to [batch_size*H*W, num_objects + BG], and input the new size tensor into the CrossEntropyLoss, is it right?
Same as above, I calcutated loss after got the mulit-objct logit at every frame, didn't use softmax, because CrossEntropyLoss does softmax internally, , and then sum up losses from a sample of frames, and backward.

PIL.putpalette

Hi, when I run STM on DAVIS16, I came across the following problem.
Traceback (most recent call last):
File "/home/masterbin-iiau/Desktop/STM/eval_DAVIS.py", line 131, in
img_E.putpalette(palette)
File "/home/masterbin-iiau/anaconda3/envs/VOT20/lib/python3.6/site-packages/PIL/Image.py", line 1641, in putpalette
data = bytes(data)
TypeError: cannot convert 'NoneType' object to bytes

palette = Image.open(DATA_ROOT + '/Annotations/480p/blackswan/00000.png').getpalette()
the path is right but getpalette() return None.

Is this problem related to the version of pillow? I am using pillow=6.1.0
Could you give me some advice? Thank you

Training code

Hi man, great job! Do you intend to release the scripts for training?

How to train your model?

Hi, thanks for your great work, how to train your model? Where is your train.py? looking forward to your reply.

Questions about training

From your previous answers:

We only use sampled 3 frames for training. That's the reason why we sample frames from videos.
The gradient is computed based on 4 samples in the batch. Backpropagation is done after all the frames are processed.

I have two another questions targeting these two answers:

why do you only use 3 frames for training ?
According to your paper, more previous frames do benefit model performance; what's more, more than 3 previous frames would be used and added into memory in inference mode, meaning that it causes inconsistency between training and testing; so why not just use longer frames in main training ?
Is BP or BP-Through-Time used for gradient computation ?
For each sample, there are several frames computed one by one and the subsequent frames rely on previous frames' activations and predictions, so whether gradients are computed each time a frame is forwarded (and previous activations are detached) OR gradients are only computed after all frames' losses are accumulated? If it is former, it is simple BP, otherwise, it's BPTT, right?

About training code

Hi, do you plan to release training code? :)

Results on Youtobe

Hi, I test your released code and model on Youtobe, but I can get the accuracy reported in the paper. Did you test this code on Youtobe?

questions about Ms and Fs

Hallo! Thanks for your codes! I appreciate you very much!
After i read your codes carefuully,i've noticed that there are two tensors called Es and Ms.

Es = torch.zeros_like(Ms) Es[:,:,0] = Ms[:,:,0]

Could you tell me their accurate meaning?

wish you a good day

where is the training code?

The output pred is all 0

Thanks for the good work done.I tried to repeat your results.
But the final output img image is all black, check and find that the pred output of Run_video() is all 0. Why does this happen? Is it a problem with the test data format I'm using? Or is it some other reason?
Waiting for your reply!

Weight for the interactive track of DAVIS 19 challenge

Hi:
Thanks for sharing the code. I notice that the current released weight is for the semi-supervised track and different from the weights you used in the interactive track of the DAVIS 19 challenge. I test this weight under the Davis-interactive framework follow the official challenge setting and only achieve AUC 67.74 on the DAVIS 17 validation set. I wonder if you have any plan to release the weights which trained for the interactive track of the DAVIS 19 challenge?

Question about class imbalance for training

Hello, thanks for your great work and code!

When I try to train the model by myself, I found class imbalance seems to be a problem. Background pixels are far more than foreground pixels, which makes the training difficult. Could you please tell me how did you solve the problem? Did you use some kind of re-weighting or anything else? Thank you very much!

Validation on DAVIS 2016

When I run the pre-trained model on DAVIS2016, the error comes. I downloaded the DAVIS 2016 from the official website. There is no file named "ImageSets/2016/val.txt".

(openmmlab) root@bh1llmn592poa-0:/yhwang/0-Projects/11-mmsegmentation/STM# python eval_DAVIS.py -g '1' -s val -y 16 -D ../STCN/dataset/DAVIS/2016
Space-time Memory Networks: initialized.
STM : Testing on DAVIS
Traceback (most recent call last):
File "eval_DAVIS.py", line 101, in
Testset = DAVIS_MO_Test(DATA_ROOT, resolution='480p', imset='20{}/{}.txt'.format(YEAR,SET), single_object=(YEAR==16))
File "/yhwang/0-Projects/11-mmsegmentation/STM/dataset.py", line 28, in init
with open(os.path.join(_imset_f), "r") as lines:
FileNotFoundError: [Errno 2] No such file or directory: '../STCN/dataset/DAVIS/2016/ImageSets/2016/val.txt'

Question about weight initialisation

Hi, thanks for sharing your great work!

I am trying to reproduce the training code and I have 2 questions about how to initialise the weights in STM model:
1. In your released code, the backbone network (ResNet-50) uses the weights pre-trained on ImageNet to extract features from video sequence, so does that mean this network module is not fine-tuned further during training?
2. How to initialise the weights in the Decoder module and the convolutional layers for computing key and value features? With the ones generated randomly or pre-trained on some segmentation datasets?

Thank you so much for your consideration and look forward to hearing from you soon!

how did you sample 3 temporally ordered frames of main training?

Hi,According to your paper that you sampled 3 temporally ordered frames in main training and the
maximum number of frames to be skipped is gradually increased from 0 to 25 ,but you have sampled 3 frames,so is the gap between each TWO frame increased from 0 to 25?

A question in the single object segmentation

Hey, I think in the dataset.py, line 79

STM/dataset.py

Line 79 in 905f114

N_masks = (N_masks > 0.5).astype(np.uint8) * (N_masks < 255).astype(np.uint8)

It may be N_masks <= 255 instead of N_masks < 255 ?

Jaccard fluctuates seriously during training

Hi, thanks for sharing this great work. I have been working on the reproduction of STM for 2 months, and finally get a Jaccard of 77 on Davis-17-val.
I found that during training (both in pre-train and finetune), the Jaccard on val set jitters seriously. For example, the J reaches 70 at 1000 iteration, but will quickly drop to 60 at 1100 iteration, and then rises back to 70 at 1200 iteration.
The batch size is set to 4 and the optimizer is Adam with lr of 1e-5, which follows the setting proposed in the paper. I have tryed larger batch size and smaller lr, which didn't help. I'll apprecaite it if you could help me with this.

Reimplementation training not stable

Dear @seoungwugoh , I've read your paper and found your work extremely interesting. I've been trying to reproduce the work according to your paper, with some minor changes, like decoder layers and such. The memory read operation which is very much like transformer's attention mechanism is taken from this repo. Others, all reimplemented according to your paper's description.

I've been trying to train the model, loss goes down initially, and after a while it suddenly shoots up. I've tried:

clipping the gradient norm;
lowering learning rate;
removing skip connections (to make sure model actually tries to make use of temporal information (memory))

I've not tried disabling the batch norm as your paper suggests; and I'm using mixed precision training with Apex AMP.

Have you experienced such training instability before? What do you think could be the problem?

Youtube Model

Hi,

The model available to download is the one trained on youtube-vos?

Thank you!

Confusion about the memory encoder implementation

Hi,

Thanks for your outstanding model and well implementation. I have a question about memory encoder.
In the class Encoder_M, you sum up the frame and the mask at the very beginning:

x = self.conv1(f) + self.conv1_m(m) + self.conv1_o(o)

However, it is confusing that in your paper, you say

The inputs are concatenated along the channel dimension before being fed into the memory encoder.
For the memory encoder, the first convolution layer is modified to be able to take a 4-channel tensor by implanting additional single channel filters.

Could you explain this difference or talk more about the intuition behind your implementation?
Thanks in advance.

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False.

when I run this command (taken from the instructions page https://github.com/seoungwugoh/STM ) after running through the install

(STMVOS) C:\Users\OneWorld\Documents\DeepLearning\VideoObjectSegmentation\STMVOS>python eval_DAVIS.py -g '1' -s val -y 16 -D C:\Users\OneWorld\Documents\DeepLearning\VideoObjectSegmentation\DAVISSemiSupervisedTrainVal480

It gets this far

Space-time Memory Networks: initialized.
STM : Testing on DAVIS
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to C:\Users\OneWorld/.cache\torch\checkpoints\resnet50-19c8e357.pth
100%|██████████████████████████████████████████████████████████████████████████| 97.8M/97.8M [00:09<00:00, 10.7MB/s]
Loading weights: STM_weights.pth

and then I see this error

Traceback (most recent call last):
  File "eval_DAVIS.py", line 111, in <module>
    model.load_state_dict(torch.load(pth_path))
  File "C:\Users\OneWorld\anaconda3\envs\STMVOS\lib\site-packages\torch\serialization.py", line 593, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "C:\Users\OneWorld\anaconda3\envs\STMVOS\lib\site-packages\torch\serialization.py", line 773, in _legacy_load
    result = unpickler.load()
  File "C:\Users\OneWorld\anaconda3\envs\STMVOS\lib\site-packages\torch\serialization.py", line 729, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "C:\Users\OneWorld\anaconda3\envs\STMVOS\lib\site-packages\torch\serialization.py", line 178, in default_restore_location
    result = fn(storage, location)
  File "C:\Users\OneWorld\anaconda3\envs\STMVOS\lib\site-packages\torch\serialization.py", line 154, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "C:\Users\OneWorld\anaconda3\envs\STMVOS\lib\site-packages\torch\serialization.py", line 138, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

I am using Windows 10.

# packages in environment at C:\Users\OneWorld\anaconda3\envs\STMVOS:
#
# Name                    Version                   Build  Channel
blas                      1.0                         mkl
ca-certificates           2020.1.1                      0
certifi                   2020.4.5.1               py38_0
cuda100                   1.0                           0    pytorch
cudatoolkit               10.2.89              h74a9793_1
cycler                    0.10.0                   py38_0
freetype                  2.9.1                ha9979f8_1
hdf5                      1.10.4               h7ebc959_0
icc_rt                    2019.0.0             h0cc432a_1
icu                       58.2                 ha925a31_3
intel-openmp              2020.1                      216
jpeg                      9b                   hb83a4c4_2
kiwisolver                1.2.0            py38h74a9793_0
libopencv                 4.0.1                hbb9e17c_0
libpng                    1.6.37               h2a8f88b_0
libtiff                   4.1.0                h56a325e_0
matplotlib                3.1.3                    py38_0
matplotlib-base           3.1.3            py38h64f37c6_0
mkl                       2020.1                      216
mkl-service               2.3.0            py38hb782905_0
mkl_fft                   1.0.15           py38h14836fe_0
mkl_random                1.1.1            py38h47e9c7a_0
ninja                     1.9.0            py38h74a9793_0
numpy                     1.18.1           py38h93ca92e_0
numpy-base                1.18.1           py38hc3f5095_1
olefile                   0.46                       py_0
opencv                    4.0.1            py38h2a7c758_0
openssl                   1.1.1g               he774522_0
pillow                    7.1.2            py38hcc1f983_0
pip                       20.0.2                   py38_3
py-opencv                 4.0.1            py38he44ac1e_0
pyparsing                 2.4.7                      py_0
pyqt                      5.9.2            py38ha925a31_4
python                    3.8.3                he1778fa_0
python-dateutil           2.8.1                      py_0
pytorch                   1.5.0           py3.8_cuda102_cudnn7_0    pytorch
qt                        5.9.7            vc14h73c81de_0
setuptools                46.4.0                   py38_0
sip                       4.19.13          py38ha925a31_0
six                       1.14.0                   py38_0
sqlite                    3.31.1               h2a8f88b_1
tk                        8.6.8                hfa6e2cd_0
torchvision               0.6.0                py38_cu102    pytorch
tornado                   6.0.4            py38he774522_1
tqdm                      4.46.0                     py_0
vc                        14.1                 h0510ff6_4
vs2015_runtime            14.16.27012          hf0eaf9b_2
wheel                     0.34.2                   py38_0
wincertstore              0.2                      py38_0
xz                        5.2.5                h62dcd97_0
zlib                      1.2.11               h62dcd97_4
zstd                      1.3.7                h508b16e_0

     active environment : STMVOS
    active env location : C:\Users\OneWorld\anaconda3\envs\STMVOS
            shell level : 2
       user config file : C:\Users\OneWorld\.condarc
 populated config files : C:\Users\OneWorld\.condarc
          conda version : 4.8.2
    conda-build version : 3.18.11
         python version : 3.7.6.final.0
       virtual packages : __cuda=10.2
       base environment : C:\Users\OneWorld\anaconda3  (writable)
           channel URLs : https://repo.anaconda.com/pkgs/main/win-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/win-64
                          https://repo.anaconda.com/pkgs/r/noarch
                          https://repo.anaconda.com/pkgs/msys2/win-64
                          https://repo.anaconda.com/pkgs/msys2/noarch
          package cache : C:\Users\OneWorld\anaconda3\pkgs
                          C:\Users\OneWorld\.conda\pkgs
                          C:\Users\OneWorld\AppData\Local\conda\conda\pkgs
       envs directories : C:\Users\OneWorld\anaconda3\envs
                          C:\Users\OneWorld\.conda\envs
                          C:\Users\OneWorld\AppData\Local\conda\conda\envs
               platform : win-64
             user-agent : conda/4.8.2 requests/2.22.0 CPython/3.7.6 Windows/10 Windows/10.0.17134
          administrator : False
             netrc file : None
           offline mode : False

I have added some more detail into things I have tried in the following stack overflow link:-
https://stackoverflow.com/questions/62088265/runtimeerror-attempting-to-deserialize-object-on-a-cuda-device-but-torch-cuda-i
I tried in Ubuntu aswell

Any ideas how to get an NVIDIA Geforce 1070 GPU to work with STM ?

Soft aggregation

Hi,
When I checked your code with the supplementary, I found out that the way soft aggregation is calculated in the paper is different from your code.

In the code, you apply softmax directly to the output of model and then perform logit function.
In the paper, you use logit and then softmax.

Is this an error or did you do that on purpose?

Questions about freezing bn layer

Thank you for your excellent work! But I have some questions about the implementations. Could you give an example to better illustrate how you disable the bn layers? If you only set model.eval() or set the requires_grad=False for the bn weight and bias or both? Further, how many instances have you chosen in your main training phase as there tends to be more than 1 instances in the video. Thanks for your reply.