seoungwugoh / stm Goto Github PK
View Code? Open in Web Editor NEWVideo Object Segmentation using Space-Time Memory Networks
Video Object Segmentation using Space-Time Memory Networks
Hi! Thanks for your great job. I have three questions on YouTube-VOS dataset.
Hi, I want to know some details of the configuration of Adam optimizer. In the paper, as you just mentioned use constant learning rate 1e^{-5}, but did not mention about the weight decay which is also important for optimization. Would you mind sharing with us the hyperparameter setting for Adam optimizer (i.e. weight_decay and betas).
Thanks
Hello. Thanks for you great job!
Do you plan to release the weights of STM pre-trained on images? So that we can focus on reproducing the training script for STM.
Hi, thanks for your code and work.
I read on another issue #6 that the main training runs for 260 epochs with 3771 samples per epoch. That should be 260*3771/4(batch size) ~ 240K iterations while pretraining runs for 2M iterations. Why would it take just 4 days for pretraining but 3 days for main training as mentioned in the paper, given that each iteration should approximately take the same amount of time?
Am I missing something? I am trying to re-train the network but 260 epochs seem insufficient. Thanks a lot!
It was mentioned in the paper that the STM samples three frames during the main training stage. After I random sample three frames how the model do forward confuses me for a while? Suppose here are three frames named A,B and C, should I first compute the segmentation result of B according the prev_key
and prev_value
of A generated in memorize
stage and then feed the B and C into next forward pass. Or should I just need compute the segmentation result of C?
Hi seoungwugoh :
Your work are great. And I have a question about DAVIS training using extra YoutubeVOS dataset. How do you set the ratio of the two training datasets or you training the model using YoutubeVOS data before training using DAVIS training data? Looking forward to your reply.
Thanks.
Thanks for the good work done.
I tried to repeat your results. As it is written in your wonderful article, I trained at a size of 384x384. My trained model works at 384x384 input size.
But when I try to input the full size from the Davis dataset (as you have in the demo script), I get trash drawn. Have you encountered such a problem and how did you solve it?
Waiting for your reply!
Hi, sir! Thank you for your fine work, but I still have some question.
From what I understood, K
is the number of categories in a Dataset. This affects the dimensionality of the embeddings, which is (Batch, K, C/8, Time, Height, Width), since we compute different embeddings for each category. In the case of DAVIS, k=11, since we have 10 categories and the background category.
If this is correct, then I'm curious why throughout the code, you ignore the embeddings for the background (K=0). Wouldn't this increase performance, along the lines of this paper by Yang et al?
Also, if we're not using K=0, aren't we wasting memory by calculating these embeddings and storing them in VRAM?
Finally, since I'm using this for just people, I've set K=2. Are there any problems with this change?
Hi,
Can you please clarify how many epochs did you trained the model for each stage of training?
Hi, thank you for your great work. Could you please provide the segmentation performance (like J, F on DAVIS-2016 and DAVIS-2017?) of the pre-trained STM model to help us to validate the reproduced pre-training process? Thanks.
B_list['o'].append( (torch.sum(masks[:,1:o], dim=1) + \ torch.sum(masks[:,o+1:num_objects+1], dim=1)).clamp(0,1) )
at
Line 217 in 905f114
sorry, I follow this operation can you explain?
B_list['o'].append( (torch.sum(masks[:,1:o], dim=1) + \ torch.sum(masks[:,o+1:num_objects+1], dim=1)).clamp(0,1) )
I don't understand the meaning of B_list['o'], besides,
x = self.conv1(f) + self.conv1_m(m) + self.conv1_o(o)
f represents image, m means mask, while o I don't know its meaning.
I would be grateful if you could answer my questions.
Thanks for you answered my previous question,but i still have many questions......
how did you choose the first frame of the 3 temporally ordered frames?
how many epochs will you increase the maximum_skip?
what is maximum_skip when the dataset is youtube vos?
Thk a lot !
could you please explain the mean of in_m & in_o in forward function of Encoder_M?
For multi-obj, the input frame size is [batch_size, color channels, H, W]
, and the input objects mask size is [batch_size, num_objects + BG, H, W]
, and the questions are:
batch_size =1
;[batch_size, num_objects + BG, H, W]
, then resize it to [batch_size*H*W, num_objects + BG]
, and input the new size tensor into the CrossEntropyLoss, is it right?Hi, when I run STM on DAVIS16, I came across the following problem.
Traceback (most recent call last):
File "/home/masterbin-iiau/Desktop/STM/eval_DAVIS.py", line 131, in
img_E.putpalette(palette)
File "/home/masterbin-iiau/anaconda3/envs/VOT20/lib/python3.6/site-packages/PIL/Image.py", line 1641, in putpalette
data = bytes(data)
TypeError: cannot convert 'NoneType' object to bytes
palette = Image.open(DATA_ROOT + '/Annotations/480p/blackswan/00000.png').getpalette()
the path is right but getpalette() return None.
Is this problem related to the version of pillow? I am using pillow=6.1.0
Could you give me some advice? Thank you
Hi man, great job! Do you intend to release the scripts for training?
Hi, thanks for your great work, how to train your model? Where is your train.py? looking forward to your reply.
From your previous answers:
I have two another questions targeting these two answers:
why do you only use 3 frames for training ?
According to your paper, more previous frames do benefit model performance; what's more, more than 3 previous frames would be used and added into memory in inference mode, meaning that it causes inconsistency between training and testing; so why not just use longer frames in main training ?
Is BP or BP-Through-Time used for gradient computation ?
For each sample, there are several frames computed one by one and the subsequent frames rely on previous frames' activations and predictions, so whether gradients are computed each time a frame is forwarded (and previous activations are detached) OR gradients are only computed after all frames' losses are accumulated? If it is former, it is simple BP, otherwise, it's BPTT, right?
Hi, do you plan to release training code? :)
Hi, I test your released code and model on Youtobe, but I can get the accuracy reported in the paper. Did you test this code on Youtobe?
Hallo! Thanks for your codes! I appreciate you very much!
After i read your codes carefuully,i've noticed that there are two tensors called Es and Ms.
Es = torch.zeros_like(Ms) Es[:,:,0] = Ms[:,:,0]
Could you tell me their accurate meaning?
wish you a good day
Thanks for the good work done.I tried to repeat your results.
But the final output img image is all black, check and find that the pred output of Run_video() is all 0. Why does this happen? Is it a problem with the test data format I'm using? Or is it some other reason?
Waiting for your reply!
Hi:
Thanks for sharing the code. I notice that the current released weight is for the semi-supervised track and different from the weights you used in the interactive track of the DAVIS 19 challenge. I test this weight under the Davis-interactive framework follow the official challenge setting and only achieve AUC 67.74 on the DAVIS 17 validation set. I wonder if you have any plan to release the weights which trained for the interactive track of the DAVIS 19 challenge?
Hello, thanks for your great work and code!
When I try to train the model by myself, I found class imbalance seems to be a problem. Background pixels are far more than foreground pixels, which makes the training difficult. Could you please tell me how did you solve the problem? Did you use some kind of re-weighting or anything else? Thank you very much!
When I run the pre-trained model on DAVIS2016, the error comes. I downloaded the DAVIS 2016 from the official website. There is no file named "ImageSets/2016/val.txt".
(openmmlab) root@bh1llmn592poa-0:/yhwang/0-Projects/11-mmsegmentation/STM# python eval_DAVIS.py -g '1' -s val -y 16 -D ../STCN/dataset/DAVIS/2016
Space-time Memory Networks: initialized.
STM : Testing on DAVIS
Traceback (most recent call last):
File "eval_DAVIS.py", line 101, in
Testset = DAVIS_MO_Test(DATA_ROOT, resolution='480p', imset='20{}/{}.txt'.format(YEAR,SET), single_object=(YEAR==16))
File "/yhwang/0-Projects/11-mmsegmentation/STM/dataset.py", line 28, in init
with open(os.path.join(_imset_f), "r") as lines:
FileNotFoundError: [Errno 2] No such file or directory: '../STCN/dataset/DAVIS/2016/ImageSets/2016/val.txt'
Hi, thanks for sharing your great work!
I am trying to reproduce the training code and I have 2 questions about how to initialise the weights in STM model:
1. In your released code, the backbone network (ResNet-50) uses the weights pre-trained on ImageNet to extract features from video sequence, so does that mean this network module is not fine-tuned further during training?
2. How to initialise the weights in the Decoder module and the convolutional layers for computing key and value features? With the ones generated randomly or pre-trained on some segmentation datasets?
Thank you so much for your consideration and look forward to hearing from you soon!
Hi,According to your paper that you sampled 3 temporally ordered frames in main training and the
maximum number of frames to be skipped is gradually increased from 0 to 25 ,but you have sampled 3 frames,so is the gap between each TWO frame increased from 0 to 25?
Hey, I think in the dataset.py, line 79
Line 79 in 905f114
Hi, thanks for sharing this great work. I have been working on the reproduction of STM for 2 months, and finally get a Jaccard of 77 on Davis-17-val.
I found that during training (both in pre-train and finetune), the Jaccard on val set jitters seriously. For example, the J reaches 70 at 1000 iteration, but will quickly drop to 60 at 1100 iteration, and then rises back to 70 at 1200 iteration.
The batch size is set to 4 and the optimizer is Adam with lr of 1e-5, which follows the setting proposed in the paper. I have tryed larger batch size and smaller lr, which didn't help. I'll apprecaite it if you could help me with this.
Dear @seoungwugoh , I've read your paper and found your work extremely interesting. I've been trying to reproduce the work according to your paper, with some minor changes, like decoder layers and such. The memory read operation which is very much like transformer's attention mechanism is taken from this repo. Others, all reimplemented according to your paper's description.
I've been trying to train the model, loss goes down initially, and after a while it suddenly shoots up. I've tried:
I've not tried disabling the batch norm as your paper suggests; and I'm using mixed precision training with Apex AMP.
Have you experienced such training instability before? What do you think could be the problem?
Hi,
The model available to download is the one trained on youtube-vos?
Thank you!
Hi,
Thanks for your outstanding model and well implementation. I have a question about memory encoder.
In the class Encoder_M
, you sum up the frame and the mask at the very beginning:
x = self.conv1(f) + self.conv1_m(m) + self.conv1_o(o)
However, it is confusing that in your paper, you say
The inputs are concatenated along the channel dimension before being fed into the memory encoder.
For the memory encoder, the first convolution layer is modified to be able to take a 4-channel tensor by implanting additional single channel filters.
Could you explain this difference or talk more about the intuition behind your implementation?
Thanks in advance.
when I run this command (taken from the instructions page https://github.com/seoungwugoh/STM ) after running through the install
(STMVOS) C:\Users\OneWorld\Documents\DeepLearning\VideoObjectSegmentation\STMVOS>python eval_DAVIS.py -g '1' -s val -y 16 -D C:\Users\OneWorld\Documents\DeepLearning\VideoObjectSegmentation\DAVISSemiSupervisedTrainVal480
It gets this far
Space-time Memory Networks: initialized.
STM : Testing on DAVIS
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to C:\Users\OneWorld/.cache\torch\checkpoints\resnet50-19c8e357.pth
100%|██████████████████████████████████████████████████████████████████████████| 97.8M/97.8M [00:09<00:00, 10.7MB/s]
Loading weights: STM_weights.pth
and then I see this error
Traceback (most recent call last):
File "eval_DAVIS.py", line 111, in <module>
model.load_state_dict(torch.load(pth_path))
File "C:\Users\OneWorld\anaconda3\envs\STMVOS\lib\site-packages\torch\serialization.py", line 593, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "C:\Users\OneWorld\anaconda3\envs\STMVOS\lib\site-packages\torch\serialization.py", line 773, in _legacy_load
result = unpickler.load()
File "C:\Users\OneWorld\anaconda3\envs\STMVOS\lib\site-packages\torch\serialization.py", line 729, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "C:\Users\OneWorld\anaconda3\envs\STMVOS\lib\site-packages\torch\serialization.py", line 178, in default_restore_location
result = fn(storage, location)
File "C:\Users\OneWorld\anaconda3\envs\STMVOS\lib\site-packages\torch\serialization.py", line 154, in _cuda_deserialize
device = validate_cuda_device(location)
File "C:\Users\OneWorld\anaconda3\envs\STMVOS\lib\site-packages\torch\serialization.py", line 138, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
I am using Windows 10.
# packages in environment at C:\Users\OneWorld\anaconda3\envs\STMVOS:
#
# Name Version Build Channel
blas 1.0 mkl
ca-certificates 2020.1.1 0
certifi 2020.4.5.1 py38_0
cuda100 1.0 0 pytorch
cudatoolkit 10.2.89 h74a9793_1
cycler 0.10.0 py38_0
freetype 2.9.1 ha9979f8_1
hdf5 1.10.4 h7ebc959_0
icc_rt 2019.0.0 h0cc432a_1
icu 58.2 ha925a31_3
intel-openmp 2020.1 216
jpeg 9b hb83a4c4_2
kiwisolver 1.2.0 py38h74a9793_0
libopencv 4.0.1 hbb9e17c_0
libpng 1.6.37 h2a8f88b_0
libtiff 4.1.0 h56a325e_0
matplotlib 3.1.3 py38_0
matplotlib-base 3.1.3 py38h64f37c6_0
mkl 2020.1 216
mkl-service 2.3.0 py38hb782905_0
mkl_fft 1.0.15 py38h14836fe_0
mkl_random 1.1.1 py38h47e9c7a_0
ninja 1.9.0 py38h74a9793_0
numpy 1.18.1 py38h93ca92e_0
numpy-base 1.18.1 py38hc3f5095_1
olefile 0.46 py_0
opencv 4.0.1 py38h2a7c758_0
openssl 1.1.1g he774522_0
pillow 7.1.2 py38hcc1f983_0
pip 20.0.2 py38_3
py-opencv 4.0.1 py38he44ac1e_0
pyparsing 2.4.7 py_0
pyqt 5.9.2 py38ha925a31_4
python 3.8.3 he1778fa_0
python-dateutil 2.8.1 py_0
pytorch 1.5.0 py3.8_cuda102_cudnn7_0 pytorch
qt 5.9.7 vc14h73c81de_0
setuptools 46.4.0 py38_0
sip 4.19.13 py38ha925a31_0
six 1.14.0 py38_0
sqlite 3.31.1 h2a8f88b_1
tk 8.6.8 hfa6e2cd_0
torchvision 0.6.0 py38_cu102 pytorch
tornado 6.0.4 py38he774522_1
tqdm 4.46.0 py_0
vc 14.1 h0510ff6_4
vs2015_runtime 14.16.27012 hf0eaf9b_2
wheel 0.34.2 py38_0
wincertstore 0.2 py38_0
xz 5.2.5 h62dcd97_0
zlib 1.2.11 h62dcd97_4
zstd 1.3.7 h508b16e_0
active environment : STMVOS
active env location : C:\Users\OneWorld\anaconda3\envs\STMVOS
shell level : 2
user config file : C:\Users\OneWorld\.condarc
populated config files : C:\Users\OneWorld\.condarc
conda version : 4.8.2
conda-build version : 3.18.11
python version : 3.7.6.final.0
virtual packages : __cuda=10.2
base environment : C:\Users\OneWorld\anaconda3 (writable)
channel URLs : https://repo.anaconda.com/pkgs/main/win-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/win-64
https://repo.anaconda.com/pkgs/r/noarch
https://repo.anaconda.com/pkgs/msys2/win-64
https://repo.anaconda.com/pkgs/msys2/noarch
package cache : C:\Users\OneWorld\anaconda3\pkgs
C:\Users\OneWorld\.conda\pkgs
C:\Users\OneWorld\AppData\Local\conda\conda\pkgs
envs directories : C:\Users\OneWorld\anaconda3\envs
C:\Users\OneWorld\.conda\envs
C:\Users\OneWorld\AppData\Local\conda\conda\envs
platform : win-64
user-agent : conda/4.8.2 requests/2.22.0 CPython/3.7.6 Windows/10 Windows/10.0.17134
administrator : False
netrc file : None
offline mode : False
I have added some more detail into things I have tried in the following stack overflow link:-
https://stackoverflow.com/questions/62088265/runtimeerror-attempting-to-deserialize-object-on-a-cuda-device-but-torch-cuda-i
I tried in Ubuntu aswell
Any ideas how to get an NVIDIA Geforce 1070 GPU to work with STM ?
Hi,
When I checked your code with the supplementary, I found out that the way soft aggregation is calculated in the paper is different from your code.
In the code, you apply softmax directly to the output of model and then perform logit function.
In the paper, you use logit and then softmax.
Is this an error or did you do that on purpose?
Thank you for your excellent work! But I have some questions about the implementations. Could you give an example to better illustrate how you disable the bn layers? If you only set model.eval() or set the requires_grad=False for the bn weight and bias or both? Further, how many instances have you chosen in your main training phase as there tends to be more than 1 instances in the video. Thanks for your reply.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.