Hello I have questions about input and labels about figure3.(paper) when you are train

Figure3 on paper about multimodal-future-prediction HOT 4 CLOSED

lmb-freiburg commented on September 27, 2024

Figure3 on paper

from multimodal-future-prediction.

Comments (4)

os1a commented on September 27, 2024

Hi,
Just to clarify something, figure 3 is only used to explain how the optimization of our loss function works.

The three ground truths are not available at the same time. You will see only one ground truth at every iteration, but during training you will see all of them.

The figure 3 explains only the sampling framework where we have EWTA. For simplicity, you can assume the simpler version of our approach where in the sampling network you generate multiple hypotheses (a set of points (x,y)) and then during fitting, you fit those hypotheses into your final mixture model.

In practice, we train the sampling network to generate 20 hypotheses and then fit them into 4 modes (as mentioned in section 6.1).

To get an idea about the EWTA loss implementation, we have already provided the code for the loss function.

Multimodal-Future-Prediction/net.py

Line 66 in d0a5d0f

 def make_sampling_loss(self, hyps, bounded_log_sigmas, gt, mode='epe', top_n=1): 

We also provided the loss function used in the fitting network (nll) at:

Multimodal-Future-Prediction/net.py

Line 138 in d0a5d0f

def make_fitting_loss(self, means, bounded_log_sigmas, mixture_weights, gt):

Feel free to raise more questions if you still need help.

from multimodal-future-prediction.

droneRL2020 commented on September 27, 2024

Thank you for your explanation!

To confirm, can you please elaborate this sentence?
"The three ground truths are not available at the same time. You will see only one ground truth at every iteration, but during training you will see all of them."

Does it mean, we use 3 ground truth labels(3 future trajectories(x,y position data on image)) paired with one image when we are training???

from multimodal-future-prediction.

os1a commented on September 27, 2024

No, we use only one ground truth. Every training sample has an input (e.g, image) and a single ground truth. We generate multiple hypotheses (e.g, 8 or 20) and use the EWTA loss function (make_sampling_loss() in our repository) which takes a set of hypotheses (hyps) and a single ground truth (gt).

What we mean by figure 3 is that during training, for some iteration we see an image with its single ground truth and for another iteration (maybe after a long time) the network sees a similar input image with a different ground truth. The EWTA loss function will encourage the network to use one head in the first case while using another head in the latter case.

from multimodal-future-prediction.

droneRL2020 commented on September 27, 2024

Thank you for the explanation!

from multimodal-future-prediction.

Figure3 on paper about multimodal-future-prediction HOT 4 CLOSED

Comments (4)

Related Issues (12)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent