khrylx / agentformer Goto Github PK

[ICCV 2021] Official PyTorch Implementation of "AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting".

Home Page: https://www.ye-yuan.com/agentformer/

License: MIT License

Python 100.00%

autonomous-driving deep-learning iccv2021 multi-agent-forecasting pytorch trajectory-forecasting trajectory-prediction transformer

agentformer's Introduction

agentformer's People

Stargazers

Watchers

agentformer's Issues

ADE/FDE Future mask Loss

Hello,
I have seen that you are using a mask on the MSE loss to not take into consideration the padded agents, which is good.
However, why aren't you applying the same on the ADE and FDE metrics?

Hi, I really like your work in dealing with multi-agent trajectories prediction. I went through the paper and codes and popped up a quick question about the time encoder. As you mentioned in the paper, the time encoder that integrated the timestamp features differs from the original positional encoder. But I cannot find the time encoder codes in this repo. Please let me know if I missed anything. Much appreciated!

data processing

Hello, this is an excellent job, but I don't understand one question, I hope you can answer them.
The first is why the eth/ UCY data processing needs to divid the scale: like "found_data = past_data[past_data[:, 1] == identity].squeeze()[[self.xind, self.zind]] / self.past_traj_scale". The self.past_traj_scale=2, but in Trajectron++, the dataset is not divide the scale.

Looking forward to your reply.

Input trajectories' coordinate system, and necessity to scale coordinates when working on another dataset?

Hey,

I would like to know something about the format of the input trajectories as they are provided directly within the .txt files. Is any kind of coordinate preprocessing performed when generating those files? Specifically, is there any scaling applied to ensure reasonably similar scales across different datasets?

I have plotted the coordinates for some of the extracted .txt files (from within the preprocessed eth_ucy files).

From the looks of it, the coordinate system seems to be expressed in meters. No centering is performed here. I do recognise that this is unnecessary as this is eventually done in the set_data method of the dataloader anyway:

AgentFormer/model/agentformer.py

Lines 523 to 526 in e4fe8dd

 if scene_orig_all_past: 

 self.data['scene_orig'] = self.data['pre_motion'].view(-1, 2).mean(dim=0) 

 else: 

 self.data['scene_orig'] = self.data['pre_motion'][-1].mean(dim=0)

The reason why I would like to know if you follow any standard practice when it comes to scaling the data is because I would like to apply the model to another dataset, which makes use of pixels as units in their coordinate system, instead of meters.

From my current understanding, there's no need to apply any kind of coordinate scaling here. The model will simply adapt its weights to account for a more widely or narrowly "stretched" version of the input data. However, I also found that the preprocessor does apply some kind of scaling on the input trajectory data:

AgentFormer/data/preprocessor.py

Line 124 in e4fe8dd

 found_data = past_data[past_data[:, 1] == identity].squeeze()[[self.xind, self.zind]] / self.past_traj_scale 

AgentFormer/data/preprocessor.py

Line 144 in e4fe8dd

 found_data = fut_data[fut_data[:, 1] == identity].squeeze()[[self.xind, self.zind]] / self.traj_scale 

Is anything required from me in terms of setting this scaling factor properly with respect to the dataset I intend to use? Why is this scaling factor used here?

Thank you very much for your time and your work on the AgentFormer model!

Data Science 2

.

Issues understanding the input format for eth_ucy

Dear reader, thank you for the great work on this topic and for realising the code for the community to improve. I am currently trying to understand the data format however am very unclear about how the eth_ucy dataset is actually preprocessed. Do I understand it correctly that only x and y coordinates are used and no velocity and heading inofmration is extracted. As looking through the data most columns only contain -1 values. Could you provide a column name list for the inputs found in the datasets/eth_ucy files?

Furthermore, is my understadning correct that only the ego agent informatin is stored in a given row? As from reading the paper my understanding was that all agent states would be stored in a single entry for each timestep. Could you maybe elaborate on how you create the

representation of the data that is described in the paper?
Thank you!

recon_motion_3D significance

Hi,

Thanks for the amazing work on AgentFormer. I had a question. In the test pipeline, there are two sampling methods. One is sampling the 'mode' that is recon, and one samples a few samples from the distribution. I think I understand why the mode is sampled, but the future encoder is used along with the future motion data. Could you tell me why the FutureEncoder is used during test time?

How to understand the variable num_seq_samples

Thank you for your work !
In the init method of the calss data_generator, I can not understand the function of the variable num_seq_samples.
num_seq_samples = preprocessor.num_fr - (parser.min_past_frames + parser.min_future_frames - 1) * self.frame_skip
Thank you very much !

significance of FutureDecoder's ```sn_out_type``` variable?

Dear AgentFormer authors,

I would like to get a better understanding of the AgentFormer model. As I review the source code, one of the parameters that has been rather difficult for me to understand is the sn_out_type attribute of the FutureDecoder. So far, here's what I understood about this variable:

it is a string, with default value "scene_norm", but it can also take the values "norm" or "vel" if the model is configured as such (by specifying the parameter value in the cfg file)
it alters the behaviour of the decode_traj_ar method of the FutureDecoder, by modifying the format of the output seq_out that is being predicted. seq_out is the sequence of positions that is actually predicted by the model.
by default, no modification is performed at all.
seq_out is translated into dec_motion, which is the actual variable that is compared with the ground truth for loss computation. This is done independently from the alterations performed by sn_out_type (instead, it is done by pred_type, which, quite clearly, defines whether the model is supposed to predict velocities, positions, or positions aligned with the scene origin).
sn_out_type does not alter the behaviour of the code in any other place than in FutureDecoder.

From my observations, I am inclined to believe that sn_out_type defines the format of the ground truth. If the ground truth were to be proposed to the model in a different format than the predicted one (eg, if the ground truth trajectories are expressed as velocities while the model predicts scene aligned positions), then sn_out_type is responsible for ensuring that the predicted sequence is first translated into the right format before we can perform comparison between prediction and ground truth for loss calculation. Is this correct? I am uncertain whether that is right, since I've found that sn_out_type does not alter the processing of the ground truth in the data_generator or the preprocessor classes, which I would expect this to be the case, if my guess about this variable acting as a regulator between ground truth and prediction to be right.

If anyone could help me clarify this, I would be very thankful.

Most likely prediction for Nuscenes official prediction challenge

Hi,

Thanks for sharing this great work!

I have a question about this paper, have you tried to get the most likely prediction on Nuscenes official prediction challenge and calculate the ADE?

implementation on another dataset

Hi,
I run your model w/ETH dataset however,
I want to try your model w/ PIE dataset. But, I didn't understand the dataloader. I converted annotations as txt file, but i couldnt load the dataset. Should I write a new process.py and dataloader.py file for running on your model?
Thanks.
.

Training own model

about the paper code?

Hi, when you will publish the paper code.

Inference Time

How much time does Agentformer model takes to inference upon a given sample.

Important remark. Potential inconsistency between source code and paper: DLow's diversity loss.

Good day to all people involved in developing/studying the AgentFormer model,

I would like to make a small remark with regards to DLow's diversity sampling loss. There may be a discrepancy between the definition of the loss as stipulated in the paper, and the loss as written in the original source code.

Indeed, the paper defines the diversity component of the loss as:

Here we can see that each prediction made across the K number of modes is being compared to one another, by summing the distance of each point predicted over the prediction horizon.

However, in the source code, the implementation collapses the x and y components of the prediction into one single dimension. This in turn means that the distance being computed with the F.pdist() function does not compute the sum of distances across each timestep, but instead the L2 distance of two points in a high dimension space of shape [T_pred * 2].

Here's a minimal code snippet that highlights the difference between the loss as defined in the source code, and the loss as explained in the paper:

    import torch.nn.functional as F

    # scaling factor
    d_scale = 10

    # example predictions
    pred_1 = torch.Tensor([[1, 1],
                           [2, 1],
                           [3, 1],
                           [3, 2],
                           [3, 3],
                           [3, 4]])
    pred_2 = torch.Tensor([[1, 1],
                           [1, 1.5],
                           [1, 2],
                           [2, 2],
                           [3, 3],
                           [4, 3]])
    pred_3 = torch.zeros_like(pred_1)

    # predictions are of shape [N agents, K samples, P prediction length, 2]
    preds = torch.stack([pred_1, pred_2, pred_3]).unsqueeze(0)

    # diversity_loss reshaped the predictions tensor to collapse the x and y components of predictions
    reshaped_preds = preds.view(*preds.shape[:2], -1)       # [N agents, K samples, P prediction length * 2]

    code_loss = 0
    for motion in reshaped_preds:
        # motion: [K, P * 2]
        their_dist = F.pdist(motion, 2) ** 2
        code_loss += (-their_dist / d_scale).exp().mean()
    print(f"{code_loss=}")

    paper_loss = 0
    paper_dists = []
    for motion in preds:
        for k1, sample_1 in enumerate(motion):
            for k2, sample_2 in enumerate(motion[k1+1:, ...]):
                # sample_1, sample_2 --> [P, 2]

                # difference between any two non-identical predictions
                diff = sample_1 - sample_2

                # sum of euclidean distance between points of diff over each timestep
                se = diff.pow(2).sum(-1).sqrt().sum()

                paper_dists.append(se)

    paper_dists = torch.tensor(paper_dists)

    paper_loss = (-paper_dists / 10).exp().mean()

    print(f"{paper_loss=}")

Note that the loss value as defined in the paper is different than that of the one implemented in the source code.

I would however like to also mention that both versions of the loss do encourage diversity among predictions. However the way in which they 'push' predictions away from each other is different.

It might be nice (for whomever would be interested in studying this further) to implement an efficient version of the computation performed to obtain paper_loss as I showed in the code snippet above, and check if this ends up altering the behaviour of DLow.

I do not suspect a major change in the way DLow operates. However I leave this remark here for whomever might want to study the DLow module in more detail.

some puzzles about the math formulas in the CVAE Future Decoder part

As you described in section 3.2 of the paper:

I can understand the purpose of the MSE term ||Y-\hat Y||^2 is to push the real value Y and the mean of the Guassian \hat Y as close as possible,because the Gaussian distribution has the maximum probability value at the mean value.
But where did the weighting factor 1/(2beta) come from? Why dividing the variance by beta leads to this weighting factor?

Question about paper result

Thanks for sharing this great work!!
I have a few question about this paper, how did you get the fde result of nuScene?
I search the leader board on nuScenes and the result of yours out performed those method. (Really differ greatly)
So just want to know about how you calculate the result.

possibly missing an else clause in preprocessor.py

AgentFormer/data/preprocessor.py

Lines 67 to 69 in e4fe8dd

 if frame - i < self.init_frame: 

 data = [] 

 data = self.gt[self.gt[:, 0] == (frame - i * self.frame_skip)]

Hi, thanks for the great work done on AgentFormer!

While going through the model's code, I noticed that there might be a missing else clause in preprocessor.py, specifically on lines 67 to 69. I'm unsure if this is a typo or if I have misunderstood something.

I would appreciate if you can give some clarification on that. Thanks in advance!

NotImplementedError in agent_aware_attention

Thanks for your great work !
I am trying to apply your model to continue learning .Without modify your network and dataloader , i encounter following error .Since it raise this error inside of your lib function , and there are some forward pass already , i find it very hard to debug.... Could you please help me out?

epoch:0,loss:mse: 20.561 (18.647) kld: 2.017 (2.326) sample: 20.089 (10.814) total_loss: 42.667 (31.787)
epoch:0,loss:mse: 20.943 (37.872) kld: 2.045 (2.312) sample: 20.115 (24.269) total_loss: 43.103 (64.452)
epoch:0,loss:mse: 21.206 (0.451) kld: 2.075 (2.000) sample: 20.142 (0.421) total_loss: 43.423 (2.872)
epoch:0,loss:mse: 20.992 (4.669) kld: 2.075 (2.000) sample: 19.957 (4.490) total_loss: 43.024 (11.158)
epoch:0,loss:mse: 21.082 (12.970) kld: 2.136 (6.585) sample: 20.000 (12.838) total_loss: 43.218 (32.393)
epoch:0,loss:mse: 21.042 (16.777) kld: 2.331 (3.709) sample: 19.912 (16.551) total_loss: 43.286 (37.037)
Traceback (most recent call last):
  File "/mnt/petrelfs/tangxiaqiang/code/trajnet/dev/./main.py", line 189, in <module>
    main()
  File "/mnt/petrelfs/tangxiaqiang/code/trajnet/dev/./main.py", line 162, in main
    task_iter(task, num_devices, pop, generation_id, loop_id, exp_config)
  File "/mnt/petrelfs/tangxiaqiang/code/trajnet/dev/mu2net_traj/main.py", line 83, in task_iter
    train_loop(paths, ds_train, ds_validation,
  File "/mnt/petrelfs/tangxiaqiang/code/trajnet/dev/./mytrain.py", line 203, in train_loop
    model_data = path.model()
  File "/mnt/petrelfs/tangxiaqiang/miniconda3/./lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/petrelfs/tangxiaqiang/code/trajnet/dev/./AgentFormer/model/agentformer.py", line 596, in forward
    self.inference(sample_num=self.loss_cfg['sample']['k'])
  File "/mnt/petrelfs/tangxiaqiang/code/trajnet/dev/./AgentFormer/model/agentformer.py", line 607, in inference
    self.future_decoder(self.data, mode=mode, sample_num=sample_num, autoregress=True, need_weights=need_weights)
  File "/mnt/petrelfs/tangxiaqiang/miniconda3/./lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/petrelfs/tangxiaqiang/code/trajnet/dev/./AgentFormer/model/agentformer.py", line 425, in forward
    self.decode_traj_ar(data, mode, context, pre_motion, pre_vel, pre_motion_scene_norm, z, sample_num, need_weights=need_weights)
  File "/mnt/petrelfs/tangxiaqiang/code/trajnet/dev/./AgentFormer/model/agentformer.py", line 341, in decode_traj_ar
    tf_out, attn_weights = self.tf_decoder(tf_in_pos, context, memory_mask=mem_mask, tgt_mask=tgt_mask, num_agent=data['agent_num'], need_weights=need_weights)
  File "/mnt/petrelfs/tangxiaqiang/miniconda3/./lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/petrelfs/tangxiaqiang/code/trajnet/dev/./AgentFormer/model/agentformer_lib.py", line 746, in forward
    output, self_attn_weights[i], cross_attn_weights[i] = mod(output, memory, tgt_mask=tgt_mask,
  File "/mnt/petrelfs/tangxiaqiang/miniconda3/./lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/petrelfs/tangxiaqiang/code/trajnet/dev/./AgentFormer/model/agentformer_lib.py", line 644, in forward
    tgt2, self_attn_weights = self.self_attn(tgt, tgt, tgt, attn_mask=tgt_mask,
  File "/mnt/petrelfs/tangxiaqiang/miniconda3/./lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/petrelfs/tangxiaqiang/code/trajnet/dev/./AgentFormer/model/agentformer_lib.py", line 506, in forward
    return agent_aware_attention(
  File "/mnt/petrelfs/tangxiaqiang/code/trajnet/dev/./AgentFormer/model/agentformer_lib.py", line 177, in agent_aware_attention
    raise NotImplementedError
NotImplementedError

Single inference script

Is there any documentation or instructions for using the code in this repo for single inference step?

list index out of range

Hi brother. I have installed your package, and all the environments have been configured, but I still can't run it, and such an error is reported. This is the file format and wrong statement.

Reg. distance between adjacent grid points in semantic maps for nuscenes

Hi, congrats on the excellent work. I was going through your code and found that while converting the position of the agent in the image to pixel position you are multiplying it with a scale = 3. That should mean that the distance between adjacent pixels is 1/3 m as opposed to 3 m mentioned in the paper. Please let me know if I am interpreting anything wrong.

Nuscene Dataset Question

Hi,

Thanks for your wonderful work on this paper, you guys did a good job!

I have a question: did you use the training set obtained from all cameras on the Nuscenes dataset?

Thanks ahead for your help!

About normalization

Hi, I have noticed that in the code (

AgentFormer/model/agentformer.py

Line 535 in cf13e40

 self.data['scene_orig'] = torch.cat([self.data['pre_motion'], self.data['fut_motion']]).view(-1, 2).mean(dim=0) 

) the center of both the past trajectory and the future trajectory is used to normalize the input data. However, the future trajectory should not be available in the test. Is this data snooping? Please let me know if there is anything wrong with my understanding. Many thanks!

Question about batch training

Hi, is it possible to train this model in a batch-training manner? It seems that the model is optimized with one sample per step (batch size = 1).

velocity and heading

There is no velocity and heading available in eth and ucy dataset .If it is present in nuscenes data than where i can find it ?
Why we need columns with -1.0 ?
Thanks for the response !

NAN value was obtained during training of the 1 sample trajectory sampler

I'm working on the nuScenes 1 sample training.
After finishing 100 epochs of training of CVAE，I continued to train the trajectory sampler.
But unfortulately I got NAN value during training on the diversity loss term.

I guess this is because the diversity loss term is divided by 0 when K=1，and It is meaningless to calculate the diversity loss when K=1

Maybe we need to modify this line:

AgentFormer/model/dlow.py

Line 24 in 195aae0

dist = F.pdist(motion, 2) ** 2

In case I missed something: Have you ever encountered a NAN value when training trajectory sampler with K=1?

Questions about Visualization

Hi @Khrylx ,

Thank you for your work, it is really impressive.

I am wondering if you have the visualisation script for these visualisation that you did.

Using sequences with only one pedestrian

Thank you for your great contribution. When checking the code for agentformer, I noticed the number of samples do not match that of Social-STGCNN and SGAN. Looking at these two latter codes, they only consider scenes where they contain more than one pedestrian. Is that also the case in your code and I have overlooked something? Looking at the samples that test.py outputs, some of them include only one pedestrian.
If this is true, would it be fair to compare against SGAN? Trajectron ++ also has the same train/test splits as SGAN.

pred_epoch

Hello @Khrylx , thank you for your great work.

I didn't understand what is pred_epoch in cfg files for each dataset. Why it is different from a dataset to another in cfg files?

Thanks

	if scene_orig_all_past:
	self.data['scene_orig'] = self.data['pre_motion'].view(-1, 2).mean(dim=0)
	else:
	self.data['scene_orig'] = self.data['pre_motion'][-1].mean(dim=0)

	if frame - i < self.init_frame:
	data = []
	data = self.gt[self.gt[:, 0] == (frame - i * self.frame_skip)]

khrylx / agentformer Goto Github PK

agentformer's Introduction

agentformer's People

Stargazers

Watchers

Forkers

agentformer's Issues

Recommend Projects

Recommend Topics

Recommend Org