media-smart / vedatad Goto Github PK

View Code? Open in Web Editor NEW

53.0 53.0 14.0 376 KB

A single stage temporal action detection toolbox based on PyTorch

License: Apache License 2.0

Python 82.00% Shell 0.27% C++ 7.84% Cuda 9.90%

pytorch single-stage temporal-action-detection toolbox

vedatad's People

Contributors

Stargazers

Watchers

Forkers

c-h-wong cv-ip hibiscuses carpedkm allezsyh haritha91 persona0102 klauscc blank-cd xujinglin sjtuwxz connor-john gauthsvenkat semooze

vedatad's Issues

Question about training time

Hi,

Thanks for your wonderful repo. It reports a very large epoch number in your paper. So I wonder the training time of the model. And how many GPUs do you use?

GPU type for training

Hi! Thanks for the great work!

May I ask what GPU type did you use for training? Did you also use gtx1080ti the same as inference?

Labels: background as 1, foreground as 0?

Hi guys, thanks for your work and sharing the code. I have a question about the labels input to calculate the loss. So I understand it as if we have multi-class detection problem, say 5 categories, then the foreground would be 0,1,2,3,4 and the background will be 5. So similarly, if we only have 1 class, then foreground would be 0, background would be 1.

I was just wondering whether this "fg-0 bg-1" has been flipped (as in "fg-1 bg-0") in calculating the loss? Cuz I saw from the vadacore.ops.sigmoid_focal_loss, specifically in sigmoid_focal_loss_cuda.cu file, it wrote

__global__ void SigmoidFocalLossForward(const int nthreads,
                                        const scalar_t *logits,
                                        const int64_t *targets,
                                        const int num_classes,
                                        const float gamma, const float alpha,
                                        const int num, scalar_t *losses) {
  CUDA_1D_KERNEL_LOOP(i, nthreads) {
    int n = i / num_classes;
    int d = i % num_classes;  // current class[0~79];
    int t = targets[n];       // target class [0~79];

    // Decide it is positive or negative case.
    scalar_t c1 = (t == d);
    scalar_t c2 = (t >= 0 & t != d);

And I guess this int d = i % num_classes; // current class[0~79] is where the labels are flipped (so labels become bg-0 fg-1)?

The reason why I have this question is when I look at the loss, if the labels aren't flipped, it doesn't make sense. For the simplest case, Binary Cross Entropy loss, it should be

loss = - [y log(p) + (1-y) log(1-p)]

Minimizing the loss is equivalent to maximizing y log(p) + (1-y) log(1-p). So here, when y=1, we maximize p; when y=0, we maximize 1-p i.e. minimize p. And so here, if the input labels are in "bg-1 fg-0", we should make it "bg-0 fg-1". Is this correct?

Thanks!

End-to-end training?

Hi @Media-Smart Thank you for your excellent work and clean implementation. I want to ask you if you trained the whole network end-to-end. As you have described in the paper one of the disadvantage of two stream methods is their difficulty of training end-to-end. As this work is not based on two-stream input, I am assuming you have trained the network end-to-end. So did you optimize the feature extractor network too when training your model?

open-mmlab weights are not loading

Tried to run model on THUMOS14 and seems open-mmlab://i3d_r50_256p_32x2x1_100e_kinetics400_rgb having an issue with loading. Attached the error log for reference.

weight file mismatch with the model

Hii thanks a lot for your work and sharing the code. I'm having trouble loading the weights file. So at the very beginning, it could not be loaded and then I used #10 suggested method and it worked. But then, I had this "unexpected keys" and "missing keys" issues.

I only changed num_classes = 1 in the second section "2. model" as I want to retrain the model on my dataset. But even if I changed it back to num_classes = 20, it's still having the same problem.

Could you help me with it? Thanks!

Question about FPS

Thank you for you excellent work! I have a question about the fps.
The fps is 25 when you extract frames, but the fps of the video is 30, and duration in txt2json.py is calculated by fps 30.
Does this influence the results? Waiting for your reply sincerely.

Do you have nms process during testing the results?

I use test.py to get the result of [email protected], but I can not find the nms process which is necessary for evaluating the result.

question about .cpp

在dcn/deform_conv.py文件中导入了.cpp文件， from . import deform_conv_ext，deform_conv_ext.cpp等文件您是怎样编译的呢？我没找到编译命令，烦请指点。

Question about pipeline for Inference with `InferEngine`

How is data needed to be prepared for using InferEngine
If my inference was something like

def read_video(video):
    '''Read video prepare video_metas'''
    pass

def prepare(cfg, checkpoint):
    engine = build_engine(cfg.infer_engine)
    load_weights(engine.model, checkpoint, map_location='cpu')

    device = torch.cuda.current_device()
    engine = MMDataParallel(
        engine.to(device), device_ids=[torch.cuda.current_device()])

    data_pipeline = Compose(cfg.data_pipeline)

    return engine, data_pipeline

def main():
    args = parse_args()
    cfg = Config.fromfile(args.config)

    engine, data_pipeline = prepare(cfg, args.checkpoint)

    imgs, video_metas = read_video(args.video)

    data = data_pipeline(imgs)
    
    # scatter here

    results = engine.infer(data['imgs'], video_metas)

    print(results)

I will likely need to change the pipeline from the default but to what

data_pipeline=[
    dict(typename='LoadMetaInfo'), # probably dont need
    dict(typename='Time2Frame'), # probably dont need
    dict(
        typename='OverlapCropAug',
        num_frames=num_frames,
        overlap_ratio=overlap_ratio,
        transforms=[
            dict(typename='TemporalCrop'),
            dict(typename='LoadFrames', to_float32=True), # probably dont need 
            dict(typename='SpatialCenterCrop', crop_size=img_shape),
            dict(typename='Normalize', **img_norm_cfg),
            dict(typename='Pad', size=(num_frames, *img_shape)),
            dict(typename='DefaultFormatBundle'),
            dict(typename='Collect', keys=['imgs'])
    ])
]

I imagine ImageToTensor is needed as a last step before Collect and loading the frame will need to be different
Any clues or help is appreciated

how can I save the parament of my own net while Training?

请问eval_map输入的格式是什么样的呢？

您好，
请问一下evaluation\mean_ap.py里的eval_map输入数据是怎么组织的呢？
我输入det_results的格式是第一层：list，包含N个元素，每个元素是一个视频的预测结果list --> 第二层：包含C个元素，每个元素代表着一个类的预测结果 --> 第三层：包含不定个元素，如果有K个预测结果，就是K×3个元素，分别是起始点，终止点和该类预测概率。如下图：

但是在315行报错：
cls_dets = np.vstack(cls_dets)

提示：
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 0 and the array at index 21 has size 3

想请问一下应该如何正确组织det_results的格式呢？