media-smart / vedatad Goto Github PK
View Code? Open in Web Editor NEWA single stage temporal action detection toolbox based on PyTorch
License: Apache License 2.0
A single stage temporal action detection toolbox based on PyTorch
License: Apache License 2.0
Hi,
Thanks for your wonderful repo. It reports a very large epoch number in your paper. So I wonder the training time of the model. And how many GPUs do you use?
Hi! Thanks for the great work!
May I ask what GPU type did you use for training? Did you also use gtx1080ti the same as inference?
Hi guys, thanks for your work and sharing the code. I have a question about the labels input to calculate the loss. So I understand it as if we have multi-class detection problem, say 5 categories, then the foreground would be 0,1,2,3,4 and the background will be 5. So similarly, if we only have 1 class, then foreground would be 0, background would be 1.
I was just wondering whether this "fg-0 bg-1" has been flipped (as in "fg-1 bg-0") in calculating the loss? Cuz I saw from the vadacore.ops.sigmoid_focal_loss
, specifically in sigmoid_focal_loss_cuda.cu
file, it wrote
__global__ void SigmoidFocalLossForward(const int nthreads,
const scalar_t *logits,
const int64_t *targets,
const int num_classes,
const float gamma, const float alpha,
const int num, scalar_t *losses) {
CUDA_1D_KERNEL_LOOP(i, nthreads) {
int n = i / num_classes;
int d = i % num_classes; // current class[0~79];
int t = targets[n]; // target class [0~79];
// Decide it is positive or negative case.
scalar_t c1 = (t == d);
scalar_t c2 = (t >= 0 & t != d);
And I guess this int d = i % num_classes; // current class[0~79]
is where the labels are flipped (so labels become bg-0 fg-1)?
The reason why I have this question is when I look at the loss, if the labels aren't flipped, it doesn't make sense. For the simplest case, Binary Cross Entropy loss, it should be
loss = - [y log(p) + (1-y) log(1-p)]
Minimizing the loss is equivalent to maximizing y log(p) + (1-y) log(1-p)
. So here, when y=1
, we maximize p
; when y=0
, we maximize 1-p
i.e. minimize p
. And so here, if the input labels are in "bg-1 fg-0", we should make it "bg-0 fg-1". Is this correct?
Thanks!
Hi @Media-Smart Thank you for your excellent work and clean implementation. I want to ask you if you trained the whole network end-to-end. As you have described in the paper one of the disadvantage of two stream methods is their difficulty of training end-to-end. As this work is not based on two-stream input, I am assuming you have trained the network end-to-end. So did you optimize the feature extractor network too when training your model?
Tried to run model on THUMOS14 and seems open-mmlab://i3d_r50_256p_32x2x1_100e_kinetics400_rgb
having an issue with loading. Attached the error log for reference.
Hii thanks a lot for your work and sharing the code. I'm having trouble loading the weights file. So at the very beginning, it could not be loaded and then I used #10 suggested method and it worked. But then, I had this "unexpected keys" and "missing keys" issues.
I only changed num_classes = 1 in the second section "2. model" as I want to retrain the model on my dataset. But even if I changed it back to num_classes = 20, it's still having the same problem.
Could you help me with it? Thanks!
Thank you for you excellent work! I have a question about the fps.
The fps is 25 when you extract frames, but the fps of the video is 30, and duration in txt2json.py is calculated by fps 30.
Does this influence the results? Waiting for your reply sincerely.
I use test.py to get the result of [email protected], but I can not find the nms process which is necessary for evaluating the result.
在dcn/deform_conv.py文件中导入了.cpp文件, from . import deform_conv_ext,deform_conv_ext.cpp等文件您是怎样编译的呢?我没找到编译命令,烦请指点。
How is data needed to be prepared for using InferEngine
If my inference was something like
def read_video(video):
'''Read video prepare video_metas'''
pass
def prepare(cfg, checkpoint):
engine = build_engine(cfg.infer_engine)
load_weights(engine.model, checkpoint, map_location='cpu')
device = torch.cuda.current_device()
engine = MMDataParallel(
engine.to(device), device_ids=[torch.cuda.current_device()])
data_pipeline = Compose(cfg.data_pipeline)
return engine, data_pipeline
def main():
args = parse_args()
cfg = Config.fromfile(args.config)
engine, data_pipeline = prepare(cfg, args.checkpoint)
imgs, video_metas = read_video(args.video)
data = data_pipeline(imgs)
# scatter here
results = engine.infer(data['imgs'], video_metas)
print(results)
I will likely need to change the pipeline from the default but to what
data_pipeline=[
dict(typename='LoadMetaInfo'), # probably dont need
dict(typename='Time2Frame'), # probably dont need
dict(
typename='OverlapCropAug',
num_frames=num_frames,
overlap_ratio=overlap_ratio,
transforms=[
dict(typename='TemporalCrop'),
dict(typename='LoadFrames', to_float32=True), # probably dont need
dict(typename='SpatialCenterCrop', crop_size=img_shape),
dict(typename='Normalize', **img_norm_cfg),
dict(typename='Pad', size=(num_frames, *img_shape)),
dict(typename='DefaultFormatBundle'),
dict(typename='Collect', keys=['imgs'])
])
]
I imagine ImageToTensor
is needed as a last step before Collect
and loading the frame will need to be different
Any clues or help is appreciated
您好,
请问一下evaluation\mean_ap.py里的eval_map输入数据是怎么组织的呢?
我输入det_results的格式是 第一层:list,包含N个元素,每个元素是一个视频的预测结果list --> 第二层:包含C个元素,每个元素代表着一个类的预测结果 --> 第三层:包含不定个元素,如果有K个预测结果,就是K×3个元素,分别是起始点,终止点和该类预测概率。如下图:
但是在315行报错:
cls_dets = np.vstack(cls_dets)
提示:
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 0 and the array at index 21 has size 3
想请问一下应该如何正确组织det_results的格式呢?
I got the error explained above.Is this function "_specify_ddp_gpu_num" exist?Thank you!
howdy,Read your paper, very admire, but have a few questions, hope you to answer. First of all, I ran the training model and measured several sets of epoch, but it is not consistent with the 53.8 in your paper. Is the baseline data obtained here?
1200epoch 1000 900 800 700 600 300 200 100
0.445 0.448 0.455 0.456 0.457 0.45 0.445 0.416 0.34
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.