sangminwoo / explore-and-match Goto Github PK

Official pytorch implementation of "Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos"

License: MIT License

Python 96.83% Shell 3.17%

moment-retrieval natural-language-video-localization video-grounding vision-and-language

explore-and-match's People

Contributors

Stargazers

Watchers

Forkers

onlyonewater xian-sh zongzizhang

explore-and-match's Issues

No zero_shot_clip

In /lib/modeling/model.py, there is the following call for zero_shot_clip. However, there is no relevant content in the whole project.

from lib.modeling.zero_shot_clip import build_zeroshot_clip

Training equipment

Hello, very meaningful work. How many GPUs are needed for the training? How long is it altogether?

ap_array is empty

I got this error during evaluating. I found sometimes ap_array was empty. Could you kind give some advice to adjust it?

  File "/Explore-and-Match/lib/evaluate/eval.py", line 75, in compute_ap
    iou_thd2ap = dict(zip([str(e) for e in iou_thds], ap_thds))
TypeError: 'numpy.float64' object is not iterable

Reproduce results of LVTR-CLIP

Hello, the author! I followed all your feature extraction and charades training suggestions in your github homepage. But in my work environment, the reprocude results of LVTR-CLIP in 200-th epoch looked like this

>>>>> Evalutation
[Epoch] 200
[Loss]
        > loss_label 0.0958
        > class_error 0.0000
        > loss_span 0.0345
        > loss_giou 0.5867
        > loss_label_0 0.0965
        > class_error_0 0.0000
        > loss_span_0 0.0351
        > loss_giou_0 0.6012
        > loss_label_1 0.0960
        > class_error_1 0.0000
        > loss_span_1 0.0345
        > loss_giou_1 0.5858
        > loss_label_2 0.0958
        > class_error_2 0.0000
        > loss_span_2 0.0342
        > loss_giou_2 0.5838
        > loss_overall 2.8799
[Metrics_No_NMS]
OrderedDict([   ('[email protected]', 54.11),
                ('[email protected]', 38.61),
                ('[email protected]', 22.44),
                ('[email protected]', 7.09),
                ('[email protected]', 87.81),
                ('[email protected]', 77.6),
                ('[email protected]', 62.84),
                ('[email protected]', 33.93),
                ('VG-full-mAP', 33.87),
                ('VG-full-mIoU@R1', 0.2466),
                ('VG-full-mIoU@R5', 0.5349),
                ('[email protected]', 34.25),
                ('[email protected]', 73.62),
                ('VG-middle-mAP', 40.85),
                ('[email protected]', 15.05),
                ('[email protected]', 55.39),
                ('VG-short-mAP', 29.21)])

To record the problem, I used tensorboard to collect some evaluation information of each epoch.

I followed all your code without any modification except removing the evaluation record of "long" length_range. So, would you kindly give us some advice to solve this problem to successfully reproduce your work.? Thank you very much!

Frames of some videos in the charades cannot be read

Thanks for your meaningful work. When I tried to get frames per video, I noticed some of frames folders were empty, it happened to num_frames of 16/32/64/126/256 all. It looks like there were some problems when reading the video frames of charades with opencv, would you kindly give us some advice? And I downloaded the original size of charades(55GB), which version of charades did you use?Thank you!

Regarding Figure 7. on the paper.

Hello,
Could you please explain more about fig.7 on the paper?
What does the x-axis and y-axis of the each proposals mean?
Thanks!

What's the definition of run ?

I find this snippet in train.py. What's the definition of run ?

if __name__ == '__main__':
    logger = setup_logger('LVTR', args.log_dir, distributed_rank=0, filename=cur_time()+"_train.txt")
    train_val(logger, run=run)

Get CLIP Feature for Charades

I find the following code in clip_encoder.py. There are train.json and text.json required for charades. But in jiyanggao/TALL, only charades_sta_train/test.txt are provided. How do you extract the CLIP features for charades ? Could you provide the features used in LVTR by Google Drive or Box ? And will you kindly provide some detailed description of the document organization of your datasets ?

    phases = ['train', 'val', 'test'] if dataset in ['activitynet'] else ['train', 'test']
    for phase in phases:
        # load annotations
        with open(os.path.join(data_dir, phase + '.json')) as j:
            annos = json.load(j)
        time_meters['load_annotations'].update(time.time()-tictoc)
        tictoc = time.time()

'clip' Package in preprocess/clip_encoder.py

Thanks for your good work. When I extract the clip features for the datasets, I find the source code (preprocess/clip_encoder.py) imports a package named 'clip', which is not mentioned in Installation. Where can I install the 'clip' package? Thanks

Reproducing with C3D

Hi, thank you for the interesting work.
I want to reproduce the results with C3D features.
Is the configuration same as CLIP?
Could you provide a detailed config file for both Charades and ActivityNet with C3D features please?

LVTR-CLIP

Hello, I'm very interested in your LVTR-CLIP model. The features ectracted by CLIP only include image information, whereas the features ectracted by C3D include both image information and video termporal information. So, why does LVTR-CLIP outperform LVTR-C3D ? Or, in another word, cross_model encoder has capability to modeling the temporal relations both between frame and frame, frame and text ?

it seems that some queries were not evaluated？

There seems to be bugs in line 119 - 122， test.py.
if the number of sentences > num_input_sentences, the length of the list "split_src_txt" will be larger than 1, while the length of "annotations" is always 1. Therefore, the For loop (from lines 119 to 181) will only be executed once, regardless of the size of "split_src_txt".
That is to say, it seems that some queries were not evaluated？