dcdmllm / momentor Goto Github PK

Python 100.00%

momentor's Introduction

Momentor (ICML 2024)

The official repository of paper Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning.

Momentor Overview

Momentor is a Video-LLM designed for fine-grained comprehension and localization in videos. It is composed of a frame encoder, a linear projection layer, a Temporal Perception Module (TPM), and a Large Language Model (LLM). We carefully design the Temporal Perception Module (TPM) to improve fine-grained temporal modeling and representation. Architecture and training of Momentor are shown in the following figure.

Installation

Git clone our repository and creating conda environment:

cd Momentor/momentor
conda create --name=momentor python=3.10
conda activate momentor
pip install -r requirements.txt

Training

For training instructions, check out train_momentor.md.

Moment-10M

We present Moment-10M, a large-scale video instruction dataset with segment-level annotation. We use videos from YTTemporal-1B to construct Moment-10M. We propose an automatic data generation engine to extract instance and event information from these videos and generate segment-level instruction following data. We meticulously design 5 single-segment tasks and 3 cross-segment tasks, which enables Video-LLMs perform comprehensive segment-level reasoning.

We are releasing our Moment-10M dataset, you can download it from the following links: part1, part2.

You can also download the data for Grounded Event-Sequence Modeling here: GESM.

After downloading and extracting the dataset to obtain the data files, you can use convert_data.py to transform the data into a text dialogue format and download_videos.py to download the corresponding video files. The usage for these scripts is as follows:

python convert_data.py --source_path <path_to_data_file> --target_path <path_to_converted_file>

Parameters:

--source_path: The path to the input data file that needs to be converted.
--target_path: The path where the converted file will be saved.

python download_videos.py --source_path <path_to_data_file> --video_path <path_to_store_videos>

Parameters:

--source_path: The path to the input data file containing identifiers for the videos.
--video_path: The path where the downloaded video files will be stored.

For GESM data extraction, use convert_data_gesm.py as follows:

python convert_data_gesm.py --source_path <path_to_data_file> --target_path <path_to_converted_file>

Parameters:

--source_path: The path to the input data file that needs to be converted.
--target_path: The path where the converted file will be saved.

Citation

If you found our work useful in your research, please consider giving this repository a star and citing our paper as followed:

@misc{qian2024momentor,
      title={Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning}, 
      author={Long Qian and Juncheng Li and Yu Wu and Yaobo Ye and Hao Fei and Tat-Seng Chua and Yueting Zhuang and Siliang Tang},
      year={2024},
      eprint={2402.11435},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgment

Thanks to the open source of the following projects:

momentor's People

Contributors

Stargazers

Watchers

Forkers

liubo0902

momentor's Issues

load_video function

Hi @loveofguoke @longqian-zju,

Thank you for your interesting work. I have been really excited about it!

When I run your training script, I encounter this error:

ModuleNotFoundError: No module named 'momentor.eval'

Can you help me locate where the momentor.eval folder is?

Thank you so much!

Checkpoints & Evaluation Code

Hi there

Thanks for sharing your work!

I wonder if you can share the pre-trained model and the evaluation codes for downstream tasks to ensure reproducibility.

Thank you.

Code Availability

Hi @longqian-zju ,

Thank you for sharing the data and great work. Requesting to share the code for reproducibility.

Thanks

Correct dataset format to train the model?

Hi @longqian-zju , @loveofguoke

I am trying to train the model, however, the dataset before and after conversion has different formats and both the formats seems in-compatible to train the model.

data_0 = json.load(open("Moment-10M_0.json"))
data_0["jS2fE622RMA"]["qa_data"]

[{'id': 'jS2fE622RMA', 'data_type': 'qa_data', 'variables': {'moment': [0.0, 12.4]}, 'conversations': [{'User': 'Please watch the clip of {moment}. What color is the car with a black stripe?', 'Assistant': 'The car with a black stripe is a black car.'}], 'clip_similarity': 0.2374267578125}, {'id': 'jS2fE622RMA', 'data_type': 'qa_data', 'variables': {'moment': [12.4, 14.233333333333333]}, 'conversations': [{'User': 'Take a look at the segment of {moment}. What is the color of the SUV in the parking lot?', 'Assistant': 'The color of the SUV in the parking lot is blue.'}]

The code in the train_momentor.py creates new format, however, the key prototype or backbone do not exist.

qa_types = ['qa_data', 'instance_qa_data', 'cross_segment_qa_data']
packed_instruction_data = []
for video_name in tqdm(instruction_data):
    for key in instruction_data[video_name]:
        for dialogue in instruction_data[video_name][key]:
            text_dialogue = dialogue['dialogue']
            backbone = dialogue['prototype']['backbone']
            if key in qa_types:
                backbone[0]['User'] = re.sub(r'(' + re.escape(sent_tokenize(text_dialogue[0]['User'])[0]) + r') +', '', backbone[0]['User'])
            packed_instruction_data.append({
                'id' : video_name,
                'data_type' : key,
                'moment' : dialogue['prototype']['variables'].get('moment', None),
                'click_position' : dialogue['prototype']['variables'].get('click_position', None),
                'instance_class' : dialogue['prototype']['variables'].get('instance_class', None),
                'SOURCE_CLIP' : dialogue['prototype']['variables'].get('SOURCE_CLIP', None),
                'content' : dialogue['prototype']['variables'].get('content', None),
                'conversations' : backbone,
            })

Lastly, the packed_instruction_data format of data is not being used in the file train_momentor.py. I would appreciate if you help to resolve these issues.

Query 2: Do we use only data from ['qa_data', 'instance_qa_data', 'cross_segment_qa_data'] types to train the model and ignore the others?

Thanks!

Training Details

hi @loveofguoke @longqian-zju

I have the following queries on training the model.

Do we train the model with image-text or video-text pairs? If yes, then which datasets are used for the same.

How to train with GESM data? The converted data do not have conversations key, however, the source code expect this information. Moreover after conversion, the only present keys are id, data_type, data as used in convert_data_gesm.py.

I assume we first need to train with GESM data and then perform supervised instruction fine-tuning with Moment-10M. Please confirm.

It would be really helpful if you can share the information to train the model along with appropriate dataset format.

Thanks in advance!

Request for Release of Instance-Event Metric Data

Dear Maintainers,

I hope this message finds you well. I am reaching out to inquire about the instance-event metric data used in your project.

Your work has been a great source of inspiration for my own research, and having access to the instance-event metric data would significantly aid in understanding and replicating your results.

Are there any plans to release this data in the near future? I believe it would be an invaluable resource for the community and would greatly appreciate your consideration.

Thank you for your time and efforts on this project.

Best Regards

dcdmllm / momentor Goto Github PK

momentor's Introduction

Momentor (ICML 2024)

Momentor Overview

Installation

Training

Moment-10M

Citation

Acknowledgment

momentor's People

Contributors

Stargazers

Watchers

Forkers

momentor's Issues

Recommend Projects

Recommend Topics

Recommend Org