Giter Club home page Giter Club logo

momentor's Introduction

Momentor (ICML 2024)

The official repository of paper Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning.

Momentor Overview

Momentor is a Video-LLM designed for fine-grained comprehension and localization in videos. It is composed of a frame encoder, a linear projection layer, a Temporal Perception Module (TPM), and a Large Language Model (LLM). We carefully design the Temporal Perception Module (TPM) to improve fine-grained temporal modeling and representation. Architecture and training of Momentor are shown in the following figure.

Installation

Git clone our repository and creating conda environment:

cd Momentor/momentor
conda create --name=momentor python=3.10
conda activate momentor
pip install -r requirements.txt

Training

For training instructions, check out train_momentor.md.

Moment-10M

We present Moment-10M, a large-scale video instruction dataset with segment-level annotation. We use videos from YTTemporal-1B to construct Moment-10M. We propose an automatic data generation engine to extract instance and event information from these videos and generate segment-level instruction following data. We meticulously design 5 single-segment tasks and 3 cross-segment tasks, which enables Video-LLMs perform comprehensive segment-level reasoning.

We are releasing our Moment-10M dataset, you can download it from the following links: part1, part2.

You can also download the data for Grounded Event-Sequence Modeling here: GESM.

After downloading and extracting the dataset to obtain the data files, you can use convert_data.py to transform the data into a text dialogue format and download_videos.py to download the corresponding video files. The usage for these scripts is as follows:

python convert_data.py --source_path <path_to_data_file> --target_path <path_to_converted_file>

Parameters:

  • --source_path: The path to the input data file that needs to be converted.
  • --target_path: The path where the converted file will be saved.
python download_videos.py --source_path <path_to_data_file> --video_path <path_to_store_videos>

Parameters:

  • --source_path: The path to the input data file containing identifiers for the videos.
  • --video_path: The path where the downloaded video files will be stored.

For GESM data extraction, use convert_data_gesm.py as follows:

python convert_data_gesm.py --source_path <path_to_data_file> --target_path <path_to_converted_file>

Parameters:

  • --source_path: The path to the input data file that needs to be converted.
  • --target_path: The path where the converted file will be saved.

Citation

If you found our work useful in your research, please consider giving this repository a star and citing our paper as followed:

@misc{qian2024momentor,
      title={Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning}, 
      author={Long Qian and Juncheng Li and Yu Wu and Yaobo Ye and Hao Fei and Tat-Seng Chua and Yueting Zhuang and Siliang Tang},
      year={2024},
      eprint={2402.11435},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgment

Thanks to the open source of the following projects:

momentor's People

Contributors

loveofguoke avatar longqian-zju avatar

Stargazers

Yao Xiao avatar Orr Zohar avatar Jinrui Zhang avatar  avatar Yean Cheng avatar Leonard Musk avatar Romy Luo avatar verigle avatar Yikun Liu avatar  avatar Jinfa Huang avatar Jingyang Lin avatar Aniki avatar Yaya Shi avatar Sangmin Woo avatar hcwei avatar Dibyadip Chatterjee avatar Zachariah Mustafa avatar  avatar Jeff Carpenter avatar Varun Ganjigunte Prakash avatar  avatar Maitreya Patel avatar  avatar sojoner avatar  avatar Yongxin Guo avatar  avatar Huaxin Zhang avatar he neng avatar  avatar Naptmn avatar kukukikikaka avatar Zhiqi Ge avatar Yu Fang avatar Yang Liu avatar lb203 avatar  avatar hl-Chen avatar Pei Liu avatar Guanhong Wang avatar  avatar guojietian avatar

Watchers

Will avatar Teng Wang avatar Yao Xiao avatar  avatar Sanctuary avatar  avatar Orr Zohar avatar

Forkers

liubo0902

momentor's Issues

load_video function

Hi @loveofguoke @longqian-zju,

Thank you for your interesting work. I have been really excited about it!

When I run your training script, I encounter this error:

ModuleNotFoundError: No module named 'momentor.eval'

Can you help me locate where the momentor.eval folder is?

Thank you so much!

Checkpoints & Evaluation Code

Hi there

Thanks for sharing your work!

I wonder if you can share the pre-trained model and the evaluation codes for downstream tasks to ensure reproducibility.

Thank you.

Correct dataset format to train the model?

Hi @longqian-zju , @loveofguoke

I am trying to train the model, however, the dataset before and after conversion has different formats and both the formats seems in-compatible to train the model.

data_0 = json.load(open("Moment-10M_0.json"))
data_0["jS2fE622RMA"]["qa_data"]

[{'id': 'jS2fE622RMA', 'data_type': 'qa_data', 'variables': {'moment': [0.0, 12.4]}, 'conversations': [{'User': 'Please watch the clip of {moment}. What color is the car with a black stripe?', 'Assistant': 'The car with a black stripe is a black car.'}], 'clip_similarity': 0.2374267578125}, {'id': 'jS2fE622RMA', 'data_type': 'qa_data', 'variables': {'moment': [12.4, 14.233333333333333]}, 'conversations': [{'User': 'Take a look at the segment of {moment}. What is the color of the SUV in the parking lot?', 'Assistant': 'The color of the SUV in the parking lot is blue.'}]

The code in the train_momentor.py creates new format, however, the key prototype or backbone do not exist.

qa_types = ['qa_data', 'instance_qa_data', 'cross_segment_qa_data']
packed_instruction_data = []
for video_name in tqdm(instruction_data):
    for key in instruction_data[video_name]:
        for dialogue in instruction_data[video_name][key]:
            text_dialogue = dialogue['dialogue']
            backbone = dialogue['prototype']['backbone']
            if key in qa_types:
                backbone[0]['User'] = re.sub(r'(' + re.escape(sent_tokenize(text_dialogue[0]['User'])[0]) + r') +', '', backbone[0]['User'])
            packed_instruction_data.append({
                'id' : video_name,
                'data_type' : key,
                'moment' : dialogue['prototype']['variables'].get('moment', None),
                'click_position' : dialogue['prototype']['variables'].get('click_position', None),
                'instance_class' : dialogue['prototype']['variables'].get('instance_class', None),
                'SOURCE_CLIP' : dialogue['prototype']['variables'].get('SOURCE_CLIP', None),
                'content' : dialogue['prototype']['variables'].get('content', None),
                'conversations' : backbone,
            })

Lastly, the packed_instruction_data format of data is not being used in the file train_momentor.py. I would appreciate if you help to resolve these issues.

Query 2: Do we use only data from ['qa_data', 'instance_qa_data', 'cross_segment_qa_data'] types to train the model and ignore the others?

Thanks!

Training Details

hi @loveofguoke @longqian-zju

I have the following queries on training the model.

  • Do we train the model with image-text or video-text pairs? If yes, then which datasets are used for the same.
image
  • How to train with GESM data? The converted data do not have conversations key, however, the source code expect this information. Moreover after conversion, the only present keys are id, data_type, data as used in convert_data_gesm.py.
image
  • I assume we first need to train with GESM data and then perform supervised instruction fine-tuning with Moment-10M. Please confirm.

It would be really helpful if you can share the information to train the model along with appropriate dataset format.

Thanks in advance!

Request for Release of Instance-Event Metric Data

Dear Maintainers,

I hope this message finds you well. I am reaching out to inquire about the instance-event metric data used in your project.

Your work has been a great source of inspiration for my own research, and having access to the instance-event metric data would significantly aid in understanding and replicating your results.

Are there any plans to release this data in the near future? I believe it would be an invaluable resource for the community and would greatly appreciate your consideration.

Thank you for your time and efforts on this project.

Best Regards

可学习连续时序编码的好处

我看可视化的图,学出来的已经很线性了,那为什么不直接采用固定的编码?比如余弦或者线性映射?有没有做消融实验

Data for Grounded Event-Sequence Modeling

Hi @longqian-zju, thanks for sharing this work. I was wondering whether you are using the same videos and annotations for Grounded Event-Sequence Modeling and instruction tuning. The paper did not mention what data (and the size) you used for GESM.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.