Giter Club home page Giter Club logo

awesome-video-datasets's Introduction

Awesome-Video-Datasets

  • Contributions are most welcome, if you have any suggestions or improvements, please create an issue or raise a pull request.
  • Our group website: VIS Lab, University of Amsterdam.

Contents

Action Recognition

  • HOLLYWOOD2: Actions in Context (CVPR 2009)
    [Paper][Homepage]
    12 classes of human actions, 10 classes of scenes, 3,669 clips, 69 movies

  • HMDB: A Large Video Database for Human Motion Recognition (ICCV 2011)
    [Paper][Homepage]
    51 classes, 7,000 clips

  • UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
    [Paper][Homepage]
    101 classes, 13k clips

  • Sports-1M: Large-scale Video Classification with Convolutional Neural Networks
    [Paper][Homepage]
    1,000,000 videos, 487 classes

  • ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding (CVPR 2015)
    [Paper][Homepage]
    203 classes, 137 untrimmed videos per class, 1.41 activity instances per video

  • MPII-Cooking: Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data (IJCV 2015)
    [Paper][Homepage]
    67 fine-grained activities, 59 composite activities, 14,105 clips, 273 videos

  • Kinetics
    [Kinetics-400/Kinetics-600/Kinetics-700/Kinetics-700-2020] [Homepage]
    400/600/700/700 classes, at least 400/600/600/700 clips per class

  • Charades: Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding (ECCV 2016)
    [Paper][Homepage]
    9,848 annotated videos, 267 people, 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes

  • Charades-Ego: Actor and Observer: Joint Modeling of First and Third-Person Videos (CVPR 2018)
    [Paper][Homepage]
    112 people, 4000 paired videos, 157 action classes

  • 20BN-jester: The Jester Dataset: A Large-Scale Video Dataset of Human Gestures (ICCVW 2019)
    [Paper][Homepage]
    148,092 videos, 27 classes, 1376 actors

  • Moments in Time Dataset: one million videos for event understanding (TPAMI 2019)
    [Paper][Homepage]
    over 1,000,000 labelled videos for 339 Moment classes, the average number of labeled videos per class is 1,757 with a median of 2,775

  • Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding
    [Paper][Homepage]
    1.02 million videos, 313 action classes, 553,535 videos are annotated with more than one label and 257,491 videos are annotated with three or more labels

  • 20BN-SOMETHING-SOMETHING: The "something something" video database for learning and evaluating visual common sense
    [Paper][Homepage]
    100,000 videos across 174 classes

  • EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020)
    [Paper][Homepage]
    100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes

  • HOMAGE: Home Action Genome: Cooperative Compositional Action Understanding (CVPR 2021)
    [Paper][Homepage]
    27 participants, 12 sensor types, 75 activities, 453 atomic actions, 1,752 synchronized sequences, 86 object classes, 29 relationship classes, 497,534 bounding boxes, 583,481 relationships

  • MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding (ICCV 2019)
    [Paper][Homepage]
    36k video clips, 37 action classes, RGB+Keypoints+Acc+Gyo+Ori+Wi-Fi+Presure

  • LEMMA: A Multi-view Dataset for LEarning Multi-agent Multi-task Activities (ECCV 2020)
    [Paper][Homepage]
    RGB-D, 641 action classes, 11,781 action segments, 4.6M frames

  • NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis (CVPR 2016, TPAMI 2019)
    [Paper][Homepage]
    106 distinct subjects and contains more than 114 thousand video samples and 8 million frames, 120 action classes

  • Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs (CVPR 2020)
    [Paper][Homepage]
    10K videos, 0.4M objects, 1.7M visual relationships

  • TITAN: Future Forecast using Action Priors (CVPR 2020)
    [Paper][Homepage]
    700 labeled video-clips, 50 labels including vehicle states and actions, pedestrian age groups, and targeted pedestrian action attributes

  • PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding (ACM Multimedia Workshop)
    [Paper][Homepage]
    1,076 long video sequences, 51 action categories, performed by 66 subjects in three camera views, 20,000 action instances, 5.4 million frames, RGB+depth+Infrared Radiation+Skeleton

  • HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization
    [Paper][Homepage]
    HACS Clips: 1.5M annotated clips sampled from 504K untrimmed videos, HACS Segments: 139K action segments densely annotated in 50K untrimmed videos spanning 200 action categories

  • Oops!: Predicting Unintentional Action in Video (CVPR 2020)
    [Paper][Homepage]
    20,338 videos, 7,368 annotated for training, 6,739 annotated for testing

  • RareAct: A video dataset of unusual interactions
    [Paper][Homepage]
    122 different actions, 7,607 clips, 905 videos, 19 verbs, 38 nouns

  • FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding (CVPR 2020)
    [Paper][Homepage]
    10 event categories, including 6 male events and 4 female events, 530 element categories

  • THUMOS: The THUMOS challenge on action recognition for videos “in the wild”
    [Paper][Homepage]
    101 actions, train: 13,000 temporally trimmed videos, validation: 2100 temporally untrimmed videos with temporal annotations of actions, background: 3000 relevant videos, test: 5600 temporally untrimmed videos with withheld ground truth

  • MultiTHUMOS: Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos (IJCV 2017)
    [Paper][Homepage]
    400 videos, 38,690 annotations of 65 action classes, 10.5 action classes per video

  • TinyVIRAT: Low-resolution Video Action Recognition
    [Paper][Homepage]
    12,829 low-resolution videos, 26 classes, multi-label classification

  • UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles
    [Paper][Homepage]
    67,428 multi-modal video sequences and 119 subjects for action recognition, 22,476 frames for pose estimation, 41,290 frames and 1,144 identities for person re-identification, and 22,263 frames for attribute recognition

  • Hierarchical Action Search: Searching for Actions on the Hyperbole (CVPR 2020)
    [Paper][Homepage]
    Hierarchical-ActivityNet, Hierarchical-Kinetics, and Hierarchical-Moments from ActivityNet, mini-Kinetics, and Moments-in-time; provide action hierarchies and action splits for unseen action search

Video Classification

  • COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis (CVPR 2019)
    [Paper][Homepage]
    11,827 videos, 180 tasks, 12 domains, 46,354 annotated segments

  • VideoLT: Large-scale Long-tailed Video Recognition
    [Paper][Homepage]
    256,218 untrimmed videos, annotated into 1,004 classes with a long-tailed distribution

  • Youtube-8M: A Large-Scale Video Classification Benchmark
    [Paper][Homepage]
    8,000,000 videos, 4000 visual entities

  • HVU: Large Scale Holistic Video Understanding (ECCV 2020)
    [Paper][Homepage]
    572k videos in total with 9 million annotations for training, validation and test set spanning over 3142 labels, semantic aspects defined on categories of scenes, objects, actions, events, attributes and concepts

  • VLOG: From Lifestyle Vlogs to Everyday Interactions (CVPR 2018)
    [Paper][Homepage]
    114K video clips, 10.7K participants, Annotations: Hand/Semantic Object, Hand Contact State, Scene Classification

  • EEV: A Large-Scale Dataset for Studying Evoked Expressions from Video
    [Paper][Homepage]
    Each video is annotated at 6 Hz with 15 continuous evoked expression labels, 36.7 million annotations of viewer facial reactions to 23,574 videos (1,700 hours)

  • OmniSource Web Dataset: Omni-sourced Webly-supervised Learning for Video Recognition (ECCV 2020)
    [Paper][Dataset]
    web data related to the 200 classes in Mini-Kinetics subset, 732,855 instagram videos, 365,4650 instagram images, 3,050,880 google images

Egocentric View

  • Charades-Ego: Actor and Observer: Joint Modeling of First and Third-Person Videos (CVPR 2018)
    [Paper][Homepage]
    112 people, 4000 paired videos, 157 action classes

  • EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020)
    [Paper][Homepage]
    100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes

  • HOMAGE: Home Action Genome: Cooperative Compositional Action Understanding (CVPR 2021)
    [Paper][Homepage]
    27 participants, 12 sensor types, 75 activities, 453 atomic actions, 1,752 synchronized sequences, 86 object classes, 29 relationship classes, 497,534 bounding boxes, 583,481 relationships

Video Object Segmentation

  • DAVIS: A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation (CVPR 2016)
    [Paper][Homepage]
    50 sequences, 3455 annotated frames

  • SegTrack v2: Video Segmentation by Tracking Many Figure-Ground Segments (ICCV 2013)
    [Paper][Homepage]
    1,000 frames with pixel-level annotations

  • UVO: Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation
    [Paper][Homepage]
    1200 videos, 108k frames, 12.29 objects per video

  • VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild (CVPR 2021)
    [Paper][Homepage]
    3,536 videos, 251,632 pixel-level labeled frames, 124 categories, pixel-level annotations are provided at 15 f/s, a complete shot lasting 5 seconds on average

Object Detection

  • ImageNet VID
    [Paper][Homepage]
    30 categories, train: 3,862 video snippets, validation: 555 snippets

  • YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video
    [Paper][Homepage]
    380,000 video segments about 19s long, 5.6 M bounding boxes, 23 types of objects

  • Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations (CVPR 2021)
    [Paper][Homepage]
    15K annotated video clips supplemented with over 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes, manually annotated 3D bounding boxes for each object

  • DroneCrowd: Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark (CVPR 2021)
    [Paper][Homepage]
    112 video clips with 33,600 HD frames in various scenarios, 20,800 people trajectories with 4.8 million heads and several video-level attributes

  • Water detection through spatio-temporal invariant descriptors
    [Paper][Dataset]
    260 videos

Dynamic Texture Classification

  • Dynamic Texture: A New Large Scale Dynamic Texture Dataset with Application to ConvNet Understanding (ECCV 2018)
    [Paper][Homepage]
    over 10,000 videos

  • YUVL: Spacetime Texture Representation and Recognition Based on a Spatiotemporal Orientation Analysis (TPAMI 2012)
    [Paper][Homepage][Dataset]
    610 spacetime texture samples

  • UCLA: Dynamic Texture Recognition (CVPR 2001)
    [Paper][Dataset]
    76 dynamic textures

Group Activity Recognition

  • Volleyball: A Hierarchical Deep Temporal Model for Group Activity Recognition
    [Paper][Homepage]
    4,830 clips, 8 group activity classes, nine individual actions

  • Collective: What are they doing? : Collective activity classification using spatio-temporal relationship among people
    [Paper][Homepage]
    5 different collective activities, 44 clips

Movie

  • HOLLYWOOD2: Actions in Context (CVPR 2009)
    [Paper][Homepage]
    12 classes of human actions, 10 classes of scenes, 3,669 clips, 69 movies

  • HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do
    [Paper][Homepage]
    10 movies from Romance, Drama, Fantasy, Adventure, Comedy

  • MPII-MD: A Dataset for Movie Description
    [Paper][Homepage]
    94 videos, 68,337 clips, 68,375 descriptions

  • MovieNet: A Holistic Dataset for Movie Understanding (ECCV 2020)
    [Paper][Homepage]
    1,100 movies, 1.1M characters with bounding boxes and identities, 42K scene boundaries, 2.5K aligned description sentences, 65K tags of place and action, and 92 K tags of cinematic style

  • MovieQA: Story Understanding Benchmark (CVPR 2016)
    [Paper][Homepage]
    14,944 questions, 408 movies

  • Video Person-Clustering Dataset: Face, Body, Voice: Video Person-Clustering with Multiple Modalities
    [Paper][Homepage]
    multi-modal annotations (face, body and voice) for all primary and secondary characters from a range of diverse TV-shows and movies

  • MovieGraphs: Towards Understanding Human-Centric Situations from Videos (CVPR 2018)
    [Paper][Homepage]
    7,637 movie clips, 51 movies, annotations: scene, situation, description, graph (Character, Attributes, Relationship, Interaction, Topic, Reason, Time stamp)

  • Condensed Movies: Story Based Retrieval with Contextual Embeddings (ACCV 2020)
    [Paper][Homepage]
    33,976 captioned clips from 3,605 movies, 400K+ face-tracks, 8K+ labelled characters, 20K+ subtitles, densely pre-extracted features for each clip (RGB, Motion, Face, Subtitles, Scene)

360 Videos

  • Pano2Vid: Automatic Cinematography for Watching 360° Videos (ACCV 2016)
    [Paper][Homepage]
    20 out of 86 360° videos have labels for testing; 9,171 normal videos captured by humans for inference in training; topics: Soccer, Mountain Climbing, Parade, and Hiking)

  • Deep 360 Pilot: Learning a Deep Agent for Piloting through 360◦ Sports Videos (CVPR 2017)
    [Paper][Homepage]
    342 360° videos, topics: basketball, parkour, BMX, skateboarding, and dance

  • YT-ALL: Self-Supervised Generation of Spatial Audio for 360◦ Video (NeurIPS 2018)
    [Paper][Homepage]
    1,146 videos, half of the videos are live music performances

  • YT360: Learning Representations from Audio-Visual Spatial Alignment (NeurIPS 2020)
    [Paper][Homepage]
    topics: musical performances, vlogs, sports, and others

Activity Localization

  • Hollywood2Tubes: Spot On: Action Localization from Pointly-Supervised Proposals
    [Paper][Dataset]
    train: 823 videos, 1,026 action instances, 16,411 annotations; test: 884 videos, 1,086 action instances, 15,835 annotations

  • DALY: Human Action Localization with Sparse Spatial Supervision
    [Paper][Homepage]
    10 actions, 3.3M frames, 8,133 clips

  • Action Completion: A temporal model for Moment Detection (BMVC 2018)
    [Paper][Homepage]
    completion moments of 16 actions from three datasets: HMDB, UCF101, RGBD-AC

  • RGBD-Action-Completion: Beyond Action Recognition: Action Completion in RGB-D Data (BMVC 2016)
    [Paper][Homepage]
    414 complete/incomplete object interaction sequences, spanning six actions and captured using an RGB-D camera

  • AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
    [Paper][Homepage]
    80 atomic visual actions in 430 15-minute video clips, 1.58M action labels with multiple labels per person occurring frequently

  • AVA-Kinetics: The AVA-Kinetics Localized Human Actions Video Dataset
    [Paper][Homepage]
    230k clips, 80 AVA action classes

  • HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization
    [Paper][Homepage]
    HACS Clips: 1.5M annotated clips sampled from 504K untrimmed videos, HACS Segments: 139K action segments densely annotated in 50K untrimmed videos spanning 200 action categories

  • CommonLocalization: Localizing the Common Action Among a Few Videos (ECCV 2020)
    [Paper][Homepage]
    few-shot common action localization, revised ActivityNet1.3 and Thumos14

  • CommonSpaceTime: Few-Shot Transformation of Common Actions into Time and Space (CVPR 2021)
    [Paper][Homepage]
    revised AVA and UCF101-24

  • MUSES: Multi-shot Temporal Event Localization: a Benchmark (CVPR 2021)
    [Paper][Homepage]
    31,477 event instances, 716 video hours, 19 shots per instance, 176 shots per video, 25 categories, 3,697 videos

  • MEVA: A Large-Scale Multiview, Multimodal Video Dataset for Activity (WACV 2021)
    [Paper][Homepage]
    annotated 144 hours for 37 activity types, marking bounding boxes of actors and props, 38 RGB and thermal IR cameras

  • TVSeries: Online Action Detection (ECCV 2016)
    [Paper][Homepage]
    27 episodes from 6 popular TV series, 30 action classes, 6,231 action instances

Video Captioning

  • VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events
    [Paper][Homepage]
    45,826 videos and their descriptions obtained by harvesting YouTube

  • MSR-VTT: A Large Video Description Dataset for Bridging Video and Language (CVPR 2016)
    [Paper][Homepage]
    10K web video clips, 200K clip-sentence pairs

  • VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research (ICCV 2019)
    [Paper][Homepage]
    41,250 videos, 825,000 captions in both English and Chinese, over 206,000 English-Chinese parallel translation pairs

  • ActivityNet Captions: Dense-Captioning Events in Videos (ICCV 2017)
    [Paper][Homepage]
    20k videos, 100k sentences

  • ActivityNet Entities: Grounded Video Description
    [Paper][Homepage]
    14,281 annotated videos, 52k video segments with at least one noun phrase annotated per segment, augment the ActivityNet Captions dataset with 158k bounding box

  • VTW: Title Generation for User Generated Videos (ECCV 2016)
    [Paper][Homepage]
    18100 video clips with an average of 1.5 minutes duration per clip

  • TGIF: A New Dataset and Benchmark on Animated GIF Description (CVPR 2016)
    [Paper][Homepage]
    100K animated GIFs from Tumblr and 120K natural language descriptions

  • Charades: Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding (ECCV 2016)
    [Paper][Homepage]
    9,848 annotated videos, 267 people, 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes

Video and Language

  • Lingual OTB99 & Lingual ImageNet Videos: Tracking by Natural Language Specification (CVPR 2017)
    [Paper][Homepage]
    natural language descriptions of the target object

  • MPII-MD: A Dataset for Movie Description
    [Paper][Homepage]
    94 videos, 68,337 clips, 68,375 descriptions

  • Narrated Instruction Videos: Unsupervised Learning from Narrated Instruction Videos
    [Paper][Homepage]
    150 videos, 800,000 frames, five tasks: Making a coffee, Changing car tire, Performing cardiopulmonary resuscitation (CPR), Jumping a car and Repotting a plant

  • YouCook: A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching (CVPR 2013)
    [Paper][Homepage]
    88 YouTube cooking videos

  • HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips (ICCV 2019)
    [Paper][Homepage]
    136 million video clips sourced from 1.22M narrated instructional web videos, 23k different visual tasks

  • How2: A Large-scale Dataset for Multimodal Language Understanding (NeurIPS 2018)
    [Paper][Homepage]
    80,000 clips, word-level time alignments to the ground-truth English subtitles

  • Breakfast: The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities
    [Paper][Homepage]
    52 participants, 10 distinct cooking activities captured in 18 different kitchens, 48 action classes, 11,267 clips

  • EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020)
    [Paper][Homepage]
    100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes

  • YouCook2: YouCookII Dataset
    [Paper][Homepage]
    2000 long untrimmed videos, 89 cooking recipes, each recipe includes 5 to 16 steps, each step should be described with one sentence

  • QuerYD: A video dataset with textual and audio narrations (ICASSP 2021)
    [Paper][Homepage]
    1,400+ narrators, 200+ video hours, 70+ description hours

  • VIOLIN: A Large-Scale Dataset for Video-and-Language Inference (CVPR 2020)
    [Paper][Homepage]
    95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video

  • CrossTask: weakly supervised learning from instructional videos (CVPR 2019)
    [Paper][Homepage]
    4.7K videos, 83 tasks

Video Question Answering

  • SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events
    [Paper][Homepage]
    10,080 in-the-wild videos and annotated 62,535 QA pairs

  • R2VQ: Recipe-to-Video Questions
    [Paper][Homepage]

Action Segmentation

  • A2D: Can Humans Fly? Action Understanding with Multiple Classes of Actors (CVPR 2015)
    [Paper][Homepage]
    3,782 videos, actors: adult, baby, bird, cat, dog, ball and car, actions: climbing, crawling, eating, flying, jumping, rolling, running, and walking

  • J-HMDB: Towards understanding action recognition (ICCV 2013)
    [Paper][Homepage]
    31,838 annotated frames, 21 categories involving a single person in action: brush hair, catch, clap, climb stairs, golf, jump, kick ball, pick, pour, pull-up, push, run, shoot ball, shoot bow, shoot gun, sit, stand, swing baseball, throw, walk, wave

  • A2D Sentences & J-HMDB Sentences: Actor and Action Video Segmentation from a Sentence (CVPR 2018)
    [Paper][Homepage]
    A2D Sentences: 6,656 sentences, including 811 different nouns, 225 verbs and 189 adjectives, J-HMDB Sentences: 928 sentences, including 158 different nouns, 53 verbs and 23 adjectives

Audiovisual Learning

  • Audio Set: An ontology and human-labeled dataset for audio events (ICASSP 2017)
    [Paper][Homepage]
    632 audio event classes, 2,084,320 human-labeled 10-second sound clips

  • MUSIC: The Sound of Pixels (ECCV 2018)
    [Paper][Homepage]
    685 untrimmed videos, 11 instrument categories

  • AudioSet ZSL: Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zero-shot Classification and Retrieval of Videos (WACV 2020)
    [Paper][Homepage]
    33 classes, 156,416 videos

  • Kinetics-Sound: Look, Listen and Learn (ICCV 2017)
    [Paper]
    34 action classes from Kinetics

  • EPIC-KITCHENS: Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (ECCV 2018, extended into TPAMI 2020)
    [Paper][Homepage]
    100 hours, 37 participants, 20M frames, 90K action segments, 700 variable length videos, 97 verb classes, 300 noun classes, 4053 action classes

  • SoundNet: Learning Sound Representations from Unlabeled Video (NIPS 2016)
    [Paper][Homepage]
    2+ million videos

  • AVE: Audio-Visual Event Localization in Unconstrained Videos (ECCV 2018)
    [Paper][Homepage]
    4,143 10-second videos, 28 audio-visual events

  • LLP: Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing (ECCV 2020)
    [Paper][Homepage]
    11,849 YouTube video clips, 25 event categories

  • VGG-Sound: A large scale audio-visual dataset
    [Paper][Homepage]
    200k videos, 309 audio classes

  • YouTube-ASMR-300K: Telling Left from Right: Learning Spatial Correspondence of Sight and Sound (CVPR 2020)
    [Paper][Homepage]
    300K 10-second video clips with spatial audio

  • XD-Violence: Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision (ECCV 2020)
    [Paper][Homepage]
    4754 untrimmed videos

  • VGG-SS: Localizing Visual Sounds the Hard Way (CVPR 2021)
    [Paper][Homepage]
    5K videos, 200 categories

  • VoxCeleb: Large-scale speaker verification in the wild
    [Paper][Homepage]
    a million ‘real-world’ utterances, over 7000 speakers

  • EmoVoxCeleb: Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
    [Paper][Homepage]
    1,251 speakers

  • Speech2Gesture: Learning Individual Styles of Conversational Gesture (CVPR 2019)
    [Paper][Homepage]
    144-hour person-specific video, 10 speakers

  • AVSpeech: Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
    [Paper][Homepage]
    150,000 distinct speakers, 290k YouTube videos

  • LRW: Lip Reading in the Wild (ACCV 2016)
    [Paper][Homepage]
    1000 utterances of 500 different words

  • LRW-1000: LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild (FG 2019)
    [Paper][Homepage]
    718018 video samples from 2000+ individual speakers of 1000 Mandarin words

  • LRS2: Deep Audio-Visual Speech Recognition (TPAMI 2018)
    [Paper][Homepage]
    Thousands of natural sentences from British television

  • LRS3-TED: a large-scale dataset for visual speech recognition
    [Paper][Homepage]
    thousands of spoken sentences from TED and TEDx videos

  • CMLR: A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading (ACM MM Asia 2019)
    [Paper][Homepage]
    102072 spoken sentences of 11 speakers from national news program in China (CCTV)

  • Countix-AV & Extreme Countix-AV: Repetitive Activity Counting by Sight and Sound (CVPR 2021)
    [Paper][Homepage]
    1,863 videos in Countix-AV, 214 videos in Extreme Countix-AV

Repetition Counting

  • QUVA Repetition: Real-World Repetition Estimation by Div, Grad and Curl (CVPR 2018)
    [Paper][Homepage]
    100 videos

  • YTSegments: Live Repetition Counting (ICCV 2015)
    [Paper][Homepage]
    100 videos

  • UCFRep: Context-Aware and Scale-Insensitive Temporal Repetition Counting (CVPR 2020)
    [Paper][Homepage]
    526 videos

  • Countix: Counting Out Time: Class Agnostic Video Repetition Counting in the Wild (CVPR 2020)
    [Paper][Homepage]
    8,757 videos

  • Countix-AV & Extreme Countix-AV: Repetitive Activity Counting by Sight and Sound (CVPR 2021)
    [Paper][Homepage]
    1,863 videos in Countix-AV, 214 videos in Extreme Countix-AV

Video Indexing

  • MediaMill: The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia
    [Paper][Homepage]
    manually annotated concept lexicon

Skill Determination

  • EPIC-Skills: Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination (CVPR 2018)
    [Paper][Homepage]
    3 tasks, 113 videos, 1000 pairwise ranking annotations

  • BEST: The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos (CVPR 2019)
    [Paper][Homepage]
    5 tasks, 500 videos, 13000 pairwise ranking annotations

  • AQA-7: Action Quality Assessment Across Multiple Actions (WACV 2019)
    [Paper][Homepage]
    1,189 samples from 7 sports: 370 from single diving - 10m platform, 176 from gymnastic vault, 175 from big air skiing, 206 from big air snowboarding, 88 from synchronous diving - 3m springboard, 91 from synchronous diving - 10m platform and 83 from trampoline

  • AQA-MTL: What and how well you performed? A multitask approach to action quality assessment (CVPR 2019)
    [Paper][Homepage]
    1,412 fine-grained samples collected from 16 different events with various views

  • Olympic Scoring Dataset: Learning to score Olympic events (CVPR 2017 Workshop)
    [Paper][Homepage]
    The existing MIT diving dataset is doubled from 159 samples to 370 examples. A new gymnastic vault dataset consisting of 176 samples has been collected

  • JIGSAWS: JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS): A Surgical Activity Dataset for Human Motion Modeling
    [Paper][Homepage]
    surgical activities, 3 tasks: “Suturing (S)”, ”Needle Passing (NP)” and “Knot Tying (KT)”, each video is annotated with multiple annotation scores assessing different aspects of a video

Video Retrieval

  • TRECVID Challenge: TREC Video Retrieval Evaluation
    [Homepage]
    sources: YFCC100M, Flickr, etc

  • Video Browser Showdown – The Video Retrieval Competition
    [Homepage]

  • TRECVID-VTT: TRECVID 2019: An Evaluation Campaign to Benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & Retrieval
    [Paper][Homepage]
    9185 videos with captions

  • V3C - A Research Video Collection
    [Paper][Homepage]
    7475 Vimeo videos, 1,082,657 short video segments

  • IACC: Creating a web-scale video collection for research
    [Paper][Homepage]
    4600 Internet Archive videos

  • TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (ECCV 2020)
    [Paper][Homepage]
    108,965 queries on 21,793 videos from 6 TV shows of diverse genres, where each query is associated with a tight temporal alignment

Single Object Tracking

  • Lingual OTB99 & Lingual ImageNet Videos: Tracking by Natural Language Specification (CVPR 2017)
    [Paper][Homepage]
    natural language descriptions of the target object

  • OxUvA: Long-term Tracking in the Wild: A Benchmark (ECCV 2018)
    [Paper][Homepage]
    366 sequences spanning 14 hours of video

  • LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking
    [Paper][Homepage]
    1,400 sequences with more than 3.5M frames, each frame is annotated with a bounding box

  • TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild (ECCV 2018)
    [Paper][Homepage]
    30K videos with more than 14 million dense bounding box annotations, a new benchmark composed of 500 novel videos

  • ALOV300+: Visual Tracking: An Experimental Survey (TPAMI 2014)
    [Paper][Homepage][Dataset]
    315 videos

  • NUS-PRO: A New Visual Tracking Challenge (TPAMI 2015)
    [Paper][Homepage]
    365 image sequences

  • UAV123: A Benchmark and Simulator for UAV Tracking (ECCV 2016)
    [Paper][Homepage]
    123 new and fully annotated HD video sequences captured from a low-altitude aerial perspective

  • OTB2013: Online Object Tracking: A Benchmark (CVPR 2013)
    [Paper][Homepage]
    50 video sequences

  • OTB2015: Object Tracking Benchmark (TPAMI 2015)
    [Paper][Homepage]
    100 video sequences

  • VOT Challenge
    [Homepage]

Multiple Objects Tracking

  • MOT Challenge
    [Homepage]

  • VisDrone: Vision Meets Drones: A Challenge
    [Paper][Homepage]

  • TAO: A Large-Scale Benchmark for Tracking Any Object
    [Paper][Homepage]
    2,907 videos, 833 classes, 17,287 tracks

  • GMOT-40: A Benchmark for Generic Multiple Object Tracking
    [Paper][Homepage]
    40 carefully annotated sequences evenly distributed among 10 object categories

  • BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning (CVPR 2020)
    [Paper][Homepage]
    100K videos and 10 tasks

  • DroneCrowd: Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark (CVPR 2021)
    [Paper][Homepage]
    112 video clips with 33,600 HD frames in various scenarios, 20,800 people trajectories with 4.8 million heads and several video-level attributes

Video Relation Detection

  • KIEV: Interactivity Proposals for Surveillance Videos
    [Paper][Homepage]
    a new task of spatio-temporal interactivity proposals

  • ImageNet-VidVRD: Video Visual Relation Detection
    [Paper][Homepage]
    1,000 videos, 35 common subject/object categories and 132 relationships

  • VidOR: Annotating Objects and Relations in User-Generated Videos
    [Paper][Homepage]
    10,000 videos selected from YFCC100M collection, 80 object categories and 50 predicate categories

  • Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks (CVPR 2020)
    [Paper][Homepage]
    annotations for 180049 videos from the Something-Something Dataset

  • Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs (CVPR 2020)
    [Paper][Homepage]
    10K videos, 0.4M objects, 1.7M visual relationships

  • VidSitu: Visual Semantic Role Labeling for Video Understanding (CVPR 2021)
    [Paper][Homepage]
    29K 10-second movie clips richly annotated with a verb and semantic-roles every 2 seconds

Anomaly Detection

  • XD-Violence: Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision (ECCV 2020)
    [Paper][Homepage]
    4,754 untrimmed videos

  • UCF-Crime: Real-world Anomaly Detection in Surveillance Videos
    [Paper][Homepage]
    1,900 videos

Highlight Detection

  • YouTube Highlights: Ranking Domain-specific Highlights by Analyzing Edited Videos
    [Paper][Homepage]
    six domain-specific categories: surfing, skating, skiing, gymnastics, parkour, and dog. Each domain consists of around 100 videos and the total accumulated time is 1430 minutes

  • PHD2: Personalized Highlight Detection for Automatic GIF Creation
    [Paper][Homepage]
    the training set contains highlights from 12,972 users, the test set contains highlights from 850 users

  • TVSum: Summarizing web videos using titles (CVPR 2015)
    [Paper][Homepage]
    50 videos of various genres (e.g., news, how-to, documentary, vlog, egocentric) and 1,000 annotations of shot-level importance scores obtained via crowdsourcing (20 per video)

Video Memorability

  • Video Memorability: Multimodal Memorability: Modeling Effects of Semantics and Decay on Video Memorability (ECCV 2020)
    [Paper][Homepage]
    10,000 video clips, over 900,000 human memory annotations at different delay intervals and 50,000 captions describing the events

Pose Estimation

  • YouTube Pose: Personalizing Human Video Pose Estimation (CVPR 2016)
    [Paper][Homepage]
    50 videos, 5,000 annotated frames

  • JHMDB: Towards understanding action recognition (ICCV 2013)
    [Paper][Homepage]
    5,100 clips of 51 different human actions collected from movies or the Internet, 31,838 annotated frames in total

  • Penn Action: From Actemes to Action: A Strongly-supervised Representation for Detailed Action Understanding (ICCV 2013)
    [Paper][Homepage]
    2,326 video sequences of 15 different actions and human joint annotations for each sequence

  • PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes (RSS 2018)
    [Paper][Homepage]
    accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames

Person Re-identification

  • iLIDS-VID: Person re-identification by video ranking (ECCV 2014)
    [Paper][Homepage]
    600 image sequences of 300 distinct individuals

  • PRID-2011: Person Re-identification by Descriptive and Discriminative Classification
    [Paper][Homepage]
    400 image sequences for 200 identities from two non-overlapping cameras

  • MARS: A Video Benchmark for Large-Scale Person Re-Identification (ECCV 2016)
    [Paper][Homepage]
    1,261 identities around 18,000 video sequences

Physics

  • Real-world Flag & FlagSim: Cloth in the Wind: A Case Study of Physical Measurement through Simulation (CVPR 2020)
    [Paper][Homepage]
    Real-world Flag: 2.7K train and 1.3K videos clips, FlagSim: 1,000 mesh sequences, 14, 000 training examples

awesome-video-datasets's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.