Giter Club home page Giter Club logo

video-dataset-loading-pytorch's Introduction

Efficient Video Dataset Loading and Augmentation in PyTorch

Author: Raivo Koot
https://video-dataset-loading-pytorch.readthedocs.io/en/latest/VideoDataset.html
If you find the code useful, please star the repository.

If you are completely unfamiliar with loading datasets in PyTorch using torch.utils.data.Dataset and torch.utils.data.DataLoader, I recommend getting familiar with these first through this or this.

In a Nutshell

Video-Dataset-Loading-Pytorch provides the lowest entry barrier for setting up deep learning training loops on video data. It makes working with video datasets easy and accessible (also efficient!). It only requires you to have your video dataset in a certain format on disk and takes care of the rest. No complicated dependencies and it supports native Torchvision video data augmentation.

Overview: This small library solely provides the class VideoFrameDataset

The VideoFrameDataset class (an implementation of torch.utils.data.Dataset) serves to easily, efficiently and effectively load video samples from video datasets in PyTorch.

  1. Easily because this dataset class can be used with custom datasets with minimum effort and no modification. The class merely expects the video dataset to have a certain structure on disk and expects a .txt annotation file that enumerates each video sample. Details on this can be found below. Pre-made annotation files and preparation scripts are also provided for Kinetics 400, Something Something V2 and Epic Kitchens 100.
  2. Efficiently because the video loading pipeline that this class implements is very fast. This minimizes GPU waiting time during training by eliminating CPU input bottlenecks that can slow down training time by several folds.
  3. Effectively because the implemented sampling strategy for video frames is very representative. Video training using the entire sequence of video frames (often several hundred) is too memory and compute intense. Therefore, this implementation samples frames evenly from the video (sparse temporal sampling) so that the loaded frames represent every part of the video, with support for arbitrary and differing video lengths within the same dataset. This approach has shown to be very effective and is taken from "Temporal Segment Networks (ECCV2016)" with modifications.

In conjunction with PyTorch's DataLoader, the VideoFrameDataset class returns video batch tensors of size BATCH x FRAMES x CHANNELS x HEIGHT x WIDTH.

For a demo, visit demo.py.

QuickDemo (demo.py)

root = os.path.join(os.getcwd(), 'demo_dataset')  # Folder in which all videos lie in a specific structure
annotation_file = os.path.join(root, 'annotations.txt')  # A row for each video sample as: (VIDEO_PATH START_FRAME END_FRAME CLASS_ID)

""" DEMO 1 WITHOUT IMAGE TRANSFORMS """
dataset = VideoFrameDataset(
    root_path=root,
    annotationfile_path=annotation_file,
    num_segments=5,
    frames_per_segment=1,
    imagefile_template='img_{:05d}.jpg',
    transform=None,
    test_mode=False
)

sample = dataset[0]  # take first sample of dataset 
frames = sample[0]   # list of PIL images
label = sample[1]    # integer label

for image in frames:
    plt.imshow(image)
    plt.title(label)
    plt.show()
    plt.pause(1)

alt text

Table of Contents

1. Requirements

# Without these three, VideoFrameDataset will not work.
torchvision >= 0.8.0
torch >= 1.7.0
python >= 3.6

2. Custom Dataset

(This description explains using custom datasets where each sample has a single class label. If you want to know how to use a dataset where a sample can have more than a single class label, read this anyways and then read 6. below)

To use any dataset, two conditions must be met.

  1. The video data must be supplied as RGB frames, each frame saved as an image file. Each video must have its own folder, in which the frames of that video lie. The frames of a video inside its folder must be named uniformly with consecutive indices such as img_00001.jpg ... img_00120.jpg, if there are 120 frames. Indices can start at zero or any other number and the exact file name template can be chosen freely. The filename template for frames in this example is "img_{:05d}.jpg" (python string formatting, specifying 5 digits after the underscore), and must be supplied to the constructor of VideoFrameDataset as a parameter. Each video folder must lie inside some root folder.
  2. To enumerate all video samples in the dataset and their required metadata, a .txt annotation file must be manually created that contains a row for each video clip sample in the dataset. The training, validation, and testing datasets must have separate annotation files. Each row must be a space-separated list that contains VIDEO_PATH START_FRAME END_FRAME CLASS_INDEX. The VIDEO_PATH of a video sample should be provided without the root prefix of this dataset.

This example project demonstrates this using a dummy dataset inside of demo_dataset/, which is the root dataset folder of this example. The folder structure looks as follows:

demo_dataset
│
├───annotations.txt
├───jumping # arbitrary class folder naming
│       ├───0001  # arbitrary video folder naming
│       │     ├───img_00001.jpg
│       │     .
│       │     └───img_00017.jpg
│       └───0002
│             ├───img_00001.jpg
│             .
│             └───img_00018.jpg
│
└───running # arbitrary folder naming
        ├───0001  # arbitrary video folder naming
        │     ├───img_00001.jpg
        │     .
        │     └───img_00015.jpg
        └───0002
              ├───img_00001.jpg
              .
              └───img_00015.jpg

 

The accompanying annotation .txt file contains the following rows (PATH, START_FRAME, END_FRAME, LABEL_ID)

jumping/0001 1 17 0
jumping/0002 1 18 0
running/0001 1 15 1
running/0002 1 15 1

Another annotations file that uses multiple clips from each video could be

jumping/0001 1 8 0
jumping/0001 5 17 0
jumping/0002 1 18 0
running/0001 10 15 1
running/0001 5 10 1
running/0002 1 15 1

(END_FRAME is inclusive)

Another, simpler, example of the way your dataset's RGB frames can be organized on disk is the following:

demo_dataset
│
├───annotations.txt
└───rgb 
     ├───video_1
     │     ├───img_00001.jpg
     │     .
     │     └───img_00017.jpg
     ├───video_2
     │     ├───img_00001.jpg
     │     .
     │     └───img_00044.jpg
     └───video_3
           ├───img_00001.jpg
           .
           └───img_00023.jpg

 

The accompanying annotation .txt file contains the following rows (PATH, START_FRAME, END_FRAME, LABEL_ID)

video_1 1 17 1
video_2 1 44 0
video_3 1 23 0

Instantiating a VideoFrameDataset with the root_path parameter pointing to demo_dataset/rgb/, the annotationsfile_path parameter pointing to the annotation file demo_dataset/annotations.txt, and the imagefile_template parameter as "img_{:05d}.jpg", is all that it takes to start using the VideoFrameDataset class.

3. Video Frame Sampling Method

When loading a video, only a number of its frames are loaded. They are chosen in the following way:

  1. The frame index range [START_FRAME, END_FRAME] is divided into NUM_SEGMENTS even segments. From each segment, a random start-index is sampled from which FRAMES_PER_SEGMENT consecutive indices are loaded. This results in NUM_SEGMENTS*FRAMES_PER_SEGMENT chosen indices, whose frames are loaded as PIL images and put into a list and returned when calling dataset[i]. alt text

4. Alternate Video Frame Sampling Methods

If you do not want to use sparse temporal sampling and instead want to sample a single N-frame continuous clip from a video, this is possible. Set NUM_SEGMENTS=1 and FRAMES_PER_SEGMENT=N. Because VideoFrameDataset will chose a random start index per segment and take NUM_SEGMENTS continuous frames from each sampled start index, this will result in a single N-frame continuous clip per video that starts at a random index. An example of this is in demo.py.

5. Using VideoFrameDataset for training

As demonstrated in demo.py, we can use PyTorch's torch.utils.data.DataLoader class with VideoFrameDataset to take care of shuffling, batching, and more. To turn the lists of PIL images returned by VideoFrameDataset into tensors, the transform video_dataset.ImglistToTensor() can be supplied as the transform parameter to VideoFrameDataset. This turns a list of N PIL images into a batch of images/frames of shape N x CHANNELS x HEIGHT x WIDTH. We can further chain preprocessing and augmentation functions that act on batches of images onto the end of ImglistToTensor(), as seen in demo.py

As of torchvision 0.8.0, all torchvision transforms can now also operate on batches of images, and they apply deterministic or random transformations on the batch identically on all images of the batch. Because a single video-tensor (FRAMES x CHANNELS x HEIGHT x WIDTH) has the same shape as an image batch tensor (BATCH x CHANNELS x HEIGHT x WIDTH), any torchvision transform can be used here to apply video-uniform preprocessing and augmentation.

REMEMBER:
Pytorch transforms are applied to individual dataset samples (in this case a list of PIL images of a video, or a video-frame tensor after ImglistToTensor()) before batching. So, any transforms used here must expect its input to be a frame tensor of shape FRAMES x CHANNELS x HEIGHT x WIDTH or a list of PIL images if ImglistToTensor() is not used.

6. Allowing Multiple Labels per Sample

Your dataset labels might be more complicated than just a single label id per sample. For example, in the EPIC-KITCHENS dataset each video clip has a verb class, noun class, and action class. In this case, each sample is associated with three label ids. To accommodate for datasets where a sample can have N integer labels, annotation.txt files can be used where each row is space separated PATH, FRAME_START, FRAME_END, LABEL_1_ID, ..., LABEL_N_ID, instead of PATH, FRAME_START, FRAME_END, LABEL_ID. The VideoFrameDataset class can handle this type of annotation files too, without changing anything apart from the rows in your annotations.txt.

The annotations.txt file for a dataset where multiple clip samples can come from the same video and each sample has three labels, would have rows like PATH, START_FRAME, END_FRAME, LABEL1, LABEL2, LABEL3 as seen below

jumping/0001 1 8 0 2 1
jumping/0001 5 17 0 10 3
jumping/0002 1 18 0 5 3
running/0001 10 15 1 3 3
running/0001 5 10 1 1 0
running/0002 1 15 1 12 4

When you use torch.utils.data.DataLoader with VideoFrameDataset to retrieve your batches during training, the dataloader then no longer returns batches as a ( (BATCHxFRAMESxHEIGHTxWIDTH) , (BATCH) ) tuple, where the second item is just a list/tensor of the batch's labels. Instead, the second item is replaced with the tuple ( (BATCH) ... (BATCH) ) where the first BATCH-sized list gives label_1 for the whole batch, and the last BATCH-sized list gives label_n for the whole batch.

A demo of this can be found at the end in demo.py. It uses the dummy dataset in directory demo_dataset_multilabel.

7. Conclusion

A proper code-based explanation on how to use VideoFrameDataset for training is provided in demo.py

8. Kinetics 400 & Something Something V2 & EPIC-KITCHENS-100

After you have read Section 1 to 7, this repository also contains easy pre-made conversion scripts and annotation files to get you instantly started with the Kinetics 400 dataset, Something Something V2 dataset, and the EPIC-KITCHENS-100 dataset. To get started with either, read the README inside the Kinetics400, EpicKitchens100 or SomethingSomethingV2 directory.

9. Upcoming Features

  • Include compatible annotation files for common datasets, such as Something-Something-V2, EPIC-KITCHENS-100 and Kinetics, so that users do not need to spend their own time converting those datasets' annotation files to be compatible with this repository.
  • Add demo for sampling a single continous-frame clip from videos.
  • Add support for arbitrary labels that are more than just a single integer.
  • Add support for specifying START_FRAME and END_FRAME for a video instead of NUM_FRAMES.
  • Improve the handling of edge cases where NUM_FRAMES*FRAM_PER_SEG (or similar) might be larger than the number of frames in a video. (a warning message is printed now)
  • Clean up some of the internal code that is still very messy, which was taken from the below codebase.
  • Create a version of this implementation that uses OpenCV instead of PIL for frame loading, so that you can use Albumentation transforms instead of Torchvision transforms.

10. Acknowledgements

We thank the authors of TSN for their codebase, from which we took VideoFrameDataset and adapted it for general use and compatibility.

@InProceedings{wang2016_TemporalSegmentNetworks,
    title={Temporal Segment Networks: Towards Good Practices for Deep Action Recognition},
    author={Limin Wang and Yuanjun Xiong and Zhe Wang and Yu Qiao and Dahua Lin and
            Xiaoou Tang and Luc {Val Gool}},
    booktitle={The European Conference on Computer Vision (ECCV)},
    year={2016}
}

video-dataset-loading-pytorch's People

Contributors

pifry avatar raivokoot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

video-dataset-loading-pytorch's Issues

Problems with batch size

Hello,
first of all thank u for this nice video DataLoader.

I want to implement this into my project but currently I encounter into problems.
Like u described I preprocessed my video into frames and created the .txt file.
If u want now to load into my project I get a RuntimeError : RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [125, 5, 3, 224, 224]
The 125 is my batch size, 5 my num_segments, and so on.

I tried to reshape my video batch like this:
batch_size, frames, channels, height, width = video.shape
video = video.reshape(batch_size * frames, channels, height, width)
But then I get problems with my labels in the batch:
ValueError: Target size (torch.Size([64, 1])) must be the same as input size (torch.Size([320, 1]))

Do u know how to fix it or did I do sth wrong?

here's my code- part I am currently using:

def train_dataloader(self):
    preprocess = Compose([
        ImglistToTensor(),
        RandomResizedCrop(224, scale=(0.8, 1.0)),
        RandomHorizontalFlip(),
        RandomRotation(degrees=15),
        ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1),
        Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    dataset = VideoFrameDataset(
        root_path=self.video_path_prefix,
        annotationfile_path=self.annotation_file_train,
        num_segments=5,
        frames_per_segment=1,
        imagefile_template='frame_{:04d}.jpg',
        transform=preprocess,
        test_mode=False
    )
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=True,
        num_workers=self.num_worker,
        pin_memory=True
    )

Thanks in advance :)

Support iterable / webdataset

What would be required to make this work for very large video sets? Would it be some integration with web dataset?

Assigning values from data loader in training loop causing error

First of all, I will apologize if this does not belong here. I am new to ML and Pytorch, so I still have a lot to learn. I am going to paste an image of my stack trace to give a feel for what the error is saying and then do some explanation of what I am seeing on my end.

image

For starters, this issue appears to only be on one case or very few cases. I say that because when I had shuffle=True in my dataloader, I would sometimes make it almost the whole way through the loop and crash towards the end. Other times it will crash the first few times through the loop. I have since turned shuffle to false to see if I could gain more information from where this was originating, instead of having my problem moving around on me. I would assume it is irrelevant to the problem, but it is on the 6th round through when shuffle is off.

After doing so, I entered debug mode and began to step into functions. After quite a few steps in and looking at values in the debug window, I don't see an issue so far. (There is definitely a possibility that I am missing something.)

If I step over, and not into (in the debugger) the line "for video_batch, labels in train_loader:" takes me out of my training function (where this line is contained) and then ends up catching at the line "if name == 'main': main()" that is in my main.py.

It likely shouldn't make a difference, but I am using the UCF11 dataset. My images, file directories, and text files are all formatted as the documentation states.

It looks like empty values are being returned, but as I was in debug mode, I saw which file directory it was looking into, and verified that the images that were supposed to be there were in that directory. They were.

If there is any other information you would like, please let me know and I'd be glad to post it.

Question about annotations.txt

Hey,

great repo! thank you!
Just a question about annotations.txt. Am i right in the assumption that if I do NOT include a datapoint in annotations.txt, it will not be included in the dataset although it is in the same folder? Is this correct? That would be great for things like cross validation.

And by the way, the pytorch-lightning version that works with the repo in 2023 is 1.7.7(at least for me). Maybe include that in readme

All the best!

bounding boxes

Hi there, this looks like a really good dataloader, but i was wondering why the bounding boxes werent loaded in the annotations, i have read through the read me briefly and it doesnt look like any bounding boxes are to be included within the annotations files. if this isnt the case please let me know so i can avoid doing extra work thank you.

demo fix error

in line 131 just replace frames with frame_tensor:
plot_video(rows=1, cols=5, frame_list= frame_tensor, plot_width=15., plot_height=3.)>

END_FRAME

Thanks for your code, it's great. I wonder to create the annotation file if the END_FRAME of the video must be known in advance?

reporting of different issues encountered while working

Hello.
I'd like to present something to fix:

TO FIX:

  1. the quick demo(demo.py) at this link report an incorrect way to use the VideoFrameDataset class. In the specific:
  • image_template should be changed into imagefile_template
  • the parameter random_shift does not exist
  • the annotations.txt file indicated in the file ask the user to have the following structure per line:
    immagine_2022-05-11_154731919
    but instead in the github is reported:
    immagine_2022-05-11_154822791
    Presenting a discrepancy for a user who's trying to follow the instructions step by step.

Thanks for reading, have a nice day.

Extra whitespace characters in the annotation file breaks

The program breaks, when there are extra whitespace characters in the annotation file e.g.

bs 2        5       1
bs 6        8       2
bs 9        12      3
bs 13       16      4

The extra white space chars makes the annotation file more readable and editable with column selection mode. Therefore, I think, it will be nice to be ready for such an extra whitespace.
If You think it's an issue, I can handle it.

Using a dataset with different widths and heights for each frame

Hello,

My dataset is pre-processed and takes frames of a video and crops out a detection from the video. This means my dataset has frames of slightly different sizes due to the bounding boxes being different. Due to this, I am getting an error when using the ImglistToTensor() function. The error being:

'RuntimeError: stack expects each tensor to be equal size, but got [3, 105, 111] at entry 0 and [3, 109, 115] at entry 1'

I try to initially resize them all before using this function but I get a type error:

'TypeError: img should be PIL Image. Got <class 'list'>'

I'm unsure if there is anything I can do to still make use of this custom dataset as it is just what I need for my project.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.