Detection of persons in a Youtube video

^^click here^^

Objective

Given the url of a Youtube video (here the Dior - Eau de Parfum commercial), generate a new video that shows the presence of humans by drawing bounding boxes around them.

Methodology

Overall principle

split the video into frames;
apply a detection model every K frame;
for each frame where no detection has been run, interpolate the bounding boxes;
draw the bounding boxes found for each frame;
recombine the frames into the final video.

Some details

👉 For the detection model, the script loads either RetinaNet or Faster R-CNN from torchivion.models

👉 To speed up the processing time, the detection of persons is performed on every K frame, where K is a parameter chosen by the user (argument --stride=K).

👉 To find the bounding boxes on frames where no detection is performed, the script interpolates boxes between two successive frames where a detection has been made. To decide whether two boxes on two different frames correspond to the same person, it first checks that they have a similar size, and then it selects the pair of boxes that yields the highest IoU scores for each pair of successive interpolated boxes. Finally, it checks that these IoU scores are above some threshold (to avoid interpolating boxes that are too far apart) and that the confidence level of at least one box is high enough (to avoid detection of false positive).

The principle of the algorithm is described below, where T(box) is a box with same size but with its lower-left corner (in Cartesian coordinates) translated to the origin and conf(box) is the confidence level associated to a box:

input:  boxes0, boxes1 (boxes found on successive frames, frame0 and frame1,
        where detection has been run and ordered by decreasing confidence level)
output: final_boxes (list of bounding boxes for each frame between frame0 and frame1)

params: min_IoU, min_conf

for each box1 in boxes1:
    best_IoU <- 0
    best_box <- None
    for each box0 in boxes0:
        if IoU(T(box0), T(box1)) < min_IoU:
            continue
        interpolate(box0, box1)
        score <- min(IoU bewteen successive interpolated boxes)
        if score > best_score:
            best_IoU <- score
            best_box <- box0
    if best_IoU > min_IoU and (conf(best_box) > min_conf or conf(box1) > min_conf):
        boxes0 <- boxes0 \ {best_box}
        add each box in interpolate(best_box, box1) to final_boxes

👉 Previously to the latter algorithm, a first selection is made to filter out bounding boxes for which the confidence level is below some threshold eps.

Some remarks

🚩 The above algorithm will display a box even if its confidence level is below min_conf, provided a similar box with a high confidence level is detected on the previous or the next frame. In other words, it decreases the number of false negative detections (and symmetrically increases the number of false positives).

🚩 On the other hand, if the primary concern is reducing the number of false positive detections, it is better to set eps = min_conf = 0.5 (or any suitable value).

How to run the detection

First, set up a python virtual environment (named env here) with pytorch and torchivision (see details here). The script has been tested with Python==3.8 and Pytorch==1.7. Then, install the required packages:

(env)$ python -m pip install -U -r requirements.txt

To generate the output video, activate the virtual environment and run the script

(env)$ python detect.py --stride=4 --eps=0.3 --min_conf=0.7 --min_iou=0.66 --with_conf

The input video is automatically downloaded in the same folder as the script detect.py, and the output video is created at the same location. To try with another video, add --url='https://youtu.be/<xxx>'.

To use the GPU, add the argument --with_gpu and choose an adequate batch size (say 4) with --batch_size=4.

The default detection model is RetinaNet. To use Faster R-CNN instead, add the argument --model='fasterrcnn'.

antoine-hochart / person_detection Goto Github PK

person_detection's Introduction

Detection of persons in a Youtube video

Objective

Methodology

Overall principle

Some details

Some remarks

How to run the detection

person_detection's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent