Light

cosmaadrian / multimodal-depression-from-video Goto Github PK

View Code? Open in Web Editor NEW

31.0 3.0 3.0 379 KB

Official source code for the paper: "Reading Between the Frames Multi-Modal Non-Verbal Depression Detection in Videos"

License: Other

Python 63.22% Shell 36.78%

depression-detection multimodal-deep-learning non-verbal-communication video-classification

multimodal-depression-from-video's Introduction

Adrian Cosma

Fantasies have to be unrealistic because the moment, the second that you get what you seek, you don't, you can't want it anymore. In order to continue to exist, desire must have its objects perpetually absent. It's not the "it" that you want, it's the fantasy of "it". [...] What it means to be fully human is to strive to live by ideas and ideals and not to measure your life by what you've attained in terms of your desires but those small moments of integrity, compassion, rationality, even self-sacrifice. Because in the end, the only way that we can measure the significance of our own lives is by valuing the lives of others. - David Gale, The Life of David Gale

A (non-exhaustive) list of projects I've worked on:

🚶🏻 Gait Analysis 🚶🏻:

😶‍🌫️ Psychology and Mental Health 😶‍🌫️:

📖 Romanian Corpora 📖

cosmaadrian/rocode

🔨 Home-Made Tools 🔨

multimodal-depression-from-video's People

Stargazers

Watchers

Forkers

ianeesdev shaina-12 mayapony

multimodal-depression-from-video's Issues

Check if evaluate.py is working.

We need the evaluate.py script to load trained models and evaluate them on the test data.

For each modality, construct a attention mask

We have some file with missing things from each modality. Read the file, get the ids that we sampled and construct an attention mask. Use the attention mask in the perceiver.

Implement positional embeddings

We're processing sequences in order. Use nn.Embedding() to map positions (i.e. numbers between 0 and sequence_length) to a vector. Add each vector to each corresponding latent vector.

For hand gestures, add "token_type_ids"

Each hand should have a different embedding to differentiate between them. The embedding (nn.Embedding(...)) should be added to the projected coordinates. Similar to "token_type_ids" in huggingface transformers library.

Thanks for your great work! Can share the implement of the Attribution scores as shown in Fig 5?

Thanks for your great work! Can share the implement of the Attribution scores as shown in Fig 5? Thanks!

Implement Encoders for gaze / blinking

Read the new modalities and implement appropriate encoders for each one

Add mean predictions to TemporalEvaluator

Evaluating the baseline model with temporal evaluator only takes the final predictions into account.

Include the mean prediction over the whole video in TemporalEvaluator. Add a line to the .csv results file with kind = 'mean'. Add kind = 'last' to the current line.

Implement Temporal Evaluator

Take windows in order, process all of them in order (with perceiver or not?). Take the majority vote or the last prediction.

We can use this to visualize the depression score over time, make nice plots for the paper.

Implement a simple baseline

After each modality is encoded, let's try to implement a simple baseline: just a simple transformer encoder that encodes all modalities and does the classification per window.

Evaluation is majority vote.

correct structure to store the videos

HI @david-gimeno.. I have downloaded all the videos from the D-vlog dataset.. should i split it based on the ids given in the test, train and validation.csv file?? or there is a separate file python3 ./scripts/feature_extraction/dvlog/extract_wavs.py --csv-path ./data/D-vlog/video_ids.csv --column-video-id video_id --video-dir $VIDEO_DIR --dest-dir $WAV_DIR video_id.csv which is missing in the repo?? please help me with this..

Implement Perceiver Architecture

Implement Time-Unaware Perceiver Architecture
Discuss when modality embeddings are needed
Discuss details & Implement Time-Aware Perceiver Architecture

Modify dataloader to only sample windows with at least one modality present.

If there are windows with no modality present, don't feed them into the model

Come up with a nice visualization for the gaze.

Do a polar heatmap with the gaze angles (pitch / yaw).

Maybe do some statistics, find out what.

Come up with some nice statistics for blinking patterns

Use Poisson regression to compare between groups.

Plots with std and mean of blinking interval per video (binned by seconds), colored by group.

Do some statistical test and compute p-value.

Implement per Modality Embedding

Differentiate between modalities. Use nn.Embedding() to map a modality id to a vector representation. Add the vector to each latent vector in the corresponding modality.

For each modality, split them into windows

Right now we concatenate everything. We need to process chunks of the video into windows. Maybe use torch.split()?

Fix normalization for landmarks

Decide on how to normalize the landmarks.

Use a batch-norm for the last dimension, probably, before applying the projection.

Common Positional Encodings for all modalities

It's kind of wierd that we have different positional encodings for each modality. We should probably have the same ones for all of them. But how do we take into account the different framerates?

Maybe some framerate-aware positional encoding? (i.e. don't use learned positional encodings, but "classical" sin/cos P.E.)

Something like this (Fractional positional encoding): https://opus.bibliothek.uni-augsburg.de/opus4/frontdoor/deliver/index/docId/96742/file/96742.pdf

Repository: https://github.com/philm5/fpe-vtt/blob/master/vtt_transformer.py#L85

The implementation is in tensorflow, we need to convert it to torch (use ChatGPT?).

Experiments before final experiments

Exploring different schedulers & learning rates for the baseline model
Check if the perceiver is learning something, check if the same batch size works for it
Ablation study on number of windows for perceiver, check if it works
Small ablation on presence_threshold, maybe lower is better (more data), or maybe higher is better (more quality)
Ablation on the modalities. Which combination works better?

Implement MajorityVoting Evaluator

Take N random windows from the video, make a prediction on all of them, take the majority prediction.

while setting variables

Hello, can you please help me with what value I should set for the variable in each file ?? for example PASE_CONFIG=
PASE_CHCK_PATH=
USE_AUTH_TOKEN= what should be the value for this??

Extract Eye Gaze for each video

Use this: https://github.com/hysts/pytorch_mpiigaze_demo to extract gaze for each frame of each video.

Gaze is a 3-tuple containing (angle_x, angle_y, angle_z).

If no gaze can be estimated, put (-1, -1, -1) so we can compute a mask.

I do not know how to remove the issue: Implement Hand Landmarks Projection

As we have 21 landmarks of 5 coordinates for each N frame of each W window when dealing with hand landmarks:

Original Shape: (W, N, 2, 21, 5)
Input Transformer Shape: (N, embed_size)

So, we need a linear projection layer to transform the dimension of each modality to the embed_size of the Transformer. I am a bit confused how to preserve the distinction between both hands. Because the order in the forward pass would be:

x = linear_proj(x)
x += pos_emb
x += modality_emb
x[:, :, :embed_size//2] += left_hand_type_emb & x[:, :, embed_size//2:] += right_hand_type_emb
x = transformer(x)

The other option is feeding the linear_proj with a reshaped tensor of (W, N, 2, 21*5), but then we have other problems when reshaping to the shape it should be.

This is for next week. See you tomorrow :)

Probably need another dataset

If we want ECIR, we need to add another dataset with some other particularity.

Pre-process DAIC-WOZ dataset + Implement specific dataloader
Pre-process Extended DAIC-WOZ dataset + Implement specific dataloader
Experiments DAIC-WOZ
Experiments E-DAIC-WOZ

Extract blinking patterns for each video

Use this: https://github.com/wenzhengzeng/MPEblink

Extract blinking for each frame, for each video.

Blinking patterns = a list of 0s and 1s.

0 = not blinking
1 = blinking

If the eyes are not visible, put a -1 so we can compute a mask.

Try 0.01 lr in CyclicLR

Uncomment loss weights to configs/base_config.yaml

Re-run some experiments to check if it works better.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.