Despite cultural diversity, emotions are universal. We will undertake the EmotiW 2020 challenge, doing group-level sentiment recognition on videos from across the globe. Given short clips, the goal is to predict whether the video sentiment is positive, neutral, or negative. This problem is interesting because audio-visual sentiment analysis has implications in psychology and mental health.
Our paper was accepted at the 22nd ACM International Conference on Multimodal Interaction: https://dl.acm.org/doi/10.1145/3382507.3417966
We worked with the EmotiW 2020 dataset.
We ensembled models from four modalities: overall scene, pose, audio, and facial.
To start, please check out our presentation and slide deck.
The code is organized as follows:
- src/ - preprocessing, generation, and classification code
- notebooks/ - notebooks for training and prediction
Run this notebook to see how our model works on the dataset.
We provide many of the models we trained here.
Results as reported based on the EmotiW 2020 validation dataset.
Modality | Accuracy | F1-Score |
---|---|---|
Scene | 0.546 | 0.541 |
Pose | 0.486 | 0.489 |
Audio | 0.577 | 0.577 |
Face | 0.4 | 0.348 |
Image Captioning | 0.505 | 0.506 |
Dataset | Accuracy |
---|---|
Validation | 0.640 |
Test | 0.639 |
Boyang Tom Jin, Leila Abdelrahman, Cong Kevin Chen, Amil Khanzada
CS231n - Stanford University
@misc{2020fusical,
author = {Boyang Tom Jin and Leila Abdelrahman and Cong Kevin Chen and
Amil Khanzada},
title = {Fusical},
howpublished = {\url{https://github.com/kevincong95/cs231n-emotiw}},
year = {2020}
}