Facial-Expression-Recognition

This note introduces ViT_SE, Difference of Embeddings, and Disentangled Diffrence of Embeddings models on FER task.

📝 Contents

Experimental Model Introduction
- ViT_SE
- Difference of Embeddings
- Disentangled Difference of Embeddings
Experimental Results
- Dataset
  - AffectNet small
  - Camera two of CTBC Dataset (collected by MISLAB)
- Experimental Results of AffectNet small
- Experimental Results of CTBC Dataset
- Ablation Study of Difference of Embeddings Model

📝 Experimental Model Introduction

ViT_SE

Learning Vision Transformer with Squeeze and Excitation for Facial Expression Recognition
作者提出Vison transformer會逐漸從global attention轉變為local attention，於是作者決定加入SE block來以重新調整local attention features之間的關聯性。

Model Structure

Input shape:[Batch_size, 3, 224, 224]

How to implement?

1. Install Transformer
```
pip install transformers
```
2. Read the following resouce

Then, you can click the official introduction pages to learn the API of ViT model, or you can see the introduction and implementation here. Otherwise, you can directly refer the source code.
Difference of Embeddings

This model is going to use difference of embeddings (between Neutral and target expressions) and target expressions embeddings to classify the target expression.

Model Structure

Difference_of_Embeddings model Input shape:[Batchsize, 2, C_in, Height, Width]
```
  You have to input 2 images(Neutral and target expressions image)
  into the model at the same time.
```
How to implement?

Use Feature Extractor which is based on MobileNetV3 large and pretrained on ImageNet dataset to infer the embeddings of Neutral image and Target expression image, and then concatenate the difference of embeddings and target expression embedding to infer the class of target expression image.

You can refer the source code here.
Disentangled Difference of Embeddings
This model is going to separate into 2 parts.
1. The first part is going to learn the Emotion feature extractor which aims to ignore the identity of difference people.
2. The second part is going to use the Emotion feature extractor to get the emotion embeddings of Neutral image and Target expression image. After getting the embeddings, we can calculate the difference of two embeddings which can represent the variation between Neutral and Target emotion and concatenate the difference with the embedding of target expression to classify the Target expression image.
Model Structure

Pretrained_Emotion_Encoder model Input shape:[Batchsize, C_in, Height, Width]

Disentangled_Difference_of_Embeddings model Input shape:[Batchsize, 2, C_in, Height, Width]

How to implement?
This model is trained in 2 steps
1. Train Pretrained_Emotion_Encoder by using concatenation of Identity embedding and Emotion embedding to classify the expressions.(The pretrained_Identity_Encoder is based on ResNet50 trained on MS1M and fine-tuned on VGGFace2)
2. Use Pretrained Emotion Encoder to infer the embeddings of Neutral image and Target expression image, and then concatenate the difference of embeddings and target expression embedding to infer the class of target expression image.

📝 Experimental Results

Dataset

1. AffectNet small

AffectNet is an annotated dataset collected in the wild and contains more than 1M facial images. In this experiment, We only use the sample images from manually annotated part of the AffectNet, and these images consists of 8 discrete facial expressions.

Preprocessing procedures
1. Random sample 10% pictures of eight labels from original manually annotated AffectNet
2. Crop the face region
3. Resize to 224×224
Training and Testing data

Expressions Neutral Happiness Sadness Surprise

Training data 7487 13441 2545 1409

Testing data 500 500 500 500

Expressions Fear Disgust Anger Contempt

Training data 637 380 2488 375

Testing data 500 500 500 500

2. CTBC

CTBC is an annotated dataset consisting of 7 labels and 10 subjects collected by MISLAB. In the following experiments, we will use 10 folds validation to validate the dataset. Due to the RAM problems of Colab, We use part of the front face images(camera2) to train our model in some experiments.

Preprocessing procedures of Experimental Results of CTBC Dataset
1. Sample 16081 cam2 images
2. Resize to 224×224
Preprocessing procedures of Ablation Study of Difference of Embeddings Model
1. Sample all of cam2 data
2. Resize to 128×128
Experimental Results of AffectNet small

The experimental results demonstrate that the Disentangled_Difference_of_Embeddings improve the accuracy on AffectNet small by using the Pretrained Emotion Encoder to only extract the emotion features and ignore the difference identities problems in AffectNet.
- Experimental Settings
  
  Hyper parameters Value
  
  Training data 28762
  
  Testing data 4000
  
  Batchsize 32
  
  Epochs 30
  
  Optimizer Adam
  
  Loss function Cross Entropy
- Experimental Results
Expereimental Results of CTBC Dataset
- Experimental Settings
  
  Hyper parameters Value
  
  Batchsize 32
  
  Epochs 30
  
  Optimizer Adam
  
  Loss function Cross Entropy
  
  Validation method 10 folds validation
- Experimental Results
Ablation Study of Difference of Embeddings

The experimental results demonstrate the combination of Baseline and difference of embeddings representing the variation of expressions improves performance of expression recognition.
- Experimental Settings
  
  Hyper parameters Value
  
  Batchsize 32
  
  Epochs 30
  
  Optimizer Adam
  
  Loss function Cross Entropy
  
  Validation method 10 folds validation
- Experimental Results

Hyper parameters	Value
Training data	28762
Testing data	4000
Batchsize	32
Epochs	30
Optimizer	Adam
Loss function	Cross Entropy

Expressions	Neutral	Happiness	Sadness	Surprise
Training data	7487	13441	2545	1409
Testing data	500	500	500	500

Expressions	Fear	Disgust	Anger	Contempt
Training data	637	380	2488	375
Testing data	500	500	500	500

jerry940100 / image-based-facial-expression-recognition Goto Github PK

image-based-facial-expression-recognition's Introduction

Facial-Expression-Recognition

📝 Contents

Experimental Model Introduction

Experimental Results

📝 Experimental Model Introduction

ViT_SE

Model Structure

Input shape:[Batch_size, 3, 224, 224]

How to implement?

1. Install Transformer

2. Read the following resouce

Difference of Embeddings

Model Structure

Difference_of_Embeddings model Input shape:[Batchsize, 2, Cin, Height, Width]

How to implement?

Disentangled Difference of Embeddings

Model Structure

Pretrained_Emotion_Encoder model Input shape:[Batchsize, Cin, Height, Width]

Disentangled_Difference_of_Embeddings model Input shape:[Batchsize, 2, Cin, Height, Width]

How to implement?

📝 Experimental Results

Dataset

1. AffectNet small

Preprocessing procedures

Training and Testing data

2. CTBC

Preprocessing procedures of Experimental Results of CTBC Dataset

Preprocessing procedures of Ablation Study of Difference of Embeddings Model

Experimental Results of AffectNet small

Expereimental Results of CTBC Dataset

Ablation Study of Difference of Embeddings

image-based-facial-expression-recognition's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Difference_of_Embeddings model Input shape:[Batchsize, 2, C_in, Height, Width]

Pretrained_Emotion_Encoder model Input shape:[Batchsize, C_in, Height, Width]

Disentangled_Difference_of_Embeddings model Input shape:[Batchsize, 2, C_in, Height, Width]