This note introduces ViT_SE, Difference of Embeddings, and Disentangled Diffrence of Embeddings models on FER task.
-
- ViT_SE
- Difference of Embeddings
- Disentangled Difference of Embeddings
-
- Dataset
- AffectNet small
- Camera two of CTBC Dataset (collected by MISLAB)
- Experimental Results of AffectNet small
- Experimental Results of CTBC Dataset
- Ablation Study of Difference of Embeddings Model
- Dataset
-
Learning Vision Transformer with Squeeze and Excitation for Facial Expression Recognition
作者提出Vison transformer會逐漸從global attention轉變為local attention,於是作者決定加入SE block來以重新調整local attention features之間的關聯性。pip install transformers
Then, you can click the official introduction pages to learn the API of ViT model, or you can see the introduction and implementation here. Otherwise, you can directly refer the source code.
-
This model is going to use difference of embeddings (between Neutral and target expressions) and target expressions embeddings to classify the target expression.
You have to input 2 images(Neutral and target expressions image) into the model at the same time.
Use Feature Extractor which is based on MobileNetV3 large and pretrained on ImageNet dataset to infer the embeddings of Neutral image and Target expression image, and then concatenate the difference of embeddings and target expression embedding to infer the class of target expression image.
You can refer the source code here.
-
This model is going to separate into 2 parts.
- The first part is going to learn the Emotion feature extractor which aims to ignore the identity of difference people.
- The second part is going to use the Emotion feature extractor to get the emotion embeddings of Neutral image and Target expression image. After getting the embeddings, we can calculate the difference of two embeddings which can represent the variation between Neutral and Target emotion and concatenate the difference with the embedding of target expression to classify the Target expression image.
Pretrained_Emotion_Encoder model Input shape:[Batchsize, Cin, Height, Width]
Disentangled_Difference_of_Embeddings model Input shape:[Batchsize, 2, Cin, Height, Width]
This model is trained in 2 steps
- Train Pretrained_Emotion_Encoder by using concatenation of Identity embedding and Emotion embedding to classify the expressions.(The pretrained_Identity_Encoder is based on ResNet50 trained on MS1M and fine-tuned on VGGFace2)
- Use Pretrained Emotion Encoder to infer the embeddings of Neutral image and Target expression image, and then concatenate the difference of embeddings and target expression embedding to infer the class of target expression image.
-
AffectNet is an annotated dataset collected in the wild and contains more than 1M facial images. In this experiment, We only use the sample images from manually annotated part of the AffectNet, and these images consists of 8 discrete facial expressions.
- Random sample 10% pictures of eight labels from original manually annotated AffectNet
- Crop the face region
- Resize to 224×224
Expressions Neutral Happiness Sadness Surprise Training data 7487 13441 2545 1409 Testing data 500 500 500 500 Expressions Fear Disgust Anger Contempt Training data 637 380 2488 375 Testing data 500 500 500 500 CTBC is an annotated dataset consisting of 7 labels and 10 subjects collected by MISLAB. In the following experiments, we will use 10 folds validation to validate the dataset. Due to the RAM problems of Colab, We use part of the front face images(camera2) to train our model in some experiments.
Preprocessing procedures of Experimental Results of CTBC Dataset
- Sample 16081 cam2 images
- Resize to 224×224
Preprocessing procedures of Ablation Study of Difference of Embeddings Model
- Sample all of cam2 data
- Resize to 128×128
-
The experimental results demonstrate that the Disentangled_Difference_of_Embeddings improve the accuracy on AffectNet small by using the Pretrained Emotion Encoder to only extract the emotion features and ignore the difference identities problems in AffectNet.
-
Experimental Settings
Hyper parameters Value Training data 28762 Testing data 4000 Batchsize 32 Epochs 30 Optimizer Adam Loss function Cross Entropy
-
-
Ablation Study of Difference of Embeddings
The experimental results demonstrate the combination of Baseline and difference of embeddings representing the variation of expressions improves performance of expression recognition.