Hi, I've been looking into doing transfer learning with the model fo

Hello, A few months ago I worked on a <a href="https://github.com/Mi

Hello, A few months ago I worked on a <a href="https://

AV-huBERT as backbone for a slightly different task about av_hubert HOT 3 OPEN

gadese-vooban commented on September 10, 2024

AV-huBERT as backbone for a slightly different task

from av_hubert.

Comments (3)

MiscellaneousStuff commented on September 10, 2024

Hello,

A few months ago I worked on a project which used AV-HuBERT as a frozen feature extractor like you described in approach 1. I then trained a simple MLP classifier layer on top of it to predict which phoneme was being uttered by a person, per timestep (so depending on which video frame) in a video and it achieved around 65% accuracy. If you're also interested which phonemes are being classified well, and which phonemes are often being confused, this paper provides useful graphs (refer to the larger dataset) which you may find useful and will let you have realistic expectations about performance (hopefully).

The code for that is in the notebook which I attached and you can basically re-use the code which I provided in that notebook but just change your target labels. I think you should also roughly expect around the same accuracy seeing as predicting the phoneme is quite similar to predicting the mouth shape during the phonemes utterance?

Re-tasking the AV-HuBERT model to re-train for your new task would be a lot more effort to be honest and it also removes the benefit which you already have where you can just pre-compute all of your datasets AV-HuBERT features and very quickly just train the mouth shape classifier using the pre-computed features. I found that in my project as I was only training a very simple MLP classifier on top of the AV-HuBERT features, it was extremely quick to train the model which is also useful for experimentation speed (you can find out what is going to work a lot faster than trying approach 2 which is also super useful for you).

Good luck!

from av_hubert.

gadese-vooban commented on September 10, 2024

Hello,

A few months ago I worked on a project which used AV-HuBERT as a frozen feature extractor like you described in approach 1. I then trained a simple MLP classifier layer on top of it to predict which phoneme was being uttered by a person, per timestep (so depending on which video frame) in a video and it achieved around 65% accuracy. If you're also interested which phonemes are being classified well, and which phonemes are often being confused, this paper provides useful graphs (refer to the larger dataset) which you may find useful and will let you have realistic expectations about performance (hopefully).

The code for that is in the notebook which I attached and you can basically re-use the code which I provided in that notebook but just change your target labels. I think you should also roughly expect around the same accuracy seeing as predicting the phoneme is quite similar to predicting the mouth shape during the phonemes utterance?

Re-tasking the AV-HuBERT model to re-train for your new task would be a lot more effort to be honest and it also removes the benefit which you already have where you can just pre-compute all of your datasets AV-HuBERT features and very quickly just train the mouth shape classifier using the pre-computed features. I found that in my project as I was only training a very simple MLP classifier on top of the AV-HuBERT features, it was extremely quick to train the model which is also useful for experimentation speed (you can find out what is going to work a lot faster than trying approach 2 which is also super useful for you).

Good luck!

Glad to know I'm not the only one looking into similar projects! I'll have a look at your code, thanks :)

Did you use the training data in the lip2seq paper for your task? Or did you have your own data?

from av_hubert.

MiscellaneousStuff commented on September 10, 2024

I used the Jordan Peterson lectures on YouTube as I needed many hours of a single speaker speaking for long periods of time with their lips being clearly visible. However from what I've seen, most spoken datasets will have a similar phoneme distribution to what is reported in the paper (which should also be very relevant to your project as the phonemes and mouth movements will be closely related).

from av_hubert.

AV-huBERT as backbone for a slightly different task about av_hubert HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent