Comments (3)
Hello,
A few months ago I worked on a project which used AV-HuBERT as a frozen feature extractor like you described in approach 1. I then trained a simple MLP classifier layer on top of it to predict which phoneme was being uttered by a person, per timestep (so depending on which video frame) in a video and it achieved around 65% accuracy. If you're also interested which phonemes are being classified well, and which phonemes are often being confused, this paper provides useful graphs (refer to the larger dataset) which you may find useful and will let you have realistic expectations about performance (hopefully).
The code for that is in the notebook which I attached and you can basically re-use the code which I provided in that notebook but just change your target labels. I think you should also roughly expect around the same accuracy seeing as predicting the phoneme is quite similar to predicting the mouth shape during the phonemes utterance?
Re-tasking the AV-HuBERT model to re-train for your new task would be a lot more effort to be honest and it also removes the benefit which you already have where you can just pre-compute all of your datasets AV-HuBERT features and very quickly just train the mouth shape classifier using the pre-computed features. I found that in my project as I was only training a very simple MLP classifier on top of the AV-HuBERT features, it was extremely quick to train the model which is also useful for experimentation speed (you can find out what is going to work a lot faster than trying approach 2 which is also super useful for you).
Good luck!
from av_hubert.
Hello,
A few months ago I worked on a project which used AV-HuBERT as a frozen feature extractor like you described in approach 1. I then trained a simple MLP classifier layer on top of it to predict which phoneme was being uttered by a person, per timestep (so depending on which video frame) in a video and it achieved around 65% accuracy. If you're also interested which phonemes are being classified well, and which phonemes are often being confused, this paper provides useful graphs (refer to the larger dataset) which you may find useful and will let you have realistic expectations about performance (hopefully).
The code for that is in the notebook which I attached and you can basically re-use the code which I provided in that notebook but just change your target labels. I think you should also roughly expect around the same accuracy seeing as predicting the phoneme is quite similar to predicting the mouth shape during the phonemes utterance?
Re-tasking the AV-HuBERT model to re-train for your new task would be a lot more effort to be honest and it also removes the benefit which you already have where you can just pre-compute all of your datasets AV-HuBERT features and very quickly just train the mouth shape classifier using the pre-computed features. I found that in my project as I was only training a very simple MLP classifier on top of the AV-HuBERT features, it was extremely quick to train the model which is also useful for experimentation speed (you can find out what is going to work a lot faster than trying approach 2 which is also super useful for you).
Good luck!
Glad to know I'm not the only one looking into similar projects! I'll have a look at your code, thanks :)
Did you use the training data in the lip2seq paper for your task? Or did you have your own data?
from av_hubert.
I used the Jordan Peterson lectures on YouTube as I needed many hours of a single speaker speaking for long periods of time with their lips being clearly visible. However from what I've seen, most spoken datasets will have a similar phoneme distribution to what is reported in the paper (which should also be very relevant to your project as the phonemes and mouth movements will be closely related).
from av_hubert.
Related Issues (20)
- Concatenation of features from multiple videos HOT 1
- Errors with distributed training on a single node HOT 1
- I'm trying to pretrain a model with another langauge HOT 1
- Problem Extracting Audio-Only Embeddings using AV-HuBERT HOT 1
- How to load a pre-trained AVHuBERT? (problems after following the instructions) HOT 6
- best_checkpoint_metric is accuracy even for finetuning? HOT 2
- Errors about loading a pretrained model
- a recipe and checkpoint for CTC decoding using CMUDict HOT 1
- Too many CPU resources for fine-tuning HOT 2
- Inference for AVSR
- How to train model on Urdu dataset HOT 1
- The counts of correct predictions for both masked and unmasked tokens are considerably low.
- Unable to Locate mix_babble.py for LRS3 Audio Noise Preparation
- Reproducing base_noise_pt_noise_ft_30h.pt
- To load the previous model when doing double finetuning,
- How to extract embeddings from a specific layer
- 1. Differences between "extract_finetune" and "extract_features" and 2. extract discrete unit feature
- Can I use this work to generate/predict new audio frame in real time?
- decode text with temporal or duration information
- Enquiry about Visual Hubert pre-training procedures, hyper-parameter settings and Visual Hubert model parameters.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from av_hubert.