Giter Club home page Giter Club logo

Comments (3)

MiscellaneousStuff avatar MiscellaneousStuff commented on September 10, 2024

Hello,

A few months ago I worked on a project which used AV-HuBERT as a frozen feature extractor like you described in approach 1. I then trained a simple MLP classifier layer on top of it to predict which phoneme was being uttered by a person, per timestep (so depending on which video frame) in a video and it achieved around 65% accuracy. If you're also interested which phonemes are being classified well, and which phonemes are often being confused, this paper provides useful graphs (refer to the larger dataset) which you may find useful and will let you have realistic expectations about performance (hopefully).

The code for that is in the notebook which I attached and you can basically re-use the code which I provided in that notebook but just change your target labels. I think you should also roughly expect around the same accuracy seeing as predicting the phoneme is quite similar to predicting the mouth shape during the phonemes utterance?

Re-tasking the AV-HuBERT model to re-train for your new task would be a lot more effort to be honest and it also removes the benefit which you already have where you can just pre-compute all of your datasets AV-HuBERT features and very quickly just train the mouth shape classifier using the pre-computed features. I found that in my project as I was only training a very simple MLP classifier on top of the AV-HuBERT features, it was extremely quick to train the model which is also useful for experimentation speed (you can find out what is going to work a lot faster than trying approach 2 which is also super useful for you).

Good luck!

from av_hubert.

gadese-vooban avatar gadese-vooban commented on September 10, 2024

Hello,

A few months ago I worked on a project which used AV-HuBERT as a frozen feature extractor like you described in approach 1. I then trained a simple MLP classifier layer on top of it to predict which phoneme was being uttered by a person, per timestep (so depending on which video frame) in a video and it achieved around 65% accuracy. If you're also interested which phonemes are being classified well, and which phonemes are often being confused, this paper provides useful graphs (refer to the larger dataset) which you may find useful and will let you have realistic expectations about performance (hopefully).

The code for that is in the notebook which I attached and you can basically re-use the code which I provided in that notebook but just change your target labels. I think you should also roughly expect around the same accuracy seeing as predicting the phoneme is quite similar to predicting the mouth shape during the phonemes utterance?

Re-tasking the AV-HuBERT model to re-train for your new task would be a lot more effort to be honest and it also removes the benefit which you already have where you can just pre-compute all of your datasets AV-HuBERT features and very quickly just train the mouth shape classifier using the pre-computed features. I found that in my project as I was only training a very simple MLP classifier on top of the AV-HuBERT features, it was extremely quick to train the model which is also useful for experimentation speed (you can find out what is going to work a lot faster than trying approach 2 which is also super useful for you).

Good luck!

Glad to know I'm not the only one looking into similar projects! I'll have a look at your code, thanks :)

Did you use the training data in the lip2seq paper for your task? Or did you have your own data?

from av_hubert.

MiscellaneousStuff avatar MiscellaneousStuff commented on September 10, 2024

I used the Jordan Peterson lectures on YouTube as I needed many hours of a single speaker speaking for long periods of time with their lips being clearly visible. However from what I've seen, most spoken datasets will have a similar phoneme distribution to what is reported in the paper (which should also be very relevant to your project as the phonemes and mouth movements will be closely related).

from av_hubert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.