Giter Club home page Giter Club logo

Comments (13)

Gpwner avatar Gpwner commented on August 12, 2024

Here is the part_of_friends.json:
part_of_friends.zip

from ft-w2v2-ser.

b04901014 avatar b04901014 commented on August 12, 2024

Some guidelines for debugging.

  • Identify if the problem is from the proposed algorithm (TAPT, PTAPT) or the original wav2vec 2.0 fine-tuning. Do the V-FT yield similar results? Do the training loss overfits quickly, or it's still under-fitting?
  • The epochs trained and the number of clusters should be tuned according to the size of the dataset. For instance, if you have a smaller dataset size, increase the epochs or decrease the batch size, and lower the cluster size for PTAPT.
  • Some of the samples in your dataset seems to be broken:
f"`mask_length` has to be smaller than `sequence_length`, but got `mask_length`: {mask_length} and `sequence_length`: {sequence_length}`"`

indicates the input audio length is less than about 10 * 0.01 = 0.1 (s), which is an unreasonable length to predict emotion.

from ft-w2v2-ser.

b04901014 avatar b04901014 commented on August 12, 2024

Also, just some experience with working on the MELD audio.

  • The audio need to be normalized in terms of mean and variance across the utterance. Otherwise the loss may not be going down.
  • It take more epochs to lower the loss.

from ft-w2v2-ser.

Gpwner avatar Gpwner commented on August 12, 2024

V-FT even worse.Do I miss anything?
I run the python code:

python run_downstream_custom_multiple_fold.py --precision 16 --datadir Dataset/combineData16KHz/ --labeldir PART_OF_FRIENDS/labels/ --saving_path VFT/downstreammul --outputfile VFT.log  --max_epochs=30

Here is my result:

+++ SUMMARY +++
Mean UAR [%]: 25.00
Fold Std. UAR [%]: 0.00
Fold Median UAR [%]: 25.00
Run Std. UAR [%]: 0.00
Run Median UAR [%]: 25.00
Mean WAR [%]: 25.00
Fold Std. WAR [%]: 0.00
Fold Median WAR [%]: 25.00
Run Std. WAR [%]: 0.00
Run Median WAR [%]: 25.00
Mean macroF1 [%]: 10.00
Fold Std. macroF1 [%]: 0.00
Fold Median macroF1 [%]: 10.00
Run Std. macroF1 [%]: 0.00
Run Median macroF1 [%]: 10.00
Mean microF1 [%]: 10.00
Fold Std. microF1 [%]: 0.00
Fold Median microF1 [%]: 10.00
Run Std. microF1 [%]: 0.00
Run Median microF1 [%]: 10.00

the confusion matrix:

[[233.   0.   0.   0.]
 [233.   0.   0.   0.]
 [233.   0.   0.   0.]
 [233.   0.   0.   0.]]

The training log is attached:
V_FT.zip

By the way,when you come to The audio need to be normalized,what is the recommend way? Thanks

from ft-w2v2-ser.

b04901014 avatar b04901014 commented on August 12, 2024

You can observe from the training loss, it is not decreasing for V-FT. So the training is not even happening.

Something like:

wav = (wav - wav.mean()) / (wav.std() + 1e-12)

will normalize the audio signal.

Some dataset needs this for the loss to go down depending on the recording environment.

from ft-w2v2-ser.

b04901014 avatar b04901014 commented on August 12, 2024

You may add it at the __getitem__ of the downstream dataloader
But if you run TPAPT/PTAPT, you'll also have to add it at the pretrain dataloader
Or you can simply preprocess the audio.

from ft-w2v2-ser.

Gpwner avatar Gpwner commented on August 12, 2024

So the mean and std should be calculated across all the trainning samples before return a single sample from https://github.com/b04901014/FT-w2v2-ser/blob/main/downstream/Custom/dataloader.py#L73 like this:

#mean and std is calculate by all the training samples
return (wav.astype(np.float32)-mean)/(std+1e-12), label

right?

from ft-w2v2-ser.

b04901014 avatar b04901014 commented on August 12, 2024

No. That is another way of doing normalization for spectral-based features.

For raw audio, we can do it inside each samples, where the statistics are calculated and normalized per sample. This will normalize the average volume (std) and DC (mean) which shouldn't effect the emotion information.

from ft-w2v2-ser.

Gpwner avatar Gpwner commented on August 12, 2024

No. That is another way of doing normalization for spectral-based features.

For raw audio, we can do it inside each samples, where the statistics are calculated and normalized per sample. This will normalize the average volume (std) and DC (mean) which shouldn't effect the emotion information.

So the getitem method(https://github.com/b04901014/FT-w2v2-ser/blob/main/downstream/Custom/dataloader.py#L68) should be like this:

    def __getitem__(self, i):
        dataname = self.dataset[i]
        wav, _sr = sf.read(dataname)
        _label = self.label[self.datasetbase[i]]
        label = self.labeldict[_label]
        wav = wav.astype(np.float32)
        wav = (wav - wav.mean()) / (wav.std() + 1e-12)
        return wav, label

But the result didn't get better,here is the training log:
V_FT.zip
By the way,I have change the command to:

python run_downstream_custom.py --precision 16 --datadir Dataset/combineData16KHz/ --labelpath PART_OF_FRIENDS/labels/part_of_friends.json --output_path VFT/downstreammul  --max_epochs=30

from ft-w2v2-ser.

b04901014 avatar b04901014 commented on August 12, 2024

You can still observe it from the log that your loss is not decreasing. The learning rate need to be lower, such as --lr 2e-5.

qq.log

Here is some log I use --lr 2e-5 --batch_size 32 on your dataset with only 15 epochs. Again, you can see that my loss is decreasing in a few epochs, but yours is not. So you should tune the hyper-params according to that.

Also, I would imagine it requires much more epochs to converge. You can see that 15 epochs only get the loss about 0.9. Ideally we want to get the training loss to overfit to about 0.2~0.3.

from ft-w2v2-ser.

Gpwner avatar Gpwner commented on August 12, 2024

You can still observe it from the log that your loss is not decreasing. The learning rate need to be lower, such as --lr 2e-5.

qq.log

Here is some log I use --lr 2e-5 --batch_size 32 on your dataset with only 15 epochs. Again, you can see that my loss is decreasing in a few epochs, but yours is not. So you should tune the hyper-params according to that.

Also, I would imagine it requires much more epochs to converge. You can see that 15 epochs only get the loss about 0.9. Ideally we want to get the training loss to overfit to about 0.2~0.3.

Got it.So how can we decide whether a dataset should be normalized or not?

from ft-w2v2-ser.

Gpwner avatar Gpwner commented on August 12, 2024

I think maybe we don't need the normalization,I just run the code without normalization and the batch-size is 64:

python run_downstream_custom.py --precision 16 --datadir Dataset/combineData16KHz/ --labelpath PART_OF_FRIENDS/labels/part_of_friends.json --output_path VFT/downstreammul --max_epochs=30 --lr 2e-5

After 30 epoches the loss is decreasing to loss=0.422,here is the training log:
V_FT_NO_NORMALIZITION.zip

from ft-w2v2-ser.

b04901014 avatar b04901014 commented on August 12, 2024

Yeah, maybe it's just the learning rate that matters. Hyper-parameters should be tuned dataset-to-dataset.

from ft-w2v2-ser.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.