I download the MELD Datasets from <a href="https://affective-meld.github.io/" re

Here is the part_of_friends.json: <a href="https://github.com/b04901014/FT-w2v2-s

Some guidelines for debugging. Identify if the problem is from

V-FT even worse.Do I miss anything? I run the python code: <div class="snippet

The result on MELD dataset is not as good as IEMOCAP about ft-w2v2-ser HOT 13 OPEN

b04901014 commented on August 12, 2024

The result on MELD dataset is not as good as IEMOCAP

from ft-w2v2-ser.

Comments (13)

Gpwner commented on August 12, 2024

Here is the part_of_friends.json:
part_of_friends.zip

from ft-w2v2-ser.

b04901014 commented on August 12, 2024

Some guidelines for debugging.

Identify if the problem is from the proposed algorithm (TAPT, PTAPT) or the original wav2vec 2.0 fine-tuning. Do the V-FT yield similar results? Do the training loss overfits quickly, or it's still under-fitting?
The epochs trained and the number of clusters should be tuned according to the size of the dataset. For instance, if you have a smaller dataset size, increase the epochs or decrease the batch size, and lower the cluster size for PTAPT.
Some of the samples in your dataset seems to be broken:

f"`mask_length` has to be smaller than `sequence_length`, but got `mask_length`: {mask_length} and `sequence_length`: {sequence_length}`"`

indicates the input audio length is less than about 10 * 0.01 = 0.1 (s), which is an unreasonable length to predict emotion.

from ft-w2v2-ser.

b04901014 commented on August 12, 2024

Also, just some experience with working on the MELD audio.

The audio need to be normalized in terms of mean and variance across the utterance. Otherwise the loss may not be going down.
It take more epochs to lower the loss.

from ft-w2v2-ser.

Gpwner commented on August 12, 2024

V-FT even worse.Do I miss anything?
I run the python code:

python run_downstream_custom_multiple_fold.py --precision 16 --datadir Dataset/combineData16KHz/ --labeldir PART_OF_FRIENDS/labels/ --saving_path VFT/downstreammul --outputfile VFT.log  --max_epochs=30

Here is my result:

+++ SUMMARY +++
Mean UAR [%]: 25.00
Fold Std. UAR [%]: 0.00
Fold Median UAR [%]: 25.00
Run Std. UAR [%]: 0.00
Run Median UAR [%]: 25.00
Mean WAR [%]: 25.00
Fold Std. WAR [%]: 0.00
Fold Median WAR [%]: 25.00
Run Std. WAR [%]: 0.00
Run Median WAR [%]: 25.00
Mean macroF1 [%]: 10.00
Fold Std. macroF1 [%]: 0.00
Fold Median macroF1 [%]: 10.00
Run Std. macroF1 [%]: 0.00
Run Median macroF1 [%]: 10.00
Mean microF1 [%]: 10.00
Fold Std. microF1 [%]: 0.00
Fold Median microF1 [%]: 10.00
Run Std. microF1 [%]: 0.00
Run Median microF1 [%]: 10.00

the confusion matrix:

[[233.   0.   0.   0.]
 [233.   0.   0.   0.]
 [233.   0.   0.   0.]
 [233.   0.   0.   0.]]

The training log is attached:
V_FT.zip

By the way,when you come to The audio need to be normalized,what is the recommend way? Thanks

from ft-w2v2-ser.

b04901014 commented on August 12, 2024

You can observe from the training loss, it is not decreasing for V-FT. So the training is not even happening.

Something like:

wav = (wav - wav.mean()) / (wav.std() + 1e-12)

will normalize the audio signal.

Some dataset needs this for the loss to go down depending on the recording environment.

from ft-w2v2-ser.

b04901014 commented on August 12, 2024

You may add it at the __getitem__ of the downstream dataloader
But if you run TPAPT/PTAPT, you'll also have to add it at the pretrain dataloader
Or you can simply preprocess the audio.

from ft-w2v2-ser.

Gpwner commented on August 12, 2024

So the mean and std should be calculated across all the trainning samples before return a single sample from https://github.com/b04901014/FT-w2v2-ser/blob/main/downstream/Custom/dataloader.py#L73 like this:

#mean and std is calculate by all the training samples
return (wav.astype(np.float32)-mean)/(std+1e-12), label

right?

from ft-w2v2-ser.

b04901014 commented on August 12, 2024

No. That is another way of doing normalization for spectral-based features.

For raw audio, we can do it inside each samples, where the statistics are calculated and normalized per sample. This will normalize the average volume (std) and DC (mean) which shouldn't effect the emotion information.

from ft-w2v2-ser.

Gpwner commented on August 12, 2024

No. That is another way of doing normalization for spectral-based features.

For raw audio, we can do it inside each samples, where the statistics are calculated and normalized per sample. This will normalize the average volume (std) and DC (mean) which shouldn't effect the emotion information.

So the getitem method(https://github.com/b04901014/FT-w2v2-ser/blob/main/downstream/Custom/dataloader.py#L68) should be like this:

    def __getitem__(self, i):
        dataname = self.dataset[i]
        wav, _sr = sf.read(dataname)
        _label = self.label[self.datasetbase[i]]
        label = self.labeldict[_label]
        wav = wav.astype(np.float32)
        wav = (wav - wav.mean()) / (wav.std() + 1e-12)
        return wav, label

But the result didn't get better,here is the training log:
V_FT.zip
By the way，I have change the command to:

python run_downstream_custom.py --precision 16 --datadir Dataset/combineData16KHz/ --labelpath PART_OF_FRIENDS/labels/part_of_friends.json --output_path VFT/downstreammul  --max_epochs=30

from ft-w2v2-ser.

b04901014 commented on August 12, 2024

You can still observe it from the log that your loss is not decreasing. The learning rate need to be lower, such as --lr 2e-5.

qq.log

Here is some log I use --lr 2e-5 --batch_size 32 on your dataset with only 15 epochs. Again, you can see that my loss is decreasing in a few epochs, but yours is not. So you should tune the hyper-params according to that.

Also, I would imagine it requires much more epochs to converge. You can see that 15 epochs only get the loss about 0.9. Ideally we want to get the training loss to overfit to about 0.2~0.3.

from ft-w2v2-ser.

Gpwner commented on August 12, 2024

You can still observe it from the log that your loss is not decreasing. The learning rate need to be lower, such as --lr 2e-5.

qq.log

Here is some log I use --lr 2e-5 --batch_size 32 on your dataset with only 15 epochs. Again, you can see that my loss is decreasing in a few epochs, but yours is not. So you should tune the hyper-params according to that.

Also, I would imagine it requires much more epochs to converge. You can see that 15 epochs only get the loss about 0.9. Ideally we want to get the training loss to overfit to about 0.2~0.3.

Got it.So how can we decide whether a dataset should be normalized or not?

from ft-w2v2-ser.

Gpwner commented on August 12, 2024

I think maybe we don't need the normalization,I just run the code without normalization and the batch-size is 64:

python run_downstream_custom.py --precision 16 --datadir Dataset/combineData16KHz/ --labelpath PART_OF_FRIENDS/labels/part_of_friends.json --output_path VFT/downstreammul --max_epochs=30 --lr 2e-5

After 30 epoches the loss is decreasing to loss=0.422,here is the training log:
V_FT_NO_NORMALIZITION.zip

from ft-w2v2-ser.

b04901014 commented on August 12, 2024

Yeah, maybe it's just the learning rate that matters. Hyper-parameters should be tuned dataset-to-dataset.

from ft-w2v2-ser.

The result on MELD dataset is not as good as IEMOCAP about ft-w2v2-ser HOT 13 OPEN

Comments (13)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent