b04901014 / ft-w2v2-ser Goto Github PK

View Code? Open in Web Editor NEW

135.0 135.0 32.0 151 KB

Official implementation for the paper Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition

License: MIT License

Python 96.93% Shell 2.39% Dockerfile 0.68%

ft-w2v2-ser's People

Contributors

Stargazers

Watchers

ft-w2v2-ser's Issues

Could you share the pre-trained wav2vec 2.0 models using TAPT and P-TAPT?

Thanks very much!

The result on MELD dataset is not as good as IEMOCAP

I download the MELD Datasets from https://affective-meld.github.io/
Then I do some lable mapping which map the ‘joy’ to 'happy' ,'sadness' to 'sad' by this script:

    if x == 'neutral':
        return 'neutral'
    elif x == 'joy':
        return 'happy'
    elif x == 'anger':
        return 'anger'
    elif x == 'sadness':
        return 'sad'
    else:
        return '-1'

3.Then I exact wav from mp4 by this shell:

#!/bin/bash
files=$(ls $1)
for filename in $files
do
   echo ${filename%.*}
   ffmpeg -i $1/${filename%.*}.mp4 -f wav -ar 16000 -ac 1  $2/${filename%.*}.wav
done

4.I select samples to train the model.And here my sample statistics:

Statistics of training splits:
----Involved Emotions----
sad: 450 examples
anger: 450 examples
neutral: 450 examples
happy: 450 examples
Total 1800 examples
----Examples Involved----

Statistics of testing splits:
----Involved Emotions----
sad: 233 examples
anger: 233 examples
happy: 233 examples
neutral: 233 examples
Total 932 examples
----Examples Involved----

5、Then I run the https://github.com/b04901014/FT-w2v2-ser/blob/main/bin/run_exp_iemocap.sh
Unfortunately I got an error like this:

f"`mask_length` has to be smaller than `sequence_length`, but got `mask_length`: {mask_length} and `sequence_length`: {sequence_length}`"`

So I modified the code in https://github.com/b04901014/FT-w2v2-ser/blob/main/modules/FeatureFuser.py from

                # apply SpecAugment along time axis
                batch_size, sequence_length, hidden_size = wav2vec_z.size()
                mask_time_indices = _compute_mask_indices(
                    (batch_size, sequence_length),
                    self.mask_time_prob,
                    self.mask_time_length,
                    min_masks=2,
                    device=x.device
                )

                # apply SpecAugment along time axis
                batch_size, sequence_length, hidden_size = wav2vec_z.size()
                mask_length = min(self.mask_time_length, sequence_length)
                mask_time_indices = _compute_mask_indices(
                    (batch_size, sequence_length),
                    self.mask_time_prob,
                    mask_length,
                    min_masks=2,
                    device=x.device
                )

6.Finally I got the result

Here is the P-TAPT.log:

+++ SUMMARY +++
Mean UAR [%]: 39.21
Fold Std. UAR [%]: 0.00
Fold Median UAR [%]: 39.21
Run Std. UAR [%]: 0.62
Run Median UAR [%]: 38.84
Mean WAR [%]: 39.21
Fold Std. WAR [%]: 0.00
Fold Median WAR [%]: 39.21
Run Std. WAR [%]: 0.62
Run Median WAR [%]: 38.84
Mean macroF1 [%]: 38.16
Fold Std. macroF1 [%]: 0.00
Fold Median macroF1 [%]: 38.16
Run Std. macroF1 [%]: 1.45
Run Median macroF1 [%]: 37.73
Mean microF1 [%]: 39.95
Fold Std. microF1 [%]: 0.00
Fold Median microF1 [%]: 39.95
Run Std. microF1 [%]: 0.38
Run Median microF1 [%]: 39.93

And the confusion matrix:

[[358. 310. 207. 290.]
 [145. 512. 230. 278.]
 [175. 299. 370. 321.]
 [156. 231. 191. 587.]]

The result is so bad,It will be great if you can help.

run_downstream_custom_multiple_fold.py CUDA out of memory

Got the following when running run_downstream_custom_multiple_fold.py
RuntimeError: CUDA out of memory. Tried to allocate 730.00 MiB (GPU 0; 23.70 GiB total capacity; 21.65 GiB already allocated; 426.81 MiB free; 21.81 GiB reserved in total by PyTorch)

I have NVIDIA GeForce RTX 3090 with 24GB.

Any insights on how to workaround it?

NUM_EXPS

Hello, I want to know about the values of NUM_EXP because I cannot obtain the correct confusion matrix when performing each fine-tuning method.
The results are :
Testing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1085/1085 [00:32<00:00, 33.17it/s]
[[ 0. 2519. 916. 0.]
[ 0. 3058. 1112. 0.]
[ 0. 4224. 1536. 0.]
[ 0. 2134. 776. 0.]]
Saved figure to downstream/checkpoints/custom/confmat.png.

Questions about batch size and clustering model

What's the rationale behind making the default batch size 64 for the pre-training, continued pre-training, and fine-tuning loops? Others have mentioned that they had to reduce the batch size to make it run on their systems, considering the original code uses a single GPU. Is this the batch size that produced the best results in your experiments?
I noticed that cluster.py accepts either wav2vec or wav2vec2 as the model_type. Why did you move forward with making wav2vec2 as the default model? Could you have used HuBERT or other variations of a transformer-based model?

How to test

Thank you for your share, I am new to SER, and I am learning this code recently. I would like to ask how to test it. Is there a reference code?

Unable to get file status

cp: Unable to get'Dataset/IEMOCAP/labels_sess/label_{SESSION_TO_TEST}.json' s File status(stat): There is no file or directory
cp: 无法获取'Dataset/IEMOCAP/labels_sess/label_{SESSION_TO_TEST}.json' 的文件状态(stat): 没有那个文件或目录

Pre-training

python run_pretrain.py --datadir Audio_Dir \
                       --labelpath Label_Path
                       --labeling_method hard \
                       --saving_path Saving_Path \
                       --training_step 10000 \
                       --save_top_k 1 \
                       --wav2vecpath Wav2vecCKPT \
                       --precision 16

Here --labelpath, should be a session label or metalabel.json?

Seems to be nn.LSTM is useless

I don't see any usage of https://github.com/b04901014/FT-w2v2-ser/blob/main/pretrain/trainer.py#L252 ,it seems to be useless.

How do I format the dataset to match this model, and where should I store the dataset

Issue about TAPT

We cannot reproduce the results about TAPT, and our pretraining loss is ''nan'' when running ''FT-w2v2-ser-main\run_baseline_continueFT.py''. Can you help us solve this issue?

TypeError: _compute_mask_indices() got an unexpected keyword argument 'device'

Hi while running run_downstream_custom_multiple_fold.py, I ran into
TypeError: _compute_mask_indices() got an unexpected keyword argument 'device'
In package transformers modeling_wav2vec2.py also shows that _compute_mask_indices() does not take arg device
Am I missing something here?

Any help is appreciated!

How many GPUs is enough?

Wondering how many Gpus you used, I used 2 v100s (16GB) and still couldn't run the last phase of the code until I reduced batch-size to 48 .I've made sure I use both gpus by modifying the following code:

        trainer = Trainer(
            precision=args.precision,
            amp_backend='native',
            callbacks=[checkpoint_callback] if hasattr(model, 'valid_met') else None,
            checkpoint_callback=hasattr(model, 'valid_met'),
            resume_from_checkpoint=None,
            check_val_every_n_epoch=1,
            max_epochs=hparams.max_epochs,
            num_sanity_val_steps=2 if hasattr(model, 'valid_met') else 0,
            gpus=-1,
            strategy='dp',  # multiple-gpus, 1 machine
            logger=False
        )

b04901014 / ft-w2v2-ser Goto Github PK

ft-w2v2-ser's People

Contributors

Stargazers

Watchers

Forkers

ft-w2v2-ser's Issues

Recommend Projects

Recommend Topics

Recommend Org