b04901014 / ft-w2v2-ser Goto Github PK
View Code? Open in Web Editor NEWOfficial implementation for the paper Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition
License: MIT License
Official implementation for the paper Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition
License: MIT License
Thanks very much!
if x == 'neutral':
return 'neutral'
elif x == 'joy':
return 'happy'
elif x == 'anger':
return 'anger'
elif x == 'sadness':
return 'sad'
else:
return '-1'
3.Then I exact wav from mp4 by this shell:
#!/bin/bash
files=$(ls $1)
for filename in $files
do
echo ${filename%.*}
ffmpeg -i $1/${filename%.*}.mp4 -f wav -ar 16000 -ac 1 $2/${filename%.*}.wav
done
4.I select samples to train the model.And here my sample statistics:
Statistics of training splits:
----Involved Emotions----
sad: 450 examples
anger: 450 examples
neutral: 450 examples
happy: 450 examples
Total 1800 examples
----Examples Involved----
Statistics of testing splits:
----Involved Emotions----
sad: 233 examples
anger: 233 examples
happy: 233 examples
neutral: 233 examples
Total 932 examples
----Examples Involved----
5、Then I run the https://github.com/b04901014/FT-w2v2-ser/blob/main/bin/run_exp_iemocap.sh
Unfortunately I got an error like this:
f"`mask_length` has to be smaller than `sequence_length`, but got `mask_length`: {mask_length} and `sequence_length`: {sequence_length}`"`
So I modified the code in https://github.com/b04901014/FT-w2v2-ser/blob/main/modules/FeatureFuser.py from
# apply SpecAugment along time axis
batch_size, sequence_length, hidden_size = wav2vec_z.size()
mask_time_indices = _compute_mask_indices(
(batch_size, sequence_length),
self.mask_time_prob,
self.mask_time_length,
min_masks=2,
device=x.device
)
to
# apply SpecAugment along time axis
batch_size, sequence_length, hidden_size = wav2vec_z.size()
mask_length = min(self.mask_time_length, sequence_length)
mask_time_indices = _compute_mask_indices(
(batch_size, sequence_length),
self.mask_time_prob,
mask_length,
min_masks=2,
device=x.device
)
6.Finally I got the result
Here is the P-TAPT.log:
+++ SUMMARY +++
Mean UAR [%]: 39.21
Fold Std. UAR [%]: 0.00
Fold Median UAR [%]: 39.21
Run Std. UAR [%]: 0.62
Run Median UAR [%]: 38.84
Mean WAR [%]: 39.21
Fold Std. WAR [%]: 0.00
Fold Median WAR [%]: 39.21
Run Std. WAR [%]: 0.62
Run Median WAR [%]: 38.84
Mean macroF1 [%]: 38.16
Fold Std. macroF1 [%]: 0.00
Fold Median macroF1 [%]: 38.16
Run Std. macroF1 [%]: 1.45
Run Median macroF1 [%]: 37.73
Mean microF1 [%]: 39.95
Fold Std. microF1 [%]: 0.00
Fold Median microF1 [%]: 39.95
Run Std. microF1 [%]: 0.38
Run Median microF1 [%]: 39.93
And the confusion matrix:
[[358. 310. 207. 290.]
[145. 512. 230. 278.]
[175. 299. 370. 321.]
[156. 231. 191. 587.]]
The result is so bad,It will be great if you can help.
Got the following when running run_downstream_custom_multiple_fold.py
RuntimeError: CUDA out of memory. Tried to allocate 730.00 MiB (GPU 0; 23.70 GiB total capacity; 21.65 GiB already allocated; 426.81 MiB free; 21.81 GiB reserved in total by PyTorch)
I have NVIDIA GeForce RTX 3090 with 24GB.
Any insights on how to workaround it?
Hello, I want to know about the values of NUM_EXP because I cannot obtain the correct confusion matrix when performing each fine-tuning method.
The results are :
Testing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1085/1085 [00:32<00:00, 33.17it/s]
[[ 0. 2519. 916. 0.]
[ 0. 3058. 1112. 0.]
[ 0. 4224. 1536. 0.]
[ 0. 2134. 776. 0.]]
Saved figure to downstream/checkpoints/custom/confmat.png.
cluster.py
accepts either wav2vec or wav2vec2 as the model_type. Why did you move forward with making wav2vec2 as the default model? Could you have used HuBERT or other variations of a transformer-based model?Thank you for your share, I am new to SER, and I am learning this code recently. I would like to ask how to test it. Is there a reference code?
cp: Unable to get'Dataset/IEMOCAP/labels_sess/label_{SESSION_TO_TEST}.json' s File status(stat): There is no file or directory
cp: 无法获取'Dataset/IEMOCAP/labels_sess/label_{SESSION_TO_TEST}.json' 的文件状态(stat): 没有那个文件或目录
python run_pretrain.py --datadir Audio_Dir \
--labelpath Label_Path
--labeling_method hard \
--saving_path Saving_Path \
--training_step 10000 \
--save_top_k 1 \
--wav2vecpath Wav2vecCKPT \
--precision 16
Here --labelpath, should be a session label or metalabel.json?
I don't see any usage of https://github.com/b04901014/FT-w2v2-ser/blob/main/pretrain/trainer.py#L252 ,it seems to be useless.
We cannot reproduce the results about TAPT, and our pretraining loss is ''nan'' when running ''FT-w2v2-ser-main\run_baseline_continueFT.py''. Can you help us solve this issue?
Hi while running run_downstream_custom_multiple_fold.py, I ran into
TypeError: _compute_mask_indices() got an unexpected keyword argument 'device'
In package transformers modeling_wav2vec2.py also shows that _compute_mask_indices() does not take arg device
Am I missing something here?
Any help is appreciated!
Wondering how many Gpus you used, I used 2 v100s (16GB) and still couldn't run the last phase of the code until I reduced batch-size to 48 .I've made sure I use both gpus by modifying the following code:
trainer = Trainer(
precision=args.precision,
amp_backend='native',
callbacks=[checkpoint_callback] if hasattr(model, 'valid_met') else None,
checkpoint_callback=hasattr(model, 'valid_met'),
resume_from_checkpoint=None,
check_val_every_n_epoch=1,
max_epochs=hparams.max_epochs,
num_sanity_val_steps=2 if hasattr(model, 'valid_met') else 0,
gpus=-1,
strategy='dp', # multiple-gpus, 1 machine
logger=False
)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.