Comments (13)
Here is the part_of_friends.json:
part_of_friends.zip
from ft-w2v2-ser.
Some guidelines for debugging.
- Identify if the problem is from the proposed algorithm (TAPT, PTAPT) or the original wav2vec 2.0 fine-tuning. Do the V-FT yield similar results? Do the training loss overfits quickly, or it's still under-fitting?
- The epochs trained and the number of clusters should be tuned according to the size of the dataset. For instance, if you have a smaller dataset size, increase the epochs or decrease the batch size, and lower the cluster size for PTAPT.
- Some of the samples in your dataset seems to be broken:
f"`mask_length` has to be smaller than `sequence_length`, but got `mask_length`: {mask_length} and `sequence_length`: {sequence_length}`"`
indicates the input audio length is less than about 10 * 0.01 = 0.1 (s), which is an unreasonable length to predict emotion.
from ft-w2v2-ser.
Also, just some experience with working on the MELD audio.
- The audio need to be normalized in terms of mean and variance across the utterance. Otherwise the loss may not be going down.
- It take more epochs to lower the loss.
from ft-w2v2-ser.
V-FT even worse.Do I miss anything?
I run the python code:
python run_downstream_custom_multiple_fold.py --precision 16 --datadir Dataset/combineData16KHz/ --labeldir PART_OF_FRIENDS/labels/ --saving_path VFT/downstreammul --outputfile VFT.log --max_epochs=30
Here is my result:
+++ SUMMARY +++
Mean UAR [%]: 25.00
Fold Std. UAR [%]: 0.00
Fold Median UAR [%]: 25.00
Run Std. UAR [%]: 0.00
Run Median UAR [%]: 25.00
Mean WAR [%]: 25.00
Fold Std. WAR [%]: 0.00
Fold Median WAR [%]: 25.00
Run Std. WAR [%]: 0.00
Run Median WAR [%]: 25.00
Mean macroF1 [%]: 10.00
Fold Std. macroF1 [%]: 0.00
Fold Median macroF1 [%]: 10.00
Run Std. macroF1 [%]: 0.00
Run Median macroF1 [%]: 10.00
Mean microF1 [%]: 10.00
Fold Std. microF1 [%]: 0.00
Fold Median microF1 [%]: 10.00
Run Std. microF1 [%]: 0.00
Run Median microF1 [%]: 10.00
the confusion matrix:
[[233. 0. 0. 0.]
[233. 0. 0. 0.]
[233. 0. 0. 0.]
[233. 0. 0. 0.]]
The training log is attached:
V_FT.zip
By the way,when you come to The audio need to be normalized
,what is the recommend way? Thanks
from ft-w2v2-ser.
You can observe from the training loss, it is not decreasing for V-FT. So the training is not even happening.
Something like:
wav = (wav - wav.mean()) / (wav.std() + 1e-12)
will normalize the audio signal.
Some dataset needs this for the loss to go down depending on the recording environment.
from ft-w2v2-ser.
You may add it at the __getitem__
of the downstream dataloader
But if you run TPAPT/PTAPT, you'll also have to add it at the pretrain dataloader
Or you can simply preprocess the audio.
from ft-w2v2-ser.
So the mean and std should be calculated across all the trainning samples before return a single sample from https://github.com/b04901014/FT-w2v2-ser/blob/main/downstream/Custom/dataloader.py#L73 like this:
#mean and std is calculate by all the training samples
return (wav.astype(np.float32)-mean)/(std+1e-12), label
right?
from ft-w2v2-ser.
No. That is another way of doing normalization for spectral-based features.
For raw audio, we can do it inside each samples, where the statistics are calculated and normalized per sample. This will normalize the average volume (std) and DC (mean) which shouldn't effect the emotion information.
from ft-w2v2-ser.
No. That is another way of doing normalization for spectral-based features.
For raw audio, we can do it inside each samples, where the statistics are calculated and normalized per sample. This will normalize the average volume (std) and DC (mean) which shouldn't effect the emotion information.
So the getitem method(https://github.com/b04901014/FT-w2v2-ser/blob/main/downstream/Custom/dataloader.py#L68) should be like this:
def __getitem__(self, i):
dataname = self.dataset[i]
wav, _sr = sf.read(dataname)
_label = self.label[self.datasetbase[i]]
label = self.labeldict[_label]
wav = wav.astype(np.float32)
wav = (wav - wav.mean()) / (wav.std() + 1e-12)
return wav, label
But the result didn't get better,here is the training log:
V_FT.zip
By the way,I have change the command to:
python run_downstream_custom.py --precision 16 --datadir Dataset/combineData16KHz/ --labelpath PART_OF_FRIENDS/labels/part_of_friends.json --output_path VFT/downstreammul --max_epochs=30
from ft-w2v2-ser.
You can still observe it from the log that your loss is not decreasing. The learning rate need to be lower, such as --lr 2e-5
.
Here is some log I use --lr 2e-5 --batch_size 32
on your dataset with only 15 epochs. Again, you can see that my loss is decreasing in a few epochs, but yours is not. So you should tune the hyper-params according to that.
Also, I would imagine it requires much more epochs to converge. You can see that 15 epochs only get the loss about 0.9. Ideally we want to get the training loss to overfit to about 0.2~0.3.
from ft-w2v2-ser.
You can still observe it from the log that your loss is not decreasing. The learning rate need to be lower, such as
--lr 2e-5
.Here is some log I use
--lr 2e-5 --batch_size 32
on your dataset with only 15 epochs. Again, you can see that my loss is decreasing in a few epochs, but yours is not. So you should tune the hyper-params according to that.Also, I would imagine it requires much more epochs to converge. You can see that 15 epochs only get the loss about 0.9. Ideally we want to get the training loss to overfit to about 0.2~0.3.
Got it.So how can we decide whether a dataset should be normalized or not?
from ft-w2v2-ser.
I think maybe we don't need the normalization,I just run the code without normalization and the batch-size is 64:
python run_downstream_custom.py --precision 16 --datadir Dataset/combineData16KHz/ --labelpath PART_OF_FRIENDS/labels/part_of_friends.json --output_path VFT/downstreammul --max_epochs=30 --lr 2e-5
After 30 epoches the loss is decreasing to loss=0.422,here is the training log:
V_FT_NO_NORMALIZITION.zip
from ft-w2v2-ser.
Yeah, maybe it's just the learning rate that matters. Hyper-parameters should be tuned dataset-to-dataset.
from ft-w2v2-ser.
Related Issues (13)
- TypeError: _compute_mask_indices() got an unexpected keyword argument 'device' HOT 2
- Unable to get file status
- How do I format the dataset to match this model, and where should I store the dataset
- Questions about batch size and clustering model
- NUM_EXPS
- How many GPUs is enough? HOT 7
- Seems to be nn.LSTM is useless HOT 1
- Could you share the pre-trained wav2vec 2.0 models using TAPT and P-TAPT? HOT 25
- Pre-training HOT 2
- run_downstream_custom_multiple_fold.py CUDA out of memory HOT 2
- Issue about TAPT
- How to test HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ft-w2v2-ser.