Comments (15)
Hmm, shouldn't be that hard, since the model itself is only trained on the balanced subset.
The training script for this repo will probably come with the followup work.
Did you maybe: forget to balance each batch during training? That influence is significant.
Also just for reference, on my evaluation audioset subset (only 13k samples), I get an mAP of 0.095, just for reference.
Also maybe your feature preprocessing is problematic?
from gpv.
@RicherMans Thanks for your reply. The Audioset balanced subset contains more than 20k audios, but I only want to seperate 'Speech' from other audios, so I chose about 4k 'Speech' and other 4k audios random, I got about 8k audios as training data, I wonder 'balance each batch during training' means that for every batch, I have to make sure the 'Speech' and other audios should be 1:1, did it work for me?
from gpv.
another question, if I reproduce the GPV-f(527), how to balance the each batch(if the batchsize=64 less than 527)
from gpv.
@RicherMans Thanks for your reply. The Audioset balanced subset contains more than 20k audios, but I only want to separate 'Speech' from other audios, so I chose about 4k 'Speech' and other 4k audios random, I got about 8k audios as training data
Ohoh, no that wasn't the entire point of the work. So the problem would be if you only use these 4k Speech and 4k non-speech samples, that the "noise" class has a much too large of a variance.
That is actually one of the points of the work - common VAD methods are trained to incorporate a single, noise class, which is likely to be "huge".
Also, your model wouldn't be able to fully distinguish between the individual events either.
So just train on all 20k samples of audioset.
In this work, GPV-B simply replaced all occurrences of non-speech samples with "noise/non-speech". After replacing, you will notice that there are only two event label combinations: (Speech, Noise) and (Noise). Speech never occurs independently in that dataset.
I wonder 'balance each batch during training' means that for every batch, I have to make sure the 'Speech' and other audios should be 1:1, did it work for me?
Actually, I personally didn't care about the balancing of 'Speech' directly, just make sure each batch is reasonably balanced.
another question, if I reproduce the GPV-f(527), how to balance the each batch(if the batchsize=64 less than 527)
It's a multi-class multi-output problem, so balancing, like for standard classification problems, is impossible.
You will always sample a majority class, appearing together with a minority class, therefore you cant guarantee that each class is only seen once in a batch.
However, what I mean by "balancing" here is to roughly balance the dataset, such that within a batch and subsequent batches, some class appears at least once.
So say, in the first batch, classes 1,2,3
are trained, while in the following batch its 4,5,6
and so on. Of course you cant 100% guarantee that this will always be the case.
I just paste my minimum working example for my own "minimumoccupancy sampler".
The input is just a 2d np array named labels with size (N, E), N is the number of overall samples you have (data size) and E is the number of classes (e.g., 527). When used on the unbalanced Audioset, that array becomes too large to store in memory directly, so I used sparse matrices, but you can just neglect it.
Have fun.
import torch.utils.data as tdata
import scipy
import numpy as np
class MinimumOccupancySampler(tdata.Sampler):
"""
docstring for MinimumOccupancySampler
samples at least one instance from each class sequentially
"""
def __init__(self, labels, sampling_mode='same', random_state=1):
self.labels = labels
data_samples, n_labels = labels.shape
label_to_idx_list, label_to_length = [], []
self.random_state = np.random.RandomState(seed=random_state)
for lb_idx in range(n_labels):
label_selection = labels[:, lb_idx]
if scipy.sparse.issparse(label_selection):
label_selection = label_selection.toarray()
label_indexes = np.where(label_selection == 1)[0]
self.random_state.shuffle(label_indexes)
label_to_length.append(len(label_indexes))
label_to_idx_list.append(label_indexes)
self.longest_seq = max(label_to_length)
self.data_source = np.zeros((self.longest_seq, len(label_to_length)),
dtype=np.uint32)
# Each column represents one "single instance per class" data piece
for ix, leng in enumerate(label_to_length):
# Fill first only "real" samples
self.data_source[:leng, ix] = label_to_idx_list[ix]
self.label_to_idx_list = label_to_idx_list
self.label_to_length = label_to_length
if sampling_mode == 'same':
self.data_length = data_samples
elif sampling_mode == 'over': # Sample all items
self.data_length = np.prod(self.data_source.shape)
def _reshuffle(self):
# Reshuffle
for ix, leng in enumerate(self.label_to_length):
leftover = self.longest_seq - leng
random_idxs = self.random_state.randint(leng, size=leftover)
self.data_source[leng:,
ix] = self.label_to_idx_list[ix][random_idxs]
def __iter__(self):
# Before each epoch, reshuffle random indicies
self._reshuffle()
n_samples = len(self.data_source)
random_indices = self.random_state.permutation(n_samples)
data = np.concatenate(
self.data_source[random_indices])[:self.data_length]
return iter(data)
def __len__(self):
return self.data_length
from gpv.
@RicherMans when I applied your Sampler, the task of 527 classes works better, but when it comes to the bianry task (Speech and non-Speech), it was worse than your gpv-b. for traing and testing, I modified the label(speech, noises) to (1,1) and (noises) to (0,1), the first num stands for the exist of speech and the second num stands for the noises. the original label(527 classes) was used to apply the Sampler to ensure I have each balanced batch.
from gpv.
So there are quite some hyperparameters out of my "direct" control.
I usually do a stratified (ish) split on the available 20k data to split into 90% train and 10% CV.
The split is governed by a predefined seed. For these experiments, I ran different seeds, which usually lead to different results.
I personally just know that you wouldn't be able to reproduce my results 1:1, meaning that you obtain identical models, but definitely something similar.
Just for the reference, my binary results (macro F1) for four different seeds on the aurora clean scenario are:
Run | macro-F1 |
---|---|
1 | 91.368 |
2 | 86.242 |
3 | 85.190 |
4 | 94.116 |
I am well aware of the influence of training data and sampling, specifically for a binary classification task.
If your F1-macro is around 85+, then its all good.
it was worse than your gpv-b. for traing and testing,
I never published training results? You mean you just evaluated my model on your dataset?
Also what dataset are you using? Aurora 4? Or some custom one
from gpv.
@RicherMans I splited 20k data into 90% train and 10% CV. I used BCEloss as my loss function. I tested models on my record real data. each input was controled as 20-50 frames to check out results, your GPV-b and GPV-f pretrained models performed much better than my models at the frame level, but I don't know why. Maybe I should try to do more tests to reproduce a similar results
from gpv.
I tested models on my record real data. each input was controled as 20-50 frames to check out results,
Sorry, I don't get that, you perform every 20-50 batches cross-validation?
What's your input feature? 40ms/20ms LMS?
Also, did you use double thresholding as the post-processing method?
Also
the original label(527 classes) was used to apply the Sampler to ensure I have each balanced batch.\
That seems wrong, you always apply the sampling strategy based on your labels, not based on some other labels. For binary classification, just sample according to (1,1) and (1,0).
from gpv.
@RicherMans I mean that during training, the input clip was tailored to 10s, during tesing, the input clip was in range(20ms, 50ms), when the input clip was 20ms, the input feature was 20ms LMS.
from gpv.
I mean that during training, the input clip was tailored to 10s, during tesing, the input clip was in range(20ms, 50ms), when the input clip was 20ms, the input feature was 20ms LMS.
Sorry I don't get your meaning here.
Training features differ from testing ones?
Do you input a single frame at a time during testing?
from gpv.
yes, training audio duration differ from testing duration. training clip audios are 10s, if the framesize=20ms, about 500frames, the input feature is 500 * 64, but for the testing audio, if the testing audio was 1s, the input feature is 50 * 64, if the testing audio was 500ms, the input feature is 25*64
from gpv.
yes, training audio duration differ from testing duration. training clip audios are 10s, if the framesize=20ms, about 500frames, the input feature is 500 * 64, but for the testing audio, if the testing audio was 1s, the input feature is 50 * 64, if the testing audio was 500ms, the input feature is 25*64
I see, that shouldn't influence performance.
Well just so far, I am unsure about your training.
If you can provide some logs, it would be helpful.
from gpv.
@RicherMans Can I have your email if it's conveniet for you, I‘ll send you the code
from gpv.
For the speech label=(1,1) and non-speech, label=(0,1) task , the first batch train loss=0.28, validation loss=0.26, after about 8-10 epoch, the valdidation began to increase from 0.20, and the train loss continued to decrease.
from gpv.
Can I have your email if it's conveniet for you, I‘ll send you the code
Umm, if possible just add my WeChat, the username is the same as here (richermans)
from gpv.
Related Issues (9)
- Train HOT 2
- how did label encoder model get? HOT 5
- Could you upload the training script? HOT 1
- ModuleNotFoundError: No module named 'numba.decorators' HOT 3
- Thoughts on streaming the forward pass? HOT 1
- Improving Performance on Shorter Audio Clips HOT 1
- Pretrained models summary HOT 1
- bug in feature extraction HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpv.