asteroid-team / torch-audiomentations Goto Github PK
View Code? Open in Web Editor NEWFast audio data augmentation in PyTorch. Inspired by audiomentations. Useful for deep learning.
License: MIT License
Fast audio data augmentation in PyTorch. Inspired by audiomentations. Useful for deep learning.
License: MIT License
Ref #5 (comment)
Change the spectrum. Boost some frequencies, maybe attenuate some frequencies.
@iver56 - there's a failure to the assertion when comparing the PyTorch way of convolving versus scipy's version. I have made one small change here to match the required arguments as specified in numpy's documentation.
I prefer a non-sox backend for compatibility reasons
We should come up with a different name that doesn't interfere with the nn.Module.parameters method
I just downloaded EchoThief RIRs and naively thought the find_audio_files would look into subfolders. It did not and I got a nice error message telling me it could find any audio file...
The reason is that EchoThief RIRs are stored in subfolders by category.
We should update find_audio_files to look into subfolders.
The first dimension should represent batches
Some things, like mp3 compression, are not easy to implement in pytorch, so we have to rely on other libraries like audiomentations
for that. Here's an example of what I imagine:
from audiomentations import Mp3Compression
from torch_audiomentations import AudiomentationsWaveformTransform
augment = AudiomentationsWaveformTransform(
Mp3Compression(min_bitrate=32, max_bitrate=96, p=1.0), p=0.5
)
...
I'm sure a lot of people working with torch-audiomentations will want to use some of the transforms available in torchaudio, at least for now until #4 is ready. I noticed that the sample rate was required for the forward call of the BaseWaveTransforms? This means that I have to check in a list of transforms to invoke the forward correctly. Would it not be better to have the sample_rate initialised as part of the init?
For multi-GPU training, users might use DataParallel
and DistributedDataParallel
. And the augmentations should probably follow that as well, to avoid copying parameters from one device to another (and loosing more time).
Maybe we could use this? https://github.com/jonashaag/numpy-fast-align-mse
There are several ways of representing multichannel audio:
shape like [batch_size, num_samples, num_channels]
shape like [batch_size, num_channels, samples]
I would choose samples last based on this information
E.g. transforms like this: https://github.com/zcaceres/spec_augment
Currently supported in torchaudio using sox.
import torch
import torchaudio
from pyannote.audio.utils.preview import listen
from IPython.display import Audio
x = torchaudio.sox_effects.effect_names()
effects = [["reverb"]]
wave,sr = torchaudio.load("./tests/data/tst00.wav")
display(Audio(wave, rate=sr))
wave, sr = torchaudio.sox_effects.apply_effects_tensor(wave, sr, effects)
print(sr, wave.shape)
display(Audio(wave[0][None], rate=sr))
I found an article where someone implements Schroeder's Reverberator algorthim and comes with code
Is this something that you are planning to implement/ in the process of doing so?
We currently have compatibility code for pytorch < 1.7 and torchaudio < 0.7 which the CI currently does not test. We should start doing this.
In pyannote.audio we have a a set of functions that run before commiting. This is through the pre-commit git hook. It makes code styling more consistent, checks for missing imports , unused imports, formats code that is to be changed, removes whitespace etc. I've seen @iver make a few commits to blackify code so thought this might help save him some time.
More details
per_channel
augmentations: an augmentation that is applied on each audio channel separately
per_image
augmenations: augmentations that are applied on multichannel audio (e,g. swapping the channels, changing the spatial image)
per_ensemble
(or per_batch
?) augmentations: augmentations that are applied to the whole ensemble of sources. E.g. imagine you want to change the energy of the each image but keep the overall mix energy constant or below a certain value
Ref. a comment from @faroit on slack
In my opinion, per channel and per image has the highest priority
Would be nice for reproducibility reasons to be able to instantiate augmentations from a configuration file.
augment = AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5)
would be equivalent to loading from this configuration file:
# config.yaml
AddGaussianNoise:
min_amplitude: 0.001
max_amplitude: 0.015
p: 0.5
with open('config.yaml', 'r') as fp:
augment = torch_audiomentations.from_config(yaml.load(fp))
Composition could have the following (simple) syntax:
# config.yaml
Compose:
AddGaussianNoise:
min_amplitude: 0.001
max_amplitude: 0.015
p: 0.5
TimeStretch:
min_rate: 0.8
max_rate: 1.25
p: 0.5
Would you consider a PR adding such a feature?
The idea is that whatever perturbations are applied to the input are reflected in the target data
Scenarios:
Input | Target |
---|---|
Audio | class(es), fixed length, not correlated with the input length. Low priority. |
Audio | Time series of class(es). E.g. predict class for each frame of 25 ms with hop size 10 ms. Medium priority. |
Audio | Audio (length same as the input). This one has the highest priority |
Like in pyro-ppl/pyro@ae55140
Context: https://pytorch.org/docs/stable/generated/torch.rfft.html
The function torch.rfft() is deprecated and will be removed in a future PyTorch release. Use the new torch.fft module functions, instead, by importing torch.fft and calling torch.fft.rfft() for one-sided output, or torch.fft.fft() for two-sided output.
Suppose I define MyNetwork
with built-in augmentation like this:
import torch
import torch.nn as nn
from torch_audiomentations import Gain
class MyNetwork(nn.Module):
def __init__(self):
self.augmentation = Gain(min_gain_in_db=-15.0, max_gain_in_db=5.0, p=0.5)
self.other_layers = ...
def foward(self, waveforms):
augmented_waveforms = self.augmentation(waveforms)
return self.other_layers(augmented_waveforms)
net = MyNetwork()
As per my understanding of current code base, calling net.train(mode=False)
(aka net.eval()
) has no effect on net.augmentation
. Augmentation will always be applied, whether we call net.eval()
or not.
Would it make sense to add the following method to BaseWaveformTransform
so that one can activate/deactivate augmentations with net.train(mode=...)
?
class BaseWaveformTransform(nn.Module):
def train(self, mode: bool = True):
if mode:
self.p = self._p_backup
else:
self._p_backup = self.p
self.p = 0.0
The _p_backup
hack is ugly and the whole p
management should probably be rewritten but you get the idea. What do you think?
The BaseWaveTransformation
s currently don't except a batch of audio in the forward as they think that a batch of audio is multi_channel audio. I can submit a PR or is there some design decision here that I'm missing?
Such a wonderful work! Is there any plan to implement audio effects like pitch-shift, reverb, equalizer, etc?
So the user can extract information on which RIRs were applied afterwards
Like in audiomentations: https://github.com/iver56/audiomentations/blob/master/audiomentations/core/composition.py#L4
torch-audiomentations works on tensors with shapes like (batch_size, samples) or (batch_size, channels, samples)
How do you propose CutMix would work when replacing some of the audio in an audio example? Pick audio from a different audio example in the batch? Or load different audio from disk?
If one of the objective of torch-audiomentations
is to do fast augmentation on GPU as part of a larger network, reading from disk for each new audio sample would slow down the forward pass dramatically, wouldn't it?
Would it make sense to allow the user to preload the whole background noise collection in memory during __init__
and then read from memory in apply_transform
?
Or, if the background noise collection is too big to fit in memory, we could load a bunch of them, use them for a while, and then load another (random) bunch of them, etc.
Pros:
Cons:
What are the scenarios that we want to cover?
I am getting RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cpu!
.
I send my audio to cuda:2 and pass it to the augmentor and I get the error above. Same problem with Rirs. I also tried to send the augmentor itself to GPU, but this did not help. Is this a known issue?
`
There's an off-by-one error in AddBackgroundNoise that we need to fix before v0.5.0 can be released:
============================================================= FAILURES ==============================================================
____________________________ TestAddBackgroundNoise.test_background_noise_guaranteed_with_batched_tensor ____________________________
self = <tests.test_background_noise.TestAddBackgroundNoise testMethod=test_background_noise_guaranteed_with_batched_tensor>
def test_background_noise_guaranteed_with_batched_tensor(self):
mixed_inputs = self.bg_noise_transform_guaranteed(
> self.input_audios, self.sample_rate
)
tests\test_background_noise.py:85:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\anaconda3\envs\torch-audiomentations-gpu\lib\site-packages\torch\nn\modules\module.py:722: in _call_impl
result = self.forward(*input, **kwargs)
torch_audiomentations\core\transforms_interface.py:164: in forward
self.randomize_parameters(selected_samples, sample_rate)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = AddBackgroundNoise()
selected_samples = tensor([[[0.0000, 0.0000, 0.0000, ..., 0.0077, 0.0062, 0.0042]],
[[0.0000, 0.0000, 0.0000, ..., 0.0077, 0.0...0077, 0.0062, 0.0042]],
[[0.0000, 0.0000, 0.0000, ..., 0.0077, 0.0062, 0.0042]]],
dtype=torch.float64)
sample_rate = 16000
def randomize_parameters(
self, selected_samples: torch.Tensor, sample_rate: int = None
):
"""
:params selected_samples: (batch_size, num_channels, num_samples)
"""
batch_size, _, num_samples = selected_samples.shape
# (batch_size, num_samples) RMS-normalized background noise
audio = self.audio if hasattr(self, "audio") else Audio(sample_rate, mono=True)
self.transform_parameters["background"] = torch.stack(
> [self.random_background(audio, num_samples) for _ in range(batch_size)]
)
E RuntimeError: stack expects each tensor to be equal size, but got [1, 140544] at entry 0 and [1, 140545] at entry 1
torch_audiomentations\augmentations\background_noise.py:113: RuntimeError
____________________________ TestAddBackgroundNoise.test_background_noise_guaranteed_with_single_tensor _____________________________
self = <tests.test_background_noise.TestAddBackgroundNoise testMethod=test_background_noise_guaranteed_with_single_tensor>
def test_background_noise_guaranteed_with_single_tensor(self):
mixed_input = self.bg_noise_transform_guaranteed(
> self.input_audio, self.sample_rate
)
tests\test_background_noise.py:77:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\anaconda3\envs\torch-audiomentations-gpu\lib\site-packages\torch\nn\modules\module.py:722: in _call_impl
result = self.forward(*input, **kwargs)
torch_audiomentations\core\transforms_interface.py:168: in forward
] = self.apply_transform(selected_samples, sample_rate)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = AddBackgroundNoise()
selected_samples = tensor([[[0.0000, 0.0000, 0.0000, ..., 0.0077, 0.0062, 0.0042]]],
dtype=torch.float64), sample_rate = 16000
def apply_transform(self, selected_samples: torch.Tensor, sample_rate: int = None):
batch_size, num_channels, num_samples = selected_samples.shape
# (batch_size, num_samples)
background = self.transform_parameters["background"].to(selected_samples.device)
# (batch_size, num_channels)
background_rms = calculate_rms(selected_samples) / (
10 ** (self.transform_parameters["snr_in_db"].unsqueeze(dim=-1) / 20)
)
return selected_samples + background_rms.unsqueeze(-1) * background.view(
> batch_size, 1, num_samples
).expand(-1, num_channels, -1)
E RuntimeError: shape '[1, 1, 140544]' is invalid for input of size 140545
torch_audiomentations\augmentations\background_noise.py:134: RuntimeError
_______________________________________ TestAddBackgroundNoise.test_varying_snr_within_batch ________________________________________
self = <tests.test_background_noise.TestAddBackgroundNoise testMethod=test_varying_snr_within_batch>
def test_varying_snr_within_batch(self):
min_snr_in_db = 3
max_snr_in_db = 30
augment = AddBackgroundNoise(
self.bg_path, min_snr_in_db=3, max_snr_in_db=30, p=1.0
)
> augmented_audios = augment(self.input_audios, self.sample_rate)
tests\test_background_noise.py:105:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\anaconda3\envs\torch-audiomentations-gpu\lib\site-packages\torch\nn\modules\module.py:722: in _call_impl
result = self.forward(*input, **kwargs)
torch_audiomentations\core\transforms_interface.py:164: in forward
self.randomize_parameters(selected_samples, sample_rate)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = AddBackgroundNoise()
selected_samples = tensor([[[0.0000, 0.0000, 0.0000, ..., 0.0077, 0.0062, 0.0042]],
[[0.0000, 0.0000, 0.0000, ..., 0.0077, 0.0...0077, 0.0062, 0.0042]],
[[0.0000, 0.0000, 0.0000, ..., 0.0077, 0.0062, 0.0042]]],
dtype=torch.float64)
sample_rate = 16000
def randomize_parameters(
self, selected_samples: torch.Tensor, sample_rate: int = None
):
"""
:params selected_samples: (batch_size, num_channels, num_samples)
"""
batch_size, _, num_samples = selected_samples.shape
# (batch_size, num_samples) RMS-normalized background noise
audio = self.audio if hasattr(self, "audio") else Audio(sample_rate, mono=True)
self.transform_parameters["background"] = torch.stack(
> [self.random_background(audio, num_samples) for _ in range(batch_size)]
)
E RuntimeError: stack expects each tensor to be equal size, but got [1, 140544] at entry 0 and [1, 140545] at entry 2
torch_audiomentations\augmentations\background_noise.py:113: RuntimeError
____________________________________________ test_transform_is_differentiable[augment0] _____________________________________________
augment = AddBackgroundNoise()
@pytest.mark.parametrize(
"augment",
[
# Differentiable transforms:
AddBackgroundNoise(BG_NOISE_PATH, 20, p=1.0),
ApplyImpulseResponse(IR_PATH, p=1.0),
Compose(
transforms=[
Gain(min_gain_in_db=-15.0, max_gain_in_db=5.0, p=1.0),
PolarityInversion(p=1.0),
]
),
Gain(min_gain_in_db=-6.000001, max_gain_in_db=-6, p=1.0),
PolarityInversion(p=1.0),
Shift(p=1.0),
# Non-differentiable transforms:
# RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:
# [torch.DoubleTensor [1, 5]], which is output 0 of IndexBackward, is at version 1; expected version 0 instead.
# Hint: enable anomaly detection to find the operation that failed to compute its gradient,
# with torch.autograd.set_detect_anomaly(True).
pytest.param(
PeakNormalization(p=1.0), marks=pytest.mark.skip("Not differentiable")
),
],
)
def test_transform_is_differentiable(augment):
sample_rate = 16000
# Note: using float64 dtype to be compatible with AddBackgroundNoise fixtures
samples = torch.tensor(
[[1.0, 0.5, -0.25, -0.125, 0.0]], dtype=torch.float64
).unsqueeze(1)
samples_cpy = deepcopy(samples)
# We are going to convert the input tensor to a nn.Parameter so that we can
# track the gradients with respect to it. We'll "optimize" the input signal
# to be closer to that after the augmentation to test differentiability
# of the transform. If the signal got changed in any way, and the test
# didn't crash, it means it works.
samples = torch.nn.Parameter(samples)
optim = SGD([samples], lr=1.0)
for i in range(10):
optim.zero_grad()
> transformed = augment(samples=samples, sample_rate=sample_rate)
tests\test_differentiable.py:64:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\anaconda3\envs\torch-audiomentations-gpu\lib\site-packages\torch\nn\modules\module.py:722: in _call_impl
result = self.forward(*input, **kwargs)
torch_audiomentations\core\transforms_interface.py:168: in forward
] = self.apply_transform(selected_samples, sample_rate)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = AddBackgroundNoise()
selected_samples = tensor([[[ 1.0000, 0.5000, -0.2500, -0.1250, 0.0000]]], dtype=torch.float64,
grad_fn=<IndexBackward>)
sample_rate = 16000
def apply_transform(self, selected_samples: torch.Tensor, sample_rate: int = None):
batch_size, num_channels, num_samples = selected_samples.shape
# (batch_size, num_samples)
background = self.transform_parameters["background"].to(selected_samples.device)
# (batch_size, num_channels)
background_rms = calculate_rms(selected_samples) / (
10 ** (self.transform_parameters["snr_in_db"].unsqueeze(dim=-1) / 20)
)
return selected_samples + background_rms.unsqueeze(-1) * background.view(
> batch_size, 1, num_samples
).expand(-1, num_channels, -1)
E RuntimeError: shape '[1, 1, 5]' is invalid for input of size 6
torch_audiomentations\augmentations\background_noise.py:134: RuntimeError
Originally posted by @iver56 in #61 (comment)
also revise the "is multichannel audio" check
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.