asteroid-team / torch-audiomentations Goto Github PK

View Code? Open in Web Editor NEW

905.0 11.0 86.0 2.35 MB

Fast audio data augmentation in PyTorch. Inspired by audiomentations. Useful for deep learning.

License: MIT License

Python 100.00%

data-augmentation pytorch audio-data-augmentation audio waveform dsp audio-effects machine-learning deep-learning music

torch-audiomentations's People

Contributors

Stargazers

Watchers

Forkers

francislata hbredin pzelasko p4vlos mogwai appleholic jcaw wlmsoft xinkez chenchy owlwang road2018 david20181 speechprojects cyrta amiasato youngjay0612 edresson yagan93 normonisping project-tequila kentonishi oucxlw keunwoochoi isgursoy vuongdinhcong jgraving mamodrzejewski soapa ntzzc almostimplemented akashrajkn anupsingh15 sc-lj miblue119 chester-w-xie swagshaw morenolaquatra jackyin68 dongsig frenchkrab ravichoudhary123 rsadek gl3lan shahules786 segmentationfaults hlasse kitenko fantasyrqg devesh-k rfalcon100 baekms vkothapally miccio-dk zabir-nabil fwl2000 fabiocat93 talhausuf dhockaday techthiyanes ajunlonglive ip-augmentation nxtdmnsn rfdougherty constd ecliptic-y pengyizhou qool-naym vectominist ilyushin bookbot-hive jasonguo1 lyticamx runngezhang thanatoz-1 podcastle-studio zcy618 yuhogun0908 5l1v3r1 knoriy pliploop kriti-k raphaelschwinger shanxinta ahmed-fau

torch-audiomentations's Issues

BasicTransform subclasses should implement apply, not forward

Ref #5 (comment)

Implement equalizer transform

Change the spectrum. Boost some frequencies, maybe attenuate some frequencies.

@iver56 - there's a failure to the assertion when comparing the PyTorch way of convolving versus scipy's version. I have made one small change here to match the required arguments as specified in numpy's documentation.

Load audio with torchaudio, not librosa

I prefer a non-sox backend for compatibility reasons

Use Google-style docstrings?

Don't override nn.Module.parameters

We should come up with a different name that doesn't interfere with the nn.Module.parameters method

Update find_audio_files to look into subfolders as well

I just downloaded EchoThief RIRs and naively thought the find_audio_files would look into subfolders. It did not and I got a nice error message telling me it could find any audio file...

The reason is that EchoThief RIRs are stored in subfolders by category.
We should update find_audio_files to look into subfolders.

Add a few more RIRs for demo/testing purposes

Test support for batches in all transforms

The first dimension should represent batches

Add support for using audiomentations transforms

Some things, like mp3 compression, are not easy to implement in pytorch, so we have to rely on other libraries like audiomentations for that. Here's an example of what I imagine:

from audiomentations import Mp3Compression
from torch_audiomentations import AudiomentationsWaveformTransform

augment = AudiomentationsWaveformTransform(
    Mp3Compression(min_bitrate=32, max_bitrate=96, p=1.0), p=0.5
)

...

Sample Rate in the forward

I'm sure a lot of people working with torch-audiomentations will want to use some of the transforms available in torchaudio, at least for now until #4 is ready. I noticed that the sample rate was required for the forward call of the BaseWaveTransforms? This means that I have to check in a list of transforms to invoke the forward correctly. Would it not be better to have the sample_rate initialised as part of the init?

Multi-GPU augmentations

For multi-GPU training, users might use DataParallel and DistributedDataParallel. And the augmentations should probably follow that as well, to avoid copying parameters from one device to another (and loosing more time).

Optionally compensate for propagation delay in ApplyImpulseResponse

Maybe we could use this? https://github.com/jonashaag/numpy-fast-align-mse

Implement pitch shift transform

Support multichannel audio

There are several ways of representing multichannel audio:

Channels last

shape like [batch_size, num_samples, num_channels]

Common when loading/saving audio from/to wav

Samples last

shape like [batch_size, num_channels, samples]

Convenient for convolution. E.g. for STFT, the time dimension is commonly the last dimension (in torchaudio, torch-stft and asteroid's STFT), according to @mpariente
"I always prefer channel first because in case one wants to apply an operation to each channel independently can do straight reshaping. Otherwise you have to transpose then reshape." - @popcornell
Librosa uses this convention

I would choose samples last based on this information

Add support for spectrogram transforms

E.g. transforms like this: https://github.com/zcaceres/spec_augment

Reverb Augmentation

Currently supported in torchaudio using sox.

import torch 
import torchaudio
from pyannote.audio.utils.preview import listen
from IPython.display import Audio

x = torchaudio.sox_effects.effect_names()
effects = [["reverb"]]
wave,sr = torchaudio.load("./tests/data/tst00.wav")
display(Audio(wave, rate=sr))
wave, sr = torchaudio.sox_effects.apply_effects_tensor(wave, sr, effects)
print(sr, wave.shape)
display(Audio(wave[0][None], rate=sr))

I found an article where someone implements Schroeder's Reverberator algorthim and comes with code

Is this something that you are planning to implement/ in the process of doing so?

Implement white/brown/pink/gaussian noise

CI: Test multiple versions of pytorch and torchaudio

We currently have compatibility code for pytorch < 1.7 and torchaudio < 0.7 which the CI currently does not test. We should start doing this.

Test find_audio_files

supports uppercase filename extension
can find audio files in subfolders
ignores irrelevant files like .txt

Use pre-commit

In pyannote.audio we have a a set of functions that run before commiting. This is through the pre-commit git hook. It makes code styling more consistent, checks for missing imports , unused imports, formats code that is to be changed, removes whitespace etc. I've seen @iver make a few commits to blackify code so thought this might help save him some time.

More details

Support the following conventions: per channel, per image and per ensemble

per_channel augmentations: an augmentation that is applied on each audio channel separately
per_image augmenations: augmentations that are applied on multichannel audio (e,g. swapping the channels, changing the spatial image)
per_ensemble (or per_batch?) augmentations: augmentations that are applied to the whole ensemble of sources. E.g. imagine you want to change the energy of the each image but keep the overall mix energy constant or below a certain value

Ref. a comment from @faroit on slack

In my opinion, per channel and per image has the highest priority

Configuration file

Would be nice for reproducibility reasons to be able to instantiate augmentations from a configuration file.

augment = AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5)

would be equivalent to loading from this configuration file:

# config.yaml
AddGaussianNoise:
    min_amplitude: 0.001
    max_amplitude: 0.015
    p: 0.5

with open('config.yaml', 'r') as fp:
    augment = torch_audiomentations.from_config(yaml.load(fp))

Composition could have the following (simple) syntax:

# config.yaml
Compose:
    AddGaussianNoise:
        min_amplitude: 0.001
        max_amplitude: 0.015
        p: 0.5
    TimeStretch:
        min_rate: 0.8
        max_rate: 1.25
        p: 0.5

Would you consider a PR adding such a feature?

Process (various kinds of) target data similarly to input data

The idea is that whatever perturbations are applied to the input are reflected in the target data

Scenarios:

Input	Target
Audio	class(es), fixed length, not correlated with the input length. Low priority.
Audio	Time series of class(es). E.g. predict class for each frame of 25 ms with hop size 10 ms. Medium priority.
Audio	Audio (length same as the input). This one has the highest priority

Implement time stretch transform

Test the differentiability of transforms

Deal with torch.rfft deprecation

Like in pyro-ppl/pyro@ae55140

Context: https://pytorch.org/docs/stable/generated/torch.rfft.html

The function torch.rfft() is deprecated and will be removed in a future PyTorch release. Use the new torch.fft module functions, instead, by importing torch.fft and calling torch.fft.rfft() for one-sided output, or torch.fft.fft() for two-sided output.

Add `train()` method?

Suppose I define MyNetwork with built-in augmentation like this:

import torch
import torch.nn as nn
from torch_audiomentations import Gain

class MyNetwork(nn.Module):
    def __init__(self):
        self.augmentation = Gain(min_gain_in_db=-15.0, max_gain_in_db=5.0, p=0.5)
        self.other_layers = ...

    def foward(self, waveforms):
        augmented_waveforms = self.augmentation(waveforms)
        return self.other_layers(augmented_waveforms)

net = MyNetwork()

As per my understanding of current code base, calling net.train(mode=False) (aka net.eval()) has no effect on net.augmentation. Augmentation will always be applied, whether we call net.eval() or not.

Would it make sense to add the following method to BaseWaveformTransform so that one can activate/deactivate augmentations with net.train(mode=...)?

class BaseWaveformTransform(nn.Module):
    def train(self, mode: bool = True):
        if mode:
            self.p = self._p_backup
        else:
            self._p_backup = self.p
            self.p = 0.0

The _p_backup hack is ugly and the whole p management should probably be rewritten but you get the idea. What do you think?

Make PeakNormalization differentiable

Batch GPU Transforms

The BaseWaveTransformations currently don't except a batch of audio in the forward as they think that a batch of audio is multi_channel audio. I can submit a PR or is there some design decision here that I'm missing?

Any plans to support more audio effects like pitchshift, etc.?

Such a wonderful work! Is there any plan to implement audio effects like pitch-shift, reverb, equalizer, etc?

Improve the execution time of the shift transform?

Like https://github.com/iver56/audiomentations/blob/6f0b8b6c783a6c8268eb00120ab6bb25cab1aab7/audiomentations/augmentations/transforms.py#L273

ApplyImpulseResponse should store ir_file_paths in transform_parameters

So the user can extract information on which RIRs were applied afterwards

Implement temporal masking

@nicofarr

Implement Compose

Like in audiomentations: https://github.com/iver56/audiomentations/blob/master/audiomentations/core/composition.py#L4

Set up CI

lint
run tests
measure code coverage

Implement CutMix

@nicofarr

torch-audiomentations works on tensors with shapes like (batch_size, samples) or (batch_size, channels, samples)

How do you propose CutMix would work when replacing some of the audio in an audio example? Pick audio from a different audio example in the batch? Or load different audio from disk?

Implement p_mode="per_example" in Compose

https://github.com/asteroid-team/torch-audiomentations/pull/32/files/3d5d9293bb31716e8cbf21e064a85d879940f835#diff-f8aff5239b5132abfaddcbaa25a0342ec801e5f75e0b82f27757169fa5a82c68

Pre-loading background noise (and IR) in init?

If one of the objective of torch-audiomentations is to do fast augmentation on GPU as part of a larger network, reading from disk for each new audio sample would slow down the forward pass dramatically, wouldn't it?

torch-audiomentations/torch_audiomentations/augmentations/background_noise.py

Lines 47 to 51 in 02752c9

 current_bg_audio = load_audio( 

 bg_file_path, 

 sample_rate=sample_rate, 

 start=bg_start_index, 

 stop=bg_stop_index,

Would it make sense to allow the user to preload the whole background noise collection in memory during __init__ and then read from memory in apply_transform?

Or, if the background noise collection is too big to fit in memory, we could load a bunch of them, use them for a while, and then load another (random) bunch of them, etc.

Switch to julius for resampling

Pros:

Julius has fewer dependencies, and is a lighter dependency overall
Julius is much faster, and can run on GPU

Cons:

Julius is more memory-hungry

Implement compressor

@faroit

Implement SomeOf and OneOf

Like iver56/audiomentations#61

Add Memory Profiling to the Benchmark methods

Originally posted by @iver56

Implement resample transform

https://github.com/adefossez/julius

SNR & gain distribution

#13 (comment)

What are the scenarios that we want to cover?

Bug: AddBackgroundNoise does not support CUDA.

I am getting RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cpu!.
I send my audio to cuda:2 and pass it to the augmentor and I get the error above. Same problem with Rirs. I also tried to send the augmentor itself to GPU, but this did not help. Is this a known issue?
`

Merge test_requires (in setup.py) with .github/workflows/test_requirements.txt

There's an off-by-one error in AddBackgroundNoise that we need to fix before v0.5.0 can be released

There's an off-by-one error in AddBackgroundNoise that we need to fix before v0.5.0 can be released:

============================================================= FAILURES ==============================================================
____________________________ TestAddBackgroundNoise.test_background_noise_guaranteed_with_batched_tensor ____________________________

self = <tests.test_background_noise.TestAddBackgroundNoise testMethod=test_background_noise_guaranteed_with_batched_tensor>

    def test_background_noise_guaranteed_with_batched_tensor(self):
        mixed_inputs = self.bg_noise_transform_guaranteed(
>           self.input_audios, self.sample_rate
        )

tests\test_background_noise.py:85:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\anaconda3\envs\torch-audiomentations-gpu\lib\site-packages\torch\nn\modules\module.py:722: in _call_impl
    result = self.forward(*input, **kwargs)
torch_audiomentations\core\transforms_interface.py:164: in forward
    self.randomize_parameters(selected_samples, sample_rate)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = AddBackgroundNoise()
selected_samples = tensor([[[0.0000, 0.0000, 0.0000,  ..., 0.0077, 0.0062, 0.0042]],

        [[0.0000, 0.0000, 0.0000,  ..., 0.0077, 0.0...0077, 0.0062, 0.0042]],

        [[0.0000, 0.0000, 0.0000,  ..., 0.0077, 0.0062, 0.0042]]],
       dtype=torch.float64)
sample_rate = 16000

    def randomize_parameters(
        self, selected_samples: torch.Tensor, sample_rate: int = None
    ):
        """

        :params selected_samples: (batch_size, num_channels, num_samples)
        """

        batch_size, _, num_samples = selected_samples.shape

        # (batch_size, num_samples) RMS-normalized background noise
        audio = self.audio if hasattr(self, "audio") else Audio(sample_rate, mono=True)
        self.transform_parameters["background"] = torch.stack(
>           [self.random_background(audio, num_samples) for _ in range(batch_size)]
        )
E       RuntimeError: stack expects each tensor to be equal size, but got [1, 140544] at entry 0 and [1, 140545] at entry 1

torch_audiomentations\augmentations\background_noise.py:113: RuntimeError
____________________________ TestAddBackgroundNoise.test_background_noise_guaranteed_with_single_tensor _____________________________

self = <tests.test_background_noise.TestAddBackgroundNoise testMethod=test_background_noise_guaranteed_with_single_tensor>

    def test_background_noise_guaranteed_with_single_tensor(self):
        mixed_input = self.bg_noise_transform_guaranteed(
>           self.input_audio, self.sample_rate
        )

tests\test_background_noise.py:77:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\anaconda3\envs\torch-audiomentations-gpu\lib\site-packages\torch\nn\modules\module.py:722: in _call_impl
    result = self.forward(*input, **kwargs)
torch_audiomentations\core\transforms_interface.py:168: in forward
    ] = self.apply_transform(selected_samples, sample_rate)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = AddBackgroundNoise()
selected_samples = tensor([[[0.0000, 0.0000, 0.0000,  ..., 0.0077, 0.0062, 0.0042]]],
       dtype=torch.float64), sample_rate = 16000

    def apply_transform(self, selected_samples: torch.Tensor, sample_rate: int = None):

        batch_size, num_channels, num_samples = selected_samples.shape

        # (batch_size, num_samples)
        background = self.transform_parameters["background"].to(selected_samples.device)

        # (batch_size, num_channels)
        background_rms = calculate_rms(selected_samples) / (
            10 ** (self.transform_parameters["snr_in_db"].unsqueeze(dim=-1) / 20)
        )

        return selected_samples + background_rms.unsqueeze(-1) * background.view(
>           batch_size, 1, num_samples
        ).expand(-1, num_channels, -1)
E       RuntimeError: shape '[1, 1, 140544]' is invalid for input of size 140545

torch_audiomentations\augmentations\background_noise.py:134: RuntimeError
_______________________________________ TestAddBackgroundNoise.test_varying_snr_within_batch ________________________________________

self = <tests.test_background_noise.TestAddBackgroundNoise testMethod=test_varying_snr_within_batch>

    def test_varying_snr_within_batch(self):
        min_snr_in_db = 3
        max_snr_in_db = 30
        augment = AddBackgroundNoise(
            self.bg_path, min_snr_in_db=3, max_snr_in_db=30, p=1.0
        )
>       augmented_audios = augment(self.input_audios, self.sample_rate)

tests\test_background_noise.py:105:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\anaconda3\envs\torch-audiomentations-gpu\lib\site-packages\torch\nn\modules\module.py:722: in _call_impl
    result = self.forward(*input, **kwargs)
torch_audiomentations\core\transforms_interface.py:164: in forward
    self.randomize_parameters(selected_samples, sample_rate)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = AddBackgroundNoise()
selected_samples = tensor([[[0.0000, 0.0000, 0.0000,  ..., 0.0077, 0.0062, 0.0042]],

        [[0.0000, 0.0000, 0.0000,  ..., 0.0077, 0.0...0077, 0.0062, 0.0042]],

        [[0.0000, 0.0000, 0.0000,  ..., 0.0077, 0.0062, 0.0042]]],
       dtype=torch.float64)
sample_rate = 16000

    def randomize_parameters(
        self, selected_samples: torch.Tensor, sample_rate: int = None
    ):
        """

        :params selected_samples: (batch_size, num_channels, num_samples)
        """

        batch_size, _, num_samples = selected_samples.shape

        # (batch_size, num_samples) RMS-normalized background noise
        audio = self.audio if hasattr(self, "audio") else Audio(sample_rate, mono=True)
        self.transform_parameters["background"] = torch.stack(
>           [self.random_background(audio, num_samples) for _ in range(batch_size)]
        )
E       RuntimeError: stack expects each tensor to be equal size, but got [1, 140544] at entry 0 and [1, 140545] at entry 2

torch_audiomentations\augmentations\background_noise.py:113: RuntimeError
____________________________________________ test_transform_is_differentiable[augment0] _____________________________________________

augment = AddBackgroundNoise()

    @pytest.mark.parametrize(
        "augment",
        [
            # Differentiable transforms:
            AddBackgroundNoise(BG_NOISE_PATH, 20, p=1.0),
            ApplyImpulseResponse(IR_PATH, p=1.0),
            Compose(
                transforms=[
                    Gain(min_gain_in_db=-15.0, max_gain_in_db=5.0, p=1.0),
                    PolarityInversion(p=1.0),
                ]
            ),
            Gain(min_gain_in_db=-6.000001, max_gain_in_db=-6, p=1.0),
            PolarityInversion(p=1.0),
            Shift(p=1.0),
            # Non-differentiable transforms:
            # RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:
            # [torch.DoubleTensor [1, 5]], which is output 0 of IndexBackward, is at version 1; expected version 0 instead.
            # Hint: enable anomaly detection to find the operation that failed to compute its gradient,
            # with torch.autograd.set_detect_anomaly(True).
            pytest.param(
                PeakNormalization(p=1.0), marks=pytest.mark.skip("Not differentiable")
            ),
        ],
    )
    def test_transform_is_differentiable(augment):
        sample_rate = 16000
        # Note: using float64 dtype to be compatible with AddBackgroundNoise fixtures
        samples = torch.tensor(
            [[1.0, 0.5, -0.25, -0.125, 0.0]], dtype=torch.float64
        ).unsqueeze(1)
        samples_cpy = deepcopy(samples)

        # We are going to convert the input tensor to a nn.Parameter so that we can
        # track the gradients with respect to it. We'll "optimize" the input signal
        # to be closer to that after the augmentation to test differentiability
        # of the transform. If the signal got changed in any way, and the test
        # didn't crash, it means it works.
        samples = torch.nn.Parameter(samples)
        optim = SGD([samples], lr=1.0)
        for i in range(10):
            optim.zero_grad()
>           transformed = augment(samples=samples, sample_rate=sample_rate)

tests\test_differentiable.py:64:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\anaconda3\envs\torch-audiomentations-gpu\lib\site-packages\torch\nn\modules\module.py:722: in _call_impl
    result = self.forward(*input, **kwargs)
torch_audiomentations\core\transforms_interface.py:168: in forward
    ] = self.apply_transform(selected_samples, sample_rate)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = AddBackgroundNoise()
selected_samples = tensor([[[ 1.0000,  0.5000, -0.2500, -0.1250,  0.0000]]], dtype=torch.float64,
       grad_fn=<IndexBackward>)
sample_rate = 16000

    def apply_transform(self, selected_samples: torch.Tensor, sample_rate: int = None):

        batch_size, num_channels, num_samples = selected_samples.shape

        # (batch_size, num_samples)
        background = self.transform_parameters["background"].to(selected_samples.device)

        # (batch_size, num_channels)
        background_rms = calculate_rms(selected_samples) / (
            10 ** (self.transform_parameters["snr_in_db"].unsqueeze(dim=-1) / 20)
        )

        return selected_samples + background_rms.unsqueeze(-1) * background.view(
>           batch_size, 1, num_samples
        ).expand(-1, num_channels, -1)
E       RuntimeError: shape '[1, 1, 5]' is invalid for input of size 6

torch_audiomentations\augmentations\background_noise.py:134: RuntimeError

Originally posted by @iver56 in #61 (comment)