Giter Club home page Giter Club logo

torch-audiomentations's People

Contributors

akashrajkn avatar amiasato avatar eschmidbauer avatar francislata avatar frenchkrab avatar hbredin avatar hlasse avatar iver56 avatar kentonishi avatar keunwoochoi avatar miccio-dk avatar mogwai avatar morenolaquatra avatar mpariente avatar nuniz avatar pzelasko avatar shahules786 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

torch-audiomentations's Issues

Update find_audio_files to look into subfolders as well

I just downloaded EchoThief RIRs and naively thought the find_audio_files would look into subfolders. It did not and I got a nice error message telling me it could find any audio file...

The reason is that EchoThief RIRs are stored in subfolders by category.
We should update find_audio_files to look into subfolders.

Add support for using audiomentations transforms

Some things, like mp3 compression, are not easy to implement in pytorch, so we have to rely on other libraries like audiomentations for that. Here's an example of what I imagine:

from audiomentations import Mp3Compression
from torch_audiomentations import AudiomentationsWaveformTransform

augment = AudiomentationsWaveformTransform(
    Mp3Compression(min_bitrate=32, max_bitrate=96, p=1.0), p=0.5
)

...

Sample Rate in the forward

I'm sure a lot of people working with torch-audiomentations will want to use some of the transforms available in torchaudio, at least for now until #4 is ready. I noticed that the sample rate was required for the forward call of the BaseWaveTransforms? This means that I have to check in a list of transforms to invoke the forward correctly. Would it not be better to have the sample_rate initialised as part of the init?

Multi-GPU augmentations

For multi-GPU training, users might use DataParallel and DistributedDataParallel. And the augmentations should probably follow that as well, to avoid copying parameters from one device to another (and loosing more time).

Support multichannel audio

There are several ways of representing multichannel audio:

Channels last

shape like [batch_size, num_samples, num_channels]

  • Common when loading/saving audio from/to wav

Samples last

shape like [batch_size, num_channels, samples]

  • Convenient for convolution. E.g. for STFT, the time dimension is commonly the last dimension (in torchaudio, torch-stft and asteroid's STFT), according to @mpariente
  • "I always prefer channel first because in case one wants to apply an operation to each channel independently can do straight reshaping. Otherwise you have to transpose then reshape." - @popcornell
  • Librosa uses this convention

I would choose samples last based on this information

Reverb Augmentation

Currently supported in torchaudio using sox.

import torch 
import torchaudio
from pyannote.audio.utils.preview import listen
from IPython.display import Audio

x = torchaudio.sox_effects.effect_names()
effects = [["reverb"]]
wave,sr = torchaudio.load("./tests/data/tst00.wav")
display(Audio(wave, rate=sr))
wave, sr = torchaudio.sox_effects.apply_effects_tensor(wave, sr, effects)
print(sr, wave.shape)
display(Audio(wave[0][None], rate=sr))

I found an article where someone implements Schroeder's Reverberator algorthim and comes with code

Is this something that you are planning to implement/ in the process of doing so?

Test find_audio_files

  • supports uppercase filename extension
  • can find audio files in subfolders
  • ignores irrelevant files like .txt

Use pre-commit

In pyannote.audio we have a a set of functions that run before commiting. This is through the pre-commit git hook. It makes code styling more consistent, checks for missing imports , unused imports, formats code that is to be changed, removes whitespace etc. I've seen @iver make a few commits to blackify code so thought this might help save him some time.

More details

Support the following conventions: per channel, per image and per ensemble

per_channel augmentations: an augmentation that is applied on each audio channel separately
per_image augmenations: augmentations that are applied on multichannel audio (e,g. swapping the channels, changing the spatial image)
per_ensemble (or per_batch?) augmentations: augmentations that are applied to the whole ensemble of sources. E.g. imagine you want to change the energy of the each image but keep the overall mix energy constant or below a certain value

Ref. a comment from @faroit on slack

In my opinion, per channel and per image has the highest priority

Configuration file

Would be nice for reproducibility reasons to be able to instantiate augmentations from a configuration file.

augment = AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5)

would be equivalent to loading from this configuration file:

# config.yaml
AddGaussianNoise:
    min_amplitude: 0.001
    max_amplitude: 0.015
    p: 0.5
with open('config.yaml', 'r') as fp:
    augment = torch_audiomentations.from_config(yaml.load(fp))

Composition could have the following (simple) syntax:

# config.yaml
Compose:
    AddGaussianNoise:
        min_amplitude: 0.001
        max_amplitude: 0.015
        p: 0.5
    TimeStretch:
        min_rate: 0.8
        max_rate: 1.25
        p: 0.5

Would you consider a PR adding such a feature?

Process (various kinds of) target data similarly to input data

The idea is that whatever perturbations are applied to the input are reflected in the target data

Scenarios:

Input Target
Audio class(es), fixed length, not correlated with the input length. Low priority.
Audio Time series of class(es). E.g. predict class for each frame of 25 ms with hop size 10 ms. Medium priority.
Audio Audio (length same as the input). This one has the highest priority

Add `train()` method?

Suppose I define MyNetwork with built-in augmentation like this:

import torch
import torch.nn as nn
from torch_audiomentations import Gain

class MyNetwork(nn.Module):
    def __init__(self):
        self.augmentation = Gain(min_gain_in_db=-15.0, max_gain_in_db=5.0, p=0.5)
        self.other_layers = ...

    def foward(self, waveforms):
        augmented_waveforms = self.augmentation(waveforms)
        return self.other_layers(augmented_waveforms)

net = MyNetwork()

As per my understanding of current code base, calling net.train(mode=False) (aka net.eval()) has no effect on net.augmentation. Augmentation will always be applied, whether we call net.eval() or not.

Would it make sense to add the following method to BaseWaveformTransform so that one can activate/deactivate augmentations with net.train(mode=...)?

class BaseWaveformTransform(nn.Module):
    def train(self, mode: bool = True):
        if mode:
            self.p = self._p_backup
        else:
            self._p_backup = self.p
            self.p = 0.0

The _p_backup hack is ugly and the whole p management should probably be rewritten but you get the idea. What do you think?

Batch GPU Transforms

The BaseWaveTransformations currently don't except a batch of audio in the forward as they think that a batch of audio is multi_channel audio. I can submit a PR or is there some design decision here that I'm missing?

Set up CI

  • lint
  • run tests
  • measure code coverage

Implement CutMix

@nicofarr

torch-audiomentations works on tensors with shapes like (batch_size, samples) or (batch_size, channels, samples)

How do you propose CutMix would work when replacing some of the audio in an audio example? Pick audio from a different audio example in the batch? Or load different audio from disk?

Pre-loading background noise (and IR) in __init__?

If one of the objective of torch-audiomentations is to do fast augmentation on GPU as part of a larger network, reading from disk for each new audio sample would slow down the forward pass dramatically, wouldn't it?

current_bg_audio = load_audio(
bg_file_path,
sample_rate=sample_rate,
start=bg_start_index,
stop=bg_stop_index,

Would it make sense to allow the user to preload the whole background noise collection in memory during __init__ and then read from memory in apply_transform?

Or, if the background noise collection is too big to fit in memory, we could load a bunch of them, use them for a while, and then load another (random) bunch of them, etc.

Switch to julius for resampling

Pros:

  • Julius has fewer dependencies, and is a lighter dependency overall
  • Julius is much faster, and can run on GPU

Cons:

  • Julius is more memory-hungry

Bug: AddBackgroundNoise does not support CUDA.

I am getting RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cpu!.
I send my audio to cuda:2 and pass it to the augmentor and I get the error above. Same problem with Rirs. I also tried to send the augmentor itself to GPU, but this did not help. Is this a known issue?
`

There's an off-by-one error in AddBackgroundNoise that we need to fix before v0.5.0 can be released

There's an off-by-one error in AddBackgroundNoise that we need to fix before v0.5.0 can be released:

============================================================= FAILURES ==============================================================
____________________________ TestAddBackgroundNoise.test_background_noise_guaranteed_with_batched_tensor ____________________________

self = <tests.test_background_noise.TestAddBackgroundNoise testMethod=test_background_noise_guaranteed_with_batched_tensor>

    def test_background_noise_guaranteed_with_batched_tensor(self):
        mixed_inputs = self.bg_noise_transform_guaranteed(
>           self.input_audios, self.sample_rate
        )

tests\test_background_noise.py:85:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\anaconda3\envs\torch-audiomentations-gpu\lib\site-packages\torch\nn\modules\module.py:722: in _call_impl
    result = self.forward(*input, **kwargs)
torch_audiomentations\core\transforms_interface.py:164: in forward
    self.randomize_parameters(selected_samples, sample_rate)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = AddBackgroundNoise()
selected_samples = tensor([[[0.0000, 0.0000, 0.0000,  ..., 0.0077, 0.0062, 0.0042]],

        [[0.0000, 0.0000, 0.0000,  ..., 0.0077, 0.0...0077, 0.0062, 0.0042]],

        [[0.0000, 0.0000, 0.0000,  ..., 0.0077, 0.0062, 0.0042]]],
       dtype=torch.float64)
sample_rate = 16000

    def randomize_parameters(
        self, selected_samples: torch.Tensor, sample_rate: int = None
    ):
        """

        :params selected_samples: (batch_size, num_channels, num_samples)
        """

        batch_size, _, num_samples = selected_samples.shape

        # (batch_size, num_samples) RMS-normalized background noise
        audio = self.audio if hasattr(self, "audio") else Audio(sample_rate, mono=True)
        self.transform_parameters["background"] = torch.stack(
>           [self.random_background(audio, num_samples) for _ in range(batch_size)]
        )
E       RuntimeError: stack expects each tensor to be equal size, but got [1, 140544] at entry 0 and [1, 140545] at entry 1

torch_audiomentations\augmentations\background_noise.py:113: RuntimeError
____________________________ TestAddBackgroundNoise.test_background_noise_guaranteed_with_single_tensor _____________________________

self = <tests.test_background_noise.TestAddBackgroundNoise testMethod=test_background_noise_guaranteed_with_single_tensor>

    def test_background_noise_guaranteed_with_single_tensor(self):
        mixed_input = self.bg_noise_transform_guaranteed(
>           self.input_audio, self.sample_rate
        )

tests\test_background_noise.py:77:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\anaconda3\envs\torch-audiomentations-gpu\lib\site-packages\torch\nn\modules\module.py:722: in _call_impl
    result = self.forward(*input, **kwargs)
torch_audiomentations\core\transforms_interface.py:168: in forward
    ] = self.apply_transform(selected_samples, sample_rate)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = AddBackgroundNoise()
selected_samples = tensor([[[0.0000, 0.0000, 0.0000,  ..., 0.0077, 0.0062, 0.0042]]],
       dtype=torch.float64), sample_rate = 16000

    def apply_transform(self, selected_samples: torch.Tensor, sample_rate: int = None):

        batch_size, num_channels, num_samples = selected_samples.shape

        # (batch_size, num_samples)
        background = self.transform_parameters["background"].to(selected_samples.device)

        # (batch_size, num_channels)
        background_rms = calculate_rms(selected_samples) / (
            10 ** (self.transform_parameters["snr_in_db"].unsqueeze(dim=-1) / 20)
        )

        return selected_samples + background_rms.unsqueeze(-1) * background.view(
>           batch_size, 1, num_samples
        ).expand(-1, num_channels, -1)
E       RuntimeError: shape '[1, 1, 140544]' is invalid for input of size 140545

torch_audiomentations\augmentations\background_noise.py:134: RuntimeError
_______________________________________ TestAddBackgroundNoise.test_varying_snr_within_batch ________________________________________

self = <tests.test_background_noise.TestAddBackgroundNoise testMethod=test_varying_snr_within_batch>

    def test_varying_snr_within_batch(self):
        min_snr_in_db = 3
        max_snr_in_db = 30
        augment = AddBackgroundNoise(
            self.bg_path, min_snr_in_db=3, max_snr_in_db=30, p=1.0
        )
>       augmented_audios = augment(self.input_audios, self.sample_rate)

tests\test_background_noise.py:105:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\anaconda3\envs\torch-audiomentations-gpu\lib\site-packages\torch\nn\modules\module.py:722: in _call_impl
    result = self.forward(*input, **kwargs)
torch_audiomentations\core\transforms_interface.py:164: in forward
    self.randomize_parameters(selected_samples, sample_rate)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = AddBackgroundNoise()
selected_samples = tensor([[[0.0000, 0.0000, 0.0000,  ..., 0.0077, 0.0062, 0.0042]],

        [[0.0000, 0.0000, 0.0000,  ..., 0.0077, 0.0...0077, 0.0062, 0.0042]],

        [[0.0000, 0.0000, 0.0000,  ..., 0.0077, 0.0062, 0.0042]]],
       dtype=torch.float64)
sample_rate = 16000

    def randomize_parameters(
        self, selected_samples: torch.Tensor, sample_rate: int = None
    ):
        """

        :params selected_samples: (batch_size, num_channels, num_samples)
        """

        batch_size, _, num_samples = selected_samples.shape

        # (batch_size, num_samples) RMS-normalized background noise
        audio = self.audio if hasattr(self, "audio") else Audio(sample_rate, mono=True)
        self.transform_parameters["background"] = torch.stack(
>           [self.random_background(audio, num_samples) for _ in range(batch_size)]
        )
E       RuntimeError: stack expects each tensor to be equal size, but got [1, 140544] at entry 0 and [1, 140545] at entry 2

torch_audiomentations\augmentations\background_noise.py:113: RuntimeError
____________________________________________ test_transform_is_differentiable[augment0] _____________________________________________

augment = AddBackgroundNoise()

    @pytest.mark.parametrize(
        "augment",
        [
            # Differentiable transforms:
            AddBackgroundNoise(BG_NOISE_PATH, 20, p=1.0),
            ApplyImpulseResponse(IR_PATH, p=1.0),
            Compose(
                transforms=[
                    Gain(min_gain_in_db=-15.0, max_gain_in_db=5.0, p=1.0),
                    PolarityInversion(p=1.0),
                ]
            ),
            Gain(min_gain_in_db=-6.000001, max_gain_in_db=-6, p=1.0),
            PolarityInversion(p=1.0),
            Shift(p=1.0),
            # Non-differentiable transforms:
            # RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:
            # [torch.DoubleTensor [1, 5]], which is output 0 of IndexBackward, is at version 1; expected version 0 instead.
            # Hint: enable anomaly detection to find the operation that failed to compute its gradient,
            # with torch.autograd.set_detect_anomaly(True).
            pytest.param(
                PeakNormalization(p=1.0), marks=pytest.mark.skip("Not differentiable")
            ),
        ],
    )
    def test_transform_is_differentiable(augment):
        sample_rate = 16000
        # Note: using float64 dtype to be compatible with AddBackgroundNoise fixtures
        samples = torch.tensor(
            [[1.0, 0.5, -0.25, -0.125, 0.0]], dtype=torch.float64
        ).unsqueeze(1)
        samples_cpy = deepcopy(samples)

        # We are going to convert the input tensor to a nn.Parameter so that we can
        # track the gradients with respect to it. We'll "optimize" the input signal
        # to be closer to that after the augmentation to test differentiability
        # of the transform. If the signal got changed in any way, and the test
        # didn't crash, it means it works.
        samples = torch.nn.Parameter(samples)
        optim = SGD([samples], lr=1.0)
        for i in range(10):
            optim.zero_grad()
>           transformed = augment(samples=samples, sample_rate=sample_rate)

tests\test_differentiable.py:64:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\anaconda3\envs\torch-audiomentations-gpu\lib\site-packages\torch\nn\modules\module.py:722: in _call_impl
    result = self.forward(*input, **kwargs)
torch_audiomentations\core\transforms_interface.py:168: in forward
    ] = self.apply_transform(selected_samples, sample_rate)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = AddBackgroundNoise()
selected_samples = tensor([[[ 1.0000,  0.5000, -0.2500, -0.1250,  0.0000]]], dtype=torch.float64,
       grad_fn=<IndexBackward>)
sample_rate = 16000

    def apply_transform(self, selected_samples: torch.Tensor, sample_rate: int = None):

        batch_size, num_channels, num_samples = selected_samples.shape

        # (batch_size, num_samples)
        background = self.transform_parameters["background"].to(selected_samples.device)

        # (batch_size, num_channels)
        background_rms = calculate_rms(selected_samples) / (
            10 ** (self.transform_parameters["snr_in_db"].unsqueeze(dim=-1) / 20)
        )

        return selected_samples + background_rms.unsqueeze(-1) * background.view(
>           batch_size, 1, num_samples
        ).expand(-1, num_channels, -1)
E       RuntimeError: shape '[1, 1, 5]' is invalid for input of size 6

torch_audiomentations\augmentations\background_noise.py:134: RuntimeError

Originally posted by @iver56 in #61 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.