Giter Club home page Giter Club logo

Comments (13)

haoheliu avatar haoheliu commented on July 19, 2024 3

@galfaroth Super-resolution and inpainting will be available this Friday. Thanks for your patience.

from audioldm.

galfaroth avatar galfaroth commented on July 19, 2024 3

Hey! Thanks for the reply! What if I wanted to test the super resolution? Can you provide an example too? And possibly sample in and out example.

from audioldm.

haoheliu avatar haoheliu commented on July 19, 2024

Sure. We will open-source that part, which is also in the TODO list.

from audioldm.

galfaroth avatar galfaroth commented on July 19, 2024

Could you possibly just send the Audio Super Resolution model you used so that we don't have to download the dataset and train ourselves?

from audioldm.

devilismyfriend avatar devilismyfriend commented on July 19, 2024

from audioldm.

galfaroth avatar galfaroth commented on July 19, 2024

@galfaroth Super-resolution and inpainting will be available this Friday. Thanks for your patience.

No way!

from audioldm.

haoheliu avatar haoheliu commented on July 19, 2024

Hi all, the code related to super-resolution and inpainting is available here: https://github.com/haoheliu/AudioLDM/blob/main/audioldm/pipeline.py#L223

It has not been integrated into the command line usage yet because I haven't come up with an elegant and simple interface. I'm just trying to avoid making this tool exceedingly heavy. And maybe super-resolution and inpainting are not that of board interest from my perspective (correct me if I'm wrong). So I'll temporarily leave super-resolution and inpainting in this python function form. You can still play with the function though. I've already tested it out and it all works fine.

from audioldm.

galfaroth avatar galfaroth commented on July 19, 2024

Hey, I tried using the new method:

def upsample(original_filepath,text, duration, guidance_scale, random_seed, n_candidates, steps):
  waveform = super_resolution_and_inpainting(audioldm,text,original_filepath,
                                  seed=random_seed,ddim_steps=steps,
                                  duration=duration, batchsize=1,
                                  guidance_scale=guidance_scale,
                                  n_candidate_gen_per_text=int(n_candidates),
                                  time_mask_ratio_start_and_end=(1.0, 1.0), # no inpainting,
                                  freq_mask_ratio_start_and_end=(0.75, 1.0), # regenerate the higher 75% to 100% mel bins
                                  )
  if(len(waveform) == 1):
    waveform = waveform[0]
  return waveform

but then I get:

[<ipython-input-11-eac161f8fca7>](https://localhost:8080/#) in upsample(original_filepath, text, duration, guidance_scale, random_seed, n_candidates, steps)
      8 
      9 def upsample(original_filepath,text, duration, guidance_scale, random_seed, n_candidates, steps):
---> 10   waveform = super_resolution_and_inpainting(audioldm,text,original_filepath,
     11                                   seed=random_seed,ddim_steps=steps,
     12                                   duration=duration, batchsize=1,

[/content/AudioLDM/audioldm/pipeline.py](https://localhost:8080/#) in super_resolution_and_inpainting(latent_diffusion, text, original_audio_file_path, seed, ddim_steps, duration, batchsize, guidance_scale, n_candidate_gen_per_text, time_mask_ratio_start_and_end, freq_mask_ratio_start_and_end, config)
    258     )
    259 
--> 260     batch = make_batch_for_text_to_audio(text, fbank=mel[None,...], batchsize=batchsize)
    261 
    262     # latent_diffusion.latent_t_size = duration_to_latent_t_size(duration)

[/content/AudioLDM/audioldm/pipeline.py](https://localhost:8080/#) in make_batch_for_text_to_audio(text, waveform, fbank, batchsize)
     26     else:
     27         fbank = torch.FloatTensor(fbank)
---> 28         fbank = fbank.expand(batchsize, 1024, 64)
     29         assert fbank.size(0) == batchsize
     30 

RuntimeError: The expanded size of the tensor (1024) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 1024, 64]. Tensor sizes: [1, 512, 64]

I know the base SR = 16000, where do I specify the target SR? Can it upscale to 96000 for example?

from audioldm.

haoheliu avatar haoheliu commented on July 19, 2024

@galfaroth The super-resolution means upsample a sampling rate (<16 kHz) to 16 kHz. A higher sampling rate will be another research.

from audioldm.

galfaroth avatar galfaroth commented on July 19, 2024

@galfaroth The super-resolution means upsample a sampling rate (<16 kHz) to 16 kHz. A higher sampling rate will be another research.

Apart from upsample resolution, why do I get the error? Can you post an example of how to do the upsampling with this method?

from audioldm.

haoheliu avatar haoheliu commented on July 19, 2024

You can use the following script (sr_inpainting.py) @galfaroth

#!/usr/bin/python3
import os
from audioldm import text_to_audio, style_transfer, build_model, save_wave, get_time, super_resolution_and_inpainting
import argparse

CACHE_DIR = os.getenv(
    "AUDIOLDM_CACHE_DIR",
    os.path.join(os.path.expanduser("~"), ".cache/audioldm"))

parser = argparse.ArgumentParser()

parser.add_argument(
    "-t",
    "--text",
    type=str,
    required=False,
    default="",
    help="Text prompt to the model for audio generation",
)

parser.add_argument(
    "-f",
    "--file_path",
    type=str,
    required=False,
    default=None,
    help="(--mode transfer): Original audio file for style transfer; Or (--mode generation): the guidance audio file for generating simialr audio",
)

parser.add_argument(
    "--transfer_strength",
    type=float,
    required=False,
    default=0.5,
    help="A value between 0 and 1. 0 means original audio without transfer, 1 means completely transfer to the audio indicated by text",
)

parser.add_argument(
    "-s",
    "--save_path",
    type=str,
    required=False,
    help="The path to save model output",
    default="./output",
)

parser.add_argument(
    "-ckpt",
    "--ckpt_path",
    type=str,
    required=False,
    help="The path to the pretrained .ckpt model",
    default=os.path.join(
                CACHE_DIR,
                "audioldm-s-full.ckpt",
            ),
)

parser.add_argument(
    "-b",
    "--batchsize",
    type=int,
    required=False,
    default=1,
    help="Generate how many samples at the same time",
)

parser.add_argument(
    "--ddim_steps",
    type=int,
    required=False,
    default=200,
    help="The sampling step for DDIM",
)

parser.add_argument(
    "-gs",
    "--guidance_scale",
    type=float,
    required=False,
    default=2.5,
    help="Guidance scale (Large => better quality and relavancy to text; Small => better diversity)",
)

parser.add_argument(
    "-dur",
    "--duration",
    type=float,
    required=False,
    default=10.0,
    help="The duration of the samples",
)

parser.add_argument(
    "-n",
    "--n_candidate_gen_per_text",
    type=int,
    required=False,
    default=3,
    help="Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with heavier computation",
)

parser.add_argument(
    "--seed",
    type=int,
    required=False,
    default=42,
    help="Change this value (any integer number) will lead to a different generation result.",
)

args = parser.parse_args()
assert args.duration % 2.5 == 0, "Duration must be a multiple of 2.5"

mode = "super_resolution_and_inpainting"
        
save_path = os.path.join(args.save_path, mode)

if(args.file_path is not None):
    save_path = os.path.join(save_path, os.path.basename(args.file_path.split(".")[0]))

text = args.text
random_seed = args.seed
duration = args.duration
guidance_scale = args.guidance_scale
n_candidate_gen_per_text = args.n_candidate_gen_per_text

os.makedirs(save_path, exist_ok=True)
audioldm = build_model(ckpt_path=args.ckpt_path)

waveform = super_resolution_and_inpainting(
    audioldm,
    text,
    args.file_path,
    random_seed,
    duration=duration,
    guidance_scale=guidance_scale,
    ddim_steps=args.ddim_steps,
    n_candidate_gen_per_text=n_candidate_gen_per_text,
    batchsize=args.batchsize,
    time_mask_ratio_start_and_end=(0.10, 0.15), # regenerate the 10% to 15% of the time steps in the spectrogram
    # time_mask_ratio_start_and_end=(1.0, 1.0), # no inpainting
    # freq_mask_ratio_start_and_end=(0.75, 1.0), # regenerate the higher 75% to 100% mel bins
    freq_mask_ratio_start_and_end=(1.0, 1.0), # no super-resolution
)


save_wave(waveform, save_path, name="%s_%s" % (get_time(), text))

in the command line, run this script by:

python3 sr_inpainting.py -f trumpet.wav

Then the script will do inpainting on audio between 10% to 15% time steps.

from audioldm.

bitnom avatar bitnom commented on July 19, 2024

omg it's happening

from audioldm.

Hikari-Tsai avatar Hikari-Tsai commented on July 19, 2024

Hi @galfaroth,
Just modify this parameter freq_mask_ratio_start_and_end in @haoheliu 's sample code.
You can spend a little time to understand this repo. it's a good investt.

from audioldm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.