Comments (13)
@galfaroth Super-resolution and inpainting will be available this Friday. Thanks for your patience.
from audioldm.
Hey! Thanks for the reply! What if I wanted to test the super resolution? Can you provide an example too? And possibly sample in and out example.
from audioldm.
Sure. We will open-source that part, which is also in the TODO list.
from audioldm.
Could you possibly just send the Audio Super Resolution model you used so that we don't have to download the dataset and train ourselves?
from audioldm.
from audioldm.
@galfaroth Super-resolution and inpainting will be available this Friday. Thanks for your patience.
No way!
from audioldm.
Hi all, the code related to super-resolution and inpainting is available here: https://github.com/haoheliu/AudioLDM/blob/main/audioldm/pipeline.py#L223
It has not been integrated into the command line usage yet because I haven't come up with an elegant and simple interface. I'm just trying to avoid making this tool exceedingly heavy. And maybe super-resolution and inpainting are not that of board interest from my perspective (correct me if I'm wrong). So I'll temporarily leave super-resolution and inpainting in this python function form. You can still play with the function though. I've already tested it out and it all works fine.
from audioldm.
Hey, I tried using the new method:
def upsample(original_filepath,text, duration, guidance_scale, random_seed, n_candidates, steps):
waveform = super_resolution_and_inpainting(audioldm,text,original_filepath,
seed=random_seed,ddim_steps=steps,
duration=duration, batchsize=1,
guidance_scale=guidance_scale,
n_candidate_gen_per_text=int(n_candidates),
time_mask_ratio_start_and_end=(1.0, 1.0), # no inpainting,
freq_mask_ratio_start_and_end=(0.75, 1.0), # regenerate the higher 75% to 100% mel bins
)
if(len(waveform) == 1):
waveform = waveform[0]
return waveform
but then I get:
[<ipython-input-11-eac161f8fca7>](https://localhost:8080/#) in upsample(original_filepath, text, duration, guidance_scale, random_seed, n_candidates, steps)
8
9 def upsample(original_filepath,text, duration, guidance_scale, random_seed, n_candidates, steps):
---> 10 waveform = super_resolution_and_inpainting(audioldm,text,original_filepath,
11 seed=random_seed,ddim_steps=steps,
12 duration=duration, batchsize=1,
[/content/AudioLDM/audioldm/pipeline.py](https://localhost:8080/#) in super_resolution_and_inpainting(latent_diffusion, text, original_audio_file_path, seed, ddim_steps, duration, batchsize, guidance_scale, n_candidate_gen_per_text, time_mask_ratio_start_and_end, freq_mask_ratio_start_and_end, config)
258 )
259
--> 260 batch = make_batch_for_text_to_audio(text, fbank=mel[None,...], batchsize=batchsize)
261
262 # latent_diffusion.latent_t_size = duration_to_latent_t_size(duration)
[/content/AudioLDM/audioldm/pipeline.py](https://localhost:8080/#) in make_batch_for_text_to_audio(text, waveform, fbank, batchsize)
26 else:
27 fbank = torch.FloatTensor(fbank)
---> 28 fbank = fbank.expand(batchsize, 1024, 64)
29 assert fbank.size(0) == batchsize
30
RuntimeError: The expanded size of the tensor (1024) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 1024, 64]. Tensor sizes: [1, 512, 64]
I know the base SR = 16000, where do I specify the target SR? Can it upscale to 96000 for example?
from audioldm.
@galfaroth The super-resolution means upsample a sampling rate (<16 kHz) to 16 kHz. A higher sampling rate will be another research.
from audioldm.
@galfaroth The super-resolution means upsample a sampling rate (<16 kHz) to 16 kHz. A higher sampling rate will be another research.
Apart from upsample resolution, why do I get the error? Can you post an example of how to do the upsampling with this method?
from audioldm.
You can use the following script (sr_inpainting.py) @galfaroth
#!/usr/bin/python3
import os
from audioldm import text_to_audio, style_transfer, build_model, save_wave, get_time, super_resolution_and_inpainting
import argparse
CACHE_DIR = os.getenv(
"AUDIOLDM_CACHE_DIR",
os.path.join(os.path.expanduser("~"), ".cache/audioldm"))
parser = argparse.ArgumentParser()
parser.add_argument(
"-t",
"--text",
type=str,
required=False,
default="",
help="Text prompt to the model for audio generation",
)
parser.add_argument(
"-f",
"--file_path",
type=str,
required=False,
default=None,
help="(--mode transfer): Original audio file for style transfer; Or (--mode generation): the guidance audio file for generating simialr audio",
)
parser.add_argument(
"--transfer_strength",
type=float,
required=False,
default=0.5,
help="A value between 0 and 1. 0 means original audio without transfer, 1 means completely transfer to the audio indicated by text",
)
parser.add_argument(
"-s",
"--save_path",
type=str,
required=False,
help="The path to save model output",
default="./output",
)
parser.add_argument(
"-ckpt",
"--ckpt_path",
type=str,
required=False,
help="The path to the pretrained .ckpt model",
default=os.path.join(
CACHE_DIR,
"audioldm-s-full.ckpt",
),
)
parser.add_argument(
"-b",
"--batchsize",
type=int,
required=False,
default=1,
help="Generate how many samples at the same time",
)
parser.add_argument(
"--ddim_steps",
type=int,
required=False,
default=200,
help="The sampling step for DDIM",
)
parser.add_argument(
"-gs",
"--guidance_scale",
type=float,
required=False,
default=2.5,
help="Guidance scale (Large => better quality and relavancy to text; Small => better diversity)",
)
parser.add_argument(
"-dur",
"--duration",
type=float,
required=False,
default=10.0,
help="The duration of the samples",
)
parser.add_argument(
"-n",
"--n_candidate_gen_per_text",
type=int,
required=False,
default=3,
help="Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with heavier computation",
)
parser.add_argument(
"--seed",
type=int,
required=False,
default=42,
help="Change this value (any integer number) will lead to a different generation result.",
)
args = parser.parse_args()
assert args.duration % 2.5 == 0, "Duration must be a multiple of 2.5"
mode = "super_resolution_and_inpainting"
save_path = os.path.join(args.save_path, mode)
if(args.file_path is not None):
save_path = os.path.join(save_path, os.path.basename(args.file_path.split(".")[0]))
text = args.text
random_seed = args.seed
duration = args.duration
guidance_scale = args.guidance_scale
n_candidate_gen_per_text = args.n_candidate_gen_per_text
os.makedirs(save_path, exist_ok=True)
audioldm = build_model(ckpt_path=args.ckpt_path)
waveform = super_resolution_and_inpainting(
audioldm,
text,
args.file_path,
random_seed,
duration=duration,
guidance_scale=guidance_scale,
ddim_steps=args.ddim_steps,
n_candidate_gen_per_text=n_candidate_gen_per_text,
batchsize=args.batchsize,
time_mask_ratio_start_and_end=(0.10, 0.15), # regenerate the 10% to 15% of the time steps in the spectrogram
# time_mask_ratio_start_and_end=(1.0, 1.0), # no inpainting
# freq_mask_ratio_start_and_end=(0.75, 1.0), # regenerate the higher 75% to 100% mel bins
freq_mask_ratio_start_and_end=(1.0, 1.0), # no super-resolution
)
save_wave(waveform, save_path, name="%s_%s" % (get_time(), text))
in the command line, run this script by:
python3 sr_inpainting.py -f trumpet.wav
Then the script will do inpainting on audio between 10% to 15% time steps.
from audioldm.
omg it's happening
from audioldm.
Hi @galfaroth,
Just modify this parameter freq_mask_ratio_start_and_end
in @haoheliu 's sample code.
You can spend a little time to understand this repo. it's a good investt.
from audioldm.
Related Issues (20)
- init_audio_file error HOT 3
- Text-guided Audio-to-Audio Style Transfer
- Output generated via command line tool not saved if prompt is very long
- Is there any docs on how to run the Text to Speech? HOT 2
- Is there any docs on how to run Super Resolution? HOT 2
- Unexpected key(s) error HOT 2
- I am having this issue HOT 3
- cuda
- Status or info on Mac silicon GPU support ? HOT 2
- Error: RuntimeError: Error(s) in loading state_dict for LatentDiffusion: HOT 1
- How to contribute?
- audio-to-audio: is it possible to use more than one samples?
- AttributeError: module 'gradio' has no attribute 'Box' HOT 1
- WebUI not functioning HOT 1
- Training
- NameError: name '_C' is not defined HOT 2
- Error: Unexpected key(s) in state_dict: "cond_stage_model.model.text_branch.embeddings.position_ids". HOT 3
- Hope to release the training code HOT 2
- Multi-GPU support?
- Output singing as well as instruments ? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from audioldm.