Giter Club home page Giter Club logo

mars5-tts's Introduction

MARS5: A novel speech model for insane prosody

Updates

<> July 5, 2024: Latest AR checkpoint released: higher stability of output. Very big update coming soon!

Approach

This is the repo for the MARS5 English speech model (TTS) from CAMB.AI.

The model follows a two-stage AR-NAR pipeline with a distinctively novel NAR component (see more info in the Architecture).

With just 5 seconds of audio and a snippet of text, MARS5 can generate speech even for prosodically hard and diverse scenarios like sports commentary, anime and more. Check out our demo:

intro_vid_camb.mp4

Watch full video here: Youtube

Mars 5 simplified diagram

Figure: The high-level architecture flow of MARS5. Given text and a reference audio, coarse (L0) encodec speech features are obtained through an autoregressive transformer model. Then, the text, reference, and coarse features are refined in a multinomial DDPM model to produce the remaining encodec codebook values. The output of the DDPM is then vocoded to produce the final audio.

Because the model is trained on raw audio together with byte-pair-encoded text, it can be steered with things like punctuation and capitalization. E.g. To add a pause, add a comma to that part in the transcript. Or, to emphasize a word, put it in capital letters in the transcript. This enables a fairly natural way for guiding the prosody of the generated output.

Speaker identity is specified using an audio reference file between 2-12 seconds, with lengths around 6s giving optimal results. Further, by providing the transcript of the reference, MARS5 enables one to do a 'deep clone' which improves the quality of the cloning and output, at the cost of taking a bit longer to produce the audio. For more details on this and other performance and model details, please see the docs folder.

Quick links

Quickstart

We use torch.hub to make loading the model easy -- no cloning of the repo needed. The steps to perform inference are simple:

  1. Installation using pip:

    Requirements:

    • Python >= 3.10
    • Torch >= 2.0
    • Torchaudio
    • Librosa
    • Vocos
    • Encodec
    • safetensors
    • regex
pip install --upgrade torch torchaudio librosa vocos encodec safetensors regex
  1. Load models: load the MARS5 AR and NAR model from torch hub:
import torch, librosa

mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True)
# The `mars5` contains the AR and NAR model, as well as inference code.
# The `config_class` contains tunable inference config settings like temperature.

(Optional) Load Model from huggingface (make sure repository is cloned)

from inference import Mars5TTS, InferenceConfig as config_class
import torch, librosa

mars5 = Mars5TTS.from_pretrained("CAMB-AI/MARS5-TTS")
  1. Pick a reference and optionally its transcript:
# Load reference audio between 1-12 seconds.
wav, sr = librosa.load('<path to arbitrary 24kHz waveform>.wav',
                       sr=mars5.sr, mono=True)
wav = torch.from_numpy(wav)
ref_transcript = "<transcript of the reference audio>"

Note: The reference transcript is optional. Pass it if you wish to do a deep clone.

MARS5 supports 2 kinds of inference: a shallow, fast inference whereby you do not need the transcript of the reference (we call this a shallow clone), and a second slower, but typically higher quality way, which we call a deep clone. To use the deep clone, you need the prompt transcript. See the model architecture for more info on this.

  1. Perform the synthesis:
# Pick whether you want a deep or shallow clone. Set to False if you don't know prompt transcript or want fast inference. Set to True if you know transcript and want highest quality.
deep_clone = True
# Below you can tune other inference settings, like top_k, temperature, top_p, etc...
cfg = config_class(deep_clone=deep_clone, rep_penalty_window=100,
                      top_k=100, temperature=0.7, freq_penalty=3)

ar_codes, output_audio = mars5.tts("The quick brown rat.", wav,
          ref_transcript,
          cfg=cfg)
# output_audio is (T,) shape float tensor corresponding to the 24kHz output audio.

That's it! These default settings provide pretty good results, but feel free to tune the inference settings to optimize the output for your particular usecase. See the InferenceConfig code or the demo notebook for info and docs on all the different inference settings.

Some tips for best quality:

  • Make sure reference audio is clean and between 1 second and 12 seconds.
  • Use deep clone and provide an accurate transcript for the reference.
  • Use proper punctuation -- the model can be guided and made better or worse with proper use of punctuation and capitalization.

Or Use Docker

Pull from DockerHub

You can directly pull the docker image from our DockerHub page.

Build On Your Own

You can build a custom image from the provided Dockerfile in this repo by running the following command.

cd MARS5-TTS
docker build -t mars5ttsimage ./docker

Note: This image should be used as a base image on top of which you can add your custom inference script in a Dockerfile or docker-compose. Images that directly generate output will be added to Docker Hub and as Dockerfiles in this repo soon

Model Details

Checkpoints

The checkpoints for MARS5 are provided under the releases tab of this github repo. We provide two checkpoints:

  • AR fp16 checkpoint [~750M parameters], along with config embedded in the checkpoint.
  • NAR fp16 checkpoint [~450M parameters], along with config embedded in the checkpoint.
  • The byte-pair encoding tokenizer used for the L0 encodec codes and the English text is embedded in each checkpoint under the 'vocab' key, and follows roughly the same format of a saved minbpe tokenizer.

The checkpoints are provided as both pytorch .pt checkpoints, and safetensors .safetensors checkpoints. By default, the torch.hub.load() loads the safetensors version, but you can specify which version of checkpoint you prefer with the ckpt_format='safetensors' or ckpt_format='pt' argument the in torch.hub.load() call. E.g. to force safetensors format:

torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', ckpt_format='safetensors')

Or to force pytorch .pt format when loading the checkpoints:

torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', ckpt_format='pt')

Hardware Requirements:

You must be able to store at least 750M+450M params on GPU, and do inference with 750M of active parameters.

If you do not have the necessary hardware requirements and just want to use MARS5 in your applications, you can use it via our API. If you need some extra credits to test it for your use case, feel free to reach out to [email protected].

Roadmap and tasks

MARS5 is not perfect at the moment, and we are working on improving its quality, stability, and performance. Rough areas we are looking to improve, and welcome any contributions in:

  • Improving inference stability and consistency
  • Speed/performance optimizations
  • Improving reference audio selection when given long references.
  • Benchmark performance numbers for MARS5 on standard speech datasets.

Specific tasks

  • Profile the GPU and CPU memory and runtime speed metrics of the current model, add to readme.
  • Port model operations not supported by MPS to equivalents to speed up apple mac inference. E.g. site-packages/torch/nn/functional.py:4840: UserWarning: The operator 'aten::col2im' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications.
  • Cleanly add more performant ODE samplers to DDPM inference code (even just DPM++2M would be great).
  • Make demo/ user-interface program to rapidly collect human preference ratings between two audio samples, one generated by the model, and one ground truth.
  • Implement a way to do long-form generation. E.g. one possibility is to chunk long intput text into smaller pieces and then synthesize the codes each in turn, concatenating them, and vocoding the final result.
  • Perform a search (e.g. beam or grid) on the autoregressive sampling settings to find the setting preset which give the best quality.

If you would like to contribute any improvement to MARS5, please feel free to contribute (guidelines below).

Contributions

We welcome any contributions to improving the model. As you may find when experimenting, it can produce really great results, it can still be further improved to create excellent outputs consistently. We'd also love to see how you used MARS5 in different scenarios, please use the 🙌 Show and tell category in Discussions to share your examples.

Contribution format:

The preferred way to contribute to our repo is to fork the master repository on GitHub:

  1. Fork the repo on github
  2. Clone the repo, set upstream as this repo: git remote add upstream [email protected]:Camb-ai/mars5-tts.git
  3. Make a new local branch and make your changes, commit changes.
  4. Push changes to new upstream branch: git push --set-upstream origin <NAME-NEW-BRANCH>
  5. On github, go to your fork and click 'Pull Request' to begin the PR process. Please make sure to include a description of what you did/fixed.

License

We are open-sourcing MARS5 in English under GNU AGPL 3.0. For commercial inquiries or to license the closed source version of MARS, please email [email protected]

Join Our Team

We're an ambitious team, globally distributed, with a singular aim of making everyone's voice count. At CAMB.AI, we're a research team of Interspeech-published, Carnegie Mellon, ex-Siri engineers and we're looking for you to join our team.

We're actively hiring; please drop us an email at [email protected] if you're interested. Visit our careers page for more info.

Community

Join CAMB.AI community on Forum and Discord to share any suggestions, feedback, or questions with our team.

Support Camb.ai on Ko-fi ❤️!

ko-fi

Acknowledgements

Parts of code for this project are adapted from the following repositories -- please make sure to check them out! Thank you to the authors of:

mars5-tts's People

Contributors

akshhack avatar arnavmehta7 avatar ashraygattani avatar hassan-bazzi-lab avatar nihaalnz avatar nouralmerey avatar nourmerey avatar pieterscholtz avatar rf5 avatar victorchall avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mars5-tts's Issues

20GB VRAM requirement

Hi,

can somebody explain how this requires ~20GB VRAM?
For 750M+450M params that seems very strange. The readme indicates that there is room for optimization but I would like to understand what the main problem is.

[BUG] RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size

File "....\MARS5-TTS\./mdl\hub\Camb-ai_mars5-tts_master\inference.py", line 291, in tts
    final_audio = self.vocode(final_output).squeeze()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".....\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "....\MARS5-TTS\./mdl\hub\Camb-ai_mars5-tts_master\inference.py", line 158, in vocode
    wav_diffusion = self.vocos.decode(features, bandwidth_id=bandwidth_id)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
......
  File "....\Lib\site-packages\torch\nn\modules\conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size

I not sure what is wrong i feeded a 5 second wav file and a transcript.but throws this error.

Provide a docker image for the code

Hi, i found this project on HN, it looks interesting, but i don't want to manually setup all the things to just check the quality of your solution.

The modern delivery standard is a docker image, because it allow to any people run your code with a single command.
Check another projects from HN like https://github.com/gabotechs/MusicGPT

So please, create docker file that will encapsulate whole configuration and let users to try your solution simple and fast

release weights as safetensors

safetensors is a file format for tensor storage that is faster to load than PyTorch pickles and avoids the security risks of loading pickles.

It's true that if the code and the weights come from the same repo, one is just as safe as the other. But if there's any chance that you're going to end up with an ecosystem where various different models or fine-tunes are floating around, I think it best to set precedent for using safetensors from the start.

Portuguese Support

Great work!!! 😮😮😮
Would be awesome to have portuguese support.
Is there anything that we can help on this?

Support for Russian and Kazakh

Hello!
I want to use your tts locally in my project, but I need support for eng, russian and kazakh. Is there any documentation on these languages ​​in Mars5?

Licensing

Noting We are open-sourcing MARS5 in English under GNU AGPL 3.0, but you can request to use it under a different license by emailing [[email protected]](mailto:[email protected]).

Keep in mind, if you are accepting third party contributions on github, the offering of another licenses is very dubious. You should consult a lawyer before you do so.

I have a PR for fixing the numpy requirement, and honestly I don't care (it's one line fixing a trivial bug, whatever), but others might care and it opens you to potential lawsuits from contributors by relicensing their contributions without their authorization.

Voice cloning failed when reference transcript is not provided

The quality is really bad when reference transcript is not provided, even in the mars5_demo notebook.

However, it works well on the official mars5 speech emulation demo website, even when the reference transcript is not provided.

Does it use tools like whisper to generate the audio transcript first?

Thanks!

Colab Demo fails to run.

The issue:

RuntimeError                              Traceback (most recent call last)

[<ipython-input-7-2d05018561f0>](https://localhost:8080/#) in <cell line: 6>()
      4                       top_k=100, temperature=0.7, freq_penalty=3)
      5 
----> 6 ar_codes, wav_out = mars5.tts("The quick brown rat.", wav, 
      7           ref_transcript,
      8           cfg=cfg)

13 frames

[~/.cache/torch/hub/Camb-ai_mars5-tts_master/mars5/nn_future.py](https://localhost:8080/#) in forward(self, x, freqs_cis, positions, mask, cache)
    249             scatter_pos = (positions[-self.sliding_window:] % self.sliding_window)[None, :, None, None]
    250             scatter_pos = scatter_pos.repeat(bsz, 1, self.n_kv_heads, self.args.head_dim)
--> 251             cache.cache_k[:bsz].scatter_(dim=1, index=scatter_pos, src=xk[:, -self.sliding_window:])
    252             cache.cache_v[:bsz].scatter_(dim=1, index=scatter_pos, src=xv[:, -self.sliding_window:])
    253 

RuntimeError: scatter(): Expected self.dtype to be equal to src.dtypeRuntimeError                              Traceback (most recent call last)

[<ipython-input-7-2d05018561f0>](https://localhost:8080/#) in <cell line: 6>()
      4                       top_k=100, temperature=0.7, freq_penalty=3)
      5 
----> 6 ar_codes, wav_out = mars5.tts("The quick brown rat.", wav, 
      7           ref_transcript,
      8           cfg=cfg)

13 frames

[~/.cache/torch/hub/Camb-ai_mars5-tts_master/mars5/nn_future.py](https://localhost:8080/#) in forward(self, x, freqs_cis, positions, mask, cache)
    249             scatter_pos = (positions[-self.sliding_window:] % self.sliding_window)[None, :, None, None]
    250             scatter_pos = scatter_pos.repeat(bsz, 1, self.n_kv_heads, self.args.head_dim)
--> 251             cache.cache_k[:bsz].scatter_(dim=1, index=scatter_pos, src=xk[:, -self.sliding_window:])
    252             cache.cache_v[:bsz].scatter_(dim=1, index=scatter_pos, src=xv[:, -self.sliding_window:])
    253 

RuntimeError: scatter(): Expected self.dtype to be equal to src.dtype

Google Colab failing to load model

When I run the Google Colab notebook, it throws an error when I reach this command, I believe it's down to a corrupt safetensors installation

mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True)

image

I fixed it by manually upgrading safetensors and using this command

!pip install --upgrade safetensors

image

I'm planning to run this locally - but thought it might be useful to publish my fix and perhaps the Colab file can be updated

pretraining/finetune

in case people would like to contribute adding another english accent or another languages, which documents should they refer ?

部署

本地部署总是不成功,有没有启动包或者安装包之类的?

Great results but too computational heavy.

Results seem excellent, but it definitely requires too much VRAM to be run. It would be great if you could make a lighter version keeping as much quality as you can.

Thanks.

Does it run on CUDA? Getting cfg error

I get the following error
TypeError: _DecoratorContextManager.__call__() got an unexpected keyword argument 'cfg'

on this line

ar_codes, wav_out = mars5.tts("The quick brown rat.", wav, ref_transcript, cfg=cfg)

Installed with the requirements on Windows:

torch==2.0.1+cu117
torchvision==0.15.2+cu117
torchaudio==2.0.2+cu117
numpy==1.26.4
regex
librosa
vocos
encodec
safetensors

Replicate support

I see there's an official demo of Mars5 on Replicate

https://replicate.com/camb-ai/mars5-tts

Could you add the cog.yaml and predict.py files that were used to create that demo to this repo? That way users like myself who use replicate can riff on the code in a way that makes it easier to share and host demos of results.

Give exact versions of dependencies

Hi

Could you give the exact versions of dependencies in requirements.txt?

While installing on Mac, I am always getting errors because of the versions and need to google it every time to fix.
For example, numpy 2.0 was just released, but your lib does not work with it, so everyone need to use numpy==1.26, and so on.

Too much inference time

I was running the quickstart demo on colab but the Synthesis step is taking too much time to run, even when I try runing it for shallow clone. Is there any possible way to fix this?

Any way to manually force phonemes? Issue with incorrect utterances of common words

Tried to generate some outputs using this sentence from the demo's instructions:

We provide several generation candidates when you synthesize text, and attempt to pick the best one on the right.

The word "several" simply WILL NOT come out correctly. It comes out as "seeval," "seeral," "seel," etc.

I am sure this is a byproduct of being an early release, but I want to flag it now as I think that in addition to training there ought to be a way to manually pass in pronunciation data using ssml

Example:

 <phoneme alphabet="ipa" ph="ˌmænɪˈtoʊbə">manitoba</phoneme>
 <phoneme alphabet="x-sampa" ph='m@"hA:g@%ni:'>mahogany</phoneme>

This way, if the autoregressive model repeatedly guesses incorrectly (i.e. on an unusual name), there is a way to force the right result.

Where is the output saved?

I feel like an idiot here. Running this on Windows (WSL with conda). Where in the heck does the audio file output get saved?

read number problem

tmp6la1swgu.MP4

I used demo to generate this sentence 'Select a speaker, enter some text, and hit "generate" to hear Mars 5 yourself.' The number 5 is not pronounced

slow Inferencing

Can we load the reference voice once and then use it for multiple inferences to improve processing speed, which is currently quite slow?

Arabic Support

It's amazing what you guys are doing! Are there any plans to add more languages, such as Arabic?

Thanks.

How do I choose another language?

Hi, your project looks promising.

On the demo, in the examples of use and in the releases there is a link only to the English-language model, please tell me how you can choose another language and whether it can be chosen at all, you have stated support for 140 languages.

Or do you give only English language model within the open source code?

Windows: PermisssionError: File In Use

I've followed all the instructions, but on Windows I cannot get the model to load. It keeps getting stuck attempting to delete the temporary .model file and crashing.

Stacktrace:

Traceback (most recent call last):
  File "E:\Users\MyUser\Documents\git\mars5\clone_tts.py", line 6, in <module>
    mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True)
  File "E:\Users\MyUser\Documents\git\mars5\.venv\lib\site-packages\torch\hub.py", line 568, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "E:\Users\MyUser\Documents\git\mars5\.venv\lib\site-packages\torch\hub.py", line 597, in _load_local
    model = entry(*args, **kwargs)
  File "E:\Users\MyUser\Documents\git\mars5\./hub\Camb-ai_mars5-tts_master\hubconf.py", line 31, in mars5_english       
    mars5 = Mars5TTS(ar_ckpt, nar_ckpt, device=device)
  File "E:\Users\Nolan\Documents\git\mars5\./hub\Camb-ai_mars5-tts_master\inference.py", line 85, in __init__
    os.remove(tfn)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\myuser\\AppData\\Local\\Temp\\tmp1hrrwls1texttok.model'

The crash seems to be happening here:

os.remove(tfn)

When I inspect the user of this file, it's locked by python.exe, meaning the file is open somewhere in python and hasn't been closed yet.

I'll do my best to debug and add anything more, but my knowledge and expertise is not in this area.

System Info

OS: Windows 11 Pro 23H2 build 22631.3593
Processor: Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz 3.79 GHz
RAM: 64.0 GB
Graphics Card: NVIDIA GeForce RTX 4090
Graphics VRAM: 24 GB
Graphics Driver: 555.99
Python Version: 3.10.11
Torch Version: 2.3.1

minbpe have some werid assertion on file name?

File "......\MARS5-TTS\inference.py", line 159, in _from_pretrained
return cls(ar_ckpt=ar_ckpt, nar_ckpt=nar_ckpt, device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".....\inference.py", line 92, in init
self.texttok.load(texttok_data)
File "....\MARS5-TTS\mars5\minbpe\base.py", line 143, in load
assert model_file.endswith(".model")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

why does it have this assertion?what is the model it is finding for where can i download it?I cloned from hf repo and just copy and paste the huggingface example.

The latest librosa-0.10.2.post1 is not compatibility with numpy-2.0.0

I init the enviroment with README.md guideline, when I execute the Step3:Pick a reference and optionally its transcript,get the error as flow
‘’‘
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
xxx
xxx
xxx
ImportError: numpy.core.multiarray failed to import (auto-generated because you didn't call 'numpy.import_array()' after cimporting numpy; use 'numpy._import_array' to disable if you are certain you don't need it).
’‘’
I find the system default librosa version is 0.10.2.post1 and numpy version is 2.0.0.

To fix the problem, we should downgrade the librosa and numpy version

The fix method is as flow
'''
pip uninstall librosa && pip uninstall numpy
pip install librosa==0.9.1
'''

Maybe the other librosa version will work too, but 0.10.2.post1 and 0.10.2 will be not

Support for other languages

Thanks for the great project! Looks like only English voice generation works well, generating audio in spanish results in a native english speaking voice talking spanish.

Also, I did testing with a spanish reference voice with similar results.

Are there plans to support other languages?

Curious on licensing

Hey @akshhack I'm wondering how you're dealing with licensing dual AGPL + Commercial since there are no CLAs or something, how do you incorporate external commits ? I want to do the same for my upcoming projects, your input is appreciated

Running locally?

Hi - just setting this up on my machine (baremetal / pyenv 3.10) and have just noticed that there is no inference script, jupyter notebook or gradio demo to verify the install?

If I can find the time I would reverse engineer what's happening here to write the script - but do you have this at hand already?

Deep Clone and generation longer than 12s

Currently, when using deep cloning, and maybe when not - the model starts producing artifacts after 12 seconds of total new audio generation. Was wondering if it is expected for current model checkpoint, or needs further troubleshooting?

ImportError - Mars5TTS

I followed the QuickStart (https://github.com/Camb-ai/MARS5-TTS?tab=readme-ov-file#quickstart) and copy/pasted code exactly. (did the pip install --upgrade torch torchaudio librosa vocos encodec, etc) - and here's my script. But when I run it, it closes with error (full console log below the script):

ImportError: cannot import name 'Mars5TTS' from partially initialized module 'inference' (most likely due to a circular import) (/Users/josiahbryan/.cache/torch/hub/Camb-ai_mars5-tts_master/inference.py)

Script (mars5.py):

import torch, librosa

mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True)
# The `mars5` contains the AR and NAR model, as well as inference code.
# The `config_class` contains tunable inference config settings like temperature.

# Load reference audio between 1-12 seconds.
wav, sr = librosa.load('/Users/josiahbryan/Downloads/fakePhoneConvoJosiah.wav',
                       sr=mars5.sr, mono=True)
wav = torch.from_numpy(wav)
ref_transcript = "Hey yeah okay...ummm, That sounds great. . Next, we're going to go, and, an' take it, yeah. Okay! Sounds good...when are you coming home? Perfect, okay, talk soon! Thanks, bye."

# Pick whether you want a deep or shallow clone. Set to False if you don't know prompt transcript or want fast inference. Set to True if you know transcript and want highest quality.
deep_clone = True
# Below you can tune other inference settings, like top_k, temperature, top_p, etc...
cfg = config_class(deep_clone=deep_clone, rep_penalty_window=100,
                      top_k=100, temperature=0.7, freq_penalty=3)

ar_codes, output_audio = mars5.tts("The quick brown rat.", wav,
          ref_transcript,
          cfg=cfg)
# output_audio is (T,) shape float tensor corresponding to the 24kHz output audio.

Console Output:

(base) josiahbryan@JosiahscBookPro devel % python mars5.py 
Downloading: "https://github.com/Camb-ai/mars5-tts/zipball/master" to /Users/josiahbryan/.cache/torch/hub/master.zip
Using cache found in /Users/josiahbryan/.cache/torch/hub/Camb-ai_mars5-tts_master
Traceback (most recent call last):
  File "/Users/josiahbryan/devel/rubber/backend/devel/mars5.py", line 3, in <module>
    mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True)
  File "/Users/josiahbryan/miniforge3/lib/python3.10/site-packages/torch/hub.py", line 568, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "/Users/josiahbryan/miniforge3/lib/python3.10/site-packages/torch/hub.py", line 594, in _load_local
    hub_module = _import_module(MODULE_HUBCONF, hubconf_path)
  File "/Users/josiahbryan/miniforge3/lib/python3.10/site-packages/torch/hub.py", line 106, in _import_module
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/Users/josiahbryan/.cache/torch/hub/Camb-ai_mars5-tts_master/hubconf.py", line 7, in <module>
    from inference import Mars5TTS, InferenceConfig
  File "/Users/josiahbryan/.cache/torch/hub/Camb-ai_mars5-tts_master/inference.py", line 12, in <module>
    from mars5.model import CodecLM, ResidualTransformer
  File "/Users/josiahbryan/devel/rubber/backend/devel/mars5.py", line 3, in <module>
    mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True)
  File "/Users/josiahbryan/miniforge3/lib/python3.10/site-packages/torch/hub.py", line 568, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "/Users/josiahbryan/miniforge3/lib/python3.10/site-packages/torch/hub.py", line 594, in _load_local
    hub_module = _import_module(MODULE_HUBCONF, hubconf_path)
  File "/Users/josiahbryan/miniforge3/lib/python3.10/site-packages/torch/hub.py", line 106, in _import_module
    spec.loader.exec_module(module)
  File "/Users/josiahbryan/.cache/torch/hub/Camb-ai_mars5-tts_master/hubconf.py", line 7, in <module>
    from inference import Mars5TTS, InferenceConfig
ImportError: cannot import name 'Mars5TTS' from partially initialized module 'inference' (most likely due to a circular import) (/Users/josiahbryan/.cache/torch/hub/Camb-ai_mars5-tts_master/inference.py)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.