polyai-ldn / pheme Goto Github PK

License: Creative Commons Attribution 4.0 International

Python 100.00%

pheme's Introduction

Pheme Model

This repo contains recipes and models used for training Pheme TTS models. It is the official implementation for the paper: Pheme: Efficient and Conversational Speech Generation. Demo is available here, while a selection of audio samples can be found here.

Our Pheme TTS framework validates several hypotheses:

We can train Transformer-based conversational TTS models with much fewer training data than e.g., VALL-E or SoundStorm (e.g., 10x fewer data).
Training can be performed with conversational, podcast, and noisy data like GigaSpeech.
Efficiency is paramount, which includes parameter efficiency (compact models), data efficiency (fewer training data) and inference efficiency (reduced latency).
One fundamental ingredient is the separation of semantic and acoustic tokens and the adequate speech tokenizer.
Inference can be run parallelly through MaskGit-style inference with 15x speed-ups compared to similarly sized autoregressive models.
The single-speaker quality can be improved through student-teacher training with (synthetic) data generated by third-party providers.

Set Up the Environment

Set up conda environment:

conda create --name pheme3 python=3.10
conda activate pheme3

pip3 install torch torchvision torchaudio
pip3 install -r requirements.txt --no-deps

Download pre-trained SpeechTokenizer and unique token list models:

st_dir="ckpt/speechtokenizer/"
mkdir -p ${st_dir}
cd ${st_dir}
wget "https://huggingface.co/fnlp/SpeechTokenizer/resolve/main/speechtokenizer_hubert_avg/SpeechTokenizer.pt"
wget "https://huggingface.co/fnlp/SpeechTokenizer/resolve/main/speechtokenizer_hubert_avg/config.json" 
cd ..
wget "https://huggingface.co/fnlp/USLM/resolve/main/USLM_libritts/unique_text_tokens.k2symbols"

You need to create an access token to use the speaker embedding of pyannote.

export HUGGING_FACE_HUB_TOKEN=YOUR_PRIVATE_TOKEN

Download pre-trained T2S and S2A models (the 100M Pheme variant):

git clone https://huggingface.co/PolyAI/pheme_small ckpt/pheme
mkdir -p "ckpt/t2s"
mkdir -p "ckpt/s2a"
mv ckpt/pheme/config_t2s.json ckpt/t2s/config.json
mv ckpt/pheme/generation_config.json ckpt/t2s/generation_config.json
mv ckpt/pheme/t2s.bin ckpt/t2s/pytorch_model.bin
mv ckpt/pheme/config_s2a.json ckpt/s2a/config.json
mv ckpt/pheme/s2a.ckpt ckpt/s2a/s2a.ckpt

or the larger version (300M) at https://huggingface.co/PolyAI/pheme

Prompt-based Generation

The generation can be invoked by:

python transformer_infer.py

Training

Data Preparation

The package requires data of the format: datasets/example/train.json with datasets/audios/ where you store wav files. The manifest should follow the format:

{
    "LJ001-0051.wav": {
      "text": "and paying great attention to the press work or actual process of printing,",
      "raw-text": "and paying great attention to the press work or actual process of printing,",
      "duration": 4.860090702947846,
      "phoneme": "æ|n|d|_|p|eɪ|ɪ|ŋ|_|ɡ|ɹ|eɪ|t|_|ɐ|t|ɛ|n|ʃ|ə|n|_|t|ə|_|ð|ə|_|\"|p|ɹ|ɛ|s|_|w|ɜː|k|\"|_|ɔː|ɹ|_|æ|k|tʃ|uː|əl|_|p|ɹ|ɑː|s|ɛ|s|_|ʌ|v|_|p|ɹ|ɪ|n|t|ɪ|ŋ|,"
    },
    "LJ001-0120.wav": {
    ...
    },
    ...
}

Create train/valid/test manifests

PYTHONPATH=. python utils/data_prep.py

Resample audio files to 16kHz

find LJSpeech-1.1/wavs/ -name "*.wav" | parallel ffmpeg -i {} -ar 16000 -ac 1 audios/{/}

The following command will create semantic and acoustic tokens based on the audios folder.

PYTHONPATH=. python utils/get_tokens_speech_tokenizer.py \
    --config_path ckpt/speechtokenizer/config.json \
    --ckpt_path ckpt/speechtokenizer/SpeechTokenizer.pt \
    --encoding_input datasets/ljspeech-training-data/audios \
    --encoding_output datasets/ljspeech-training-data/audios-speech-tokenizer

T2S

TRAIN_MANIFEST="./datasets/ljspeech-training-data/train.json"
DEV_MANIFEST="./datasets/ljspeech-training-data/dev.json"
OUT_DIR="./experiments/t2s-ljspeech"

OUT_DIR="/home/taras/experiments/t2s-ljspeech"
python train_t2s.py --metapath "${TRAIN_MANIFEST}" \
  --val_metapath "${DEV_MANIFEST}" \
  --output_dir "${OUT_DIR}" \
  --model_size tiny --batch_size 64 \
  --nworkers 12 --warmup_steps 10000 \
  --save_steps 500 --n_epochs 100 \
  --learning_rate 1e-3

A2S

TRAIN_MANIFEST="./datasets/ljspeech-training-data/train.json"
DEV_MANIFEST="./datasets/ljspeech-training-data/dev.json"
OUT_DIR="./experiments/s2a-ljspeech"

python train_s2a.py --saving_path "${OUT_DIR}" --sampledir "${OUT_DIR}" --vocoder_type SPEECHTOKENIZER \
 --n_codes 1024 --n_cluster_groups 7 --metapath "${TRAIN_MANIFEST}" \
 --val_metapath "${DEV_MANIFEST}" \
 --warmup_step 10000 --nworkers 12 --first_n_lvls 7 \
 --batch_size 200 --ffd_size 1024 --hidden_size 768 --enc_nlayers 3 --dec_nlayers 6 --nheads 8 \
 --depthwise_conv_kernel_size 5 \
 --val_check_interval 60 --sample_rate 16000 --lr 5e-4 \
 --check_val_every_n_epoch 1 --n_semantic_codes 1024 \
 --distributed

Speed test with TensoRT-LLM:

A100 GPU / 100M Pheme Variant

Model	Batch Size	Steps	RTF (ms)
T2S-S2A Short sentence	1	16	0.133
T2S-S2A Long sentence	1	16	0.133

A100 GPU / 300M Pheme Variant

Model	Batch Size	Steps	RTF (ms)
T2S-S2A Short sentence	1	16	0.143
T2S-S2A Long sentence	1	16	0.143

Acknowledge

MQTTS
SpeechTokenizer
maskgit
SoundStorm

TODO

Add Tensorrt-LLM image

Citation

If you use this code or components of the model in your own work, please cite our work as:

@misc{budzianowski2024pheme,
      title={Pheme: Efficient and Conversational Speech Generation}, 
      author={Paweł Budzianowski and Taras Sereda and Tomasz Cichy and Ivan Vulić},
      year={2024},
      eprint={2401.02839},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

pheme's People

Contributors

Stargazers

Watchers

pheme's Issues

Does it support Chinese?

Pretrained weights ?

Hi
Are you planning to share pretrained weights or built some demo where I test this model.
Also have test this model with Vocos vocoder rather than speechtokenizer generated audio codec?

Demo Colab Notebook for Inference

Hi, can you provide a demo colab notebook to get started? It would be great to try out the model using jupyter notebook

Alternative to Soundstorm (S2A) model

This paper : https://arxiv.org/pdf/2401.01099.pdf , suggest better masking strategy with Grouped Acoustic Token like HiFi-Codec which results far better quality that Soundstorm.

There is an issue with the pronunciation of numbers and symbols.

It pronounces $4,000 as "Four zero zero zero"
sometimes "Four zero zero zero zero zero"

When are you going to release tensorrt-llm image?

The result of Tensorrt-llm is very amazing. If this is real, we streaming is not needed at all.

Possible to infer word by word?

I was wondering if it was possible to infer coherent speech word by word, without the whole sentence requiring to be passed together. This would greatly decrease the inference time and real-life latency.

hallucination in the T2S stage.

Does the autoregressive decoding of the T2S stage induce random hallucinated results such as repeated words/phrases or long silences? how does this relate to the reported WER results?

Replicate information about the model weights on the 🤗 Hub!

Hi PolyAI team,

I'm VB; I lead the advocacy efforts for Audio at Hugging Face. Thanks for contributing and releasing your weights on the Hugging Face Hub (https://huggingface.co/PolyAI/pheme).

It'd be great if you could replicate the GitHub's README over to the Pheme model card as well. This will help in its discovery and lead to greater model visibility.

Please do feel free to let me know if you need any support with this!

Cheers,
VB

What is the minimum level of graphics card required for model training?

Can V100 train models?

How to train model in other languages?

please tell me, is it enough to generate the correct dataset with the correct manifest for training in other languages, or is there some additional manipulation required?

About Producing Manifest File

Hello, thank you for providing a wonderful repository.
I would like to ask a question about how to obtain information in order to produce manifest file.

{
    "LJ001-0051.wav": {
      "text": "and paying great attention to the press work or actual process of printing,",
      "raw-text": "and paying great attention to the press work or actual process of printing,",
      "duration": 4.860090702947846,
      "phoneme": "æ|n|d|_|p|eɪ|ɪ|ŋ|_|ɡ|ɹ|eɪ|t|_|ɐ|t|ɛ|n|ʃ|ə|n|_|t|ə|_|ð|ə|_|\"|p|ɹ|ɛ|s|_|w|ɜː|k|\"|_|ɔː|ɹ|_|æ|k|tʃ|uː|əl|_|p|ɹ|ɑː|s|ɛ|s|_|ʌ|v|_|p|ɹ|ɪ|n|t|ɪ|ŋ|,"
    },
    "LJ001-0120.wav": {
    ...
    },
    ...
}

In the format of manifest file above, although there is a transcribed data of the audio file, getting duration and phoneme from the following data should need extra pre-processing, and the method nor code to create these information doesn't seem to presented in the repository.

Thus, I would like to ask: what libraries did you use for producing those information?
Also, if you used codes to produce the manifest file in your experiments, I would like to carefully ask if you could provide the code.

Thank you.

unique_text_tokens.k2symbols for non-english languages

Hello everyone,
I've noticed that throughout the pipeline, unknown tokens are removed, and that the unique_text_tokens.k2symbols doesn't contém all necessary phonemes for Non-English languages, such as accents and other diacritics.

I'm training to train pheme in Portuguese, and I was wondering what I should do so the model can understand the accents of my language. Any tips on how to do it?

P.S.: I've also changed the phonemizer backend, so it could generate phonemes in PT-BR. espeak is available in PT-BR, so it was a no-brainer.

Only <eos> generated after training new model.

Hi - First, thank you for sharing your impressive work!

I was able to train a model in another language with similar amount of data as your 100M parameter model. I have this very interesting behavior - If I provide a prompt to the t2s model it only generates the <eos> token and generates an empty wav file (~0.2 sec silence). If I override and set semantic_prompt = [] in infer_text() in transfomer_infer.py it generates pretty good random output but obviously no related to the dub I want to generate. Its a pretty vanilla training run using your code (minor changes to add symbols not present) - just on a different dataset.

Q: Have you run into this and/or do you have ideas on how to fix?

Segmentation fault when running `python transformer_infer.py`

Hi, I am on an AWS EC2 Machine (Amazon Linux 2), and I followed the instructions in the setup and are able to install the library dependencies. I also downloaded all the checkpoints on my machine. However, when I run python transformer_infer.py, I get the following message and a segmentation fault:

/opt/conda/envs/pheme3/lib/python3.10/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
/opt/conda/envs/pheme3/lib/python3.10/site-packages/torch_audiomentations/utils/io.py:27: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
DEBUG:fsspec.local:open file: /home/ec2-user/wam/pheme/ckpt/s2a/s2a.ckpt
Non-A100 GPU detected, using math or mem efficient attention if input tensor is on cuda
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /pyannote/embedding/resolve/main/pytorch_model.bin HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /pyannote/embedding/resolve/main/config.yaml HTTP/1.1" 200 0
DEBUG:fsspec.local:open file: /home/ec2-user/.cache/torch/pyannote/models--pyannote--embedding/snapshots/c6335d8f1cd77b30084387468a6cf26fea90009b/pytorch_model.bin
DEBUG:fsspec.local:open file: /home/ec2-user/.cache/torch/pyannote/models--pyannote--embedding/snapshots/c6335d8f1cd77b30084387468a6cf26fea90009b/pytorch_model.bin
Lightning automatically upgraded your loaded checkpoint from v1.2.7 to v2.1.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../.cache/torch/pyannote/models--pyannote--embedding/snapshots/c6335d8f1cd77b30084387468a6cf26fea90009b/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.8.1+cu102, yours is 2.1.1+cu121. Bad things might happen unless you revert torch to 1.x.
Segmentation fault

I have tried debugging the python script and stepping into function calls.

This is the last non-library function call I was at before erroring, which was in the TextTokenizer class in semantic_dataset.py. From this I believe the phonemizer backend is erroring.

I tried stepping into the phonemizer library as well, and the program errors out after this line is executed:

And this is my call stack:

Any help on the issue would be greatly appreciated!

License

Hi,
The quality of this is incredible! Would you mind switching to a software license for the code (ie MIT/ISC)? The CC-BY license is designed for non-software.
Personally, the MIT license seems like the most similar software license to the CC-BY.
Thanks!

Does it support streaming output?

Why semantic_tokens generation is so incredibly slow? :(

Hey, first of all thanks for doing and publishing a great work!

But coming at the practical side, I am looking at rendering my favourite set of matrix quotes:

Inferring: args.text="As you can see, we've had our eye on you for some time now, Mister Anderson."
generate_audio
semantic_tokens: 3.829667329788208
self.featuredir / element_id_prompt=PosixPath('demo/audios/male_voice.wav')
speaker_emb.shape=(1, 512)
acoustic_tokens: 1.4472830295562744
vocoder time: 1.052943229675293
Inferring: args.text="It seems that you've been living two lives."
generate_audio
semantic_tokens: 1.5347039699554443
acoustic_tokens: 0.33491015434265137
vocoder time: 0.3351314067840576
Inferring: args.text="In one life, you're Thomas A Anderson, program writer for a respectable software company."
generate_audio
semantic_tokens: 4.294419050216675
acoustic_tokens: 1.0456955432891846
vocoder time: 1.1635217666625977
Inferring: args.text='You have a Social Security number, you pay your taxes, and you...help your landlady carry out her garbage.'
generate_audio
semantic_tokens: 5.344205617904663
acoustic_tokens: 1.0732522010803223
vocoder time: 1.2692646980285645
Inferring: args.text='The other life is lived in computers, where you go by the hacker alias Neo and are guilty of virtually every computer crime we have a law for.'
generate_audio
semantic_tokens: 6.878687143325806
acoustic_tokens: 1.191270112991333
vocoder time: 1.4288511276245117
Inferring: args.text='One of these lives has a future, and one of them does not.'
generate_audio
semantic_tokens: 3.3958096504211426
acoustic_tokens: 0.30650901794433594
vocoder time: 0.46129584312438965
Inferring: args.text='Have you ever stood and stared at it, marveled at its beauty, its genius? Billions of people just living out their lives, oblivious.'
generate_audio
semantic_tokens: 5.401077508926392
acoustic_tokens: 1.2113051414489746
vocoder time: 1.4591665267944336
Inferring: args.text='Did you know that the first Matrix was designed to be a perfect human world.'
generate_audio
semantic_tokens: 3.0609078407287598
acoustic_tokens: 0.4051539897918701
vocoder time: 0.49598193168640137
Inferring: args.text='Where none suffered.'
generate_audio
semantic_tokens: 1.1007585525512695
acoustic_tokens: 0.275745153427124
vocoder time: 0.3767883777618408
Inferring: args.text='Where everyone would be happy.'
generate_audio
semantic_tokens: 1.5478556156158447
acoustic_tokens: 0.3160576820373535
vocoder time: 0.45749568939208984
Inferring: args.text='It was a disaster.'
generate_audio
semantic_tokens: 1.159377098083496
acoustic_tokens: 0.2947394847869873
vocoder time: 0.3695690631866455
Inferring: args.text='No one would accept the program.'
generate_audio
semantic_tokens: 1.6103193759918213
acoustic_tokens: 0.29689860343933105
vocoder time: 0.3959059715270996
Inferring: args.text='Entire crops were lost.'
generate_audio
semantic_tokens: 1.473541021347046
acoustic_tokens: 1.0539555549621582
vocoder time: 1.0956377983093262
Inferring: args.text='Some believed that we lacked the programming language to describe your perfect world.'
generate_audio
semantic_tokens: 4.125181674957275
acoustic_tokens: 1.230463981628418
vocoder time: 1.5059840679168701
Inferring: args.text='But I believe that as a species, human beings define their reality through misery and suffering.'

The s2a is indeed quite fast, however t2s is absolutely horrible. With SpeechT5 I can get t2s in 100-200ms regardless of the prompt length and with the batch size of 50-100. I am missing something here?