Giter Club home page Giter Club logo

funcodec's Introduction

FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec

This project is still working on progress. To make FunCodec better, please let me know your concerns and feel free to comment them in the Issues part.

News

  • 2023.12.22 🎉🎉: We release the training and inference recipes for LauraTTS as well as pre-trained models. LauraTTS is a powerful codec-based zero-shot text-to-speech synthesizer, which outperforms VALL-E in terms of semantic consistency and speaker similarity. Please refer egs/LibriTTS/text2speech_laura/README.md for more details.

Installation

git clone https://github.com/alibaba/FunCodec.git && cd FunCodec
pip install --editable ./

Available models

🤗 links to the Huggingface model hub, while ⭐ refers the Modelscope.

Model name Model hub Corpus Bitrate Parameters Flops
audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch 🤗 General 250~8000 57.83 M 7.73G
audio_codec-encodec-zh_en-general-16k-nq32ds320-pytorch 🤗 General 500~16000 14.85 M 3.72 G
audio_codec-encodec-en-libritts-16k-nq32ds640-pytorch 🤗 LibriTTS 250~8000 57.83 M 7.73G
audio_codec-encodec-en-libritts-16k-nq32ds320-pytorch 🤗 LibriTTS 500~16000 14.85 M 3.72 G
audio_codec-freqcodec_magphase-en-libritts-16k-gr8nq32ds320-pytorch 🤗 LibriTTS 500~16000 4.50 M 2.18 G
audio_codec-freqcodec_magphase-en-libritts-16k-gr1nq32ds320-pytorch 🤗 LibriTTS 500~16000 0.52 M 0.34 G

Model Download

Download models from ModelScope

Please refer egs/LibriTTS/codec/encoding_decoding.sh to download pretrained models:

cd egs/LibriTTS/codec
model_name=audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch
bash encoding_decoding.sh --stage 0 --model_name ${model_name} --model_hub modelscope
# The pre-trained model will be downloaded to exp/audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch

Download models from Huggingface

Please refer egs/LibriTTS/codec/encoding_decoding.sh to download pretrained models:

cd egs/LibriTTS/codec
model_name=audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch
bash encoding_decoding.sh --stage 0 --model_name ${model_name} --model_hub huggingface
# The pre-trained model will be downloaded to exp/audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch

Inference

Batch inference

Please refer egs/LibriTTS/codec/encoding_decoding.sh to perform encoding and decoding. Extract codes with an input file input_wav.scp, and the codes will be saved to output_dir/codecs.txt in a format of jsonl.

cd egs/LibriTTS/codec
bash encoding_decoding.sh --stage 1 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
  --model_dir exp/${model_name} --bit_width 16000 \
  --wav_scp input_wav.scp  --out_dir outputs/codecs/
# input_wav.scp has the following format:
# uttid1 path/to/file1.wav
# uttid2 path/to/file2.wav
# ...

Decode codes with an input file codecs.txt, and the reconstructed waveform will be saved to output_dir/logdir/output.*/*.wav.

bash encoding_decoding.sh --stage 2 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
  --model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
  --wav_scp codecs.txt --out_dir outputs/recon_wavs 
# codecs.scp is the output of above encoding stage, which has the following format:
# uttid1 [[[1, 2, 3, ...],[2, 3, 4, ...], ...]]
# uttid2 [[[9, 7, 5, ...],[3, 1, 2, ...], ...]]

Training

Training on open-source corpora

For commonly-used open-source corpora, you can train a model using the recipe in egs directory. For example, to train a model on the LibriTTS corpus, you can use egs/LibriTTS/codec/run.sh:

# entry the LibriTTS recipe directory
cd egs/LibriTTS/codec
# run data downloading, preparation and training stages with 2 GPUs (device 0 and 1)
bash run.sh --stage 0 --stop_stage 3 --gpu_devices 0,1 --gpu_num 2

We recommend run the script stage by stage to have an overview of FunCodec.

Training on customized data

For uncovered corpora or customized dataset, you can prepare the data by yourself. In general, FunCodec employs the kaldi-like wav.scp file to organize the data files. wav.scp has the following format:

# for waveform files
uttid1 /path/to/uttid1.wav
uttid2 /path/to/uttid2.wav
# for kaldi-ark files
uttid3 /path/to/ark1.wav:10
uttid4 /path/to/ark1.wav:200
uttid5 /path/to/ark2.wav:10

As shown in the above example, FunCodec supports the combination of waveforms or kaldi-ark files in one wav.scp file for both training and inference. Here is a demo script to train a model on your customized dataset named foo:

cd egs/LibriTTS/codec
# 0. make the directory for train, dev and test sets
mkdir -p dump/foo/train dump/foo/dev dump/foo/test

# 1a. if you already have the wav.scp file, just place them under the corresponding directories
mv train.scp dump/foo/train/; mv dev.scp dump/foo/dev/; mv test.scp dump/foo/test/;
# 1b. if you don't have the wav.scp file, you can prepare it as follows
find path/to/train_set/ -iname "*.wav" | awk -F '/' '{print $(NF),$0}' | sort > dump/foo/train/wav.scp
find path/to/dev_set/   -iname "*.wav" | awk -F '/' '{print $(NF),$0}' | sort > dump/foo/dev/wav.scp
find path/to/test_set/  -iname "*.wav" | awk -F '/' '{print $(NF),$0}' | sort > dump/foo/test/wav.scp

# 2. collate shape files
mkdir exp/foo_states/train exp/foo_states/dev
torchrun --nproc_per_node=4 --master_port=1234 scripts/gen_wav_length.py --wav_scp dump/foo/train/wav.scp --out_dir exp/foo_states/train/wav_length
cat exp/foo_states/train/wav_length/wav_length.*.txt | shuf > exp/foo_states/train/speech_shape
torchrun --nproc_per_node=4 --master_port=1234 scripts/gen_wav_length.py --wav_scp dump/foo/dev/wav.scp --out_dir exp/foo_states/dev/wav_length
cat exp/foo_states/dev/wav_length/wav_length.*.txt | shuf > exp/foo_states/dev/speech_shape

# 3. train the model with 2 GPUs (device 4 and 5) on the customized dataset (foo)
bash run.sh --gpu_devices 4,5 --gpu_num 2 --dumpdir dump/foo --state_dir foo_states

Acknowledge

  1. We had a consistent design of FunASR, including dataloader, model definition and so on.
  2. We borrowed a lot of code from Kaldi for data preparation.
  3. We borrowed a lot of code from ESPnet. FunCodec follows up the training and finetuning pipelines of ESPnet.
  4. We borrowed the design of model architecture from Enocdec and Enocdec_Trainner.

License

This project is licensed under The MIT License. FunCodec also contains various third-party components and some code modified from other repos under other open source licenses.

Citations

@misc{du2023funcodec,
      title={FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec},
      author={Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng},
      year={2023},
      eprint={2309.07405},
      archivePrefix={arXiv},
      primaryClass={cs.Sound}
}

funcodec's People

Contributors

eltociear avatar zhihaodu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

funcodec's Issues

LauraTTS: _pickle.UnpicklingError: invalid load key, 'v'.

Environment

  • PyTorch version: 1.12.0
  • Python version: 3.8

Issue Description

I believe I have correctly installed the required PyTorch version as per the README instructions and have also executed pip install --editable ./ to install the necessary requirements. However, while trying to run the "Use LauraTTS to synthesize speech" example, executing the following command:
bash demo.sh --stage 1 --model_name ${model_name} --output_dir results --text "nothing was to be done but to put about, and return in disappointment towards the north."
I encountered the following error:

Traceback (most recent call last):
  File "/root/miniconda3/envs/lg/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/lg/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/autodl-tmp/FunCodec/funcodec/bin/text2audio_inference.py", line 561, in <module>
    main()
  File "/root/autodl-tmp/FunCodec/funcodec/bin/text2audio_inference.py", line 557, in main
    inference(**kwargs)
  File "/root/autodl-tmp/FunCodec/funcodec/bin/text2audio_inference.py", line 381, in inference
    inference_pipeline = inference_func(
  File "/root/autodl-tmp/FunCodec/funcodec/bin/text2audio_inference.py", line 287, in inference_func
    my_model = Text2Audio.from_pretrained(
  File "/root/autodl-tmp/FunCodec/funcodec/bin/text2audio_inference.py", line 227, in from_pretrained
    return Text2Audio(**kwargs)
  File "/root/autodl-tmp/FunCodec/funcodec/bin/text2audio_inference.py", line 53, in __init__
    model, model_args = Text2AudioGenTask.build_model_from_file(
  File "/root/autodl-tmp/FunCodec/funcodec/tasks/abs_task.py", line 1941, in build_model_from_file
    src_state = torch.load(model_file, map_location=device)
  File "/root/miniconda3/envs/lg/lib/python3.8/site-packages/torch/serialization.py", line 713, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/root/miniconda3/envs/lg/lib/python3.8/site-packages/torch/serialization.py", line 920, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'v'.

Question:

Is there an issue with how I am using the program? I am eager to experience your project and would greatly appreciate your guidance or suggestions for resolving this issue.

About model details

i want to know how the input shape changes from 257 to 256, thanks?
and it seems that the config in this repo is different with pretrained models?

image

how does it achieve zero-shot tts

Hi author, thanks for you sharing the creative project.
When I read the paper and code, I found that it is no needed speaker labels when training LauraTTS. The same as codes: dataset.py and other data_py_files show training only rely wav.scp and phoneme.list, and training data doesn't need to be spliced. So, I wonder that Funcodec and LauraTTS really supports zero-shot TTS? If my guess is wrong, thanks for your explain:)

Inquiry about Future Plans for Funcodec with Fewer nq Options

I hope this message finds you well. I am reaching out to commend the exceptional work on Funcodec; it has proven to be a remarkable asset in the community. Currently, I notice that all the available checkpoints are for 32 nq. I am curious to know if there are any plans to release versions with fewer nqs, such as 8 or 12, in the future.
Additionally, I would be interested to learn if there have been any experiments or considerations regarding the impact of a higher number of nqs (like 32) on models similar to valle and whether it affects their performance or efficiency. Your insights on these matters would be greatly appreciated.

Thank you for your dedication to advancing this field. I look forward to your response.

Best regards,

Training Funcodec: Data Sources and Recommendations for Starting From Scratch

In your Funcodec paper, you mentioned that you used 25k hours of data for training the codec. Does this data include open-source datasets like Gigaspeech and WenetSpeech?
If we want to train Funcodec from scratch, do you have any suggestions?
Is it better to use more clean data without background noise or more data with noise?

Inconsistency in Encode Results with Different Batch Sizes

I have noticed that when using different batch sizes for the encode inference, the same data yields different results. Specifically, changing the batch_size parameter seems to affect the outcome even when the input data remains consistent.
I am unsure if this behavior is expected or indicative of a bug. It would be greatly appreciated if you could provide some insights or guidance on this matter. Understanding the expected behavior when varying batch sizes would be crucial for my continued use and trust in the tool's reliability.
Thank you for your attention to this matter and for your continued support of the community with Funcodec.

Low-complexity FreqCodec requires a lot of VRAM

Hi again,

I'm currently trying to retrain FreqCodec models using the configurations released by you (audio_codec-freqcodec_magphase-en-libritts-16k-gr8nq32ds320-pytorch and audio_codec-freqcodec_magphase-en-libritts-16k-gr1nq32ds320-pytorch).
Using an A100 (40GB) GPU, I am able to train the larger model (4.50M params, 2.18 GFlops) without issues at batch size 32.
However, the smaller model (0.52M params, 0.34 GFlops) causes a Cuda OOM error at batch size 32 (and at 24).

As far as I can tell, in terms of architecture differences, the smaller model uses more groups in the depthwise convolutions, 3 residual layers (instead of 1), and a dilation base of 3 (instead of 2).
I thought depthwise convolutions with more groups would reduce the required memory rather than increase it. Is this a misconception, or do you have any idea what the reason for this could be?

Required features in Jan. 2024

Hi, all. I'm collecting the required features which will be considered implementing in Jan. 2024. Please let me know your concern and feel free to comment below. Thanks. To make FunCodec better!

Getting error while testing LauraTTS

Hi @ZhihaoDU while running LauraTTS from egs/LibriTTS/text2speech_laura README I am getting error which described below.

When using ModelScope, getting following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rishikesh/.local/lib/python3.10/site-packages/modelscope/pipelines/builder.py", line 170, in pipeline
    return build_pipeline(cfg, task_name=task)
  File "/home/rishikesh/.local/lib/python3.10/site-packages/modelscope/pipelines/builder.py", line 65, in build_pipeline
    return build_from_cfg(
  File "/home/rishikesh/.local/lib/python3.10/site-packages/modelscope/utils/registry.py", line 198, in build_from_cfg
    raise KeyError(
KeyError: 'laura-codec-tts-inference is not in the pipelines registry group text-to-speech. Please make sure the correct version of ModelScope library is used.'

While running bash command, getting following error:

File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/rishikesh/code/FunCodec/funcodec/bin/text2audio_inference.py", line 561, in <module>
    main()
  File "/home/rishikesh/code/FunCodec/funcodec/bin/text2audio_inference.py", line 557, in main
    inference(**kwargs)
  File "/home/rishikesh/code/FunCodec/funcodec/bin/text2audio_inference.py", line 381, in inference
    inference_pipeline = inference_func(
  File "/home/rishikesh/code/FunCodec/funcodec/bin/text2audio_inference.py", line 287, in inference_func
    my_model = Text2Audio.from_pretrained(
  File "/home/rishikesh/code/FunCodec/funcodec/bin/text2audio_inference.py", line 227, in from_pretrained
    return Text2Audio(**kwargs)
  File "/home/rishikesh/code/FunCodec/funcodec/bin/text2audio_inference.py", line 53, in __init__
    model, model_args = Text2AudioGenTask.build_model_from_file(
  File "/home/rishikesh/code/FunCodec/funcodec/tasks/abs_task.py", line 1928, in build_model_from_file
    model = cls.build_model(args)
  File "/home/rishikesh/code/FunCodec/funcodec/tasks/text2audio_generation.py", line 206, in build_model
    if args.text_encoder is not None:
AttributeError: 'Namespace' object has no attribute 'text_encoder'

FYI : Bash issue is ressolved

How to check progress?

image
Hi @ZhihaoDU
The stage3 has started training, but where can I see the progress bar?
I am a 4080 card, and the training parameters have not changed, and I think it is impossible to keep the same parameters as your A800 card, such as batch_size and num_workers. I usually check the progress bar and change the appropriate batch_size, but it cannot be realized now.

Difference between Encodec and Funcodec

Hi,

first of all, thank you for making this toolkit publicly available.

I have a question regarding the difference between Encodec and Funcodec in your paper:
In Table 3, you list Encodec and Funcodec as different models. Initially I thought Funcodec refers to the frequency-domain model. However, on your demo page, the models "FunCodec" and "FunCodec-2x" are time-domain models, and I was unable to find a difference to the Encodec architecture (besides the training data and the increased stride for the 2x model).

I am probably missing something and would be grateful if you could clarify this.

Discriminator loss?

image
As far as I know, should the loss in the figure above be the generator loss? What does the discriminator loss look like?

如何仅用funcodec的4层或者8层量化器进行推理

您好,我想测试一下funcodec在4层或者8层codec下的性能,我尝试将配置文件中超参直接设置为8,但是这样修改后他仅仅会生成一层的codec。我又尝试在推理阶段的ddp_core_vq.py文件中,强制将32层codec的后24层设置为0,但是这样推理出来的音频重建效果很差,请问正确处理方式应该怎样呢

zipfile.BadZipFile: File is not a zip file

Issue Decription : I followed the instructions in the README.md step by step, until I encountered the following problem when I executed the command in the order described under “Use LauraTTS to synthesize speech”:
bash demo.sh --stage 1 --model_name ${model_name} --output_dir results --text "nothing was to be done but to put about, and return in disappointment towards the north."

I encountered the following:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/FunCodec/funcodec/bin/text2audio_inference.py", line 561, in <module>
    main()
  File "/root/FunCodec/funcodec/bin/text2audio_inference.py", line 557, in main
    inference(**kwargs)
  File "/root/FunCodec/funcodec/bin/text2audio_inference.py", line 381, in inference
    inference_pipeline = inference_func(
  File "/root/FunCodec/funcodec/bin/text2audio_inference.py", line 287, in inference_func
    my_model = Text2Audio.from_pretrained(
  File "/root/FunCodec/funcodec/bin/text2audio_inference.py", line 227, in from_pretrained
    return Text2Audio(**kwargs)
  File "/root/FunCodec/funcodec/bin/text2audio_inference.py", line 75, in __init__
    from funcodec.text.phoneme_tokenizer import G2p_en
  File "/root/FunCodec/funcodec/text/phoneme_tokenizer.py", line 10, in <module>
    import g2p_en
  File "/root/miniconda3/lib/python3.8/site-packages/g2p_en/__init__.py", line 1, in <module>
    from .g2p import G2p
  File "/root/miniconda3/lib/python3.8/site-packages/g2p_en/g2p.py", line 26, in <module>
    nltk.data.find('corpora/cmudict.zip')
  File "/root/miniconda3/lib/python3.8/site-packages/nltk/data.py", line 542, in find
    return ZipFilePathPointer(p, zipentry)
  File "/root/miniconda3/lib/python3.8/site-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/nltk/data.py", line 394, in __init__
    zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
  File "/root/miniconda3/lib/python3.8/site-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/nltk/data.py", line 935, in __init__
    zipfile.ZipFile.__init__(self, filename)
  File "/root/miniconda3/lib/python3.8/zipfile.py", line 1269, in __init__
    self._RealGetContents()
  File "/root/miniconda3/lib/python3.8/zipfile.py", line 1336, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

question : I have already downloaded the relevant two models, but encountered an error during the process of executing speech synthesis. I am eager to experience your project and would greatly appreciate your guidance or suggestions for resolving this issue.

not support Chinise-raw-txt input

bash demo.sh --stage 2 --model_name ${model_name} --output_dir results --text "你好"
--prompt_text "one of these is context" --prompt_audio "demo/8230_279154_000013_000003.wav"
not support,
self.phoneme_tokenizer uses g2p_en to convert english word into phn,is the model support Chinese inputs

TKR?

image
The first row in this table, 400 200.... 50 TKR, I think is the sampling rate divided by the stride, and then multiplied by the token, right?
just like: 16000/320*8=400TKR,
I guess that the number of tokens in each of the first four rows in the table is the same, and they are [8,4,2,1] and the same sampling rate 16000.
But, in the last two lines: I wonder that you get the same TKR by changing the sampling rate? Or, by changing tokens?

run.sh: 34: utils/parse_options.sh: Syntax error: Bad for loop variable

在text2speech_laura文件夹下,输入sh run.sh进行模型训练时,报了如题所示的错误,请问是还需要添加什么参数吗?还有下载的预训练模型audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch文件夹中的config: conf/encodec_lstm_16k_n32_600k_step_rmseg_use_power_ds640.yaml是在项目的哪里?在项目中没有找到,我想查看模型的结构进行修改,谢谢。

Relation between bitrate and token ratio

Hi,

reading your paper, it was unclear to me how exactly the token ratio (TKR) relates to the bitrate.
Initially, I thought this meant the number of frames per second at 16kHz, where 1 codebook index would be generated per frame. But then I realized this can't be right because in Table 3, different TKRs are shown for the same stride.

Could you further explain the relation between TKR and bitrate, maybe with an example, e.g. for one of the FreqCodec models?

[bug] encoding阶段生成的codec.txt, 无法直接读取?

按照提供的encoding_decoding.sh脚本,encoding阶段会生成codec.txt文件

这个文件的形式类似于:
utts_id "空格" json.dumps(codecs)

这个形式无法被read_text.py直接读取,需要改写“load_jsonl_trans_int”函数,如下

def load_jsonl_trans_int(path: Union[Path, str]) -> Dict[str, np.ndarray]: d = read_2column_text(path) retval = {} for k, v in d.items(): try: value = json.loads(v) if isinstance(value, dict): retval[k] = np.array(value["trans"], dtype=int) elif isinstance(value, list): retval[k] = np.array(value, dtype=int) else: raise TypeError except TypeError: logging.error(f'Error happened with path="{path}", id="{k}", value="{v}"') raise return retval

Stage 3

/home/rian0532/anaconda3/envs/py39Ubuntu/lib/python3.9/site-packages/torch/nn/modules/conv.py:306: UserWarning: Applied workaround for CuDNN issue, install nvrtc.so (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:80.)
return F.conv1d(input, weight, bias, self.stride,
[DESKTOP-PQV8NDO] 2024-04-16 14:56:40,650 (codec_basic:648) INFO: Will update discriminator: forward_step=0, disc_loss=2.0000, gen_loss=0.0000
Traceback (most recent call last):
File "/home/rian0532/anaconda3/envs/py39Ubuntu/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/rian0532/anaconda3/envs/py39Ubuntu/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/e/000/FunCodec-master/funcodec/bin/codec_train.py", line 48, in
main(args=args)
File "/mnt/e/000/FunCodec-master/funcodec/bin/codec_train.py", line 23, in main
GANSpeechCodecTask.main(args=args, cmd=cmd)
File "/mnt/e/000/FunCodec-master/funcodec/tasks/abs_task.py", line 1130, in main
cls.main_worker(args)
File "/mnt/e/000/FunCodec-master/funcodec/tasks/abs_task.py", line 1431, in main_worker
cls.trainer.run(
File "/mnt/e/000/FunCodec-master/funcodec/train/trainer.py", line 308, in run
all_steps_are_invalid, max_update_stop = cls.train_one_epoch(
File "/mnt/e/000/FunCodec-master/funcodec/train/gan_trainer.py", line 185, in train_one_epoch
retval = model(turn == "generator", batch)
File "/home/rian0532/anaconda3/envs/py39Ubuntu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/rian0532/anaconda3/envs/py39Ubuntu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/e/000/FunCodec-master/funcodec/models/codec_basic.py", line 324, in forward
return self._forward_generator(
File "/mnt/e/000/FunCodec-master/funcodec/models/codec_basic.py", line 528, in _forward_generator
orig_mel, recon_mel = map(mel_transform, (orig_speech, recon_speech))
File "/home/rian0532/anaconda3/envs/py39Ubuntu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/rian0532/anaconda3/envs/py39Ubuntu/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/e/000/FunCodec-master/funcodec/models/codec_basic.py", line 66, in forward
mel_output = torch.matmul(self.mel_basis, power_spec)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (64x513 and 4x513)

Fail to train in multi gpu

Hi,

i can try the codec model in single gpu, but i cannot train it in multigpu mode. the log is

the env is

CUDA Version: 12.2 
alias-free-torch         0.0.6
pytorch-wpe              0.0.1
torch                    1.13.1
torch-complex            0.4.3
torchaudio               0.13.1
torchvision              0.14.1

the main log is

./run_freqcodec.sh: gpu_num: 2
stage 3: Training
log can be found at ./exp/freqcodec_mag_phase_16k_n32_600k_step_ds640/log/train.log.0

the detail log is

-rw-rw-r-- 1 test test   0 Jan 18 11:07 train.log.0
-rw-rw-r-- 1 test test 884 Jan 18 11:07 train.log.1

cat train.log.1
Traceback (most recent call last):
  File "/home/test/python_env/anaconda3/envs/fun_codec/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/test/python_env/anaconda3/envs/fun_codec/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/test/code/enhance/FunCodec/funcodec/bin/codec_train.py", line 32, in <module>
    torch.cuda.set_device(args.gpu_id)
  File "/home/test/python_env/anaconda3/envs/fun_codec/lib/python3.8/site-packages/torch/cuda/__init__.py", line 326, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Funcodec进行解码处理时,波形幅值不一致

感谢您的分享,我在尝试进行Funcodec解码时,忽略了_scale_输出后,波形结构相近,但是幅值差异较大,请问我该如何修改代码。

speech2token = Speech2Token("egs/LibriTTS/codec/exp/audio_codec-encodec-zh_en-general-16k-nq32ds320-pytorch/config.yaml", "egs/LibriTTS/codec/exp/audio_codec-encodec-zh_en-general-16k-nq32ds320-pytorch/model.pth", sampling_rate=16000)

audio, rate = librosa.load("egs/LibriTTS/codec/test_wav/BAC009S0002W0122.wav", sr=16000) 
audio_32 = np.reshape(audio, (1,1,-1))
output = speech2token(audio_32, bit_width=16000, run_mod="encode")
tokens = output[0][0]
tokens_t = tokens.permute(1, 2, 0)
audio_re = speech2token(tokens_t, bit_width=16000, run_mod="decode")
image

TypeError: 'NoneType' object is not callable

I didn't run run.h, directly run codec_train.py, because I wanted to know the architecture of the whole program. But I have this problem. Do you know the reason?

Traceback (most recent call last):
File "E:\00\FunCodec-master\funcodec\bin\codec_train.py", line 48, in
main(args=args)
File "E:\00\FunCodec-master\funcodec\bin\codec_train.py", line 23, in main
GANSpeechCodecTask.main(args=args, cmd=cmd)
File "E:\00\FunCodec-master\funcodec\tasks\abs_task.py", line 1130, in main
cls.main_worker(args)
File "E:\00\FunCodec-master\funcodec\tasks\abs_task.py", line 1239, in main_worker
model = cls.build_model(args=args)
File "E:\00\FunCodec-master\funcodec\tasks\gan_speech_codec.py", line 310, in build_model
frontend = frontend_class(**args.frontend_conf)
TypeError: 'NoneType' object is not callable

运行run.sh在stage 4时报错

报错信息如下所示:
run.pl: job failed, log is in /mnt/d/Work/FunCodec/egs/LibriTTS/text2speech_laura/dump/libritts/test-other/codecs//logdir/inference.1.log
cat: '/mnt/d/Work/FunCodec/egs/LibriTTS/text2speech_laura/dump/libritts/test-other/codecs//logdir/output./codecs.txt': No such file or directory
Codes are saved to /mnt/d/Work/FunCodec/egs/LibriTTS/text2speech_laura/dump/libritts/test-other/codecs//logdir/output.
/codecs.txt and collected to /mnt/d/Work/FunCodec/egs/LibriTTS/text2speech_laura/dump/libritts/test-other/codecs//codecs.txt.
codec scp files are collected into /mnt/d/Work/FunCodec/egs/LibriTTS/text2speech_laura/dump/libritts/test-other/codec_token.scp

log里看起来是路径错误,但是前几步,stage1到3都没错:

2024-04-15 15:52:57,508 (codec_inference:233) INFO: param_dict: None
Traceback (most recent call last):
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/zz/work/FunCodec/funcodec/bin/codec_inference.py", line 584, in
main()
File "/home/zz/work/FunCodec/funcodec/bin/codec_inference.py", line 580, in main
inference(**kwargs)
File "/home/zz/work/FunCodec/funcodec/bin/codec_inference.py", line 425, in inference
return inference_pipeline(data_path_and_name_and_type, raw_inputs=None)
File "/home/zz/work/FunCodec/funcodec/bin/codec_inference.py", line 313, in _forward
for keys, batch in loader:
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 652, in next
data = self._next_data()
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1347, in _next_data
return self._process_data(data)
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1373, in _process_data
data.reraise()
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise
raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
data.append(next(self.dataset_iter))
File "/home/zz/work/FunCodec/funcodec/datasets/iterable_dataset.py", line 260, in iter
array = func(value)
File "/home/zz/work/FunCodec/funcodec/datasets/iterable_dataset.py", line 28, in load_kaldi
retval = kaldiio.load_mat(input)
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/kaldiio/matio.py", line 239, in load_mat
with open_like_kaldi(ark, "rb") as fd:
File "/home/zz/anaconda3/envs/Laura/lib/python3.9/site-packages/kaldiio/utils.py", line 207, in open_like_kaldi
return io.open(name, mode, encoding=encoding)
FileNotFoundError: [Errno 2] No such file or directory: 'dump/libritts/train/arks/wav.00.ark'

ERROR Generating with prompt text and prompt audio

Hi, thank you for sharing FunCodec, this is really awesome work!

I ran into the following issue when trying to generate audio using my own prompt audio and prompt text. Please let me know what the nature of this error is and how it can be fixed. Thank you very much!

File "/home/____/FunCodec/funcodec/bin/text2audio_inference.py", line 617, in <module>
    main()
  File "/home/____/FunCodec/funcodec/bin/text2audio_inference.py", line 613, in main
    inference(**kwargs)
  File "/home/____/FunCodec/funcodec/bin/text2audio_inference.py", line 454, in inference
    return inference_pipeline(data_path_and_name_and_type, raw_inputs=kwargs.get("raw_inputs", None))
  File "/home/____/FunCodec/funcodec/bin/text2audio_inference.py", line 400, in _forward
    ret_val, _ = my_model(*model_inputs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/____/FunCodec/funcodec/bin/text2audio_inference.py", line 218, in __call__
    gen_speech = self.model.syn_audio(
  File "/home/____/FunCodec/funcodec/models/audio_generation/laura_model.py", line 565, in syn_audio
    _, _, recon_wav, _ = codec_model(codec_emb[:, continual_length:], run_mod="decode_emb")
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/____/FunCodec/funcodec/bin/codec_inference.py", line 119, in __call__
    ret_dict = self.model.inference_decoding_emb(*batch)
  File "/home/____/FunCodec/funcodec/models/codec_basic.py", line 829, in inference_decoding_emb
    recon_speech = self._decode(codes)
  File "/home/____/FunCodec/funcodec/models/codec_basic.py", line 390, in _decode
    return self._decode_frame(encoded_frames[0])
  File "/home/____/FunCodec/funcodec/models/codec_basic.py", line 401, in _decode_frame
    out = self.decoder(emb)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/____/FunCodec/funcodec/models/decoder/seanet_decoder.py", line 179, in forward
    y = self.model(z.permute(0, 2, 1))
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/____/FunCodec/funcodec/modules/normed_modules/conv.py", line 259, in forward
    x = self.conv(x)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/____/FunCodec/funcodec/modules/normed_modules/conv.py", line 157, in forward
    x = self.conv(x)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/____/anaconda3/envs/funcodec2/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size

Questions about training from scratch

Hello, I followed the steps in run.sh to train with the LibriTTS-R dataset. Below is the training process loss. When I use the current checkpoint to synthesize speech, it is almost noise. Based on the loss, does the training of the model appear normal? Thank you!

Loss:
loss
nll_loss:
nll_loss
reg_l1_loss:
reg_l1_loss
reg_l2_loss:
reg_l2_loss

LauraTTS推理生成的音频是杂音?

model_name="speech_synthesizer-laura-en-libritts-16k-codec_nq2-pytorch"
bash demo.sh --stage 1 --model_name ${model_name} --output_dir results --text "nothing was to be done but to put about, and return in disappointment towards the north."

utt1_gen.mp4
utt1_gen_only_lm.mp4

Encodec模型中基于transformer的LMModel应该怎么训练和应用?

您好,我想请问一下,encodec原论文中,量化器后可选的LM模型,应该如何训练和应用,是需要在encodec模型的config文件中进行一系列配置,还是需要重新写一个train文件呢?如果需要重新写一个train文件的话,应该怎么写呢?非常期待您的解答!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.