coqui-ai / trainer Goto Github PK

View Code? Open in Web Editor NEW

170.0 170.0 94.0 269 KB

🐸 - A general purpose model trainer, as flexible as it gets

Python 99.49% Makefile 0.51%

ai data-science deep-learning machine-learning pytorch

trainer's People

Contributors

Stargazers

Watchers

Forkers

edresson brainwaves-ai junjihashimoto loganhart02 a-froghyar bitnom ayushexel techthiyanes gullabi taras-sereda neongeckocom hansonhl iprovalo manmay-nakhashi gokulkarthik jmaty arampacha tarasfrompir satellite30 lucifar777 mm86133 gerrysant datasciensyash astricks chengjingfeng thennal10 netzkontrast shivammehta25 kaijung rajopriyo zhanzlic babyblue26 shenberg shalevy1 zutatensuppe blixr22 sspsk jfontestad flaviopontes janjurca miguelsalcedo01 siddas27 tabassum5952 guohong365 vigneshv59 mgore ppisljar wgfi110 blironak mmmake-gmbh meryemsakin sid124557 squire-tomsk gattilorenz josephrp sadeghkrmi mushahid-intesum ishine hanasim dialohq eginhard nagyist robinrealtor bryanchance prassanna-ravishankar omshivaprakash rankinqin tlabz08 listen2it thinhlpg moonlight63 harmonytechlabs hallucinate-ai ichdaheim hishab-nlp wang9203tao liucr asdlei99 asherbond tienbku cdiazmo scarabaeus fredatgithub tldkhoa abdulrahumanseenimohamed lucapericlp dandansocrates idiap dependify emanuelfromflorence

trainer's Issues

[Feature request] Save only last N checkpoints on ClearML

🚀 Feature Description
Save last N checkpoints on ClearML logger.

Solution
Implement saving only the last N checkpoints on a dashboard logger of choice.

Additional context

Keeping all the checkpoints is unnecessary and induces extra costs on cloud storage.

[Feature Request] Batch size and learning rate finders

Just a reminder we need to implement and batch size finder and learning rate finder into the trainer

[Feature request] Upload the training script to the tracking dashboard.

🚀 Feature Description

Uploading the training python script will improve the reproducibility.

[Bug] Errors when running VITS in distributed DDP mode

Describe the bug

There are two problems. The first occurs when starting VTCK VITS training with DDP, which results in a division by zero error. The second one occurred with a custom VITS recipe:

 > EVALUATION

 | > Synthesizing test sentences.
 ! Run is removed from /home/wdb/src/gitlab.com/whydobirds/tts-coqui/recipes/frank/vits/vits_frank-June-01-2022_02+35PM-75e2a0fd
Traceback (most recent call last):
  File "/home/wdb/miniconda3/envs/tts-coqui/lib/python3.8/site-packages/trainer/trainer.py", line 1492, in fit
    self._fit()
  File "/home/wdb/miniconda3/envs/tts-coqui/lib/python3.8/site-packages/trainer/trainer.py", line 1480, in _fit
    self.test_run()
  File "/home/wdb/miniconda3/envs/tts-coqui/lib/python3.8/site-packages/trainer/trainer.py", line 1416, in test_run
    self.model.test_log(test_outputs, self.dashboard_logger, self.training_assets, self.total_steps_done)
  File "/home/wdb/miniconda3/envs/tts-coqui/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'test_log'

To Reproduce

Start training VITS in distributed mode:

 python -m trainer.distribute --script recipes/vctk/vits/train_vits.py --gpus "0,1"

This immediately results in a division by zero error.

Starting with a custom script:

 python -m trainer.distribute --script recipes/custom/vits/train_vits.py --gpus "0,1"

Results in the stack trace shown above. Presumably the VCTK one will run in to the same problem once the division by zero is resolved.

Expected behavior

The training run should complete successfully.

Logs

No response

Environment

The wget above 404s.

Additional context

I think it might just be

Trainer/trainer/trainer.py

Line 1415 in 5217943

 if hasattr(self.model, "test_log") or (self.num_gpus > 1 and hasattr(self.model.module, "test_log")): 

. Everywhere else, it's if ngups > 1: model.module. But this predicate catches both cases, DDP and not, but it's just doing self.model. I guess it should be two separate conditions

[Bug] Gradient accumulation breaks GAN training with Discriminator feature loss

Describe the bug

Gradient accumulation breaks the vocoder training in 🐸TTS.

To Reproduce

Try any vocoder recipe of end2end recipe with gradient accumulation.

Expected behavior

No response

Logs

No response

Environment

No need for system info

Additional context

No response

[Feature request] Reset best loss when restoring from checkpoint in order to keep best model of fine-tuning

🚀 Feature Description

I'm fine-tuning a Coqui TTS model on a different dataset and language. Since it's never reaching the best loss of the pre-trained model it never saves the best model (of the fine-tuning stage). I am only able to see the final checkpoints saved but the best model gets lost in history since I keep only the final X checkpoints.

Solution

If I could reset the best loss saved in the model, I believe it would be possible to save the best model starting from the fine-tuning training.

[Bug] Trainer saves the given config to ClearML rather than the continued config

Describe the bug

Trainer saves the config that is given at the beginning of the training rather than the one restored after continuing a training.

To Reproduce

Continue one of your training using python script and give a modified config.json

The trainer will use the restored config but save the modified config.

Expected behavior

Save the config as exactly used.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            ""
        ],
        "available": true,
        "version": "11.3"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0",
        "Trainer": "v0.0.12",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.9.5",
        "version": "#119-Ubuntu SMP Mon Mar 7 18:49:24 UTC 2022"
    }
}

Additional context

No response

[Bug] Outputs referenced before assignment error while using YourTTS recipe

Describe the bug

Hi,

When running YourTTS recipe with my own LJSpeech dataset, during the first evaluation I get the following error :

> DataLoader initialization
| > Tokenizer:
        | > add_blank: True
        | > use_eos_bos: False
        | > use_phonemes: False
| > Number of instances : 20
 | > Preprocessing samples
 | > Max text length: 119
 | > Min text length: 21
 | > Avg text length: 45.2
 |
 | > Max audio length: 119952.0
 | > Min audio length: 18103.0
 | > Avg audio length: 48773.45
 | > Num. instances discarded samples: 0
 | > Batch group size: 0.
 > Using weighted sampler for attribute 'speaker_name' with alpha '1.0'
None
 > Attribute weights for '['ljspeech']'
 | > [0.22360679774997894]

 > EVALUATION 

 ! Run is kept in /home/caraduf/Models/YourTTS_ME_22k-February-07-2023_06+31AM-0000000
Traceback (most recent call last):
  File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 1659, in fit
    self._fit()
  File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 1614, in _fit
    self.eval_epoch()
  File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 1501, in eval_epoch
    outputs,
UnboundLocalError: local variable 'outputs' referenced before assignment

I updated the trainer to the latest version following the instructions for github but the issue still occurs.

Also note that training VITS model against the same dataset (and also the same max value [10 seconds or 10 x 22050]) is working. So it stops only when running the YourTTS recipe. I will try with debug mode ON and see if it shows interesting things.

Here is the adapted recipe :

import os

import torch
from trainer import Trainer, TrainerArgs

from TTS.bin.compute_embeddings import compute_embeddings
from TTS.bin.resample import resample_files
from TTS.config.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import CharactersConfig, Vits, VitsArgs, VitsAudioConfig
from TTS.utils.downloaders import download_vctk

torch.set_num_threads(24)

# pylint: disable=W0105
"""
    This recipe replicates the first experiment proposed in the YourTTS paper (https://arxiv.org/abs/2112.02418).
    YourTTS model is based on the VITS model however it uses external speaker embeddings extracted from a pre-trained speaker encoder and has small architecture changes.
    In addition, YourTTS can be trained in multilingual data, however, this recipe replicates the single language training using the VCTK dataset.
    If you are interested in multilingual training, we have commented on parameters on the VitsArgs class instance that should be enabled for multilingual training.
    In addition, you will need to add the extra datasets following the VCTK as an example.
"""
# Path where you want to save the models outputs (configs, checkpoints and tensorboard logs)
OUT_PATH = os.path.dirname(os.path.abspath(__file__))  # "/raid/coqui/Checkpoints/original-YourTTS/"

# Name of the run for the Trainer
RUN_NAME = "YourTTS_ME_22k"

Me_Rec_1_config = BaseDatasetConfig(
    formatter="ljspeech", dataset_name="ME_Rec1", meta_file_train="metadata.csv", path="/home/caraduf/Datasets/ME_22kHz/Rec_1_LARGE_V2_22.05kHz_dataset", language="fr-fr"
)

Me_Rec_2_config = BaseDatasetConfig(
    formatter="ljspeech", dataset_name="ME_Rec2", meta_file_train="metadata.csv", path="/home/caraduf/Datasets/ME_22kHz/Rec_2_LARGE_V2_22.05kHz_dataset", language="fr-fr"
)



# Add here all datasets configs, in our case we just want to train with the VCTK dataset then we need to add just VCTK. Note: If you want to added new datasets just added they here and it will automatically compute the speaker embeddings (d-vectors) for this new dataset :)
DATASETS_CONFIG_LIST = [
    Me_Rec_1_config,
    Me_Rec_2_config
]


# If you want to do transfer learning and speedup your training you can set here the path to the original YourTTS model
RESTORE_PATH = None  # "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth"

# This paramter is usefull to debug, it skips the training epochs and just do the evaluation  and produce the test sentences
SKIP_TRAIN_EPOCH = False

# Set here the batch size to be used in training and evaluation
BATCH_SIZE = 32

# Training Sampling rate and the target sampling rate for resampling the downloaded dataset (Note: If you change this you might need to redownload the dataset !!)
# Note: If you add new datasets, please make sure that the dataset sampling rate and this parameter are matching, otherwise resample your audios
SAMPLE_RATE = 22050

# Max audio length in seconds to be used in training (every audio bigger than it will be ignored)
MAX_AUDIO_LEN_IN_SECONDS = 10

# # Define the number of threads used during the audio resampling
# NUM_RESAMPLE_THREADS = 10
# # Check if VCTK dataset is not already downloaded, if not download it
# if not os.path.exists(VCTK_DOWNLOAD_PATH):
#     print(">>> Downloading VCTK dataset:")
#     download_vctk(VCTK_DOWNLOAD_PATH)
#     resample_files(VCTK_DOWNLOAD_PATH, SAMPLE_RATE, file_ext="flac", n_jobs=NUM_RESAMPLE_THREADS)

# # init configs
# vctk_config = BaseDatasetConfig(
#     formatter="vctk",
#     dataset_name="vctk",
#     meta_file_train="",
#     meta_file_val="",
#     path=VCTK_DOWNLOAD_PATH,
#     language="en",
#     ignored_speakers=[
#         "p261",
#         "p225",
#         "p294",
#         "p347",
#         "p238",
#         "p234",
#         "p248",
#         "p335",
#         "p245",
#         "p326",
#         "p302",
#     ],  # Ignore the test speakers to full replicate the paper experiment
# )


### Extract speaker embeddings
SPEAKER_ENCODER_CHECKPOINT_PATH = (
    "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar"
)
SPEAKER_ENCODER_CONFIG_PATH = "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json"

D_VECTOR_FILES = []  # List of speaker embeddings/d-vectors to be used during the training

# Iterates all the dataset configs checking if the speakers embeddings are already computated, if not compute it
for dataset_conf in DATASETS_CONFIG_LIST:
    # Check if the embeddings weren't already computed, if not compute it
    embeddings_file = os.path.join(dataset_conf.path, "speakers.pth")
    if not os.path.isfile(embeddings_file):
        print(f">>> Computing the speaker embeddings for the {dataset_conf.dataset_name} dataset")
        compute_embeddings(
            SPEAKER_ENCODER_CHECKPOINT_PATH,
            SPEAKER_ENCODER_CONFIG_PATH,
            embeddings_file,
            old_spakers_file=None,
            config_dataset_path=None,
            formatter_name=dataset_conf.formatter,
            dataset_name=dataset_conf.dataset_name,
            dataset_path=dataset_conf.path,
            meta_file_train=dataset_conf.meta_file_train,
            meta_file_val=dataset_conf.meta_file_val,
            disable_cuda=False,
            no_eval=False,
        )
    D_VECTOR_FILES.append(embeddings_file)


# Audio config used in training.
audio_config = VitsAudioConfig(
    sample_rate=SAMPLE_RATE,
    hop_length=256,
    win_length=1024,
    fft_size=1024,
    mel_fmin=0.0,
    mel_fmax=None,
    num_mels=80,
)

# Init VITSArgs setting the arguments that is needed for the YourTTS model
model_args = VitsArgs(
    d_vector_file=D_VECTOR_FILES,
    use_d_vector_file=True,
    d_vector_dim=512,
    num_layers_text_encoder=10,
    speaker_encoder_model_path=SPEAKER_ENCODER_CHECKPOINT_PATH,
    speaker_encoder_config_path=SPEAKER_ENCODER_CONFIG_PATH,
    resblock_type_decoder="1",  # On the paper, we accidentally trained the YourTTS using ResNet blocks type 2, if you like you can use the ResNet blocks type 1 like the VITS model
    # Usefull parameters to enable the Speaker Consistency Loss (SCL) discribed in the paper
    # use_speaker_encoder_as_loss=True,
    # Usefull parameters to the enable multilingual training
    # use_language_embedding=True,
    # embedded_language_dim=4,
)

# General training config, here you can change the batch size and others usefull parameters
config = VitsConfig(
    output_path=OUT_PATH,
    model_args=model_args,
    run_name=RUN_NAME,
    project_name="YourTTS",
    run_description="""
            - Original YourTTS trained using shorter extacts made by new method
        """,
    dashboard_logger="tensorboard",
    logger_uri=None,
    audio=audio_config,
    batch_size=BATCH_SIZE,
    batch_group_size=48,
    eval_batch_size=BATCH_SIZE,
    num_loader_workers=4,
    eval_split_max_size=256,
    print_step=50,
    plot_step=100,
    log_model_step=1000,
    save_step=5000,
    save_n_checkpoints=10,
    save_checkpoints=True,
    target_loss="loss_1",
    print_eval=False,
    use_phonemes=False,
    phonemizer="espeak",
    phoneme_language="fr-fr",
    compute_input_seq_cache=True,
    add_blank=True,
    text_cleaner="multilingual_cleaners",
    characters=CharactersConfig(
        characters_class="TTS.tts.models.vits.VitsCharacters",
        pad="_",
        eos="&",
        bos="*",
        blank=None,
        #characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
        
        characters="!',-.:?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz «»ÀÇÉÊàâçèéêëîïôùûœ–’…",
        punctuations="!'(),-.:;? ",
        phonemes="",
        is_unique=True,
        is_sorted=True,
    ),
    phoneme_cache_path=None,
    precompute_num_workers=12,
    start_by_longest=True,
    datasets=DATASETS_CONFIG_LIST,
    cudnn_benchmark=False,
    max_audio_len=SAMPLE_RATE * MAX_AUDIO_LEN_IN_SECONDS,
    mixed_precision=True,
    test_sentences=[
        [
            "Il m'a fallu du temps pour créer cette voix alors ma bouche ne restera pas fermée.",
            # "ME",
            None,
            "fr_FR",
        ],
        [
            "Il m'a fallu beaucoup de temps pour développer une voix, et maintenant que je l'ai, je ne vais pas me taire.",
            # "ME",
            None,
            "fr_FR",
        ],
        [
            "Mais son âge rendait cette dernière qualité plus saillante.",
            # "ME",
            None,
            "fr_FR",
        ],
        # [
        #     "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
        #     "VCTK_p277",
        #     None,
        #     "en",
        # ],
        # [
        #     "Be a voice, not an echo.",
        #     "VCTK_p239",
        #     None,
        #     "en",
        # ],
        # [
        #     "I'm sorry Dave. I'm afraid I can't do that.",
        #     "VCTK_p258",
        #     None,
        #     "en",
        # ],
        # [
        #     "This cake is great. It's so delicious and moist.",
        #     "VCTK_p244",
        #     None,
        #     "en",
        # ],
        # [
        #     "Prior to November 22, 1963.",
        #     "VCTK_p305",
        #     None,
        #     "en",
        # ],
    ],
    # Enable the weighted sampler
    use_weighted_sampler=True,
    # Ensures that all speakers are seen in the training batch equally no matter how many samples each speaker has
    weighted_sampler_attrs={"speaker_name": 1.0},
    weighted_sampler_multipliers={},
    # It defines the Speaker Consistency Loss (SCL) α to 9 like the paper
    speaker_encoder_loss_alpha=9.0,
)

# Load all the datasets samples and split traning and evaluation sets
train_samples, eval_samples = load_tts_samples(
    config.datasets,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

# Init the model
model = Vits.init_from_config(config)

# Init the trainer and 🚀
trainer = Trainer(
    TrainerArgs(restore_path=RESTORE_PATH, skip_train_epoch=SKIP_TRAIN_EPOCH),
    config,
    output_path=OUT_PATH,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()

To Reproduce

Create a dataset in LJSpeech format (22,05 kHz audios) in French.
Adapt dataset config, sample rate in the provided recipe.
Launch it.

python3 YourTTS_recipe.py

Expected behavior

The training should go on.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.13.1+cu117",
        "Trainer": "v0.0.22",
        "numpy": "1.22.4"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.6",
        "version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023"
    }
}

Additional context

No response

[Bug]

Describe the bug

Error when starting model training from checkpoint in Coqui TTS
When saved as a checkpoint for later training, the last training and eval losses are saved as in dict. When training from scratch, the last training loss is saved as a float. Hence, starting from a checkpoint doesn't run the code properly

To Reproduce

Train a model in Coqui TTS using trainer
Once a checkpoint for best model is saved, stop the training
Set the checkpoint folder as continue path in the trainer class
Restart from the checkpoint

https://colab.research.google.com/drive/1OwemROn306_JIYASjx39d52eXFHS1O_u

Expected behavior

The training should stop

Logs

Traceback (most recent call last):
  File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/trainer.py", line 1808, in fit
    self._fit()
  File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/trainer.py", line 1771, in _fit
    self.save_best_model()
  File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/utils/distributed.py", line 35, in wrapped_fn
    return fn(*args, **kwargs)
  File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/trainer.py", line 1893, in save_best_model
    self.best_loss = save_best_model(
  File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/io.py", line 183, in save_best_model
    if current_loss < best_loss:
TypeError: '<' not supported between instances of 'float' and 'dict'

Environment

-torch: 2.1.0
-trainer: 0.0.31
-python: 3.10
-OS: Endeavor OS
-cuda: cuda_12.2.r12.2
-GPU: NVIDIA RTX 3060
-pytorch installation: pip

Additional context

No response

[Bug] Callbacks do not run in multi-gpu training

It is because in multi-gpu mode functions of a model are contained by module namespace

so any model functions should be called as model.module.model_func(), so the callbacks.

[Feature request] Multi-node training

Hi,
I have two questions:

Can it be used in multi-node training?
When will trainer support deepspeed? I have noticed that integrating deepspeed is in to-do list, but do you have the exact time or schedule?

Thank you!

[Bug] No warning or error when no eval samples provided

Describe the bug

👟 needs to raise an error when there is no eval sample or skip the eval step.

👉 coqui-ai/TTS#1447

To Reproduce

coqui-ai/TTS#1447

Expected behavior

Raise an error " [!] No eval samples provided"

Or skip the eval step

Logs

No response

Environment

https://github.com/coqui-ai/TTS/issues/1447

Additional context

No response

Tag releases on Git

Release versions are currently only availabe on PyPi, while the exact commit they were built with on Git needs to be guessed and can't be cleanly referenced for automatic updates.

[Bug] ClearML checkpoints on S3 are faulty.

Describe the bug

Download and try to load a checkpoint from S3 that is saved by ClearML raises the following error.

/miniforge3/lib/python3.9/site-packages/torch/serialization.py in __init__(self, name_or_buffer)
    240 class _open_zipfile_reader(_opener):
    241     def __init__(self, name_or_buffer) -> None:
--> 242         super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
    243
    244

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

To Reproduce

Try running

import os
from dataclasses import dataclass, field

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision import transforms
from torchvision.datasets import MNIST

from trainer import Trainer, TrainerArgs, TrainerConfig, TrainerModel


@dataclass
class MnistModelConfig(TrainerConfig):
    optimizer: str = "Adam"
    lr: float = 0.001
    epochs: int = 5
    print_step: int = 1
    plot_step: int = 1
    save_step: int = 1
    dashboard_logger: str = "clearml"
    project_name: str = "pytorch-mnist"
    run_name: str = "test-run"


class MnistModel(TrainerModel):
    def __init__(self):
        super().__init__()

        # mnist images are (1, 28, 28) (channels, height, width)
        self.layer_1 = nn.Linear(28 * 28, 128)
        self.layer_2 = nn.Linear(128, 256)
        self.layer_3 = nn.Linear(256, 10)

    def forward(self, x):
        batch_size, _, _, _ = x.size()

        # (b, 1, 28, 28) -> (b, 1*28*28)
        x = x.view(batch_size, -1)
        x = self.layer_1(x)
        x = F.relu(x)
        x = self.layer_2(x)
        x = F.relu(x)
        x = self.layer_3(x)

        x = F.log_softmax(x, dim=1)
        return x

    def train_step(self, batch, criterion):
        x, y = batch
        logits = self(x)
        loss = criterion(logits, y)
        return {"model_outputs": logits}, {"loss": loss}

    def eval_step(self, batch, criterion):
        x, y = batch
        logits = self(x)
        loss = criterion(logits, y)
        return {"model_outputs": logits}, {"loss": loss}

    def get_criterion(self):
        return torch.nn.NLLLoss()

    def get_data_loader(self, config, assets, is_eval, samples, verbose, num_gpus, rank=0):
        transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
        dataset = MNIST(os.getcwd(), train=not is_eval, download=True, transform=transform)
        mnist_train = DataLoader(dataset, batch_size=8)
        return mnist_train


def test_train_mnist():
    model = MnistModel()
    trainer = Trainer(TrainerArgs(), MnistModelConfig(), model=model, output_path=os.getcwd())
    trainer.fit()


if __name__ == "__main__":
    test_train_mnist()

Expected behavior

No response

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA A100-SXM4-80GB",
            "NVIDIA A100-SXM4-80GB"
        ],
        "available": true,
        "version": "11.3"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0",
        "Trainer": "v0.0.4",
        "numpy": "1.21.2"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.9.5",
        "version": "#106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022"
    }
}

Additional context

No response

test_train_mnist is failing

On 776eba8 this test tries to call get_data_loader with the unexpected kwarg samples.

_______________________________ test_train_mnist _______________________________

self = Trainer()

    def fit(self) -> None:
        """Where the ✨️magic✨️ happens..."""
        try:
>           self._fit()

trainer/trainer.py:1403:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Trainer()

    def _fit(self) -> None:
        """🏃 train -> evaluate -> test for the number of epochs."""
        self._restore_best_loss()

        self.total_steps_done = self.restore_step

        for epoch in range(0, self.config.epochs):
            if self.num_gpus > 1:
                # let all processes sync up before starting with a new epoch of training
                dist.barrier()
            self.callbacks.on_epoch_start(self)
            self.keep_avg_train = KeepAverage()
            self.keep_avg_eval = KeepAverage() if self.config.run_eval else None
            self.epochs_done = epoch
            self.c_logger.print_epoch_start(epoch, self.config.epochs, self.output_path)
            if not self.skip_train_epoch:
>               self.train_epoch()

trainer/trainer.py:1387:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Trainer()

    def train_epoch(self) -> None:
        """Main entry point for the training loop. Run training on the all training samples."""
        # initialize the data loader
>       self.train_loader = self.get_train_dataloader(
            self.training_assets,
            self.train_samples,
            verbose=True,
        )

trainer/trainer.py:1148:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Trainer(), training_assets = {}, samples = None, verbose = True

    def get_train_dataloader(self, training_assets: Dict, samples: List, verbose: bool) -> DataLoader:
        """Initialize and return a training data loader.
        Call ```model.get_train_data_loader``` if it is implemented, else call ```model.get_data_loader```
        and set ```is_eval=False```.

        Args:
            ap (AudioProcessor): Audio processor.
            samples (List): Data samples used for training.
            verbose (bool): enable/disable printing loader stats at initialization.

        Returns:
            DataLoader: Initialized training data loader.
        """
        if self.num_gpus > 1:
            if hasattr(self.model.module, "get_train_data_loader"):
                loader = self.model.module.get_train_data_loader(
                    self.config,
                    self.training_assets,
                    samples,
                    verbose,
                    self.num_gpus,
                    self.args.rank,
                )
                return loader
        else:
            if hasattr(self.model, "get_train_data_loader"):
                loader = self.model.get_train_data_loader(
                    self.config, self.training_assets, samples, verbose, self.num_gpus
                )
                return loader

>       return self._get_loader(
            self.model,
            self.config,
            training_assets,
            False,
            samples,
            verbose,
            self.num_gpus,
        )

trainer/trainer.py:679:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Trainer()
model = MnistModel(
  (layer_1): Linear(in_features=784, out_features=128, bias=True)
  (layer_2): Linear(in_features=128, out_features=256, bias=True)
  (layer_3): Linear(in_features=256, out_features=10, bias=True)
)
config = MnistModelConfig(output_path='output', logger_uri=None, run_name='run', project_name=None, run_description='🐸Coqui tra...rue, lr=0.001, optimizer='Adam', optimizer_params={}, lr_scheduler=None, lr_scheduler_params={}, use_grad_scaler=False)
assets = {}, is_eval = False, samples = None, verbose = True, num_gpus = 0

    def _get_loader(
        self,
        model: nn.Module,
        config: Coqpit,
        assets: Dict,
        is_eval: str,
        samples: List,
        verbose: bool,
        num_gpus: int,
    ) -> DataLoader:
        if num_gpus > 1:
            if hasattr(model.module, "get_data_loader"):
                loader = model.module.get_data_loader(
                    config,
                    assets,
                    is_eval,
                    samples,
                    verbose,
                    num_gpus,
                    self.args.rank,
                )
        else:
            if hasattr(model, "get_data_loader"):
>               loader = model.get_data_loader(
                    config=config, assets=assets, is_eval=is_eval, samples=samples, verbose=verbose, num_gpus=num_gpus
                )
E               TypeError: get_data_loader() got an unexpected keyword argument 'samples'

Package is marked as platform-independent, but it does not import on Windows

Describe the bug

Trying to import this package on Windows gives the following error:

  File "<stdin>", line 1, in <module>
  File "...\lib\site-packages\trainer\__init__.py", line 4, in <module>
    from trainer.trainer import *
  File "...\lib\site-packages\trainer\trainer.py", line 47, in <module>
    multiprocessing.set_start_method("fork")
  File "...\lib\multiprocessing\context.py", line 246, in set_start_method
    self._actual_context = self.get_context(method)
  File "...\lib\multiprocessing\context.py", line 238, in get_context
    return super().get_context(method)
  File "...\lib\multiprocessing\context.py", line 192, in get_context
    raise ValueError('cannot find context for %r' % method) from None
ValueError: cannot find context for 'fork'

Since there is no fork method on Windows, please either fix this issue or do not mark this package as platform-independent.

To Reproduce

Install this package on a Windows system.
Start Python.
Try to import trainer.

Expected behavior

No response

Logs

No response

Environment

- trainer v0.0.5 (installed via conda-forge)
- OS: Windows 10

Additional context

No response

[Bug] trainer.distribute not working

Describe the bug

For single GPU training, I am using train_yourtts.py. When I switch to multi-gpu, the program could run but didn't show acceleration. I checked the code in distribute.py and found that it only set environment and start parallel processes. It didn't do collection and sync operation. I am wondering if it is by design or I misused the trainer.distribute

To Reproduce

CUDA_VISIBLE_DEVICES=0,1 python -m trainer.distribute --script recipes/vctk/yourtts/train_yourtts.py

Expected behavior

expected two times acceleration but actually the progress is same as single gpu training.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.1+cu121",
        "Trainer": "v0.0.34",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.9.18",
        "version": "#99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023"
    }
}

Additional context

No response

[Bug] Hardcoded argument parsing on distribute.py preventing other options to be passed when using TrainerArgs

Describe the bug

Because the optional arguments are hardcoded as in this line https://github.com/coqui-ai/Trainer/blob/main/trainer/distribute.py#L37-L40, another arguments, such as --start_with_eval or --use_accelerate will not be changed and kept as default when running, for example, https://github.com/coqui-ai/TTS/blob/dev/TTS/bin/train_tts.py,

such as CUDA_VISIBLE_DEVICES="0,1,2,3" python -m trainer.distribute --start_with_eval true --use_accelerate true --script train_tts.py --config_path <config_path>, start with eval and use accelerate will be both false because only continue_path, restore_path, group_id, and use_ddp (which is fixed to true) will be changed.

To Reproduce

See the command example written in the description.

Expected behavior

No response

Logs

No response

Environment

No response

Additional context

No response

[Bug] Misleading python version constraint message

Describe the bug

Seems like you wanted to say < 3.10 but wrote <= 3.10, at least the latter would allow 3.10, but not 3.10.2.

Trainer/setup.py

Lines 33 to 39 in 45a5604

 if LooseVersion(sys.version) < LooseVersion("3.6") or LooseVersion( 

 sys.version 

 ) > LooseVersion("3.10"): 

 raise RuntimeError( 

 "Coqui-Trainer requires python >= 3.6 and <=3.10 " 

 "but your Python version is {}".format(sys.version) 

 )

To Reproduce

Building coqui-trainer with python 3.10.2

Expected behavior

Clarification whether < or <= is the intended constraint.

Logs

>   RuntimeError: Coqui-Trainer requires python >= 3.6 and <=3.10 but your Python version is 3.10.2 (main, Jan 13 2022, 19:06:22) [GCC 10.3.0]

Environment

- Version 0.0.5
- Python 3.10.2

Additional context

No response

Use logging module for terminal outputs.

Use logging module
Only print rank==0 on the terminal and save the rest to a file per process for debugging in DDP

[Bug] Cannot restore YourTTS model

Describe the bug

Hi,

Following YourTTS vctk recipe I try to restore a model to continue the training.

But I get the following error :

 > Restoring from best_model_22016.pth ...
 > Restoring Model...
 > Restoring Optimizer...
Traceback (most recent call last):
  File "/home/caraduf/Models/train_yourtts_16kHz.py", line 318, in <module>
    trainer = Trainer(
  File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 507, in __init__
    (self.model, self.optimizer, self.scaler, self.restore_step, self.restore_epoch) = self.restore_model(
  File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 711, in restore_model
    optimizer = _restore_list_objs(checkpoint["optimizer"], optimizer)
  File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 701, in _restore_list_objs
    obj.load_state_dict(states)
AttributeError: 'list' object has no attribute 'load_state_dict'

The obj is a list but the else is executed :


def _restore_list_objs(states, obj):           
            if isinstance(obj, list):
                for idx, state in enumerate(states):
                    obj[idx].load_state_dict(state)
            if isinstance(obj, dict):
                for key, state in states.items():
                    obj[key].load_state_dict(state)
            else:
                obj.load_state_dict(states)
            return obj

A workaround is to replace the second if with an elif because if obj is a List it cannot become a Dict. So in my opinion it makes sense to use a elif but I can be wrong.

To Reproduce

Set a restore path to a checkpoint in the recipe and run the recipe.

python3 train_yourtts.py

Expected behavior

THe model is restored and the training goes on.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.13.1+cu117",
        "Trainer": "v0.0.22",
        "numpy": "1.22.4"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.6",
        "version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023"
    }
}

Additional context

No response

[Bug] grad_accum_steps has no effect on training

Describe the bug

Supposedly, setting a value to the --grad_accum_steps option should decrease the needed steps to complete an epoch by simulating a higher batch size, however, setting it to any value does absolutely nothing on training speed or the total step counter.

To Reproduce

Start training with a batch_size=8 using the recipe train_vits_tts_phonemes.py --grad_accum_steps=16 or add the value direcly to Trainer()
See number of steps per epoch
change --grad_accum_steps to another value
Number of steps required doesn't change, therefore, the effective batch size hasn't changed

Expected behavior

Assuming the batch_size=8 and grad_accum_steps=8, an epoch that has 6400 steps to complete should reduce the number of steps to 100 while having an effective batch size of 64 (8*8) during training.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 2060"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.0.1+cu117",
        "Trainer": "v0.0.29",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "",
        "python": "3.10.12",
        "version": "#1 ZEN SMP PREEMPT_DYNAMIC Wed, 02 Aug 2023 10:40:11 +0000"
    }
}

Additional context

No response

Implement batch size finder

[Bug] AttributeError: module 'torch' has no attribute 'autocast'

Describe the bug

Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1591, in fit
self._fit()
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1544, in _fit
self.train_epoch()
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1309, in train_epoch
_, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1172, in train_step
num_optimizers=len(self.optimizer),
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1021, in _optimize
with torch.autocast(device_type=device, dtype=dtype, enabled=config.mixed_precision):
AttributeError: module 'torch' has no attribute 'autocast'

To Reproduce

https://github.com/coqui-ai/TTS/blob/dev/recipes/ljspeech/delightful_tts/train_delightful_tts.py

Expected behavior

No response

Logs

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1591, in fit 
    self._fit()
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1544, in _fit
    self.train_epoch()
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1309, in train_epoch
    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1172, in train_step
    num_optimizers=len(self.optimizer),
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1021, in _optimize
    with torch.autocast(device_type=device, dtype=dtype, enabled=config.mixed_precision):
AttributeError: module 'torch' has no attribute 'autocast'

Environment

torch version:'1.8.1+cu102'. I have checked several version of torch docs, seems no autocast in torch

Additional context

No response

[Bug] Unable to run distributed training using TTS recipe for yourtts

Describe the bug

I've been trying to train yourtts on a google compute instance, but it doesn't seem to work using trainer.distribute.
Previously i could run it, but it would get up to the same point in initialization and then crash one of the training workers, with the others freezing.
I am running largely unchanged code from the provided recipe, and have simply reduced the worker count to work on the cloud instance, and added my own dataset.
It previously trained fine without distributed training until it runs out of vram. and training locally on a 3090 works fine if not slowly.

Also TTS is installed to the latest version, not sure why collect_env_info.py didn't catch it.

To Reproduce

Run CUDA_VISIBLE_DEVICES="0,1,2,3" python -m trainer.distribute --script train_yourtts.py on google compute instance
Wait several seconds
Error.

Expected behavior

Runs the training script with processing split between the GPUs.

Logs

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1666, in fit
    self._fit()
  File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1618, in _fit
    self.train_epoch()
  File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1350, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.10/site-packages/TTS/tts/models/vits.py", line 263, in __getitem__
    item = self.samples[idx]
TypeError: list indices must be integers or slices, not list

Environment

{       
    "CUDA": {
        "GPU": [
            "Tesla T4",
            "Tesla T4",
            "Tesla T4",
            "Tesla T4"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.0.1+cu117",
        "Trainer": "v0.0.27",
        "numpy": "1.23.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "",
        "python": "3.10.10",
        "version": "#1 SMP Debian 5.10.179-1 (2023-05-12)"
    }
}

Additional context

No response

[Feature request] Save a checkpoint when interrupting the training (ctrl - c)

🚀 Feature Description
Hi,
Sometimes while training, I need my GPU (eg to do some work with Whisper, or because I need to switch off the computer). So I have to interrupt the training and sometimes it is right between 2 checkpoints (eg checkpoints are saved every 10k iterations and it is 7k after the previous saved checkpoints).
So in this case I would loose all the training that has been achieved after the previous checkpoint.

Consequently it would be more comfortable that a checkpoint is saved when I interrupt the training so that I can then restore the training right from this checkpoint.

Solution

When the training process is interrupted (ctrl-c) make coqui save a checkpoint at the current step (as it does when save_step is reached).

Alternative Solutions

I could lower save_step but then checkpoints are too near to each others.

Additional context

[Bug] Tests taking forever

Describe the bug

num_gpus test takes hangs.

To Reproduce

Check the CI action logs.

Expected behavior

No response

Logs

No response

Environment

CI Action instances

Additional context

No response

[Bug] data = [self.dataset[idx] for idx in possibly_batched_index] TypeError: 'int' object is not iterable

Describe the bug

possibly_batched_index is not a list.

To Reproduce

python -m trainer.distribute --gpus "0,1" --script train_multi.py --restore_path /root/.local/share/tts/tts_models--en--ljspeech--vits/model_file.pth

Expected behavior

No response

Logs

> TRAINING (2022-08-26 01:41:21)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1533, in fit
    self._fit()
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1517, in _fit
    self.train_epoch()
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1281, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
TypeError: 'int' object is not iterable

 ! Run is kept in logs/multi_be/vits_ljs_speaker_embedded-August-26-2022_01+41AM-0000000
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1533, in fit
    self._fit()
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1517, in _fit
    self.train_epoch()
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1281, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
TypeError: 'int' object is not iterable

Environment

docker pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime

Additional context

No response

Run test_run() every time that saves a checkpoint and not in the epochs end.

Currently, we save checkpoints without audio samples for each one. In this way, we cannot evaluate the models correctly. My suggestion is to run test_run() every time that we save a checkpoint.

Missing dependencies

The dependencies in requirements.txt does not include:

soundfile
tensorboardX

There is no requirements file for tests, but they require

torchvision

[Bug] ValueError: not allowed to raise maximum limit (rlimit)

Describe the bug

Error while training:-

I tried with sudo same error
I am using docker image nvidia/cuda:11.7.0-base-ubuntu22.04
The default value of the docker container for command resource.getrlimit(resource.RLIMIT_NOFILE) is (1048576, 1048576)

| > stats_path:None
2023-06-14T07:29:43.025431079Z  | > base:10
2023-06-14T07:29:43.025437149Z  | > hop_length:256
2023-06-14T07:29:43.025444429Z  | > win_length:1024
2023-06-14T07:29:43.025450699Z  > initialization of speaker-embedding layers.
2023-06-14T07:29:43.025462919Z Traceback (most recent call last):
2023-06-14T07:29:43.025469199Z   File "/workspace/coqui-tts/train.py", line 320, in <module>
2023-06-14T07:29:43.025476859Z     trainer = Trainer(
2023-06-14T07:29:43.025484659Z   File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 405, in __init__
2023-06-14T07:29:43.025494939Z     self.use_cuda, self.num_gpus = self.setup_training_environment(args=args, config=config, gpu=gpu)
2023-06-14T07:29:43.025500099Z   File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 632, in setup_training_environment
2023-06-14T07:29:43.025543959Z     resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))
2023-06-14T07:29:43.025560229Z ValueError: not allowed to raise maximum limit

Due this line:

Trainer/trainer/trainer.py

Lines 653 to 660 in 9879d3d

 if platform.system() != "Windows": 

 # https://github.com/pytorch/pytorch/issues/973 

 import resource # pylint: disable=import-outside-toplevel 

 rlimit = resource.getrlimit(resource.RLIMIT_NOFILE) 

 resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1])) 

 # set and initialize Pytorch runtime

To Reproduce

Install coqui-tts in nvidia/cuda:11.7.0-base-ubuntu22.04 docker container
Try train vits model
This error is throw (even with sudo)

Expected behavior

No errors

Logs

| > stats_path:None
2023-06-14T07:29:43.025431079Z  | > base:10
2023-06-14T07:29:43.025437149Z  | > hop_length:256
2023-06-14T07:29:43.025444429Z  | > win_length:1024
2023-06-14T07:29:43.025450699Z  > initialization of speaker-embedding layers.
2023-06-14T07:29:43.025462919Z Traceback (most recent call last):
2023-06-14T07:29:43.025469199Z   File "/workspace/coqui-tts/train.py", line 320, in <module>
2023-06-14T07:29:43.025476859Z     trainer = Trainer(
2023-06-14T07:29:43.025484659Z   File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 405, in __init__
2023-06-14T07:29:43.025494939Z     self.use_cuda, self.num_gpus = self.setup_training_environment(args=args, config=config, gpu=gpu)
2023-06-14T07:29:43.025500099Z   File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 632, in setup_training_environment
2023-06-14T07:29:43.025543959Z     resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))
2023-06-14T07:29:43.025560229Z ValueError: not allowed to raise maximum limit

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla V100-FHHL-16GB"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.0.1+cu117",
        "Trainer": "v0.0.20",
        "numpy": "1.22.4"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.6",
        "version": "#46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020"
    }
}

Additional context

No response

[Bug] Unable to use WandbLogger

Describe the bug

I tried using the WandB logger for training in the TTS repo, but it didn't work.

To Reproduce

Clone TTS repo
Modify the LJSpeech recipe's dataset path
Run:

CUDA_VISIBLE_DEVICES=0 python recipes/ljspeech/vits_tts/train_vits.py \
    --coqpit.dashboard_logger wandb \
    --coqpit.project_name FakeName \
    --coqpit.wandb_entity FakeEntity \

It crashes with this error:

Traceback (most recent call last):
  File "runs/train_vits.py", line 85, in <module>
    eval_samples=eval_samples,
  File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/trainer/trainer.py", line 359, in __init__
    self.dashboard_logger = logger_factory(config, output_path)
  File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/trainer/logging/__init__.py", line 36, in logger_factory
    entity=config.wandb_entity,
TypeError: Can't instantiate abstract class WandbLogger with abstract methods add_audio, add_figure, add_scalar

Expected behavior

It should work just like the default Tensorboard logger

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "11.3"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.10.2",
        "TTS": "0.6.1",
        "numpy": "1.19.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "x86_64",
        "python": "3.7.11",
        "version": "#202202230823 SMP PREEMPT Wed Feb 23 14:53:24 UTC 2022"
    }
}

Additional context

No response

Remove AMP

Describe the bug

Hi everybody,

Trainer is great, but still uses APEX, which is deprecated and tends to cause problems.
Could you remove and/or replace with Nvidia/amp?

Best regrads

To Reproduce

Just run Trainer on an Nvidia GPU

Expected behavior

No response

Logs

No response

Environment

not relevant

Additional context

No response

[Bug] Trainer does not save the model config as an artifact on ClearML

👟 must save not only the checkpoints but also the other model artifacts like config.json, speaker.json, etc.

	if LooseVersion(sys.version) < LooseVersion("3.6") or LooseVersion(
	sys.version
	) > LooseVersion("3.10"):
	raise RuntimeError(
	"Coqui-Trainer requires python >= 3.6 and <=3.10 "
	"but your Python version is {}".format(sys.version)
	)

	if platform.system() != "Windows":
	# https://github.com/pytorch/pytorch/issues/973
	import resource # pylint: disable=import-outside-toplevel

	rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
	resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))

	# set and initialize Pytorch runtime