Giter Club home page Giter Club logo

trainer's People

Contributors

a-froghyar avatar bitnom avatar edresson avatar eginhard avatar erogol avatar iprovalo avatar loganhart02 avatar manmay-nakhashi avatar ppisljar avatar shenberg avatar squire-tomsk avatar weberjulian avatar zutatensuppe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

trainer's Issues

[Feature request] Save only last N checkpoints on ClearML

🚀 Feature Description
Save last N checkpoints on ClearML logger.

Solution
Implement saving only the last N checkpoints on a dashboard logger of choice.

Additional context

Keeping all the checkpoints is unnecessary and induces extra costs on cloud storage.

[Bug] Errors when running VITS in distributed DDP mode

Describe the bug

There are two problems. The first occurs when starting VTCK VITS training with DDP, which results in a division by zero error. The second one occurred with a custom VITS recipe:

 > EVALUATION

 | > Synthesizing test sentences.
 ! Run is removed from /home/wdb/src/gitlab.com/whydobirds/tts-coqui/recipes/frank/vits/vits_frank-June-01-2022_02+35PM-75e2a0fd
Traceback (most recent call last):
  File "/home/wdb/miniconda3/envs/tts-coqui/lib/python3.8/site-packages/trainer/trainer.py", line 1492, in fit
    self._fit()
  File "/home/wdb/miniconda3/envs/tts-coqui/lib/python3.8/site-packages/trainer/trainer.py", line 1480, in _fit
    self.test_run()
  File "/home/wdb/miniconda3/envs/tts-coqui/lib/python3.8/site-packages/trainer/trainer.py", line 1416, in test_run
    self.model.test_log(test_outputs, self.dashboard_logger, self.training_assets, self.total_steps_done)
  File "/home/wdb/miniconda3/envs/tts-coqui/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'test_log'

To Reproduce

Start training VITS in distributed mode:

 python -m trainer.distribute --script recipes/vctk/vits/train_vits.py --gpus "0,1"

This immediately results in a division by zero error.

Starting with a custom script:

 python -m trainer.distribute --script recipes/custom/vits/train_vits.py --gpus "0,1"

Results in the stack trace shown above. Presumably the VCTK one will run in to the same problem once the division by zero is resolved.

Expected behavior

The training run should complete successfully.

Logs

No response

Environment

The wget above 404s.

Additional context

I think it might just be

if hasattr(self.model, "test_log") or (self.num_gpus > 1 and hasattr(self.model.module, "test_log")):
. Everywhere else, it's if ngups > 1: model.module. But this predicate catches both cases, DDP and not, but it's just doing self.model. I guess it should be two separate conditions

[Feature request] Reset best loss when restoring from checkpoint in order to keep best model of fine-tuning

🚀 Feature Description

I'm fine-tuning a Coqui TTS model on a different dataset and language. Since it's never reaching the best loss of the pre-trained model it never saves the best model (of the fine-tuning stage). I am only able to see the final checkpoints saved but the best model gets lost in history since I keep only the final X checkpoints.

Solution

If I could reset the best loss saved in the model, I believe it would be possible to save the best model starting from the fine-tuning training.

[Bug] Trainer saves the given config to ClearML rather than the continued config

Describe the bug

Trainer saves the config that is given at the beginning of the training rather than the one restored after continuing a training.

To Reproduce

Continue one of your training using python script and give a modified config.json

The trainer will use the restored config but save the modified config.

Expected behavior

Save the config as exactly used.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            ""
        ],
        "available": true,
        "version": "11.3"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0",
        "Trainer": "v0.0.12",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.9.5",
        "version": "#119-Ubuntu SMP Mon Mar 7 18:49:24 UTC 2022"
    }
}

Additional context

No response

[Bug] Outputs referenced before assignment error while using YourTTS recipe

Describe the bug

Hi,

When running YourTTS recipe with my own LJSpeech dataset, during the first evaluation I get the following error :

> DataLoader initialization
| > Tokenizer:
        | > add_blank: True
        | > use_eos_bos: False
        | > use_phonemes: False
| > Number of instances : 20
 | > Preprocessing samples
 | > Max text length: 119
 | > Min text length: 21
 | > Avg text length: 45.2
 |
 | > Max audio length: 119952.0
 | > Min audio length: 18103.0
 | > Avg audio length: 48773.45
 | > Num. instances discarded samples: 0
 | > Batch group size: 0.
 > Using weighted sampler for attribute 'speaker_name' with alpha '1.0'
None
 > Attribute weights for '['ljspeech']'
 | > [0.22360679774997894]

 > EVALUATION 

 ! Run is kept in /home/caraduf/Models/YourTTS_ME_22k-February-07-2023_06+31AM-0000000
Traceback (most recent call last):
  File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 1659, in fit
    self._fit()
  File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 1614, in _fit
    self.eval_epoch()
  File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 1501, in eval_epoch
    outputs,
UnboundLocalError: local variable 'outputs' referenced before assignment

I updated the trainer to the latest version following the instructions for github but the issue still occurs.

Also note that training VITS model against the same dataset (and also the same max value [10 seconds or 10 x 22050]) is working. So it stops only when running the YourTTS recipe. I will try with debug mode ON and see if it shows interesting things.

Here is the adapted recipe :

import os

import torch
from trainer import Trainer, TrainerArgs

from TTS.bin.compute_embeddings import compute_embeddings
from TTS.bin.resample import resample_files
from TTS.config.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import CharactersConfig, Vits, VitsArgs, VitsAudioConfig
from TTS.utils.downloaders import download_vctk

torch.set_num_threads(24)

# pylint: disable=W0105
"""
    This recipe replicates the first experiment proposed in the YourTTS paper (https://arxiv.org/abs/2112.02418).
    YourTTS model is based on the VITS model however it uses external speaker embeddings extracted from a pre-trained speaker encoder and has small architecture changes.
    In addition, YourTTS can be trained in multilingual data, however, this recipe replicates the single language training using the VCTK dataset.
    If you are interested in multilingual training, we have commented on parameters on the VitsArgs class instance that should be enabled for multilingual training.
    In addition, you will need to add the extra datasets following the VCTK as an example.
"""
# Path where you want to save the models outputs (configs, checkpoints and tensorboard logs)
OUT_PATH = os.path.dirname(os.path.abspath(__file__))  # "/raid/coqui/Checkpoints/original-YourTTS/"

# Name of the run for the Trainer
RUN_NAME = "YourTTS_ME_22k"

Me_Rec_1_config = BaseDatasetConfig(
    formatter="ljspeech", dataset_name="ME_Rec1", meta_file_train="metadata.csv", path="/home/caraduf/Datasets/ME_22kHz/Rec_1_LARGE_V2_22.05kHz_dataset", language="fr-fr"
)

Me_Rec_2_config = BaseDatasetConfig(
    formatter="ljspeech", dataset_name="ME_Rec2", meta_file_train="metadata.csv", path="/home/caraduf/Datasets/ME_22kHz/Rec_2_LARGE_V2_22.05kHz_dataset", language="fr-fr"
)



# Add here all datasets configs, in our case we just want to train with the VCTK dataset then we need to add just VCTK. Note: If you want to added new datasets just added they here and it will automatically compute the speaker embeddings (d-vectors) for this new dataset :)
DATASETS_CONFIG_LIST = [
    Me_Rec_1_config,
    Me_Rec_2_config
]


# If you want to do transfer learning and speedup your training you can set here the path to the original YourTTS model
RESTORE_PATH = None  # "/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth"

# This paramter is usefull to debug, it skips the training epochs and just do the evaluation  and produce the test sentences
SKIP_TRAIN_EPOCH = False

# Set here the batch size to be used in training and evaluation
BATCH_SIZE = 32

# Training Sampling rate and the target sampling rate for resampling the downloaded dataset (Note: If you change this you might need to redownload the dataset !!)
# Note: If you add new datasets, please make sure that the dataset sampling rate and this parameter are matching, otherwise resample your audios
SAMPLE_RATE = 22050

# Max audio length in seconds to be used in training (every audio bigger than it will be ignored)
MAX_AUDIO_LEN_IN_SECONDS = 10

# # Define the number of threads used during the audio resampling
# NUM_RESAMPLE_THREADS = 10
# # Check if VCTK dataset is not already downloaded, if not download it
# if not os.path.exists(VCTK_DOWNLOAD_PATH):
#     print(">>> Downloading VCTK dataset:")
#     download_vctk(VCTK_DOWNLOAD_PATH)
#     resample_files(VCTK_DOWNLOAD_PATH, SAMPLE_RATE, file_ext="flac", n_jobs=NUM_RESAMPLE_THREADS)

# # init configs
# vctk_config = BaseDatasetConfig(
#     formatter="vctk",
#     dataset_name="vctk",
#     meta_file_train="",
#     meta_file_val="",
#     path=VCTK_DOWNLOAD_PATH,
#     language="en",
#     ignored_speakers=[
#         "p261",
#         "p225",
#         "p294",
#         "p347",
#         "p238",
#         "p234",
#         "p248",
#         "p335",
#         "p245",
#         "p326",
#         "p302",
#     ],  # Ignore the test speakers to full replicate the paper experiment
# )


### Extract speaker embeddings
SPEAKER_ENCODER_CHECKPOINT_PATH = (
    "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar"
)
SPEAKER_ENCODER_CONFIG_PATH = "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json"

D_VECTOR_FILES = []  # List of speaker embeddings/d-vectors to be used during the training

# Iterates all the dataset configs checking if the speakers embeddings are already computated, if not compute it
for dataset_conf in DATASETS_CONFIG_LIST:
    # Check if the embeddings weren't already computed, if not compute it
    embeddings_file = os.path.join(dataset_conf.path, "speakers.pth")
    if not os.path.isfile(embeddings_file):
        print(f">>> Computing the speaker embeddings for the {dataset_conf.dataset_name} dataset")
        compute_embeddings(
            SPEAKER_ENCODER_CHECKPOINT_PATH,
            SPEAKER_ENCODER_CONFIG_PATH,
            embeddings_file,
            old_spakers_file=None,
            config_dataset_path=None,
            formatter_name=dataset_conf.formatter,
            dataset_name=dataset_conf.dataset_name,
            dataset_path=dataset_conf.path,
            meta_file_train=dataset_conf.meta_file_train,
            meta_file_val=dataset_conf.meta_file_val,
            disable_cuda=False,
            no_eval=False,
        )
    D_VECTOR_FILES.append(embeddings_file)


# Audio config used in training.
audio_config = VitsAudioConfig(
    sample_rate=SAMPLE_RATE,
    hop_length=256,
    win_length=1024,
    fft_size=1024,
    mel_fmin=0.0,
    mel_fmax=None,
    num_mels=80,
)

# Init VITSArgs setting the arguments that is needed for the YourTTS model
model_args = VitsArgs(
    d_vector_file=D_VECTOR_FILES,
    use_d_vector_file=True,
    d_vector_dim=512,
    num_layers_text_encoder=10,
    speaker_encoder_model_path=SPEAKER_ENCODER_CHECKPOINT_PATH,
    speaker_encoder_config_path=SPEAKER_ENCODER_CONFIG_PATH,
    resblock_type_decoder="1",  # On the paper, we accidentally trained the YourTTS using ResNet blocks type 2, if you like you can use the ResNet blocks type 1 like the VITS model
    # Usefull parameters to enable the Speaker Consistency Loss (SCL) discribed in the paper
    # use_speaker_encoder_as_loss=True,
    # Usefull parameters to the enable multilingual training
    # use_language_embedding=True,
    # embedded_language_dim=4,
)

# General training config, here you can change the batch size and others usefull parameters
config = VitsConfig(
    output_path=OUT_PATH,
    model_args=model_args,
    run_name=RUN_NAME,
    project_name="YourTTS",
    run_description="""
            - Original YourTTS trained using shorter extacts made by new method
        """,
    dashboard_logger="tensorboard",
    logger_uri=None,
    audio=audio_config,
    batch_size=BATCH_SIZE,
    batch_group_size=48,
    eval_batch_size=BATCH_SIZE,
    num_loader_workers=4,
    eval_split_max_size=256,
    print_step=50,
    plot_step=100,
    log_model_step=1000,
    save_step=5000,
    save_n_checkpoints=10,
    save_checkpoints=True,
    target_loss="loss_1",
    print_eval=False,
    use_phonemes=False,
    phonemizer="espeak",
    phoneme_language="fr-fr",
    compute_input_seq_cache=True,
    add_blank=True,
    text_cleaner="multilingual_cleaners",
    characters=CharactersConfig(
        characters_class="TTS.tts.models.vits.VitsCharacters",
        pad="_",
        eos="&",
        bos="*",
        blank=None,
        #characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
        
        characters="!',-.:?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz «»ÀÇÉÊàâçèéêëîïôùûœ–’…",
        punctuations="!'(),-.:;? ",
        phonemes="",
        is_unique=True,
        is_sorted=True,
    ),
    phoneme_cache_path=None,
    precompute_num_workers=12,
    start_by_longest=True,
    datasets=DATASETS_CONFIG_LIST,
    cudnn_benchmark=False,
    max_audio_len=SAMPLE_RATE * MAX_AUDIO_LEN_IN_SECONDS,
    mixed_precision=True,
    test_sentences=[
        [
            "Il m'a fallu du temps pour créer cette voix alors ma bouche ne restera pas fermée.",
            # "ME",
            None,
            "fr_FR",
        ],
        [
            "Il m'a fallu beaucoup de temps pour développer une voix, et maintenant que je l'ai, je ne vais pas me taire.",
            # "ME",
            None,
            "fr_FR",
        ],
        [
            "Mais son âge rendait cette dernière qualité plus saillante.",
            # "ME",
            None,
            "fr_FR",
        ],
        # [
        #     "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
        #     "VCTK_p277",
        #     None,
        #     "en",
        # ],
        # [
        #     "Be a voice, not an echo.",
        #     "VCTK_p239",
        #     None,
        #     "en",
        # ],
        # [
        #     "I'm sorry Dave. I'm afraid I can't do that.",
        #     "VCTK_p258",
        #     None,
        #     "en",
        # ],
        # [
        #     "This cake is great. It's so delicious and moist.",
        #     "VCTK_p244",
        #     None,
        #     "en",
        # ],
        # [
        #     "Prior to November 22, 1963.",
        #     "VCTK_p305",
        #     None,
        #     "en",
        # ],
    ],
    # Enable the weighted sampler
    use_weighted_sampler=True,
    # Ensures that all speakers are seen in the training batch equally no matter how many samples each speaker has
    weighted_sampler_attrs={"speaker_name": 1.0},
    weighted_sampler_multipliers={},
    # It defines the Speaker Consistency Loss (SCL) α to 9 like the paper
    speaker_encoder_loss_alpha=9.0,
)

# Load all the datasets samples and split traning and evaluation sets
train_samples, eval_samples = load_tts_samples(
    config.datasets,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

# Init the model
model = Vits.init_from_config(config)

# Init the trainer and 🚀
trainer = Trainer(
    TrainerArgs(restore_path=RESTORE_PATH, skip_train_epoch=SKIP_TRAIN_EPOCH),
    config,
    output_path=OUT_PATH,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()

To Reproduce

Create a dataset in LJSpeech format (22,05 kHz audios) in French.
Adapt dataset config, sample rate in the provided recipe.
Launch it.

python3 YourTTS_recipe.py

Expected behavior

The training should go on.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.13.1+cu117",
        "Trainer": "v0.0.22",
        "numpy": "1.22.4"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.6",
        "version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023"
    }
}

Additional context

No response

[Bug]

Describe the bug

Error when starting model training from checkpoint in Coqui TTS
When saved as a checkpoint for later training, the last training and eval losses are saved as in dict. When training from scratch, the last training loss is saved as a float. Hence, starting from a checkpoint doesn't run the code properly

To Reproduce

  1. Train a model in Coqui TTS using trainer
  2. Once a checkpoint for best model is saved, stop the training
  3. Set the checkpoint folder as continue path in the trainer class
  4. Restart from the checkpoint

https://colab.research.google.com/drive/1OwemROn306_JIYASjx39d52eXFHS1O_u

Expected behavior

The training should stop

Logs

Traceback (most recent call last):
  File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/trainer.py", line 1808, in fit
    self._fit()
  File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/trainer.py", line 1771, in _fit
    self.save_best_model()
  File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/utils/distributed.py", line 35, in wrapped_fn
    return fn(*args, **kwargs)
  File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/trainer.py", line 1893, in save_best_model
    self.best_loss = save_best_model(
  File "/mnt/Work/anaconda3/envs/tts-env/lib/python3.10/site-packages/trainer/io.py", line 183, in save_best_model
    if current_loss < best_loss:
TypeError: '<' not supported between instances of 'float' and 'dict'

Environment

-torch: 2.1.0
-trainer: 0.0.31
-python: 3.10
-OS: Endeavor OS
-cuda: cuda_12.2.r12.2
-GPU: NVIDIA RTX 3060
-pytorch installation: pip

Additional context

No response

[Feature request] Multi-node training

Hi,
I have two questions:

  1. Can it be used in multi-node training?
  2. When will trainer support deepspeed? I have noticed that integrating deepspeed is in to-do list, but do you have the exact time or schedule?

Thank you!

Tag releases on Git

Release versions are currently only availabe on PyPi, while the exact commit they were built with on Git needs to be guessed and can't be cleanly referenced for automatic updates.

[Bug] ClearML checkpoints on S3 are faulty.

Describe the bug

Download and try to load a checkpoint from S3 that is saved by ClearML raises the following error.

/miniforge3/lib/python3.9/site-packages/torch/serialization.py in __init__(self, name_or_buffer)
    240 class _open_zipfile_reader(_opener):
    241     def __init__(self, name_or_buffer) -> None:
--> 242         super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
    243
    244

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

To Reproduce

Try running

import os
from dataclasses import dataclass, field

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision import transforms
from torchvision.datasets import MNIST

from trainer import Trainer, TrainerArgs, TrainerConfig, TrainerModel


@dataclass
class MnistModelConfig(TrainerConfig):
    optimizer: str = "Adam"
    lr: float = 0.001
    epochs: int = 5
    print_step: int = 1
    plot_step: int = 1
    save_step: int = 1
    dashboard_logger: str = "clearml"
    project_name: str = "pytorch-mnist"
    run_name: str = "test-run"


class MnistModel(TrainerModel):
    def __init__(self):
        super().__init__()

        # mnist images are (1, 28, 28) (channels, height, width)
        self.layer_1 = nn.Linear(28 * 28, 128)
        self.layer_2 = nn.Linear(128, 256)
        self.layer_3 = nn.Linear(256, 10)

    def forward(self, x):
        batch_size, _, _, _ = x.size()

        # (b, 1, 28, 28) -> (b, 1*28*28)
        x = x.view(batch_size, -1)
        x = self.layer_1(x)
        x = F.relu(x)
        x = self.layer_2(x)
        x = F.relu(x)
        x = self.layer_3(x)

        x = F.log_softmax(x, dim=1)
        return x

    def train_step(self, batch, criterion):
        x, y = batch
        logits = self(x)
        loss = criterion(logits, y)
        return {"model_outputs": logits}, {"loss": loss}

    def eval_step(self, batch, criterion):
        x, y = batch
        logits = self(x)
        loss = criterion(logits, y)
        return {"model_outputs": logits}, {"loss": loss}

    def get_criterion(self):
        return torch.nn.NLLLoss()

    def get_data_loader(self, config, assets, is_eval, samples, verbose, num_gpus, rank=0):
        transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
        dataset = MNIST(os.getcwd(), train=not is_eval, download=True, transform=transform)
        mnist_train = DataLoader(dataset, batch_size=8)
        return mnist_train


def test_train_mnist():
    model = MnistModel()
    trainer = Trainer(TrainerArgs(), MnistModelConfig(), model=model, output_path=os.getcwd())
    trainer.fit()


if __name__ == "__main__":
    test_train_mnist()

Expected behavior

No response

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA A100-SXM4-80GB",
            "NVIDIA A100-SXM4-80GB"
        ],
        "available": true,
        "version": "11.3"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0",
        "Trainer": "v0.0.4",
        "numpy": "1.21.2"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.9.5",
        "version": "#106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022"
    }
}

Additional context

No response

test_train_mnist is failing

On 776eba8 this test tries to call get_data_loader with the unexpected kwarg samples.

_______________________________ test_train_mnist _______________________________

self = Trainer()

    def fit(self) -> None:
        """Where the ✨️magic✨️ happens..."""
        try:
>           self._fit()

trainer/trainer.py:1403:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Trainer()

    def _fit(self) -> None:
        """🏃 train -> evaluate -> test for the number of epochs."""
        self._restore_best_loss()

        self.total_steps_done = self.restore_step

        for epoch in range(0, self.config.epochs):
            if self.num_gpus > 1:
                # let all processes sync up before starting with a new epoch of training
                dist.barrier()
            self.callbacks.on_epoch_start(self)
            self.keep_avg_train = KeepAverage()
            self.keep_avg_eval = KeepAverage() if self.config.run_eval else None
            self.epochs_done = epoch
            self.c_logger.print_epoch_start(epoch, self.config.epochs, self.output_path)
            if not self.skip_train_epoch:
>               self.train_epoch()

trainer/trainer.py:1387:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Trainer()

    def train_epoch(self) -> None:
        """Main entry point for the training loop. Run training on the all training samples."""
        # initialize the data loader
>       self.train_loader = self.get_train_dataloader(
            self.training_assets,
            self.train_samples,
            verbose=True,
        )

trainer/trainer.py:1148:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Trainer(), training_assets = {}, samples = None, verbose = True

    def get_train_dataloader(self, training_assets: Dict, samples: List, verbose: bool) -> DataLoader:
        """Initialize and return a training data loader.
        Call ```model.get_train_data_loader``` if it is implemented, else call ```model.get_data_loader```
        and set ```is_eval=False```.

        Args:
            ap (AudioProcessor): Audio processor.
            samples (List): Data samples used for training.
            verbose (bool): enable/disable printing loader stats at initialization.

        Returns:
            DataLoader: Initialized training data loader.
        """
        if self.num_gpus > 1:
            if hasattr(self.model.module, "get_train_data_loader"):
                loader = self.model.module.get_train_data_loader(
                    self.config,
                    self.training_assets,
                    samples,
                    verbose,
                    self.num_gpus,
                    self.args.rank,
                )
                return loader
        else:
            if hasattr(self.model, "get_train_data_loader"):
                loader = self.model.get_train_data_loader(
                    self.config, self.training_assets, samples, verbose, self.num_gpus
                )
                return loader

>       return self._get_loader(
            self.model,
            self.config,
            training_assets,
            False,
            samples,
            verbose,
            self.num_gpus,
        )

trainer/trainer.py:679:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Trainer()
model = MnistModel(
  (layer_1): Linear(in_features=784, out_features=128, bias=True)
  (layer_2): Linear(in_features=128, out_features=256, bias=True)
  (layer_3): Linear(in_features=256, out_features=10, bias=True)
)
config = MnistModelConfig(output_path='output', logger_uri=None, run_name='run', project_name=None, run_description='🐸Coqui tra...rue, lr=0.001, optimizer='Adam', optimizer_params={}, lr_scheduler=None, lr_scheduler_params={}, use_grad_scaler=False)
assets = {}, is_eval = False, samples = None, verbose = True, num_gpus = 0

    def _get_loader(
        self,
        model: nn.Module,
        config: Coqpit,
        assets: Dict,
        is_eval: str,
        samples: List,
        verbose: bool,
        num_gpus: int,
    ) -> DataLoader:
        if num_gpus > 1:
            if hasattr(model.module, "get_data_loader"):
                loader = model.module.get_data_loader(
                    config,
                    assets,
                    is_eval,
                    samples,
                    verbose,
                    num_gpus,
                    self.args.rank,
                )
        else:
            if hasattr(model, "get_data_loader"):
>               loader = model.get_data_loader(
                    config=config, assets=assets, is_eval=is_eval, samples=samples, verbose=verbose, num_gpus=num_gpus
                )
E               TypeError: get_data_loader() got an unexpected keyword argument 'samples'

Package is marked as platform-independent, but it does not import on Windows

Describe the bug

Trying to import this package on Windows gives the following error:

  File "<stdin>", line 1, in <module>
  File "...\lib\site-packages\trainer\__init__.py", line 4, in <module>
    from trainer.trainer import *
  File "...\lib\site-packages\trainer\trainer.py", line 47, in <module>
    multiprocessing.set_start_method("fork")
  File "...\lib\multiprocessing\context.py", line 246, in set_start_method
    self._actual_context = self.get_context(method)
  File "...\lib\multiprocessing\context.py", line 238, in get_context
    return super().get_context(method)
  File "...\lib\multiprocessing\context.py", line 192, in get_context
    raise ValueError('cannot find context for %r' % method) from None
ValueError: cannot find context for 'fork'

Since there is no fork method on Windows, please either fix this issue or do not mark this package as platform-independent.

To Reproduce

Install this package on a Windows system.
Start Python.
Try to import trainer.

Expected behavior

No response

Logs

No response

Environment

- trainer v0.0.5 (installed via conda-forge)
- OS: Windows 10

Additional context

No response

[Bug] trainer.distribute not working

Describe the bug

For single GPU training, I am using train_yourtts.py. When I switch to multi-gpu, the program could run but didn't show acceleration. I checked the code in distribute.py and found that it only set environment and start parallel processes. It didn't do collection and sync operation. I am wondering if it is by design or I misused the trainer.distribute

To Reproduce

CUDA_VISIBLE_DEVICES=0,1 python -m trainer.distribute --script recipes/vctk/yourtts/train_yourtts.py

Expected behavior

expected two times acceleration but actually the progress is same as single gpu training.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.1+cu121",
        "Trainer": "v0.0.34",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.9.18",
        "version": "#99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023"
    }
}

Additional context

No response

[Bug] Hardcoded argument parsing on distribute.py preventing other options to be passed when using TrainerArgs

Describe the bug

Because the optional arguments are hardcoded as in this line https://github.com/coqui-ai/Trainer/blob/main/trainer/distribute.py#L37-L40, another arguments, such as --start_with_eval or --use_accelerate will not be changed and kept as default when running, for example, https://github.com/coqui-ai/TTS/blob/dev/TTS/bin/train_tts.py,

such as CUDA_VISIBLE_DEVICES="0,1,2,3" python -m trainer.distribute --start_with_eval true --use_accelerate true --script train_tts.py --config_path <config_path>, start with eval and use accelerate will be both false because only continue_path, restore_path, group_id, and use_ddp (which is fixed to true) will be changed.

To Reproduce

See the command example written in the description.

Expected behavior

No response

Logs

No response

Environment

No response

Additional context

No response

[Bug] Misleading python version constraint message

Describe the bug

Seems like you wanted to say < 3.10 but wrote <= 3.10, at least the latter would allow 3.10, but not 3.10.2.

Trainer/setup.py

Lines 33 to 39 in 45a5604

if LooseVersion(sys.version) < LooseVersion("3.6") or LooseVersion(
sys.version
) > LooseVersion("3.10"):
raise RuntimeError(
"Coqui-Trainer requires python >= 3.6 and <=3.10 "
"but your Python version is {}".format(sys.version)
)

To Reproduce

Building coqui-trainer with python 3.10.2

Expected behavior

Clarification whether < or <= is the intended constraint.

Logs

>   RuntimeError: Coqui-Trainer requires python >= 3.6 and <=3.10 but your Python version is 3.10.2 (main, Jan 13 2022, 19:06:22) [GCC 10.3.0]

Environment

- Version 0.0.5
- Python 3.10.2

Additional context

No response

[Bug] Cannot restore YourTTS model

Describe the bug

Hi,

Following YourTTS vctk recipe I try to restore a model to continue the training.

But I get the following error :

 > Restoring from best_model_22016.pth ...
 > Restoring Model...
 > Restoring Optimizer...
Traceback (most recent call last):
  File "/home/caraduf/Models/train_yourtts_16kHz.py", line 318, in <module>
    trainer = Trainer(
  File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 507, in __init__
    (self.model, self.optimizer, self.scaler, self.restore_step, self.restore_epoch) = self.restore_model(
  File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 711, in restore_model
    optimizer = _restore_list_objs(checkpoint["optimizer"], optimizer)
  File "/home/caraduf/CoquiTTS/Trainer/trainer/trainer.py", line 701, in _restore_list_objs
    obj.load_state_dict(states)
AttributeError: 'list' object has no attribute 'load_state_dict'

The obj is a list but the else is executed :


def _restore_list_objs(states, obj):           
            if isinstance(obj, list):
                for idx, state in enumerate(states):
                    obj[idx].load_state_dict(state)
            if isinstance(obj, dict):
                for key, state in states.items():
                    obj[key].load_state_dict(state)
            else:
                obj.load_state_dict(states)
            return obj

A workaround is to replace the second if with an elif because if obj is a List it cannot become a Dict. So in my opinion it makes sense to use a elif but I can be wrong.

To Reproduce

Set a restore path to a checkpoint in the recipe and run the recipe.

python3 train_yourtts.py

Expected behavior

THe model is restored and the training goes on.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.13.1+cu117",
        "Trainer": "v0.0.22",
        "numpy": "1.22.4"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.6",
        "version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023"
    }
}

Additional context

No response

[Bug] grad_accum_steps has no effect on training

Describe the bug

Supposedly, setting a value to the --grad_accum_steps option should decrease the needed steps to complete an epoch by simulating a higher batch size, however, setting it to any value does absolutely nothing on training speed or the total step counter.

To Reproduce

  1. Start training with a batch_size=8 using the recipe train_vits_tts_phonemes.py --grad_accum_steps=16 or add the value direcly to Trainer()
  2. See number of steps per epoch
  3. change --grad_accum_steps to another value
  4. Number of steps required doesn't change, therefore, the effective batch size hasn't changed

Expected behavior

Assuming the batch_size=8 and grad_accum_steps=8, an epoch that has 6400 steps to complete should reduce the number of steps to 100 while having an effective batch size of 64 (8*8) during training.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 2060"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.0.1+cu117",
        "Trainer": "v0.0.29",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "",
        "python": "3.10.12",
        "version": "#1 ZEN SMP PREEMPT_DYNAMIC Wed, 02 Aug 2023 10:40:11 +0000"
    }
}

Additional context

No response

[Bug] AttributeError: module 'torch' has no attribute 'autocast'

Describe the bug

Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1591, in fit
self._fit()
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1544, in _fit
self.train_epoch()
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1309, in train_epoch
_, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1172, in train_step
num_optimizers=len(self.optimizer),
File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1021, in _optimize
with torch.autocast(device_type=device, dtype=dtype, enabled=config.mixed_precision):
AttributeError: module 'torch' has no attribute 'autocast'

To Reproduce

https://github.com/coqui-ai/TTS/blob/dev/recipes/ljspeech/delightful_tts/train_delightful_tts.py

Expected behavior

No response

Logs

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1591, in fit 
    self._fit()
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1544, in _fit
    self.train_epoch()
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1309, in train_epoch
    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1172, in train_step
    num_optimizers=len(self.optimizer),
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1021, in _optimize
    with torch.autocast(device_type=device, dtype=dtype, enabled=config.mixed_precision):
AttributeError: module 'torch' has no attribute 'autocast'

Environment

torch version:'1.8.1+cu102'. I have checked several version of torch docs, seems no autocast in torch

Additional context

No response

[Bug] Unable to run distributed training using TTS recipe for yourtts

Describe the bug

I've been trying to train yourtts on a google compute instance, but it doesn't seem to work using trainer.distribute.
Previously i could run it, but it would get up to the same point in initialization and then crash one of the training workers, with the others freezing.
I am running largely unchanged code from the provided recipe, and have simply reduced the worker count to work on the cloud instance, and added my own dataset.
It previously trained fine without distributed training until it runs out of vram. and training locally on a 3090 works fine if not slowly.

Also TTS is installed to the latest version, not sure why collect_env_info.py didn't catch it.

To Reproduce

  1. Run CUDA_VISIBLE_DEVICES="0,1,2,3" python -m trainer.distribute --script train_yourtts.py on google compute instance
  2. Wait several seconds
  3. Error.

Expected behavior

Runs the training script with processing split between the GPUs.

Logs

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1666, in fit
    self._fit()
  File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1618, in _fit
    self.train_epoch()
  File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1350, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.10/site-packages/TTS/tts/models/vits.py", line 263, in __getitem__
    item = self.samples[idx]
TypeError: list indices must be integers or slices, not list

Environment

{       
    "CUDA": {
        "GPU": [
            "Tesla T4",
            "Tesla T4",
            "Tesla T4",
            "Tesla T4"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.0.1+cu117",
        "Trainer": "v0.0.27",
        "numpy": "1.23.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "",
        "python": "3.10.10",
        "version": "#1 SMP Debian 5.10.179-1 (2023-05-12)"
    }
}

Additional context

No response

[Feature request] Save a checkpoint when interrupting the training (ctrl - c)

🚀 Feature Description
Hi,
Sometimes while training, I need my GPU (eg to do some work with Whisper, or because I need to switch off the computer). So I have to interrupt the training and sometimes it is right between 2 checkpoints (eg checkpoints are saved every 10k iterations and it is 7k after the previous saved checkpoints).
So in this case I would loose all the training that has been achieved after the previous checkpoint.

Consequently it would be more comfortable that a checkpoint is saved when I interrupt the training so that I can then restore the training right from this checkpoint.

Solution

When the training process is interrupted (ctrl-c) make coqui save a checkpoint at the current step (as it does when save_step is reached).

Alternative Solutions

I could lower save_step but then checkpoints are too near to each others.

Additional context

[Bug] Tests taking forever

Describe the bug

num_gpus test takes hangs.

To Reproduce

Check the CI action logs.

Expected behavior

No response

Logs

No response

Environment

CI Action instances

Additional context

No response

[Bug] data = [self.dataset[idx] for idx in possibly_batched_index] TypeError: 'int' object is not iterable

Describe the bug

possibly_batched_index is not a list.

To Reproduce

python -m trainer.distribute --gpus "0,1" --script train_multi.py --restore_path /root/.local/share/tts/tts_models--en--ljspeech--vits/model_file.pth

Expected behavior

No response

Logs

> TRAINING (2022-08-26 01:41:21)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1533, in fit
    self._fit()
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1517, in _fit
    self.train_epoch()
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1281, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
TypeError: 'int' object is not iterable

 ! Run is kept in logs/multi_be/vits_ljs_speaker_embedded-August-26-2022_01+41AM-0000000
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1533, in fit
    self._fit()
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1517, in _fit
    self.train_epoch()
  File "/opt/conda/lib/python3.7/site-packages/trainer/trainer.py", line 1281, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
TypeError: 'int' object is not iterable

Environment

docker pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime

Additional context

No response

Missing dependencies

The dependencies in requirements.txt does not include:

  • soundfile
  • tensorboardX

There is no requirements file for tests, but they require

  • torchvision

[Bug] ValueError: not allowed to raise maximum limit (rlimit)

Describe the bug

Error while training:-

  • I tried with sudo same error
  • I am using docker image nvidia/cuda:11.7.0-base-ubuntu22.04
  • The default value of the docker container for command resource.getrlimit(resource.RLIMIT_NOFILE) is (1048576, 1048576)
| > stats_path:None
2023-06-14T07:29:43.025431079Z  | > base:10
2023-06-14T07:29:43.025437149Z  | > hop_length:256
2023-06-14T07:29:43.025444429Z  | > win_length:1024
2023-06-14T07:29:43.025450699Z  > initialization of speaker-embedding layers.
2023-06-14T07:29:43.025462919Z Traceback (most recent call last):
2023-06-14T07:29:43.025469199Z   File "/workspace/coqui-tts/train.py", line 320, in <module>
2023-06-14T07:29:43.025476859Z     trainer = Trainer(
2023-06-14T07:29:43.025484659Z   File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 405, in __init__
2023-06-14T07:29:43.025494939Z     self.use_cuda, self.num_gpus = self.setup_training_environment(args=args, config=config, gpu=gpu)
2023-06-14T07:29:43.025500099Z   File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 632, in setup_training_environment
2023-06-14T07:29:43.025543959Z     resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))
2023-06-14T07:29:43.025560229Z ValueError: not allowed to raise maximum limit

Due this line:

Trainer/trainer/trainer.py

Lines 653 to 660 in 9879d3d

if platform.system() != "Windows":
# https://github.com/pytorch/pytorch/issues/973
import resource # pylint: disable=import-outside-toplevel
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))
# set and initialize Pytorch runtime

To Reproduce

  1. Install coqui-tts in nvidia/cuda:11.7.0-base-ubuntu22.04 docker container
  2. Try train vits model
  3. This error is throw (even with sudo)

Expected behavior

No errors

Logs

| > stats_path:None
2023-06-14T07:29:43.025431079Z  | > base:10
2023-06-14T07:29:43.025437149Z  | > hop_length:256
2023-06-14T07:29:43.025444429Z  | > win_length:1024
2023-06-14T07:29:43.025450699Z  > initialization of speaker-embedding layers.
2023-06-14T07:29:43.025462919Z Traceback (most recent call last):
2023-06-14T07:29:43.025469199Z   File "/workspace/coqui-tts/train.py", line 320, in <module>
2023-06-14T07:29:43.025476859Z     trainer = Trainer(
2023-06-14T07:29:43.025484659Z   File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 405, in __init__
2023-06-14T07:29:43.025494939Z     self.use_cuda, self.num_gpus = self.setup_training_environment(args=args, config=config, gpu=gpu)
2023-06-14T07:29:43.025500099Z   File "/usr/local/lib/python3.10/dist-packages/trainer/trainer.py", line 632, in setup_training_environment
2023-06-14T07:29:43.025543959Z     resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))
2023-06-14T07:29:43.025560229Z ValueError: not allowed to raise maximum limit

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla V100-FHHL-16GB"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.0.1+cu117",
        "Trainer": "v0.0.20",
        "numpy": "1.22.4"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.6",
        "version": "#46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020"
    }
}

Additional context

No response

[Bug] Unable to use WandbLogger

Describe the bug

I tried using the WandB logger for training in the TTS repo, but it didn't work.

To Reproduce

  • Clone TTS repo
  • Modify the LJSpeech recipe's dataset path
  • Run:
CUDA_VISIBLE_DEVICES=0 python recipes/ljspeech/vits_tts/train_vits.py \
    --coqpit.dashboard_logger wandb \
    --coqpit.project_name FakeName \
    --coqpit.wandb_entity FakeEntity \

It crashes with this error:

Traceback (most recent call last):
  File "runs/train_vits.py", line 85, in <module>
    eval_samples=eval_samples,
  File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/trainer/trainer.py", line 359, in __init__
    self.dashboard_logger = logger_factory(config, output_path)
  File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/trainer/logging/__init__.py", line 36, in logger_factory
    entity=config.wandb_entity,
TypeError: Can't instantiate abstract class WandbLogger with abstract methods add_audio, add_figure, add_scalar

Expected behavior

It should work just like the default Tensorboard logger

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "11.3"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.10.2",
        "TTS": "0.6.1",
        "numpy": "1.19.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "x86_64",
        "python": "3.7.11",
        "version": "#202202230823 SMP PREEMPT Wed Feb 23 14:53:24 UTC 2022"
    }
}

Additional context

No response

Remove AMP

Describe the bug

Hi everybody,

Trainer is great, but still uses APEX, which is deprecated and tends to cause problems.
Could you remove and/or replace with Nvidia/amp?

Best regrads

image

To Reproduce

Just run Trainer on an Nvidia GPU

Expected behavior

No response

Logs

No response

Environment

not relevant

Additional context

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.