Giter Club home page Giter Club logo

Comments (8)

pseeth avatar pseeth commented on May 30, 2024

Hey @listener17 , sorry about that! I'll look into it this week.

For now, can you try launching via torchrun, even if on a single GPU? The relevant command is

torchrun --nproc_per_node 1 scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/

Just curious if that works.

from descript-audio-codec.

listener17 avatar listener17 commented on May 30, 2024

@pseeth: thanks.
It's stuck for ages. Even though I use smaller batch size (4), one discriminator each for the 2 types of discriminator.

user@v100:~/user/descript-audio-codec$ torchrun --nproc_per_node 1 scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/
Accelerator(
  amp : bool = False
)

from descript-audio-codec.

listener17 avatar listener17 commented on May 30, 2024

@pseeth:
If I use this:

export CUDA_VISIBLE_DEVICES=2
torchrun --nproc_per_node 1 scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/

I get this error:

Accelerator(
  amp : bool = False
)
Traceback (most recent call last):
  File "/home/user/descript-audio-codec/scripts/train.py", line 433, in <module>
    with Accelerator() as accel:
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/argbind/argbind.py", line 159, in cmd_func
    return func(*cmd_args, **kwargs)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/audiotools/ml/accelerator.py", line 71, in __init__
    torch.cuda.device(self.local_rank) if torch.cuda.is_available() else None
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/cuda/__init__.py", line 312, in __init__
    self.idx = _get_device_index(device, optional=True)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/cuda/_utils.py", line 26, in _get_device_index
    device = torch.device(device)
RuntimeError: Invalid device string: '0'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5276) of binary: /home/user/anaconda3/envs/dac/bin/python
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/dac/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-04_06:07:46
  host      : v100.com.net
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5276)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

from descript-audio-codec.

listener17 avatar listener17 commented on May 30, 2024

@pseeth:
The error seems to be in line 235 https://github.com/descriptinc/descript-audio-codec/blob/main/scripts/train.py
out = state.generator(signal.audio_data, signal.sample_rate)

Exception has occurred: TypeError
'int' object is not iterable
  File "/home/user/descript-audio-codec/scripts/train.py", line 235, in train_loop
    out = state.generator(signal.audio_data, signal.sample_rate)
  File "/home/user/descript-audio-codec/scripts/train.py", line 412, in train
    train_loop(state, batch, accel, lambdas)
  File "/home/user/descript-audio-codec/scripts/train.py", line 437, in <module>
    train(args, accel)
TypeError: 'int' object is not iterable

from descript-audio-codec.

listener17 avatar listener17 commented on May 30, 2024

On a different GPU server, I'm getting similar but different error message at the same place

Exception has occurred: TypeError
zip argument #1 must support iteration
  File "/home/user/descript-audio-codec/scripts/train.py", line 235, in train_loop
    out = state.generator(signal.audio_data, signal.sample_rate)
  File "/home/user/descript-audio-codec/scripts/train.py", line 412, in train
    train_loop(state, batch, accel, lambdas)
  File "/home/user/descript-audio-codec/scripts/train.py", line 437, in <module>
    train(args, accel)
TypeError: zip argument #1 must support iteration

from descript-audio-codec.

listener17 avatar listener17 commented on May 30, 2024

@pseeth and @eeishaan:

FYI:
python -m pytest tests is also not working.

BUT, if I add:

import sys 
sys.stdout.reconfigure(encoding="utf-8")

at the top of https://github.com/descriptinc/descript-audio-codec/blob/main/tests/test_train.py
Then, python -m pytest tests passes!

I tried the same trick with train.py, but still the training does not work!
But, maybe it all gives you guys some hints.

from descript-audio-codec.

listener17 avatar listener17 commented on May 30, 2024

I created a clean conda environment, followed your installation steps, and ... it was not working.

However, by luck, the training was working on my colleague's (unclean) environment.
So, I simply used exactly that environment .... and the training works now :-)

from descript-audio-codec.

zaptrem avatar zaptrem commented on May 30, 2024

I created a clean conda environment, followed your installation steps, and ... it was not working.

However, by luck, the training was working on my colleague's (unclean) environment. So, I simply used exactly that environment .... and the training works now :-)

Can you share the environment and reopen the issue? We're hitting the same thing.

Edit: Colleague says adding the following fixed it:

import matplotlib
matplotlib.use('Agg')

from descript-audio-codec.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.