Comments (8)
Hey @listener17 , sorry about that! I'll look into it this week.
For now, can you try launching via torchrun
, even if on a single GPU? The relevant command is
torchrun --nproc_per_node 1 scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/
Just curious if that works.
from descript-audio-codec.
@pseeth: thanks.
It's stuck for ages. Even though I use smaller batch size (4), one discriminator each for the 2 types of discriminator.
user@v100:~/user/descript-audio-codec$ torchrun --nproc_per_node 1 scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/
Accelerator(
amp : bool = False
)
from descript-audio-codec.
@pseeth:
If I use this:
export CUDA_VISIBLE_DEVICES=2
torchrun --nproc_per_node 1 scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/
I get this error:
Accelerator(
amp : bool = False
)
Traceback (most recent call last):
File "/home/user/descript-audio-codec/scripts/train.py", line 433, in <module>
with Accelerator() as accel:
File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/argbind/argbind.py", line 159, in cmd_func
return func(*cmd_args, **kwargs)
File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/audiotools/ml/accelerator.py", line 71, in __init__
torch.cuda.device(self.local_rank) if torch.cuda.is_available() else None
File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/cuda/__init__.py", line 312, in __init__
self.idx = _get_device_index(device, optional=True)
File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/cuda/_utils.py", line 26, in _get_device_index
device = torch.device(device)
RuntimeError: Invalid device string: '0'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5276) of binary: /home/user/anaconda3/envs/dac/bin/python
Traceback (most recent call last):
File "/home/user/anaconda3/envs/dac/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-07-04_06:07:46
host : v100.com.net
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5276)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
from descript-audio-codec.
@pseeth:
The error seems to be in line 235 https://github.com/descriptinc/descript-audio-codec/blob/main/scripts/train.py
out = state.generator(signal.audio_data, signal.sample_rate)
Exception has occurred: TypeError
'int' object is not iterable
File "/home/user/descript-audio-codec/scripts/train.py", line 235, in train_loop
out = state.generator(signal.audio_data, signal.sample_rate)
File "/home/user/descript-audio-codec/scripts/train.py", line 412, in train
train_loop(state, batch, accel, lambdas)
File "/home/user/descript-audio-codec/scripts/train.py", line 437, in <module>
train(args, accel)
TypeError: 'int' object is not iterable
from descript-audio-codec.
On a different GPU server, I'm getting similar but different error message at the same place
Exception has occurred: TypeError
zip argument #1 must support iteration
File "/home/user/descript-audio-codec/scripts/train.py", line 235, in train_loop
out = state.generator(signal.audio_data, signal.sample_rate)
File "/home/user/descript-audio-codec/scripts/train.py", line 412, in train
train_loop(state, batch, accel, lambdas)
File "/home/user/descript-audio-codec/scripts/train.py", line 437, in <module>
train(args, accel)
TypeError: zip argument #1 must support iteration
from descript-audio-codec.
FYI:
python -m pytest tests
is also not working.
BUT, if I add:
import sys
sys.stdout.reconfigure(encoding="utf-8")
at the top of https://github.com/descriptinc/descript-audio-codec/blob/main/tests/test_train.py
Then, python -m pytest tests
passes!
I tried the same trick with train.py, but still the training does not work!
But, maybe it all gives you guys some hints.
from descript-audio-codec.
I created a clean conda environment, followed your installation steps, and ... it was not working.
However, by luck, the training was working on my colleague's (unclean) environment.
So, I simply used exactly that environment .... and the training works now :-)
from descript-audio-codec.
I created a clean conda environment, followed your installation steps, and ... it was not working.
However, by luck, the training was working on my colleague's (unclean) environment. So, I simply used exactly that environment .... and the training works now :-)
Can you share the environment and reopen the issue? We're hitting the same thing.
Edit: Colleague says adding the following fixed it:
import matplotlib
matplotlib.use('Agg')
from descript-audio-codec.
Related Issues (20)
- How to compress stereo sound by model.encode HOT 2
- (Paper Error?) MSD Not Used? HOT 3
- Error when set win_duration small
- Encoding new file - use of `zero_pad` HOT 2
- Inference speed
- Loading DAC files is insecure due to pickle
- Error with 16khz
- Memory leak? HOT 2
- Padding Mismatches Output Dimension in Conv1d HOT 1
- broken training: please specify versions of libraries used
- tensor shape mismatch when training on 24khz LibriTTS dataset HOT 2
- Same error in #18
- Very low bitrate models
- Training error: "RuntimeError: grad can be implicitly created only for scalar outputs"
- How to directly download the trianing data of baseline?
- Decode using codes instead of encoder output? HOT 2
- Duration not preserved HOT 1
- Duration not preserved?
- Strange at the end of the recons audio
- The size of tensor a (5) must match the size of tensor b (6) at non-singleton dimension 1 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from descript-audio-codec.