Comments (1)
@carlos-havier As mentioned in the PyTorch Lightning documentation, when using ddp_notebook
, the downside is:
"GPU operations such as moving tensors to the GPU or calling torch.cuda functions before invoking
Trainer.fit
is not allowed."
This means that there can be no CUDA tensors before calling Trainer.fit
. By default, when training, PyTorch Lightning saves the state_dict
of the trainer as CUDA when using GPU. So when load from checkpoint, CUDA is initialised. You can verify this with a simple check:
print(torch.cuda.is_initialized())
This can be placed before and after calling:
pl_model = LT_timm_model.load_from_checkpoint(def_log_chkpt)
You'll observe that CUDA is initialized when calling load_from_checkpoint
, and once CUDA is initialized here, it cannot be re-initialized in a different context as required by ddp_notebook
.
The Fix:
Use map_location
as CPU when calling load_from_checkpoint
:
pl_model = LT_timm_model.load_from_checkpoint(def_log_chkpt, map_location=torch.device('cpu'))
For your reference, I have added a few lines to debug based on your notebook here.
from lightning.
Related Issues (20)
- CUDA unknown error HOT 1
- AttributeError: type object 'Trainer' has no attribute 'add_argparse_args'
- Add functionality to save nn.Modules supplied as arguments when initialising LightningModule
- I think it's deadly necessary to add docs or tutorials for handling the case when We return multiple loaders in test_dataloaders() method? I think it
- "save_last" could not save a complete checkpoint
- element 0 of tensors does not require grad and does not have a grad_fn in "test_step" and "validation_step" HOT 4
- LR_FIND() does not work in DDP anymore, RuntimeError: No backend type associated with device type cpu
- KeyboardInterrupt raises an exception which results in a zero exit code
- XLA FSDP strategy has undocumented requirement for using activation checkpointing
- The training process will stop unexpectedly HOT 1
- forward method missing required positional argument ‘masks’ in PyTorch Lightning HOT 2
- Lightning Fabric: generic method to get the full state dict
- ModelCheckpoint does not work when using the monitor
- Continuing training with `ckpt_path="last"` fails in distributed setting
- is `lightning` and `pytorch_lightning` the same? HOT 4
- FileNotFoundError: [Errno 2] No such file or directory tfevents file
- `grep: Invalid option -- P` when running `./tests/run_standalone_tests.sh` on macOS HOT 1
- Callback for logging forward, backward and update time
- Custom batch selection for logging HOT 3
- `make test` fails with `subprocess-exited-with-error`: `AssertionError: Could not find cmake executable!`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lightning.