Hi, I've been following your instructions on the README to train the model. However, I ran into CUDA out-of-memory issues even with a 16 GB GPU.
I have tried to solve the above with the following solutions, but none has worked so far:
- Decreasing batch_size to 4 -> 2 -> 1
- Decreasing num_data_workers to 2 -> 1
- Use torch.cuda.empty_cache()
- Use gc.collect()
- Use Google Colab and Kaggle to run a Notebook version
For point 5, the training is able to run up until the end of validation at the first epoch before the entire website crashes.
The following is a more detailed log of the error I received:
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 139, in _wrapping_function results = function(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 624, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1061, in _run results = self._run_stage() File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1140, in _run_stage self._run_train() File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1163, in _run_train self.fit_loop.run() File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance batch_output = self.batch_loop.run(kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 239, in _run_optimization closure() File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in __call__ self._result = self.closure(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure step_output = self._step_fn() File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values()) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1443, in _call_strategy_hook output = fn(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 280, in training_step return self.model(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward output = self._forward_module.training_step(*inputs, **kwargs) File "/home/chenweiyi/Grapher/model/litgrapher.py", line 102, in training_step logits_nodes, logits_edges= self.model(text_input_ids, File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/Grapher/model/grapher.py", line 58, in forward output = self.transformer(input_ids=text, File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1660, in forward decoder_outputs = self.decoder( File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1052, in forward layer_outputs = layer_module( File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 684, in forward self_attention_outputs = self.layer[0]( File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 590, in forward attention_output = self.SelfAttention( File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 520, in forward scores = torch.matmul( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.92 GiB total capacity; 10.48 GiB already allocated; 11.50 MiB free; 10.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF