Comments (14)
In my case, the GPU utilization was over 80% with my 2x V100s. Although my Dataset
class does not spawn the worker threads for fetching data from the corpus, it actually does not matter for the performance with the proper system (sufficient CPUs and RAMs) and suitable vocabulary size. How about testing the bottleneck of my Dataset
loader? Change the
Lines 28 to 50 in 71ebf91
function code as below:
def _fetch_one(self) -> Dict[str, List[int]]:
indices += [0] * (self.seq_len + 1)
return {'input': indices[:-1], 'output': indices[1:]}
from gpt2.
First of all, you did not append the backslash (\
) to the end of the --gpus 4
parameter line. Because of that, the arguments after the --gpu 4
line may be ignored. I think it is not a solution, but show me the result after fixing the bug first.
from gpt2.
sorry,This is the format of the display
python -m gpt2 train --train_corpus ../build/corpus.train.txt \ --eval_corpus ../build/corpus.test.txt \ --vocab_path ../build/vocab.txt \ --dims 1024 \ --batch_train 128 \ --batch_eval 128 \ --seq_len 64 \ --total_steps 3000 \ --eval_steps 500 \ --save_steps 3000 \ --gpus 4 \ --save_checkpoint_path ckpt-gpt2.pth \ --save_model_path gpt2-pretrained.pth
and ENTER for run
it still stuck,as follows
Train GPT-2 model: 0%| | 0/3000 [00:00<?, ?it/s]
when run nvidia-smi,
The multi GPU seems to be up, but it's stuck
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3542447 C /root/anaconda3/bin/python 2315MiB |
| 1 3542448 C /root/anaconda3/bin/python 2320MiB |
| 2 3542449 C /root/anaconda3/bin/python 2315MiB |
| 3 3542450 C /root/anaconda3/bin/python 2320MiB |
+-----------------------------------------------------------------------------+
from gpt2.
from gpt2.
How long did you wait for the freeze? Due to the distributed training environment, it usually spends a few minutes before start training. In my case, 2x V100 required about 2 to 3 minutes.
from gpt2.
It's been running for hours, so I canceled it
However, single GPU can run at a speed of 1.5it/s
from gpt2.
What about two GPUs? Can you show me the result with 2 and 3 GPUs?
from gpt2.
I've tested that --gpus from 2 to 4
There was no improvement
Maybe dataloader doesn't allow multithreading
from gpt2.
I ran this model on 2x V100s. I think distributed reduction would be the problem. Can you check if the GPU memory usage increases depending on the batch size?
from gpt2.
And check if tcp port 8000 is available as well.
from gpt2.
the GPU memory usage looks like OK when i set batch size is 64
and port 8000 is available
from gpt2.
I think it is ommunication problem of multi GPU graphics card
Communication problem of multi GPU graphics card
When I set CUDA_VISIBLE_DEVICES=0,1 and --gpus 2 it don't work
But I set CUDA_VISIBLE_DEVICES=0,2 and --gpus 2 it works
Maybe 0 and 2 or 1 and 3 are able to communicate
from gpt2.
I find Volatile GPU-Util is too low, Most of the time it is 10% even 0%
0% -> 10% -> 99%
How can I set it to work like DataLoader set num_workers
from gpt2.
Thank you very much
I think this may be another point I need to know
At present, my GPU is busy running
At the same time, I need to understand gpt2 deeply
from gpt2.
Related Issues (9)
- Dataset에 대해 문의 드립니다. HOT 1
- Confusions on Usage HOT 3
- Is Apex useful for GPT-2? HOT 2
- Activaiton Function HOT 3
- Training spec HOT 2
- Training spec #2 HOT 2
- bidirectional training in GPT2 HOT 1
- Which kind of tokenizer do you use? It looks like WordPiece, not BPE. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpt2.