Giter Club home page Giter Club logo

Comments (14)

affjljoo3581 avatar affjljoo3581 commented on June 2, 2024 1

In my case, the GPU utilization was over 80% with my 2x V100s. Although my Dataset class does not spawn the worker threads for fetching data from the corpus, it actually does not matter for the performance with the proper system (sufficient CPUs and RAMs) and suitable vocabulary size. How about testing the bottleneck of my Dataset loader? Change the

def _fetch_one(self) -> Dict[str, List[int]]:
while True:
# Read subword-tokenized sequence from corpus.
line = self.corpus_fp.readline()
if not line:
# Raise error when all sequences are fetched.
if not self.repeat:
raise StopIteration()
# Or, move to the first of the corpus.
self.corpus_fp.seek(0)
continue
# Use token indices rather than the token names directly.
indices = [self.vocab[t] for t in line.split()]
if len(indices) + 2 > self.seq_len:
continue
# Decorate the sequence with additional tokens.
indices = [self.vocab.bos_idx] + indices + [self.vocab.eos_idx]
indices += [self.vocab.pad_idx] * (self.seq_len - len(indices) + 1)
return {'input': indices[:-1], 'output': indices[1:]}

function code as below:

    def _fetch_one(self) -> Dict[str, List[int]]:
        indices += [0] * (self.seq_len + 1)
        return {'input': indices[:-1], 'output': indices[1:]}

from gpt2.

affjljoo3581 avatar affjljoo3581 commented on June 2, 2024

First of all, you did not append the backslash (\) to the end of the --gpus 4 parameter line. Because of that, the arguments after the --gpu 4 line may be ignored. I think it is not a solution, but show me the result after fixing the bug first.

from gpt2.

liygzting avatar liygzting commented on June 2, 2024

sorry,This is the format of the display
python -m gpt2 train --train_corpus ../build/corpus.train.txt \ --eval_corpus ../build/corpus.test.txt \ --vocab_path ../build/vocab.txt \ --dims 1024 \ --batch_train 128 \ --batch_eval 128 \ --seq_len 64 \ --total_steps 3000 \ --eval_steps 500 \ --save_steps 3000 \ --gpus 4 \ --save_checkpoint_path ckpt-gpt2.pth \ --save_model_path gpt2-pretrained.pth

and ENTER for run
it still stuck,as follows

Train GPT-2 model: 0%| | 0/3000 [00:00<?, ?it/s]

when run nvidia-smi,

The multi GPU seems to be up, but it's stuck
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3542447 C /root/anaconda3/bin/python 2315MiB |
| 1 3542448 C /root/anaconda3/bin/python 2320MiB |
| 2 3542449 C /root/anaconda3/bin/python 2315MiB |
| 3 3542450 C /root/anaconda3/bin/python 2320MiB |
+-----------------------------------------------------------------------------+

from gpt2.

liygzting avatar liygzting commented on June 2, 2024

gpu status

from gpt2.

affjljoo3581 avatar affjljoo3581 commented on June 2, 2024

How long did you wait for the freeze? Due to the distributed training environment, it usually spends a few minutes before start training. In my case, 2x V100 required about 2 to 3 minutes.

from gpt2.

liygzting avatar liygzting commented on June 2, 2024

It's been running for hours, so I canceled it
However, single GPU can run at a speed of 1.5it/s

from gpt2.

affjljoo3581 avatar affjljoo3581 commented on June 2, 2024

What about two GPUs? Can you show me the result with 2 and 3 GPUs?

from gpt2.

liygzting avatar liygzting commented on June 2, 2024

I've tested that --gpus from 2 to 4
There was no improvement
Maybe dataloader doesn't allow multithreading

2
3

from gpt2.

affjljoo3581 avatar affjljoo3581 commented on June 2, 2024

I ran this model on 2x V100s. I think distributed reduction would be the problem. Can you check if the GPU memory usage increases depending on the batch size?

from gpt2.

affjljoo3581 avatar affjljoo3581 commented on June 2, 2024

And check if tcp port 8000 is available as well.

from gpt2.

liygzting avatar liygzting commented on June 2, 2024

the GPU memory usage looks like OK when i set batch size is 64
and port 8000 is available
n

from gpt2.

liygzting avatar liygzting commented on June 2, 2024

I think it is ommunication problem of multi GPU graphics card
Communication problem of multi GPU graphics card
When I set CUDA_VISIBLE_DEVICES=0,1 and --gpus 2 it don't work
But I set CUDA_VISIBLE_DEVICES=0,2 and --gpus 2 it works
Maybe 0 and 2 or 1 and 3 are able to communicate

from gpt2.

liygzting avatar liygzting commented on June 2, 2024

I find Volatile GPU-Util is too low, Most of the time it is 10% even 0%
0% -> 10% -> 99%
How can I set it to work like DataLoader set num_workers

from gpt2.

liygzting avatar liygzting commented on June 2, 2024

Thank you very much
I think this may be another point I need to know
At present, my GPU is busy running
At the same time, I need to understand gpt2 deeply

from gpt2.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.