Giter Club home page Giter Club logo

pytorch-multi-gpu-training's People

Contributors

jia-zhuang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pytorch-multi-gpu-training's Issues

项目

你好,我想问一下你这个项目实际的跑起来了吗

ddp_train.py 里没有固定随机种子操作

Readme 里面的ddp 代码里有以下行:

# 固定随机种子
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

但是在代码ddp_train.py里没有这些行, 这是故意的吗?

增加gpu后准确率成倍减小,是为什么呢

代码如下:

for epoch in range(10):
    acc_num=0
    for i, (inputs, labels) in enumerate(train_loader):
        # forward
        inputs = inputs.to(device)
        labels = labels.to(device)
        outputs = model(inputs)
        loss = criterion(outputs[0], labels)
        # backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # log
        if args.local_rank == 0 and i % 5 == 0:
            tb_writer.add_scalar('loss', loss.item(), i)
        acc_num += (outputs[0].argmax(1)==labels).sum()
    if args.local_rank == 0:
        tb_writer.add_scalar('acc', acc_num/len(train_dataset),epoch)
        print(f"acc:{acc_num/len(train_dataset)}")

1个GPU时,acc:89%
3个GPU时,acc:29%
4个GPU时,acc: 17%

关于Accelerate单机多卡训练

accelerate单机多卡训练类似于项目中提到的ddp训练方式,训练时每个进程独享一张显卡,那么是否每个进程都会存取一份训练数据(dataset),使得内存占用变成原来单卡的数倍?目前只知道accelerate会对dataloader进行切分,不知道对于前一阶段的dataset,有没有做到进程间共享?以下是Claude3的回答,请您指教。
image.png

报错

运行的时候报错了,缺少了一些环境变量的设置,RANK,WORLD_SIZE,MASTER_ADDR等

DDP多卡训练模型时候各个gpu上的模型不一致。

哈喽,想请教个问题,我在使用ddp去并行多卡训练一个模型的时候,发现各个gpu上的模型参数并不一致,您知道是什么原因吗?如图所示,我使用的是DDP的形式去加载模型和分发数据,查阅资料说这种形式,会自动的在每个batch时各个gpu上的模型进行梯度同步,在主gpu上进行更新同步,确保各个模型的gpu上参数是一致的,但是通过我这么打印日志发现各个gpu上的模型并不一致,您知道是什么原因吗?
image
image
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.