narumiruna / pytorch-distributed-example Goto Github PK

View Code? Open in Web Editor NEW

168.0 6.0 37.0 51 KB

License: MIT License

Python 96.25% Dockerfile 3.75%

pytorch distributed mnist

pytorch-distributed-example's Introduction

Pytorch Distributed Example

If you are using previous version of PyTorch:

Requirements

pytorch
torchvision

References

pytorch-distributed-example's People

Contributors

Stargazers

Watchers

pytorch-distributed-example's Issues

init error

Now I stop here.
Namespace(backend='tcp', init_method='tcp://192.168.1.20:12345', rank=0, steps=20, world_size=3)
What should I do?

function blocked at DistributedDataParallel when running code at multi-machine

dear author , thank you for sharing code.
I executed the code at one single machine , using world_size 2 ,rank for 0 and 1 . It executed all well and complete the training. But when I execute the code at multi-machine , e.g. i used 2 machine 192.168.2.13 and 192.168.2.14 . they use the config of world_size 2 , rank 0 , and world_size 2 , rank 1. but finally the code blocked at function nn.parallel.DistributedDataParallel(model) then time out . The training didn't start , the problem puzzled me for several days , could you please give me some advise.

About MNIST example

Hi all, in the example of mnist, I did not find any synchronize codes (e.g., all_reduce, recv, send, etc.).

How do we guarantee that the replicated models in different processes are the same?

Would you please show me how to use single node multi-gpu mode?

Hi,

Thanks for your helpful tutorial, I am working on one machine with 2gpus, and I am also playing around torch.distributed. After studying carefully your code, I have tried to write one on myself. Here is a simplified version of my code:

import torch
import torch.nn as nn
import torch.nn.functional as F
import time

import argparse


def parse_args():
    parse = argparse.ArgumentParser()
    parse.add_argument(
            '--local_rank',
            dest = 'local_rank',
            type = int,
            default = -1,
            )
    return parse.parse_args()


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3,
            64,
            kernel_size = 3,
            stride = 2,
            padding = 1,)
        self.conv2 = nn.Conv2d(64,
            256,
            kernel_size = 3,
            stride = 2,
            padding = 1,)
        self.conv3 = nn.Conv2d(256,
            19,
            kernel_size = 3,
            stride = 2,
            padding = 1,)
        self.linear = nn.Linear(512, 10)

    def forward(self, x):
        H, W = x.size()[2:]
        x = self.conv1(x)
        x = self.conv2(x)
        logits = self.conv3(x)
        logits = F.interpolate(logits, (H, W), mode='bilinear')
        return logits


def train():
    args = parse_args()

    torch.cuda.set_device(args.local_rank)
    torch.distributed.init_process_group(
                backend='nccl',
                init_method='env://',
                )
    net = Net()
    net.train()
    net.cuda()
    #  net = nn.DataParallel(net)
    net = nn.parallel.DistributedDataParallel(net,
            device_ids = [args.local_rank, ],
            output_device = args.local_rank
            )

    optim = torch.optim.SGD(
            net.parameters(),
            lr = 1e-3,
            momentum = 0.9,
            weight_decay = 5e-4)
    criteria = nn.CrossEntropyLoss()

    for i in range(10000):
        img = torch.randn(2, 3, 768, 768).cuda()
        lb = torch.randint(0, 18, [2,768,768]).cuda()
        optim.zero_grad()
        out = net(img)
        #  out = out.permute(0,2,3,1).contiguous().view(-1, 19)
        #  lb = lb.view(-1, )
        #  print(out.size())
        #  print(lb.size())
        loss = criteria(out, lb)
        loss.backward()
        optim.step()

if __name__ == "__main__":
    train()

by running: python -m torch.distributed.launch --nproc_per_node=2 main.py. I got weird error. Would you please show what mistake I have made?

mnist problem

Hello, I am new to pytorch! I have a question of the mnist project. How can 2 workers interact with each other in your code?

Sampling data

It seems that your code does not sample the dataset for each process to have different training images during an iteration. For example, if we have

Training images A, B, C, D 
2 processes
Batch size 2

then, during an iteration, we may want that a process has A,B and another has C,D.
To achieve this, it seems that an official example provided by PyTorch uses DistributedSampler.

Am I missing something?
Thanks

wrong input argument in toy example

In https://github.com/narumiruna/pytorch-distributed-example/blob/torch170/toy/README.md

--node_rank should be 1 in Rank2, otherwise, python would raise RuntimeError: Address already in use

Does the Pytorch-1.0.1 support Pytorch 1.1

I am trying using the new features in Pytorch 1.1. Does this repo support Pytorch 1.0.1?

the print accuracy is always 0

accuracy = 1.0 * correct / len(test_dataloader.dataset)

RuntimeError : Address already in use at /opt/conda/conda-bld/pytorch_1535491974311/work/torch/lib/THD/process_g roup/General.cpp:17

When I try to run 'main.py', the error occur.
I don't know the reason.

Error:PyTorch built without distributed support

Hi,I would like to build my distributed pytorch,
but this error was reported when I entered the code
"python main.py --init-method tcp://127.0.0.1:23456 --rank 0 --world-size 3"
and I don't konw what to do
Any suggestion or reference project would be appreciated
thanks