Giter Club home page Giter Club logo

pytorch-distributed-example's Introduction

pytorch-distributed-example's People

Contributors

kh4l avatar narumiruna avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pytorch-distributed-example's Issues

init error

Now I stop here.
Namespace(backend='tcp', init_method='tcp://192.168.1.20:12345', rank=0, steps=20, world_size=3)
What should I do?

function blocked at DistributedDataParallel when running code at multi-machine

dear author , thank you for sharing code.
I executed the code at one single machine , using world_size 2 ,rank for 0 and 1 . It executed all well and complete the training. But when I execute the code at multi-machine , e.g. i used 2 machine 192.168.2.13 and 192.168.2.14 . they use the config of world_size 2 , rank 0 , and world_size 2 , rank 1. but finally the code blocked at function nn.parallel.DistributedDataParallel(model) then time out . The training didn't start , the problem puzzled me for several days , could you please give me some advise.

About MNIST example

Hi all, in the example of mnist, I did not find any synchronize codes (e.g., all_reduce, recv, send, etc.).

How do we guarantee that the replicated models in different processes are the same?

Would you please show me how to use single node multi-gpu mode?

Hi,

Thanks for your helpful tutorial, I am working on one machine with 2gpus, and I am also playing around torch.distributed. After studying carefully your code, I have tried to write one on myself. Here is a simplified version of my code:

import torch
import torch.nn as nn
import torch.nn.functional as F
import time

import argparse


def parse_args():
    parse = argparse.ArgumentParser()
    parse.add_argument(
            '--local_rank',
            dest = 'local_rank',
            type = int,
            default = -1,
            )
    return parse.parse_args()


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3,
            64,
            kernel_size = 3,
            stride = 2,
            padding = 1,)
        self.conv2 = nn.Conv2d(64,
            256,
            kernel_size = 3,
            stride = 2,
            padding = 1,)
        self.conv3 = nn.Conv2d(256,
            19,
            kernel_size = 3,
            stride = 2,
            padding = 1,)
        self.linear = nn.Linear(512, 10)

    def forward(self, x):
        H, W = x.size()[2:]
        x = self.conv1(x)
        x = self.conv2(x)
        logits = self.conv3(x)
        logits = F.interpolate(logits, (H, W), mode='bilinear')
        return logits


def train():
    args = parse_args()

    torch.cuda.set_device(args.local_rank)
    torch.distributed.init_process_group(
                backend='nccl',
                init_method='env://',
                )
    net = Net()
    net.train()
    net.cuda()
    #  net = nn.DataParallel(net)
    net = nn.parallel.DistributedDataParallel(net,
            device_ids = [args.local_rank, ],
            output_device = args.local_rank
            )

    optim = torch.optim.SGD(
            net.parameters(),
            lr = 1e-3,
            momentum = 0.9,
            weight_decay = 5e-4)
    criteria = nn.CrossEntropyLoss()

    for i in range(10000):
        img = torch.randn(2, 3, 768, 768).cuda()
        lb = torch.randint(0, 18, [2,768,768]).cuda()
        optim.zero_grad()
        out = net(img)
        #  out = out.permute(0,2,3,1).contiguous().view(-1, 19)
        #  lb = lb.view(-1, )
        #  print(out.size())
        #  print(lb.size())
        loss = criteria(out, lb)
        loss.backward()
        optim.step()

if __name__ == "__main__":
    train()

by running: python -m torch.distributed.launch --nproc_per_node=2 main.py. I got weird error. Would you please show what mistake I have made?

mnist problem

Hello, I am new to pytorch! I have a question of the mnist project. How can 2 workers interact with each other in your code?

Sampling data

It seems that your code does not sample the dataset for each process to have different training images during an iteration. For example, if we have

Training images A, B, C, D 
2 processes
Batch size 2

then, during an iteration, we may want that a process has A,B and another has C,D.
To achieve this, it seems that an official example provided by PyTorch uses DistributedSampler.

Am I missing something?
Thanks

Error:PyTorch built without distributed support

Hi,I would like to build my distributed pytorch,
but this error was reported when I entered the code
"python main.py --init-method tcp://127.0.0.1:23456 --rank 0 --world-size 3"
and I don't konw what to do
Any suggestion or reference project would be appreciated
thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.