If you are using previous version of PyTorch:
- pytorch
- torchvision
License: MIT License
If you are using previous version of PyTorch:
Now I stop here.
Namespace(backend='tcp', init_method='tcp://192.168.1.20:12345', rank=0, steps=20, world_size=3)
What should I do?
dear author , thank you for sharing code.
I executed the code at one single machine , using world_size 2 ,rank for 0 and 1 . It executed all well and complete the training. But when I execute the code at multi-machine , e.g. i used 2 machine 192.168.2.13 and 192.168.2.14 . they use the config of world_size 2 , rank 0 , and world_size 2 , rank 1. but finally the code blocked at function nn.parallel.DistributedDataParallel(model) then time out . The training didn't start , the problem puzzled me for several days , could you please give me some advise.
Hi all, in the example of mnist, I did not find any synchronize codes (e.g., all_reduce, recv, send, etc.).
How do we guarantee that the replicated models in different processes are the same?
Hi,
Thanks for your helpful tutorial, I am working on one machine with 2gpus, and I am also playing around torch.distributed
. After studying carefully your code, I have tried to write one on myself. Here is a simplified version of my code:
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
import argparse
def parse_args():
parse = argparse.ArgumentParser()
parse.add_argument(
'--local_rank',
dest = 'local_rank',
type = int,
default = -1,
)
return parse.parse_args()
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3,
64,
kernel_size = 3,
stride = 2,
padding = 1,)
self.conv2 = nn.Conv2d(64,
256,
kernel_size = 3,
stride = 2,
padding = 1,)
self.conv3 = nn.Conv2d(256,
19,
kernel_size = 3,
stride = 2,
padding = 1,)
self.linear = nn.Linear(512, 10)
def forward(self, x):
H, W = x.size()[2:]
x = self.conv1(x)
x = self.conv2(x)
logits = self.conv3(x)
logits = F.interpolate(logits, (H, W), mode='bilinear')
return logits
def train():
args = parse_args()
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(
backend='nccl',
init_method='env://',
)
net = Net()
net.train()
net.cuda()
# net = nn.DataParallel(net)
net = nn.parallel.DistributedDataParallel(net,
device_ids = [args.local_rank, ],
output_device = args.local_rank
)
optim = torch.optim.SGD(
net.parameters(),
lr = 1e-3,
momentum = 0.9,
weight_decay = 5e-4)
criteria = nn.CrossEntropyLoss()
for i in range(10000):
img = torch.randn(2, 3, 768, 768).cuda()
lb = torch.randint(0, 18, [2,768,768]).cuda()
optim.zero_grad()
out = net(img)
# out = out.permute(0,2,3,1).contiguous().view(-1, 19)
# lb = lb.view(-1, )
# print(out.size())
# print(lb.size())
loss = criteria(out, lb)
loss.backward()
optim.step()
if __name__ == "__main__":
train()
by running: python -m torch.distributed.launch --nproc_per_node=2 main.py
. I got weird error. Would you please show what mistake I have made?
Hello, I am new to pytorch! I have a question of the mnist project. How can 2 workers interact with each other in your code?
It seems that your code does not sample the dataset for each process to have different training images during an iteration. For example, if we have
Training images A, B, C, D
2 processes
Batch size 2
then, during an iteration, we may want that a process has A,B and another has C,D.
To achieve this, it seems that an official example provided by PyTorch uses DistributedSampler.
Am I missing something?
Thanks
In https://github.com/narumiruna/pytorch-distributed-example/blob/torch170/toy/README.md
--node_rank
should be 1 in Rank2, otherwise, python would raise RuntimeError: Address already in use
I am trying using the new features in Pytorch 1.1. Does this repo support Pytorch 1.0.1?
accuracy = 1.0 * correct / len(test_dataloader.dataset)
When I try to run 'main.py', the error occur.
I don't know the reason.
Hi,I would like to build my distributed pytorch,
but this error was reported when I entered the code
"python main.py --init-method tcp://127.0.0.1:23456 --rank 0 --world-size 3"
and I don't konw what to do
Any suggestion or reference project would be appreciated
thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.