twitter-archive / torch-distlearn Goto Github PK

View Code? Open in Web Editor NEW

94.0 23.0 28.0 26 KB

A set of distributed learning algorithms for Torch

License: Apache License 2.0

Lua 100.00%

torch-distlearn's Introduction

DistLearn

Some common distributed learning algorithms built in Torch with the help of the the ipc library.

AllReduceSGD

Spreads the computation of gradients for mini-batch of items across N processes. Uses AllReduce to quickly sum the gradients and distribute the total back out to every process.

local allReduceSGD = require 'distlearn.AllReduceSGD'(tree)
-- Make sure all the nodes start with the same parameter values
allReduceSGD.synchronizeParameters(params)
for _ = 1,epochs do
   for _ = 1,steps
      -- Compute your gradients as normal
      local grads = computeYourGrads(...)
      -- Sum and normalize them
      allReduceSGD.sumAndNormalizeGradients(grads)
      -- Do your SGD as normal
      SGD(params, grads)
   end
   -- Before validating we should make sure all nodes have
   -- the exact same parameter values
   allReduceSGD.synchronizeParameters(params)
   -- Validate...
end

When used in combination with Dataset you can quickly parallelize the processing of large datasets without a ton of effort. See the MNIST example for a complete working setup.

AllReduceEA

We also have a AllReduce based implementation of the Elastic Averaging algorithm as described in Deep learning with Elastic Averaging SGD. Its just as easy to add this to your training script, there are only two parameters required tau and alpha. Tau is how many steps to run before averaging the nodes and alpha is the weight used during the averaging step. You can read more about our implementation of AllReduceEA.

-- Use a tau of 10 and an alpha of 0.2
local allReduceEA = require 'distlearn.AllReduceEA'(tree, 10, 0.2)
-- Make sure all the nodes start with the same parameter values
allReduceEA.synchronizeParameters(params)
for _ = 1,epochs do
   for _ = 1,steps
      -- Compute your gradients as normal
      local grads = computeYourGrads(...)
      -- Do your SGD as normal
      SGD(params, grads)
      -- Average the params
      allReduceEA.averageParameters(params)
   end
   -- Make sure the center's haven't drifted too far due to
   -- floating point precision error build up
   allReduceEA.synchronizeCenter(params)
   -- Validate...
end

See a complete working example of EA and MNIST

License

Licensed under the Apache License, Version 2.0. See LICENSE file.

torch-distlearn's People

Contributors

Stargazers

Watchers

torch-distlearn's Issues

convergency speed

I follow the sample code and put distlearn into my RNN code. I find the training speed up to 3.5 times with 4 GPUs. However, the convergency speed is even slower than 1 GPU case. That means the script can process 3.5 times more training data with 4 GPUs, however it takes similar or longer time to achieve the same accuracy using 1 GPU. I have tuned the parameters: alpha (0.02 to 0.5) and tau (2~10 mini-batch updates) with different values. I also tried to synchronize the parameters (synchronizeParameters) in different ways: from every epoch (tens of hours) to a few of minutes.
After those changes, the convergency speed is still slower than 1 GPU case. Any hint for the problem?

Thanks

Yun

Segmentation fault

Hi, When running the example, sometimes it showed the message. Do you have any idea?

Thanks,
Chien-Lin

line 10: 12026 Segmentation fault th cifar10.lua --numNodes 4 --nodeIndex 1 --batchSize 10 --cuda --gpu 1

Build a standalone example using distlearn, autograd, dataset and parallel.

Perhaps MNIST or CIFAR10?

Add AllReduceElasticAverage and unit tests.

"CUDA IPC not possible between GPUs" & "CUDA IPC enabled between GPUs"

Hi, I tried to get distlearn working between multiple GPUs on my local machine (consists of 32cpu and 8gpu-Tesla-K80) but received the following messages. It seems to automatically enable CUDA IPC between GPUs, but I got the pretty low accuracy and very long time for computing each Epoch. Do you have any suggestion?

Thank you.

Chien-Lin

------------------------ Input
th mnist2.lua --numNodes 8 --nodeIndex 1 --cuda --gpu 1 &
th mnist2.lua --numNodes 8 --nodeIndex 2 --cuda --gpu 2 &
th mnist2.lua --numNodes 8 --nodeIndex 3 --cuda --gpu 3 &
th mnist2.lua --numNodes 8 --nodeIndex 4 --cuda --gpu 4 &
th mnist2.lua --numNodes 8 --nodeIndex 5 --cuda --gpu 5 &
th mnist2.lua --numNodes 8 --nodeIndex 6 --cuda --gpu 6 &
th mnist2.lua --numNodes 8 --nodeIndex 7 --cuda --gpu 7 &
th mnist2.lua --numNodes 8 --nodeIndex 8 --cuda --gpu 8 &
------------------------ Output
INFO: torch-ipc: CUDA IPC not possible between GPU2 and GPU0
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU2
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU3
INFO: torch-ipc: CUDA IPC not possible between GPU3 and GPU0
INFO: torch-ipc: CUDA IPC not possible between GPU4 and GPU0
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU4
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU6
INFO: torch-ipc: CUDA IPC not possible between GPU6 and GPU0
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU7
INFO: torch-ipc: CUDA IPC not possible between GPU7 and GPU0
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU5
INFO: torch-ipc: CUDA IPC not possible between GPU5 and GPU0
INFO: torch-ipc: CUDA IPC enabled between GPU1 and GPU0
INFO: torch-ipc: CUDA IPC enabled between GPU0 and GPU1
INFO: torch-ipc: CUDA IPC enabled between GPU4 and GPU5
INFO: torch-ipc: CUDA IPC enabled between GPU5 and GPU4
INFO: torch-ipc: CUDA IPC enabled between GPU7 and GPU6
INFO: torch-ipc: CUDA IPC enabled between GPU6 and GPU7
INFO: torch-ipc: CUDA IPC enabled between GPU2 and GPU3
INFO: torch-ipc: CUDA IPC enabled between GPU3 and GPU2
INFO: torch-ipc: CUDA IPC enabled between GPU6 and GPU4
INFO: torch-ipc: CUDA IPC enabled between GPU4 and GPU6

Add unit tests for AllReduceSGD.

Question about AllReduceEA

From the code and the algorithm presented (https://github.com/twitter/torch-distlearn/blob/master/lua/AllReduceEA.md), it seems like the all-reduce step involves synchronization between workers.

The algorithm published in the original paper does not require such synchronization (sec 3.1: each worker maintains it's own clock, t_i). Is AllReduceEA then an algorithm for synchronous EASGD, for which only the formulation was presented in the paper (sec 3)?

If so, are there any comparisons between synchronous and asynchronous EASGD?

Apologies if I have misunderstood this.

RDMA support ?

does torch-distlearn support rdma or means gpudirect ?

Async EASGD

I'm not sure if it belongs here, but I haven't found a better way to contact you:

Take a look of the Async EASGD implementations I did using your torch-ipc and torch-distlearn primitives.

Feel free to contact me

Lior

Example using optim?

Can you provide an example of how to use AllReduceSGD with the optim package?

I'm confused about whether or not I should normalize gradParameters like they do in [this example.]https://github.com/torch/tutorials/blob/master/2_supervised/4_train.lua) The behavior of AllReduceSGD.sumAndNormalizeGradients isn't clear to me, specifically whether (1) it sums and normalizes the accumulated gradient across all nodes or (2) if it takes the batch size given to the torch-dataset package into account.

I'm curious about the last part because you normalize the batch size, per node in your cifar example.

Thanks.

Do you have any number for the expected speed-up between one GPU and multiple GPUs?

Hi, Do you have any number for the expected speed-up between one GPU and multiple GPUs? Results of MNIST and Cifar10 dataset are shown as followings respectively. Are they correct?

Thank you,
Chien-Lin

Input: 1024(=1x32x32) dimensions; 60,000 images
Output: 10 dimensions, a sparse vector
Network: 2 Convolution NNs + 1 linear NN
GPU1 total time: 138.024 sec, Batch size: per node = 1, total = 1
GPU2 total time: 108.990 sec, Batch size: per node = 1, total = 2
GPU4 total time:  89.639 sec, Batch size: per node = 1, total = 4
GPU6 total time:  82.023 sec, Batch size: per node = 1, total = 6

Input: 3x32x32 dimensions; 50,000 images
Output: 10 dimensions, a sparse vector
Network: 4 Convolution NNs + 1 linear NN
GPU1 total time: 128.232 sec, Batch size: per node = 32, total = 32
GPU2 total time: 117.435 sec, Batch size: per node = 16, total = 32
GPU4 total time:  98.507 sec, Batch size: per node =  8, total = 32

Does this work on AWS?

I tried to get distlearn working between two GPUs on an AWS node but received this error:

INFO: torch-ipc: CUDA IPC not possible between GPU2 and GPU3
INFO: torch-ipc: CUDA IPC not possible between GPU3 and GPU2

The AWS GPUs are virtual — does this preclude them from communicating?

Thanks!

OpenBLAS Warning

First of all, thanks very much for sharing the code. It does help my research.

When I run the mnist.lua in the example folder, I get the following warning. Does it matter?
OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.

Thanks and best,

Jun

Problems with Using CPU as a node

Hello I'm currently using the current settings:
I have a 32-core server with 2 titan-x GPU's.

I want to use this package in order to research and implement asynchronous algorithms for deep learning. I want to use the system with more than 2 workers. I thought of using 2 nodes for the GPU's, and using more nodes which are running on CPU.
However, even when I use 1 CPU node, it uses all of the free cores, which causes the PC performance to drop significantly. Using more than 1 CPU node, just causes the the cores to be split into few nodes. I wondered if you experienced that problem?

Can I decide how many cores are active instead of just deciding the number of CPU nodes? If so, can I leave some of the cores free for memory transfers? Do you think it will fix the drop in performance?
Did anyone try using GPU's and CPU cores all together?
This package is based upon the torch-ipc package. I've got more GPU's in other PC's and I wondered if I can use them using torch-distlearn. Can I tweak it so It won't just be active locally?
I've noticed that the implementation uses LocalhostTree. Can it easily be changed?

Thanks in advance,
It will be of a tremendous help,
Lior

Examples on distributed server environment(i.e. scaling-out)

Please, give me comments or example codes to use this torch-distlearn in distributed servers.(i.e. scaling-out)

Multi Node Support

Hey guys,
I'm looking to implement this in a multi-node environment, managed by SLURM. As far as I can tell this code has not been developed for such an environment.

I have been able to get the client-server model from https://github.com/twitter/torch-ipc running, but I thought that before I went ahead and began development on my own multi-node implementation of allReduce following that as an example, I would just ask to see if I was doing anything incorrectly or if there other existing multi-node torch implementations of sgd.