Giter Club home page Giter Club logo

ml-distributed-training's Introduction

ML Distributed training

Demo: https://youtu.be/OOPVA-eqBTY

Introduction

This project leverage the power of multiple GPUs with the target is to reduce the training time of complex models by data parallelism method with 2 approaches:

  1. Multi-worker Training using 2 PCs with GeForce RTX GPU as Workers via:
    • Local area network (LAN).
    • VPN tunnel using OpenVPN (not included in the demo).
  2. Parameter Server Training using 5 machines in LAN:
    • 2 Laptops as Parameter Server connected via 5GHz Wi-Fi.
    • 2 PCs with GeForce RTX GPU as Workers.
    • 1 PC just with CPU as a Coordinator.

Dataset

We used our self-built 30VNFoods dataset which includes collected and labeled images of 30 famous Vietnamese dishes. This dataset is divided into:

  • 17,581 images for training.
  • 2,515 images for validation.
  • 5,040 images for testing.

In addition, we also used a small TensorFlow flowers dataset with about 3700 images of flowers, which includes 5 folders corresponding to 5 types of flowers (daisy, dandelion, roses, sunflowers, tulips).

Setup

Image size (224, 224)
Batch size/worker 32
Optimizer Adam
Learning rate 0.001

The iperf3 tool is used to measure the bandwidth of machines in network.

1. Multi-worker Training

2. Parameter Server Training

Result

Training method Dataset Connection Avg. s/epoch
Single-worker flowers LAN 14
Multi-worker flowers LAN 18
Multi-worker flowers VPN Tunnel 635
Multi-worker 30VNFoods LAN 184
Parameter Server 30VNFoods LAN 115

โ‡’ For more information, see Report.pdf.

References

ml-distributed-training's People

Contributors

18520339 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

ml-distributed-training's Issues

Role of ps and worker

Hey there!
Just wanted to clarify something regarding the code for ps and worker. I've recently started working with such distributed training, so pardon my silly queries.

As much as I've come to know, ps serve parameters to the workers while the later one fetches them. Aside from the difference in the tf_config, I've noticed no code for fetching/serving of parameters particularly dedicated to only ps or only workers. Both share the same code.

I wanted to know how are they coordinating with one another?

A question about parameter server training

Hi, your code really helps! I have one question:
In the coordinator(train_dataset_fn), you use shard to split data to each worker, but the input param(input_context.input_pipeline_id) indicates which worker index is, so I think every worker should call the function(train_dataset_fn) to get his part of data. But your code show that only the coordinator use the train_dataset_fn function.
Can you explain to me how this param(input_context.input_pipeline_id) works
thx!!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.