Does anyone know if it's possible to distribute the training on several hosts to reduc

Quick additional questions in the same spirit: There are many

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Distributing training on several hosts about neuraltalk2 HOT 3 OPEN

karpathy commented on August 24, 2024

Distributing training on several hosts

from neuraltalk2.

Comments (3)

SaMnCo commented on August 24, 2024

Quick additional questions in the same spirit:

There are many options in train.lua. Any advice on the BEST setting (assuming I have unlimited compute power, what would be the best settings?), and on how variables influence the quality of the training?
What is the recommended way to minimize the size of the model while keeping an acceptable performance?
Does the size of images impact the model?
I see a "-start_from" option, that let me think I can improve models and / or build the model iteratively. If I split my training set in sub sets and separately train them, can I aggregate the results somehow? (note this would clearly indicate it's possible to scale out) What would be the potential downsides of this approach?

Many thanks,

from neuraltalk2.

dazoulay commented on August 24, 2024

Hi SamnCo, did you figure any of these questions out? You're input would be greatly appreciated. Thank you.

from neuraltalk2.

SaMnCo commented on August 24, 2024

Hi @dazoulay sorry for the time to answer, been OOO for a little while with poor net access. Anyway...

I didn't move a lot on these, but I have some new inputs:
For the training parameters, I see more and more people using a model to actually learn what the best settings would be. Imagine you orchestrate training with various settings, collect results at different points in time, compare them, then learn from that to adjust and converge towards the best settings. It's another layer of ML/DL on top. This seem to be a successful approach, but I didn't test it myself.

For the 4th item: Essentially, the start_from allows you to give an existing model to start from and improve it.
Regarding scale out, as far as I went, you can consider 2 types of scaling:

Train several models in //, compare results, keep the best model: this is assimilated as data scaling, as the various models trained on different machines do not communicate
Use a network of machines to train on the same set. AFAIK, the only frameworks allowing that are Tensorflow, DL4j and Caffe, all using Spark as the underlying engine to scale. The main drawback coming from that is that Spark is sort of a "start network", with a central orchestrator making many decision. That means evaluation and communicating back to the orchestration node can (and will!) become the bottleneck. I submitted the idea to use SDNs to improve communication between nodes, which could help, but again this would be up to the orchestrator to "predict" the best network and set it up. Nevertheless this seems the most promising for now, until Google releases more of the scale out aspects of Tensorflow.
Note: the bottleneck here is related to velocity. If you have all the time in the world, it will still fix the "size" issue and allow you to go beyond the size of the RAM of your video cards.

I am involved in several DL projects ATM, but moving away from Torch. I may get more info in the upcoming weeks, but won't necessary update here. Checkout my account for DL projects.

from neuraltalk2.

Distributing training on several hosts about neuraltalk2 HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent