Hi there, I want to run distributed training on two servers, each has 4 GPUs.

FileNotFoundError Issues while running on 2 nodes about megatron-lm HOT 4 CLOSED

nvidia commented on May 24, 2024

FileNotFoundError Issues while running on 2 nodes

from megatron-lm.

Comments (4)

weigao266 commented on May 24, 2024

Hi there, I have met the exactly same question, do you find a way to solve it? Now I have to copy the *.npy files manually twice from node 0 to node 1 when I run a new model.

from megatron-lm.

zarzen commented on May 24, 2024

I think the system assumes the cluster is running with a shared file system like lustre.

from megatron-lm.

japarada commented on May 24, 2024

Hi there, I have met the exactly same question, do you find a way to solve it? Now I have to copy the *.npy files manually twice from node 0 to node 1 when I run a new model.

Did you ever find answer to this problem? Thanks

from megatron-lm.

mfdj2002 commented on May 24, 2024

Hi there, I have met the exactly same question, do you find a way to solve it? Now I have to copy the *.npy files manually twice from node 0 to node 1 when I run a new model.

Did you ever find answer to this problem? Thanks

If you don't plan to use a file sharing system like lustre, you'd have to manually copy and paste the dataset cache to all your nodes and comment out the self.unique_description_hash in megatron_dataset.py, otherwise it will try to load a new set of .npy files every time. Hope it helps!

from megatron-lm.

FileNotFoundError Issues while running on 2 nodes about megatron-lm HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent