1 machine 2 GPUs I cant use command line to start how to start in the wIntern

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

how to use DPP to start ```train_single_label.py``` about imagenet21k HOT 7 CLOSED

alibaba-miil commented on August 18, 2024

how to use DPP to start ```train_single_label.py```

from imagenet21k.

Comments (7)

mrT23 commented on August 18, 2024

give more details when you open an issue. what have you tried ?

Anyway, for DDP use stanard PyTorch multi-GPU command:
python -u -m torch.distributed.launch --nproc_per_node=2 train_single_label.py

currently, the script does not save the model. there is no special saving of DDP model, just torch.save(...) (on rank==0 only of course).
you are welcome to open a merge request for adding the saving feature.

from imagenet21k.

jaffe-fly commented on August 18, 2024

when i use the above command
`
root@1ebda974bc7a:/home/ImageNet21K# python -u -m torch.distributed.launch --nproc_per_node=2 train_single_label.py --batch_size=4 --data_path=/mnt/dataset --model_name=mobilenetv3_large_100 --model_path=/home/ImageNet21K/mobilenetv3_large_100_miil_21k.pth --epochs=10

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

creating model mobilenetv3_large_100...
creating model mobilenetv3_large_100...
done

done

the progarm seems block stop and not continue

is this normally？

from imagenet21k.

Stephen-Hao commented on August 18, 2024

By using "python -u -m torch.distributed.launch --nproc_per_node=2 train_single_label.py",I encounter the same problem

from imagenet21k.

mrT23 commented on August 18, 2024

@jaffe-fly thanks for the feedback
i changed the order, 'torch.distributed.init_process_group' should be before model initialization.
please update and let me know if it was fixed.

from imagenet21k.

Stephen-Hao commented on August 18, 2024

@mrT23 Now it works well,Thanks alot!

from imagenet21k.

jaffe-fly commented on August 18, 2024

@mrT23 its ok thank you

but one warining

UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

from imagenet21k.

mrT23 commented on August 18, 2024

thanks @jaffe-fly and @Stephen-Hao

the warning on lr_scheduler does not hurt the training results.

from imagenet21k.

Recommend Projects

how to use DPP to start ```train_single_label.py``` about imagenet21k HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent