Giter Club home page Giter Club logo

Comments (7)

mrT23 avatar mrT23 commented on August 18, 2024

give more details when you open an issue. what have you tried ?

Anyway, for DDP use stanard PyTorch multi-GPU command:
python -u -m torch.distributed.launch --nproc_per_node=2 train_single_label.py

currently, the script does not save the model. there is no special saving of DDP model, just torch.save(...) (on rank==0 only of course).
you are welcome to open a merge request for adding the saving feature.

from imagenet21k.

jaffe-fly avatar jaffe-fly commented on August 18, 2024

when i use the above command
`
root@1ebda974bc7a:/home/ImageNet21K# python -u -m torch.distributed.launch --nproc_per_node=2 train_single_label.py --batch_size=4 --data_path=/mnt/dataset --model_name=mobilenetv3_large_100 --model_path=/home/ImageNet21K/mobilenetv3_large_100_miil_21k.pth --epochs=10


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


creating model mobilenetv3_large_100...
creating model mobilenetv3_large_100...
done

done

`

the progarm seems block stop and not continue

image

is this normally?

from imagenet21k.

Stephen-Hao avatar Stephen-Hao commented on August 18, 2024

By using "python -u -m torch.distributed.launch --nproc_per_node=2 train_single_label.py",I encounter the same problem

from imagenet21k.

mrT23 avatar mrT23 commented on August 18, 2024

@jaffe-fly thanks for the feedback
i changed the order, 'torch.distributed.init_process_group' should be before model initialization.
please update and let me know if it was fixed.

from imagenet21k.

Stephen-Hao avatar Stephen-Hao commented on August 18, 2024

image

@mrT23 Now it works well,Thanks alot!

from imagenet21k.

jaffe-fly avatar jaffe-fly commented on August 18, 2024

@mrT23 its ok thank you

but one warining

UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

from imagenet21k.

mrT23 avatar mrT23 commented on August 18, 2024

thanks @jaffe-fly and @Stephen-Hao

the warning on lr_scheduler does not hurt the training results.

from imagenet21k.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.