Hi, I encountered problems with the distributed training. Can I train your model with

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Can I train your model without distribution? about 3d-transunet HOT 4 OPEN

zwb0 commented on June 24, 2024

Can I train your model without distribution?

from 3d-transunet.

Comments (4)

dimitar10 commented on June 24, 2024

Hello, yes it is possible to run it on a single GPU. You need to edit train.sh to run a command similar to the following:

nnunet_use_progress_bar=1 CUDA_VISIBLE_DEVICES=0 \
        python3 -m torch.distributed.launch --master_port=4322 --nproc_per_node=1 \
        ./train.py --fold=${fold} --config=$CONFIG --resume='local_latest' --npz

note the changes in CUDA_VISIBLE_DEVICES and --nproc_per_node when compared to the default values. Basically, this will use the default _DDP trainer if you haven't edited the network_trainer entry in your config file, but will still run on a single GPU. The proper way is to use a non-DDP trainer, but this will require some more modifications.

Also there might be a typo in train.py, you might need to change the --local-rank argument to --local_rank, at least that was one of the issues in my case.

Hope this helps.

from 3d-transunet.

Heanhu commented on June 24, 2024

Hello, yes it is possible to run it on a single GPU. You need to edit train.sh to run a command similar to the following:
nnunet_use_progress_bar=1 CUDA_VISIBLE_DEVICES=0 \
        python3 -m torch.distributed.launch --master_port=4322 --nproc_per_node=1 \
        ./train.py --fold=${fold} --config=$CONFIG --resume='local_latest' --npz
note the changes in CUDA_VISIBLE_DEVICES and --nproc_per_node when compared to the default values. Basically, this will use the default _DDP trainer if you haven't edited the network_trainer entry in your config file, but will still run on a single GPU. The proper way is to use a non-DDP trainer, but this will require some more modifications.

Also there might be a typo in train.py, you might need to change the --local-rank argument to --local_rank, at least that was one of the issues in my case.

Hope this helps.

Hello， I change the --local-rank argument to --local_rank, but it still report error:
usage: train.py [-h] [--network NETWORK] [--network_trainer NETWORK_TRAINER] [--task TASK] [--task_pretrained TASK_PRETRAINED] [--fold FOLD]
[--model MODEL] [--disable_ds DISABLE_DS] [--resume RESUME] [-val] [-c] [-p P] [--use_compressed_data] [--deterministic]
[--fp32] [--dbs] [--npz] [--valbest] [--vallatest] [--find_lr] [--val_folder VAL_FOLDER] [--disable_saving]
[--disable_postprocessing_on_folds] [-pretrained_weights PRETRAINED_WEIGHTS] [--config FILE] [--batch_size BATCH_SIZE]
[--max_num_epochs MAX_NUM_EPOCHS] [--initial_lr INITIAL_LR] [--min_lr MIN_LR] [--opt_eps EPSILON] [--opt_betas BETA [BETA ...]]
[--weight_decay WEIGHT_DECAY] [--local_rank LOCAL_RANK] [--world-size WORLD_SIZE] [--rank RANK]
[--total_batch_size TOTAL_BATCH_SIZE] [--hdfs_base HDFS_BASE] [--optim_name OPTIM_NAME] [--lrschedule LRSCHEDULE]
[--warmup_epochs WARMUP_EPOCHS] [--val_final] [--is_ssl] [--is_spatial_aug_only] [--mask_ratio MASK_RATIO]
[--loss_name LOSS_NAME] [--plan_update PLAN_UPDATE] [--crop_size CROP_SIZE [CROP_SIZE ...]] [--reclip RECLIP [RECLIP ...]]
[--pretrained] [--disable_decoder] [--model_params MODEL_PARAMS] [--layer_decay LAYER_DECAY] [--drop_path PCT]
[--find_zero_weight_decay] [--n_class N_CLASS]
[--deep_supervision_scales DEEP_SUPERVISION_SCALES [DEEP_SUPERVISION_SCALES ...]] [--fix_ds_net_numpool] [--skip_grad_nan]
[--merge_femur] [--is_sigmoid] [--max_loss_cal MAX_LOSS_CAL]
train.py: error: unrecognized arguments: --local-rank=0
Could you help me?
Thank you.

from 3d-transunet.

2DangFilthy commented on June 24, 2024

Hello, yes it is possible to run it on a single GPU. You need to edit train.sh to run a command similar to the following:
nnunet_use_progress_bar=1 CUDA_VISIBLE_DEVICES=0 \
        python3 -m torch.distributed.launch --master_port=4322 --nproc_per_node=1 \
        ./train.py --fold=${fold} --config=$CONFIG --resume='local_latest' --npz
note the changes in CUDA_VISIBLE_DEVICES and --nproc_per_node when compared to the default values. Basically, this will use the default _DDP trainer if you haven't edited the network_trainer entry in your config file, but will still run on a single GPU. The proper way is to use a non-DDP trainer, but this will require some more modifications.
Also there might be a typo in train.py, you might need to change the --local-rank argument to --local_rank, at least that was one of the issues in my case.
Hope this helps.
Hello， I change the --local-rank argument to --local_rank, but it still report error: usage: train.py [-h] [--network NETWORK] [--network_trainer NETWORK_TRAINER] [--task TASK] [--task_pretrained TASK_PRETRAINED] [--fold FOLD] [--model MODEL] [--disable_ds DISABLE_DS] [--resume RESUME] [-val] [-c] [-p P] [--use_compressed_data] [--deterministic] [--fp32] [--dbs] [--npz] [--valbest] [--vallatest] [--find_lr] [--val_folder VAL_FOLDER] [--disable_saving] [--disable_postprocessing_on_folds] [-pretrained_weights PRETRAINED_WEIGHTS] [--config FILE] [--batch_size BATCH_SIZE] [--max_num_epochs MAX_NUM_EPOCHS] [--initial_lr INITIAL_LR] [--min_lr MIN_LR] [--opt_eps EPSILON] [--opt_betas BETA [BETA ...]] [--weight_decay WEIGHT_DECAY] [--local_rank LOCAL_RANK] [--world-size WORLD_SIZE] [--rank RANK] [--total_batch_size TOTAL_BATCH_SIZE] [--hdfs_base HDFS_BASE] [--optim_name OPTIM_NAME] [--lrschedule LRSCHEDULE] [--warmup_epochs WARMUP_EPOCHS] [--val_final] [--is_ssl] [--is_spatial_aug_only] [--mask_ratio MASK_RATIO] [--loss_name LOSS_NAME] [--plan_update PLAN_UPDATE] [--crop_size CROP_SIZE [CROP_SIZE ...]] [--reclip RECLIP [RECLIP ...]] [--pretrained] [--disable_decoder] [--model_params MODEL_PARAMS] [--layer_decay LAYER_DECAY] [--drop_path PCT] [--find_zero_weight_decay] [--n_class N_CLASS] [--deep_supervision_scales DEEP_SUPERVISION_SCALES [DEEP_SUPERVISION_SCALES ...]] [--fix_ds_net_numpool] [--skip_grad_nan] [--merge_femur] [--is_sigmoid] [--max_loss_cal MAX_LOSS_CAL] train.py: error: unrecognized arguments: --local-rank=0 Could you help me? Thank you.

Hello, im facing the same problem. Have u solved it?

from 3d-transunet.

dimitar10 commented on June 24, 2024

@Heanhu @2DangFilthy

The arg change to --local_rank in train.py

3D-TransUNet/train.py

Line 109 in 190fe40

parser.add_argument("--local-rank", type=int) # must pass

I suggested apparently is not necessary. According to argparse's docs, internal hyphens in args are automatically converted to underscores. Perhaps try deleting any __pycache__ dirs you might have, sometimes these can cause issues. If you are running the train.sh script, it should work.

from 3d-transunet.

Can I train your model without distribution? about 3d-transunet HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent