I want to use GPUs to train this code，what can I do ? Thanks a lot!

Some command line examples Train on GPU: <p dir="au

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Can this code train with GPU？ about automl HOT 11 CLOSED

google commented on May 17, 2024 47

Can this code train with GPU？

from automl.

Comments (11)

mingxingtan commented on May 17, 2024 45

Some command line examples

Train on GPU:

python main.py --training_file_pattern=/coco_tfrecord/train* --model_name=effcientdet-d0 --model_dir=/tmp/efficientnet/ --hparams="use_bfloat16=false" --use_tpu=False

Eval on GPU:

// ssuming /tmp/efficientnet-d0/ contains your checkpoint.
python main.py --mode=eval --model_name=efficientdet-d0 --model_dir=/tmp/efficientdet-d0/ --validation_file_pattern=/coco_tfrecord/val* --val_json_file=/coco_tfrecord/instances_val2017.json --hparams="use_bfloat16=false" --use_tpu=False

Inference a single image:

// pip install pytype pycocotools
python model_inspect.py --runmode=infer --model_name=efficientdet-d0 --ckpt_path=/tmp/efficientdet-d0/ --input_image=/tmp/img1.jpg --output_image_dir=/tmp/det1/

I will add a tutorial colab soon.

from automl.

kaikaizhu commented on May 17, 2024 10

If I just have Gpus, can I use the trained weights provided by this project to test my own pictures?

from automl.

airqj commented on May 17, 2024 9

@mingxingtan
the flag "--use_tpu=False" use cpu instead of tpu to train and it is very slow.
We need to change some code to train efficientdet on GPU?

from automl.

bhack commented on May 17, 2024 2

And Edge TPU (Coral). Also Will it we available on TF HUB?

from automl.

mingxingtan commented on May 17, 2024 1

@liminghuiv it is a wrong comment and I have just fixed it. Estimator will automatically determine use GPU if you have; otherwise it uses CPU.

from automl.

hoangphucITJP commented on May 17, 2024

@kaikaizhu , I guess you can according to https://cloud.google.com/tpu/docs/using-estimator-api:

Models written using TPUEstimator work across CPUs, GPUs, single TPU devices, and whole TPU pods, generally with no code changes.

and the TPUEstimator is used in this repo:
https://github.com/google/automl/blob/master/efficientdet/main.py#L239

from automl.

liminghuiv commented on May 17, 2024

I am also interested in training with GPU. any tutorial? thanks a lot.

from automl.

Jilliansea commented on May 17, 2024

@mingxingtan Hi, I want to detect a image lists in form of 'txt', and I change the code of build_input, but it also error in post process. Because the batch size of inference is 1, when send all images to the model, it also deal as 1 batch, then then anchor numbers will biger then index...
So, could you please publish an inference code to "change ckpt to pb " and inference by pb model for multi-images?

from automl.

liminghuiv commented on May 17, 2024

Hi @mingxingtan , thanks for taking a look at it. I checked the efficientdet/main.py code:
line 53:
flags.DEFINE_bool('use_tpu', True, 'Use TPUs rather than CPUs')
it seems that it will use CPU instead of GPU, if we set use_tpu FALSE

from automl.

ruodingt commented on May 17, 2024

Hi @mingxingtan
Thanks so much for sharing your fantastic work.

I got similar problem here that the TPU estimator does train on my GPU
I am using tensorflow 2.0.0 (a docker image from official tf docker hub)

Although the system has a V100 GPU yet still it only trains on CPU.
Could you give me some tips?

Thank you.


I0529 03:07:38.554332 139669136197440 main.py:383] {'name': 'efficientdet-d0', 'act_type': 'swish', 'image_size': (512, 512), 'input_rand_hflip': True, 'train_scale_min': 0.1, 'train_scale_max': 2.0, 'autoaugment_policy': None, 'use_augmix': False, 'augmix_params': (3, -1, 1), 'num_classes': 20, 'skip_crowd_during_training': True, 'label_id_mapping': None, 'min_level': 3, 'max_level': 7, 'num_scales': 3, 'aspect_ratios': [(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)], 'anchor_scale': 4.0, 'is_training_bn': True, 'momentum': 0.9, 'optimizer': 'sgd', 'learning_rate': 0.08, 'lr_warmup_init': 0.008, 'lr_warmup_epoch': 1.0, 'first_lr_drop_epoch': 200.0, 'second_lr_drop_epoch': 250.0, 'poly_lr_power': 0.9, 'clip_gradients_norm': 10.0, 'num_epochs': 18000, 'data_format': 'channels_last', 'alpha': 0.25, 'gamma': 1.5, 'delta': 0.1, 'box_loss_weight': 50.0, 'iou_loss_type': None, 'iou_loss_weight': 1.0, 'weight_decay': 4e-05, 'strategy': '', 'precision': None, 'box_class_repeats': 3, 'fpn_cell_repeats': 3, 'fpn_num_filters': 64, 'separable_conv': True, 'apply_bn_for_resampling': True, 'conv_after_downsample': False, 'conv_bn_act_pattern': False, 'use_native_resize_op': True, 'pooling_type': None, 'fpn_name': None, 'fpn_weight_method': None, 'fpn_config': None, 'survival_prob': None, 'lr_decay_method': 'cosine', 'moving_average_decay': 0.9998, 'ckpt_var_scope': None, 'var_exclude_expr': '.*/class-predict/.*', 'backbone_name': 'efficientnet-b0', 'backbone_config': None, 'var_freeze_expr': None, 'resnet_depth': 50, 'model_name': 'efficientdet-d0', 'iterations_per_loop': 100, 'model_dir': '../output/exp-001-baseline-d0', 'num_shards': 1, 'num_examples_per_epoch': 2000, 'backbone_ckpt': '/home/appuser/project/pretrained/efficientnet-b0', 'ckpt': None, 'val_json_file': None, 'testdev_dir': None, 'mode': 'train_and_eval', 'DATA_CONF': {'CATEGORIES_IN_RANGE': ['calculus_tooth', 'tooth-decay', 'tooth-whitespot', 'gum-gingivitis', 'stain-tooth-external', 'stain-tooth-internal'], 'EVAL_SCOPE': ['calculus_tooth', 'tooth-decay', 'tooth-whitespot', 'gum-gingivitis', 'stain-tooth-external', 'stain-tooth-internal'], 'METRIC_SCOPE': ['calculus_tooth', 'tooth-decay', 'tooth-whitespot', 'gum-gingivitis', 'stain-tooth-external'], 'EVAL': ['coco_stack_out/user-data-2020-Apr-R3_B10M3-34.json'], 'IMAGE_BASEDIR': '../data/', 'SUB_MASK_CATEGORY': ['calculus'], 'TRAIN': ['coco_stack_out/web_decay_600-26-full.json', 'coco_stack_out/gingivitis_web_490-31-full.json', 'coco_stack_out/calculus_web_230-28-full.json', 'coco_stack_out/mturk_mar_2020r-30-full.json', 'coco_stack_out/legacy_decay-25-full.json', 'coco_stack_out/mturk50_mar16_ro-37.json', 'coco_stack_out/tooth_crawl_web_A-36.json', 'coco_stack_out/spotty_stain_web_A-35.json']}}
I0529 03:07:38.554489 139669136197440 main.py:274] Starting training cycle, epoch: 0 / 18000.
INFO:tensorflow:Using config: {'_model_dir': '../output/exp-001-baseline-d0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f0729556cf8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=1, num_cores_per_replica=8, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=[[1, 4, 2, 1], {'mean_num_positives': None, 'source_ids': None, 'groundtruth_data': None, 'image_scales': None, 'box_targets_3': [1, 4, 2, 1], 'cls_targets_3': [1, 4, 2, 1], 'box_targets_4': [1, 4, 2, 1], 'cls_targets_4': [1, 4, 2, 1], 'box_targets_5': [1, 4, 2, 1], 'cls_targets_5': [1, 4, 2, 1], 'box_targets_6': [1, 4, 2, 1], 'cls_targets_6': [1, 4, 2, 1], 'box_targets_7': [1, 4, 2, 1], 'cls_targets_7': [1, 4, 2, 1]}], eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
I0529 03:07:38.554966 139669136197440 estimator.py:212] Using config: {'_model_dir': '../output/exp-001-baseline-d0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f0729556cf8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=1, num_cores_per_replica=8, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=[[1, 4, 2, 1], {'mean_num_positives': None, 'source_ids': None, 'groundtruth_data': None, 'image_scales': None, 'box_targets_3': [1, 4, 2, 1], 'cls_targets_3': [1, 4, 2, 1], 'box_targets_4': [1, 4, 2, 1], 'cls_targets_4': [1, 4, 2, 1], 'box_targets_5': [1, 4, 2, 1], 'cls_targets_5': [1, 4, 2, 1], 'box_targets_6': [1, 4, 2, 1], 'cls_targets_6': [1, 4, 2, 1], 'box_targets_7': [1, 4, 2, 1], 'cls_targets_7': [1, 4, 2, 1]}], eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
I0529 03:07:38.555698 139669136197440 tpu_context.py:221] _TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
W0529 03:07:38.556049 139669136197440 tpu_context.py:223] eval_on_tpu ignored because use_tpu is False.
WARNING:tensorflow:From /home/appuser/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0529 03:07:38.561795 139669136197440 deprecation.py:506] From /home/appuser/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /home/appuser/.local/lib/python3.6/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0529 03:07:38.562173 139669136197440 deprecation.py:323] From /home/appuser/.local/lib/python3.6/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
2020-05-29 03:07:38.570108: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-29 03:07:38.573494: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-29 03:07:38.574404: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
2020-05-29 03:07:38.574616: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-05-29 03:07:38.575953: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-05-29 03:07:38.577151: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-05-29 03:07:38.577451: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-05-29 03:07:38.579049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-05-29 03:07:38.580268: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-05-29 03:07:38.584087: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-29 03:07:38.584177: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-29 03:07:38.585136: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-29 03:07:38.586012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
WARNING:tensorflow:From /home/appuser/project/efficientdet/dataloader.py:344: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
W0529 03:07:38.623125 139669136197440 deprecation.py:323] From /home/appuser/project/efficientdet/dataloader.py:344: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
WARNING:tensorflow:Entity <function InputReader.__call__.<locals>._dataset_parser at 0x7f073ebbabf8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: LIVE_VARS_IN
W0529 03:07:39.093873 139669136197440 ag_logging.py:146] Entity <function InputReader.__call__.<locals>._dataset_parser at 0x7f073ebbabf8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: LIVE_VARS_IN
INFO:tensorflow:Calling model_fn.
I0529 03:07:39.843184 139669136197440 estimator.py:1147] Calling model_fn.
INFO:tensorflow:Running train on CPU

from automl.

linkrain-a commented on May 17, 2024

successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

Find your answer through this link:
https://stackoverflow.com/questions/44232898/memoryerror-in-tensorflow-and-successful-numa-node-read-from-sysfs-had-negativ

from automl.

Can this code train with GPU？ about automl HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent