Giter Club home page Giter Club logo

Comments (6)

mnslarcher avatar mnslarcher commented on May 17, 2024 3

Hi @mingxingtan, first thanks for sharing the code of EfficientDet. It would be fantastic if you could say when you believe that the current code will support training with multiple GPUs or, if this will not be added soon, any suggestions on how it could be implemented, I am investigating TensorFlow more deeply in this period and I still have no idea about how difficult it would be for me to do it.

Thanks,
Mario

from automl.

goldwater668 avatar goldwater668 commented on May 17, 2024

I also want to know how to train with multiple GPUs?

from automl.

roadcode avatar roadcode commented on May 17, 2024

same question, it seems like the code only supports single GPU train

from automl.

LucasSloan avatar LucasSloan commented on May 17, 2024

I've tried following this tutorial on doing distributed training with estimator. I changed the run config to use a MirroredStrategy like so:

  strategy = tf.distribute.MirroredStrategy()
  run_config = tf.estimator.tpu.RunConfig(
      cluster=tpu_cluster_resolver,
      evaluation_master=FLAGS.eval_master,
      model_dir=FLAGS.model_dir,
      log_step_count_steps=FLAGS.iterations_per_loop,
      session_config=config_proto,
      tpu_config=tpu_config,
      train_distribute=strategy,
  )

However, when I ran the code I got this stack trace:

Traceback (most recent call last):
  File "main.py", line 394, in <module>
    tf.app.run(main)
  File "/usr/lib/python3/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/lucas/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/lucas/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "main.py", line 362, in main
    steps=int(FLAGS.num_examples_per_epoch / FLAGS.train_batch_size))
  File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/lucas/.local/lib/python3.6/site-packages/six.py", line 703, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1159, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1222, in _train_model_distributed
    self._config._train_distribute, input_fn, hooks, saving_listeners)
  File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1258, in _actual_train_model_distributed
    input_fn, ModeKeys.TRAIN, strategy)
  File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1012, in _get_iterator_from_input_fn
    lambda input_context: self._call_input_fn(input_fn, mode,
  File "/usr/lib/python3/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1050, in make_input_fn_iterator
    input_fn, replication_mode)
  File "/usr/lib/python3/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 577, in make_input_fn_iterator
    input_fn, replication_mode=replication_mode)
  File "/usr/lib/python3/dist-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 552, in _make_input_fn_iterator
    self._container_strategy())
  File "/usr/lib/python3/dist-packages/tensorflow_core/python/distribute/input_lib.py", line 719, in __init__
    result = input_fn(ctx)
  File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1013, in <lambda>
    input_context))

I thought that might because the dataloader.py's InputReader.call function didn't take in an input context, but I fixed that also following the guide and got the same stack trace.

from automl.

LucasSloan avatar LucasSloan commented on May 17, 2024

Tried again on tf 2.1, and got a different error:

  File "main.py", line 394, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "main.py", line 362, in main
    steps=int(FLAGS.num_examples_per_epoch / FLAGS.train_batch_size))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3054, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 149, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3049, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 376, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1171, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1234, in _train_model_distributed
    self._config._train_distribute, input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1314, in _actual_train_model_distributed
    self.config))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2095, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 763, in _call_for_each_replica
    fn, args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 201, in _call_for_each_replica
    coord.join(threads)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 986, in run
    self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 265, in wrapper
    raise e.ag_error_metadata.to_exception(e)
tensorflow.python.autograph.pyct.error_utils.KeyError: in user code:

    /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:2884 _call_model_fn  *
        return super(TPUEstimator, self)._call_model_fn(features, labels, mode,
    /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1161 _call_model_fn  *
        model_fn_results = self._model_fn(features=features, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:3144 _model_fn  *
        estimator_spec = model_fn_wrapper.call_without_tpu(
    /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:1678 call_without_tpu  *
        return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
    /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:2011 _call_model_fn  *
        estimator_spec = self._model_fn(features=features, **kwargs)
    /efficientdet/det_model_fn.py:604 efficientdet_model_fn  *
        return _model_fn(
    /efficientdet/det_model_fn.py:402 _model_outputs  *
        return model(features, config=hparams_config.Config(params))
    /efficientdet/efficientdet_arch.py:543 efficientdet  *
        if not config and not model_name:
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/logical.py:28 not_
        if tensor_util.is_tensor(a):
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:1000 is_tensor
        getattr(x, "is_tensor_like", False))
    /efficientdet/hparams_config.py:48 __getattr__
        return self.__dict__[k]

    KeyError: 'is_tensor_like'

from automl.

mnslarcher avatar mnslarcher commented on May 17, 2024

Hi @LucasSloan, fsx950223 has an open PR about this and, in another issue of this repo, he said that he was able to train with multiple GPUs. You can find his code in his fork of this repo, I didn’t tried it yet. He use Horvod in his implementation.

from automl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.