Giter Club home page Giter Club logo

Comments (7)

Bingrong89 avatar Bingrong89 commented on June 4, 2024 3

Hi, I faced the same problem stemming from the function metric_harness too, and I could not reproduce it by saving the arrays involved and running them with the function elsewhere.
My simple solution was to turn off the function, since it is not needed for the training.. :)
Hope you will get a better answer than this~

from multinerf.

jonbarron avatar jonbarron commented on June 4, 2024 2

Yeah, the culprit is SSIM computation OOMing. Since SSIM isn't necessary for training I'd just comment out that code (I think there's a flag for disabling SSIM computation too).

from multinerf.

bmild avatar bmild commented on June 4, 2024

Try reducing the batch size by settingConfig.batch_size = 4096 (or smaller) in llff_256.gin.

from multinerf.

Bingrong89 avatar Bingrong89 commented on June 4, 2024

That worked! Apologies for not thinking of something so basic as batch size.

from multinerf.

Beniko95J avatar Beniko95J commented on June 4, 2024

Hi, I also met OOM and I tried to solve it by setting Config.batch_size = 4096, however the training still fails at 5000/250000: when someting like Rendering chunk 0/47 starts. I found that my gpu memory always runs out (like 24046MiB / 24268MiB) no matter how big I set the batch size. Do you have any idea why this happens?

Thanks!

from multinerf.

jonbarron avatar jonbarron commented on June 4, 2024

Try reducing render_chunk_size.

from multinerf.

Beniko95J avatar Beniko95J commented on June 4, 2024

Thank you for the reply. I am sorry for the misleading description. Actually the errors happens after Rendering chunk 44/47, so maybe it is OKay with the render_chunk_size?

Here is the full log after the error happpend. I find CUDNN_STATUS_ALLOC_FAILED in them so I guess the problem is related to OOM. I tried to set Config.batch_size = 1, but the training still eats up all my gpu memory...

5000/250000: loss=0.11740, psnr=15.222, lr=1.82e-03 | data=0.11632, dist=0.00012, inte=0.00097, 56903 r/s
Rendering chunk 0/47
Rendering chunk 4/47
Rendering chunk 8/47
Rendering chunk 12/47
Rendering chunk 16/47
Rendering chunk 20/47
Rendering chunk 24/47
Rendering chunk 28/47
Rendering chunk 32/47
Rendering chunk 36/47
Rendering chunk 40/47
Rendering chunk 44/47
Eval 5000: 10.034s., 78373 rays/sec
2022-08-24 11:00:58.242217: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:727] None of the algorithms provided by cuDNN heuristics worked; trying fallback algorithms.  Conv: (f32[3072,1,758]{2,1,0}, u8[0]{0}) custom-call(f32[3072,1,768]{2,1,0}, f32[1,1,11]{2,1,0}), window={size=11}, dim_labels=bf0_oi0->bf0, custom_call_target="__cudnn$convForward", backend_config="{\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}"
Traceback (most recent call last):
  File "/home/users/zjiang/nerf/multinerf/train.py", line 288, in <module>
    app.run(main)
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/users/zjiang/nerf/multinerf/train.py", line 242, in main
    metric = metric_harness(
  File "/home/users/zjiang/nerf/multinerf/internal/image.py", line 136, in __call__
    ssim = float(self.ssim_fn(rgb_pred, rgb_gt))
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/api.py", line 527, in cache_miss
    out_flat = xla.xla_call(
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/core.py", line 1937, in bind
    return call_bind(self, fun, *args, **params)
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/core.py", line 1953, in call_bind
    outs = top_trace.process_call(primitive, fun_, tracers, params)
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/core.py", line 687, in process_call
    return primitive.impl(f, *tracers, **params)
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/dispatch.py", line 208, in _xla_call_impl
    compiled_fun = xla_callable(fun, device, backend, name, donated_invars,
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/linear_util.py", line 295, in memoized_fun
    ans = call(fun, *args)
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/dispatch.py", line 257, in _xla_callable_uncached
    return lower_xla_callable(fun, device, backend, name, donated_invars, False,
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/dispatch.py", line 849, in compile
    self._executable = XlaCompiledComputation.from_xla_computation(
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/dispatch.py", line 956, in from_xla_computation
    compiled = compile_or_get_cached(backend, xla_computation, options,
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/dispatch.py", line 921, in compile_or_get_cached
    return backend_compile(backend, computation, compile_options, host_callbacks)
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/profiler.py", line 294, in wrapper
    return func(*args, **kwargs)
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/dispatch.py", line 865, in backend_compile
    return backend.compile(built_c, compile_options=options)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: UNKNOWN: Failed to determine best cudnn convolution algorithm for:
%cudnn-conv = (f32[3072,1,758]{2,1,0}, u8[0]{0}) custom-call(f32[3072,1,768]{2,1,0} %bitcast.4, f32[1,1,11]{2,1,0} %bitcast.5), window={size=11}, dim_labels=bf0_oi0->bf0, custom_call_target="__cudnn$convForward", metadata={op_name="jit(ssim)/jit(main)/vmap(jit(convolve))/jit(_conv)/conv_general_dilated[window_strides=(1,) padding=((0, 0),) lhs_dilation=(1,) rhs_dilation=(1,) dimension_numbers=ConvDimensionNumbers(lhs_spec=(0, 1, 2), rhs_spec=(0, 1, 2), out_spec=(0, 1, 2)) feature_group_count=1 batch_group_count=1 lhs_shape=(3072, 1, 768) rhs_shape=(1, 1, 11) precision=(<Precision.HIGHEST: 2>, <Precision.HIGHEST: 2>) preferred_element_type=None]" source_file="/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/dm_pix/_src/metrics.py" source_line=177}, backend_config="{\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}"

Original error: INTERNAL: All algorithms tried for %cudnn-conv = (f32[3072,1,758]{2,1,0}, u8[0]{0}) custom-call(f32[3072,1,768]{2,1,0} %bitcast.4, f32[1,1,11]{2,1,0} %bitcast.5), window={size=11}, dim_labels=bf0_oi0->bf0, custom_call_target="__cudnn$convForward", metadata={op_name="jit(ssim)/jit(main)/vmap(jit(convolve))/jit(_conv)/conv_general_dilated[window_strides=(1,) padding=((0, 0),) lhs_dilation=(1,) rhs_dilation=(1,) dimension_numbers=ConvDimensionNumbers(lhs_spec=(0, 1, 2), rhs_spec=(0, 1, 2), out_spec=(0, 1, 2)) feature_group_count=1 batch_group_count=1 lhs_shape=(3072, 1, 768) rhs_shape=(1, 1, 11) precision=(<Precision.HIGHEST: 2>, <Precision.HIGHEST: 2>) preferred_element_type=None]" source_file="/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/dm_pix/_src/metrics.py" source_line=177}, backend_config="{\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" failed. Falling back to default algorithm.  Per-algorithm errors:
  Profiling failure on cuDNN engine eng1{k2=4,k3=0}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng34{k2=2,k19=0,k4=0,k5=0,k6=0,k7=0}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng34{k2=2,k19=0,k4=1,k5=0,k6=0,k7=0}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng0{}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng28{k2=4,k3=0}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng28{k3=0,k2=3}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng34{k2=1,k19=0,k4=0,k5=1,k6=0,k7=0}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng34{k5=1,k6=0,k7=0,k2=1,k19=0,k4=2}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng48{k2=15,k13=0,k14=0,k6=0,k22=0}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng4{}: UNKNOWN: CUDNN_STATUS_INTERNAL_ERROR
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng28{k2=0,k3=0}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng1{k3=0,k2=2}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng34{k7=0,k2=0,k19=0,k4=0,k5=1,k6=0}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng1{k2=1,k3=0}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng28{k2=1,k3=0}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng0{}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng1{}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng28{}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'

To ignore this failure and try to use a fallback algorithm (which may have suboptimal performance), use XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false.  Please also file a bug for the root cause of failing autotuning.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/users/zjiang/nerf/multinerf/train.py", line 288, in <module>
    app.run(main)
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/users/zjiang/nerf/multinerf/train.py", line 242, in main
    metric = metric_harness(
  File "/home/users/zjiang/nerf/multinerf/internal/image.py", line 136, in __call__
    ssim = float(self.ssim_fn(rgb_pred, rgb_gt))
jaxlib.xla_extension.XlaRuntimeError: UNKNOWN: Failed to determine best cudnn convolution algorithm for:
%cudnn-conv = (f32[3072,1,758]{2,1,0}, u8[0]{0}) custom-call(f32[3072,1,768]{2,1,0} %bitcast.4, f32[1,1,11]{2,1,0} %bitcast.5), window={size=11}, dim_labels=bf0_oi0->bf0, custom_call_target="__cudnn$convForward", metadata={op_name="jit(ssim)/jit(main)/vmap(jit(convolve))/jit(_conv)/conv_general_dilated[window_strides=(1,) padding=((0, 0),) lhs_dilation=(1,) rhs_dilation=(1,) dimension_numbers=ConvDimensionNumbers(lhs_spec=(0, 1, 2), rhs_spec=(0, 1, 2), out_spec=(0, 1, 2)) feature_group_count=1 batch_group_count=1 lhs_shape=(3072, 1, 768) rhs_shape=(1, 1, 11) precision=(<Precision.HIGHEST: 2>, <Precision.HIGHEST: 2>) preferred_element_type=None]" source_file="/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/dm_pix/_src/metrics.py" source_line=177}, backend_config="{\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}"

Original error: INTERNAL: All algorithms tried for %cudnn-conv = (f32[3072,1,758]{2,1,0}, u8[0]{0}) custom-call(f32[3072,1,768]{2,1,0} %bitcast.4, f32[1,1,11]{2,1,0} %bitcast.5), window={size=11}, dim_labels=bf0_oi0->bf0, custom_call_target="__cudnn$convForward", metadata={op_name="jit(ssim)/jit(main)/vmap(jit(convolve))/jit(_conv)/conv_general_dilated[window_strides=(1,) padding=((0, 0),) lhs_dilation=(1,) rhs_dilation=(1,) dimension_numbers=ConvDimensionNumbers(lhs_spec=(0, 1, 2), rhs_spec=(0, 1, 2), out_spec=(0, 1, 2)) feature_group_count=1 batch_group_count=1 lhs_shape=(3072, 1, 768) rhs_shape=(1, 1, 11) precision=(<Precision.HIGHEST: 2>, <Precision.HIGHEST: 2>) preferred_element_type=None]" source_file="/home/users/zjiang/anaconda3/envs/multinerf/lib/python3.9/site-packages/dm_pix/_src/metrics.py" source_line=177}, backend_config="{\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" failed. Falling back to default algorithm.  Per-algorithm errors:
  Profiling failure on cuDNN engine eng1{k2=4,k3=0}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng34{k2=2,k19=0,k4=0,k5=0,k6=0,k7=0}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng34{k2=2,k19=0,k4=1,k5=0,k6=0,k7=0}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng0{}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng28{k2=4,k3=0}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng28{k3=0,k2=3}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng34{k2=1,k19=0,k4=0,k5=1,k6=0,k7=0}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng34{k5=1,k6=0,k7=0,k2=1,k19=0,k4=2}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng48{k2=15,k13=0,k14=0,k6=0,k22=0}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng4{}: UNKNOWN: CUDNN_STATUS_INTERNAL_ERROR
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng28{k2=0,k3=0}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng1{k3=0,k2=2}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng34{k7=0,k2=0,k19=0,k4=0,k5=1,k6=0}: UNKNOWN: CUDNN_STATUS_EXECUTION_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng1{k2=1,k3=0}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng28{k2=1,k3=0}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng0{}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng1{}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'
  Profiling failure on cuDNN engine eng28{}: UNKNOWN: CUDNN_STATUS_ALLOC_FAILED
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc(4369): 'status'

To ignore this failure and try to use a fallback algorithm (which may have suboptimal performance), use XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false.  Please also file a bug for the root cause of failing autotuning.

from multinerf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.