TPU type: v3-8 <a href="https://cloud.google.com/tpu/docs/supported-tpu-versions#t

Here is what I ran in the VM, up to the unit tests run: <div class="highlight high

Tests & training fail on Google TPU VM about multinerf HOT 4 OPEN

google-research commented on June 4, 2024

Tests & training fail on Google TPU VM

from multinerf.

Comments (4)

Palisand commented on June 4, 2024

Here is what I ran in the VM, up to the unit tests run:

sudo apt update
sudo apt install -y wget git

# miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh
bash Miniconda3-py39_4.12.0-Linux-x86_64.sh
source .bashrc

# COLMAP - https://colmap.github.io/install.html

sudo apt install -y \
    cmake \
    build-essential \
    libboost-program-options-dev \
    libboost-filesystem-dev \
    libboost-graph-dev \
    libboost-system-dev \
    libboost-test-dev \
    libeigen3-dev \
    libsuitesparse-dev \
    libfreeimage-dev \
    libmetis-dev \
    libgoogle-glog-dev \
    libgflags-dev \
    libglew-dev \
    qtbase5-dev \
    libqt5opengl5-dev \
    libcgal-dev

sudo apt install -y libatlas-base-dev libsuitesparse-dev
git clone https://ceres-solver.googlesource.com/ceres-solver
cd ceres-solver
git checkout $(git describe --tags) # Checkout the latest release
mkdir build
cd build
cmake .. -DBUILD_TESTING=OFF -DBUILD_EXAMPLES=OFF
make -j
sudo make install

cd ~
git clone https://github.com/colmap/colmap.git
cd colmap
git checkout dev
mkdir build
cd build
cmake ..
make -j
sudo make install
colmap -h

## Install & configure MultiNERF

cd ~
git clone https://github.com/google-research/multinerf.git
cd multinerf
conda create --name multinerf python=3.9
conda activate multinerf
conda install pip
pip install --upgrade pip
pip install -r requirements.txt
pip install tensorflow==2.9.1  # match TPU software version
git clone https://github.com/rmbrualla/pycolmap.git ./internal/pycolmap
./scripts/run_all_unit_tests.sh

from multinerf.

jonbarron commented on June 4, 2024

How are you running this on a Google TPU? We train our models on Google TPUs but using the internal interface, which is different from the publicly available one. I don't think this code has yet been run through the external interface. Have you verified that you can run other models on the TPUs you're using? It seems like the issue here is at a lower level than this codebase here --- maybe a jax/cuda/driver issue?

from multinerf.

Palisand commented on June 4, 2024

Ah, I see. I am using the publicly available interface, following google's Cloud TPU documentation. I haven't verified other models.

To create the TPU VM, I ran:

gcloud config set project multinerf
gcloud services enable tpu.googleapis.com
gcloud beta services identity create --service tpu.googleapis.com
gcloud alpha compute tpus tpu-vm create tpu-multinerf --zone us-central1-b --accelerator-type v3-8 --version tpu-vm-tf-2.9.1

I then SSHed into the VM:

gcloud alpha compute tpus tpu-vm ssh tpu-multinerf --zone us-central1-b

And ran the aforementioned commands.

Before using the TPU VM, I tested these commands locally, in a Docker container running Ubuntu 20.04 (just like the VM). The tests succeeded in the container.

from multinerf.

Palisand commented on June 4, 2024

I tried again from scratch. This time, I removed jax, jaxlib, and tensorflow from requirements.txt and then I ran:

pip install "jax[tpu]>=0.2.16" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
pip install tensorflow==2.9.1
pip install -r requirements.txt

For jax: https://github.com/google/jax/#pip-installation-google-cloud-tpu

Some tests still fail, but at least they're not aborted. Here's some partial test output:

FAIL: test_construct_ray_warps_extents_log (tests.coord_test.CoordTest)
tests.coord_test.CoordTest.test_construct_ray_warps_extents_log
test_construct_ray_warps_extents_log(<CompiledFunction of <function _one_to_one_unop.<locals>.<lambda> at 0x7faae18e0e50>>)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/palisand/miniconda3/envs/multinerf/lib/python3.9/site-packages/absl/testing/parameterized.py", line 314, in bound_param_test
    return test_method(self, *testcase_params)
  File "/home/palisand/multinerf/tests/coord_test.py", line 194, in test_construct_ray_warps_extents
    np.testing.assert_allclose(
  File "/home/palisand/miniconda3/envs/multinerf/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/palisand/miniconda3/envs/multinerf/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-05

Mismatched elements: 43 / 100 (43%)
Max absolute difference: 0.00045204
Max relative difference: 7.56597e-05
 x: array([ 2.400275,  1.668342,  2.044059,  6.927439,  1.078879,  0.115673,
        0.290508,  0.686725,  0.240134,  0.186716,  0.534207,  0.409191,
        2.090983,  0.41522 ,  0.722983,  1.309822,  0.97231 ,  0.64675 ,...
 y: array([ 2.400219,  1.668289,  2.044053,  6.927345,  1.078887,  0.115672,
        0.290508,  0.686718,  0.240134,  0.186729,  0.534207,  0.409187,
        2.090986,  0.415209,  0.722943,  1.309913,  0.972313,  0.646793,...

======================================================================
FAIL: test_pos_enc_25_2 (tests.coord_test.CoordTest)
tests.coord_test.CoordTest.test_pos_enc_25_2
test_pos_enc_25_2(25, 2)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/palisand/miniconda3/envs/multinerf/lib/python3.9/site-packages/absl/testing/parameterized.py", line 314, in bound_param_test
    return test_method(self, *testcase_params)
  File "/home/palisand/multinerf/tests/coord_test.py", line 127, in test_pos_enc
    self.assertLess(max_err, tol)
AssertionError: 2.3317099 not less than 2

======================================================================
FAIL: test_pos_enc_30_2 (tests.coord_test.CoordTest)
tests.coord_test.CoordTest.test_pos_enc_30_2
test_pos_enc_30_2(30, 2)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/palisand/miniconda3/envs/multinerf/lib/python3.9/site-packages/absl/testing/parameterized.py", line 314, in bound_param_test
    return test_method(self, *testcase_params)
  File "/home/palisand/multinerf/tests/coord_test.py", line 127, in test_pos_enc
    self.assertLess(max_err, tol)
AssertionError: 109575406000.0 not less than 2

----------------------------------------------------------------------
Ran 21 tests in 30.823s

FAILED (failures=3)
.FFFF.
======================================================================
FAIL: test_mse_to_psnr_golden (tests.image_test.ImageTest)
tests.image_test.ImageTest.test_mse_to_psnr_golden
A lazy golden test for mse_to_psnr.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/palisand/multinerf/tests/image_test.py", line 127, in test_mse_to_psnr_golden
    np.testing.assert_allclose(psnr, psnr_gt, atol=1E-5, rtol=1E-5)
  File "/home/palisand/miniconda3/envs/multinerf/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/palisand/miniconda3/envs/multinerf/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-05

Mismatched elements: 28 / 64 (43.8%)
Max absolute difference: 0.00061035
Max relative difference: 0.00023579
 x: array([43.429222, 42.739685, 42.050312, 41.360874, 40.671413, 39.982204,
       39.29292 , 38.603603, 37.91449 , 37.225014, 36.535473, 35.846165,
       35.156944, 34.46737 , 33.777996, 33.088787, 32.399387, 31.709982,...
 y: array([43.429447, 42.74009 , 42.050735, 41.361378, 40.672024, 39.982666,
       39.29331 , 38.603954, 37.914597, 37.22524 , 36.535885, 35.846527,
       35.15717 , 34.46781 , 33.778458, 33.0891  , 32.399746, 31.710388,...

======================================================================
FAIL: test_psnr_mse_round_trip (tests.image_test.ImageTest)
tests.image_test.ImageTest.test_psnr_mse_round_trip
PSNR -> MSE -> PSNR is a no-op.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/palisand/multinerf/tests/image_test.py", line 63, in test_psnr_mse_round_trip
    np.testing.assert_allclose(
  File "/home/palisand/miniconda3/envs/multinerf/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/palisand/miniconda3/envs/multinerf/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-05

Mismatched elements: 1 / 1 (100%)
Max absolute difference: 0.00024223
Max relative difference: 1.21116638e-05
 x: array(20.000242, dtype=float32)
 y: array(20.)

======================================================================
FAIL: test_srgb_linearize (tests.image_test.ImageTest)
tests.image_test.ImageTest.test_srgb_linearize
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/palisand/multinerf/tests/image_test.py", line 81, in test_srgb_linearize
    np.testing.assert_allclose(
  File "/home/palisand/miniconda3/envs/multinerf/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/palisand/miniconda3/envs/multinerf/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-05

Mismatched elements: 3524 / 10000 (35.2%)
Max absolute difference: 0.00025249
Max relative difference: 0.00018565
 x: array([-1.      , -0.9996  , -0.9992  , ...,  2.999342,  2.999746,
        3.00015 ], dtype=float32)
 y: array([-1.    , -0.9996, -0.9992, ...,  2.9992,  2.9996,  3.    ],
      dtype=float32)

======================================================================
FAIL: test_srgb_to_linear_golden (tests.image_test.ImageTest)
tests.image_test.ImageTest.test_srgb_to_linear_golden
A lazy golden test for srgb_to_linear.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/palisand/multinerf/tests/image_test.py", line 108, in test_srgb_to_linear_golden
    np.testing.assert_allclose(linear, linear_gt, atol=1E-5, rtol=1E-5)
  File "/home/palisand/miniconda3/envs/multinerf/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/palisand/miniconda3/envs/multinerf/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-05

Mismatched elements: 19 / 64 (29.7%)
Max absolute difference: 6.875396e-05
Max relative difference: 0.00015924
 x: array([0.      , 0.001229, 0.002457, 0.003725, 0.005261, 0.007113,
       0.009299, 0.011834, 0.014733, 0.018009, 0.02167 , 0.025736,
       0.030215, 0.035118, 0.040456, 0.04624 , 0.05248 , 0.059185,...
 y: array([0.      , 0.001229, 0.002457, 0.003725, 0.005261, 0.007113,
       0.0093  , 0.011835, 0.014732, 0.018007, 0.021671, 0.025736,
       0.030215, 0.035118, 0.040456, 0.04624 , 0.052479, 0.059184,...

----------------------------------------------------------------------
Ran 6 tests in 9.887s

FAILED (failures=4)

from multinerf.

Tests & training fail on Google TPU VM about multinerf HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent