Comments (10)
Hi @HuifengShrimp ,
We don't need GPU to run SPU.
And could you please
- post the complete error trace.
- clarify the steps to run this example.
Thank you!
from spu.
After pulling the whole project, creating the docker container, and equipping the requirements, I ran:
bazel run -c opt //examples/python/utils:nodectl -- up
It showed :
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'tf_runtime' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'llvm-raw' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_google_absl' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'pybind11_bazel' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_google_protobuf' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_google_googletest' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_github_gflags_gflags' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_github_grpc_grpc' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'boringssl' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'zlib' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'rules_python' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'pybind11' because it already exists.
INFO: Build option --compilation_mode has changed, discarding analysis cache.
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/tensorflow/core/lib/gtl/BUILD:133:11: in linkstatic attribute of cc_library rule @org_tensorflow//tensorflow/core/lib/gtl:map_util: setting 'linkstatic=1' is recommended if there are no object files. Since this rule was created by the macro 'cc_library', the error might have been caused by the macro implementation
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/com_github_brpc_brpc/BUILD.bazel:476:19: in cc_library rule @com_github_brpc_brpc//:cc_brpc_idl_options_proto: target '@com_github_brpc_brpc//:cc_brpc_idl_options_proto' depends on deprecated target '@com_google_protobuf//:cc_wkt_protos': Only for backward compatibility. Do not use.
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/com_github_brpc_brpc/BUILD.bazel:487:19: in cc_library rule @com_github_brpc_brpc//:cc_brpc_internal_proto: target '@com_github_brpc_brpc//:cc_brpc_internal_proto' depends on deprecated target '@com_google_protobuf//:cc_wkt_protos': Only for backward compatibility. Do not use.
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/tensorflow/core/BUILD:1281:11: in linkstatic attribute of cc_library rule @org_tensorflow//tensorflow/core:lib_internal: setting 'linkstatic=1' is recommended if there are no object files. Since this rule was created by the macro 'cc_library', the error might have been caused by the macro implementation
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/tensorflow/core/BUILD:1603:16: in linkstatic attribute of cc_library rule @org_tensorflow//tensorflow/core:framework_internal: setting 'linkstatic=1' is recommended if there are no object files. Since this rule was created by the macro 'tf_cuda_library', the error might have been caused by the macro implementation
INFO: Analyzed target //examples/python/utils:nodectl (0 packages loaded, 19720 targets configured).
INFO: Found 1 target...
Target //examples/python/utils:nodectl up-to-date:
bazel-bin/examples/python/utils/nodectl
INFO: Elapsed time: 4.327s, Critical Path: 1.41s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
INFO: Build completed successfully, 1 total action
2022-09-23 02:44:34.251902: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:165] libtpu.so already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. Not attempting to load libtpu.so in this process.
2022-09-23 02:44:34.251981: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:259] Libtpu path is: libtpu.so
2022-09-23 02:44:34.255195: I external/org_tensorflow/tensorflow/core/util/util.cc:168] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0
.
2022-09-23 02:44:34.255825: I external/org_tensorflow/tensorflow/core/tpu/tpu_executor_dlsym_initializer.cc:68] Libtpu path is: libtpu.so
2022-09-23 02:44:34.295153: I external/org_tensorflow/tensorflow/core/util/util.cc:168] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0
.
2022-09-23 02:44:34.719075: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib:/opt/rh/gcc-toolset-11/root/usr/lib64/dyninst:/opt/rh/gcc-toolset-11/root/usr/lib/dyninst
2022-09-23 02:44:34.719154: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[2022-09-23 02:44:36,420] [Process-1] Starting grpc server at 127.0.0.1:9920
[2022-09-23 02:44:36,424] [Process-2] Starting grpc server at 127.0.0.1:9921
[2022-09-23 02:44:36,432] [Process-3] Starting grpc server at 127.0.0.1:9922
[2022-09-23 02:44:36,439] [Process-4] Starting grpc server at 127.0.0.1:9923
[2022-09-23 02:44:36,444] [Process-5] Starting grpc server at 127.0.0.1:9924
[2022-09-23 02:45:29,914] [Process-1] Run : builtin_spu_init at node:0
I0923 02:45:29.932346 124729 external/com_github_brpc_brpc/src/brpc/server.cpp:1066] Server[yasl::link::internal::ReceiverServiceImpl] is serving on port=9930.
I0923 02:45:29.934043 124729 external/com_github_brpc_brpc/src/brpc/server.cpp:1069] Check out http://9a985799fac5:9930 in web browser.
I0923 02:45:30.035466 124757 external/com_github_brpc_brpc/src/brpc/socket.cpp:2202] Checking Socket{id=0 addr=127.0.0.1:9931} (0x7f69d8068440)
I0923 02:45:30.035975 124743 external/com_github_brpc_brpc/src/brpc/socket.cpp:2262] Revived Socket{id=0 addr=127.0.0.1:9931} (0x7f69d8068440) (Connectable)
[2022-09-23 02:45:30,963] [Process-1] spu-runtime (SPU) initialized
[2022-09-23 02:46:09,953] [Process-1] Run : builtin_spu_run at node:0
[2022-09-23 02:46:09,963] [Process-4] RunR: builtin_fetch_object at node:3
[2022-09-23 02:46:09,965] [Process-4] Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 312, in RunReturn
result = fn(self, *args, **kwargs)
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 262, in builtin_fetch_object
return server._globals[ObjectRef(refid, server.node_id)]
KeyError: ObjRef(26b6d05a-81a1-4375-bf85-440fda95c2eb at node:3)[2022-09-23 02:46:09,969] [Process-1] Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 327, in Run
args, kwargs = tree_map(lambda obj: self._get_object(obj), (args, kwargs))
File "/usr/local/lib/python3.8/site-packages/jax/_src/tree_util.py", line 184, in tree_map
return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))
File "/usr/local/lib/python3.8/site-packages/jax/_src/tree_util.py", line 184, in
return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 327, in
args, kwargs = tree_map(lambda obj: self._get_object(obj), (args, kwargs))
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 347, in _get_object
obj = self._node_clients[ref.origin_nodeid].get(ref)
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 265, in get
return self._call(self._stub.RunReturn, builtin_fetch_object, ref.uuid)
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 246, in _call
raise Exception("remote exception", result)
Exception: ('remote exception', Exception('Traceback (most recent call last):\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 312, in RunReturn\n result = fn(self, *args, **kwargs)\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 262, in builtin_fetch_object\n return server._globals[ObjectRef(refid, server.node_id)]\nKeyError: ObjRef(26b6d05a-81a1-4375-bf85-440fda95c2eb at node:3)\n'))[2022-09-23 02:46:15,896] [Process-5] RunR: builtin_fetch_object at node:4
[2022-09-23 02:46:15,899] [Process-5] Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 312, in RunReturn
result = fn(self, *args, **kwargs)
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 262, in builtin_fetch_object
return server._globals[ObjectRef(refid, server.node_id)]
KeyError: ObjRef(c7e739ae-8a73-4556-ac11-49b57c8d85e4 at node:4)`
And the process was stuck. So I started a new terminal and ran:
bazel run //examples/python/ml:ss_lr
It showed the following error:
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'tf_runtime' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'llvm-raw' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_google_absl' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'pybind11_bazel' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_google_protobuf' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_google_googletest' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_github_gflags_gflags' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_github_grpc_grpc' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'boringssl' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'zlib' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'rules_python' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'pybind11' because it already exists.
INFO: Build option --compilation_mode has changed, discarding analysis cache.
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/tensorflow/core/lib/gtl/BUILD:133:11: in linkstatic attribute of cc_library rule @org_tensorflow//tensorflow/core/lib/gtl:map_util: setting 'linkstatic=1' is recommended if there are no object files. Since this rule was created by the macro 'cc_library', the error might have been caused by the macro implementation
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/com_github_brpc_brpc/BUILD.bazel:476:19: in cc_library rule @com_github_brpc_brpc//:cc_brpc_idl_options_proto: target '@com_github_brpc_brpc//:cc_brpc_idl_options_proto' depends on deprecated target '@com_google_protobuf//:cc_wkt_protos': Only for backward compatibility. Do not use.
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/com_github_brpc_brpc/BUILD.bazel:487:19: in cc_library rule @com_github_brpc_brpc//:cc_brpc_internal_proto: target '@com_github_brpc_brpc//:cc_brpc_internal_proto' depends on deprecated target '@com_google_protobuf//:cc_wkt_protos': Only for backward compatibility. Do not use.
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/tensorflow/core/BUILD:1281:11: in linkstatic attribute of cc_library rule @org_tensorflow//tensorflow/core:lib_internal: setting 'linkstatic=1' is recommended if there are no object files. Since this rule was created by the macro 'cc_library', the error might have been caused by the macro implementation
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/tensorflow/core/BUILD:1603:16: in linkstatic attribute of cc_library rule @org_tensorflow//tensorflow/core:framework_internal: setting 'linkstatic=1' is recommended if there are no object files. Since this rule was created by the macro 'tf_cuda_library', the error might have been caused by the macro implementation
INFO: Analyzed target //examples/python/ml:ss_lr (0 packages loaded, 19718 targets configured).
INFO: Found 1 target...
Target //examples/python/ml:ss_lr up-to-date:
bazel-bin/examples/python/ml/ss_lr
INFO: Elapsed time: 1.823s, Critical Path: 0.13s
INFO: 4 processes: 4 internal.
INFO: Build completed successfully, 4 total actions
INFO: Build completed successfully, 4 total actions
2022-09-23 02:45:28.158577: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib:/opt/rh/gcc-toolset-11/root/usr/lib64/dyninst:/opt/rh/gcc-toolset-11/root/usr/lib/dyninst
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-fastbuild/bin/examples/python/ml/ss_lr.runfiles/spulib/examples/python/ml/ss_lr.py", line 284, in
model = sslr.fit(
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-fastbuild/bin/examples/python/ml/ss_lr.runfiles/spulib/examples/python/ml/ss_lr.py", line 184, in fit
spu_ds = self.spu(place_dataset)(xs, y)
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-fastbuild/bin/examples/python/ml/ss_lr.runfiles/spulib/spu/binding/util/distributed.py", line 683, in call
results = [future.result() for future in futures]
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-fastbuild/bin/examples/python/ml/ss_lr.runfiles/spulib/spu/binding/util/distributed.py", line 683, in
results = [future.result() for future in futures]
File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/usr/lib64/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-fastbuild/bin/examples/python/ml/ss_lr.runfiles/spulib/spu/binding/util/distributed.py", line 253, in run
return self._call(self._stub.Run, fn, *args, **kwargs)
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-fastbuild/bin/examples/python/ml/ss_lr.runfiles/spulib/spu/binding/util/distributed.py", line 246, in _call
raise Exception("remote exception", result)
Exception: ('remote exception', Exception('Traceback (most recent call last):\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 327, in Run\n args, kwargs = tree_map(lambda obj: self._get_object(obj), (args, kwargs))\n File "/usr/local/lib/python3.8/site-packages/jax/_src/tree_util.py", line 184, in tree_map\n return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))\n File "/usr/local/lib/python3.8/site-packages/jax/_src/tree_util.py", line 184, in \n return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 327, in \n args, kwargs = tree_map(lambda obj: self._get_object(obj), (args, kwargs))\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 347, in _get_object\n obj = self._node_clients[ref.origin_nodeid].get(ref)\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 265, in get\n return self._call(self._stub.RunReturn, builtin_fetch_object, ref.uuid)\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 246, in _call\n raise Exception("remote exception", result)\nException: ('remote exception', Exception('Traceback (most recent call last):\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 312, in RunReturn\n result = fn(self, *args, **kwargs)\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 262, in builtin_fetch_object\n return server._globals[ObjectRef(refid, server.node_id)]\nKeyError: ObjRef(26b6d05a-81a1-4375-bf85-440fda95c2eb at node:3)\n'))\n'))`
from spu.
Hi @HuifengShrimp ,
Are are using secretflow/secretflow-gcc11-anolis-dev as docker image? If not, could you please switch to this one?
from spu.
Yes, I used this command to start a container:
docker run -d -it --name spu-gcc11-anolis-dev-$(whoami) \ --mount type=bind,source="$(pwd)",target=/home/admin/dev/ \ -w /home/admin/dev \ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \ --cap-add=NET_ADMIN \ --privileged=true \ secretflow/secretflow-gcc11-anolis-dev:latest
And the docker images shows:
from spu.
Hi @HuifengShrimp ,
Could you please confirm that your machine meet the minimum requirements - 8c16g?
from spu.
yes, our machine meets the requirements:
from spu.
Can you check the following steps? FYI: you need two terminals:
- Build
bazel build //examples/python/... -c opt
- In first terminal
bazel-bin/examples/python/utils/nodectl up
- In second terminal
bazel-bin/examples/python/ml/ss_lr
Best
from spu.
Thank you very much! I restarted the whole process, ran jax_lr.py:
bazel run -c opt //examples/python/utils:nodectl -- up
bazel run //examples/python/ml:jax_lr
and got the auc scores successfully.
But when I ran ss_lr.py by your given commands, I got the following messages:
2022-09-23 13:53:03.461759: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib:/opt/rh/gcc-toolset-11/root/usr/lib64/dyninst:/opt/rh/gcc-toolset-11/root/usr/lib/dyninst
/* error: missing value */
{}WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
train time 239.13434767723083
predict time 72.12441802024841
Traceback (most recent call last):
File "/home/admin/dev/bazel-bin/examples/python/ml/ss_lr.runfiles/spulib/examples/python/ml/ss_lr.py", line 299, in
print(f"auc {roc_auc_score(ppd.get(y), ppd.get(yhat))}")
File "/usr/local/lib64/python3.8/site-packages/sklearn/metrics/_ranking.py", line 566, in roc_auc_score
return _average_binary_score(
File "/usr/local/lib64/python3.8/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/usr/local/lib64/python3.8/site-packages/sklearn/metrics/_ranking.py", line 337, in _binary_roc_auc_score
raise ValueError(
ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.
from spu.
Hi @HuifengShrimp,
The default dataset used by ss_lr is a mock one, which is very unbalanced, and should not turn on by default for demo. I'll fix it.
Meanwhile, please change here to False and try again.
Best
from spu.
I fixed MOCK_DS=false and got the results, thank you.
from spu.
Related Issues (20)
- How to use SPU to evaluate private models in 2PC setting with only one machine? HOT 14
- [Bug]: One more minus sign HOT 2
- [Bug]: Package 'examples/python/ml/my_custom_file' contains errors HOT 10
- [Bug]: The critical condition judgment is wrong HOT 5
- [Bug]: Error when trying to benchmark SPU latency in 2PC setting. HOT 3
- another case where secret indexing doesn't seem to work HOT 2
- [Bug]: 8x communication compared to reported in Cheetah HOT 5
- [Operation Question] How to separate truncation and matmul operations HOT 9
- [Question]: Are there any files building correspondence between the kernels and their dispatching functions? HOT 3
- [Bug]: bitintl_b in ab_api.cc is wrong HOT 3
- [Question]: The number of convolutional multiplication decreases but the communication cost increases in SPU HOT 3
- [Bug]: gRPC Socket Shutting Down After Many Runs HOT 7
- [Bug]: gcc 11.2下的编译问题 HOT 12
- [Question]: 能否不重复编译外部库,加速编译速度? HOT 4
- [Question]: stub_method方法实现将函数交由server执行,请问如何调试server中函数的具体执行过程? HOT 4
- [Question]: Common type of Ashare and Bshare HOT 1
- [Question]: Evaluating the model with significantly reduced number of ReLUs receives limited communication savings HOT 12
- [Question]: Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.0KiB (rounded to 3072) HOT 9
- [Usability]: 运行LLaMa时报错'Config' object has no attribute 'define_bool_state' HOT 10
- 我想请问一下bumblebee的复现配置 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spu.