Giter Club home page Giter Club logo

Comments (10)

6fj avatar 6fj commented on May 31, 2024

Hi @HuifengShrimp

We don't need GPU to run SPU.

And could you please

  • post the complete error trace.
  • clarify the steps to run this example.

Thank you!

from spu.

HuifengShrimp avatar HuifengShrimp commented on May 31, 2024

After pulling the whole project, creating the docker container, and equipping the requirements, I ran:
bazel run -c opt //examples/python/utils:nodectl -- up
It showed :

DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'tf_runtime' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'llvm-raw' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_google_absl' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'pybind11_bazel' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_google_protobuf' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_google_googletest' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_github_gflags_gflags' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_github_grpc_grpc' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'boringssl' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'zlib' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'rules_python' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'pybind11' because it already exists.
INFO: Build option --compilation_mode has changed, discarding analysis cache.
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/tensorflow/core/lib/gtl/BUILD:133:11: in linkstatic attribute of cc_library rule @org_tensorflow//tensorflow/core/lib/gtl:map_util: setting 'linkstatic=1' is recommended if there are no object files. Since this rule was created by the macro 'cc_library', the error might have been caused by the macro implementation
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/com_github_brpc_brpc/BUILD.bazel:476:19: in cc_library rule @com_github_brpc_brpc//:cc_brpc_idl_options_proto: target '@com_github_brpc_brpc//:cc_brpc_idl_options_proto' depends on deprecated target '@com_google_protobuf//:cc_wkt_protos': Only for backward compatibility. Do not use.
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/com_github_brpc_brpc/BUILD.bazel:487:19: in cc_library rule @com_github_brpc_brpc//:cc_brpc_internal_proto: target '@com_github_brpc_brpc//:cc_brpc_internal_proto' depends on deprecated target '@com_google_protobuf//:cc_wkt_protos': Only for backward compatibility. Do not use.
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/tensorflow/core/BUILD:1281:11: in linkstatic attribute of cc_library rule @org_tensorflow//tensorflow/core:lib_internal: setting 'linkstatic=1' is recommended if there are no object files. Since this rule was created by the macro 'cc_library', the error might have been caused by the macro implementation
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/tensorflow/core/BUILD:1603:16: in linkstatic attribute of cc_library rule @org_tensorflow//tensorflow/core:framework_internal: setting 'linkstatic=1' is recommended if there are no object files. Since this rule was created by the macro 'tf_cuda_library', the error might have been caused by the macro implementation
INFO: Analyzed target //examples/python/utils:nodectl (0 packages loaded, 19720 targets configured).
INFO: Found 1 target...
Target //examples/python/utils:nodectl up-to-date:
bazel-bin/examples/python/utils/nodectl
INFO: Elapsed time: 4.327s, Critical Path: 1.41s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
INFO: Build completed successfully, 1 total action
2022-09-23 02:44:34.251902: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:165] libtpu.so already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. Not attempting to load libtpu.so in this process.
2022-09-23 02:44:34.251981: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:259] Libtpu path is: libtpu.so
2022-09-23 02:44:34.255195: I external/org_tensorflow/tensorflow/core/util/util.cc:168] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2022-09-23 02:44:34.255825: I external/org_tensorflow/tensorflow/core/tpu/tpu_executor_dlsym_initializer.cc:68] Libtpu path is: libtpu.so
2022-09-23 02:44:34.295153: I external/org_tensorflow/tensorflow/core/util/util.cc:168] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2022-09-23 02:44:34.719075: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib:/opt/rh/gcc-toolset-11/root/usr/lib64/dyninst:/opt/rh/gcc-toolset-11/root/usr/lib/dyninst
2022-09-23 02:44:34.719154: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[2022-09-23 02:44:36,420] [Process-1] Starting grpc server at 127.0.0.1:9920
[2022-09-23 02:44:36,424] [Process-2] Starting grpc server at 127.0.0.1:9921
[2022-09-23 02:44:36,432] [Process-3] Starting grpc server at 127.0.0.1:9922
[2022-09-23 02:44:36,439] [Process-4] Starting grpc server at 127.0.0.1:9923
[2022-09-23 02:44:36,444] [Process-5] Starting grpc server at 127.0.0.1:9924
[2022-09-23 02:45:29,914] [Process-1] Run : builtin_spu_init at node:0
I0923 02:45:29.932346 124729 external/com_github_brpc_brpc/src/brpc/server.cpp:1066] Server[yasl::link::internal::ReceiverServiceImpl] is serving on port=9930.
I0923 02:45:29.934043 124729 external/com_github_brpc_brpc/src/brpc/server.cpp:1069] Check out http://9a985799fac5:9930 in web browser.
I0923 02:45:30.035466 124757 external/com_github_brpc_brpc/src/brpc/socket.cpp:2202] Checking Socket{id=0 addr=127.0.0.1:9931} (0x7f69d8068440)
I0923 02:45:30.035975 124743 external/com_github_brpc_brpc/src/brpc/socket.cpp:2262] Revived Socket{id=0 addr=127.0.0.1:9931} (0x7f69d8068440) (Connectable)
[2022-09-23 02:45:30,963] [Process-1] spu-runtime (SPU) initialized
[2022-09-23 02:46:09,953] [Process-1] Run : builtin_spu_run at node:0
[2022-09-23 02:46:09,963] [Process-4] RunR: builtin_fetch_object at node:3
[2022-09-23 02:46:09,965] [Process-4] Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 312, in RunReturn
result = fn(self, *args, **kwargs)
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 262, in builtin_fetch_object
return server._globals[ObjectRef(refid, server.node_id)]
KeyError: ObjRef(26b6d05a-81a1-4375-bf85-440fda95c2eb at node:3)

[2022-09-23 02:46:09,969] [Process-1] Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 327, in Run
args, kwargs = tree_map(lambda obj: self._get_object(obj), (args, kwargs))
File "/usr/local/lib/python3.8/site-packages/jax/_src/tree_util.py", line 184, in tree_map
return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))
File "/usr/local/lib/python3.8/site-packages/jax/_src/tree_util.py", line 184, in
return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 327, in
args, kwargs = tree_map(lambda obj: self._get_object(obj), (args, kwargs))
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 347, in _get_object
obj = self._node_clients[ref.origin_nodeid].get(ref)
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 265, in get
return self._call(self._stub.RunReturn, builtin_fetch_object, ref.uuid)
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 246, in _call
raise Exception("remote exception", result)
Exception: ('remote exception', Exception('Traceback (most recent call last):\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 312, in RunReturn\n result = fn(self, *args, **kwargs)\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 262, in builtin_fetch_object\n return server._globals[ObjectRef(refid, server.node_id)]\nKeyError: ObjRef(26b6d05a-81a1-4375-bf85-440fda95c2eb at node:3)\n'))

[2022-09-23 02:46:15,896] [Process-5] RunR: builtin_fetch_object at node:4
[2022-09-23 02:46:15,899] [Process-5] Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 312, in RunReturn
result = fn(self, *args, **kwargs)
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 262, in builtin_fetch_object
return server._globals[ObjectRef(refid, server.node_id)]
KeyError: ObjRef(c7e739ae-8a73-4556-ac11-49b57c8d85e4 at node:4)`

And the process was stuck. So I started a new terminal and ran:
bazel run //examples/python/ml:ss_lr
It showed the following error:

DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'tf_runtime' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'llvm-raw' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_google_absl' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'pybind11_bazel' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_google_protobuf' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_google_googletest' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_github_gflags_gflags' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'com_github_grpc_grpc' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'boringssl' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'zlib' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'rules_python' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/third_party/repo.bzl:124:14:
Warning: skipping import of repository 'pybind11' because it already exists.
INFO: Build option --compilation_mode has changed, discarding analysis cache.
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/tensorflow/core/lib/gtl/BUILD:133:11: in linkstatic attribute of cc_library rule @org_tensorflow//tensorflow/core/lib/gtl:map_util: setting 'linkstatic=1' is recommended if there are no object files. Since this rule was created by the macro 'cc_library', the error might have been caused by the macro implementation
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/com_github_brpc_brpc/BUILD.bazel:476:19: in cc_library rule @com_github_brpc_brpc//:cc_brpc_idl_options_proto: target '@com_github_brpc_brpc//:cc_brpc_idl_options_proto' depends on deprecated target '@com_google_protobuf//:cc_wkt_protos': Only for backward compatibility. Do not use.
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/com_github_brpc_brpc/BUILD.bazel:487:19: in cc_library rule @com_github_brpc_brpc//:cc_brpc_internal_proto: target '@com_github_brpc_brpc//:cc_brpc_internal_proto' depends on deprecated target '@com_google_protobuf//:cc_wkt_protos': Only for backward compatibility. Do not use.
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/tensorflow/core/BUILD:1281:11: in linkstatic attribute of cc_library rule @org_tensorflow//tensorflow/core:lib_internal: setting 'linkstatic=1' is recommended if there are no object files. Since this rule was created by the macro 'cc_library', the error might have been caused by the macro implementation
WARNING: /root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/external/org_tensorflow/tensorflow/core/BUILD:1603:16: in linkstatic attribute of cc_library rule @org_tensorflow//tensorflow/core:framework_internal: setting 'linkstatic=1' is recommended if there are no object files. Since this rule was created by the macro 'tf_cuda_library', the error might have been caused by the macro implementation
INFO: Analyzed target //examples/python/ml:ss_lr (0 packages loaded, 19718 targets configured).
INFO: Found 1 target...
Target //examples/python/ml:ss_lr up-to-date:
bazel-bin/examples/python/ml/ss_lr
INFO: Elapsed time: 1.823s, Critical Path: 0.13s
INFO: 4 processes: 4 internal.
INFO: Build completed successfully, 4 total actions
INFO: Build completed successfully, 4 total actions
2022-09-23 02:45:28.158577: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib:/opt/rh/gcc-toolset-11/root/usr/lib64/dyninst:/opt/rh/gcc-toolset-11/root/usr/lib/dyninst
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-fastbuild/bin/examples/python/ml/ss_lr.runfiles/spulib/examples/python/ml/ss_lr.py", line 284, in
model = sslr.fit(
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-fastbuild/bin/examples/python/ml/ss_lr.runfiles/spulib/examples/python/ml/ss_lr.py", line 184, in fit
spu_ds = self.spu(place_dataset)(xs, y)
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-fastbuild/bin/examples/python/ml/ss_lr.runfiles/spulib/spu/binding/util/distributed.py", line 683, in call
results = [future.result() for future in futures]
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-fastbuild/bin/examples/python/ml/ss_lr.runfiles/spulib/spu/binding/util/distributed.py", line 683, in
results = [future.result() for future in futures]
File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/usr/lib64/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-fastbuild/bin/examples/python/ml/ss_lr.runfiles/spulib/spu/binding/util/distributed.py", line 253, in run
return self._call(self._stub.Run, fn, *args, **kwargs)
File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-fastbuild/bin/examples/python/ml/ss_lr.runfiles/spulib/spu/binding/util/distributed.py", line 246, in _call
raise Exception("remote exception", result)
Exception: ('remote exception', Exception('Traceback (most recent call last):\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 327, in Run\n args, kwargs = tree_map(lambda obj: self._get_object(obj), (args, kwargs))\n File "/usr/local/lib/python3.8/site-packages/jax/_src/tree_util.py", line 184, in tree_map\n return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))\n File "/usr/local/lib/python3.8/site-packages/jax/_src/tree_util.py", line 184, in \n return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 327, in \n args, kwargs = tree_map(lambda obj: self._get_object(obj), (args, kwargs))\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 347, in _get_object\n obj = self._node_clients[ref.origin_nodeid].get(ref)\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 265, in get\n return self._call(self._stub.RunReturn, builtin_fetch_object, ref.uuid)\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 246, in _call\n raise Exception("remote exception", result)\nException: ('remote exception', Exception('Traceback (most recent call last):\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 312, in RunReturn\n result = fn(self, *args, **kwargs)\n File "/root/.cache/bazel/_bazel_root/eceb46742416a02f6a0f8d92bc74468c/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/binding/util/distributed.py", line 262, in builtin_fetch_object\n return server._globals[ObjectRef(refid, server.node_id)]\nKeyError: ObjRef(26b6d05a-81a1-4375-bf85-440fda95c2eb at node:3)\n'))\n'))`

from spu.

6fj avatar 6fj commented on May 31, 2024

Hi @HuifengShrimp

Are are using secretflow/secretflow-gcc11-anolis-dev as docker image? If not, could you please switch to this one?

from spu.

HuifengShrimp avatar HuifengShrimp commented on May 31, 2024

Yes, I used this command to start a container:
docker run -d -it --name spu-gcc11-anolis-dev-$(whoami) \ --mount type=bind,source="$(pwd)",target=/home/admin/dev/ \ -w /home/admin/dev \ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \ --cap-add=NET_ADMIN \ --privileged=true \ secretflow/secretflow-gcc11-anolis-dev:latest

And the docker images shows:

image

from spu.

6fj avatar 6fj commented on May 31, 2024

Hi @HuifengShrimp ,

Could you please confirm that your machine meet the minimum requirements - 8c16g?

from spu.

HuifengShrimp avatar HuifengShrimp commented on May 31, 2024

yes, our machine meets the requirements:
image

from spu.

anakinxc avatar anakinxc commented on May 31, 2024

Hi @HuifengShrimp

Can you check the following steps? FYI: you need two terminals:

  1. Build
    bazel build //examples/python/... -c opt
  2. In first terminal
    bazel-bin/examples/python/utils/nodectl up
  3. In second terminal
    bazel-bin/examples/python/ml/ss_lr

Best

from spu.

HuifengShrimp avatar HuifengShrimp commented on May 31, 2024

Thank you very much! I restarted the whole process, ran jax_lr.py:
bazel run -c opt //examples/python/utils:nodectl -- up
bazel run //examples/python/ml:jax_lr
and got the auc scores successfully.
But when I ran ss_lr.py by your given commands, I got the following messages:

2022-09-23 13:53:03.461759: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib:/opt/rh/gcc-toolset-11/root/usr/lib64/dyninst:/opt/rh/gcc-toolset-11/root/usr/lib/dyninst
/* error: missing value */
{}WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
train time 239.13434767723083
predict time 72.12441802024841
Traceback (most recent call last):
File "/home/admin/dev/bazel-bin/examples/python/ml/ss_lr.runfiles/spulib/examples/python/ml/ss_lr.py", line 299, in
print(f"auc {roc_auc_score(ppd.get(y), ppd.get(yhat))}")
File "/usr/local/lib64/python3.8/site-packages/sklearn/metrics/_ranking.py", line 566, in roc_auc_score
return _average_binary_score(
File "/usr/local/lib64/python3.8/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/usr/local/lib64/python3.8/site-packages/sklearn/metrics/_ranking.py", line 337, in _binary_roc_auc_score
raise ValueError(
ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

from spu.

anakinxc avatar anakinxc commented on May 31, 2024

Hi @HuifengShrimp,

The default dataset used by ss_lr is a mock one, which is very unbalanced, and should not turn on by default for demo. I'll fix it.

Meanwhile, please change here to False and try again.

Best

from spu.

HuifengShrimp avatar HuifengShrimp commented on May 31, 2024

I fixed MOCK_DS=false and got the results, thank you.

from spu.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.