Giter Club home page Giter Club logo

large-scale-lm-tutorials's Issues

03_distributed_programming.ipynb 를 따라하는데 broadcast에서 오류가 발생합니다.

안녕하세요... 공개해주신 노트북으로 파이토치 병렬처리 개념을 알아가고 있습니다.

03_disstributed_programming.ipynb 에서 P2P Communication 까지는 소스 실행이 잘되고 있었습니다.

그런데 Collective Communication 의 첫번째 broadcast 연산의 샘플 프로그램을 따라서 실행하다가 오류가 발생합니다. dist.broadcast(tensor, src=0)을 실행하기까지는 출력이 되었는데, 이후 오류가 발생합니다.

혹시 이 오류가 왜 발생하는지 알 수 있을까요?

제가 테스트 한 환경은 다음과 같습니다.
윈도우 10 Enterprise 22H2 버전이며,
wsl2 기반 윈도우 docker desktop 에서 ubuntu20.04 버전을 실행하고
있습니다.
파이썬은 3.8.10 이며,
pytorch 버전은 1.13.1+cu116 입니다.
GPU는 1080ti 2장이 설치되어 있는 상황입니다.

테스트 실행 결과는 다음과 같습니다.

root@9023c839d35c:/data/hf_test# python -m torch.distributed.launch --nproc_per_node=2 broadcast.py
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


torch.cuda.current_device() : 0
torch.cuda.current_device() : 1
before rank 0: tensor([[ 0.5138, -1.4212],
[-0.8317, 1.0614]], device='cuda:0')

before rank 1: tensor([[0., 0.],
[0., 0.]], device='cuda:1')

Traceback (most recent call last):
File "broadcast.py", line 21, in
print(f"after rank {rank}: {tensor}\n")
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 859, in format
return object.format(self, format_spec)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 427, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 637, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 568, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 328, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 115, in init
nonzero_finite_vals = torch.masked_select(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f93e939f457 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f93e93693ec in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f9414a47c64 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e3e5 (0x7f9414a1f3e5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f9414a22054 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f943f11fe23 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f93e937f9e0 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f93e937faf9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f943f37dc68 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f943f37df85 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x5cf323]
frame #11: /usr/bin/python() [0x5d221c]
frame #12: /usr/bin/python() [0x6a786c]
frame #13: /usr/bin/python() [0x5d1d17]
frame #14: PyImport_Cleanup + 0x193 (0x685f73 in /usr/bin/python)
frame #15: Py_FinalizeEx + 0x7f (0x68080f in /usr/bin/python)
frame #16: Py_RunMain + 0x32d (0x6b823d in /usr/bin/python)
frame #17: Py_BytesMain + 0x2d (0x6b84ad in /usr/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7f945f86d083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: _start + 0x2e (0x5fb39e in /usr/bin/python)

Traceback (most recent call last):
File "broadcast.py", line 21, in
print(f"after rank {rank}: {tensor}\n")
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 859, in format
return object.format(self, format_spec)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 427, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 637, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 568, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 328, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 115, in init
nonzero_finite_vals = torch.masked_select(
RuntimeError: numel: integer multiplication overflow
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f32834b2457 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f328347c3ec in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f32aeb5ac64 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e3e5 (0x7f32aeb323e5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f32aeb35054 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f32d9232e23 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f32834929e0 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f3283492af9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f32d9490c68 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f32d9490f85 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x5cf323]
frame #11: /usr/bin/python() [0x5d221c]
frame #12: /usr/bin/python() [0x6a786c]
frame #13: /usr/bin/python() [0x5d1d17]
frame #14: PyImport_Cleanup + 0x193 (0x685f73 in /usr/bin/python)
frame #15: Py_FinalizeEx + 0x7f (0x68080f in /usr/bin/python)
frame #16: Py_RunMain + 0x32d (0x6b823d in /usr/bin/python)
frame #17: Py_BytesMain + 0x2d (0x6b84ad in /usr/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7f32f9980083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: _start + 0x2e (0x5fb39e in /usr/bin/python)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 260) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

broadcast.py FAILED

Failures:
[1]:
time : 2023-07-10_03:07:46
host : 9023c839d35c
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 261)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 261

Root Cause (first observed failure):
[0]:
time : 2023-07-10_03:07:46
host : 9023c839d35c
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 260)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 260

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.