Comments (14)
We are seeing the same error for LLM, LLMEngine, and AsyncLLMEngine.
Interestingly, we find wrapping everything in the python script within if __name__ == '__main__':
can temporarily bypass the issue.
For test.py
being the following
import os
from vllm import LLM, SamplingParams
prompts = ['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful chatbot who always responds to requests.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is ramen?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n']
sampling_params = SamplingParams(temperature=0.0, max_tokens=512)
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=2,
disable_log_stats=False,
)
outputs = llm.generate(prompts, sampling_params)
Running python test.py
gives
(venv-vllm) ben@sakura-h100-3:/nvme0n1/ben/test$ python test.py
/nvme0n1/ben/venvs/venv-vllm/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
INFO 06-19 01:40:01 config.py:632] Defaulting to use mp for distributed inference
INFO 06-19 01:40:01 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/nvme0n1/ben/venvs/venv-vllm/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
INFO 06-19 01:40:04 config.py:632] Defaulting to use mp for distributed inference
INFO 06-19 01:40:04 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/nvme0n1/ben/test/test.py", line 8, in <module>
llm = LLM(
File "/nvme0n1/ben/vllm/vllm/entrypoints/llm.py", line 144, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/nvme0n1/ben/vllm/vllm/engine/llm_engine.py", line 371, in from_engine_args
engine = cls(
File "/nvme0n1/ben/vllm/vllm/engine/llm_engine.py", line 223, in __init__
self.model_executor = executor_class(
File "/nvme0n1/ben/vllm/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
super().__init__(*args, **kwargs)
File "/nvme0n1/ben/vllm/vllm/executor/executor_base.py", line 41, in __init__
self._init_executor()
File "/nvme0n1/ben/vllm/vllm/executor/multiproc_gpu_executor.py", line 48, in _init_executor
self.workers = [
File "/nvme0n1/ben/vllm/vllm/executor/multiproc_gpu_executor.py", line 49, in <listcomp>
ProcessWorkerWrapper(
File "/nvme0n1/ben/vllm/vllm/executor/multiproc_worker_utils.py", line 162, in __init__
self.process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
ERROR 06-19 01:40:05 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 2334411 died, exit code: 1
INFO 06-19 01:40:05 multiproc_worker_utils.py:123] Killing local vLLM worker processes
For test-wrap.py
being the following
import os
from vllm import LLM, SamplingParams
if __name__ == '__main__':
prompts = ['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful chatbot who always responds to requests.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is ramen?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n']
sampling_params = SamplingParams(temperature=0.0, max_tokens=512)
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=2,
disable_log_stats=False,
)
outputs = llm.generate(prompts, sampling_params)
Running python test-wrap.py
gives
(venv-vllm) ben@sakura-h100-3:/nvme0n1/ben/test$ python test-wrapped.py
/nvme0n1/ben/venvs/venv-vllm/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
INFO 06-19 01:43:57 config.py:632] Defaulting to use mp for distributed inference
INFO 06-19 01:43:57 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(VllmWorkerProcess pid=2335588) INFO 06-19 01:44:00 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2335588) INFO 06-19 01:44:00 utils.py:672] Found nccl from library libnccl.so.2
INFO 06-19 01:44:00 utils.py:672] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2335588) INFO 06-19 01:44:00 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 06-19 01:44:00 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 06-19 01:44:01 custom_all_reduce_utils.py:196] generating GPU P2P access cache in /home/ben/.config/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 06-19 01:44:06 custom_all_reduce_utils.py:208] reading GPU P2P access cache from /home/ben/.config/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=2335588) INFO 06-19 01:44:06 custom_all_reduce_utils.py:208] reading GPU P2P access cache from /home/ben/.config/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 06-19 01:44:07 weight_utils.py:218] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2335588) INFO 06-19 01:44:07 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 06-19 01:44:09 model_runner.py:160] Loading model weights took 7.4829 GB
(VllmWorkerProcess pid=2335588) INFO 06-19 01:44:10 model_runner.py:160] Loading model weights took 7.4829 GB
INFO 06-19 01:44:11 distributed_gpu_executor.py:56] # GPU blocks: 61915, # CPU blocks: 4096
(VllmWorkerProcess pid=2335588) INFO 06-19 01:44:12 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2335588) INFO 06-19 01:44:12 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-19 01:44:12 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-19 01:44:12 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2335588) INFO 06-19 01:44:16 custom_all_reduce.py:267] Registering 2275 cuda graph addresses
INFO 06-19 01:44:16 custom_all_reduce.py:267] Registering 2275 cuda graph addresses
INFO 06-19 01:44:16 model_runner.py:965] Graph capturing finished in 4 secs.
(VllmWorkerProcess pid=2335588) INFO 06-19 01:44:16 model_runner.py:965] Graph capturing finished in 4 secs.
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.98s/it, est. speed input: 16.18 toks/s, output: 145.13 toks/s]
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Note that there is a CUDA IPC Tensor Error and a Leaked Semaphore Objects error at the end, but at least the text generation task can be successfully completed.
This might be related to unprotected usage of python's concurrent.futures library, although the only place I could find vllm using this library is here.
from vllm.
Just checked. For v0.4.3, default backend (ray) works, albeit with a small error [rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
at the end.
If we set distributed_executor_backend='mp'
, it is broken with the same error described in the threads above.
from vllm.
cc @njhill
from vllm.
hi, folks, can you try to set an environment variable export VLLM_WORKER_MULTIPROC_METHOD=fork
, and then run vllm again?
from vllm.
hi, folks, can you try to set an environment variable
export VLLM_WORKER_MULTIPROC_METHOD=fork
, and then run vllm again?
I also encountered a similar error above, adding this environment variable resolved it.
from vllm.
I attempted to set the environment variable export VLLM_WORKER_MULTIPROC_METHOD=fork as suggested and reran the VLLM application. Unfortunately, I'm still encountering errors. This time, a KeyError occurred in the multiprocessing.resource_tracker module, indicating a potential issue with process management under the fork start method.
The traceback highlights a removal operation on a missing key in a resource cache.
Here’s the relevant part of the error message:
(llm) z5327441@k091:/scratch/pbs.5466392.kman.restech.unsw.edu.au $ export VLLM_WORKER_MULTIPROC_METHOD=fork
(llm) z5327441@k091:/scratch/pbs.5466392.kman.restech.unsw.edu.au $ python test.py
2024-06-19 20:27:18,261 INFO worker.py:1568 -- Connecting to existing Ray cluster at address: 10.197.40.91:6379...
2024-06-19 20:27:18,269 INFO worker.py:1744 -- Connected to Ray cluster. View the dashboard at 10.197.40.91:8265
INFO 06-19 20:27:18 config.py:623] Defaulting to use mp for distributed inference
INFO 06-19 20:27:18 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/scratch/pbs.5466392.kman.restech.unsw.edu.au/models/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='/scratch/pbs.5466392.kman.restech.unsw.edu.au/models/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/scratch/pbs.5466392.kman.restech.unsw.edu.au/models/Meta-Llama-3-8B-Instruct)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(VllmWorkerProcess pid=3599698) INFO 06-19 20:27:20 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3599698) INFO 06-19 20:27:20 utils.py:637] Found nccl from library libnccl.so.2
INFO 06-19 20:27:20 utils.py:637] Found nccl from library libnccl.so.2
INFO 06-19 20:27:20 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3599698) INFO 06-19 20:27:20 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 06-19 20:27:20 custom_all_reduce_utils.py:170] generating GPU P2P access cache in /home/z5327441/.config/vllm/gpu_p2p_access_cache_for_7,6,5,4,3,2,1,0.json
2024-06-19 20:27:22,430 INFO worker.py:1568 -- Connecting to existing Ray cluster at address: 10.197.40.91:6379...
2024-06-19 20:27:22,438 INFO worker.py:1744 -- Connected to Ray cluster. View the dashboard at 10.197.40.91:8265
2024-06-19 20:27:22,475 INFO worker.py:1568 -- Connecting to existing Ray cluster at address: 10.197.40.91:6379...
2024-06-19 20:27:22,483 INFO worker.py:1744 -- Connected to Ray cluster. View the dashboard at 10.197.40.91:8265
INFO 06-19 20:27:22 config.py:623] Defaulting to use mp for distributed inference
INFO 06-19 20:27:22 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/scratch/pbs.5466392.kman.restech.unsw.edu.au/models/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='/scratch/pbs.5466392.kman.restech.unsw.edu.au/models/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/scratch/pbs.5466392.kman.restech.unsw.edu.au/models/Meta-Llama-3-8B-Instruct)
INFO 06-19 20:27:22 config.py:623] Defaulting to use mp for distributed inference
INFO 06-19 20:27:22 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/scratch/pbs.5466392.kman.restech.unsw.edu.au/models/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='/scratch/pbs.5466392.kman.restech.unsw.edu.au/models/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/scratch/pbs.5466392.kman.restech.unsw.edu.au/models/Meta-Llama-3-8B-Instruct)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(VllmWorkerProcess pid=3600016) INFO 06-19 20:27:23 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3600019) INFO 06-19 20:27:23 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 06-19 20:27:23 utils.py:637] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3600016) INFO 06-19 20:27:23 utils.py:637] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3600016) INFO 06-19 20:27:23 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 06-19 20:27:23 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 06-19 20:27:23 utils.py:637] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3600019) INFO 06-19 20:27:23 utils.py:637] Found nccl from library libnccl.so.2
INFO 06-19 20:27:23 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3600019) INFO 06-19 20:27:23 pynccl.py:63] vLLM is using nccl==2.20.5
Traceback (most recent call last):
File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
cache[rtype].remove(name)
KeyError: '/psm_deecb814'
INFO 06-19 20:27:24 custom_all_reduce_utils.py:170] generating GPU P2P access cache in /home/z5327441/.config/vllm/gpu_p2p_access_cache_for_7,6,5,4,3,2,1,0.json
[rank0]: Traceback (most recent call last):
[rank0]: File "<string>", line 1, in <module>
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
[rank0]: exitcode = _main(fd, parent_sentinel)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
[rank0]: prepare(preparation_data)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
[rank0]: _fixup_main_from_path(data['init_main_from_path'])
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
[rank0]: main_content = runpy.run_path(main_path,
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/runpy.py", line 289, in run_path
[rank0]: return _run_module_code(code, init_globals, run_name,
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/runpy.py", line 96, in _run_module_code
[rank0]: _run_code(code, mod_globals, init_globals,
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/test.py", line 13, in <module>
[rank0]: llm = LLM(
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 144, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 363, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 223, in __init__
[rank0]: self.model_executor = executor_class(
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in __init__
[rank0]: self._init_executor()
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 65, in _init_executor
[rank0]: self._run_workers("init_device")
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 119, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/worker/worker.py", line 115, in init_device
[rank0]: init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/worker/worker.py", line 357, in init_worker_distributed_environment
[rank0]: ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 655, in ensure_model_parallel_initialized
[rank0]: initialize_model_parallel(tensor_model_parallel_size,
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 616, in initialize_model_parallel
[rank0]: _TP = GroupCoordinator(
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 157, in __init__
[rank0]: self.ca_comm = CustomAllreduce(
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 174, in __init__
[rank0]: if not _can_p2p(rank, world_size):
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 78, in _can_p2p
[rank0]: if not gpu_p2p_access_check(rank, i):
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 174, in gpu_p2p_access_check
[rank0]: cache[f"{_i}->{_j}"] = can_actually_p2p(_i, _j)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 123, in can_actually_p2p
[rank0]: pi.start()
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/process.py", line 121, in start
[rank0]: self._popen = self._Popen(self)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
[rank0]: return Popen(process_obj)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
[rank0]: super().__init__(process_obj)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
[rank0]: self._launch(process_obj)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
[rank0]: prep_data = spawn.get_preparation_data(process_obj._name)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
[rank0]: _check_not_importing_main()
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
[rank0]: raise RuntimeError('''
[rank0]: RuntimeError:
[rank0]: An attempt has been made to start a new process before the
[rank0]: current process has finished its bootstrapping phase.
[rank0]: This probably means that you are not using fork to start your
[rank0]: child processes and you have forgotten to use the proper idiom
[rank0]: in the main module:
[rank0]: if __name__ == '__main__':
[rank0]: freeze_support()
[rank0]: ...
[rank0]: The "freeze_support()" line can be omitted if the program
[rank0]: is not going to be frozen to produce an executable.
Traceback (most recent call last):
File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
cache[rtype].remove(name)
KeyError: '/psm_2038dbcf'
INFO 06-19 20:27:24 custom_all_reduce_utils.py:170] generating GPU P2P access cache in /home/z5327441/.config/vllm/gpu_p2p_access_cache_for_7,6,5,4,3,2,1,0.json
[rank0]: Traceback (most recent call last):
[rank0]: File "<string>", line 1, in <module>
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
[rank0]: exitcode = _main(fd, parent_sentinel)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
[rank0]: prepare(preparation_data)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
[rank0]: _fixup_main_from_path(data['init_main_from_path'])
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
[rank0]: main_content = runpy.run_path(main_path,
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/runpy.py", line 289, in run_path
[rank0]: return _run_module_code(code, init_globals, run_name,
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/runpy.py", line 96, in _run_module_code
[rank0]: _run_code(code, mod_globals, init_globals,
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/test.py", line 13, in <module>
[rank0]: llm = LLM(
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 144, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 363, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 223, in __init__
[rank0]: self.model_executor = executor_class(
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in __init__
[rank0]: self._init_executor()
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 65, in _init_executor
[rank0]: self._run_workers("init_device")
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 119, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/worker/worker.py", line 115, in init_device
[rank0]: init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/worker/worker.py", line 357, in init_worker_distributed_environment
[rank0]: ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 655, in ensure_model_parallel_initialized
[rank0]: initialize_model_parallel(tensor_model_parallel_size,
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 616, in initialize_model_parallel
[rank0]: _TP = GroupCoordinator(
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 157, in __init__
[rank0]: self.ca_comm = CustomAllreduce(
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 174, in __init__
[rank0]: if not _can_p2p(rank, world_size):
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 78, in _can_p2p
[rank0]: if not gpu_p2p_access_check(rank, i):
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 174, in gpu_p2p_access_check
[rank0]: cache[f"{_i}->{_j}"] = can_actually_p2p(_i, _j)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 123, in can_actually_p2p
[rank0]: pi.start()
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/process.py", line 121, in start
[rank0]: self._popen = self._Popen(self)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
[rank0]: return Popen(process_obj)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
[rank0]: super().__init__(process_obj)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
[rank0]: self._launch(process_obj)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
[rank0]: prep_data = spawn.get_preparation_data(process_obj._name)
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
[rank0]: _check_not_importing_main()
[rank0]: File "/scratch/pbs.5466392.kman.restech.unsw.edu.au/miniforge3/envs/llm/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
[rank0]: raise RuntimeError('''
[rank0]: RuntimeError:
[rank0]: An attempt has been made to start a new process before the
[rank0]: current process has finished its bootstrapping phase.
[rank0]: This probably means that you are not using fork to start your
[rank0]: child processes and you have forgotten to use the proper idiom
[rank0]: in the main module:
[rank0]: if __name__ == '__main__':
[rank0]: freeze_support()
[rank0]: ...
[rank0]: The "freeze_support()" line can be omitted if the program
[rank0]: is not going to be frozen to produce an executable.
*** SIGTERM received at time=1718792844 on cpu 39 ***
PC: @ 0x14ee9615645c (unknown) pthread_cond_wait@@GLIBC_2.3.2
@ 0x14ee9615acf0 (unknown) (unknown)
[2024-06-19 20:27:24,847 E 3600016 3599783] logging.cc:343: *** SIGTERM received at time=1718792844 on cpu 39 ***
[2024-06-19 20:27:24,847 E 3600016 3599783] logging.cc:343: PC: @ 0x14ee9615645c (unknown) pthread_cond_wait@@GLIBC_2.3.2
[2024-06-19 20:27:24,847 E 3600016 3599783] logging.cc:343: @ 0x14ee9615acf0 (unknown) (unknown)
*** SIGTERM received at time=1718792844 on cpu 37 ***
PC: @ 0x14bb61b8c45c (unknown) pthread_cond_wait@@GLIBC_2.3.2
@ 0x14bb61b90cf0 (unknown) (unknown)
[2024-06-19 20:27:24,892 E 3600019 3599784] logging.cc:343: *** SIGTERM received at time=1718792844 on cpu 37 ***
[2024-06-19 20:27:24,892 E 3600019 3599784] logging.cc:343: PC: @ 0x14bb61b8c45c (unknown) pthread_cond_wait@@GLIBC_2.3.2
[2024-06-19 20:27:24,892 E 3600019 3599784] logging.cc:343: @ 0x14bb61b90cf0 (unknown) (unknown)
from vllm.
@zixuzixu two separate issues:
KeyError: '/psm_deecb814'
, see #5468 . it should not crash your program.An attempt has been made to start a new process before the current process has finished its bootstrapping phase.
This is a bug, and I think the latest code #5669 should solve the problem. Please have a try.
from vllm.
For the latest commit in the main branch #5648 adding export VLLM_WORKER_MULTIPROC_METHOD=fork
does not work:
(venv-vllm) ben@sakura-h100-3:/nvme0n1/ben/test$ export VLLM_WORKER_MULTIPROC_METHOD=fork
(venv-vllm) ben@sakura-h100-3:/nvme0n1/ben/test$ python test.py
/nvme0n1/ben/venvs/venv-vllm/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
INFO 06-20 01:29:02 llm_engine.py:164] Initializing an LLM engine (v0.5.0.post1) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(VllmWorkerProcess pid=3681069) INFO 06-20 01:29:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3681068) INFO 06-20 01:29:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3681067) INFO 06-20 01:29:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 06-20 01:29:08 utils.py:672] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3681068) INFO 06-20 01:29:08 utils.py:672] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3681067) INFO 06-20 01:29:08 utils.py:672] Found nccl from library libnccl.so.2
INFO 06-20 01:29:08 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3681069) INFO 06-20 01:29:08 utils.py:672] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3681068) INFO 06-20 01:29:08 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3681067) INFO 06-20 01:29:08 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3681069) INFO 06-20 01:29:08 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 06-20 01:29:11 custom_all_reduce_utils.py:196] generating GPU P2P access cache in /home/ben/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
/nvme0n1/ben/venvs/venv-vllm/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
/nvme0n1/ben/venvs/venv-vllm/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
INFO 06-20 01:29:13 llm_engine.py:164] Initializing an LLM engine (v0.5.0.post1) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct)
INFO 06-20 01:29:13 llm_engine.py:164] Initializing an LLM engine (v0.5.0.post1) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(VllmWorkerProcess pid=3681346) INFO 06-20 01:29:17 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3681347) INFO 06-20 01:29:17 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3681350) INFO 06-20 01:29:17 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3681352) INFO 06-20 01:29:17 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3681345) INFO 06-20 01:29:17 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3681351) INFO 06-20 01:29:17 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3681352) INFO 06-20 01:29:18 utils.py:672] Found nccl from library libnccl.so.2
INFO 06-20 01:29:18 utils.py:672] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3681351) INFO 06-20 01:29:18 utils.py:672] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3681350) INFO 06-20 01:29:18 utils.py:672] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3681352) INFO 06-20 01:29:18 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 06-20 01:29:18 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3681351) INFO 06-20 01:29:18 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3681350) INFO 06-20 01:29:18 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3681345) INFO 06-20 01:29:18 utils.py:672] Found nccl from library libnccl.so.2
INFO 06-20 01:29:18 utils.py:672] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3681345) INFO 06-20 01:29:18 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 06-20 01:29:18 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3681346) INFO 06-20 01:29:18 utils.py:672] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3681347) INFO 06-20 01:29:18 utils.py:672] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3681346) INFO 06-20 01:29:18 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3681347) INFO 06-20 01:29:18 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 06-20 01:29:23 custom_all_reduce_utils.py:196] generating GPU P2P access cache in /home/ben/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
[rank0]: Traceback (most recent call last):
[rank0]: File "<string>", line 1, in <module>
[rank0]: File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
[rank0]: exitcode = _main(fd, parent_sentinel)
[rank0]: File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
[rank0]: prepare(preparation_data)
[rank0]: File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
[rank0]: _fixup_main_from_path(data['init_main_from_path'])
[rank0]: File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
[rank0]: main_content = runpy.run_path(main_path,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 289, in run_path
[rank0]: return _run_module_code(code, init_globals, run_name,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
[rank0]: _run_code(code, mod_globals, init_globals,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/nvme0n1/ben/test/test.py", line 10, in <module>
[rank0]: llm = LLM(
[rank0]: File "/nvme0n1/ben/vllm/vllm/entrypoints/llm.py", line 144, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/nvme0n1/ben/vllm/vllm/engine/llm_engine.py", line 384, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/nvme0n1/ben/vllm/vllm/engine/llm_engine.py", line 230, in __init__
[rank0]: self.model_executor = executor_class(
[rank0]: File "/nvme0n1/ben/vllm/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/nvme0n1/ben/vllm/vllm/executor/executor_base.py", line 41, in __init__
[rank0]: self._init_executor()
[rank0]: File "/nvme0n1/ben/vllm/vllm/executor/multiproc_gpu_executor.py", line 68, in _init_executor
[rank0]: self._run_workers("init_device")
[rank0]: File "/nvme0n1/ben/vllm/vllm/executor/multiproc_gpu_executor.py", line 122, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: File "/nvme0n1/ben/vllm/vllm/worker/worker.py", line 115, in init_device
[rank0]: init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]: File "/nvme0n1/ben/vllm/vllm/worker/worker.py", line 358, in init_worker_distributed_environment
[rank0]: ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
[rank0]: File "/nvme0n1/ben/vllm/vllm/distributed/parallel_state.py", line 656, in ensure_model_parallel_initialized
[rank0]: initialize_model_parallel(tensor_model_parallel_size,
[rank0]: File "/nvme0n1/ben/vllm/vllm/distributed/parallel_state.py", line 617, in initialize_model_parallel
[rank0]: _TP = GroupCoordinator(
[rank0]: File "/nvme0n1/ben/vllm/vllm/distributed/parallel_state.py", line 158, in __init__
[rank0]: self.ca_comm = CustomAllreduce(
[rank0]: File "/nvme0n1/ben/vllm/vllm/distributed/device_communicators/custom_all_reduce.py", line 174, in __init__
[rank0]: if not _can_p2p(rank, world_size):
[rank0]: File "/nvme0n1/ben/vllm/vllm/distributed/device_communicators/custom_all_reduce.py", line 78, in _can_p2p
[rank0]: if not gpu_p2p_access_check(rank, i):
[rank0]: File "/nvme0n1/ben/vllm/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 201, in gpu_p2p_access_check
[rank0]: result = can_actually_p2p(batch_src, batch_tgt)
[rank0]: File "/nvme0n1/ben/vllm/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 138, in can_actually_p2p
[rank0]: p_src.start()
[rank0]: File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
[rank0]: self._popen = self._Popen(self)
[rank0]: File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
[rank0]: return Popen(process_obj)
[rank0]: File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
[rank0]: super().__init__(process_obj)
[rank0]: File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
[rank0]: self._launch(process_obj)
[rank0]: File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
[rank0]: prep_data = spawn.get_preparation_data(process_obj._name)
[rank0]: File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
[rank0]: _check_not_importing_main()
[rank0]: File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
[rank0]: raise RuntimeError('''
[rank0]: RuntimeError:
[rank0]: An attempt has been made to start a new process before the
[rank0]: current process has finished its bootstrapping phase.
[rank0]: This probably means that you are not using fork to start your
[rank0]: child processes and you have forgotten to use the proper idiom
[rank0]: in the main module:
[rank0]: if __name__ == '__main__':
[rank0]: freeze_support()
[rank0]: ...
[rank0]: The "freeze_support()" line can be omitted if the program
[rank0]: is not going to be frozen to produce an executable.
INFO 06-20 01:29:23 custom_all_reduce_utils.py:196] generating GPU P2P access cache in /home/ben/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
[rank0]: Traceback (most recent call last):
[rank0]: File "<string>", line 1, in <module>
[rank0]: File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
[rank0]: exitcode = _main(fd, parent_sentinel)
[rank0]: File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
[rank0]: prepare(preparation_data)
[rank0]: File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
[rank0]: _fixup_main_from_path(data['init_main_from_path'])
[rank0]: File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
[rank0]: main_content = runpy.run_path(main_path,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 289, in run_path
[rank0]: return _run_module_code(code, init_globals, run_name,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
[rank0]: _run_code(code, mod_globals, init_globals,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/nvme0n1/ben/test/test.py", line 10, in <module>
[rank0]: llm = LLM(
[rank0]: File "/nvme0n1/ben/vllm/vllm/entrypoints/llm.py", line 144, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/nvme0n1/ben/vllm/vllm/engine/llm_engine.py", line 384, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/nvme0n1/ben/vllm/vllm/engine/llm_engine.py", line 230, in __init__
[rank0]: self.model_executor = executor_class(
[rank0]: File "/nvme0n1/ben/vllm/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/nvme0n1/ben/vllm/vllm/executor/executor_base.py", line 41, in __init__
[rank0]: self._init_executor()
[rank0]: File "/nvme0n1/ben/vllm/vllm/executor/multiproc_gpu_executor.py", line 68, in _init_executor
[rank0]: self._run_workers("init_device")
[rank0]: File "/nvme0n1/ben/vllm/vllm/executor/multiproc_gpu_executor.py", line 122, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: File "/nvme0n1/ben/vllm/vllm/worker/worker.py", line 115, in init_device
[rank0]: init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]: File "/nvme0n1/ben/vllm/vllm/worker/worker.py", line 358, in init_worker_distributed_environment
[rank0]: ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
[rank0]: File "/nvme0n1/ben/vllm/vllm/distributed/parallel_state.py", line 656, in ensure_model_parallel_initialized
[rank0]: initialize_model_parallel(tensor_model_parallel_size,
[rank0]: File "/nvme0n1/ben/vllm/vllm/distributed/parallel_state.py", line 617, in initialize_model_parallel
[rank0]: _TP = GroupCoordinator(
[rank0]: File "/nvme0n1/ben/vllm/vllm/distributed/parallel_state.py", line 158, in __init__
[rank0]: self.ca_comm = CustomAllreduce(
[rank0]: File "/nvme0n1/ben/vllm/vllm/distributed/device_communicators/custom_all_reduce.py", line 174, in __init__
[rank0]: if not _can_p2p(rank, world_size):
[rank0]: File "/nvme0n1/ben/vllm/vllm/distributed/device_communicators/custom_all_reduce.py", line 78, in _can_p2p
[rank0]: if not gpu_p2p_access_check(rank, i):
[rank0]: File "/nvme0n1/ben/vllm/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 201, in gpu_p2p_access_check
[rank0]: result = can_actually_p2p(batch_src, batch_tgt)
[rank0]: File "/nvme0n1/ben/vllm/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 138, in can_actually_p2p
[rank0]: p_src.start()
[rank0]: File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
[rank0]: self._popen = self._Popen(self)
[rank0]: File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
[rank0]: return Popen(process_obj)
[rank0]: File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
[rank0]: super().__init__(process_obj)
[rank0]: File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
[rank0]: self._launch(process_obj)
[rank0]: File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
[rank0]: prep_data = spawn.get_preparation_data(process_obj._name)
[rank0]: File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
[rank0]: _check_not_importing_main()
[rank0]: File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
[rank0]: raise RuntimeError('''
[rank0]: RuntimeError:
[rank0]: An attempt has been made to start a new process before the
[rank0]: current process has finished its bootstrapping phase.
[rank0]: This probably means that you are not using fork to start your
[rank0]: child processes and you have forgotten to use the proper idiom
[rank0]: in the main module:
[rank0]: if __name__ == '__main__':
[rank0]: freeze_support()
[rank0]: ...
[rank0]: The "freeze_support()" line can be omitted if the program
[rank0]: is not going to be frozen to produce an executable.
ERROR 06-20 01:29:24 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 3681352 died, exit code: -15
INFO 06-20 01:29:24 multiproc_worker_utils.py:123] Killing local vLLM worker processes
ERROR 06-20 01:29:24 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 3681347 died, exit code: -15
INFO 06-20 01:29:24 multiproc_worker_utils.py:123] Killing local vLLM worker processes
For #5669, it works with or without export VLLM_WORKER_MULTIPROC_METHOD=fork
, albeit with a small error after text generation is completed:
INFO 06-20 02:08:42 model_runner.py:965] Graph capturing finished in 6 secs.
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.71s/it, est. speed input: 18.70 toks/s, output: 168.34 toks/s]
ERROR 06-20 02:08:45 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 3690565 died, exit code: -15
INFO 06-20 02:08:45 multiproc_worker_utils.py:123] Killing local vLLM worker processes
from vllm.
This is expected. #5669 contains fixes for the error you mentioned.
from vllm.
After building from the latest code #5669, everything is working now! (I faced some challenges with setting the g++ version and the cudatoolkit version without sudo permissions, but I managed to resolve them.)
Thank you so much for your help! I appreciate your efforts. Should I close this issue?
from vllm.
you can keep it open until #5669 is merged. I have to fix some test failure there.
from vllm.
Hi. Encountering the same issue. Is there a workaround?
VLLM_WORKER_MULTIPROC_METHOD=fork not working.
from vllm.
Hi. Encountering the same issue. Is there a workaround? VLLM_WORKER_MULTIPROC_METHOD=fork not working.
The might be three ways to fix this currently. This is fixed in the latest pull request but not in the latest build.
- install a previous version:
pip install vllm==0.4.3
- build from source using the latest fix
git clone [email protected]:vllm-project/vllm.git
cd vllm
gh pr checkout 5669
pip install .
- Wait for the next update, which should include the latest fixes.
The issue has been addressed in the latest pull request, but it hasn't been included in a build yet.
from vllm.
@zixuzixu Thank you.
But the issue happens on vllm 0.4.3 as well. I opened an issue.
(BTW I'm using Docker).
from vllm.
Related Issues (20)
- [Bug]: Severe computation errors when batching request for microsoft/Phi-3-mini-128k-instruct HOT 3
- [Bug]: The shape of the embed_tokens of llama model doesn't match the llama3 configuration
- [Bug]: TypeError: 'NoneType' object is not callable when start Gemma2-27b-it HOT 5
- [Bug]: Seed issue with Pipeline Parallel HOT 2
- [Bug]: vLLM is unable to load Mistral on Inferentia and AWS neuron HOT 1
- [Installation]: ERROR: Could not find a version that satisfies the requirement pyzmq (from versions: none) HOT 4
- [Bug]: No metrics exposed at /metrics with 0.5.2 (0.5.1 is fine), possible regression? HOT 6
- [Bug]: Can't load gemma-2-9b-it with vllm 0.5.2 HOT 9
- unable to run vllm model deployment HOT 6
- [Bug]: failed when run Qwen2-54B-A14B-GPTQ-Int4(MOE) HOT 3
- [Performance]: [Speculative Decoding] Measurement of Cost Coefficient through vLLM HOT 5
- [Usage]: PeftModelForCausalLM is not JSON serializable HOT 1
- [Feature]: Pipeline parallelism support for qwen model HOT 1
- [Installation]: Unable to build docker image for v0.5.2
- [Bug]: [vllm-openvino]: ValueError: `use_cache` was set to `True` but the loaded model only supports `use_cache=False`. HOT 7
- [Bug]: Gemma 27B crashes on GCP A100 HOT 2
- [Bug]: AttributeError: '_OpNamespace' '_C' object has no attribute 'rotary_embedding' / gemma-2-9b with vllm=0.5.2 HOT 12
- [New Model]: Codestral Mamba HOT 1
- [Bug]: No module named `jsonschema.protocols`.
- [Misc] Updated flashinfer to v0.0.9 in the following test scripts:
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.