Your current environment root@9c92d584ab5f:/app# python3 ./collect

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Bug]: Running vllm docker image with neuron fails about vllm HOT 3 OPEN

yaronr commented on September 26, 2024 3

[Bug]: Running vllm docker image with neuron fails

from vllm.

Comments (3)

yaronr commented on September 26, 2024

Update: vllm 5.0 post, added ray to requirements-neuron.txt, still not working:

WARNING 06-14 08:15:40 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 06-14 08:15:45 api_server.py:177] vLLM API server version 0.5.0.post1
INFO 06-14 08:15:45 api_server.py:178] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=True, max_log_len=None)
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
INFO 06-14 08:15:47 config.py:623] Defaulting to use ray for distributed inference
WARNING 06-14 08:15:47 config.py:437] Possibly too large swap space. 8.00 GiB out of the 15.27 GiB total CPU memory is allocated for the swap space.
INFO 06-14 08:15:47 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 06-14 08:15:48 utils.py:465] Pin memory is not supported on Neuron.
Loading checkpoint shards:  25%|██▌       | 1/4 [00:37<01:51, 37.24s/it]Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 28, in <module>
    subprocess.check_call(shlex.split(" ".join(sys.argv[1:])))
  File "/opt/conda/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['python3', '-m', 'vllm.entrypoints.openai.api_server', '--port=8000', '--model=meta-llama/Meta-Llama-3-8B-Instruct', '--tensor-parallel-size=2', '--disable-log-requests', '--enable-prefix-caching', '--gpu-memory-utilization=0.9']' died with <Signals.SIGKILL: 9>.

from vllm.

ashrafMahgoub commented on September 26, 2024

Ray is not required for Neuron device.
I see you are attaching only one core to the container when calling docker run, for tp=2, at least 2 Neuron cores should be attached. Can you please modify the docker run command to include these two devices?

-device=/dev/neuron0 -device=/dev/neuron1

from vllm.

karllessard commented on September 26, 2024

@ashrafMahgoub Reading the AWS doc, it seems that each neuron device should have two neuron cores. In this case, requesting for a single device should be enough? With EKS, I tried requesting for 2 devices on a inf2 instance that has a single Inferentia2 chip and it failed: Could not open the nd2

from vllm.

[Bug]: Running vllm docker image with neuron fails about vllm HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent