What is the issue? Archlinux 6.6.35-2-lts Ollama version 0.1.4

gemma2s <div class="snippet-clipboard-content notranslate position-relative over

<a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https://github

The available options are listed in the API doc, <div class="Box Box--condensed my

Both Gemma2 model fail with cudaMalloc error despite available GPU memory, while other models run successfully. about ollama HOT 6 CLOSED

chiragbharambe commented on July 17, 2024

Both Gemma2 model fail with cudaMalloc error despite available GPU memory, while other models run successfully.

from ollama.

Comments (6)

chiragbharambe commented on July 17, 2024 2

Thanks. It worked.
And where can I find all parameters? Parameter you mentioned is not here https://github.com/ollama/ollama/blob/main/docs/modelfile.md

Defaults
8 gpu layers for gemma2:27b
22 gpu layers for gemma2:9b
What worked
6 gpu layers for gemma2:27b
19 gpu layers for gemma2:9b

gemma2l

FROM gemma2:27b
PARAMETER num_gpu 6

ollama create gemma2l -f pathtofile/gemma2l

gemma2s

FROM gemma2
PARAMETER num_gpu 19

ollama create gemma2s -f pathtofile/gemma2s

Logs

Jun 28 13:40:14 archlinux ollama[211707]: time=2024-06-28T13:40:14.215+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama4270095903/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-e84ed7399c82fbf7dbd6cdef3f12d356c3cdb5512e5d8b2a9898080cbcdd72e5 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 22 --parallel 1 --port 33021"
Jun 28 13:40:36 archlinux ollama[211707]: time=2024-06-28T13:40:36.915+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama4270095903/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 18 --parallel 1 --port 36953"
Jun 28 13:40:51 archlinux ollama[211707]: time=2024-06-28T13:40:51.293+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama4270095903/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-b6ee2328408ebc031359e9745973b09963df9269468d37e1ea7912862aadec72 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 8 --parallel 1 --port 33949"
Jun 28 13:44:57 archlinux ollama[1036]: time=2024-06-28T13:44:57.033+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-e84ed7399c82fbf7dbd6cdef3f12d356c3cdb5512e5d8b2a9898080cbcdd72e5 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 22 --parallel 1 --port 41111"
Jun 28 13:45:39 archlinux ollama[1036]: time=2024-06-28T13:45:39.994+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-e84ed7399c82fbf7dbd6cdef3f12d356c3cdb5512e5d8b2a9898080cbcdd72e5 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 22 --parallel 1 --port 41445"
Jun 28 13:53:46 archlinux ollama[1036]: time=2024-06-28T13:53:46.235+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-b6ee2328408ebc031359e9745973b09963df9269468d37e1ea7912862aadec72 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 8 --parallel 1 --port 45083"
Jun 28 14:08:13 archlinux ollama[1036]: time=2024-06-28T14:08:13.101+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 31 --parallel 1 --port 41101"
Jun 28 14:11:10 archlinux ollama[1036]: time=2024-06-28T14:11:10.351+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-e84ed7399c82fbf7dbd6cdef3f12d356c3cdb5512e5d8b2a9898080cbcdd72e5 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 22 --parallel 1 --port 42557"
Jun 28 14:23:09 archlinux ollama[1036]: time=2024-06-28T14:23:09.476+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-e84ed7399c82fbf7dbd6cdef3f12d356c3cdb5512e5d8b2a9898080cbcdd72e5 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 22 --parallel 1 --port 43707"
Jun 28 14:26:33 archlinux ollama[1036]: time=2024-06-28T14:26:33.030+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-e84ed7399c82fbf7dbd6cdef3f12d356c3cdb5512e5d8b2a9898080cbcdd72e5 --ctx-size 1024 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --parallel 1 --port 45287"
Jun 28 14:27:35 archlinux ollama[1036]: time=2024-06-28T14:27:35.763+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 18 --parallel 1 --port 43549"
Jun 28 15:24:47 archlinux ollama[1036]: time=2024-06-28T15:24:47.997+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-e84ed7399c82fbf7dbd6cdef3f12d356c3cdb5512e5d8b2a9898080cbcdd72e5 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 22 --parallel 1 --port 35471"

from ollama.

chiragbharambe commented on July 17, 2024

It seems like there's an issue with VRAM. But why can't this model be run on my system when it's possible to run Mixtral with reasonable performance? Is it possible to change some parameters to run it?

Jun 28 13:53:46 archlinux ollama[1036]: ggml_cuda_init: found 1 CUDA devices:
Jun 28 13:53:46 archlinux ollama[1036]:   Device 0: NVIDIA GeForce RTX 3050 Laptop GPU, compute capability 8.6, VMM: yes
Jun 28 13:53:46 archlinux ollama[1036]: llm_load_tensors: ggml ctx size =    0.49 MiB
Jun 28 13:53:54 archlinux ollama[1036]: llm_load_tensors: offloading 8 repeating layers to GPU
Jun 28 13:53:54 archlinux ollama[1036]: llm_load_tensors: offloaded 8/47 layers to GPU
Jun 28 13:53:54 archlinux ollama[1036]: llm_load_tensors:        CPU buffer size = 14898.60 MiB
Jun 28 13:53:54 archlinux ollama[1036]: llm_load_tensors:      CUDA0 buffer size =  2430.56 MiB
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: n_ctx      = 2048
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: n_batch    = 512
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: n_ubatch   = 512
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: flash_attn = 0
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: freq_base  = 10000.0
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: freq_scale = 1
Jun 28 13:53:55 archlinux ollama[1036]: llama_kv_cache_init:  CUDA_Host KV buffer size =   608.00 MiB
Jun 28 13:53:55 archlinux ollama[1036]: llama_kv_cache_init:      CUDA0 KV buffer size =   128.00 MiB
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: KV self size  =  736.00 MiB, K (f16):  368.00 MiB, V (f16):  368.00 MiB
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
Jun 28 13:53:55 archlinux ollama[1036]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1431.85 MiB on device 0: cudaMalloc failed: out of memory
Jun 28 13:53:55 archlinux ollama[1036]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1501405184
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: failed to allocate compute buffers
Jun 28 13:53:55 archlinux ollama[1036]: llama_init_from_gpt_params: error: failed to create context with model '/var/lib/ollama/.ollama/models/blobs/sha256-b6ee2328408ebc031359e9745973b09963df9269468d37e1ea7912862aadec72'
Jun 28 13:53:56 archlinux ollama[16132]: ERROR [load_model] unable to load model | model="/var/lib/ollama/.ollama/models/blobs/sha256-b6ee2328408ebc031359e9745973b09963df9269468d37e1ea7912862aadec72" tid="134527161749504" timestamp=1719575
636

from ollama.

chiragbharambe commented on July 17, 2024

gemma2s

FROM gemma2
PARAMETER num_ctx 1024

Model does not load with smaller num_ctx 1024. Same error.
It is saying out of memory but it's not. Nearly entire 4096MB is availble (13/4096 MB).

from ollama.

rick-github commented on July 17, 2024

1ed4f52 resolves (for me) the problem of OOM during model load. You can get the model to load without this patch by setting num_gpu lower (search logs for --n-gpu-layers to see what the default value is for your config).

from ollama.

rick-github commented on July 17, 2024

The available options are listed in the API doc,

ollama/docs/api.md

Line 288 in 1ed4f52

##### Request

from ollama.

chiragbharambe commented on July 17, 2024

Thanks.

from ollama.

Both Gemma2 model fail with cudaMalloc error despite available GPU memory, while other models run successfully. about ollama HOT 6 CLOSED

Comments (6)

gemma2s

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent