Comments (6)
Thanks. It worked.
And where can I find all parameters? Parameter you mentioned is not here https://github.com/ollama/ollama/blob/main/docs/modelfile.md
Defaults
8 gpu layers for gemma2:27b
22 gpu layers for gemma2:9b
What worked
6 gpu layers for gemma2:27b
19 gpu layers for gemma2:9b
gemma2l
FROM gemma2:27b
PARAMETER num_gpu 6
ollama create gemma2l -f pathtofile/gemma2l
gemma2s
FROM gemma2
PARAMETER num_gpu 19
ollama create gemma2s -f pathtofile/gemma2s
Logs
Jun 28 13:40:14 archlinux ollama[211707]: time=2024-06-28T13:40:14.215+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama4270095903/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-e84ed7399c82fbf7dbd6cdef3f12d356c3cdb5512e5d8b2a9898080cbcdd72e5 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 22 --parallel 1 --port 33021"
Jun 28 13:40:36 archlinux ollama[211707]: time=2024-06-28T13:40:36.915+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama4270095903/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 18 --parallel 1 --port 36953"
Jun 28 13:40:51 archlinux ollama[211707]: time=2024-06-28T13:40:51.293+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama4270095903/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-b6ee2328408ebc031359e9745973b09963df9269468d37e1ea7912862aadec72 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 8 --parallel 1 --port 33949"
Jun 28 13:44:57 archlinux ollama[1036]: time=2024-06-28T13:44:57.033+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-e84ed7399c82fbf7dbd6cdef3f12d356c3cdb5512e5d8b2a9898080cbcdd72e5 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 22 --parallel 1 --port 41111"
Jun 28 13:45:39 archlinux ollama[1036]: time=2024-06-28T13:45:39.994+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-e84ed7399c82fbf7dbd6cdef3f12d356c3cdb5512e5d8b2a9898080cbcdd72e5 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 22 --parallel 1 --port 41445"
Jun 28 13:53:46 archlinux ollama[1036]: time=2024-06-28T13:53:46.235+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-b6ee2328408ebc031359e9745973b09963df9269468d37e1ea7912862aadec72 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 8 --parallel 1 --port 45083"
Jun 28 14:08:13 archlinux ollama[1036]: time=2024-06-28T14:08:13.101+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-b26e6713dc749dda35872713fa19a568040f475cc71cb132cff332fe7e216462 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 31 --parallel 1 --port 41101"
Jun 28 14:11:10 archlinux ollama[1036]: time=2024-06-28T14:11:10.351+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-e84ed7399c82fbf7dbd6cdef3f12d356c3cdb5512e5d8b2a9898080cbcdd72e5 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 22 --parallel 1 --port 42557"
Jun 28 14:23:09 archlinux ollama[1036]: time=2024-06-28T14:23:09.476+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-e84ed7399c82fbf7dbd6cdef3f12d356c3cdb5512e5d8b2a9898080cbcdd72e5 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 22 --parallel 1 --port 43707"
Jun 28 14:26:33 archlinux ollama[1036]: time=2024-06-28T14:26:33.030+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-e84ed7399c82fbf7dbd6cdef3f12d356c3cdb5512e5d8b2a9898080cbcdd72e5 --ctx-size 1024 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --parallel 1 --port 45287"
Jun 28 14:27:35 archlinux ollama[1036]: time=2024-06-28T14:27:35.763+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 18 --parallel 1 --port 43549"
Jun 28 15:24:47 archlinux ollama[1036]: time=2024-06-28T15:24:47.997+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama2544715102/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/.ollama/models/blobs/sha256-e84ed7399c82fbf7dbd6cdef3f12d356c3cdb5512e5d8b2a9898080cbcdd72e5 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 22 --parallel 1 --port 35471"
from ollama.
It seems like there's an issue with VRAM. But why can't this model be run on my system when it's possible to run Mixtral with reasonable performance? Is it possible to change some parameters to run it?
Jun 28 13:53:46 archlinux ollama[1036]: ggml_cuda_init: found 1 CUDA devices:
Jun 28 13:53:46 archlinux ollama[1036]: Device 0: NVIDIA GeForce RTX 3050 Laptop GPU, compute capability 8.6, VMM: yes
Jun 28 13:53:46 archlinux ollama[1036]: llm_load_tensors: ggml ctx size = 0.49 MiB
Jun 28 13:53:54 archlinux ollama[1036]: llm_load_tensors: offloading 8 repeating layers to GPU
Jun 28 13:53:54 archlinux ollama[1036]: llm_load_tensors: offloaded 8/47 layers to GPU
Jun 28 13:53:54 archlinux ollama[1036]: llm_load_tensors: CPU buffer size = 14898.60 MiB
Jun 28 13:53:54 archlinux ollama[1036]: llm_load_tensors: CUDA0 buffer size = 2430.56 MiB
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: n_ctx = 2048
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: n_batch = 512
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: n_ubatch = 512
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: flash_attn = 0
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: freq_base = 10000.0
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: freq_scale = 1
Jun 28 13:53:55 archlinux ollama[1036]: llama_kv_cache_init: CUDA_Host KV buffer size = 608.00 MiB
Jun 28 13:53:55 archlinux ollama[1036]: llama_kv_cache_init: CUDA0 KV buffer size = 128.00 MiB
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: KV self size = 736.00 MiB, K (f16): 368.00 MiB, V (f16): 368.00 MiB
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
Jun 28 13:53:55 archlinux ollama[1036]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1431.85 MiB on device 0: cudaMalloc failed: out of memory
Jun 28 13:53:55 archlinux ollama[1036]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1501405184
Jun 28 13:53:55 archlinux ollama[1036]: llama_new_context_with_model: failed to allocate compute buffers
Jun 28 13:53:55 archlinux ollama[1036]: llama_init_from_gpt_params: error: failed to create context with model '/var/lib/ollama/.ollama/models/blobs/sha256-b6ee2328408ebc031359e9745973b09963df9269468d37e1ea7912862aadec72'
Jun 28 13:53:56 archlinux ollama[16132]: ERROR [load_model] unable to load model | model="/var/lib/ollama/.ollama/models/blobs/sha256-b6ee2328408ebc031359e9745973b09963df9269468d37e1ea7912862aadec72" tid="134527161749504" timestamp=1719575
636
from ollama.
gemma2s
FROM gemma2
PARAMETER num_ctx 1024
Model does not load with smaller num_ctx 1024. Same error.
It is saying out of memory but it's not. Nearly entire 4096MB is availble (13/4096 MB).
from ollama.
1ed4f52 resolves (for me) the problem of OOM during model load. You can get the model to load without this patch by setting num_gpu
lower (search logs for --n-gpu-layers
to see what the default value is for your config).
from ollama.
The available options are listed in the API doc,
Line 288 in 1ed4f52
from ollama.
Thanks.
from ollama.
Related Issues (20)
- Multiple windows instances with different ports HOT 1
- erorr loading models x3 7900 XTX HOT 4
- Wrong version in UI with custom build HOT 3
- Allow using `"""` in TEMPLATE Modelfile command
- GPU isn't detected in Docker WSL2 in Win11
- When I use the GLM4 model, the return result is garbled. HOT 1
- ollama-docker-app using 100% without reason in idle state HOT 1
- Environment variable OLLAMA_NUM_PARALLEL is ignored (Linux)
- Is ollama since 0.2.1 slower on CPU's HOT 1
- Avoid blocking requests to already loaded models while loading another model HOT 1
- Mistral Codestral Mamba 7B HOT 1
- Prompt Tokens for Image Chat
- SmolLM family
- Installation on Linux fails because /usr/share/ollama does not exist. HOT 4
- How to Set Up RAG / LLamaIndex with Windows Preview?
- bug: Open WebUI RAG Malfunction with Ollama Versions Post 0.2.1 HOT 4
- Releases page: please also generate an archive with dependencies
- How can I make the model produce consistent and stable results for the same prompt?
- support minicpm language model
- ROCm Memory Issues with Long Contexts HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ollama.