Comments (26)
I'll have a look at this tonight, thanks for reporting!
from serge.
In the case where it's still happening, can you run the following:
docker compose up -d docker compose exec api bash llama -m weights/your_model.bin
see if it loads & starts outputting stuff. If it doesn't please send me the traceback.
7B model works for me, but 13B is not happy:
root@8730b2d5f0fd:/usr/src/app# llama -m weights/ggml-alpaca-13B-q4_0.bin
main: seed = 1679613493
llama_model_load: loading model from 'weights/ggml-alpaca-13B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 5120
llama_model_load: n_mult = 256
llama_model_load: n_head = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size = 800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from 'weights/ggml-alpaca-13B-q4_0.bin'
llama_model_load: llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file
llama_init_from_file: failed to load model
main: error: failed to load model 'weights/ggml-alpaca-13B-q4_0.bin'
(MacBook Pro M1 Max, 32 GB RAM)
from serge.
7B
run commnad : llama -m weights/ggml-alpaca-7B-q4_0.bin
main: seed = 1679625493
llama_model_load: loading model from 'weights/ggml-alpaca-7B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from 'weights/ggml-alpaca-7B-q4_0.bin'
llama_model_load: ..............................Killed`
13B
run command : llama -m weights/ggml-alpaca-13B-q4_0.bin
main: seed = 1679625532
llama_model_load: loading model from 'weights/ggml-alpaca-13B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 5120
llama_model_load: n_mult = 256
llama_model_load: n_head = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size = 800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from 'weights/ggml-alpaca-13B-q4_0.bin'
llama_model_load: llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file
llama_init_from_file: failed to load model
main: error: failed to load model 'weights/ggml-alpaca-13B-q4_0.bin'`
I'm using docker on Mac mini 2018 (i5 Series)
from serge.
Indeed low RAM is often the issue, but no need for graphics card on this repo. I updated the RAM requirements in the README, make sure you have enough free !
from serge.
same issue, but i choose 30B
from serge.
Don't know what hardware you guys are on, but it took 10+ minutes for me to get a response on an I5-6500 using the smallest one.
from serge.
Can you check the logs of the api container, see if it's maybe converting the model and hanging because of that ?
Also are you all on windows with WSL?
from serge.
I have an i5-7400, but the important thing is that I used alpaca with llama without ui and it responded in 10 seconds on the same model
from serge.
I use wsl
from serge.
Where can I see the logs?
from serge.
i use wsl 2, win 11, intel i7 11800
from serge.
I'm on a Ubuntu Cloud Image running on Proxmox with Host CPU.
Trying to figure out how to run the Straight API on a different node to see if there is a difference.
File weights/ggml-alpaca-7B-q4_0.bin already converted INFO: 172.18.0.2:56482 - "GET /models HTTP/1.1" 200 OK INFO: 172.18.0.2:56498 - "GET /chats HTTP/1.1" 200 OK INFO: 172.18.0.3:46222 - "GET /chat/1db1f49e-ca45-4a13-8aee-f6a82bde7b76 HTTP/1.1" 200 OK INFO: 172.18.0.3:46224 - "GET /chat/8a32fda3-cc0e-46d0-afed-e7c079a588e4 HTTP/1.1" 200 OK INFO: 172.18.0.3:46236 - "GET /chat/1db1f49e-ca45-4a13-8aee-f6a82bde7b76 HTTP/1.1" 200 OK INFO: 172.18.0.3:50102 - "GET /models HTTP/1.1" 200 OK INFO: 172.18.0.3:47664 - "GET /chat/8a32fda3-cc0e-46d0-afed-e7c079a588e4 HTTP/1.1" 200 OK INFO: 172.18.0.3:47680 - "GET /chat/1db1f49e-ca45-4a13-8aee-f6a82bde7b76 HTTP/1.1" 200 OK INFO: 172.18.0.3:55744 - "GET /chat/1db1f49e-ca45-4a13-8aee-f6a82bde7b76 HTTP/1.1" 200 OK INFO: 172.18.0.3:48880 - "GET /chat/8a32fda3-cc0e-46d0-afed-e7c079a588e4 HTTP/1.1" 200 OK INFO: 172.18.0.2:35214 - "GET /models HTTP/1.1" 200 OK INFO: 172.18.0.2:35224 - "GET /chats HTTP/1.1" 200 OK INFO: 172.18.0.2:48390 - "POST /chat?temp=0.1&top_k=50&max_length=256&top_p=0.95&model=ggml-alpaca-13B-q4_0.bin&repeat_last_n=64&repeat_penalty=1.3&preprompt=Below+is+an+instruction+that+describes+a+task.+Write+a+response+that+appropriately+completes+the+request.+The+response+must+be+accurate%2C+concise+and+evidence-based+whenever+possible.+A+complete+answer+is+always+ended+by+%5Bend+of+text%5D. HTTP/1.1" 200 OK INFO: 172.18.0.2:48390 - "GET /chats HTTP/1.1" 200 OK INFO: 172.18.0.2:48402 - "GET /chat/25b1b3b5-226c-4143-9491-07a8f7acfd51 HTTP/1.1" 200 OK INFO: 172.18.0.3:54912 - "GET /chat/8a32fda3-cc0e-46d0-afed-e7c079a588e4 HTTP/1.1" 200 OK INFO: 172.18.0.3:54920 - "GET /chat/1db1f49e-ca45-4a13-8aee-f6a82bde7b76 HTTP/1.1" 200 OK INFO: 172.18.0.3:54926 - "GET /chat/8a32fda3-cc0e-46d0-afed-e7c079a588e4 HTTP/1.1" 200 OK INFO: 172.18.0.3:46472 - "GET /chat/25b1b3b5-226c-4143-9491-07a8f7acfd51/question?prompt=[PROMPT] HTTP/1.1" 200 OK INFO: 172.18.0.3:44188 - "GET /models HTTP/1.1" 200 OK
from serge.
Yeah, using https://github.com/ggerganov/llama.cpp with
./main -m ./models/ggml-alpaca-7b-q4.bin --color -f ./prompts/alpaca.txt -ins
Is outrageously faster than serge.
from serge.
Thank you for making this.
If there's anything I can do to help, please let me know.
from serge.
Same here, running on macos, on a macbook pro M2, with podman.
οΏ½INFO: 10.89.0.8:45776 - "GET /chat/637636b2-7e17-4508-a5a9-be04d1c6e894/question?prompt=What+is+a+bridge%3F HTTP/1.1" 200 OK
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 436, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 276, in __call__
await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 84, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
raise e
File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 69, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/sse_starlette/sse.py", line 227, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
raise exceptions[0]
File "/usr/local/lib/python3.10/dist-packages/sse_starlette/sse.py", line 230, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/sse_starlette/sse.py", line 219, in stream_response
async for data in self.body_iterator:
File "/usr/src/app/main.py", line 161, in event_generator
async for output in generate(
File "/usr/src/app/utils/generate.py", line 65, in generate
raise ValueError(error_output.decode("utf-8"))
ValueError: main: seed = 1679580662
llama_model_load: loading model from '/usr/src/app/weights/ggml-alpaca-7B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
from serge.
keep watching, met the same issue on Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz,
from serge.
Same problem here: on WSL
from serge.
Hey everyone! Can you try to grab the latest main, rebuild the docker container and tell me if it's still happening?
In the case where it's still happening, can you run the following:
docker compose up -d
docker compose exec api bash
llama -m weights/your_model.bin --n_parts 1
see if it loads & starts outputting stuff. If it doesn't please send me the traceback.
from serge.
Similar issues here as with @panicsteve.
7B model works well for me.
13B and 30B are failing (with different errors):
/usr/src/app# llama -m weights/ggml-alpaca-13B-q4_0.bin
main: seed = 1679620095
llama_model_load: loading model from 'weights/ggml-alpaca-13B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 5120
llama_model_load: n_mult = 256
llama_model_load: n_head = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size = 800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from 'weights/ggml-alpaca-13B-q4_0.bin'
llama_model_load: llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file
llama_init_from_file: failed to load model
main: error: failed to load model 'weights/ggml-alpaca-13B-q4_0.bin'
/usr/src/app# llama -m weights/ggml-alpaca-30B-q4_0.bin
main: seed = 1679622447
llama_model_load: loading model from 'weights/ggml-alpaca-30B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 6656
llama_model_load: n_mult = 256
llama_model_load: n_head = 52
llama_model_load: n_layer = 60
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 17920
llama_model_load: n_parts = 4
llama_model_load: ggml ctx size = 20951.50 MB
Segmentation fault
I'm using docker on WSL.
from serge.
wsl 2:
root@f951b7cb75a6:/usr/src/app# llama -m weights/ggml-alpaca-30B-q4_0.bin
main: seed = 1679648468
llama_model_load: loading model from 'weights/ggml-alpaca-30B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 6656
llama_model_load: n_mult = 256
llama_model_load: n_head = 52
llama_model_load: n_layer = 60
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 17920
llama_model_load: n_parts = 4
llama_model_load: ggml ctx size = 20951.50 MB
Segmentation fault
from serge.
Stumbled accros the issue, too. Could be you are running out of memory...
On Windows/WSL2 it seems your dockerimages are restricted to memory usage. here https://learn.microsoft.com/en-us/windows/wsl/wsl-config you find information to increase memory and CPU for you WSL2 containers.
Not shure if that helps anyone, but fixed the issue with the output like #27 (comment) for me.
from serge.
llama_model_load: llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file
For 30B
from serge.
Hey everyone! Can you try to grab the latest main, rebuild the docker container and tell me if it's still happening?
In the case where it's still happening, can you run the following:
docker compose up -d docker compose exec api bash llama -m weights/your_model.bin
see if it loads & starts outputting stuff. If it doesn't please send me the traceback.
root@69f29a3e3123:/usr/src/app# llama -m weights/ggml-alpaca-7B-q4_0.bin
main: seed = 1679694795
llama_model_load: loading model from 'weights/ggml-alpaca-7B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from 'weights/ggml-alpaca-7B-q4_0.bin'
llama_model_load: ...................................Killed
from serge.
main: seed = 1679712774
llama_model_load: loading model from 'weights/ggml-alpaca-7B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from 'weights/ggml-alpaca-7B-q4_0.bin'
llama_model_load: ...................Killed
from serge.
Most of the time, the error is caused by low RAM or VRAM.
Using a research graphics card seems to be the answer.
from serge.
I'm getting this:
root@personal-gpt-tasks-546548ffbb-69c85:/app# llama -m /mnt/data/weights/ggml-alpaca-7B-q4_0.bin --n_parts 1
main: seed = 1679775129
llama_model_load: loading model from '/mnt/data/weights/ggml-alpaca-7B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: type = 1
Illegal instruction (core dumped)
from serge.
Related Issues (20)
- π [Feature]: Add OpenVino / OpenVino Model Server HOT 1
- π [Bug]: Web interface does not render properly on mobile devices HOT 1
- π [Feature]: Add LINCE-Mistal model HOT 1
- π [Bug]: UI components are missing accessibility labels HOT 2
- π [Bug]: response text generated by a model sometimes disappears after computer/browser is woken up from a 'sleep' HOT 4
- have a separate page which displays downloaded moddles. HOT 1
- π [Feature]: Add support for Intel ARC GPUs A750 and A770 (If Possible) HOT 2
- bug: Allow loading .gguf and .bin files HOT 3
- π [Feature]: add eagle 7b HOT 3
- π [Bug]: system reachable via ICMP and via Port 8008 but screen "navy blue" with no text whatsoever HOT 14
- π [Feature]: Add Nous-Hermes-2-Mistral-7B-DPO HOT 8
- π [Feature]: Add support for uploading files during chat conversation
- π [Bug]: New install - response keeps repeating the last line HOT 7
- π [Feature]: add characters HOT 6
- π [Feature]: Please add Gorilla: Large Language Model Connected with Massive APIs HOT 3
- π€ [Question]: Whats the difference between the... models?
- π [Feature]: Add meta-llama/Meta-Llama-3-70B-Instruct HOT 7
- π [Bug]: Can't use pre-existing model at /weights HOT 1
- π [Bug]: DLLAMA_BLAS_VENDOR=OpenBLAS build with pip is not enabling OpenBlas HOT 3
- how to use mixtral-8x7b-v0.1π€ [Question]: HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from serge.