I've got everything installed on Windows with WSL2. UI work, but the eternal loading o

7B run commnad : llama -m weights/ggml-alpaca-7B-q4_0.b

AI does not respond,about serge-chat/serge

Comments (26)

nsarrazin commented on July 21, 2024 2

I'll have a look at this tonight, thanks for reporting!

from serge.

panicsteve commented on July 21, 2024 1

In the case where it's still happening, can you run the following:
docker compose up -d
docker compose exec api bash
llama -m weights/your_model.bin
see if it loads & starts outputting stuff. If it doesn't please send me the traceback.

7B model works for me, but 13B is not happy:

root@8730b2d5f0fd:/usr/src/app# llama -m weights/ggml-alpaca-13B-q4_0.bin
main: seed = 1679613493
llama_model_load: loading model from 'weights/ggml-alpaca-13B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size =   800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from 'weights/ggml-alpaca-13B-q4_0.bin'
llama_model_load: llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file
llama_init_from_file: failed to load model
main: error: failed to load model 'weights/ggml-alpaca-13B-q4_0.bin'

(MacBook Pro M1 Max, 32 GB RAM)

from serge.

qn1213 commented on July 21, 2024 1

run commnad : llama -m weights/ggml-alpaca-7B-q4_0.bin
main: seed = 1679625493
llama_model_load: loading model from 'weights/ggml-alpaca-7B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from 'weights/ggml-alpaca-7B-q4_0.bin'
llama_model_load: ..............................Killed`

13B

run command : llama -m weights/ggml-alpaca-13B-q4_0.bin
main: seed = 1679625532
llama_model_load: loading model from 'weights/ggml-alpaca-13B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 5120
llama_model_load: n_mult = 256
llama_model_load: n_head = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size = 800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from 'weights/ggml-alpaca-13B-q4_0.bin'
llama_model_load: llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file
llama_init_from_file: failed to load model
main: error: failed to load model 'weights/ggml-alpaca-13B-q4_0.bin'`

I'm using docker on Mac mini 2018 (i5 Series)

from serge.

nsarrazin commented on July 21, 2024 1

Indeed low RAM is often the issue, but no need for graphics card on this repo. I updated the RAM requirements in the README, make sure you have enough free !

from serge.

maxstanger commented on July 21, 2024

same issue, but i choose 30B

from serge.

ThellraAK commented on July 21, 2024

Don't know what hardware you guys are on, but it took 10+ minutes for me to get a response on an I5-6500 using the smallest one.

from serge.

nsarrazin commented on July 21, 2024

Can you check the logs of the api container, see if it's maybe converting the model and hanging because of that ?

Also are you all on windows with WSL?

from serge.

Wellmare commented on July 21, 2024

I have an i5-7400, but the important thing is that I used alpaca with llama without ui and it responded in 10 seconds on the same model

from serge.

Wellmare commented on July 21, 2024

I use wsl

from serge.

Wellmare commented on July 21, 2024

Where can I see the logs?

from serge.

maxstanger commented on July 21, 2024

i use wsl 2, win 11, intel i7 11800

from serge.

ThellraAK commented on July 21, 2024

I'm on a Ubuntu Cloud Image running on Proxmox with Host CPU.

Trying to figure out how to run the Straight API on a different node to see if there is a difference.

File weights/ggml-alpaca-7B-q4_0.bin already converted INFO: 172.18.0.2:56482 - "GET /models HTTP/1.1" 200 OK INFO: 172.18.0.2:56498 - "GET /chats HTTP/1.1" 200 OK INFO: 172.18.0.3:46222 - "GET /chat/1db1f49e-ca45-4a13-8aee-f6a82bde7b76 HTTP/1.1" 200 OK INFO: 172.18.0.3:46224 - "GET /chat/8a32fda3-cc0e-46d0-afed-e7c079a588e4 HTTP/1.1" 200 OK INFO: 172.18.0.3:46236 - "GET /chat/1db1f49e-ca45-4a13-8aee-f6a82bde7b76 HTTP/1.1" 200 OK INFO: 172.18.0.3:50102 - "GET /models HTTP/1.1" 200 OK INFO: 172.18.0.3:47664 - "GET /chat/8a32fda3-cc0e-46d0-afed-e7c079a588e4 HTTP/1.1" 200 OK INFO: 172.18.0.3:47680 - "GET /chat/1db1f49e-ca45-4a13-8aee-f6a82bde7b76 HTTP/1.1" 200 OK INFO: 172.18.0.3:55744 - "GET /chat/1db1f49e-ca45-4a13-8aee-f6a82bde7b76 HTTP/1.1" 200 OK INFO: 172.18.0.3:48880 - "GET /chat/8a32fda3-cc0e-46d0-afed-e7c079a588e4 HTTP/1.1" 200 OK INFO: 172.18.0.2:35214 - "GET /models HTTP/1.1" 200 OK INFO: 172.18.0.2:35224 - "GET /chats HTTP/1.1" 200 OK INFO: 172.18.0.2:48390 - "POST /chat?temp=0.1&top_k=50&max_length=256&top_p=0.95&model=ggml-alpaca-13B-q4_0.bin&repeat_last_n=64&repeat_penalty=1.3&preprompt=Below+is+an+instruction+that+describes+a+task.+Write+a+response+that+appropriately+completes+the+request.+The+response+must+be+accurate%2C+concise+and+evidence-based+whenever+possible.+A+complete+answer+is+always+ended+by+%5Bend+of+text%5D. HTTP/1.1" 200 OK INFO: 172.18.0.2:48390 - "GET /chats HTTP/1.1" 200 OK INFO: 172.18.0.2:48402 - "GET /chat/25b1b3b5-226c-4143-9491-07a8f7acfd51 HTTP/1.1" 200 OK INFO: 172.18.0.3:54912 - "GET /chat/8a32fda3-cc0e-46d0-afed-e7c079a588e4 HTTP/1.1" 200 OK INFO: 172.18.0.3:54920 - "GET /chat/1db1f49e-ca45-4a13-8aee-f6a82bde7b76 HTTP/1.1" 200 OK INFO: 172.18.0.3:54926 - "GET /chat/8a32fda3-cc0e-46d0-afed-e7c079a588e4 HTTP/1.1" 200 OK INFO: 172.18.0.3:46472 - "GET /chat/25b1b3b5-226c-4143-9491-07a8f7acfd51/question?prompt=[PROMPT] HTTP/1.1" 200 OK INFO: 172.18.0.3:44188 - "GET /models HTTP/1.1" 200 OK

from serge.

ThellraAK commented on July 21, 2024

Yeah, using https://github.com/ggerganov/llama.cpp with

./main -m ./models/ggml-alpaca-7b-q4.bin --color -f ./prompts/alpaca.txt -ins

Is outrageously faster than serge.

from serge.

ThellraAK commented on July 21, 2024

Thank you for making this.

If there's anything I can do to help, please let me know.

from serge.

adeleglise commented on July 21, 2024

Same here, running on macos, on a macbook pro M2, with podman.

�INFO:     10.89.0.8:45776 - "GET /chat/637636b2-7e17-4508-a5a9-be04d1c6e894/question?prompt=What+is+a+bridge%3F HTTP/1.1" 200 OK
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 436, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 84, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 69, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/sse_starlette/sse.py", line 227, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/dist-packages/sse_starlette/sse.py", line 230, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/sse_starlette/sse.py", line 219, in stream_response
    async for data in self.body_iterator:
  File "/usr/src/app/main.py", line 161, in event_generator
    async for output in generate(
  File "/usr/src/app/utils/generate.py", line 65, in generate
    raise ValueError(error_output.decode("utf-8"))
ValueError: main: seed = 1679580662
llama_model_load: loading model from '/usr/src/app/weights/ggml-alpaca-7B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB

from serge.

ygean commented on July 21, 2024

keep watching, met the same issue on Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz,

from serge.

OzzyKampha commented on July 21, 2024

Same problem here: on WSL

from serge.

nsarrazin commented on July 21, 2024

Hey everyone! Can you try to grab the latest main, rebuild the docker container and tell me if it's still happening?

In the case where it's still happening, can you run the following:

docker compose up -d
docker compose exec api bash
llama -m weights/your_model.bin --n_parts 1

see if it loads & starts outputting stuff. If it doesn't please send me the traceback.

from serge.

NikitaGolovko commented on July 21, 2024

Similar issues here as with @panicsteve.

7B model works well for me.
13B and 30B are failing (with different errors):

/usr/src/app# llama -m weights/ggml-alpaca-13B-q4_0.bin
main: seed = 1679620095
llama_model_load: loading model from 'weights/ggml-alpaca-13B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size =   800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from 'weights/ggml-alpaca-13B-q4_0.bin'
llama_model_load: llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file
llama_init_from_file: failed to load model
main: error: failed to load model 'weights/ggml-alpaca-13B-q4_0.bin'

/usr/src/app# llama -m weights/ggml-alpaca-30B-q4_0.bin
main: seed = 1679622447
llama_model_load: loading model from 'weights/ggml-alpaca-30B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 6656
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 52
llama_model_load: n_layer = 60
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 17920
llama_model_load: n_parts = 4
llama_model_load: ggml ctx size = 20951.50 MB
Segmentation fault

I'm using docker on WSL.

from serge.

maxstanger commented on July 21, 2024

wsl 2:
root@f951b7cb75a6:/usr/src/app# llama -m weights/ggml-alpaca-30B-q4_0.bin
main: seed = 1679648468
llama_model_load: loading model from 'weights/ggml-alpaca-30B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 6656
llama_model_load: n_mult = 256
llama_model_load: n_head = 52
llama_model_load: n_layer = 60
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 17920
llama_model_load: n_parts = 4
llama_model_load: ggml ctx size = 20951.50 MB
Segmentation fault

from serge.

QuaxelBrod commented on July 21, 2024

Stumbled accros the issue, too. Could be you are running out of memory...
On Windows/WSL2 it seems your dockerimages are restricted to memory usage. here https://learn.microsoft.com/en-us/windows/wsl/wsl-config you find information to increase memory and CPU for you WSL2 containers.
Not shure if that helps anyone, but fixed the issue with the output like #27 (comment) for me.

from serge.

Qualzz commented on July 21, 2024

llama_model_load: llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file

For 30B

from serge.

aaroneden commented on July 21, 2024

Hey everyone! Can you try to grab the latest main, rebuild the docker container and tell me if it's still happening?

In the case where it's still happening, can you run the following:
docker compose up -d
docker compose exec api bash
llama -m weights/your_model.bin
see if it loads & starts outputting stuff. If it doesn't please send me the traceback.

root@69f29a3e3123:/usr/src/app# llama -m weights/ggml-alpaca-7B-q4_0.bin
main: seed = 1679694795
llama_model_load: loading model from 'weights/ggml-alpaca-7B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from 'weights/ggml-alpaca-7B-q4_0.bin'
llama_model_load: ...................................Killed

from serge.

timdingman-scale commented on July 21, 2024

main: seed = 1679712774
llama_model_load: loading model from 'weights/ggml-alpaca-7B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from 'weights/ggml-alpaca-7B-q4_0.bin'
llama_model_load: ...................Killed

from serge.

qn1213 commented on July 21, 2024

Most of the time, the error is caused by low RAM or VRAM.
Using a research graphics card seems to be the answer.

from serge.

tadasgedgaudas commented on July 21, 2024

I'm getting this:

root@personal-gpt-tasks-546548ffbb-69c85:/app# llama -m /mnt/data/weights/ggml-alpaca-7B-q4_0.bin --n_parts 1
main: seed = 1679775129
llama_model_load: loading model from '/mnt/data/weights/ggml-alpaca-7B-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
Illegal instruction (core dumped)

from serge.

AI does not respond about serge HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent