Comments (9)
we have made some breaking change on qwen-1.5's int4 checkpoint in 5.21 version, old int4 checkpoint(generated by ipex 0520 or eariler) cannot be loaded with new ipex-llm(0521 or later), please regenerate int4 checkpoint with ipex-llm 20240521 or later
from bigdl.
we have made some breaking change on qwen-1.5's int4 checkpoint in 5.21 version, old int4 checkpoint(generated by ipex 0520 or eariler) cannot be loaded with new ipex-llm(0521 or later), please regenerate int4 checkpoint with ipex-llm 20240521 or later
ok, got it.
the new version has some improvements? such as quantization accuracy, or RAM?
from bigdl.
ok, got it.
the new version has some improvements? such as quantization accuracy, or RAM?
yes, there should be some improvements on speed and RAM, but not much
from bigdl.
ok, got it.
the new version has some improvements? such as quantization accuracy, or RAM?yes, there should be some improvements on speed and RAM, but not much
I regenerate qwen-7b int4 model and run it on my laptop(ultra 7 155H), but the "warm up" stage costs very long time(more than 5 minutes), do you have any advice?
from bigdl.
I regenerate qwen-7b int4 model and run it on my laptop(ultra 7 155H), but the "warm up" stage costs very long time(more than 5 minutes), do you have any advice?
Did you set SYCL_CACHE_PERSISTENT=1
? https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration
from bigdl.
I regenerate qwen-7b int4 model and run it on my laptop(ultra 7 155H), but the "warm up" stage costs very long time(more than 5 minutes), do you have any advice?
Did you
set SYCL_CACHE_PERSISTENT=1
? https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration
yes, i have set it
I found that warm up speed is much faster in cpu mode(about 10-20s). but slower in xpu mode..
from bigdl.
I found that warm up speed is much faster in cpu mode(about 10-20s). but slower in xpu mode..
CPU doesn't need JIT compilation, while gpu needs.
On CPU: load model -> quantization -> inference
On GPU: load model -> quantization -> JIT compilation -> inference. This JIT compilation is what we called warm up, and it costs about ten minutes.
set SYCL_CACHE_PERSISTENT=1
will store gpu JIT code on disk so that it won't need to compile again the second time you run it.
If you are using powershell, please use CMD instead.
Could you check whether C:\Users\<user name>\AppData\Roaming\libsycl_cache
exists ? If exits, please delete it. Then set SYCL_CACHE_PERSISTENT=1
and run inference (this run will take a long time (about 10 minutes) because it needs to regenerate JIT code cache), after finish, you should see regenerated C:\Users\<user name>\AppData\Roaming\libsycl_cache
. With cache, following inference should has no warm up. (set SYCL_CACHE_PERSISTENT=1
is still required)
from bigdl.
I found that warm up speed is much faster in cpu mode(about 10-20s). but slower in xpu mode..
CPU doesn't need JIT compilation, while gpu needs.
On CPU: load model -> quantization -> inference
On GPU: load model -> quantization -> JIT compilation -> inference. This JIT compilation is what we called warm up, and it costs about ten minutes.
set SYCL_CACHE_PERSISTENT=1
will store gpu JIT code on disk so that it won't need to compile again the second time you run it.If you are using powershell, please use CMD instead.
Could you check whether
C:\Users\<user name>\AppData\Roaming\libsycl_cache
exists ? If exits, please delete it. Thenset SYCL_CACHE_PERSISTENT=1
and run inference (this run will take a long time (about 10 minutes) because it needs to regenerate JIT code cache), after finish, you should see regeneratedC:\Users\<user name>\AppData\Roaming\libsycl_cache
. With cache, following inference should has no warm up. (set SYCL_CACHE_PERSISTENT=1
is still required)
ok,i will try, thank you very much.
If libsycl_cache exists, even if I finish the infer process, restart and reload model, is there no need for a warm up?
from bigdl.
If libsycl_cache exists, even if I finish the infer process, restart and reload model, is there no need for a warm up?
yes
from bigdl.
Related Issues (20)
- Ollama Linux seg fault with GPU on Ubuntu 22.04 HOT 3
- Ollama on Windows not working HOT 4
- GLM-4-9B-Chat missing 'import math' HOT 1
- Error running llama.cpp with IPEX-LLM on MTL iGPU following quickstart guide (Native API returns: -30 (PI_ERROR_INVALID_VALUE)) HOT 16
- [utils] invalidInputError during RuntimeError HOT 8
- [qwen2][windows][MTL] 6-8k input token OOM HOT 3
- Qwen-7b int8 inference speed is too slow, only 3-4 tokens/s. HOT 2
- libze_loader.so.1: cannot open shared object file: No such file or directory HOT 1
- RuntimeError: could not create a primitive HOT 3
- run GLM4-9b-chat on MTL iGPU get error: pyo3_runtime.PanicException: index out of bounds: HOT 2
- Could you help to enable GLM-4v-9b? HOT 2
- llama2 Segmentation fault (core dumped) on i7-9700K HOT 4
- Can faster whisper run on Intel iGPU or Arc DGPU? HOT 2
- vllm converting model to sym_int4 even when --load-in-low-bit sym_int4 not set HOT 3
- Cannot find dGPU when install ollama on Windows HOT 1
- error with ipex-llm langchain integration for LLAVA model HOT 2
- error with loading 4bit saved LLava model on CPU HOT 3
- Benchmark latency different between oneAPI2024.0 and 2024.1 HOT 1
- Phi3 3.8B mini 128k model not supported HOT 1
- vLLM CPU example load-in-low-bit is not used HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bigdl.