Comments (4)
Hi @ElliottDyson
Self-speculative decoding with vLLM is not available right now. Will let you know when it's available.
BTW, Speculative decoding support in vLLM is also in progress (https://docs.google.com/document/d/1rE4pr3IdspRw97XbImY4fS9IWYuJJ3HGtL7AdIKGrw8/).
from ipex-llm.
Hello there,
I was wondering if it were possible to have the self-speculative decoding operate using IQ2 as the draft model and FP8 as the core model (as it has been shown that FP8 is very rarely any different in accuracy compared to FP16).
A look into the following 1.58bit quant method would also be interesting as to its integration: ggerganov/llama.cpp#5999
I was also curious as to whether llama.cpp quants other than 4bit are compatible at all, as I noticed you only provided examples using 4bit quantisations. My reasoning behind being interested in this is the ability to offload x number of layers to the GPU and keep the remaining layers computing on the CPU, as it is an incredibly useful feature to be able to work with much larger models or/and longer context lengths.
Thanks
@ElliottDyson currently we only optimized IQ2 for memory size, not for speed yet, and therefore using IQ2 as the draft model may not be faster than INT4; using FP8 as target model may be possible.
And we do support llama.cpp compatible IQ2 and IQ1 models using our cpp backend (see https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html)
from ipex-llm.
Sorry, one more thing that I forgot to add, is it possible to use your self-speculative decoding method or custom IQ2 quantisation in vLLM in any way (only typical low-bit quant was mentioned in the docs and I can't seem to find the library link to "ipex_llm.vllm.entrypoints.llm" to figure out if this is possible by myself)? I also had a thought that may work better than configuring for the various custom quants in llama.cpp such as IQ2, and that would be integrating CPU layer offloading directly into the core methods you are using here, it's just a possible alternative idea I had in case it was any easier.
Again, thank you for all the work your team have been doing here!
from ipex-llm.
Hello there,
I was wondering if it were possible to have the self-speculative decoding operate using IQ2 as the draft model and FP8 as the core model (as it has been shown that FP8 is very rarely any different in accuracy compared to FP16).
A look into the following 1.58bit quant method would also be interesting as to its integration: ggerganov/llama.cpp#5999
I was also curious as to whether llama.cpp quants other than 4bit are compatible at all, as I noticed you only provided examples using 4bit quantisations. My reasoning behind being interested in this is the ability to offload x number of layers to the GPU and keep the remaining layers computing on the CPU, as it is an incredibly useful feature to be able to work with much larger models or/and longer context lengths.
Thanks
@ElliottDyson currently we only optimized IQ2 for memory size, not for speed yet, and therefore using IQ2 as the draft model may not be faster than INT4; using FP8 as target model may be possible.
And we do support llama.cpp compatible IQ2 and IQ1 models using our cpp backend (see https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html)
Just tried this combination. Thank you, FP8 as target and INT4 as draft worked very well. Looking forward to the potential of an even speedier lower precision draft model! 😁
from ipex-llm.
Related Issues (20)
- ipex-llm version 0510 has regression than 0430, especially for BS=16,32 and 8k input HOT 3
- Failing to run ipex-llm ollama on Intel Arc A770 HOT 12
- Can you help to release common.lib for llama.cpp with ipex-llm? HOT 1
- llama3-8B causes MTL iGPU runtime error when ipex-llm's running AI inference HOT 3
- Segmentation fault (core dumped) while inferencing with MTL iGPU HOT 4
- Support both Llama2 and stablelm/Zephyr-3B HOT 2
- all-in-one benchmark with Baichuan2-13B OOM HOT 1
- MTL Windows Qwen-VL AttributeError: 'QWenAttention' object has no attribute 'position_ids' HOT 4
- ChatGLM run error on MTL iGPU HOT 1
- failed to run truthfulqa_mc1 by harness HOT 2
- how to switch to load multiple llm models in a streamlit page? HOT 3
- Transform a string into input llama2-specific and llama3-specific input ? HOT 1
- Docker on Windows vllm serving issue HOT 15
- default values of max_generated_tokens, top_k, top_p, and temperature? HOT 1
- log using ipex-llm instead of bigdl-llm in while running native models
- Weights of LlamaForCausalLM were not initialized from the model checkpoint at meta-llama/Meta-Llama-3-8B-Instruct? HOT 1
- vLLM offline_inference.py failed to run on CPU inference HOT 1
- Unable to save quantized model HOT 1
- Llama 3 performance drop from transformers version 4.37.2 to 4.38.0 HOT 1
- about conflict HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ipex-llm.