Comments (4)
Specific optimizations for smaller kernels (~100m parameters).
- Improve sampling efficiency.
- We may need to merge more models.
This should not be prioritized. Because the core technique of CacheFlow (memory saving) is not helpful for small models at all, but still, they may benefit from iteration-level scheduling.
from vllm.
After the C++ version, we might need to rerun all the experiments with the new implementation.
from vllm.
@zhuohan123 can this work be considered complete?
from vllm.
If interested, I've been building a c++ native deep learning framework for the past few years that I want to get open sourced soon. This framework aims for optimal performance, here is my framework training AlexNet, mostly the kernels here are cublasLt and cuDNN:
I'd certainly like for it to be part of VLLM. Is this something that there'd be interest in? If so I can go ensure that I support (or add support) for all of the pieces that you need and get it connected. I can provide access to the private repo if requested.
from vllm.
Related Issues (20)
- [Bug]: ValueError: Can't set signal handler for SIGINT while SIGINT is being deferred within a DeferSigint context when tp>1
- [Bug]: Chat templates not working HOT 2
- [Misc] [CI]: AMD test flaky on main CI HOT 4
- [Bug]: Typo in rocm_flash_attn.py HOT 4
- [Feature]: Support HuggingFaceM4/idefics2-8b as vision model HOT 1
- [Bug][Chunked prefill]: head size has to be power of two
- [Bug]: Invalid Device Ordinal on ROCm
- [Bug]: async llm engine failed unexpectedly (using mixtral-8x7b with tp=4) HOT 4
- [Bug]: benchmark trtllm failed
- [Bug]: win10 WSL2中vllm0.4.0 无法启动 HOT 5
- [Feature]: Phi2 LoRA support HOT 1
- ImportError : /opt/conda/lib/python3.10/site-packages/vllm/_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKSs
- [Usage]: Why it is not a model parallel when I use LLM from vllm
- [Usage]: How to get the latency of each request with benchmark_serving.py HOT 7
- [Usage]: if I want to run a 34B model,like yi-34B-chat,how can I use multi GPU,I just have A100 40G HOT 5
- [Bug]: Processed prompts: 5%|▌ | 429/8535 [00:27<08:37, 15.68it/s] RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
- [Feature]: AMD ROCm 6.1 Support
- [Feature]: No `outlines` strong dependency HOT 2
- [Bug]: Unable to process request HOT 4
- [Misc]: How to access the KV cache directly? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.