Comments (6)
Just coming here from having installed 7xGPUs on a Threadripper and I'm unable to run llama 3 70b on 7x24 GB cards so I have a valid use for this and would like to help with what I can as well
Can perhaps guide me vaguely on problems I should be aware of in attempting to make this change happen I was thinking to start from reading the tests and going from there to get an idea of how to make this change happen
I would also like to have a way to use llama 3 8b or something like that locally while I am able to piece together this for vllm so if there's any alternative that might work like llama.cpp or any other project that might intermittently help me have an idea of how it's implemented without this particular restriction
The biggest problem that you will have is that we have a central KV cache management system which manages the mapping from physical to logical blocks. There is a large implicit assumption that each shard has the same mapping since there is one set of metadata passed to each shard during the forward pass
It will require deep understanding and refactoring of the entire system to unwind this assumption
from vllm.
Just coming here from having installed 7xGPUs on a Threadripper and I'm unable to run llama 3 70b on 7x24 GB cards so I have a valid use for this and would like to help with what I can as well
Can perhaps guide me vaguely on problems I should be aware of in attempting to make this change happen I was thinking to start from reading the tests and going from there to get an idea of how to make this change happen
I would also like to have a way to use llama 3 8b or something like that locally while I am able to piece together this for vllm so if there's any alternative that might work like llama.cpp or any other project that might intermittently help me have an idea of how it's implemented without this particular restrictionThe biggest problem that you will have is that we have a central KV cache management system which manages the mapping from physical to logical blocks. There is a large implicit assumption that each shard has the same mapping since there is one set of metadata passed to each shard during the forward pass
It will require deep understanding and refactoring of the entire system to unwind this assumption
I'd be happy to try and implement this, but I'm less familiar with this part of the vLLM codebase.
Can you please refer me to the primary files I'd have to change for this, or give a brief explanation of the central KV cache management system you mentioned?
(I am familiar with PagedAttention and block management, but less in the distributed inference context).
from vllm.
Just coming here from having installed 7xGPUs on a Threadripper and I'm unable to run llama 3 70b on 7x24 GB cards so I have a valid use for this and would like to help with what I can as well
Can perhaps guide me vaguely on problems I should be aware of in attempting to make this change happen I was thinking to start from reading the tests and going from there to get an idea of how to make this change happen
I would also like to have a way to use llama 3 8b or something like that locally while I am able to piece together this for vllm so if there's any alternative that might work like llama.cpp or any other project that might intermittently help me have an idea of how it's implemented without this particular restrictionThe biggest problem that you will have is that we have a central KV cache management system which manages the mapping from physical to logical blocks. There is a large implicit assumption that each shard has the same mapping since there is one set of metadata passed to each shard during the forward pass
It will require deep understanding and refactoring of the entire system to unwind this assumptionI'd be happy to try and implement this, but I'm less familiar with this part of the vLLM codebase. Can you please refer me to the primary files I'd have to change for this, or give a brief explanation of the central KV cache management system you mentioned? (I am familiar with PagedAttention and block management, but less in the distributed inference context).
You're going to have to touch many subsystems.
- The Scheduler, BlockManager, and Attention to get a sense of how the KVCache is managed.
- Worker + ModelExecutor handle the tensor parallelism
from vllm.
Supporting this case will add complications, particularly for:
- managing KV cache memory on each shard
- making sure attention happens locally on each shard
- loading quantized weights
I think in theory this could be done, but it would require quite a bit of refactoring to many parts of the vllm system to support it
from vllm.
Just coming here from having installed 7xGPUs on a Threadripper and I'm unable to run llama 3 70b on 7x24 GB cards so I have a valid use for this and would like to help with what I can as well
Can perhaps guide me vaguely on problems I should be aware of in attempting to make this change happen I was thinking to start from reading the tests and going from there to get an idea of how to make this change happen
I would also like to have a way to use llama 3 8b or something like that locally while I am able to piece together this for vllm so if there's any alternative that might work like llama.cpp or any other project that might intermittently help me have an idea of how it's implemented without this particular restriction
from vllm.
Just coming here from having installed 7xGPUs on a Threadripper and I'm unable to run llama 3 70b on 7x24 GB cards so I have a valid use for this and would like to help with what I can as well
Can perhaps guide me vaguely on problems I should be aware of in attempting to make this change happen I was thinking to start from reading the tests and going from there to get an idea of how to make this change happen
I would also like to have a way to use llama 3 8b or something like that locally while I am able to piece together this for vllm so if there's any alternative that might work like llama.cpp or any other project that might intermittently help me have an idea of how it's implemented without this particular restrictionThe biggest problem that you will have is that we have a central KV cache management system which manages the mapping from physical to logical blocks. There is a large implicit assumption that each shard has the same mapping since there is one set of metadata passed to each shard during the forward pass
It will require deep understanding and refactoring of the entire system to unwind this assumptionI'd be happy to try and implement this, but I'm less familiar with this part of the vLLM codebase. Can you please refer me to the primary files I'd have to change for this, or give a brief explanation of the central KV cache management system you mentioned? (I am familiar with PagedAttention and block management, but less in the distributed inference context).
You're going to have to touch many subsystems.
- The Scheduler, BlockManager, and Attention to get a sense of how the KVCache is managed.
- Worker + ModelExecutor handle the tensor parallelism
Hey, I implement this feature (:
Would really appreciate your feedback on this #5367
from vllm.
Related Issues (20)
- [Bug]: Gemma2 models inference using vLLM 0.5.4 produces incorrect responses HOT 2
- [Bug]: Error happened with Large scale requests based on 0.5.4 vllm HOT 19
- [Misc]: OOM (CUDA Out Of Memory) when running LLMs in WSL using vLLM HOT 7
- [Feature]: json_schema support in OpenAI compat server
- Virtual Office Hours: August 8 and August 21 HOT 2
- [Feature]: GGUF quantization with tensor parallelism HOT 2
- [Bug]: when using llama-3.1-70b-instruct for inference, input with large number of tokens(>8k) will result in endless output
- [Bug]: Mismatch in the number of image tokens and placeholders during batch inference HOT 12
- [Bug]: Streaming API: Abort functionality not working as expected HOT 1
- [Usage]: how to use guided_decoding_backend? HOT 1
- [Feature]: Please Support FATRelu
- [Usage, bug]: vLLM Docker | ValueError: OpenTelemetry packages must be installed before configuring 'otlp_traces_endpoint' during vLLM startup HOT 5
- [Performance]: vllm量化和非量化的性能对比 HOT 4
- [Bug]: when `echo=True`, vllm will append chat template(`assistant`) after the last message HOT 8
- [Misc]: how to add tests for new backends? HOT 1
- [Feature]: Pipeline Parallelism support for the Vision Language Models HOT 6
- [Usage]: Qwen2 GGUF model can't run successfully HOT 8
- [Feature]: Overlap model weight loading and model prefill HOT 7
- [RFC]: Add Ascend NPU as a new backend
- [Bug]: Inconsistent Output Behavior with and without tools and tool_choice Parameters
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.