🚀 The feature, motivation and pitch I am trying to run a 70B mode

[Feature]: Tensor Parallelism with non divisble amount of attention heads about vllm HOT 6 OPEN

NadavShmayo commented on September 26, 2024

[Feature]: Tensor Parallelism with non divisble amount of attention heads

from vllm.

Comments (6)

robertgshaw2-neuralmagic commented on September 26, 2024 1

Just coming here from having installed 7xGPUs on a Threadripper and I'm unable to run llama 3 70b on 7x24 GB cards so I have a valid use for this and would like to help with what I can as well

Can perhaps guide me vaguely on problems I should be aware of in attempting to make this change happen I was thinking to start from reading the tests and going from there to get an idea of how to make this change happen

I would also like to have a way to use llama 3 8b or something like that locally while I am able to piece together this for vllm so if there's any alternative that might work like llama.cpp or any other project that might intermittently help me have an idea of how it's implemented without this particular restriction

The biggest problem that you will have is that we have a central KV cache management system which manages the mapping from physical to logical blocks. There is a large implicit assumption that each shard has the same mapping since there is one set of metadata passed to each shard during the forward pass

It will require deep understanding and refactoring of the entire system to unwind this assumption

from vllm.

NadavShmayo commented on September 26, 2024 1

Just coming here from having installed 7xGPUs on a Threadripper and I'm unable to run llama 3 70b on 7x24 GB cards so I have a valid use for this and would like to help with what I can as well
Can perhaps guide me vaguely on problems I should be aware of in attempting to make this change happen I was thinking to start from reading the tests and going from there to get an idea of how to make this change happen
I would also like to have a way to use llama 3 8b or something like that locally while I am able to piece together this for vllm so if there's any alternative that might work like llama.cpp or any other project that might intermittently help me have an idea of how it's implemented without this particular restriction

The biggest problem that you will have is that we have a central KV cache management system which manages the mapping from physical to logical blocks. There is a large implicit assumption that each shard has the same mapping since there is one set of metadata passed to each shard during the forward pass

It will require deep understanding and refactoring of the entire system to unwind this assumption

I'd be happy to try and implement this, but I'm less familiar with this part of the vLLM codebase.
Can you please refer me to the primary files I'd have to change for this, or give a brief explanation of the central KV cache management system you mentioned?
(I am familiar with PagedAttention and block management, but less in the distributed inference context).

from vllm.

robertgshaw2-neuralmagic commented on September 26, 2024 1

Just coming here from having installed 7xGPUs on a Threadripper and I'm unable to run llama 3 70b on 7x24 GB cards so I have a valid use for this and would like to help with what I can as well
Can perhaps guide me vaguely on problems I should be aware of in attempting to make this change happen I was thinking to start from reading the tests and going from there to get an idea of how to make this change happen
I would also like to have a way to use llama 3 8b or something like that locally while I am able to piece together this for vllm so if there's any alternative that might work like llama.cpp or any other project that might intermittently help me have an idea of how it's implemented without this particular restriction

The biggest problem that you will have is that we have a central KV cache management system which manages the mapping from physical to logical blocks. There is a large implicit assumption that each shard has the same mapping since there is one set of metadata passed to each shard during the forward pass
It will require deep understanding and refactoring of the entire system to unwind this assumption

I'd be happy to try and implement this, but I'm less familiar with this part of the vLLM codebase. Can you please refer me to the primary files I'd have to change for this, or give a brief explanation of the central KV cache management system you mentioned? (I am familiar with PagedAttention and block management, but less in the distributed inference context).

You're going to have to touch many subsystems.

The Scheduler, BlockManager, and Attention to get a sense of how the KVCache is managed.
Worker + ModelExecutor handle the tensor parallelism

from vllm.

robertgshaw2-neuralmagic commented on September 26, 2024

Supporting this case will add complications, particularly for:

managing KV cache memory on each shard
making sure attention happens locally on each shard
loading quantized weights

I think in theory this could be done, but it would require quite a bit of refactoring to many parts of the vllm system to support it

from vllm.

theycallmeloki commented on September 26, 2024

Just coming here from having installed 7xGPUs on a Threadripper and I'm unable to run llama 3 70b on 7x24 GB cards so I have a valid use for this and would like to help with what I can as well

Can perhaps guide me vaguely on problems I should be aware of in attempting to make this change happen I was thinking to start from reading the tests and going from there to get an idea of how to make this change happen

I would also like to have a way to use llama 3 8b or something like that locally while I am able to piece together this for vllm so if there's any alternative that might work like llama.cpp or any other project that might intermittently help me have an idea of how it's implemented without this particular restriction

from vllm.

NadavShmayo commented on September 26, 2024

Just coming here from having installed 7xGPUs on a Threadripper and I'm unable to run llama 3 70b on 7x24 GB cards so I have a valid use for this and would like to help with what I can as well
Can perhaps guide me vaguely on problems I should be aware of in attempting to make this change happen I was thinking to start from reading the tests and going from there to get an idea of how to make this change happen
I would also like to have a way to use llama 3 8b or something like that locally while I am able to piece together this for vllm so if there's any alternative that might work like llama.cpp or any other project that might intermittently help me have an idea of how it's implemented without this particular restriction

The biggest problem that you will have is that we have a central KV cache management system which manages the mapping from physical to logical blocks. There is a large implicit assumption that each shard has the same mapping since there is one set of metadata passed to each shard during the forward pass
It will require deep understanding and refactoring of the entire system to unwind this assumption

I'd be happy to try and implement this, but I'm less familiar with this part of the vLLM codebase. Can you please refer me to the primary files I'd have to change for this, or give a brief explanation of the central KV cache management system you mentioned? (I am familiar with PagedAttention and block management, but less in the distributed inference context).

You're going to have to touch many subsystems.

The Scheduler, BlockManager, and Attention to get a sense of how the KVCache is managed.

Worker + ModelExecutor handle the tensor parallelism

Hey, I implement this feature (:
Would really appreciate your feedback on this #5367

from vllm.

[Feature]: Tensor Parallelism with non divisble amount of attention heads about vllm HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent