Giter Club home page Giter Club logo

Comments (6)

robertgshaw2-neuralmagic avatar robertgshaw2-neuralmagic commented on September 26, 2024 1

Just coming here from having installed 7xGPUs on a Threadripper and I'm unable to run llama 3 70b on 7x24 GB cards so I have a valid use for this and would like to help with what I can as well

Can perhaps guide me vaguely on problems I should be aware of in attempting to make this change happen I was thinking to start from reading the tests and going from there to get an idea of how to make this change happen

I would also like to have a way to use llama 3 8b or something like that locally while I am able to piece together this for vllm so if there's any alternative that might work like llama.cpp or any other project that might intermittently help me have an idea of how it's implemented without this particular restriction

The biggest problem that you will have is that we have a central KV cache management system which manages the mapping from physical to logical blocks. There is a large implicit assumption that each shard has the same mapping since there is one set of metadata passed to each shard during the forward pass

It will require deep understanding and refactoring of the entire system to unwind this assumption

from vllm.

NadavShmayo avatar NadavShmayo commented on September 26, 2024 1

Just coming here from having installed 7xGPUs on a Threadripper and I'm unable to run llama 3 70b on 7x24 GB cards so I have a valid use for this and would like to help with what I can as well
Can perhaps guide me vaguely on problems I should be aware of in attempting to make this change happen I was thinking to start from reading the tests and going from there to get an idea of how to make this change happen
I would also like to have a way to use llama 3 8b or something like that locally while I am able to piece together this for vllm so if there's any alternative that might work like llama.cpp or any other project that might intermittently help me have an idea of how it's implemented without this particular restriction

The biggest problem that you will have is that we have a central KV cache management system which manages the mapping from physical to logical blocks. There is a large implicit assumption that each shard has the same mapping since there is one set of metadata passed to each shard during the forward pass

It will require deep understanding and refactoring of the entire system to unwind this assumption

I'd be happy to try and implement this, but I'm less familiar with this part of the vLLM codebase.
Can you please refer me to the primary files I'd have to change for this, or give a brief explanation of the central KV cache management system you mentioned?
(I am familiar with PagedAttention and block management, but less in the distributed inference context).

from vllm.

robertgshaw2-neuralmagic avatar robertgshaw2-neuralmagic commented on September 26, 2024 1

Just coming here from having installed 7xGPUs on a Threadripper and I'm unable to run llama 3 70b on 7x24 GB cards so I have a valid use for this and would like to help with what I can as well
Can perhaps guide me vaguely on problems I should be aware of in attempting to make this change happen I was thinking to start from reading the tests and going from there to get an idea of how to make this change happen
I would also like to have a way to use llama 3 8b or something like that locally while I am able to piece together this for vllm so if there's any alternative that might work like llama.cpp or any other project that might intermittently help me have an idea of how it's implemented without this particular restriction

The biggest problem that you will have is that we have a central KV cache management system which manages the mapping from physical to logical blocks. There is a large implicit assumption that each shard has the same mapping since there is one set of metadata passed to each shard during the forward pass
It will require deep understanding and refactoring of the entire system to unwind this assumption

I'd be happy to try and implement this, but I'm less familiar with this part of the vLLM codebase. Can you please refer me to the primary files I'd have to change for this, or give a brief explanation of the central KV cache management system you mentioned? (I am familiar with PagedAttention and block management, but less in the distributed inference context).

You're going to have to touch many subsystems.

  • The Scheduler, BlockManager, and Attention to get a sense of how the KVCache is managed.
  • Worker + ModelExecutor handle the tensor parallelism

from vllm.

robertgshaw2-neuralmagic avatar robertgshaw2-neuralmagic commented on September 26, 2024

Supporting this case will add complications, particularly for:

  • managing KV cache memory on each shard
  • making sure attention happens locally on each shard
  • loading quantized weights

I think in theory this could be done, but it would require quite a bit of refactoring to many parts of the vllm system to support it

from vllm.

theycallmeloki avatar theycallmeloki commented on September 26, 2024

Just coming here from having installed 7xGPUs on a Threadripper and I'm unable to run llama 3 70b on 7x24 GB cards so I have a valid use for this and would like to help with what I can as well

Can perhaps guide me vaguely on problems I should be aware of in attempting to make this change happen I was thinking to start from reading the tests and going from there to get an idea of how to make this change happen

I would also like to have a way to use llama 3 8b or something like that locally while I am able to piece together this for vllm so if there's any alternative that might work like llama.cpp or any other project that might intermittently help me have an idea of how it's implemented without this particular restriction

from vllm.

NadavShmayo avatar NadavShmayo commented on September 26, 2024

Just coming here from having installed 7xGPUs on a Threadripper and I'm unable to run llama 3 70b on 7x24 GB cards so I have a valid use for this and would like to help with what I can as well
Can perhaps guide me vaguely on problems I should be aware of in attempting to make this change happen I was thinking to start from reading the tests and going from there to get an idea of how to make this change happen
I would also like to have a way to use llama 3 8b or something like that locally while I am able to piece together this for vllm so if there's any alternative that might work like llama.cpp or any other project that might intermittently help me have an idea of how it's implemented without this particular restriction

The biggest problem that you will have is that we have a central KV cache management system which manages the mapping from physical to logical blocks. There is a large implicit assumption that each shard has the same mapping since there is one set of metadata passed to each shard during the forward pass
It will require deep understanding and refactoring of the entire system to unwind this assumption

I'd be happy to try and implement this, but I'm less familiar with this part of the vLLM codebase. Can you please refer me to the primary files I'd have to change for this, or give a brief explanation of the central KV cache management system you mentioned? (I am familiar with PagedAttention and block management, but less in the distributed inference context).

You're going to have to touch many subsystems.

  • The Scheduler, BlockManager, and Attention to get a sense of how the KVCache is managed.
  • Worker + ModelExecutor handle the tensor parallelism

Hey, I implement this feature (:
Would really appreciate your feedback on this #5367

from vllm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.