To test rocm57 ....
- make sure linux is rocm container (docker) ready (see AMD rocm doc, wsl or any sort of Linux VM will NOT work!)
cd rocm57
thensh ./buildimage.sh
, this will create themlcllmrocm57:v0.1
base image; you can do adocker images
to verify- from top level,
cd test
and see that thecache
directory exists and that you have write permissions, this is where all the weights will be downloaded the first time, and will be managed for you - run
sh startrocm57chat.sh gemma-2b-it-q4f16_1
- this should download and convert all the weights (the first time you run it) and then start a chat session with the extra-small gemma model (on your rocm supported GPU)
- try the prompt
list the states in the usa
to get a somewhat stable reply
A set of docker container templates for your scaled production deployment of GPU-accelerated mlc_llms. Based on the recent work on the SLM, jit flow, and OpenAI compatible APIs including function calling.
These containers are designed to be:
- minimalist - nothing non-essential is included; you can layer on your own security policy for example
- non-opinionated - use CNCF k8s or docker compose or swarm or whatever you have for orchestration
- adaptive and composable - nobody knows what you intend to do with these containers, and we don't guess
- compatible - with multi-GPU support maturing and batching still in testing, these containers should suvive upcoming changes without needing to be severely revamped.
- practical NOW - usable and deployable TODAY with 2024/2025 level workstation/consumer hardware and mlc-ai
- Make sure your system is one of the supported GPU drivers stack (cuda, rocm, and so on) and that you have the toolkit and container extension (docker support) already installed and tested
- Go to
/rocm/rocm57
or/cuda/cuda122
and build the base docker image according to theREADME.md
there. - Run the command line chat or serve REST APIs immediately by running the scripts in the
/bin
directory, following theREADME.md
there.
Base containers are segregated by GPU acceleration stacks. See the README of the sub-folders for more information
cuda
|-- cuda122
rocm
|-- rocm57
bin
test
The bin
folder has the template executables that will start the containers.
The test
folder contains the tests.
This structure enables the greater community to easily contribute new tested templates for other cuda and rocm releases, for example.
Managing the huge physical size of the weights for an LLM model is a major hurdle when deploying modern LLM in production / experimental environment at any scale. Couple this with the need to compile NN network support library for every combination and permutation of GPU hardware vs OS supported - and an impossibly frustrating out-of-box user experience is guaranteed.
The latest improvement in JIT and SLM flow for MLC_LLM specifically addresses this. And these docker container templates further enhances the out-of-box UX, down to one single easy to use command line (with automatic cached LLM weights management).
Users of such images can simply decide to run "llama2 7b on cuda 12.2" and in one single command immediately pull down an image onto their workstation running AI apps served by Llama 2 already GPU accelerated. The weights are downloaded directly from huggingface and converted specifically for her/his GPU hardware and OS the first time the command is executed; any subsequent invocation can start instantly using the already converted weights.
As an example the command to start an interactive chat with this LLM on Cuda 1.22 accelerated Linux is:
startcuda122chat.sh Llama-2-7b-chat-hf-q4f32_1
One container template is supplied for REST API serving, and another one is available for interactive command line chat with any supported LLMs.
There is no loss of flexibility in using these containers, the REST API implementation already support batching - the ability to handle multiple concurrent inferences at the same time. And any future improvements in MLC_AI will be
Tests are made global as they apply to mlc_ai running across any supported GPU configurations.