wangcx18 / llm-vscode-inference-server Goto Github PK

An endpoint server for efficiently serving quantized open-source LLMs for code.

License: Apache License 2.0

Python 100.00%

llm-vscode-inference-server's Introduction

`llm-vscode` Inference Server

This repository serves as an alternative endpoint server for the llm-vscode extension (formerly known as the Hugging Face VSCode extension). In contrast to LucienShui/huggingface-vscode-endpoint-server, the main objective here is to integrate support for quantized open-source LLMs tailored for coding tasks into the llm-vscode extension. Such an integration would make self-hosting a code completion service not only more accessible but also more cost-effective and faster, even on smaller GPUs and CPUs.

Tested with :

TheBloke/CodeLlama-7B-Python-AWQ

I will test more models later.

Usage

Install requirements

pip install -r requirements.txt

Serving

python api_server.py --trust-remote-code --model [/path/to/model/folder]

By default, the server runs on localhost using port 8000. You can also specify a different port by using the --port flag.

Since the api_server.py in this repository is adapted from api_server.py, it inherits the same arguments. You can refer to arg_utils.py to review all the supported command line arguments.

For quantized models, you should append the following arguments: --quantization awq --dtype half. For example:

python api_server.py --trust-remote-code --model [/path/to/model/folder] --quantization awq --dtype half

Config in VSCode

Open VSCode, go to Preferences -> Settings, navigate to Hugging Face Code section.
Set Config Template to Custom:
Set Model ID or Endpoint to http://localhost:8000/generate, and replace the port number if you are using a different one:

API

Request:

curl http://localhost:8000/generate -d '{"inputs": "def quick_sort", "parameters": {"max_new_tokens": 64}}'

Response:

{
    "generated_text": "def quick_sort(numbers):\n    if len(numbers) < 2:\n        return numbers\n    else:\n        pivot = numbers[-1]\n        less = [\n            el\n            for ind, el in enumerate(numbers[:-1])\n            if el <= pivot and ind != -1\n",
    "status": 200
}

TODO

Test more models
Test distributed serving

References

llm-vscode-inference-server's People

Contributors

Stargazers

Watchers

Forkers

milkowski jahysama dev-v-ramesh habibzadeh doridoricomgong mahadih534

llm-vscode-inference-server's Issues

Running out of memory with TheBloke/CodeLlama-7B-AWQ

Already posted on vllm-project/vllm#1479
My GPU is RTX 3060 with 12GB VRAM
My target model isCodeLlama-7B-AWQ, which size is <= 4GB

Looking for help from 2 communities 😄 thx!

Can't pip install requirements.txt on CPU-only system

I saw the README references running on CPU as a goal, is the project there right now or is there still work to be done to achieve that?
Currently I'm seeing this error if I try pip install -r requirements.txt:

      RuntimeError: Cannot find CUDA_HOME. CUDA must be available to build the package.

Keeps responding back with tokens

I keep getting fim tokens when it responds back, am I supposed to scrub this directly in the code or is there some setting that has to be used in the extension for llm-vscode ?

<fim_prefix>
import debugpy


# create a class called car
class Car:
    # create a method called drive
    def drive(self):
        print("driving")


# create an object called my_car
my_car =    <fim_suffix>

[Bug] TypeError: SamplingParams.init() got an unexpected keyword argument 'return_full_text'

When using for first time the inference server and model TheBloke/CodeLlama-7B-Instruct-AWQ, the request fails with the following traceback:

  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/fastapi/applications.py", line 292, in __call__
    await super().__call__(scope, receive, send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "{%PATH%}/llm-vscode-inference-server/api_server.py", line 34, in generate
    sampling_params = SamplingParams(max_tokens=max_new_tokens,
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: SamplingParams.__init__() got an unexpected keyword argument 'return_full_text'

Extension settings in VS Code:

    "llm.attributionWindowSize": 256,
    "llm.configTemplate": "Custom",
    "llm.contextWindow": 2048,
    "llm.fillInTheMiddle.enabled": false,
    "llm.fillInTheMiddle.middle": " <MID>",
    "llm.fillInTheMiddle.prefix": "<PRE> ",
    "llm.fillInTheMiddle.suffix": " <SUF>",
    "llm.lsp.logLevel": "debug",
    "llm.maxNewTokens": 500,
    "llm.modelIdOrEndpoint": "http://localhost:8000/generate",
    "llm.temperature": 0.2,
    "llm.tokenizer": {"repository": "TheBloke/CodeLlama-7B-Instruct-AWQ"},
    "llm.tokensToClear": ["<EOT>"],

pip freeze output:

aiosignal==1.3.1
anyio==3.7.1
attrs==23.1.0
certifi==2023.7.22
charset-normalizer==3.3.0
click==8.1.7
cmake==3.27.6
fastapi==0.103.2
filelock==3.12.4
frozenlist==1.4.0
fsspec==2023.9.2
h11==0.14.0
httptools==0.6.0
huggingface-hub==0.17.3
idna==3.4
Jinja2==3.1.2
jsonschema==4.19.1
jsonschema-specifications==2023.7.1
lit==17.0.1
MarkupSafe==2.1.3
mpmath==1.3.0
msgpack==1.0.7
networkx==3.1
ninja==1.11.1
numpy==1.26.0
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
packaging==23.2
pandas==2.1.1
protobuf==4.24.3
psutil==5.9.5
pyarrow==13.0.0
pydantic==1.10.13
python-dateutil==2.8.2
python-dotenv==1.0.0
pytz==2023.3.post1
PyYAML==6.0.1
ray==2.7.0
referencing==0.30.2
regex==2023.8.8
requests==2.31.0
rpds-py==0.10.3
safetensors==0.3.3
sentencepiece==0.1.99
six==1.16.0
sniffio==1.3.0
starlette==0.27.0
sympy==1.12
tokenizers==0.13.3
torch==2.0.1
tqdm==4.66.1
transformers==4.33.3
triton==2.0.0
typing_extensions==4.8.0
tzdata==2023.3
urllib3==2.0.6
uvicorn==0.23.2
uvloop==0.17.0
vllm==0.2.0
watchfiles==0.20.0
websockets==11.0.3
xformers==0.0.22

Issue on CUDA version and Torch on vllm

While i build a service with docker, the error is raised.

output

291.9   The detected CUDA version (12.1) mismatches the version that was used to compile 
291.9   PyTorch (11.7). Please make sure to use the same CUDA versions. 
291.9
291.9   ----------------------------------------
291.9   ERROR: Failed building wheel for vllm
291.9 Failed to build vllm
293.3 ERROR: Could not build wheels for vllm which use PEP 517 and cannot be installed directly

Dockerfiles

FROM nvidia/cuda:11.6.2-devel-ubuntu20.04

# Install Python 3.8
RUN apt-get update && apt-get install -y python3.8 python3-pip && apt-get clean

# Set Python 3.8 as the default
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1

# Set the working directory in the container
WORKDIR /app

# Install git
RUN apt-get update && apt-get install -y git && apt-get clean

# Copy the current directory contents into the container at /app
COPY . /app

ENV CUDA_HOME="/usr/local/cuda"
ENV FORCE_CUDA="1"

# Install the required packages
RUN pip install --no-cache-dir -r requirements.txt

# Expose port 8000 for the app to listen on
EXPOSE 8000

# Define the command to run the app
CMD ["python", "api_server.py", "--trust-remote-code", "--model", "/path/to/model/folder"]

Docker-compose


version: '3.8'  # Consider using a more recent version

services:
  llm-server:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8000:8000"
    volumes:
      - ./Documents/WizardCoder-Python-13B-V1.0:/app/models
    environment:
      - MODEL_PATH=/app/models