Giter Club home page Giter Club logo

llm-vscode-inference-server's Introduction

llm-vscode Inference Server

This repository serves as an alternative endpoint server for the llm-vscode extension (formerly known as the Hugging Face VSCode extension). In contrast to LucienShui/huggingface-vscode-endpoint-server, the main objective here is to integrate support for quantized open-source LLMs tailored for coding tasks into the llm-vscode extension. Such an integration would make self-hosting a code completion service not only more accessible but also more cost-effective and faster, even on smaller GPUs and CPUs.

Tested with :

I will test more models later.

Usage

Install requirements

pip install -r requirements.txt

Serving

python api_server.py --trust-remote-code --model [/path/to/model/folder]

By default, the server runs on localhost using port 8000. You can also specify a different port by using the --port flag.

Since the api_server.py in this repository is adapted from api_server.py, it inherits the same arguments. You can refer to arg_utils.py to review all the supported command line arguments.

For quantized models, you should append the following arguments: --quantization awq --dtype half. For example:

python api_server.py --trust-remote-code --model [/path/to/model/folder] --quantization awq --dtype half

Config in VSCode

  1. Open VSCode, go to Preferences -> Settings, navigate to Hugging Face Code section.

  2. Set Config Template to Custom:

  3. Set Model ID or Endpoint to http://localhost:8000/generate, and replace the port number if you are using a different one:

API

  • Request:

    curl http://localhost:8000/generate -d '{"inputs": "def quick_sort", "parameters": {"max_new_tokens": 64}}'
  • Response:

    {
        "generated_text": "def quick_sort(numbers):\n    if len(numbers) < 2:\n        return numbers\n    else:\n        pivot = numbers[-1]\n        less = [\n            el\n            for ind, el in enumerate(numbers[:-1])\n            if el <= pivot and ind != -1\n",
        "status": 200
    }

TODO

  • Test more models
  • Test distributed serving

References

llm-vscode-inference-server's People

Contributors

milkowski avatar wangcx18 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

llm-vscode-inference-server's Issues

Can't pip install requirements.txt on CPU-only system

I saw the README references running on CPU as a goal, is the project there right now or is there still work to be done to achieve that?
Currently I'm seeing this error if I try pip install -r requirements.txt:

      RuntimeError: Cannot find CUDA_HOME. CUDA must be available to build the package.

Keeps responding back with tokens

I keep getting fim tokens when it responds back, am I supposed to scrub this directly in the code or is there some setting that has to be used in the extension for llm-vscode ?

<fim_prefix>
import debugpy


# create a class called car
class Car:
    # create a method called drive
    def drive(self):
        print("driving")


# create an object called my_car
my_car =    <fim_suffix>

[Bug] TypeError: SamplingParams.__init__() got an unexpected keyword argument 'return_full_text'

When using for first time the inference server and model TheBloke/CodeLlama-7B-Instruct-AWQ, the request fails with the following traceback:

  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/fastapi/applications.py", line 292, in __call__
    await super().__call__(scope, receive, send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "{%PATH%}/llm-vscode-inference-server/api_server.py", line 34, in generate
    sampling_params = SamplingParams(max_tokens=max_new_tokens,
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: SamplingParams.__init__() got an unexpected keyword argument 'return_full_text'

Extension settings in VS Code:

    "llm.attributionWindowSize": 256,
    "llm.configTemplate": "Custom",
    "llm.contextWindow": 2048,
    "llm.fillInTheMiddle.enabled": false,
    "llm.fillInTheMiddle.middle": " <MID>",
    "llm.fillInTheMiddle.prefix": "<PRE> ",
    "llm.fillInTheMiddle.suffix": " <SUF>",
    "llm.lsp.logLevel": "debug",
    "llm.maxNewTokens": 500,
    "llm.modelIdOrEndpoint": "http://localhost:8000/generate",
    "llm.temperature": 0.2,
    "llm.tokenizer": {"repository": "TheBloke/CodeLlama-7B-Instruct-AWQ"},
    "llm.tokensToClear": ["<EOT>"],

pip freeze output:

aiosignal==1.3.1
anyio==3.7.1
attrs==23.1.0
certifi==2023.7.22
charset-normalizer==3.3.0
click==8.1.7
cmake==3.27.6
fastapi==0.103.2
filelock==3.12.4
frozenlist==1.4.0
fsspec==2023.9.2
h11==0.14.0
httptools==0.6.0
huggingface-hub==0.17.3
idna==3.4
Jinja2==3.1.2
jsonschema==4.19.1
jsonschema-specifications==2023.7.1
lit==17.0.1
MarkupSafe==2.1.3
mpmath==1.3.0
msgpack==1.0.7
networkx==3.1
ninja==1.11.1
numpy==1.26.0
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
packaging==23.2
pandas==2.1.1
protobuf==4.24.3
psutil==5.9.5
pyarrow==13.0.0
pydantic==1.10.13
python-dateutil==2.8.2
python-dotenv==1.0.0
pytz==2023.3.post1
PyYAML==6.0.1
ray==2.7.0
referencing==0.30.2
regex==2023.8.8
requests==2.31.0
rpds-py==0.10.3
safetensors==0.3.3
sentencepiece==0.1.99
six==1.16.0
sniffio==1.3.0
starlette==0.27.0
sympy==1.12
tokenizers==0.13.3
torch==2.0.1
tqdm==4.66.1
transformers==4.33.3
triton==2.0.0
typing_extensions==4.8.0
tzdata==2023.3
urllib3==2.0.6
uvicorn==0.23.2
uvloop==0.17.0
vllm==0.2.0
watchfiles==0.20.0
websockets==11.0.3
xformers==0.0.22

Issue on CUDA version and Torch on vllm

While i build a service with docker, the error is raised.

output

291.9   The detected CUDA version (12.1) mismatches the version that was used to compile 
291.9   PyTorch (11.7). Please make sure to use the same CUDA versions. 
291.9
291.9   ----------------------------------------
291.9   ERROR: Failed building wheel for vllm
291.9 Failed to build vllm
293.3 ERROR: Could not build wheels for vllm which use PEP 517 and cannot be installed directly

Dockerfiles

FROM nvidia/cuda:11.6.2-devel-ubuntu20.04

# Install Python 3.8
RUN apt-get update && apt-get install -y python3.8 python3-pip && apt-get clean

# Set Python 3.8 as the default
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1

# Set the working directory in the container
WORKDIR /app

# Install git
RUN apt-get update && apt-get install -y git && apt-get clean

# Copy the current directory contents into the container at /app
COPY . /app

ENV CUDA_HOME="/usr/local/cuda"
ENV FORCE_CUDA="1"

# Install the required packages
RUN pip install --no-cache-dir -r requirements.txt

# Expose port 8000 for the app to listen on
EXPOSE 8000

# Define the command to run the app
CMD ["python", "api_server.py", "--trust-remote-code", "--model", "/path/to/model/folder"]

Docker-compose


version: '3.8'  # Consider using a more recent version

services:
  llm-server:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8000:8000"
    volumes:
      - ./Documents/WizardCoder-Python-13B-V1.0:/app/models
    environment:
      - MODEL_PATH=/app/models

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.