Giter Club home page Giter Club logo

pytorch_backend's Introduction

License

PyTorch (LibTorch) Backend

The Triton backend for PyTorch. You can learn more about Triton backends in the backend repo. Ask questions or report problems on the issues page. This backend is designed to run TorchScript models using the PyTorch C++ API. All models created in PyTorch using the python API must be traced/scripted to produce a TorchScript model.

Where can I ask general questions about Triton and Triton backends? Be sure to read all the information below as well as the general Triton documentation available in the main server repo. If you don't find your answer there you can ask questions on the main Triton issues page.

Build the PyTorch Backend

Use a recent cmake to build. First install the required dependencies.

$ apt-get install patchelf rapidjson-dev python3-dev

An appropriate PyTorch container from NGC must be used. For example, to build a backend that uses the 21.02 version of the PyTorch container from NGC:

$ mkdir build
$ cd build
$ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_PYTORCH_DOCKER_IMAGE="nvcr.io/nvidia/pytorch:21.02-py3" ..
$ make install

The following required Triton repositories will be pulled and used in the build. By default the "main" branch/tag will be used for each repo but the listed CMake argument can be used to override.

  • triton-inference-server/backend: -DTRITON_BACKEND_REPO_TAG=[tag]
  • triton-inference-server/core: -DTRITON_CORE_REPO_TAG=[tag]
  • triton-inference-server/common: -DTRITON_COMMON_REPO_TAG=[tag]

Build the PyTorch Backend With Custom PyTorch

Currently, Triton requires that a specially patched version of PyTorch be used with the PyTorch backend. The full source for these PyTorch versions are available as Docker images from NGC. For example, the PyTorch version compatible with the 21.02 release of Triton is available as nvcr.io/nvidia/pytorch:21.02-py3.

Copy over the LibTorch and Torchvision headers and libraries from the PyTorch NGC container into local directories. You can see which headers and libraries are needed/copied from the docker.

$ mkdir build
$ cd build
$ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_PYTORCH_INCLUDE_PATHS="<PATH_PREFIX>/torch;<PATH_PREFIX>/torch/torch/csrc/api/include;<PATH_PREFIX>/torchvision" -DTRITON_PYTORCH_LIB_PATHS="<LIB_PATH_PREFIX>" ..
$ make install

Using the PyTorch Backend

Parameters

Triton exposes some flags to control the execution mode of the TorchScript models through the Parameters section of the model's 'config.pbtxt' file.

  • DISABLE_OPTIMIZED_EXECUTION: Boolean flag to disable the optimized execution of TorchScript models. By default the optimized execuiton is always enabled.

The initial calls to a loaded TorchScript model take extremely long. Due to this longer model warmup issue, Triton also allows execution of models without these optimizations. In some models, optimized execution does not benefit performance as seen here and in other cases impacts performance negatively, as seen here.

The section of model config file specifying this parameter will look like:

parameters: {
key: "DISABLE_OPTIMIZED_EXECUTION"
    value: {
    string_value:"true"
    }
}
  • INFERENCE_MODE: Boolean flag to enable the Inference Mode execution of TorchScript models. By default the inference mode is disabled.

InferenceMode is a new RAII guard analogous to NoGradMode to be used when you are certain your operations will have no interactions with autograd. Compared to NoGradMode, code run under this mode gets better performance by disabling autograd.

Please note that in some models, InferenceMode might not benefit performance and in fewer cases might impact performance negatively.

The section of model config file specifying this parameter will look like:

parameters: {
key: "INFERENCE_MODE"
    value: {
    string_value:"true"
    }
}
  • ENABLE_NVFUSER: Boolean flag to enable the NvFuser (CUDA Graph Fuser) optimization for TorchScript models. If not specified, the default pytorch fuser is used. If ENABLE_NVFUSER is specified, the ENABLE_TENSOR_FUSER configuration (see below) is ignored.

Please note that in some models generated using trace in old PyTorch versions might not work correctly with NvFuser. We recommend using scripting and a recent version of PyTorch to generate these models.

The section of model config file specifying this parameter will look like:

parameters: {
key: "ENABLE_NVFUSER"
    value: {
    string_value:"true"
    }
}
  • ENABLE_WEIGHT_SHARING: Boolean flag to enable model instances on the same device to share weights. This optimization should not be used with stateful models. If not specified, weight sharing is disabled.

The section of model config file specifying this parameter will look like:

parameters: {
key: "ENABLE_WEIGHT_SHARING"
    value: {
    string_value:"true"
    }
}
  • Additional Optimizations: Three additional boolean parameters are available to disable certain Torch optimizations that can sometimes cause latency regressions in models with complex execution modes and dynamic shapes. If not specified, all are enabled by default.

    ENABLE_JIT_EXECUTOR

    ENABLE_JIT_PROFILING

    ENABLE_TENSOR_FUSER

Important Note

  • The execution of pytorch model on GPU is asynchronous in nature. See here for more details. Consequently, an error in pytorch model execution may be raised during the next few inference requests to the server. Setting environment variable CUDA_LAUNCH_BLOCKING=1 when launching server will help in correctly debugging failing cases by forcing synchronous execution.

    • The PyTorch model in such cases may or may not recover from the failed state and a restart of the server may be required to continue serving successfully.
  • Multiple instances of the pytorch model on GPU do not always increase performance. Due to thread specific caching in pytorch, using multiple instances of the model interact negatively. See here for more details. Setting the parameter DISABLE_OPTIMIZED_EXECUTION to "true" in the model configuration may help in some cases to avoid these negative interactions due to model specific caching and increase multiple instance performance.

pytorch_backend's People

Contributors

coderham avatar tanmayv25 avatar dzier avatar guanluo avatar borisfom avatar krishung5 avatar ryanleary avatar dyastremsky avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.