triton-inference-server / client Goto Github PK

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.

License: BSD 3-Clause "New" or "Revised" License

CMake 2.89% C++ 67.60% C 0.22% Shell 0.26% Go 0.27% Java 2.87% Scala 0.25% Python 25.48% JavaScript 0.16%

client's Introduction

Triton Client Libraries and Examples

To simplify communication with Triton, the Triton project provides several client libraries and examples of how to use those libraries. Ask questions or report problems in the main Triton issues page.

The provided client libraries are:

C++ and Python APIs that make it easy to communicate with Triton from your C++ or Python application. Using these libraries you can send either HTTP/REST or GRPC requests to Triton to access all its capabilities: inferencing, status and health, statistics and metrics, model repository management, etc. These libraries also support using system and CUDA shared memory for passing inputs to and receiving outputs from Triton.
Java API (contributed by Alibaba Cloud PAI Team) that makes it easy to communicate with Triton from your Java application using HTTP/REST requests. For now, only a limited feature subset is supported.
The protoc compiler can generate a GRPC API in a large number of programming languages.
- See src/grpc_generated/go for an example for the Go programming language.
- See src/grpc_generated/java for an example for the Java and Scala programming languages.
- See src/grpc_generated/javascript for an example with JavaScript programming language.

There are also many example applications that show how to use these libraries. Many of these examples use models from the example model repository.

C++ and Python versions of image_client, an example application that uses the C++ or Python client library to execute image classification models on Triton. See Image Classification Example.
Several simple C++ examples show how to use the C++ library to communicate with Triton to perform inferencing and other task. The C++ examples demonstrating the HTTP/REST client are named with a simple_http_ prefix and the examples demonstrating the GRPC client are named with a simple_grpc_ prefix. See Simple Example Applications.
Several simple Python examples show how to use the Python library to communicate with Triton to perform inferencing and other task. The Python examples demonstrating the HTTP/REST client are named with a simple_http_ prefix and the examples demonstrating the GRPC client are named with a simple_grpc_ prefix. See Simple Example Applications.
Several simple Java examples show how to use the Java API to communicate with Triton to perform inferencing and other task.
A couple of Python examples that communicate with Triton using a Python GRPC API generated by the protoc compiler. grpc_client.py is a simple example that shows simple API usage. grpc_image_client.py is functionally equivalent to image_client but that uses a generated GRPC client stub to communicate with Triton.

Getting the Client Libraries And Examples

The easiest way to get the Python client library is to use pip to install the tritonclient module. You can also download the C++, Python and Java client libraries from Triton GitHub release, or download a pre-built Docker image containing the client libraries from NVIDIA GPU Cloud (NGC).

It is also possible to build the client libraries with cmake.

Download Using Python Package Installer (pip)

The GRPC and HTTP client libraries are available as a Python package that can be installed using a recent version of pip.

$ pip install tritonclient[all]

Using all installs both the HTTP/REST and GRPC client libraries. There are two optional packages available, grpc and http that can be used to install support specifically for the protocol. For example, to install only the HTTP/REST client library use,

$ pip install tritonclient[http]

There is another optional package namely cuda, that must be installed in order to use cuda_shared_memory utilities. all specification will install the cuda package by default but in other cases cuda needs to be explicitly specified for installing client with cuda_shared_memory support.

$ pip install tritonclient[http, cuda]

The components of the install packages are:

http
grpc [ service_pb2, service_pb2_grpc, model_config_pb2 ]
utils [ linux distribution will include shared_memory and cuda_shared_memory]

The Linux version of the package also includes the perf_analyzer binary. The perf_analyzer binary is built on Ubuntu 20.04 and may not run on other Linux distributions. To run the perf_analyzer the following dependency must be installed:

$ sudo apt update
$ sudo apt install libb64-dev

To reiterate, the installation on windows will not include perf_analyzer nor shared_memory/cuda_shared_memory components.

Download From GitHub

The client libraries and the perf_analyzer executable can be downloaded from the Triton GitHub release page corresponding to the release you are interested in. The client libraries are found in the "Assets" section of the release page in a tar file named after the version of the release and the OS, for example, v2.3.0_ubuntu2004.clients.tar.gz.

The pre-built libraries can be used on the corresponding host system or you can install them into the Triton container to have both the clients and server in the same container.

$ mkdir clients
$ cd clients
$ wget https://github.com/triton-inference-server/server/releases/download/<tarfile_path>
$ tar xzf <tarfile_name>

After installing, the libraries can be found in lib/, the headers in include/, the Python wheel files in python/, and the jar files in java/. The bin/ and python/ directories contain the built examples that you can learn more about below.

The perf_analyzer binary is built on Ubuntu 20.04 and may not run on other Linux distributions. To use the C++ libraries or perf_analyzer executable you must install some dependencies.

$ apt-get update
$ apt-get install curl libcurl4-openssl-dev libb64-dev

Download Docker Image From NGC

A Docker image containing the client libraries and examples is available from NVIDIA GPU Cloud (NGC). Before attempting to pull the container ensure you have access to NGC. For step-by-step instructions, see the NGC Getting Started Guide.

Use docker pull to get the client libraries and examples container from NGC.

$ docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk

Where <xx.yy> is the version that you want to pull. Within the container the client libraries are in /workspace/install/lib, the corresponding headers in /workspace/install/include, and the Python wheel files in /workspace/install/python. The image will also contain the built client examples.

Important Note: When running either the server or the client using Docker containers and using the CUDA shared memory feature you need to add --pid host flag when launching the containers. The reason is that CUDA IPC APIs require the PID of the source and destination of the exported pointer to be different. Otherwise, Docker enables PID namespace which may result in equality between the source and destination PIDs. The error will be always observed when both of the containers are started in the non-interactive mode.

Build Using CMake

The client library build is performed using CMake. To build the client libraries and examples with all features, first change directory to the root of this repo and checkout the release version of the branch that you want to build (or the main branch if you want to build the under-development version).

$ git checkout main

If building the Java client you must first install Maven and a JDK appropriate for your OS. For example, for Ubuntu you should install the default-jdk package:

$ apt-get install default-jdk maven

Building on Windows vs. non-Windows requires different invocations because Triton on Windows does not yet support all the build options.

Non-Windows

Use cmake to configure the build. You should adjust the flags depending on the components of Triton Client you are working and would like to build. For example, if you want to build Perf Analyzer with Triton C API, you can use
-DTRITON_ENABLE_PERF_ANALYZER=ON -DTRITON_ENABLE_PERF_ANALYZER_C_API=ON. You can also use TRITON_ENABLE_PERF_ANALYZER_TFS and TRITON_ENABLE_PERF_ANALYZER_TS flags to enable/disable support for TensorFlow Serving and TorchServe backend respectively in perf analyzer.
The following command demonstrate how to build client with all the features:

$ mkdir build
$ cd build
$ cmake -DCMAKE_INSTALL_PREFIX=`pwd`/install -DTRITON_ENABLE_CC_HTTP=ON -DTRITON_ENABLE_CC_GRPC=ON -DTRITON_ENABLE_PERF_ANALYZER=ON -DTRITON_ENABLE_PERF_ANALYZER_C_API=ON -DTRITON_ENABLE_PERF_ANALYZER_TFS=ON -DTRITON_ENABLE_PERF_ANALYZER_TS=ON -DTRITON_ENABLE_PYTHON_HTTP=ON -DTRITON_ENABLE_PYTHON_GRPC=ON -DTRITON_ENABLE_JAVA_HTTP=ON -DTRITON_ENABLE_GPU=ON -DTRITON_ENABLE_EXAMPLES=ON -DTRITON_ENABLE_TESTS=ON ..

If you are building on a release branch (or on a development branch that is based off of a release branch), then you must also use additional cmake arguments to point to that release branch for repos that the client build depends on. For example, if you are building the r21.10 client branch then you need to use the following additional cmake flags:

-DTRITON_COMMON_REPO_TAG=r21.10
-DTRITON_THIRD_PARTY_REPO_TAG=r21.10
-DTRITON_CORE_REPO_TAG=r21.10

Then use make to build the clients and examples.

$ make cc-clients python-clients java-clients

When the build completes the libraries and examples can be found in the install directory.

Windows

To build the clients you must install an appropriate C++ compiler and other dependencies required for the build. The easiest way to do this is to create the Windows min Docker image and the perform the build within a container launched from that image.

> docker run  -it --rm win10-py3-min powershell

It is not necessary to use Docker or the win10-py3-min container for the build, but if you do not you must install the appropriate dependencies onto your host system.

Next use cmake to configure the build. If you are not building within the win10-py3-min container then you will likely need to adjust the CMAKE_TOOLCHAIN_FILE location in the following command.

$ mkdir build
$ cd build
$ cmake -DVCPKG_TARGET_TRIPLET=x64-windows -DCMAKE_TOOLCHAIN_FILE='/vcpkg/scripts/buildsystems/vcpkg.cmake' -DCMAKE_INSTALL_PREFIX=install -DTRITON_ENABLE_CC_GRPC=ON -DTRITON_ENABLE_PYTHON_GRPC=ON -DTRITON_ENABLE_GPU=OFF -DTRITON_ENABLE_EXAMPLES=ON -DTRITON_ENABLE_TESTS=ON ..

-DTRITON_COMMON_REPO_TAG=r21.10
-DTRITON_THIRD_PARTY_REPO_TAG=r21.10
-DTRITON_CORE_REPO_TAG=r21.10

Then use msbuild.exe to build.

$ msbuild.exe cc-clients.vcxproj -p:Configuration=Release -clp:ErrorsOnly
$ msbuild.exe python-clients.vcxproj -p:Configuration=Release -clp:ErrorsOnly

When the build completes the libraries and examples can be found in the install directory.

Client Library APIs

The C++ client API exposes a class-based interface. The commented interface is available in grpc_client.h, http_client.h, common.h.

The Python client API provides similar capabilities as the C++ API. The commented interface is available in grpc and http.

The Java client API provides similar capabilities as the Python API with similar classes and methods. For more information please refer to the Java client directory.

HTTP Options

SSL/TLS

The client library allows communication across a secured channel using HTTPS protocol. Just setting these SSL options do not ensure the secure communication. Triton server should be running behind https:// proxy such as nginx. The client can then establish a secure channel to the proxy. The qa/L0_https in the server repository demonstrates how this can be achieved.

For C++ client, see HttpSslOptions struct that encapsulates these options in http_client.h.

For Python client, look for the following options in http/__init__.py:

ssl
ssl_options
ssl_context_factory
insecure

The C++ and Python examples demonstrates how to use SSL/TLS settings on client side.

Compression

The client library enables on-wire compression for HTTP transactions.

For C++ client, see request_compression_algorithm and response_compression_algorithm parameters in the Infer and AsyncInfer functions in http_client.h. By default, the parameter is set as CompressionType::NONE.

Similarly, for Python client, see request_compression_algorithm and response_compression_algorithm parameters in infer and async_infer functions in http/__init__.py.

The C++ and Python examples demonstrates how to use compression options.

Python AsyncIO Support (Beta)

This feature is currently in beta and may be subject to change.

Advanced users may call the Python client via async and await syntax. The infer example demonstrates how to infer with AsyncIO.

If using SSL/TLS with AsyncIO, look for the ssl and ssl_context options in http/aio/__init__.py

Python Client Plugin API (Beta)

This feature is currently in beta and may be subject to change.

The Triton Client Plugin API lets you register custom plugins to add or modify request headers. This is useful if you have gateway in front of Triton Server that requires extra headers for each request, such as HTTP Authorization. By registering the plugin, your gateway will work with Python clients without additional configuration. Note that Triton Server does not implement authentication or authorization mechanisms and similarly, Triton Server is not the direct consumer of the additional headers.

The plugin must implement the __call__ method. The signature of the __call__ method should look like below:

class MyPlugin:
  def __call__(self, request):
       """This method will be called for every HTTP request. Currently, the only
       field that can be accessed by the request object is the `request.headers`
       field. This field must be updated in-place.
       """
       request.headers['my-header-key'] = 'my-header-value'

After the plugin implementation is complete, you can register the plugin by calling register on the InferenceServerClient object.

from tritonclient.http import InferenceServerClient

client = InferenceServerClient(...)

# Register the plugin
my_plugin = MyPlugin()
client.register_plugin(my_plugin)

# All the method calls will update the headers according to the plugin
# implementation.
client.infer(...)

To unregister the plugin, you can call the client.unregister_plugin() function.

Basic Auth

You can register the BasicAuth plugin that implements Basic Authentication.

from tritonclient.grpc.auth import BasicAuth
from tritonclient.grpc import InferenceServerClient

basic_auth = BasicAuth('username', 'password')
client = InferenceServerClient('...')

client.register_plugin(basic_auth)

The example above shows how to register the plugin for gRPC client. The BasicAuth plugin can be registered similarly for HTTP and AsyncIO clients.

GRPC Options

SSL/TLS

The client library allows communication across a secured channel using gRPC protocol.

For C++ client, see SslOptions struct that encapsulates these options in grpc_client.h.

For Python client, look for the following options in grpc/__init__.py:

ssl
root_certificates
private_key
certificate_chain

The C++ and Python examples demonstrates how to use SSL/TLS settings on client side. For information on the corresponding server-side parameters, refer to the server documentation

Compression

The client library also exposes options to use on-wire compression for gRPC transactions.

For C++ client, see compression_algorithm parameter in the Infer, AsyncInfer and StartStream functions in grpc_client.h. By default, the parameter is set as GRPC_COMPRESS_NONE.

Similarly, for Python client, see compression_algorithm parameter in infer, async_infer and start_stream functions in grpc/__init__.py.

The C++ and Python examples demonstrates how to configure compression for clients. For information on the corresponding server-side parameters, refer to the server documentation.

GRPC KeepAlive

Triton exposes GRPC KeepAlive parameters with the default values for both client and server described here.

You can find a KeepAliveOptions struct/class that encapsulates these parameters in both the C++ and Python client libraries.

There is also a C++ and Python example demonstrating how to setup these parameters on the client-side. For information on the corresponding server-side parameters, refer to the server documentation

Custom GRPC Channel Arguments

Advanced users may require specific client-side GRPC Channel Arguments that are not currently exposed by Triton through direct means. To support this, Triton allows users to pass custom channel arguments upon creating a GRPC client. When using this option, it is up to the user to pass a valid combination of arguments for their use case; Triton cannot feasibly test every possible combination of channel arguments.

There is a C++ and Python example demonstrating how to construct and pass these custom arguments upon creating a GRPC client.

You can find a comprehensive list of possible GRPC Channel Arguments here.

Python AsyncIO Support (Beta)

This feature is currently in beta and may be subject to change.

Advanced users may call the Python client via async and await syntax. The infer and stream examples demonstrate how to infer with AsyncIO.

Request Cancellation

Starting from r23.10, triton python gRPC client can issue cancellation to inflight requests. This can be done by calling cancel() on the CallContext object returned by async_infer() API.

  ctx = client.async_infer(...)
  ctx.cancel()

For streaming requests, cancel_requests=True can be sent to stop_stream() API to terminate all the inflight requests sent via this stream.

  client.start_stream()
  for _ in range(10):
    client.async_stream_infer(...)

  # Cancels all pending requests on stream closure rather than blocking until requests complete
  client.stop_stream(cancel_requests=True)

See more details about these APIs in grpc/_client.py.

For gRPC AsyncIO requests, an AsyncIO task wrapping an infer() coroutine can be safely cancelled.

  infer_task = asyncio.create_task(aio_client.infer(...))
  infer_task.cancel()

For gRPC AsyncIO streaming requests, cancel() can be called on the asynchronous iterator returned by stream_infer() API.

  responses_iterator = aio_client.stream_infer(...)
  responses_iterator.cancel()

See more details about these APIs in grpc/aio/_init_.py.

See request_cancellation in the server user-guide to learn about how this is handled on the server side. If writing your own gRPC clients in the language of choice consult gRPC guide on cancellation.

Simple Example Applications

This section describes several of the simple example applications and the features that they illustrate.

Bytes/String Datatype

Some frameworks support tensors where each element in the tensor is variable-length binary data. Each element can hold a string or an arbitrary sequence of bytes. On the client this datatype is BYTES (see Datatypes for information on supported datatypes).

The Python client library uses numpy to represent input and output tensors. For BYTES tensors the dtype of the numpy array should be 'np.object_' as shown in the examples. For backwards compatibility with previous versions of the client library, 'np.bytes_' can also be used for BYTES tensors. However, using 'np.bytes_' is not recommended because using this dtype will cause numpy to remove all trailing zeros from each array element. As a result, binary sequences ending in zero(s) will not be represented correctly.

BYTES tensors are demonstrated in the C++ example applications simple_http_string_infer_client.cc and simple_grpc_string_infer_client.cc. String tensors are demonstrated in the Python example application simple_http_string_infer_client.py and simple_grpc_string_infer_client.py.

System Shared Memory

Using system shared memory to communicate tensors between the client library and Triton can significantly improve performance in some cases.

Using system shared memory is demonstrated in the C++ example applications simple_http_shm_client.cc and simple_grpc_shm_client.cc. Using system shared memory is demonstrated in the Python example application simple_http_shm_client.py and simple_grpc_shm_client.py.

Python does not have a standard way of allocating and accessing shared memory so as an example a simple system shared memory module is provided that can be used with the Python client library to create, set and destroy system shared memory.

CUDA Shared Memory

Using CUDA shared memory to communicate tensors between the client library and Triton can significantly improve performance in some cases.

Using CUDA shared memory is demonstrated in the C++ example applications simple_http_cudashm_client.cc and simple_grpc_cudashm_client.cc. Using CUDA shared memory is demonstrated in the Python example application simple_http_cudashm_client.py and simple_grpc_cudashm_client.py.

Python does not have a standard way of allocating and accessing shared memory so as an example a simple CUDA shared memory module is provided that can be used with the Python client library to create, set and destroy CUDA shared memory. The module currently supports numpy arrays (example usage) and DLPack tensors (example usage).

Client API for Stateful Models

When performing inference using a stateful model, a client must identify which inference requests belong to the same sequence and also when a sequence starts and ends.

Each sequence is identified with a sequence ID that is provided when an inference request is made. It is up to the clients to create a unique sequence ID. For each sequence the first inference request should be marked as the start of the sequence and the last inference requests should be marked as the end of the sequence.

The use of sequence ID and start and end flags are demonstrated in the C++ example applications simple_grpc_sequence_stream_infer_client.cc. The use of sequence ID and start and end flags are demonstrated in the Python example application simple_grpc_sequence_stream_infer_client.py.

Image Classification Example

The image classification example that uses the C++ client API is available at src/c++/examples/image_client.cc. The Python version of the image classification client is available at src/python/examples/image_client.py.

To use image_client (or image_client.py) you must first have a running Triton that is serving one or more image classification models. The image_client application requires that the model have a single image input and produce a single classification output. If you don't have a model repository with image classification models see QuickStart for instructions on how to create one.

Once Triton is running you can use the image_client application to send inference requests. You can specify a single image or a directory holding images. Here we send a request for the inception_graphdef model for an image from the qa/images.

$ image_client -m inception_graphdef -s INCEPTION qa/images/mug.jpg
Request 0, batch size 1
Image 'qa/images/mug.jpg':
    0.754130 (505) = COFFEE MUG

The Python version of the application accepts the same command-line arguments.

$ python image_client.py -m inception_graphdef -s INCEPTION qa/images/mug.jpg
Request 0, batch size 1
Image 'qa/images/mug.jpg':
     0.826384 (505) = COFFEE MUG

The image_client and image_client.py applications use the client libraries to talk to Triton. By default image_client instructs the client library to use HTTP/REST protocol, but you can use the GRPC protocol by providing the -i flag. You must also use the -u flag to point at the GRPC endpoint on Triton.

$ image_client -i grpc -u localhost:8001 -m inception_graphdef -s INCEPTION qa/images/mug.jpg
Request 0, batch size 1
Image 'qa/images/mug.jpg':
    0.754130 (505) = COFFEE MUG

By default the client prints the most probable classification for the image. Use the -c flag to see more classifications.

$ image_client -m inception_graphdef -s INCEPTION -c 3 qa/images/mug.jpg
Request 0, batch size 1
Image 'qa/images/mug.jpg':
    0.754130 (505) = COFFEE MUG
    0.157077 (969) = CUP
    0.002880 (968) = ESPRESSO

The -b flag allows you to send a batch of images for inferencing. The image_client application will form the batch from the image or images that you specified. If the batch is bigger than the number of images then image_client will just repeat the images to fill the batch.

$ image_client -m inception_graphdef -s INCEPTION -c 3 -b 2 qa/images/mug.jpg
Request 0, batch size 2
Image 'qa/images/mug.jpg':
    0.754130 (505) = COFFEE MUG
    0.157077 (969) = CUP
    0.002880 (968) = ESPRESSO
Image 'qa/images/mug.jpg':
    0.754130 (505) = COFFEE MUG
    0.157077 (969) = CUP
    0.002880 (968) = ESPRESSO

Provide a directory instead of a single image to perform inferencing on all images in the directory.

$ image_client -m inception_graphdef -s INCEPTION -c 3 -b 2 qa/images
Request 0, batch size 2
Image '/opt/tritonserver/qa/images/car.jpg':
    0.819196 (818) = SPORTS CAR
    0.033457 (437) = BEACH WAGON
    0.031232 (480) = CAR WHEEL
Image '/opt/tritonserver/qa/images/mug.jpg':
    0.754130 (505) = COFFEE MUG
    0.157077 (969) = CUP
    0.002880 (968) = ESPRESSO
Request 1, batch size 2
Image '/opt/tritonserver/qa/images/vulture.jpeg':
    0.977632 (24) = VULTURE
    0.000613 (9) = HEN
    0.000560 (137) = EUROPEAN GALLINULE
Image '/opt/tritonserver/qa/images/car.jpg':
    0.819196 (818) = SPORTS CAR
    0.033457 (437) = BEACH WAGON
    0.031232 (480) = CAR WHEEL

The grpc_image_client.py application behaves the same as the image_client except that instead of using the client library it uses the GRPC generated library to communicate with Triton.

Ensemble Image Classification Example Application

In comparison to the image classification example above, this example uses an ensemble of an image-preprocessing model implemented as a DALI backend and a TensorFlow Inception model. The ensemble model allows you to send the raw image binaries in the request and receive classification results without preprocessing the images on the client.

To try this example you should follow the DALI ensemble example instructions.

client's People

Contributors

Stargazers

Watchers

Forkers

niqbal996 ljnktjnk socal-ucr jishminor rmylopez k9ele7en youyuxiansen frank-dong-ms-zz bobliu20 evelinrocio nicholasyoungai impossible-g chengshiyao-1989 watch-later walusus aesop-programmer alecgunny mhs-achyut ruanmk dnquang10 alexhcheng zhouzhubin a-aiba hmahadik maxdml tmccrmck heart-zz rrtaylor r0k3 jarokaz liangtsao niderhoff moise-g amr-devman fraunhofer-iais jinyuanlu btyeung lubingtan samonsix janelu9 vincentfrancais alicimertcan horusalkebulan nealvaidya hoony-oyh ugolotti sijin-dm olivierdehaene butterluo husterlantern1 vkatms zxh1993 davidandavis2021 rarzumanyan thehamsta powerycy zhoumaomin albertbj lynnmhook fanxiangshangfen dagardner-nv gcarr1020 achbogga harharlinks amazingroad abhi15behera pierantoniomerlino labixiaok sfrias chinhuang007 hlky1831300318 remib-proovstation wwtghx winstonhic du00cs yongbinfeng shaojun datdt98 nskool zju-lishuang gwspotex kpedro88 chajath kosehy thewq romain-keramitas-prl blackhathedgehog sprout-ai haicucai-00 codedoves parsa-ra ann-wzhao freshzz quinwu sivanantha321 heibaidaolx123 janjagusch njorda ycao0001 skydice

client's Issues

Converting InferenceRequest to InferInput

I have a decoupled model (python backend) that receives requests from a client and sends the request to another downstream model, and the intermediate model only processes some inputs and passes the rest to the next model.

Currently, I'm converting inputs to numpy arrays first and then wrap them in InferInput.

for input_name in self.input_names[1:]:
                data_ = pb_utils.get_input_tensor_by_name(request, input_name)\
                    .as_numpy()\
                    .reshape(-1)
                input_ = triton_grpc.InferInput(input_name, data_.shape, "FP32" if data_.dtype == np.float32 else "INT32")
                input_.set_data_from_numpy(data_)
                inputs.append(input_)

However, I think the .as_numpy() and .set_data_from_numpy() functions do some (de)serialization, and using a for loop to copy most of the inputs is a little bit inefficient.

Is there a way to convert InferenceRequest to InferInput more efficiently?

Thanks!

Unable to use triton client with shared memory in C++ (Jetpack 6 device)

I am using the tritonserver + client igpu release (tritonserver2.41.0-igpu.tar.gz) on a jetpack 6 device. I want to use the shared memory functions with the triton client which are declared in shm_utils.h and defined in shm_utils.cc. However, the header is not found in Triton Client's include directory leading to a compilation error.

On making the following changes to src/c++/library/CMakeLists.txt and building the client from source, I was able to import the header and use the shared memory functions. (Triton Client Branch - r23.12)

@@ -84,12 +84,12 @@ if(TRITON_ENABLE_CC_GRPC OR TRITON_ENABLE_PERF_ANALYZER)
   # libgrpcclient object build
   set(
       REQUEST_SRCS
-      grpc_client.cc common.cc
+      grpc_client.cc common.cc shm_utils.cc
   )
 
   set(
       REQUEST_HDRS
-      grpc_client.h common.h ipc.h
+      grpc_client.h common.h ipc.h shm_utils.h
   )
 
   add_library(
@@ -257,12 +257,12 @@ if(TRITON_ENABLE_CC_HTTP OR TRITON_ENABLE_PERF_ANALYZER)
   # libhttpclient object build
   set(
       REQUEST_SRCS
-      http_client.cc common.cc cencode.c
+      http_client.cc common.cc cencode.c shm_utils.cc
   )
 
   set(
       REQUEST_HDRS
-      http_client.h common.h ipc.h cencode.h
+      http_client.h common.h ipc.h cencode.h shm_utils.h
   )
 
   add_library(
@@ -394,6 +394,7 @@ if(TRITON_ENABLE_CC_HTTP OR TRITON_ENABLE_CC_GRPC OR TRITON_ENABLE_PERF_ANALYZER
       FILES
       ${CMAKE_CURRENT_SOURCE_DIR}/common.h
       ${CMAKE_CURRENT_SOURCE_DIR}/ipc.h
+      ${CMAKE_CURRENT_SOURCE_DIR}/shm_utils.h
       DESTINATION include
   )

Are the shared memory cc and header files not included by default, or am I not including them correctly during compilation?

Performance Analyzer cannot collect metrics on Jetson Xavier

I have deployed Triton on Jetson Xavier and used Performance Analyzer to measure model performance during inference. Latency and throughput are correctly measured, but when I try to collect metrics using --collect-metrics option, the following message appears:

WARNING: Unable to parse ‘nv_gpu_utilization’ metric.
WARNING: Unable to parse ‘nv_gpu_power_usage’ metric.
WARNING: Unable to parse ‘nv_gpu_memory_used_bytes’ metric.
WARNING: Unable to parse ‘nv_gpu_memory_total_bytes’ metric.

The command I am using to launch the inferences is:
/usr/local/bin/perf_analyzer --collect-metrics -m 3D_fp32_05_batchd -b 1 --concurrency-range 1

Is it possible to solve this problem?

Thanks.

How to ensure `load_model` applies to the same server pod as `infer`?

In a k8s environment, there are multiple server replicas. We use python client, and at server side we use explicit mode.

Now we do something like

service_url = "triton.{namespace}.svc.cluster.local:8000"
triton_client = InferenceServerClient(url=server_url)

if not triton_client.is_model_ready(model_name):
    triton_client.load_model(model_name)

triton_client.infer(
    model_name,
    model_version=model_version,
    inputs=triton_inputs,
    outputs=triton_outputs,
)

Since server_url is the k8s service endpoint, I guess it relies on k8s load balancer to choose a random pod. In this case, how do we ensure all is_model_ready, load_model and infer apply to the same pod?

Memory leak in SharedMemoryTensor.dlpack

Hello, a memory leak was detected when executing this code. The code was run on Python 3.10., triton-client 2.41.1, torch 2.1.2.

import torch
import tritonclient.utils.cuda_shared_memory as cudashm

n1 = 1000
n2 = 1000
gpu_tensor = torch.ones([n1, n2]).cuda(0)
byte_size = 4 * n1 * n2
shm_handle = cudashm.create_shared_memory_region("cudashm_data", byte_size, 0)
while True:
    cudashm.set_shared_memory_region_from_dlpack(shm_handle, [gpu_tensor]
    smt = cudashm.as_shared_memory_tensor(shm_handle, "FP32", [n1, n2])
    generated_torch_tensor = torch.from_dlpack(smt)

The leak occurs when the dlpack function is called in torch.from_dlpack(smt)

For goalng grpc client, call ModelInfer interface, how to parse useful values from ModelInferResponse？

ModelInferResponse is defined as follows

type ModelInferResponse struct {
        // ohter ...

	Outputs []*ModelInferResponse_InferOutputTensor `protobuf:"bytes,5,rep,name=outputs,proto3" json:"outputs,omitempty"`
	RawOutputContents [][]byte `protobuf:"bytes,6,rep,name=raw_output_contents,json=rawOutputContents,proto3" json:"raw_output_contents,omitempty"`
}

Call the ModelInfer interface to get the response. How to parse the contents of the RawOutputContents field according to the Datatype?
Also why is Outputs[i].Contents empty?

make cc-clients: Could not find requested file: RapidJSON-targets.cmake

cmake is not successful

 ❯ cmake --version
cmake version 3.21.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=`pwd`/install -DTRITON_ENABLE_CC_GRPC=ON -DTRITON_ENABLE_EXAMPLES=ON ..
make cc-clients


...
[ 92%] Performing configure step for 'cc-clients'
loading initial cache file /home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/tmp/cc-clients-cache-Release.cmake
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:3 (include):
  include could not find requested file:

    /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSON-targets.cmake
Call Stack (most recent call first):
  /home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/_deps/repo-common-src/CMakeLists.txt:48 (find_package)


CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:17 (get_target_property):
  get_target_property() called with non-existent target "RapidJSON".
Call Stack (most recent call first):
  /home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/_deps/repo-common-src/CMakeLists.txt:48 (find_package)


-- RapidJSON found. Headers:
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- Found Python: /usr/bin/python3.10 (found version "3.10.13") found components: Interpreter
-- Found Protobuf: /home/hayley/nvidia_trt_llm_backend/client/buid/third-party/protobuf/bin/protoc-3.19.4.0 (found version "3.19.4.0")
-- Using protobuf 3.19.4.0
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11")
-- Found OpenSSL: /usr/lib/x86_64-linux-gnu/libcrypto.so (found version "1.1.1f")
-- Found c-ares: /home/hayley/nvidia_trt_llm_backend/client/buid/third-party/c-ares/lib/cmake/c-ares/c-ares-config.cmake (found version "1.17.2")
-- Found RE2 via CMake.
-- Using gRPC 1.48.0
-- Found Python: /home/hayley/llm_serving/.venv/bin/python3 (found version "3.8.10") found components: Interpreter
-- Using protobuf 3.19.4.0
-- Using protobuf 3.19.4.0
-- Found RE2 via CMake.
-- Using gRPC 1.48.0
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:3 (include):
  include could not find requested file:

    /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSON-targets.cmake
Call Stack (most recent call first):
  library/CMakeLists.txt:49 (find_package)


CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:17 (get_target_property):
  get_target_property() called with non-existent target "RapidJSON".
Call Stack (most recent call first):
  library/CMakeLists.txt:49 (find_package)


-- Configuring incomplete, errors occurred!
See also "/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/CMakeFiles/CMakeOutput.log".
See also "/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/CMakeFiles/CMakeError.log".
make[3]: *** [CMakeFiles/cc-clients.dir/build.make:96: cc-clients/src/cc-clients-stamp/cc-clients-configure] Error 1
make[2]: *** [CMakeFiles/Makefile2:119: CMakeFiles/cc-clients.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:126: CMakeFiles/cc-clients.dir/rule] Error 2
make: *** [Makefile:124: cc-clients] Error 2

AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'

I have a Triton Server that runs in Docker. There I initialized the CLIP model. I wrote some simple code to try infer and get the output of this model. But I get the error AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'

Here is my client code:

from transformers import CLIPProcessor
from PIL import Image
import tritonclient.http as httpclient

if __name__ == "__main__":

    triton_client = httpclient.InferenceServerClient(url="localhost:8003")

    # Example of tracing an image processing:
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
    image = Image.open("9997103.jpg").convert('RGB')

    inputs = processor(images=image, return_tensors="pt")['pixel_values']

    inputs = []
    outputs = []

    inputs.append(triton_client.InferInput("input__0", image.shape, "TYPE_FP32"))
    inputs[0].set_data_from_numpy(image)

    outputs.append(triton_client.InferRequestedOutput("output__0", binary_data=False))

    results = triton_client.infer(
        model_name='clip',
        inputs=inputs,
        outputs=outputs,
    )

    print(results.as_numpy("output__0"))

Here's the error I'm getting

Traceback (most recent call last):
  File "main.py", line 18, in <module>
    inputs.append(triton_client.InferInput("input__0", image.shape, "TYPE_FP32"))
AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'

Help me please

DLPack tensor is not contiguous. Only contiguous DLPack tensors that are stored in C-Order are supported.

Description
test_cuda_shared_memory.py
failed when dimension of batch size is smaller than 2. I think this issue is from pytorch but I'm just wondering if there's any workaround to make it work with pytorch version >1.12.

Workaround
To use torch 1.12 version. I tested and it's working fine

Triton Information
What version of Triton client are you using?
compiled from latest version b0b5b27

To Reproduce

class DLPackTest(unittest.TestCase):
    """
    Testing DLPack implementation in CUDA shared memory utilities
    """

    def test_from_gpu(self):
        # Create GPU tensor via PyTorch and CUDA shared memory region with
        # enough space
        tensor_shape = (1,2,4)
        gpu_tensor = torch.ones(tensor_shape).cuda(0)
        byte_size = gpu_tensor.nelement() * gpu_tensor.element_size()

        shm_handle = cudashm.create_shared_memory_region("cudashm_data", byte_size, 0)

        # Set data from DLPack specification of PyTorch tensor
        cudashm.set_shared_memory_region_from_dlpack(shm_handle, [gpu_tensor])

        # Make sure the DLPack specification of the shared memory region can
        # be consumed by PyTorch
        smt = cudashm.as_shared_memory_tensor(shm_handle, "FP32",  tensor_shape)
        generated_torch_tensor = torch.from_dlpack(smt)
        self.assertTrue(torch.allclose(gpu_tensor, generated_torch_tensor))

        cudashm.destroy_shared_memory_region(shm_handle)

Make perf_analyzer work on macbook

It looks like perf_analyzer doesn't work on macbook, and pip install tritonclient doesn't include perf_analyzer, is this true? Is it possible to support it?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.