triton-inference-server / client Goto Github PK

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.

License: BSD 3-Clause "New" or "Revised" License

CMake 2.88% C++ 67.37% C 0.22% Shell 0.26% Go 0.27% Java 2.86% Scala 0.25% Python 25.74% JavaScript 0.16%

client's Issues

DLPack tensor is not contiguous. Only contiguous DLPack tensors that are stored in C-Order are supported.

Description
test_cuda_shared_memory.py
failed when dimension of batch size is smaller than 2. I think this issue is from pytorch but I'm just wondering if there's any workaround to make it work with pytorch version >1.12.

Workaround
To use torch 1.12 version. I tested and it's working fine

Triton Information
What version of Triton client are you using?
compiled from latest version b0b5b27

To Reproduce

class DLPackTest(unittest.TestCase):
    """
    Testing DLPack implementation in CUDA shared memory utilities
    """

    def test_from_gpu(self):
        # Create GPU tensor via PyTorch and CUDA shared memory region with
        # enough space
        tensor_shape = (1,2,4)
        gpu_tensor = torch.ones(tensor_shape).cuda(0)
        byte_size = gpu_tensor.nelement() * gpu_tensor.element_size()

        shm_handle = cudashm.create_shared_memory_region("cudashm_data", byte_size, 0)

        # Set data from DLPack specification of PyTorch tensor
        cudashm.set_shared_memory_region_from_dlpack(shm_handle, [gpu_tensor])

        # Make sure the DLPack specification of the shared memory region can
        # be consumed by PyTorch
        smt = cudashm.as_shared_memory_tensor(shm_handle, "FP32",  tensor_shape)
        generated_torch_tensor = torch.from_dlpack(smt)
        self.assertTrue(torch.allclose(gpu_tensor, generated_torch_tensor))

        cudashm.destroy_shared_memory_region(shm_handle)

Converting InferenceRequest to InferInput

I have a decoupled model (python backend) that receives requests from a client and sends the request to another downstream model, and the intermediate model only processes some inputs and passes the rest to the next model.

Currently, I'm converting inputs to numpy arrays first and then wrap them in InferInput.

for input_name in self.input_names[1:]:
                data_ = pb_utils.get_input_tensor_by_name(request, input_name)\
                    .as_numpy()\
                    .reshape(-1)
                input_ = triton_grpc.InferInput(input_name, data_.shape, "FP32" if data_.dtype == np.float32 else "INT32")
                input_.set_data_from_numpy(data_)
                inputs.append(input_)

However, I think the .as_numpy() and .set_data_from_numpy() functions do some (de)serialization, and using a for loop to copy most of the inputs is a little bit inefficient.

Is there a way to convert InferenceRequest to InferInput more efficiently?

Thanks!

Memory leak in SharedMemoryTensor.dlpack

Hello, a memory leak was detected when executing this code. The code was run on Python 3.10., triton-client 2.41.1, torch 2.1.2.

import torch
import tritonclient.utils.cuda_shared_memory as cudashm

n1 = 1000
n2 = 1000
gpu_tensor = torch.ones([n1, n2]).cuda(0)
byte_size = 4 * n1 * n2
shm_handle = cudashm.create_shared_memory_region("cudashm_data", byte_size, 0)
while True:
    cudashm.set_shared_memory_region_from_dlpack(shm_handle, [gpu_tensor]
    smt = cudashm.as_shared_memory_tensor(shm_handle, "FP32", [n1, n2])
    generated_torch_tensor = torch.from_dlpack(smt)

The leak occurs when the dlpack function is called in torch.from_dlpack(smt)

Revert urllib3 version pin

Hi. We are trying to integrate with OIP model servers over at feast feature store and need to add mlserver and tritonclient as optional dependencies. The problem is that we also already depend on snowflake-connector-python which still has a strict urllib3<2.0.0 requirement for python 3.9. I saw that urllib3 version pin here was only to avoid vulnerability reports. #457 Not sure which vulnerability that was referring to, but seems like urlib3 plan to ship security fixes for v1 still. Is is possible to revert the version pin? Or maybe allow something like (>=1.26.18<2 or >=2.0.7). Thanks

input_data

HI,

should i need to create json with all possible inputs present in my all config.pbtxt?

if i have two models and in each model i have 2 inputs . so in the input_data.json should i place sample input data for all 4 inputs or what?

thanks

AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'

I have a Triton Server that runs in Docker. There I initialized the CLIP model. I wrote some simple code to try infer and get the output of this model. But I get the error AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'

Here is my client code:

from transformers import CLIPProcessor
from PIL import Image
import tritonclient.http as httpclient

if __name__ == "__main__":

    triton_client = httpclient.InferenceServerClient(url="localhost:8003")

    # Example of tracing an image processing:
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
    image = Image.open("9997103.jpg").convert('RGB')

    inputs = processor(images=image, return_tensors="pt")['pixel_values']

    inputs = []
    outputs = []

    inputs.append(triton_client.InferInput("input__0", image.shape, "TYPE_FP32"))
    inputs[0].set_data_from_numpy(image)

    outputs.append(triton_client.InferRequestedOutput("output__0", binary_data=False))

    results = triton_client.infer(
        model_name='clip',
        inputs=inputs,
        outputs=outputs,
    )

    print(results.as_numpy("output__0"))

Here's the error I'm getting

Traceback (most recent call last):
  File "main.py", line 18, in <module>
    inputs.append(triton_client.InferInput("input__0", image.shape, "TYPE_FP32"))
AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'

Help me please

Unable to use triton client with shared memory in C++ (Jetpack 6 device)

I am using the tritonserver + client igpu release (tritonserver2.41.0-igpu.tar.gz) on a jetpack 6 device. I want to use the shared memory functions with the triton client which are declared in shm_utils.h and defined in shm_utils.cc. However, the header is not found in Triton Client's include directory leading to a compilation error.

On making the following changes to src/c++/library/CMakeLists.txt and building the client from source, I was able to import the header and use the shared memory functions. (Triton Client Branch - r23.12)

@@ -84,12 +84,12 @@ if(TRITON_ENABLE_CC_GRPC OR TRITON_ENABLE_PERF_ANALYZER)
   # libgrpcclient object build
   set(
       REQUEST_SRCS
-      grpc_client.cc common.cc
+      grpc_client.cc common.cc shm_utils.cc
   )
 
   set(
       REQUEST_HDRS
-      grpc_client.h common.h ipc.h
+      grpc_client.h common.h ipc.h shm_utils.h
   )
 
   add_library(
@@ -257,12 +257,12 @@ if(TRITON_ENABLE_CC_HTTP OR TRITON_ENABLE_PERF_ANALYZER)
   # libhttpclient object build
   set(
       REQUEST_SRCS
-      http_client.cc common.cc cencode.c
+      http_client.cc common.cc cencode.c shm_utils.cc
   )
 
   set(
       REQUEST_HDRS
-      http_client.h common.h ipc.h cencode.h
+      http_client.h common.h ipc.h cencode.h shm_utils.h
   )
 
   add_library(
@@ -394,6 +394,7 @@ if(TRITON_ENABLE_CC_HTTP OR TRITON_ENABLE_CC_GRPC OR TRITON_ENABLE_PERF_ANALYZER
       FILES
       ${CMAKE_CURRENT_SOURCE_DIR}/common.h
       ${CMAKE_CURRENT_SOURCE_DIR}/ipc.h
+      ${CMAKE_CURRENT_SOURCE_DIR}/shm_utils.h
       DESTINATION include
   )

Are the shared memory cc and header files not included by default, or am I not including them correctly during compilation?

Add support for FetchContent or find_package

Neither find_package() nor FetchContent work out of the box for a standalone c++ cmake app.

find_package

Compile tritonclient manually and set CMAKE_PREFIX_PATH to the install folder. Alternatively install tritonclient globally. In any case, the following minimal cmake fails:

# tritonclient
find_package(TritonCommon REQUIRED)
find_package(TritonClient REQUIRED)

...
target_link_libraries(test PRIVATE TritonClient::grpcclient rt m dl)

CMake Error at CMakeLists.txt:43 (add_executable):
  Target "test" links to target "protobuf::libprotobuf" but the target was
  not found.  Perhaps a find_package() call is missing for an IMPORTED
  target, or an ALIAS target is missing?

FetchContent

# tritonclient
FetchContent_Declare(
    tritonclient
    GIT_REPOSITORY https://github.com/triton-inference-server/client
    GIT_TAG r24.04
)
set(TRITON_ENABLE_CC_GRPC ON)
set(TRITON_COMMON_REPO_TAG r24.04)
set(TRITON_THIRD_PARTY_REPO_TAG r24.04)
set(TRITON_CORE_REPO_TAG r24.04)
FetchContent_MakeAvailable(tritonclient)

...
target_link_libraries(test PRIVATE TritonClient::grpcclient rt m dl)

CMake Error at CMakeLists.txt:55 (add_executable):
  Target "test" links to target "TritonClient::grpcclient" but the target
  was not found.  Perhaps a find_package() call is missing for an IMPORTED
  target, or an ALIAS target is missing?

What is the recommended way to create a standalone app with tritonclient dependency currently? It seems that for find_package the link targets are broken and for FetchContent there is no support at all because no targets are exported.

How to ensure `load_model` applies to the same server pod as `infer`?

In a k8s environment, there are multiple server replicas. We use python client, and at server side we use explicit mode.

Now we do something like

service_url = "triton.{namespace}.svc.cluster.local:8000"
triton_client = InferenceServerClient(url=server_url)

if not triton_client.is_model_ready(model_name):
    triton_client.load_model(model_name)

triton_client.infer(
    model_name,
    model_version=model_version,
    inputs=triton_inputs,
    outputs=triton_outputs,
)

Since server_url is the k8s service endpoint, I guess it relies on k8s load balancer to choose a random pod. In this case, how do we ensure all is_model_ready, load_model and infer apply to the same pod?

urllib dependency is present when using [grpc, cuda] options

urllib is only used for the http backend based on my look at the code, but the top level requirements.txt list urllib as a dependency. However, it should be only in the requirements_http.txt and not in the top level requirements.txt.

This creates unnecessary conflicts when one only wants to depend on grpc and not http backend. Dependency conflicts like the one reported here: #648

Make perf_analyzer work on macbook

It looks like perf_analyzer doesn't work on macbook, and pip install tritonclient doesn't include perf_analyzer, is this true? Is it possible to support it?

make cc-clients: Could not find requested file: RapidJSON-targets.cmake

cmake is not successful

 ❯ cmake --version
cmake version 3.21.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=`pwd`/install -DTRITON_ENABLE_CC_GRPC=ON -DTRITON_ENABLE_EXAMPLES=ON ..
make cc-clients


...
[ 92%] Performing configure step for 'cc-clients'
loading initial cache file /home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/tmp/cc-clients-cache-Release.cmake
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:3 (include):
  include could not find requested file:

    /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSON-targets.cmake
Call Stack (most recent call first):
  /home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/_deps/repo-common-src/CMakeLists.txt:48 (find_package)


CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:17 (get_target_property):
  get_target_property() called with non-existent target "RapidJSON".
Call Stack (most recent call first):
  /home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/_deps/repo-common-src/CMakeLists.txt:48 (find_package)


-- RapidJSON found. Headers:
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- Found Python: /usr/bin/python3.10 (found version "3.10.13") found components: Interpreter
-- Found Protobuf: /home/hayley/nvidia_trt_llm_backend/client/buid/third-party/protobuf/bin/protoc-3.19.4.0 (found version "3.19.4.0")
-- Using protobuf 3.19.4.0
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11")
-- Found OpenSSL: /usr/lib/x86_64-linux-gnu/libcrypto.so (found version "1.1.1f")
-- Found c-ares: /home/hayley/nvidia_trt_llm_backend/client/buid/third-party/c-ares/lib/cmake/c-ares/c-ares-config.cmake (found version "1.17.2")
-- Found RE2 via CMake.
-- Using gRPC 1.48.0
-- Found Python: /home/hayley/llm_serving/.venv/bin/python3 (found version "3.8.10") found components: Interpreter
-- Using protobuf 3.19.4.0
-- Using protobuf 3.19.4.0
-- Found RE2 via CMake.
-- Using gRPC 1.48.0
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:3 (include):
  include could not find requested file:

    /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSON-targets.cmake
Call Stack (most recent call first):
  library/CMakeLists.txt:49 (find_package)


CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:17 (get_target_property):
  get_target_property() called with non-existent target "RapidJSON".
Call Stack (most recent call first):
  library/CMakeLists.txt:49 (find_package)


-- Configuring incomplete, errors occurred!
See also "/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/CMakeFiles/CMakeOutput.log".
See also "/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/CMakeFiles/CMakeError.log".
make[3]: *** [CMakeFiles/cc-clients.dir/build.make:96: cc-clients/src/cc-clients-stamp/cc-clients-configure] Error 1
make[2]: *** [CMakeFiles/Makefile2:119: CMakeFiles/cc-clients.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:126: CMakeFiles/cc-clients.dir/rule] Error 2
make: *** [Makefile:124: cc-clients] Error 2

Performance Analyzer cannot collect metrics on Jetson Xavier

I have deployed Triton on Jetson Xavier and used Performance Analyzer to measure model performance during inference. Latency and throughput are correctly measured, but when I try to collect metrics using --collect-metrics option, the following message appears:

WARNING: Unable to parse ‘nv_gpu_utilization’ metric.
WARNING: Unable to parse ‘nv_gpu_power_usage’ metric.
WARNING: Unable to parse ‘nv_gpu_memory_used_bytes’ metric.
WARNING: Unable to parse ‘nv_gpu_memory_total_bytes’ metric.

The command I am using to launch the inferences is:
/usr/local/bin/perf_analyzer --collect-metrics -m 3D_fp32_05_batchd -b 1 --concurrency-range 1

Is it possible to solve this problem?

Thanks.

For goalng grpc client, call ModelInfer interface, how to parse useful values from ModelInferResponse？

ModelInferResponse is defined as follows

type ModelInferResponse struct {
        // ohter ...

	Outputs []*ModelInferResponse_InferOutputTensor `protobuf:"bytes,5,rep,name=outputs,proto3" json:"outputs,omitempty"`
	RawOutputContents [][]byte `protobuf:"bytes,6,rep,name=raw_output_contents,json=rawOutputContents,proto3" json:"raw_output_contents,omitempty"`
}

Call the ModelInfer interface to get the response. How to parse the contents of the RawOutputContents field according to the Datatype?
Also why is Outputs[i].Contents empty?

triton-inference-server / client Goto Github PK

client's Issues

DLPack tensor is not contiguous. Only contiguous DLPack tensors that are stored in C-Order are supported.

Converting InferenceRequest to InferInput

Memory leak in SharedMemoryTensor.dlpack

Revert urllib3 version pin

input_data

AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'

Unable to use triton client with shared memory in C++ (Jetpack 6 device)

Add support for FetchContent or find_package

find_package

FetchContent

How to ensure `load_model` applies to the same server pod as `infer`?

urllib dependency is present when using [grpc, cuda] options

Make perf_analyzer work on macbook

make cc-clients: Could not find requested file: RapidJSON-targets.cmake

Performance Analyzer cannot collect metrics on Jetson Xavier

For goalng grpc client, call ModelInfer interface, how to parse useful values from ModelInferResponse？

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent