triton-inference-server / client Goto Github PK
View Code? Open in Web Editor NEWTriton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
License: BSD 3-Clause "New" or "Revised" License
Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
License: BSD 3-Clause "New" or "Revised" License
Description
test_cuda_shared_memory.py
failed when dimension of batch size is smaller than 2. I think this issue is from pytorch but I'm just wondering if there's any workaround to make it work with pytorch version >1.12.
Workaround
To use torch 1.12 version. I tested and it's working fine
Triton Information
What version of Triton client are you using?
compiled from latest version b0b5b27
To Reproduce
class DLPackTest(unittest.TestCase):
"""
Testing DLPack implementation in CUDA shared memory utilities
"""
def test_from_gpu(self):
# Create GPU tensor via PyTorch and CUDA shared memory region with
# enough space
tensor_shape = (1,2,4)
gpu_tensor = torch.ones(tensor_shape).cuda(0)
byte_size = gpu_tensor.nelement() * gpu_tensor.element_size()
shm_handle = cudashm.create_shared_memory_region("cudashm_data", byte_size, 0)
# Set data from DLPack specification of PyTorch tensor
cudashm.set_shared_memory_region_from_dlpack(shm_handle, [gpu_tensor])
# Make sure the DLPack specification of the shared memory region can
# be consumed by PyTorch
smt = cudashm.as_shared_memory_tensor(shm_handle, "FP32", tensor_shape)
generated_torch_tensor = torch.from_dlpack(smt)
self.assertTrue(torch.allclose(gpu_tensor, generated_torch_tensor))
cudashm.destroy_shared_memory_region(shm_handle)
I have a decoupled model (python backend) that receives requests from a client and sends the request to another downstream model, and the intermediate model only processes some inputs and passes the rest to the next model.
Currently, I'm converting inputs to numpy arrays first and then wrap them in InferInput.
for input_name in self.input_names[1:]:
data_ = pb_utils.get_input_tensor_by_name(request, input_name)\
.as_numpy()\
.reshape(-1)
input_ = triton_grpc.InferInput(input_name, data_.shape, "FP32" if data_.dtype == np.float32 else "INT32")
input_.set_data_from_numpy(data_)
inputs.append(input_)
However, I think the .as_numpy() and .set_data_from_numpy() functions do some (de)serialization, and using a for loop to copy most of the inputs is a little bit inefficient.
Is there a way to convert InferenceRequest to InferInput more efficiently?
Thanks!
Hello, a memory leak was detected when executing this code. The code was run on Python 3.10., triton-client 2.41.1, torch 2.1.2.
import torch
import tritonclient.utils.cuda_shared_memory as cudashm
n1 = 1000
n2 = 1000
gpu_tensor = torch.ones([n1, n2]).cuda(0)
byte_size = 4 * n1 * n2
shm_handle = cudashm.create_shared_memory_region("cudashm_data", byte_size, 0)
while True:
cudashm.set_shared_memory_region_from_dlpack(shm_handle, [gpu_tensor]
smt = cudashm.as_shared_memory_tensor(shm_handle, "FP32", [n1, n2])
generated_torch_tensor = torch.from_dlpack(smt)
The leak occurs when the dlpack function is called in torch.from_dlpack(smt)
Hi. We are trying to integrate with OIP model servers over at feast feature store and need to add mlserver and tritonclient as optional dependencies. The problem is that we also already depend on snowflake-connector-python
which still has a strict urllib3<2.0.0 requirement for python 3.9. I saw that urllib3 version pin here was only to avoid vulnerability reports. #457 Not sure which vulnerability that was referring to, but seems like urlib3 plan to ship security fixes for v1 still. Is is possible to revert the version pin? Or maybe allow something like (>=1.26.18<2 or >=2.0.7). Thanks
HI,
should i need to create json with all possible inputs present in my all config.pbtxt?
if i have two models and in each model i have 2 inputs . so in the input_data.json should i place sample input data for all 4 inputs or what?
thanks
I have a Triton Server that runs in Docker. There I initialized the CLIP model. I wrote some simple code to try infer and get the output of this model. But I get the error AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'
Here is my client code:
from transformers import CLIPProcessor
from PIL import Image
import tritonclient.http as httpclient
if __name__ == "__main__":
triton_client = httpclient.InferenceServerClient(url="localhost:8003")
# Example of tracing an image processing:
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open("9997103.jpg").convert('RGB')
inputs = processor(images=image, return_tensors="pt")['pixel_values']
inputs = []
outputs = []
inputs.append(triton_client.InferInput("input__0", image.shape, "TYPE_FP32"))
inputs[0].set_data_from_numpy(image)
outputs.append(triton_client.InferRequestedOutput("output__0", binary_data=False))
results = triton_client.infer(
model_name='clip',
inputs=inputs,
outputs=outputs,
)
print(results.as_numpy("output__0"))
Here's the error I'm getting
Traceback (most recent call last):
File "main.py", line 18, in <module>
inputs.append(triton_client.InferInput("input__0", image.shape, "TYPE_FP32"))
AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'
Help me please
I am using the tritonserver + client igpu release (tritonserver2.41.0-igpu.tar.gz) on a jetpack 6 device. I want to use the shared memory functions with the triton client which are declared in shm_utils.h
and defined in shm_utils.cc
. However, the header is not found in Triton Client's include directory leading to a compilation error.
On making the following changes to src/c++/library/CMakeLists.txt
and building the client from source, I was able to import the header and use the shared memory functions. (Triton Client Branch - r23.12)
@@ -84,12 +84,12 @@ if(TRITON_ENABLE_CC_GRPC OR TRITON_ENABLE_PERF_ANALYZER)
# libgrpcclient object build
set(
REQUEST_SRCS
- grpc_client.cc common.cc
+ grpc_client.cc common.cc shm_utils.cc
)
set(
REQUEST_HDRS
- grpc_client.h common.h ipc.h
+ grpc_client.h common.h ipc.h shm_utils.h
)
add_library(
@@ -257,12 +257,12 @@ if(TRITON_ENABLE_CC_HTTP OR TRITON_ENABLE_PERF_ANALYZER)
# libhttpclient object build
set(
REQUEST_SRCS
- http_client.cc common.cc cencode.c
+ http_client.cc common.cc cencode.c shm_utils.cc
)
set(
REQUEST_HDRS
- http_client.h common.h ipc.h cencode.h
+ http_client.h common.h ipc.h cencode.h shm_utils.h
)
add_library(
@@ -394,6 +394,7 @@ if(TRITON_ENABLE_CC_HTTP OR TRITON_ENABLE_CC_GRPC OR TRITON_ENABLE_PERF_ANALYZER
FILES
${CMAKE_CURRENT_SOURCE_DIR}/common.h
${CMAKE_CURRENT_SOURCE_DIR}/ipc.h
+ ${CMAKE_CURRENT_SOURCE_DIR}/shm_utils.h
DESTINATION include
)
Are the shared memory cc and header files not included by default, or am I not including them correctly during compilation?
Neither find_package() nor FetchContent work out of the box for a standalone c++ cmake app.
Compile tritonclient manually and set CMAKE_PREFIX_PATH to the install folder. Alternatively install tritonclient globally. In any case, the following minimal cmake fails:
# tritonclient
find_package(TritonCommon REQUIRED)
find_package(TritonClient REQUIRED)
...
target_link_libraries(test PRIVATE TritonClient::grpcclient rt m dl)
CMake Error at CMakeLists.txt:43 (add_executable):
Target "test" links to target "protobuf::libprotobuf" but the target was
not found. Perhaps a find_package() call is missing for an IMPORTED
target, or an ALIAS target is missing?
# tritonclient
FetchContent_Declare(
tritonclient
GIT_REPOSITORY https://github.com/triton-inference-server/client
GIT_TAG r24.04
)
set(TRITON_ENABLE_CC_GRPC ON)
set(TRITON_COMMON_REPO_TAG r24.04)
set(TRITON_THIRD_PARTY_REPO_TAG r24.04)
set(TRITON_CORE_REPO_TAG r24.04)
FetchContent_MakeAvailable(tritonclient)
...
target_link_libraries(test PRIVATE TritonClient::grpcclient rt m dl)
CMake Error at CMakeLists.txt:55 (add_executable):
Target "test" links to target "TritonClient::grpcclient" but the target
was not found. Perhaps a find_package() call is missing for an IMPORTED
target, or an ALIAS target is missing?
What is the recommended way to create a standalone app with tritonclient dependency currently? It seems that for find_package the link targets are broken and for FetchContent there is no support at all because no targets are exported.
In a k8s environment, there are multiple server replicas. We use python client, and at server side we use explicit mode.
Now we do something like
service_url = "triton.{namespace}.svc.cluster.local:8000"
triton_client = InferenceServerClient(url=server_url)
if not triton_client.is_model_ready(model_name):
triton_client.load_model(model_name)
triton_client.infer(
model_name,
model_version=model_version,
inputs=triton_inputs,
outputs=triton_outputs,
)
Since server_url is the k8s service endpoint, I guess it relies on k8s load balancer to choose a random pod. In this case, how do we ensure all is_model_ready, load_model and infer apply to the same pod?
urllib
is only used for the http
backend based on my look at the code, but the top level requirements.txt
list urllib
as a dependency. However, it should be only in the requirements_http.txt
and not in the top level requirements.txt
.
This creates unnecessary conflicts when one only wants to depend on grpc and not http backend. Dependency conflicts like the one reported here: #648
It looks like perf_analyzer doesn't work on macbook, and pip install tritonclient doesn't include perf_analyzer, is this true? Is it possible to support it?
cmake is not successful
❯ cmake --version
cmake version 3.21.0
CMake suite maintained and supported by Kitware (kitware.com/cmake).
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=`pwd`/install -DTRITON_ENABLE_CC_GRPC=ON -DTRITON_ENABLE_EXAMPLES=ON ..
make cc-clients
...
[ 92%] Performing configure step for 'cc-clients'
loading initial cache file /home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/tmp/cc-clients-cache-Release.cmake
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:3 (include):
include could not find requested file:
/home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSON-targets.cmake
Call Stack (most recent call first):
/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/_deps/repo-common-src/CMakeLists.txt:48 (find_package)
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:17 (get_target_property):
get_target_property() called with non-existent target "RapidJSON".
Call Stack (most recent call first):
/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/_deps/repo-common-src/CMakeLists.txt:48 (find_package)
-- RapidJSON found. Headers:
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- Found Python: /usr/bin/python3.10 (found version "3.10.13") found components: Interpreter
-- Found Protobuf: /home/hayley/nvidia_trt_llm_backend/client/buid/third-party/protobuf/bin/protoc-3.19.4.0 (found version "3.19.4.0")
-- Using protobuf 3.19.4.0
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11")
-- Found OpenSSL: /usr/lib/x86_64-linux-gnu/libcrypto.so (found version "1.1.1f")
-- Found c-ares: /home/hayley/nvidia_trt_llm_backend/client/buid/third-party/c-ares/lib/cmake/c-ares/c-ares-config.cmake (found version "1.17.2")
-- Found RE2 via CMake.
-- Using gRPC 1.48.0
-- Found Python: /home/hayley/llm_serving/.venv/bin/python3 (found version "3.8.10") found components: Interpreter
-- Using protobuf 3.19.4.0
-- Using protobuf 3.19.4.0
-- Found RE2 via CMake.
-- Using gRPC 1.48.0
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:3 (include):
include could not find requested file:
/home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSON-targets.cmake
Call Stack (most recent call first):
library/CMakeLists.txt:49 (find_package)
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:17 (get_target_property):
get_target_property() called with non-existent target "RapidJSON".
Call Stack (most recent call first):
library/CMakeLists.txt:49 (find_package)
-- Configuring incomplete, errors occurred!
See also "/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/CMakeFiles/CMakeOutput.log".
See also "/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/CMakeFiles/CMakeError.log".
make[3]: *** [CMakeFiles/cc-clients.dir/build.make:96: cc-clients/src/cc-clients-stamp/cc-clients-configure] Error 1
make[2]: *** [CMakeFiles/Makefile2:119: CMakeFiles/cc-clients.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:126: CMakeFiles/cc-clients.dir/rule] Error 2
make: *** [Makefile:124: cc-clients] Error 2
I have deployed Triton on Jetson Xavier and used Performance Analyzer to measure model performance during inference. Latency and throughput are correctly measured, but when I try to collect metrics using --collect-metrics option, the following message appears:
WARNING: Unable to parse ‘nv_gpu_utilization’ metric.
WARNING: Unable to parse ‘nv_gpu_power_usage’ metric.
WARNING: Unable to parse ‘nv_gpu_memory_used_bytes’ metric.
WARNING: Unable to parse ‘nv_gpu_memory_total_bytes’ metric.
The command I am using to launch the inferences is:
/usr/local/bin/perf_analyzer --collect-metrics -m 3D_fp32_05_batchd -b 1 --concurrency-range 1
Is it possible to solve this problem?
Thanks.
ModelInferResponse is defined as follows
type ModelInferResponse struct {
// ohter ...
Outputs []*ModelInferResponse_InferOutputTensor `protobuf:"bytes,5,rep,name=outputs,proto3" json:"outputs,omitempty"`
RawOutputContents [][]byte `protobuf:"bytes,6,rep,name=raw_output_contents,json=rawOutputContents,proto3" json:"raw_output_contents,omitempty"`
}
Call the ModelInfer interface to get the response. How to parse the contents of the RawOutputContents field according to the Datatype?
Also why is Outputs[i].Contents empty?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.