Deion I run a ensemble model contains three model which exe

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Ensemble model cannot be inferenced by clients without clear error log to debug.,about triton-inference-server/dali_backend

Comments (20)

szalpal commented on June 9, 2024

Hi @Edwardmark !

Thank you for extensive description of the problem. I suspect your issue might be connected to the "gpu" backend of external_source operator in DALI. Currently, the GPU input is not yet supported - we are finishing this effort (
#53). It's going to be released in tritonserver:21.06.

Should you like to verify that it's about the GPU input, please update your tritonserver to 21.04. With this version we added missing error log in DALI Backend (#43).

from dali_backend.

Edwardmark commented on June 9, 2024

@szalpal I changed the dali_det_post pipeline as follows:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import nvidia.dali as dali
import nvidia.dali.fn as fn
import nvidia.dali.types as types

pipe = dali.pipeline.Pipeline(batch_size=32, num_threads=8)
with pipe:
    nmsed_boxes = fn.external_source(device='cpu', name="NMSED_BOXES")
    scale_ratio = fn.external_source(device='cpu', name='SCALE_RATIO_INPUT')
   
    # Rescale BBOX
    ratio = fn.reductions.min(scale_ratio)
    nmsed_boxes /= ratio
    pipe.set_outputs(nmsed_boxes)

pipe.serialize(filename="1/model.dali")

But I met the same error:

I0520 02:15:41.783194 133528 ensemble_scheduler.cc:509] Internal response allocation: nmsed_classes, size 400, addr 0x7fb0844b0e00, memory type 2, type id 0
I0520 02:15:41.788463 133528 ensemble_scheduler.cc:524] Internal response release: size 4, addr 0x7fb0844b0200
I0520 02:15:41.788483 133528 ensemble_scheduler.cc:524] Internal response release: size 1600, addr 0x7fb0844b0400
I0520 02:15:41.788489 133528 ensemble_scheduler.cc:524] Internal response release: size 400, addr 0x7fb0844b0c00
I0520 02:15:41.788496 133528 ensemble_scheduler.cc:524] Internal response release: size 400, addr 0x7fb0844b0e00
I0520 02:15:41.788517 133528 infer_request.cc:502] prepared: [0x0x7fadd40015e0] request id: , model: dali_det_post, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7fadd40019b8] input: NMSED_BOXES, type: FP32, original shape: [1,100,4], batch + shape: [1,100,4], shape: [100,4]
[0x0x7fadd4001868] input: SCALE_RATIO_INPUT, type: FP32, original shape: [1,2], batch + shape: [1,2], shape: [2]
override inputs:
inputs:
[0x0x7fadd4001868] input: SCALE_RATIO_INPUT, type: FP32, original shape: [1,2], batch + shape: [1,2], shape: [2]
[0x0x7fadd40019b8] input: NMSED_BOXES, type: FP32, original shape: [1,100,4], batch + shape: [1,100,4], shape: [100,4]
original requested outputs:
SCALED_NMSED_BOXES_OUTPUT
requested outputs:
SCALED_NMSED_BOXES_OUTPUT

tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] Socket closed
> /app/model_repository/ensemble-face_det-ucs/grpc_client.py(182)main()

In addition, my first preprocess model is defined as follows:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import nvidia.dali as dali
import nvidia.dali.fn as fn
import nvidia.dali.types as types
import argparse
import numpy as np
import os

pipe = dali.pipeline.Pipeline(batch_size=32, num_threads=8)
with pipe:
    expect_output_size = (640., 640.)
    images = fn.external_source(device='cpu', name="IMAGE_RAW")
    images = fn.image_decoder(images, device="mixed", output_type=types.RGB)
    raw_shapes = fn.shapes(images, dtype=types.INT32)
    images = fn.resize(
        images,
        mode='not_larger',
        size=expect_output_size,
    )
    resized_shapes = fn.shapes(images, dtype=types.INT32)
    ratio = fn.slice(resized_shapes / raw_shapes, 0, 2, axes=[0])
    images = fn.crop_mirror_normalize(images, mean=[0.], std=[255.], output_layout='CHW')
    images = fn.pad(images, axis_names="HW", align=expect_output_size)
    pipe.set_outputs(images, ratio)
os.system('rm -rf 1 && mkdir -p 1')
pipe.serialize(filename="1/model.dali")

Any advise to make it work please? Thanks. @szalpal

from dali_backend.

Edwardmark commented on June 9, 2024

@szalpal I changed the version to 21.04 and change all input to cpu, but still no error log is shown, and I get the same log as below, what is your advise? Thanks.
The outpus is same as 21.03

I0520 02:58:13.877026 1181 plan_backend.cc:2447] Running face_det-ucs_0_gpu0 with 1 requests
I0520 02:58:13.877071 1181 plan_backend.cc:3378] Optimization profile default [0] is selected for face_det-ucs_0_gpu0
I0520 02:58:13.877337 1181 plan_backend.cc:2869] Context with profile default [0] is being executed for face_det-ucs_0_gpu0
I0520 02:58:14.543531 1181 infer_response.cc:139] add response output: output: num_detections, type: INT32, shape: [1,1]
I0520 02:58:14.543578 1181 ensemble_scheduler.cc:509] Internal response allocation: num_detections, size 4, addr 0x7f7bf04b0200, memory type 2, type id 0
I0520 02:58:14.543609 1181 infer_response.cc:139] add response output: output: nmsed_boxes, type: FP32, shape: [1,100,4]
I0520 02:58:14.543621 1181 ensemble_scheduler.cc:509] Internal response allocation: nmsed_boxes, size 1600, addr 0x7f7bf04b0400, memory type 2, type id 0
I0520 02:58:14.543642 1181 infer_response.cc:139] add response output: output: nmsed_scores, type: FP32, shape: [1,100]
I0520 02:58:14.543653 1181 ensemble_scheduler.cc:509] Internal response allocation: nmsed_scores, size 400, addr 0x7f7bf04b0c00, memory type 2, type id 0
I0520 02:58:14.543672 1181 infer_response.cc:139] add response output: output: nmsed_classes, type: FP32, shape: [1,100]
I0520 02:58:14.543683 1181 ensemble_scheduler.cc:509] Internal response allocation: nmsed_classes, size 400, addr 0x7f7bf04b0e00, memory type 2, type id 0
I0520 02:58:14.544713 1181 ensemble_scheduler.cc:524] Internal response release: size 4, addr 0x7f7bf04b0200
I0520 02:58:14.544741 1181 ensemble_scheduler.cc:524] Internal response release: size 1600, addr 0x7f7bf04b0400
I0520 02:58:14.544749 1181 ensemble_scheduler.cc:524] Internal response release: size 400, addr 0x7f7bf04b0c00
I0520 02:58:14.544764 1181 ensemble_scheduler.cc:524] Internal response release: size 400, addr 0x7f7bf04b0e00
I0520 02:58:14.544789 1181 infer_request.cc:497] prepared: [0x0x7f79300016b0] request id: , model: dali_det_post, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f7930001a88] input: NMSED_BOXES, type: FP32, original shape: [1,100,4], batch + shape: [1,100,4], shape: [100,4]
[0x0x7f7930001938] input: SCALE_RATIO_INPUT, type: FP32, original shape: [1,2], batch + shape: [1,2], shape: [2]
override inputs:
inputs:
[0x0x7f7930001938] input: SCALE_RATIO_INPUT, type: FP32, original shape: [1,2], batch + shape: [1,2], shape: [2]
[0x0x7f7930001a88] input: NMSED_BOXES, type: FP32, original shape: [1,100,4], batch + shape: [1,100,4], shape: [100,4]
original requested outputs:
SCALED_NMSED_BOXES_OUTPUT
requested outputs:
SCALED_NMSED_BOXES_OUTPUT

tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] Socket closed
> /app/model_repository_2104/ensemble-face_det-ucs/grpc_client.py(182)main()

from dali_backend.

szalpal commented on June 9, 2024

@Edwardmark

It's possible, that even though you changed the ExternalSource to "cpu", the bug still prevents normal processing. Anyhow, we've just merged the GPU input feature to upstream. It's going to be released in tritonserver:21.06, however it's very easy to run the upstream dali_backend with the latest tritonserver release.

Could you try it out and verify if the GPU input solves you problem, or we need to dig deeper? The instructions how to build dali_backend docker image are here: Docker build

from dali_backend.

Edwardmark commented on June 9, 2024

@szalpal It works, thank you very much.

from dali_backend.

Edwardmark commented on June 9, 2024

@szalpal How to build the docker without download the git repositorys? I mean if I download the related git repos beforehand, what changes should I make to the cmakelists in dali_backend? When build the docker, it happens the following erros which seems like network error:

Step 12/19 : RUN mkdir build_in_ci && cd build_in_ci &&     cmake                                                   -D CMAKE_INSTALL_PREFIX=/opt/tritonserver             -D CMAKE_BUILD_TYPE=Release                           -D TRITON_COMMON_REPO_TAG="r$TRITON_VERSION"          -D TRITON_CORE_REPO_TAG="r$TRITON_VERSION"            -D TRITON_BACKEND_REPO_TAG="r$TRITON_VERSION"         .. &&                                               make -j"$(grep ^processor /proc/cpuinfo | wc -l)" install
 ---> Running in e11becb3e19f
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Build configuration: Release
-- RapidJSON found. Headers: /usr/include
-- RapidJSON found. Headers: /usr/include
Scanning dependencies of target repo-core-populate
[ 11%] Creating directories for 'repo-core-populate'
[ 22%] Performing download step (git clone) for 'repo-core-populate'
Cloning into 'repo-core-src'...
Switched to a new branch 'r21.05'
Branch 'r21.05' set up to track remote branch 'r21.05' from 'origin'.
[ 33%] No patch step for 'repo-core-populate'
[ 44%] Performing update step for 'repo-core-populate'
fatal: unable to access 'https://github.com/triton-inference-server/core.git/': GnuTLS recv error (-110): The TLS connection was non-properly terminated.
CMake Error at /dali/build_in_ci/_deps/repo-core-subbuild/repo-core-populate-prefix/tmp/repo-core-populate-gitupdate.cmake:55 (message):
  Failed to fetch repository
  'https://github.com/triton-inference-server/core.git'


make[2]: *** [CMakeFiles/repo-core-populate.dir/build.make:117: repo-core-populate-prefix/src/repo-core-populate-stamp/repo-core-populate-update] Error 1
make[1]: *** [CMakeFiles/Makefile2:96: CMakeFiles/repo-core-populate.dir/all] Error 2
make: *** [Makefile:104: all] Error 2

CMake Error at /usr/local/share/cmake-3.17/Modules/FetchContent.cmake:912 (message):
  Build step for repo-core failed: 2
Call Stack (most recent call first):
  /usr/local/share/cmake-3.17/Modules/FetchContent.cmake:1003 (__FetchContent_directPopulate)
  /usr/local/share/cmake-3.17/Modules/FetchContent.cmake:1044 (FetchContent_Populate)
  CMakeLists.txt:72 (FetchContent_MakeAvailable)

from dali_backend.

szalpal commented on June 9, 2024

@Edwardmark ,

as far as I now, unfortunately cloning git repos is immanent for building backends in Triton. Is there a particular reason you would like to clone repos beforehand? If you want to use the latest tritonserver version (21.05), I merged today the PR, which applies that #68 . So you can clone the upstream dali_backend

from dali_backend.

Edwardmark commented on June 9, 2024

@szalpal because the network is not always good, so I want to clone repos beforhand, then just use the repo to make the build process more quicklyl.

from dali_backend.

szalpal commented on June 9, 2024

@Edwardmark ,

I see. It would be possible to tweak the root CMakeLists.txt file in order to achieve what you want. Although it is not in our scope right now (and I doubt it will ever be), so we will not implement it, you would need to try to do it yourself.

IMPORTANT: this is a dirty explanation of a workaround and we certainly do not support nor plan to support this way of building in foreseeable future. We also highly discourage changing this building procedure for production environments.

The point is, that there are these three repos, that need to be acquired for proper building any backend: core, common and backend. Our build procedure acquires them in these three declarations:

dali_backend/CMakeLists.txt

Lines 54 to 71 in bb9204c

 FetchContent_Declare( 

 repo-common 

 GIT_REPOSITORY https://github.com/triton-inference-server/common.git 

 GIT_TAG ${TRITON_COMMON_REPO_TAG} 

 GIT_SHALLOW ON 

 ) 

 FetchContent_Declare( 

 repo-core 

 GIT_REPOSITORY https://github.com/triton-inference-server/core.git 

 GIT_TAG ${TRITON_CORE_REPO_TAG} 

 GIT_SHALLOW ON 

 ) 

 FetchContent_Declare( 

 repo-backend 

 GIT_REPOSITORY https://github.com/triton-inference-server/backend.git 

 GIT_TAG ${TRITON_BACKEND_REPO_TAG} 

 GIT_SHALLOW ON 

 )

Should you like to change them to be acquired from your disk, firstly clone all three repos you need and then you can switch from fetching content from git repository to fetching content from disk location, by changing GIT, GIT_SHALLOW and GIT_REPOSITORY subcommands. Below is the documentation of the FetchContent functions, which might be helpful:
https://cmake.org/cmake/help/latest/module/FetchContent.html
https://cmake.org/cmake/help/latest/module/ExternalProject.html#command:externalproject_add
You should pay attention to Directory Options in ExternalProject_Add directive

from dali_backend.

Edwardmark commented on June 9, 2024

@szalpal Thank you very much.

from dali_backend.

Edwardmark commented on June 9, 2024

@szalpal could you please give me more hit on how to change GIT, GIT_SHALLOW and GIT_REPOSITORY subcommands? Thanks. I changed the lines as follows:


FetchContent_Declare(
  repo-common
  SOURCE_DIR /dali/common/
)
FetchContent_Declare(
  repo-core
  SOURCE_DIR /dali/core/
)
FetchContent_Declare(
  repo-backend
  SOURCE_DIR /dali/backend/
)
FetchContent_MakeAvailable(repo-common repo-core repo-backend)

is that right?
The DIR /dali/common, /dali/core/, /dali/backend/ is the obtained by :

 git clone https://github.com/triton-inference-server/common.git
 git clone https://github.com/triton-inference-server/core.git
 git clone https://github.com/triton-inference-server/backend.git

I build the docker image successfully.

from dali_backend.

szalpal commented on June 9, 2024

@Edwardmark ,

what is the problem you are facing?

from dali_backend.

Edwardmark commented on June 9, 2024

@szalpal could you please give me more hit on how to change GIT, GIT_SHALLOW and GIT_REPOSITORY subcommands? Thanks. I changed the lines as follows:
FetchContent_Declare(
  repo-common
  SOURCE_DIR /dali/common/
)
FetchContent_Declare(
  repo-core
  SOURCE_DIR /dali/core/
)
FetchContent_Declare(
  repo-backend
  SOURCE_DIR /dali/backend/
)
FetchContent_MakeAvailable(repo-common repo-core repo-backend)
is that right?
The DIR /dali/common, /dali/core/, /dali/backend/ is the obtained by :
 git clone https://github.com/triton-inference-server/common.git
 git clone https://github.com/triton-inference-server/core.git
 git clone https://github.com/triton-inference-server/backend.git
I build the docker image successfully.
@szalpal I just want to make sure if the way I tried is correct to replace git repo with local repos.
The docker-build process is ok, but when I want to run the server, it shows a bug:

I0617 08:17:59.462826 81 dali_backend.cc:269] Triton TRITONBACKEND API version: 1.0
I0617 08:17:59.462836 81 dali_backend.cc:273] 'dali' TRITONBACKEND API version: 1.4
 Segmentation fault (core dumped)

So how to deal with that?

from dali_backend.

szalpal commented on June 9, 2024

@Edwardmark ,

As I mentioned above, we do not support nor plan to support this kind of building procedure. Therefore I unfortunately won't be able to answer all the questions, simply because I didn't tried it nor tested it.

The error you're facing is there because the server verifies the API version the backend has been built with. Be sure to use proper version of backend.git repo, which has the following defines:

#define TRITONBACKEND_API_VERSION_MAJOR 1
#define TRITONBACKEND_API_VERSION_MINOR 0

from dali_backend.

Edwardmark commented on June 9, 2024

@Edwardmark ,

As I mentioned above, we do not support nor plan to support this kind of building procedure. Therefore I unfortunately won't be able to answer all the questions, simply because I didn't tried it nor tested it.

The error you're facing is there because the server verifies the API version the backend has been built with. Be sure to use proper version of backend.git repo, which has the following defines:
#define TRITONBACKEND_API_VERSION_MAJOR 1
#define TRITONBACKEND_API_VERSION_MINOR 0

I checkout to the 21.05 branch, and the problem is solved. Thank you very much.@szalpal

from dali_backend.

Edwardmark commented on June 9, 2024

@szalpal Do I have to install nvidia-dali-nightly?
https://github.com/triton-inference-server/dali_backend/blob/main/docker/Dockerfile.release#L65
Thanks.

from dali_backend.

Edwardmark commented on June 9, 2024

@szalpal Thanks.

from dali_backend.

szalpal commented on June 9, 2024

@szalpal Do I have to install nvidia-dali-nightly?
https://github.com/triton-inference-server/dali_backend/blob/main/docker/Dockerfile.release#L65
Thanks.

Not necessarily. We recommend using latest DALI release

from dali_backend.

Edwardmark commented on June 9, 2024

https://github.com/triton-inference-server/dali_backend/blob/main/docker/Dockerfile.release#L65

@Edwardmark

It's possible, that even though you changed the ExternalSource to "cpu", the bug still prevents normal processing. Anyhow, we've just merged the GPU input feature to upstream. It's going to be released in tritonserver:21.06, however it's very easy to run the upstream dali_backend with the latest tritonserver release.

Could you try it out and verify if the GPU input solves you problem, or we need to dig deeper? The instructions how to build dali_backend docker image are here: Docker build

If I use dali 1.2, would the dali_backend support gpu input?

from dali_backend.

szalpal commented on June 9, 2024

If I use dali 1.2, would the dali_backend support gpu input?

@Edwardmark, yes. Although we don't guarantee backwards compatibility. Therefore, only the latest DALI version is properly tested and maintained

from dali_backend.

Ensemble model cannot be inferenced by clients without clear error log to debug. about dali_backend HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	FetchContent_Declare(
	repo-common
	GIT_REPOSITORY https://github.com/triton-inference-server/common.git
	GIT_TAG ${TRITON_COMMON_REPO_TAG}
	GIT_SHALLOW ON
	)
	FetchContent_Declare(
	repo-core
	GIT_REPOSITORY https://github.com/triton-inference-server/core.git
	GIT_TAG ${TRITON_CORE_REPO_TAG}
	GIT_SHALLOW ON
	)
	FetchContent_Declare(
	repo-backend
	GIT_REPOSITORY https://github.com/triton-inference-server/backend.git
	GIT_TAG ${TRITON_BACKEND_REPO_TAG}
	GIT_SHALLOW ON
	)