Not an issue but a question, i am not so familiar with Triton. is it possible to use d

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Batching about yolov4-triton-tensorrt HOT 11 CLOSED

isarsoft commented on May 18, 2024

Batching

from yolov4-triton-tensorrt.

Comments (11)

philipp-schmidt commented on May 18, 2024 1

Triton does support dynamic input shapes, and so does TensorRT.

In fact, by changing #define BATCH_SIZE 1 here in main.cpp, you can specify the maximum batch size. Triton will then allow you to send anything up to this batch size to be computed.

E.g. if you set this to 32 you can send up to 32 images at once, but sending just a single one is still perfectly fine. So you can decide if you want to do the batching logic on server side or client side.

Maybe you can test that out with the perf_client like described here. perf_client allows you to set the batch size to be used via the -b flag (if im not mistaken, check the docu).

from yolov4-triton-tensorrt.

philipp-schmidt commented on May 18, 2024 1

Yes, batch size is only limited by GPU memory. And TensorRT also optimizes your network differently if you have a higher batch size (some optimization options are only selected with higher batch sizes).
At some point you won't be able to increase the max batch size and your code will crash with OOM error.

from yolov4-triton-tensorrt.

philipp-schmidt commented on May 18, 2024 1

If your only goal is to maximize throughput (and effective latency and number of concurrent applications is not an issue), then your concurrency (number of scripts...) should be at least the amount of GPUs you got (maybe plus a few additional ones for performance from overlapping memory transfer effects) and you can find out your optimal batch size through testing. 8 or 16 will probably what you will go for, higher than that will give you no additional performance boost for such a big network. 32 or 64 might be viable options for smaller networks like yolov4-tiny.

For real time applications, batch size 1 will give you the best latency and it will get worse with every additional client until you can't meet your realtime targets anymore.

If the repo helped you so far make sure to star it so others have an easier time finding it.

from yolov4-triton-tensorrt.

philipp-schmidt commented on May 18, 2024 1

You will be running one completely independent triton server per node. And then you run a fixed amount of scripts on every node (e.g. 4 for 2 GPUs and 8 for 4 GPUs) which each handle a single folder at a time. Batchsize is fixed, try 8 and 16 and decide for one. Image resolution will be rescaled to 680x680 cause that's what yolov4 takes by default (you can choose a different version, but the pretrained ones are only available with smaller input). There really is no need to load balance in any sense. You just give the system enough to do to be busy. By the way, triton will handle and manage your GPU memory. As long as one full batch fits into your GPU memory, triton will handle copying your data from RAM to GPU. So there is no OOM no matter how many clients use the server at once. With this setup, as long as your scripts can fetch data at a fast enough pace, the GPUs will always be at 100%.

from yolov4-triton-tensorrt.

ontheway16 commented on May 18, 2024 1

Thank you for all the clarifications.

from yolov4-triton-tensorrt.

ontheway16 commented on May 18, 2024

@philipp-schmidt so, the batch size is actually a gpu-memory limited parameter ? Or in other words, to fill up the gpu memory effectively, in every batch ? (Especially in case of same model & fixed image input size)

from yolov4-triton-tensorrt.

philipp-schmidt commented on May 18, 2024

@ontheway16 I have added benchmark results for batchsize 8 and 4 because I was curious if it would give another increase in performance and it did.

from yolov4-triton-tensorrt.

ontheway16 commented on May 18, 2024

@philipp-schmidt Great test, exactly what I need, thanks. I was looking to assess what would happen in case of multiple scripts and multi-gpu combo. So can we conclude it's better to limit num. of concurrent inference scripts to number of GPU s, while keeping batch around 8 ?

from yolov4-triton-tensorrt.

ontheway16 commented on May 18, 2024

@philipp-schmidt Sure, starred, thanks for all the help. I have some other questions about batch process, and related topics, with YOLO v4 & Triton combo. I believe many other people have these questions in mind, answers surely help several others, so thank you for your kind help efforts. This question is about sequence batcher.

I have four physical nodes (PCs) that I want to use for inference serving (my application is not realtime, top level accuracy is my priority), one PC have 4 GPUs, and others have 2 each. All GPUs are equal. Seeking to find optimal setup for these.

My plan is, the image files that's waiting to be inferenced are sitting in their own local folders, about 50-80 images per folder and their results will be final-evaluated "per folder". The backend python script will send these images per folder in a row for infer, and after getting results, further process them per folder, then move to another folder (there are thousands of folders).
According to your test results and above opinions, my plan is to split these folders to number of GPUs (2 or 4 main folders) and run concurrent infer. scripts on each of main folders.
The question is, using the sequence batcher might be a better approach here ? Or, letting script infer the images in subfolders in a mixed fashion and put results to a db table, and let backend find the ones belonging to a specific folder and apply further processing per folder? Too much options (direct, oldest etc..) and I am yet to figure out what performs best, by means of load balancing. Your opinions are very valuable here, thanks again.

from yolov4-triton-tensorrt.

philipp-schmidt commented on May 18, 2024

Are your images available on every node or only a single one? Is the task to load those images and preprocess / postprocess heavily CPU or Disk IO bound?

The question is, using the sequence batcher might be a better approach here ?

Most likely not. The sequence batcher is useful if you have an application which is relying on using batch size 1 and can't be altered to use larger batch sizes (e.g. realtime video analysis), but you want to leverage performance gains on the server side by batching multiple requests of many concurrent clients into larger and more efficient batch sizes. It's again a throughput and latency tradeoff. In your case you can easily create sufficiently large batch sizes on the client side by simply loading enough images before running inference and keep your code logic simple.

I am yet to figure out what performs best, by means of load balancing

If I understand the setup correctly, running one script per folder and letting it finish the full folder and to then have multiple scripts doing that concurrently should be sufficient for your task or not?

from yolov4-triton-tensorrt.

ontheway16 commented on May 18, 2024

Are your images available on every node or only a single one? Is the task to load those images and preprocess / postprocess heavily CPU or Disk IO bound?

My images require no pre- process (image-processing wise), and they will be sitting in local folders of each PC (no duplicates, unique images per PC). Post process will be like, some simple math operations on OD results and recording findings to a cloud database.

Most likely not. The sequence batcher is useful if you have an application which is relying on using batch size 1 and can't be altered to use larger batch sizes (e.g. realtime video analysis), but you want to leverage performance gains on the server side by batching multiple requests of many concurrent clients into larger and more efficient batch sizes. It's again a throughput and latency tradeoff. In your case you can easily create sufficiently large batch sizes on the client side by simply loading enough images before running inference and keep your code logic simple.

Now I understand the logic behind sequence batch, yes in that case, highest batch amount possible, will be the solution.

If I understand the setup correctly, running one script per folder and letting it finish the full folder and to then have multiple scripts doing that concurrently should be sufficient for your task or not?

After thinking a while, the ultimate goal should be load-balancing within per PC with multi-GPUs. Image sizes will be fixed and same all the time, for all images (probably 1216x1216, or whatever gives the highest accuracy) and there will be constant number of images per folder. So if folder separation of results is solved on the client side (ie, using image names from inference output), there shouldn't be any problems.

In such case, I guess it will be enough to run a single script per PC and let Triton server handle concurrent inferences (aka 'instances', correct?) and feed all available GPUs with highest number of images possible, 'before getting OOM'.

Couldn't find now, but I formerly saw somewhere there are -b and -c parameters. If I am not wrong, setting '-b 8 -c 4' for a 4-GPU node will mostly solve my load-balancing needs, I assume ? Since the whole aim is effectively consuming the GPUs.

from yolov4-triton-tensorrt.

Batching about yolov4-triton-tensorrt HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent