Skip to content

Deploying Diverse AI Model Categories from Public Model Zoo Using NVIDIA Triton Inference Server

Nowadays, a huge number of implementations of state-of-the-art (SOTA) models and modeling solutions are present for different frameworks like TensorFlow, ONNX,…

Nowadays, a huge number of implementations of state-of-the-art (SOTA) models and modeling solutions are present for different frameworks like TensorFlow, ONNX, PyTorch, Keras, MXNet, and so on. These models can be used for out-of-the-box inference if you are interested in categories already in the datasets, or they can be embedded to custom business scenarios with minor fine-tuning.

This post gives you an overview of prevalent DL model categories and walks you through the end-to-end examples of deploying these models using NVIDIA Triton Inference Server. The client applications can be used as it is or can be modified according to the use case scenarios. I walk you through the deployment of image classification, object detection, and image segmentation public models using Triton Inference Server. The steps outlined in this post can also be applied to other open-source models with minor changes.

Deep learning inference challenges

Recent years have seen remarkable advancements in deep learning (DL). By resolving numerous complex and intricate problems that have hampered the AI community for years, it has completely revolutionized the future of AI. It is currently being used with rapidly growing applications in different industries, ranging from healthcare and aerospace engineering to autonomous driving and user authentications. 

Deep learning, however, has various challenges when it comes to inference:

  • Support of multiple frameworks
  • Ease of use
  • Cost of deployment

Support of multiple frameworks

The first key challenge is around supporting multiple different types of model frameworks. 

Developers and data scientists today are using various frameworks for their production models. For instance, there can be difficulties modifying the system for testing and deployment if a machine learning project is written in Keras, but a team member has more experience with TensorFlow. 

Also, converting the models can be expensive and complicated, especially if new data is required for their training. They must have a server application to support each of those models.

Ease of use

The next key challenge is to have a serving application that can support different inference queries and use cases. 

In some applications, you’re focused on real-time online inferencing where the priority is to minimize latency as much as possible. On the other hand, there might be use cases that require you to do offline batch inferencing where you’re focused on maximizing throughput. 

It’s essential to have solutions that can support each type of query and use case and optimize for them.

Cost of deployment

The next challenge is managing the cost of deployment and lowering the cost of inference. 

A key part of this is having one serving application that can support running on a mixed infrastructure. You might create a separate serving solution for running on CPU, another one for GPU, and a different one for deploying on the cloud in the data center and edge. That’s going to skyrocket costs and lead to a nonscalable implementation.

Triton Inference Server

Triton Inference Server is an open-source server inference application allowing inference on both CPU and GPU in different environments. It supports various backends, including TensorRT, PyTorch, TensorFlow, ONNX, and Python. To have maximum hardware utilization, NVIDIA Triton allows concurrent execution of different models. Further dynamic batching allows grouping together inference queries to maximize the throughput for different types of queries. For more information, see NVIDIA Triton Inference Server.

Figure 2. Triton Inference Server architecture

Quickstart with NVIDIA Triton

The easiest way to install and run NVIDIA Triton is to use the pre-built Docker image available from NGC.

Server: Pull the Docker image

Pull the image using the following command:

$ docker pull

NVIDIA Triton is optimized to provide the best inferencing performance by using GPUs, but it can also work on CPU-only systems. In both cases, you can use the same Docker image.

Use the following command to run NVIDIA Triton with the example model repository that you just created:

docker run --gpus=1 --rm --net=host -v /path/to/the/repo/server/models:/models tritonserver --model-repository=/models --exit-on-error=false --repository-poll-secs=10 --model-control-mode="poll"

Client: Get the client libraries

Use docker pull to get the client libraries.

$ docker pull

In this command, is the version to pull. Run the client image.

To start the client, run the following command:

$ docker run -it --rm --net=host /path/to/the/repo/client/:/python_examples

End-to-end model deployment

The NVIDIA Triton project provides several client libraries in C++ and Python to simplify communication. These APIs make communicating with NVIDIA Triton easy. With the help of these APIs, the client applications process the input and communicate with NVIDIA Triton to perform inferencing.

Figure 3. Workflow of client application interaction with Triton Inference Server

In general, the interaction of client applications with NVIDIA Triton can be summarized as follows:

  • Input
  • Preprocess
  • Inference
  • Postprocess
  • Output

Input: Depending upon the application type, one or more inputs are read to be inferred by the neural network.

Preprocess: Preprocessing data is a common first step in the deep learning workflow to prepare raw data in a format the network can accept, For example, image resizing, normalization, or noise removal from input data.

Inference: For the inference part, a client initially serializes the inference request into a message and sends it to Triton Inference Server. The message travels over the network from the client to the server and gets deserialized. The request is placed on the queue. The request is removed from the queue and computed. The completed request is serialized in a message and sent back to the client. The message travels over the network from the server to the client. The message arrives at the client and is deserialized.

Postprocess: When the message arrives at the client application, it is processed as a completed inference request. Depending upon the network type and application use case, post-processing is applied. For example, in object detection, postprocessing involves suppressing the superfluous boxes, aiding in selecting the best possible boxes, and mapping them back to the input image.

Output: After inference and processing, depending upon the application, the output can be stored, displayed, or passed to the network.

Image classification

Image classification is the task of comprehending an entire image and specifying a specific label for the image. Typically in image classification, a single object is present in the image, which is analyzed and comprehended. For more information, see image classification. 

Server: Download the model

Download the ResNet-18 image classification model from the ONNX model zoo:

$ cd /path/to/the/repo/server/models/classification/1
$ wget && mv resnet18-v1-7.onnx model.onnx

The following code example shows the model configuration file:

name: "classification"
platform: "onnxruntime_onnx"
max_batch_size : 1
input [
    name: "data"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
    reshape { shape: [ 3, 224, 224 ] }
output [
    name: "resnetv15_dense0_fwd"
    data_type: TYPE_FP32
    dims: [  1000 ]
    reshape { shape: [1000] }
    label_filename: "labels.txt"

Name, platform, and backend

The name property is optional. If the name of the model is not specified in the configuration, it is assumed to be the same as the model repository directory containing the model. The model is executed by the NVIDIA Triton backend, which is simply a wrapper around the DL frameworks like TensorFlow, PyTorch, TensorRT, and so on. For more information, see backend.

Maximum batch size

The maximum batch size that a model can support is indicated by the max_batch_size property. Zero size shows that bathing is not supported. For more information, see batch size.

Inputs and outputs

For each model, the expected input, output, and data types must be specified in the model configuration file. Based on the input and output tensors, different data types are allowed. For more information, see Datatypes. 

The image classification model accepts a single input, and after the inference returns a single output.

In a separate console, launch the image_client example from the NGC NVIDIA Triton container.

Client: Run the image classification client

To run the image classification client, use the following command:

$ python3 /python_examples/examples/classification/ -m classification -s INCEPTION /python_examples/examples/images/tabby.jpg

First, inputs are preprocessed according to the model. For this model, inception scaling is applied, which scales the input as follows:

if scaling == 'INCEPTION':
    scaled = (typed / 127.5) - 1

The inference request is sent to NVIDIA Triton, and the responses are appended:


Finally, the responses obtained from the server are post-processed.

postprocess(response, output_name, FLAGS.batch_size, supports_batching)

For the classification case, the model returns a single classification output that comprehends the input image. The class is decoded and printed in the console.

for results in output_array:
    if not supports_batching:
        results = [results]
    for result in results:
        if output_array.dtype.type == np.object_:
            cls = "".join(chr(x) for x in result).split(':')
            cls = result.split(':')
        print("    {} ({}) = {}".format(cls[0], cls[1], cls[2]))

For more information, see

Figure 4 shows the sample output.

Figure 4. Classification label assigned to the image by the classification network

Object detection

The process of finding instances of objects of a particular class within an image is known as object detection. The problem of object detection combines classification with localization. It also examines more plausible scenarios in which an image might contain several objects. For more information, see object detection.

Server: Download the model

Download the faster_rcnn_inception_v2_coco object detection model:

$ cd /path/to/the/repo/server/models/detection/1
$ wget && tar xvf faster_rcnn_inception_v2_coco_2018_01_28.tar.gz && cp faster_rcnn_inception_v2_coco_2018_01_28/frozen_inference_graph.pb ./model.graphdef &&  rm -r faster_rcnn_inception_v2_coco_2018_01_28 faster_rcnn_inception_v2_coco_2018_01_28.tar.gz

The following code example shows the model configuration file for the object detection model:

name: "detection"
platform: "tensorflow_graphdef"
max_batch_size: 1
input [
    name: "image_tensor"
    data_type: TYPE_UINT8
    format: FORMAT_NHWC
    dims: [ 600, 1024, 3 ]
output [
    name: "detection_boxes"
    data_type: TYPE_FP32
    dims: [ 100, 4]
    reshape { shape: [100,4] }
    name: "detection_classes"
    data_type: TYPE_FP32
    dims: [ 100 ]
    reshape { shape: [ 1, 100 ] }  
    name: "detection_scores"
    data_type: TYPE_FP32
    dims: [ 100 ]

    name: "num_detections"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape { shape: [] }

The detection model accepts a single image as an input and returns four different outputs.

Client: Run the object detection client 

To run the object detection client, use the following command:

$ python3 /python_examples/examples/detection/ -m detection /python_examples/examples/images/car.jpg

The object detection model returns four different outputs, which are decoded in the post-processing step:

detection_boxes = results.as_numpy(output_name[0].name)
detection_classes = results.as_numpy(output_name[1].name)
detection_scores = results.as_numpy(output_name[2].name)
num_detections = results.as_numpy(output_name[3].name)

At the end, the bounding boxes are drawn on the input as follows:

for idx, detection_box in enumerate(detection_boxes[0,0:int(num_detections),:]):
    start_point = (x_min,y_min)
    end_point = (x_max,y_max)
    shape = (start_point, end_point)   
    draw.rectangle(shape, outline ="red")
    draw.text((int((x_min+x_max)/2),y_min), "class-"+str(int(detection_classes[0,idx])), fill=(0,0,0))

For more information, see

Figure 5 shows the sample output.

Figure 5. Using object detection to identify and locate vehicles (source:

Image segmentation

The process of clustering parts of an image that correspond to the same object class is known as image segmentation. Image segmentation entails splitting images or video frames into multiple objects or segments. For more information, see image segmentation.

Server: Download the model

To download the model, use the following commands:

$ cd /path/to/the/repo/server/models/segmentation/1
$ wget &&  mv fcn-resnet50-11.onnx model.onnx

The following code example shows the model configuration file for the image segmentation model:

name: "segmentation"
platform: "onnxruntime_onnx"
max_batch_size : 0
input [
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [  3, -1, -1 ]
    reshape { shape: [ 1, 3, -1, -1 ] }
output [
    name: "out"
    data_type: TYPE_FP32
    dims: [  -1, 21, -1, -1 ]

Client: Run the image classification client

To run the image classification client, run the following commands:

$ pip install opencv-python
$ python3 /python_examples/examples/segmentation/ -m segmentation -s INCEPTION /python_examples/examples/images/people.jpg

The segmentation model accepts a single input and returns a single output. After inferencing, the model returns the output based on which segmented and blended images are generated.

# generate segmented image
result_img = colorize(raw_labels)
# generate blended image
blended_img = cv2.addWeighted(image[:, :, ::-1], 0.5, result_img, 0.5, 0)

For more information, see the file.

Figure 6 shows the sample output.

Figure 6. Annotated image for semantic image segmentation. Source:


Try Triton Inference Server today on GPU, CPU, or both. The NVIDIA Triton Inference Server container can be downloaded from NGC, and its source code is available on the /triton-inference-server GitHub repo.

  • For documentation, see Triton Inference Server on GitHub.
  • If you’re looking for a hands-on skills lab, see Efficient Cloud-based Deployment of Deep Learning Models using Triton Inference Server and TensorRT.
  • For scalable model deployment with NVIDIA Triton Inference Server, see Fast and Scalable AI Model Deployment with NVIDIA Triton Inference Server.
  • See Simplifying AI Inference in Production with NVIDIA Triton.
  • If you’re interested in inferencing of large models, see Accelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server.
  • For a copy of the code used in this post, see /arslana/triton_blog_1 on Gitlab.

Source:: NVIDIA