Real&#x2d;time Serving for XGBoost, Scikit&#x2d;Learn RandomForest, LightGBM, and More

William Hicks

Dive into how the NVIDIA Triton Inference Server offers highly optimized real-time serving forest models by using the Forest Inference Library backend.

NVIDIA

•

William Hicks

•7 min read•intermediate•

--

•View Original

AWSAzureDockerFlaskHelmJSONKubernetesLightGBMPythonPyTorchscikit-learnTensorFlowVertex AIXGBoost

Overview

The article discusses the deployment of tree-based models like XGBoost and LightGBM using the NVIDIA Triton Inference Server, emphasizing its capabilities for real-time serving and GPU acceleration. It highlights the importance of these models in tabular data analysis and provides insights into the features of the Triton Inference Server, including support for multiple frameworks and dynamic batching.

What You'll Learn

1

How to deploy an XGBoost model using the FIL backend

2

Why GPU acceleration is crucial for maintaining low latency in complex models

3

When to use dynamic batching for optimizing throughput

Prerequisites & Requirements

Understanding of machine learning models and their deployment
Familiarity with NVIDIA Triton Inference Server and its components(optional)

Key Questions Answered

How does NVIDIA Triton Inference Server support tree-based models?

NVIDIA Triton Inference Server supports tree-based models like XGBoost, LightGBM, and Scikit-Learn through its Forest Inference Library (FIL) backend. This allows for efficient deployment of these models alongside deep learning models, leveraging GPU acceleration for improved throughput and reduced latency.

What are the performance benefits of using the FIL backend?

Using the FIL backend on an NVIDIA DGX-1 server with eight V100 GPUs enables over 400K inferences per second with p99 latency under 2ms. This represents about 20 times higher throughput compared to CPU deployments, allowing for more complex models without sacrificing performance.

What formats are supported for model serialization in Triton?

NVIDIA Triton Inference Server supports several serialization formats for models, including XGBoost binary format, XGBoost JSON, LightGBM text format, and Treelite binary checkpoint files. This flexibility allows users to deploy models from various frameworks seamlessly.

Key Statistics & Figures

Throughput

over 400K inferences per second

Achieved using the FIL backend on an NVIDIA DGX-1 server with eight V100 GPUs.

Latency

p99 latency under 2ms

Maintained while deploying a sophisticated fraud detection model on GPU.

Performance improvement

about 20x higher throughput than on CPU

Demonstrated in the example of deploying a fraud detection model.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Nvidia Triton Inference Server

Used for real-time serving of machine learning models, including tree-based models.

Machine Learning Framework

Xgboost

One of the tree-based models supported by the FIL backend.

Machine Learning Framework

Lightgbm

Another tree-based model supported by the FIL backend.

Machine Learning Framework

Scikit-learn

Includes support for Random Forest models through the FIL backend.

Key Actionable Insights

1
Utilize the dynamic batching feature of NVIDIA Triton Inference Server to improve throughput.
Dynamic batching allows you to collate multiple requests into a single batch, optimizing resource usage and reducing latency. This is particularly useful in high-demand applications where response time is critical.

2
Leverage GPU acceleration for deploying complex models to maintain low latency.
By deploying models on NVIDIA GPUs, you can achieve significantly higher throughput while keeping latency manageable, making it feasible to use more sophisticated models in production environments.

3
Explore the FIL backend for serving tree-based models alongside deep learning models.
The FIL backend enables a unified serving architecture, allowing organizations to deploy both tree-based and deep learning models without the need for custom code, simplifying the deployment process.

Common Pitfalls

1

Relying on custom Flask servers for serving models can lead to poor performance.

Many users resort to building their own serving solutions, which often lack the optimizations provided by dedicated frameworks like NVIDIA Triton Inference Server, resulting in suboptimal throughput and latency.