Shortcuts

Performance Guide

In case you’re interested in optimizing the memory usage, latency or throughput of a PyTorch model served with TorchServe, this is the guide for you.

We have also created a quick checklist here for extra things to try outside of what is covered on this page. You can find the checklist here.

Optimizing PyTorch

There are many tricks to optimize PyTorch models for production including but not limited to distillation, quantization, fusion, pruning, setting environment variables and we encourage you to benchmark and see what works best for you.

In general it’s hard to optimize models and the easiest approach can be exporting to some runtime like ORT, TensorRT, IPEX or FasterTransformer. We have many examples for how to integrate these runtimes on the TorchServe github page. If your favorite runtime is not supported please feel free to open a PR.

torch.compile

Starting with PyTorch 2.0, torch.compile provides out of the box speed up ( ~1.8x) for a large number of models. You can refer to this dashboard which tracks this on a nightly basis.

Models which have been fully optimized with torch.compile show performance improvements up to 10x

When using smaller batch sizes, using mode="reduce-overhead" with torch.compile can give improved performance as it makes use of CUDA graphs

You can find all the examples of torch.compile with TorchServe here

Details regarding torch.compile GenAI examples can be found in this link

ONNX and ORT support

TorchServe has native support for ONNX models which can be loaded via ORT for both accelerated CPU and GPU inference. ONNX operates a bit differently from a regular PyTorch model in that when you’re running the conversion you need to explicitly set and name your input and output dimensions. See this example.

At a high level what TorchServe allows you to do is

  1. Package serialized ONNX weights torch-model-archiver --serialized-file model.onnx ...

  2. Load those weights from base_handler.py using ort_session = ort.InferenceSession(self.model_pt_path, providers=providers, sess_options=sess_options) which supports reasonable defaults for both CPU and GPU inference

  3. Allow you define custom pre and post processing functions to pass in data in the format your onnx model expects with a custom handler

To use ONNX with GPU on TorchServe Docker, we need to build an image with NVIDIA CUDA runtime as the base image as shown here

TensorRT

TorchServe also supports models optimized via TensorRT. To leverage the TensorRT runtime you can convert your model by following these instructions and once you’re done you’ll have serialized weights which you can load with torch.jit.load().

After a conversion there is no difference in how PyTorch treats a Torchscript model vs a TensorRT model.

Better Transformer

Better Transformer from PyTorch implements a backwards-compatible fast path of torch.nn.TransformerEncoder for Transformer Encoder Inference and does not require model authors to modify their models. BetterTransformer improvements can exceed 2x in speedup and throughput for many common execution scenarios. You can find more information on Better Transformer here and here.

Optimizing TorchServe

The main settings you should vary if you’re trying to improve the performance of TorchServe from the config.properties are the batch_size and batch_delay. A larger batch size means a higher throughput at the cost of lower latency.

The second most important settings are number of workers and number of gpus which will have a dramatic impact on CPU and GPU performance.

Concurrency And Number of Workers

TorchServe exposes configurations that allow the user to configure the number of worker threads on CPU and GPUs. There is an important config property that can speed up the server depending on the workload. Note: the following property has bigger impact under heavy workloads.

TorchServe On CPU

If working with TorchServe on a CPU you can improve performance by setting the following in your config.properties:

cpu_launcher_enable=true
cpu_launcher_args=--use_logical_core

These settings improve performance significantly through launcher core pinning. The theory behind this improvement is discussed in this blog which can be quickly summarized as:

  • In a hyperthreading enabled system, avoid logical cores by setting thread affinity to physical cores only via core pinning.

  • In a multi-socket system with NUMA, avoid cross-socket remote memory access by setting thread affinity to a specific socket via core pinning.

TorchServe on GPU

There is a config property called number_of_gpu that tells the server to use a specific number of GPUs per model. In cases where we register multiple models with the server, this will apply to all the models registered. If this is set to a low value (ex: 0 or 1), it will result in under-utilization of GPUs. On the contrary, setting to a high value (>= max GPUs available on the system) results in as many workers getting spawned per model. Clearly, this will result in unnecessary contention for GPUs and can result in sub-optimal scheduling of threads to GPU.

ValueToSet = (Number of Hardware GPUs) / (Number of Unique Models)

NVIDIA MPS

While NVIDIA GPUs allow multiple processes to run on CUDA kernels, this comes with its own drawbacks namely:

  • The execution of the kernels is generally serialized

  • Each processes creates its own CUDA context which occupies additional GPU memory

To get around these drawbacks, you can utilize the NVIDIA Multi-Process Service (MPS) to increase performance. You can find more information on how to utilize NVIDIA MPS with TorchServe here.

NVIDIA DALI

The NVIDIA Data Loading Library (DALI) is a library for data loading and pre-processing to accelerate deep learning applications. It can be used as a portable drop-in replacement for built in data loaders and data iterators in popular deep learning frameworks. DALI provides a collection of highly optimized building blocks for loading and processing image, video and audio data. You can find an example of DALI optimization integration with TorchServe here.

Benchmarking

To make comparing various model and TorchServe configurations easier to compare, we’ve added a few helper scripts that output performance data like p50, p90, p99 latency in a clean report here and mostly require you to determine some configuration either via JSON or YAML. You can find more information on TorchServe benchmarking here.

Profiling

TorchServe has native support for the PyTorch profiler which will help you find performance bottlenecks in your code.

If you created a custom handle or initialize method overwriting the BaseHandler, you must define the self.manifest attribute to be able to run _infer_with_profiler.

export ENABLE_TORCH_PROFILER=TRUE

Visit this link to learn more about the PyTorch profiler.

More Resources

TorchServe on the Animated Drawings App

For some insight into fine tuning TorchServe performance in an application, take a look at this article. The case study shown here uses the Animated Drawings App form Meta to improve TorchServe Performance.

Performance Checklist

We have also created a quick checklist here for extra things to try outside of what is covered on this page. You can find the checklist here.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources