Skip to content

2022

Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels

We are releasing Kernl under Apache 2 license, a library to make PyTorch models inference significantly faster. With 1 line of code we applied the optimizations and made Bert up to 12X faster than Hugging Face baseline. T5 is also covered in this first release (> 6X speed up generation and we are still halfway in the optimizations!). This has been possible because we wrote custom GPU kernels with the new OpenAI programming language Triton and leveraged TorchDynamo.

What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference)

TL;DR: TorchDynamo (prototype from PyTorch team) plus nvfuser (from Nvidia) backend makes Bert (the tool is model agnostic) inference on PyTorch > 3X faster most of the time (it depends on input shape) by just adding a single line of code in Python script. The surprising thing is that during the benchmark, we have not seen any drawback implied by the use of this library, the acceleration just comes for free. On the same model, TensorRT is (of course) much faster, > 5X at least (and even more at batch size 1 which is impressive) but comes with its own complexity. The tool being a prototype, better performances are to be expected with more mature support of some backends, in particular regarding fx2trt (aka TensorRT mixed with PyTorch)!