2022

October 26, 2022
in Kernl, Optimization, Transformers
5 min read

Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels

We are releasing Kernl under Apache 2 license, a library to make PyTorch models inference significantly faster. With 1 line of code we applied the optimizations and made Bert up to 12X faster than Hugging Face baseline. T5 is also covered in this first release (> 6X speed up generation and we are still halfway in the optimizations!). This has been possible because we wrote custom GPU kernels with the new OpenAI programming language Triton and leveraged TorchDynamo.

August 3, 2022
in Benchmarking, Transformers
5 min read

What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference)

TL;DR: TorchDynamo (prototype from PyTorch team) plus nvfuser (from Nvidia) backend makes Bert (the tool is model agnostic) inference on PyTorch > 3X faster most of the time (it depends on input shape) by just adding a single line of code in Python script. The surprising thing is that during the benchmark, we have not seen any drawback implied by the use of this library, the acceleration just comes for free. On the same model, TensorRT is (of course) much faster, > 5X at least (and even more at batch size 1 which is impressive) but comes with its own complexity. The tool being a prototype, better performances are to be expected with more mature support of some backends, in particular regarding fx2trt (aka TensorRT mixed with PyTorch)!

May 24, 2022
in Optimization, Transformers
4 min read

What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)

We made autoregressive transformer based models like T5-large 2X faster than 🤗 Hugging Face Pytorch with 3 simple tricks:

February 9, 2022
in Optimization, Transformers
3 min read

What we learned by accelerating by 5X Hugging Face generative language models

2 trends ongoing in the NLP ecosystem: bigger language model and better text generation. Both are NLP game changers (zero shot, etc.) but they bring their own challenges: how to perform inference with them? At what cost? GPU or CPU ? etc.