What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference)
TorchDynamo (prototype from PyTorch team) plus
nvfuser (from Nvidia) backend makes Bert (the tool is model
agnostic) inference on PyTorch > 3X faster most of the time (it depends on input shape) by just adding a single line of code in
The surprising thing is that during the benchmark, we have not seen any drawback implied by the use of this library, the acceleration just comes for free.
On the same model, TensorRT is (of course) much faster, > 5X at least (and even more at batch size 1 which is
impressive) but comes with its own complexity.
The tool being a prototype, better performances are to be expected with more mature support of some backends, in
particular regarding fx2trt (aka TensorRT mixed with PyTorch)!