Skip to content

A blog fully dedicated to AI

Discover the expertise of our engineering community

Recent articles

Deep Dive into Kernel Fusion: Accelerating Inference in Llama V2

The code is available at

Llama, the most widely discussed machine learning model in 2023, has recently received an upgrade with the release of Llama V2. Its new licensing terms have sparked significant excitement in the field, reaffirming its position at the forefront of the local model run movement. This movement emphasizes low-level optimizations, with a particular focus on platforms like MacBook Pro, evidenced by the llama.cpp project and numerous published quantization schemes. Like its contemporaries, Llama V2's design rests on the Transformer architecture. However, its distinct attributes include the use of Rotary Positional Embeddings (RoPE) over conventional positional encoding, RMSNorm replacing LayerNorm, and the integration of the SILU function in the feed-forward components.

Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl

We are happy to announce the support of OpenAI Whisper model (ASR task) on Kernl.

We focused on high quality transcription in a latency sensitive scenario, meaning:

  • whisper-large-v2 weights
  • beam search 5 (as recommended in the related paper)

We measured a 2.3x speedup on Nvidia A100 GPU (2.4x on 3090 RTX) compared to Hugging Face implementation using FP16 mixed precision on transcribing librispeech test set (over 2600 examples). For now, OpenAI implementation is not yet PyTorch 2.0 compliant.

Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels

We are releasing Kernl under Apache 2 license, a library to make PyTorch models inference significantly faster. With 1 line of code we applied the optimizations and made Bert up to 12X faster than Hugging Face baseline. T5 is also covered in this first release (> 6X speed up generation and we are still halfway in the optimizations!). This has been possible because we wrote custom GPU kernels with the new OpenAI programming language Triton and leveraged TorchDynamo.

What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference)

TL;DR: TorchDynamo (prototype from PyTorch team) plus nvfuser (from Nvidia) backend makes Bert (the tool is model agnostic) inference on PyTorch > 3X faster most of the time (it depends on input shape) by just adding a single line of code in Python script. The surprising thing is that during the benchmark, we have not seen any drawback implied by the use of this library, the acceleration just comes for free. On the same model, TensorRT is (of course) much faster, > 5X at least (and even more at batch size 1 which is impressive) but comes with its own complexity. The tool being a prototype, better performances are to be expected with more mature support of some backends, in particular regarding fx2trt (aka TensorRT mixed with PyTorch)!

4.5 times faster Hugging Face transformer inference by modifying some Python AST

Recently, 🤗 Hugging Face people have released a commercial product called Infinity to perform inference with very high performance (aka very fast compared to Pytorch + FastAPI deployment). Unfortunately it’s a paid product costing 20K for one model deployed on a single machine (no info on price scaling publicly available) according to their product director.

Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec

We just launched a new open source Python library to help in optimizing Transformer model inference and prepare deployment in production.

It’s a follow up of a proof of concept shared here. Scripts have been converted to a Python library (Apache 2 license) to be used in any NLP project, and documentation has been reworked. We also added direct TensorRT support, which provides another boost in performance compared to the ORT+TRT backend. It will usually provide you with 5X faster inference compared to vanilla Pytorch, and up to 10X in specific cases. On a RTX 3090, perf_analyzer reports over 2800 inferences per second throughput!