A blog fully dedicated to AI

Discover the expertise of our engineering community

Recent articles

July 20, 2023
in Kernl, Large Language Models
21 min read

Deep Dive into Kernel Fusion: Accelerating Inference in Llama V2

The code is available at https://github.com/ELS-RD/kernl/tree/main/experimental/llama-v2.

Llama, the most widely discussed machine learning model in 2023, has recently received an upgrade with the release of Llama V2. Its new licensing terms have sparked significant excitement in the field, reaffirming its position at the forefront of the local model run movement. This movement emphasizes low-level optimizations, with a particular focus on platforms like MacBook Pro, evidenced by the llama.cpp project and numerous published quantization schemes. Like its contemporaries, Llama V2's design rests on the Transformer architecture. However, its distinct attributes include the use of Rotary Positional Embeddings (RoPE) over conventional positional encoding, RMSNorm replacing LayerNorm, and the integration of the SILU function in the feed-forward components.

February 9, 2023
in Kernl, Optimization, Transformers
9 min read

Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl

We are happy to announce the support of OpenAI Whisper model (ASR task) on Kernl.

We focused on high quality transcription in a latency sensitive scenario, meaning:

whisper-large-v2 weights
beam search 5 (as recommended in the related paper)

We measured a 2.3x speedup on Nvidia A100 GPU (2.4x on 3090 RTX) compared to Hugging Face implementation using FP16 mixed precision on transcribing librispeech test set (over 2600 examples). For now, OpenAI implementation is not yet PyTorch 2.0 compliant.

October 26, 2022
in Kernl, Optimization, Transformers
5 min read

Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels

We are releasing Kernl under Apache 2 license, a library to make PyTorch models inference significantly faster. With 1 line of code we applied the optimizations and made Bert up to 12X faster than Hugging Face baseline. T5 is also covered in this first release (> 6X speed up generation and we are still halfway in the optimizations!). This has been possible because we wrote custom GPU kernels with the new OpenAI programming language Triton and leveraged TorchDynamo.

August 3, 2022
in Benchmarking, Transformers
5 min read

What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference)

TL;DR: TorchDynamo (prototype from PyTorch team) plus nvfuser (from Nvidia) backend makes Bert (the tool is model agnostic) inference on PyTorch > 3X faster most of the time (it depends on input shape) by just adding a single line of code in Python script. The surprising thing is that during the benchmark, we have not seen any drawback implied by the use of this library, the acceleration just comes for free. On the same model, TensorRT is (of course) much faster, > 5X at least (and even more at batch size 1 which is impressive) but comes with its own complexity. The tool being a prototype, better performances are to be expected with more mature support of some backends, in particular regarding fx2trt (aka TensorRT mixed with PyTorch)!

May 24, 2022
in Optimization, Transformers
4 min read

What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)

We made autoregressive transformer based models like T5-large 2X faster than 🤗 Hugging Face Pytorch with 3 simple tricks:

February 9, 2022
in Optimization, Transformers
3 min read

What we learned by accelerating by 5X Hugging Face generative language models

2 trends ongoing in the NLP ecosystem: bigger language model and better text generation. Both are NLP game changers (zero shot, etc.) but they bring their own challenges: how to perform inference with them? At what cost? GPU or CPU ? etc.

December 29, 2021
in Optimization, Transformers
3 min read

4.5 times faster Hugging Face transformer inference by modifying some Python AST

Recently, 🤗 Hugging Face people have released a commercial product called Infinity to perform inference with very high performance (aka very fast compared to Pytorch + FastAPI deployment). Unfortunately it’s a paid product costing 20K for one model deployed on a single machine (no info on price scaling publicly available) according to their product director.

December 10, 2021
in Optimization, Transformers
2 min read

1st ever method to perform GPU quantization on most 🤗 HF transformer models: > 2X faster inference!

Quantization is a technique to significantly accelerate inference by replacing high precision tensors by lower precision representation in a way where accuracy is kept intact (or close to).

It’s quite common in CPU inference, a lot less on GPU, even if the performance boost is significant.

November 24, 2021
in Optimization, Transformers
3 min read

Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec

We just launched a new open source Python library to help in optimizing Transformer model inference and prepare deployment in production.

It’s a follow up of a proof of concept shared here. Scripts have been converted to a Python library (Apache 2 license) to be used in any NLP project, and documentation has been reworked. We also added direct TensorRT support, which provides another boost in performance compared to the ORT+TRT backend. It will usually provide you with 5X faster inference compared to vanilla Pytorch, and up to 10X in specific cases. On a RTX 3090, perf_analyzer reports over 2800 inferences per second throughput!

November 5, 2021
in Optimization, Transformers
21 min read

Hugging Face Transformer Inference Under 1 Millisecond Latency

Go to production with Microsoft and Nvidia open source tooling