Skip to content


Deep Dive into Kernel Fusion: Accelerating Inference in Llama V2

The code is available at

Llama, the most widely discussed machine learning model in 2023, has recently received an upgrade with the release of Llama V2. Its new licensing terms have sparked significant excitement in the field, reaffirming its position at the forefront of the local model run movement. This movement emphasizes low-level optimizations, with a particular focus on platforms like MacBook Pro, evidenced by the llama.cpp project and numerous published quantization schemes. Like its contemporaries, Llama V2's design rests on the Transformer architecture. However, its distinct attributes include the use of Rotary Positional Embeddings (RoPE) over conventional positional encoding, RMSNorm replacing LayerNorm, and the integration of the SILU function in the feed-forward components.

Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl

We are happy to announce the support of OpenAI Whisper model (ASR task) on Kernl.

We focused on high quality transcription in a latency sensitive scenario, meaning:

  • whisper-large-v2 weights
  • beam search 5 (as recommended in the related paper)

We measured a 2.3x speedup on Nvidia A100 GPU (2.4x on 3090 RTX) compared to Hugging Face implementation using FP16 mixed precision on transcribing librispeech test set (over 2600 examples). For now, OpenAI implementation is not yet PyTorch 2.0 compliant.