Recently, 🤗 Hugging Face people have released a commercial product called Infinity to perform inference with very high performance (aka very fast compared to Pytorch + FastAPI deployment). Unfortunately it’s a paid product costing 20K for one model deployed on a single machine (no info on price scaling publicly available) according to their product director.
Quantization is a technique to significantly accelerate inference by replacing high precision tensors by lower precision representation in a way where accuracy is kept intact (or close to).
It’s quite common in CPU inference, a lot less on GPU, even if the performance boost is significant.
Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec
We just launched a new open source Python library to help in optimizing Transformer model inference and prepare deployment in production.
It’s a follow up of a proof of concept shared here. Scripts have been converted to a Python library (Apache 2 license) to be used in any NLP project, and documentation has been reworked. We also added direct TensorRT support, which provides another boost in performance compared to the ORT+TRT backend. It will usually provide you with 5X faster inference compared to vanilla Pytorch, and up to 10X in specific cases. On a RTX 3090, perf_analyzer reports over 2800 inferences per second throughput!
Go to production with Microsoft and Nvidia open source tooling
Optimization of Hugging Face Transformer models to get Inference < 1 Millisecond Latency + deployment on production ready inference server
I just released a project showing how to optimize big NLP models and deploy them on Nvidia Triton inference server.