Tags
Following is a list of relevant tags:
Anonymization
Bert
Cuda
Cuda Graph
Data Science
Deep Learning
Flair
GPT-3
GPU Quantization
Hugging Face
- Divide Hugging Face Transformers training time by 2 or more with dynamic padding and uniform length batching
- 1st ever method to perform GPU quantization on most 🤗 HF transformer models: > 2X faster inference!
- 4.5 times faster Hugging Face transformer inference by modifying some Python AST
- Hugging Face Transformer Inference Under 1 Millisecond Latency
- Optimization of Hugging Face Transformer models to get Inference < 1 Millisecond Latency + deployment on production ready inference server
- Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec
- What we learned by accelerating by 5X Hugging Face generative language models
- What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)
Justice
Kernel
- Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels
- Deep Dive into Kernel Fusion: Accelerating Inference in Llama V2
- Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl
Llama
Machine Learning
- NER algo benchmark: spaCy, Flair, m-BERT and camemBERT on anonymizing French commercial legal cases
- Divide Hugging Face Transformers training time by 2 or more with dynamic padding and uniform length batching
NLP
Nvidia Triton
- 4.5 times faster Hugging Face transformer inference by modifying some Python AST
- Optimization of Hugging Face Transformer models to get Inference < 1 Millisecond Latency + deployment on production ready inference server
- What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)
ONNX Runtime
- 4.5 times faster Hugging Face transformer inference by modifying some Python AST
- Hugging Face Transformer Inference Under 1 Millisecond Latency
- Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec
- What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference)
- What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)
OpenAI Triton
OpenAI Whisper
Programming
Python
Pytorch
Spacy
T5
- Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels
- What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)
Technology
TensorRT
- 4.5 times faster Hugging Face transformer inference by modifying some Python AST
- Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec
- What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference)
- What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)