sglang-flash-attn3
Pre-built Flash Attention 3 (forward-only) CUDA kernels from sgl-flash-attn, packaged for the HuggingFace kernels library. Requires Hopper (sm_90+).
Kernel source: kernels-community/sgl-flash-attn3
Usage
pip install kernels
from kernels import get_kernel
fa3 = get_kernel("kernels-community/sgl-flash-attn3", revision="v1")
fa3.flash_attn_varlen_func(q, k, v, cu_seqlens_q, cu_seqlens_k, causal=True)
fa3.flash_attn_with_kvcache(q, k_cache, v_cache, cache_seqlens=cache_seqlens, causal=True)
fa3.is_fa3_supported() # True on H100/H200
SGLang Integration
Entry point: python/sglang/srt/layers/attention/flashattention_backend.py
Original:
from sgl_kernel.flash_attn import flash_attn_varlen_func as flash_attn_varlen_func_fa3
from sgl_kernel.flash_attn import flash_attn_with_kvcache as flash_attn_with_kvcache_fa3
Replace with:
from kernels import get_kernel
_fa3_mod = get_kernel("kernels-community/sgl-flash-attn3", revision="v1")
flash_attn_varlen_func_fa3 = _fa3_mod.flash_attn_varlen_func
flash_attn_with_kvcache_fa3 = _fa3_mod.flash_attn_with_kvcache
Same pattern in 5 other files:
dual_chunk_flashattention_backend.pynsa_backend.pyxpu_backend.pyvision.pymultimodal_gen/runtime/layers/attention/backends/flash_attn.py
Benchmarks
H100 NVL, Qwen2.5-3B-Instruct, FA3. All deltas within noise - zero performance regression.
| Scenario | Native sgl_kernel FA3 tok/s |
HF Hub FA3 tok/s | ฮ |
|---|---|---|---|
| Short Gen (128โ32) | 40,934 | 39,878 | -2.6% |
| Long Gen (256โ1024) | 25,054 | 26,239 | +4.7% |
| Long Prefill (2048โ128) | 53,833 | 54,283 | +0.8% |
| Bursty (512โ256, 16rps) | 6,518 | 6,527 | +0.1% |
| High Concurrency (256โ256) | 40,666 | 40,522 | -0.4% |
Credits
- Tri Dao - Flash Attention 3
- SGLang -
sgl_kernelFA3 implementation - HuggingFace - kernel-builder infrastructure
- Downloads last month
- 8
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support