view post Post 86 ā” Qwen3.5, up to 1.4Ć faster. Same quality. Less latency.We applied FlashHead to the Qwen3.5 family: Novel drop-in replacement of the LM head with measurably lower latency on edge hardware. Benchmarks and models below.š embedl/Edge-Inference-Benchmarksš¤ https://huggingface.co/collections/embedl/qwen35 See translation š„ 1 1 + Reply
NVIDIA Jetson AGX Orin Collection Models optimized and bench-marked for NVIDIA Jetson AGX Orin. Memory-efficient and latency-optimized variants designed for real-time edge inference. ⢠8 items ⢠Updated 4 days ago ⢠3
NVIDIA Jetson AGX Thor Collection Models validated and performance-optimized for NVIDIA Jetson AGX Thor. Tailored for high-performance edge AI workloads. ⢠7 items ⢠Updated 4 days ago ⢠1
FlashHead Collection Efficient Drop-In Replacement for the Classification Head in Language Model Inference. https://github.com/embedl/flash-head ⢠24 items ⢠Updated 4 days ago ⢠2
EdgeN Collection Quantization strategy where most weights are converted to INT4, activations remain in FP16, and sensitive layers are preserved in FP16. ⢠4 items ⢠Updated 4 days ago ⢠1
Qwen3.5 Collection Qwen/Qwen3.5 variants optimized by embedl. ⢠6 items ⢠Updated 4 days ago ⢠1
NVIDIA Jetson Orin Nano Collection Ultra-efficient model variants optimized for Jetson Orin Nano. Designed for constrained edge environments requiring low memory footprint. ⢠5 items ⢠Updated 4 days ago ⢠4