WilhelmT's picture
Update README.md
7684592 verified
|
raw
history blame
5.84 kB
metadata
license: other
license_name: embedl-models-community-licence-1.0
license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
base_model:
  - meta-llama/Llama-3.2-1B-Instruct
tags:
  - text-generation-inference

Llama-3.2-1B-Instruct-FlashHead-W4A16

Optimized version of Llama-3.2-1B-Instruct using Quantization and FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

  • FlashHead
  • Quantization (W4A16)
  • Custom vLLM generation via embedl-models

FlashHead matches the baseline Llama-3.2-1B within rounding on standard evaluations (MMLU-Pro, HellaSwag, GSM8K, etc.) and, in combination with quantization, achieves H200-level latency on RTX Ada GPUs.


Model Details

Field Value
Base Model Llama-3.2-1B-Instruct
Input / Output Text → Text
Release Date 2025-12-08
Version 1.0
Optimizations FlashHead LM Head, Quantization (W4A16)
Developers Embedl
Licenses Upstream: Meta Llama 3.2 License. Built with Llama.
Optimized components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs

Optimizations

  • FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
  • Quantization (W4A16) - large reduction in memory footprint and accuracy.
  • Custom Runtime Integration - compatible with vLLM (0.10.2) via the embedl-models package.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

Precision Tokens/sec Speedup vs BF16
BF16 baseline 130 1.0×
FlashHead (Embedl) 163 1.25×
W4A16 baseline 278 2.14×
FlashHead W4A16 (Embedl) 485 3.73×

FlashHead improves end-to-end speed by 1.75× over state-of-the-art, while maintaining full accuracy parity.


Accuracy (Parity with Baseline)

Method MMLU-Pro HellaSwag IFEval BoolQ BBH TruthfulQA GSM8K
Baseline 0.18 0.59 0.45 0.69 0.38 0.36 0.46
FlashHead 0.18 0.59 0.45 0.69 0.38 0.36 0.46

FlashHead matches baseline performance exactly across all evaluation benchmarks.


Installation

pip install embedl-models

The embedl-models package is required, it provides the optimized FlashHead implementation and quantized model runtime.


Usage Examples

vLLM Inference

from vllm import SamplingParams
from embedl.models.vllm import LLM

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16"

sampling = SamplingParams(max_tokens=128, temperature=0.0)
llm = LLM(model=model_id, trust_remote_code=True)

prompt = "Write a haiku about coffee."
output = llm.generate([prompt], sampling)
print(output[0].outputs[0].text)

Interactive REPL Example

The run_repl() coroutine launches an interactive, streaming chat interface using the vLLM backend with FlashHead enabled.
It maintains an in-memory chat history and supports simple commands such as /exit to quit and /reset to clear context.

import asyncio
from embedl.models.vllm.demo import run_repl

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16"
asyncio.run(
    run_repl(
        model=model_id
    )


⚠️ Important Warning: Hugging Face Transformers Support

FlashHead is currently not applied when using the Hugging Face transformers pipeline.
Generation through transformers will fall back to the standard dense LM head, disabling FlashHead acceleration.

For now, we strongly recommend using the vLLM integration (embedl.models.vllm.LLM) to ensure FlashHead is active and optimized for low-latency inference.

Full support for the Hugging Face transformers pipeline with FlashHead integration will be released in the coming days.


Limitations

  • Limited to vLLM 0.10.2 (pinned dependency)
  • Batch size = 1 (real-time generation)
  • Currently optimized for NVIDIA RTX GPUs

Roadmap

Planned improvements:

  • Huggingface transformers generation
  • Advanced mixed precision quantization
  • vLLM CLI benchmarking for detailed latency evaluation
  • lm-eval-harness integration for detailed accuracy evaluation
  • Upstream support in Transformers and vLLM
  • Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
  • Broader model coverage (larger models, VLMs, VLAs)

License

  • Upstream: Meta Llama 3.2 License
  • Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise & Commercial Inquiries sales@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com


Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

  • Embedl SDK - AI optimization tools & profiling
  • Embedl HUB - benchmarking platform
  • Engineering support for on-prem/edge deployments
  • Migration guidance (Llama / Qwen / Gemma)
  • Early access & partner co-marketing opportunities

Contact: sales@embedl.com