|
|
--- |
|
|
model-index: |
|
|
- name: >- |
|
|
Granite-4.0-H-Tiny — MLX (Apple Silicon), **6-bit** (with guidance for |
|
|
2/3/4/5-bit) |
|
|
results: [] |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- ibm |
|
|
- granite |
|
|
- mlx |
|
|
- apple-silicon |
|
|
- mamba2 |
|
|
- transformer |
|
|
- hybrid |
|
|
- moe |
|
|
- long-context |
|
|
- instruct |
|
|
- quantized |
|
|
- 6bit |
|
|
- MoE |
|
|
pipeline_tag: text-generation |
|
|
library_name: mlx |
|
|
base_model: |
|
|
- ibm-granite/granite-4.0-h-tiny |
|
|
--- |
|
|
|
|
|
# Granite-4.0-H-Tiny — **MLX 6-bit** (Apple Silicon) |
|
|
**Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary) |
|
|
|
|
|
This repository provides an **Apple-Silicon MLX build** of **IBM Granite-4.0-H-Tiny** quantized to **6-bit**. |
|
|
Among MLX quant variants, **6-bit** offers the **highest fidelity** while still fitting comfortably on modern M-series Macs. If your workload involves **precise extraction, structured outputs, or long contexts**, 6-bit is usually the best on-device choice. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔢 Choosing a quantization level (LMX variants) |
|
|
Use this table as a **practical** guide for a ~7B hybrid MoE LM on Apple Silicon. (Figures vary by device/context.) |
|
|
|
|
|
| Variant | Typical Peak RAM | Relative Speed | Typical Behavior | When to Choose | |
|
|
|---|---:|:---:|---|---| |
|
|
| **2-bit** | ~3–4 GB | 🔥🔥🔥🔥 | Smallest, most lossy | Minimal RAM devices; smoke tests | |
|
|
| **3-bit** | ~5–6 GB | **🔥🔥🔥🔥** | Direct, concise | Great default on M1/M2/M3/M4 | |
|
|
| **4-bit** | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention | If 3-bit misses details | |
|
|
| **5-bit** | ~8–9 GB | 🔥🔥☆ | Higher fidelity | Heavier docs/structured outputs | |
|
|
| **6-bit** *(this repo)* | **~9.5–11 GB** | 🔥🔥 | **Highest MLX fidelity** | Best quality on-device if RAM permits | |
|
|
|
|
|
**Tips** |
|
|
- Prefer **6-bit** when you have ~10–12 GB free and want maximum quality. |
|
|
- Use **3-bit/4-bit** for tighter RAM with good latency and strong baseline quality. |
|
|
- For JSON/structured extraction, consider **temperature 0.0** and **schema-style prompts**. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔎 About Granite 4.0 (context for this build) |
|
|
- **Architecture:** Hybrid **Mamba-2 + softmax attention**; *H* tiers add **Mixture-of-Experts (MoE)** for sparse activation and efficiency. |
|
|
- **Model tier:** **H-Tiny** (~7B total params with ~1B active via MoE) — designed for **long-context** use and efficient serving. |
|
|
- **License:** **Apache-2.0** (permissive, enterprise-friendly). |
|
|
- **Use cases:** Instruction following, long-context assistants, RAG backends, structured outputs. |
|
|
|
|
|
> This card documents the **MLX 6-bit** conversion. For lower-RAM devices, see the 2/3/4/5-bit guidance below. |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## 📦 Contents of this repository |
|
|
- `config.json` (MLX), `mlx_model*.safetensors` (**6-bit** shards) |
|
|
- Tokenizer files: `tokenizer.json`, `tokenizer_config.json` |
|
|
- Any auxiliary metadata (e.g., `model_index.json`) |
|
|
|
|
|
This build targets **macOS** on **Apple Silicon (M-series)** using **Metal/MPS**. |
|
|
|
|
|
--- |
|
|
|
|
|
## ✅ Intended use |
|
|
- **High-fidelity** instruction following and summarization |
|
|
- **Long-context** reasoning and retrieval-augmented generation (RAG) |
|
|
- **Structured extraction** (JSON, key–value) and document parsing |
|
|
- On-device prototyping where **answer faithfulness** matters |
|
|
|
|
|
## ⚠️ Limitations |
|
|
- As with any quantization, small regressions vs FP16 can occur (complex math/code). |
|
|
- **Token limits** and **KV-cache growth** still apply for very long contexts. |
|
|
- Always add your own **guardrails/safety** for sensitive deployments. |
|
|
|
|
|
|
|
|
|
|
|
## 🚀 Quickstart (CLI — MLX) |
|
|
|
|
|
**Deterministic generation** |
|
|
```bash |
|
|
python -m mlx_lm.generate \ |
|
|
--model <this-repo-id> \ |
|
|
--prompt "Summarize the following meeting notes in 5 bullet points:\n<your text>" \ |
|
|
--max-tokens 256 \ |
|
|
--temperature 0.0 \ |
|
|
--device mps \ |
|
|
--seed 0 |