Update README.md

Browse files

Files changed (1) hide show

README.md +89 -4

README.md CHANGED Viewed

@@ -1,10 +1,95 @@
 ---
 license: apache-2.0
-library_name: mlx
 tags:
-- language
-- granite-4.0
 - mlx
 pipeline_tag: text-generation
-base_model: ibm-granite/granite-4.0-h-tiny
 ---

 ---
+model-index:
+- name: Granite-4.0-H-Tiny — MLX (Apple Silicon), **6-bit** (with guidance for 2/3/4/5-bit)
+  results: []
 license: apache-2.0
+language:
+- en
 tags:
+- ibm
+- granite
 - mlx
+- apple-silicon
+- mamba2
+- transformer
+- hybrid
+- moe
+- long-context
+- instruct
+- quantized
+- 6bit
 pipeline_tag: text-generation
+library_name: mlx
+---
+# Granite-4.0-H-Tiny — **MLX 6-bit** (Apple Silicon)
+**Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary)
+This repository provides an **Apple-Silicon MLX build** of **IBM Granite-4.0-H-Tiny** quantized to **6-bit**.
+Among MLX quant variants, **6-bit** offers the **highest fidelity** while still fitting comfortably on modern M-series Macs. If your workload involves **precise extraction, structured outputs, or long contexts**, 6-bit is usually the best on-device choice.
+---
+## 🔢 Choosing a quantization level (LMX variants)
+Use this table as a **practical** guide for a ~7B hybrid MoE LM on Apple Silicon. (Figures vary by device/context.)
+| Variant | Typical Peak RAM | Relative Speed | Typical Behavior | When to Choose |
+|---|---:|:---:|---|---|
+| **2-bit** | ~3–4 GB | 🔥🔥🔥🔥 | Smallest, most lossy | Minimal RAM devices; smoke tests |
+| **3-bit** | ~5–6 GB | **🔥🔥🔥🔥** | Direct, concise | Great default on M1/M2/M3/M4 |
+| **4-bit** | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention | If 3-bit misses details |
+| **5-bit** | ~8–9 GB | 🔥🔥☆ | Higher fidelity | Heavier docs/structured outputs |
+| **6-bit** *(this repo)* | **~9.5–11 GB** | 🔥🔥 | **Highest MLX fidelity** | Best quality on-device if RAM permits |
+**Tips**
+- Prefer **6-bit** when you have ~10–12 GB free and want maximum quality.
+- Use **3-bit/4-bit** for tighter RAM with good latency and strong baseline quality.
+- For JSON/structured extraction, consider **temperature 0.0** and **schema-style prompts**.
+---
+## 🔎 About Granite 4.0 (context for this build)
+- **Architecture:** Hybrid **Mamba-2 + softmax attention**; *H* tiers add **Mixture-of-Experts (MoE)** for sparse activation and efficiency.
+- **Model tier:** **H-Tiny** (~7B total params with ~1B active via MoE) — designed for **long-context** use and efficient serving.
+- **License:** **Apache-2.0** (permissive, enterprise-friendly).
+- **Use cases:** Instruction following, long-context assistants, RAG backends, structured outputs.
+> This card documents the **MLX 6-bit** conversion. For lower-RAM devices, see the 2/3/4/5-bit guidance below.
+---
+## 📦 Contents of this repository
+- `config.json` (MLX), `mlx_model*.safetensors` (**6-bit** shards)
+- Tokenizer files: `tokenizer.json`, `tokenizer_config.json`
+- Any auxiliary metadata (e.g., `model_index.json`)
+This build targets **macOS** on **Apple Silicon (M-series)** using **Metal/MPS**.
 ---
+## ✅ Intended use
+- **High-fidelity** instruction following and summarization
+- **Long-context** reasoning and retrieval-augmented generation (RAG)
+- **Structured extraction** (JSON, key–value) and document parsing
+- On-device prototyping where **answer faithfulness** matters
+## ⚠️ Limitations
+- As with any quantization, small regressions vs FP16 can occur (complex math/code).
+- **Token limits** and **KV-cache growth** still apply for very long contexts.
+- Always add your own **guardrails/safety** for sensitive deployments.
+## 🚀 Quickstart (CLI — MLX)
+**Deterministic generation**
+```bash
+python -m mlx_lm.generate \
+  --model <this-repo-id> \
+  --prompt "Summarize the following meeting notes in 5 bullet points:\n<your text>" \
+  --max-tokens 256 \
+  --temperature 0.0 \
+  --device mps \
+  --seed 0