Model Card for RWKV-Qwen3-32B-Hybrid-GGUF
This model requires a custom fork of llama.cpp with RWKV079 implementation
Model Overview
Model Name: RWKV-Qwen3-32B-Hybrid-GGUF
Repository: OpenMOSE/RWKV-Qwen3-32B-hxa079-Low
Format: GGUF (for llama.cpp) with imatrix quantization
Year: 2025
Release phase: alpha
Description
RWKV-Qwen3-32B-Hybrid-GGUF is an experimental large language model that combines the strengths of traditional transformer architecture with the efficiency of RWKV (Receptance Weighted Key Value) mechanisms. This model is specifically optimized for inference in memory-constrained environments while maintaining excellent context length capabilities.
Technical Specifications
Model Parameters
- Parameter Count: 32 Billion parameters
- Architecture: RWKV079 + GQA (Grouped-Query Attention) Hybrid Linear Attention
- Base Model: Alibaba Qwen3-32B
- Suitable Ctx Length: 32768 (passkey up to 80k)
- Layers: 56 RWKV, 8 NoPE GQA
Key Innovation
The model achieves remarkable efficiency by:
- Converting 87.5% of attention layers from the base Qwen3-32B model to RWKV architecture
- Reducing KV (Key-Value) cache size to 1/8 of the original
- Enabling superior long-context inference in VRAM-limited environments
Performance Benefits
Compared to the base model, RWKV-Qwen3-32B-Hybrid offers:
- 2-4x longer context length capability (theoretical)
- 2-4x larger batch size for simultaneous inference
- Significantly reduced memory footprint while maintaining model quality
Installation and Usage
Prerequisites
This model requires a custom fork of llama.cpp with RWKV079 implementation, based on mollysophia's RWKV7 implementation.
Setup Instructions
- Clone the repository:
git clone https://github.com/OpenMOSE/llama.cpp
cd llama.cpp
git checkout hxa079
Building the Project(Linux)
For CUDA (NVIDIA GPUs):
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
For ROCm (AMD GPUs):
First, identify your GPU architecture:
- AMD Radeon RX 79xx series โ
gfx1100 - AMD Instinct MI300 series โ
gfx942 - AMD Instinct MI100 โ
gfx908
Then build with the appropriate target:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build --config Release -- -j 16
Note: Replace gfx1100 with your GPU's architecture code
Running the Model
Standard Inference:
./build/bin/llama-cli -m YOUR_MODEL_PATH --jinja -fa 1
With KV Cache Quantization:
./build/bin/llama-cli -m YOUR_MODEL_PATH --jinja -fa 1 -ctv q8_0 -ctk q8_0
Extreme Low VRAM Mode(fit to 16GB GPU):
./build/bin/llama-cli -m YOUR_MODEL_PATH --jinja -fa 1 -ctv q8_0 -ctk q8_0 --override-tensor "time_mix_g1=CPU,time_mix_g2=CPU,time_mix_w1=CPU,time_mix_w2=CPU"
Important: The --jinja flag enables non-inference mode by default. To use inference mode, you'll need to prepare and apply a separate Jinja template.
Important Limitations and Notes
Current Limitations:
- Model Compatibility: This branch exclusively supports RWKV079 models - other model types will not function
Supported Hardware:
- โ NVIDIA GPUs (via CUDA)
- โ AMD GPUs (via ROCm)
- โ CPU inference
- โ Apple Silicon (Metal)
Acknowledgments
This project was made possible through:
- Substantial computational support from Recursal.AI
- Special thanks to SmerkyG for invaluable guidance and mentorship
- Inspired by RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
- Reference Code https://github.com/recursal/RADLADS-paper
We extend our heartfelt gratitude to all contributors and supporters who made this experimental model possible.
Disclaimer
EXPERIMENTAL MODEL: This model is created purely for experimental and research purposes.
No Warranty: The creators make no guarantees regarding:
- Model performance
- Output quality
- Suitability for any particular use case
- Results accuracy
Users should thoroughly evaluate the model for their specific needs before deployment in any application.
License
Apache-2.0 Please refer to the repository for specific license information. As this is based on Qwen3-32B, users should also comply with the original Qwen model's licensing terms.
Contact and Support
For issues, questions, or contributions, please visit the GitHub repository or open an issue in the project's issue tracker.
2025 OpenMOSE
- Downloads last month
- 26
2-bit
3-bit
4-bit
5-bit