Model Card for MaximusLLM (190M)
MaximusLLM is a long-context language model designed for hyper-efficient architecture and training. It introduces a new paradigm for to long context while reducing training VRAM by ~40% and increasing throughput by over 17x compared to optimized standard Cross-Entropy baselines.
Model Details
Model Description
- Developed by: Yousef Gamaleldin (Independent Researcher)
- Model type: Transformer with Bifurcated Latent Attention
- Language(s) (NLP): English
- License: MIT
- Finetuned from model: Trained from scratch (Base) followed by Instruction Pre-training.
- Tokenizer: Gemma 3 (262,144 vocab size)
Model Sources
- Repository: yousefg/MaximusLLM
- Technical Reports:
- MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training
- Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA
Bias, Risks, and Limitations
MaximusLLM (190M) is an architectural proof-of-concept. While it demonstrates extreme efficiency, its absolute knowledge capacity is limited by its parameter count. Users should expect hallucinations.
How to Get Started with the Model
from src.model import Model, Config
from src.lora import blockswap_attention_layers
from src.infer import general_generate_fn
config = Config.from_pretrained("yousefg/MaximusLLM")
model = Model(config, device="cuda")
blockswap_attention_layers(model)
prompt = "<start_of_turn>user\nWhat is the capital of France?<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = general_generate_fn(model, inputs, tokenizer, max_new_tokens=50)
print(tokenizer.decode(output[0]))
Training Details
Training Data
- Pre-training: A high-quality subset of
HuggingFaceFW/fineweb-edu. - Narrative Alignment:
roneneldan/TinyStoriesto stabilize linguistic fluidity. - Instruction Alignment:
HuggingFaceH4/ultrachat_200kusing a multi-turn conversational format.
Training Procedure
Maximus utilizes a specialized training pipeline to maintain FP32 master weight stability while achieving FP16 throughput.
Training Hyperparameters
- Optimizers:
- Muon: Applied to all 2D weight matrices (Attention/MLP) with LR 0.02 (Pre-train) and 0.005 (SFT).
- AdamW: Applied to Embeddings, Head, and Norms (LR 4e-4).
- Loss Function: MAXIS Loss (Unnormalized Ghost Logits + Matryoshka Auxiliary loss).
- Precision: FP32 Master Weights, FP16 Mixed Precision (Autocast).
- Effective Batch Size: 64 to 256 (via Gradient Accumulation).
- Context Length: Scaled from 2,048 to 8,192 native (Long-context phase).
Speeds, Sizes, Times
- Throughput: 2.81 updates/sec (17.5x faster than Liger-fused Cross-Entropy).
- VRAM Savings: 38.7% reduction in peak memory usage.
- Scaling: $O(N \cdot K)$ complexity achieved via Query Chunking and KV-compression.
Technical Specifications
Model Architecture and Objective
MaximusLLM utilizes three core innovations:
- MAXIS Loss: A Matryoshka-structured loss using Dynamic Variance Ghost Logits to simulate the full-vocabulary distribution, preventing the "premature saturation" common in sampled softmax.
- RandNLA Attention: Bifurcates the KV-cache into a Top-K Detail Path (lossless) and a Causal Kronecker Sketch Path (compressed background). It uses an Asymmetric Causal Mask to remain strictly autoregressive.
- Fisher SVD: Leverages the Fisher Information Matrix ($\sum (\frac{\partial L}{\partial W})^2$) to optimally initialize latent spaces, preserving pre-trained intelligence during architectural transitions.
Compute Infrastructure
Hardware
- Primary: NVIDIA Tesla T4 (16GB VRAM) / 2x Tesla T4 via Kaggle/Cloud.
- Secondary: Benchmarked on NVIDIA L4 (24GB VRAM).
Software
- Framework: PyTorch 2.5+ or 2.9+ for training
- Compiler:
torch.compile(Hollow-compilation of inner blocks for stability).
Citation
MAXIS Loss:
@article{gamaleldin2026maxis,
title={MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training},
author={Gamaleldin, Yousef},
journal={SSRN: Artificial Intelligence eJournal},
year={2026}
}
RandNLA Attention:
@article{gamaleldin2026randnla,
title={Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA},
author={Gamaleldin, Yousef},
journal={SSRN: Artificial Intelligence eJournal},
year={2026}
}
Model Card Contact
Yousef Gamaleldin - [yrafat38@gmail.com]
- Downloads last month
- 1,202