Model Card for MaximusLLM (190M)

MaximusLLM is a long-context language model designed for hyper-efficient architecture and training. It introduces a new paradigm for to long context while reducing training VRAM by ~40% and increasing throughput by over 17x compared to optimized standard Cross-Entropy baselines.

Model Details

Model Description

  • Developed by: Yousef Gamaleldin (Independent Researcher)
  • Model type: Transformer with Bifurcated Latent Attention
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: Trained from scratch (Base) followed by Instruction Pre-training.
  • Tokenizer: Gemma 3 (262,144 vocab size)

Model Sources

  • Repository: yousefg/MaximusLLM
  • Technical Reports:
    • MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training
    • Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA

Bias, Risks, and Limitations

MaximusLLM (190M) is an architectural proof-of-concept. While it demonstrates extreme efficiency, its absolute knowledge capacity is limited by its parameter count. Users should expect hallucinations.

How to Get Started with the Model

from src.model import Model, Config
from src.lora import blockswap_attention_layers
from src.infer import general_generate_fn

config = Config.from_pretrained("yousefg/MaximusLLM")
model = Model(config, device="cuda")
blockswap_attention_layers(model)

prompt = "<start_of_turn>user\nWhat is the capital of France?<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = general_generate_fn(model, inputs, tokenizer, max_new_tokens=50)
print(tokenizer.decode(output[0]))

Training Details

Training Data

  1. Pre-training: A high-quality subset of HuggingFaceFW/fineweb-edu.
  2. Narrative Alignment: roneneldan/TinyStories to stabilize linguistic fluidity.
  3. Instruction Alignment: HuggingFaceH4/ultrachat_200k using a multi-turn conversational format.

Training Procedure

Maximus utilizes a specialized training pipeline to maintain FP32 master weight stability while achieving FP16 throughput.

Training Hyperparameters

  • Optimizers:
    • Muon: Applied to all 2D weight matrices (Attention/MLP) with LR 0.02 (Pre-train) and 0.005 (SFT).
    • AdamW: Applied to Embeddings, Head, and Norms (LR 4e-4).
  • Loss Function: MAXIS Loss (Unnormalized Ghost Logits + Matryoshka Auxiliary loss).
  • Precision: FP32 Master Weights, FP16 Mixed Precision (Autocast).
  • Effective Batch Size: 64 to 256 (via Gradient Accumulation).
  • Context Length: Scaled from 2,048 to 8,192 native (Long-context phase).

Speeds, Sizes, Times

  • Throughput: 2.81 updates/sec (17.5x faster than Liger-fused Cross-Entropy).
  • VRAM Savings: 38.7% reduction in peak memory usage.
  • Scaling: $O(N \cdot K)$ complexity achieved via Query Chunking and KV-compression.

Technical Specifications

Model Architecture and Objective

MaximusLLM utilizes three core innovations:

  1. MAXIS Loss: A Matryoshka-structured loss using Dynamic Variance Ghost Logits to simulate the full-vocabulary distribution, preventing the "premature saturation" common in sampled softmax.
  2. RandNLA Attention: Bifurcates the KV-cache into a Top-K Detail Path (lossless) and a Causal Kronecker Sketch Path (compressed background). It uses an Asymmetric Causal Mask to remain strictly autoregressive.
  3. Fisher SVD: Leverages the Fisher Information Matrix ($\sum (\frac{\partial L}{\partial W})^2$) to optimally initialize latent spaces, preserving pre-trained intelligence during architectural transitions.

Compute Infrastructure

Hardware

  • Primary: NVIDIA Tesla T4 (16GB VRAM) / 2x Tesla T4 via Kaggle/Cloud.
  • Secondary: Benchmarked on NVIDIA L4 (24GB VRAM).

Software

  • Framework: PyTorch 2.5+ or 2.9+ for training
  • Compiler: torch.compile (Hollow-compilation of inner blocks for stability).

Citation

MAXIS Loss:

@article{gamaleldin2026maxis,
  title={MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training},
  author={Gamaleldin, Yousef},
  journal={SSRN: Artificial Intelligence eJournal},
  year={2026}
}

RandNLA Attention:

@article{gamaleldin2026randnla,
  title={Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA},
  author={Gamaleldin, Yousef},
  journal={SSRN: Artificial Intelligence eJournal},
  year={2026}
}

Model Card Contact

Yousef Gamaleldin - [yrafat38@gmail.com]

Downloads last month
1,202
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support