NabuOCR: Neural Cuneiform Transliteration
NabuOCR
NabuOCR is an OCR model for transcribing ancient cuneiform tablets directly from images to Unicode. Named after Nabu, the Mesopotamian god of writing and scribes, this model bridges a 5,000-year gap between humanity's earliest writing system and cutting-edge computer vision.
Made for the ERNIE AI Developer Challenge, you can watch the submission video here: https://www.youtube.com/embed/hqmjepRLdfU?si=aJHpWdc12ThgWIxD
Overview
NabuOCR processes images of cuneiform tablets and outputs Unicode transcriptions of cuneiform signs. While Assyriologists typically use ATF (ASCII Transliteration Format), ATF's complexity proved too challenging for the 0.9B model within training constraints. Unicode transcription is a meaningful intermediate step: a model that can reliably identify which signs appear on a tablet is doing real work, even if a human still needs to add the scholarly apparatus.
Built by fine-tuning PaddleOCR-VL on cuneiform tablet images, NabuOCR can handle multi-view images of tablets and produce transcriptions of each face using markers like @obverse, @reverse, @left, @right, @top, and @bottom.
Features
NabuOCR is based on the efficient 0.9B parameter PaddleOCR-VL model with an expanded tokenizer that includes all unique cuneiform signs from the dataset plus special face markers. The model was trained on diverse tablet conditions from multiple periods.
It employs end-to-end transcription rather than a multi-stage pipeline, allowing it to leverage full tablet context when making predictions. It handles multi-view images containing obverse, reverse, and edge views all at once.
Example Output
Training
Base Model
NabuOCR is built on PaddleOCR-VL with an expanded tokenizer vocabulary to include cuneiform Unicode codepoints and special face markers (@obverse, @reverse, @left, @right, @top, @bottom).
Dataset
The training data was built from the Cuneiform Digital Library Initiative (CDLI). Starting from 135,255 ATF transliterations, aggressive filtering removed damaged tablets, those outside Sumerian/Akkadian scope, entries without images, and low-quality black-and-white photos or with noisy backgrounds. The result was 33,257 high-quality examples split into 32,257 training samples and 1,000 held-out test samples. ATF was converted to Unicode for the final targets.
SFT
The model was trained using Unsloth's FastVisionModel wrapper for full fine-tuning with gradient checkpointing:
- Epochs: 2 (~32,000 steps)
- Batch size: 2
- Learning rate: 2e-5 with linear decay
- Warmup: 5% of training steps
- Optimizer: AdamW (8-bit)
- Precision: BF16
- Max sequence length: 16,000 tokens
GRPO
Group Relative Policy Optimization (GRPO) was applied on top of the SFT checkpoint using DR-GRPO loss. Unlike SFT which learns from ground truth, GRPO generates multiple completions per image, scores them with reward functions, and updates the model to favor higher-scoring outputs.
- LoRA rank: 256 (RSLoRA with ฮฑ=16)
- Trainable parameters: 239M of 1.2B (20%)
- Generations per prompt: 4
- Batch size: 16
- Learning rate: 5e-6 with cosine decay
- Warmup: 3% of training steps
- Optimizer: AdamW (8-bit)
The reward function combined five components: weighted Token Error Rate using glyph visual similarity and curriculum learning, length deviation penalty, repetition penalty, line structure accuracy, and cuneiform character ratio. The adapter was merged back into the base model at 16-bit precision.
Story
For the more detailed story of how this model was trained, see STORY.md. To read the code used for training, see training/.
Performance
Evaluated on a held-out test set of 1,000 tablets using TER. Lower is better; 0% means perfect transcription.
Usage
Best Practices
Provide high-resolution images when possible (minimum 800x800 recommended) and include all visible sides of the tablet in a single image.
Ensure that the photographs are well-lit and have high contrast so that characters are readable, and remove excessive background from images.
For more details on the best format for images, see the CDLI guidlines.
Limitations
NabuOCR performs best on well-preserved tablets with clear impressions and may struggle with heavily damaged or eroded sections.
Note that the model only supports the Sumerian and Akkadian languages, and limited support is available for complex literary texts with unusual sign variants.
Citation
If you use NabuOCR in your research, please cite:
@software{nabuocr2025,
title={NabuOCR: Neural Cuneiform Transliteration},
author={[Zack Williams]},
year={2025},
url={https://huggingface.co/boatbomber/NabuOCR}
}
Acknowledgments
- Built on PaddleOCR-VL
- Training data courtesy of the Cuneiform Digital Library Initiative (CDLI)
- ATF format specification from ORACC
- Inspired by CuneiML: A Cuneiform Dataset for Machine Learning
- Downloads last month
- 3




