SQL OCR LoRA (synthetic, CPU-friendly)

This repository hosts a tiny GPT-2–style LoRA adapter trained on a synthetic SQL Q&A corpus that mimics table-structure reasoning prompts. The model and tokenizer are initialized from scratch to avoid external downloads and keep the pipeline CPU-friendly.

Model Details

Architecture: GPT-2 style causal LM (2 layers, 4 heads, 128 hidden size)
Tokenizer: Word-level tokenizer trained on the synthetic prompts/answers with special tokens [BOS], [EOS], [PAD], [UNK]
Task: Text generation / instruction following for SQL-style outputs
Base model: local-synthetic-gpt2 (initialized from scratch)

Training

Data: 64 synthetic Spider-inspired text pairs combining schema prompts with target SQL answers (no real images)
Batch size: 2 (gradient accumulation 1)
Max steps: 30
Precision: fp32 on CPU
Regularization: LoRA rank 8, alpha 16 on c_attn modules

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("JohnnyZeppelin/sql-ocr")
tokenizer = AutoTokenizer.from_pretrained("JohnnyZeppelin/sql-ocr")
text = "<|system|>Given the database schema displayed above for database 'sales_0', analyze relations...<|end|><|user|>"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations & Notes

This is a demonstration LoRA trained on synthetic text-only data; it is not a production OCR or SQL model.
The tokenizer and model are tiny and intended for quick CPU experiments only.
Because training is fully synthetic, outputs will be illustrative rather than accurate for real schemas.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support