SQL OCR LoRA (synthetic, CPU-friendly)

This repository hosts a tiny GPT-2–style LoRA adapter trained on a synthetic SQL Q&A corpus that mimics table-structure reasoning prompts. The model and tokenizer are initialized from scratch to avoid external downloads and keep the pipeline CPU-friendly.

Model Details

  • Architecture: GPT-2 style causal LM (2 layers, 4 heads, 128 hidden size)
  • Tokenizer: Word-level tokenizer trained on the synthetic prompts/answers with special tokens [BOS], [EOS], [PAD], [UNK]
  • Task: Text generation / instruction following for SQL-style outputs
  • Base model: local-synthetic-gpt2 (initialized from scratch)

Training

  • Data: 64 synthetic Spider-inspired text pairs combining schema prompts with target SQL answers (no real images)
  • Batch size: 2 (gradient accumulation 1)
  • Max steps: 30
  • Precision: fp32 on CPU
  • Regularization: LoRA rank 8, alpha 16 on c_attn modules

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("JohnnyZeppelin/sql-ocr")
tokenizer = AutoTokenizer.from_pretrained("JohnnyZeppelin/sql-ocr")
text = "<|system|>Given the database schema displayed above for database 'sales_0', analyze relations...<|end|><|user|>"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations & Notes

  • This is a demonstration LoRA trained on synthetic text-only data; it is not a production OCR or SQL model.
  • The tokenizer and model are tiny and intended for quick CPU experiments only.
  • Because training is fully synthetic, outputs will be illustrative rather than accurate for real schemas.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support