JohnnyZeppelin
/

sql-ocr

+---
+tags:
+- lora
+- transformers
+base_model: local-synthetic-gpt2
+license: mit
+task: text-generation
+---
+# SQL OCR LoRA (synthetic, CPU-friendly)
+This repository hosts a tiny GPT-2–style LoRA adapter trained on a synthetic SQL Q&A corpus that mimics table-structure reasoning prompts. The model and tokenizer are initialized from scratch to avoid external downloads and keep the pipeline CPU-friendly.
+## Model Details
+- **Architecture:** GPT-2 style causal LM (2 layers, 4 heads, 128 hidden size)
+- **Tokenizer:** Word-level tokenizer trained on the synthetic prompts/answers with special tokens `[BOS]`, `[EOS]`, `[PAD]`, `[UNK]`
+- **Task:** Text generation / instruction following for SQL-style outputs
+- **Base model:** `local-synthetic-gpt2` (initialized from scratch)
+## Training
+- **Data:** 64 synthetic Spider-inspired text pairs combining schema prompts with target SQL answers (no real images)
+- **Batch size:** 2 (gradient accumulation 1)
+- **Max steps:** 30
+- **Precision:** fp32 on CPU
+- **Regularization:** LoRA rank 8, alpha 16 on `c_attn` modules
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained("JohnnyZeppelin/sql-ocr")
+tokenizer = AutoTokenizer.from_pretrained("JohnnyZeppelin/sql-ocr")
+text = "<|system|>Given the database schema displayed above for database 'sales_0', analyze relations...<|end|><|user|>"
+inputs = tokenizer(text, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=64)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Limitations & Notes
+- This is a demonstration LoRA trained on synthetic text-only data; it is **not** a production OCR or SQL model.
+- The tokenizer and model are tiny and intended for quick CPU experiments only.
+- Because training is fully synthetic, outputs will be illustrative rather than accurate for real schemas.