JohnnyZeppelin
/

sql-ocr

Model card Files Files and versions

sql-ocr / README.md

JohnnyZeppelin's picture

Upload README.md with huggingface_hub

b64831d verified 7 days ago

|

history blame contribute delete

1.9 kB

	---
	tags:
	- lora
	- transformers
	base_model: local-synthetic-gpt2
	license: mit
	task: text-generation
	---

	# SQL OCR LoRA (synthetic, CPU-friendly)

	This repository hosts a tiny GPT-2–style LoRA adapter trained on a synthetic SQL Q&A corpus that mimics table-structure reasoning prompts. The model and tokenizer are initialized from scratch to avoid external downloads and keep the pipeline CPU-friendly.

	## Model Details
	- Architecture: GPT-2 style causal LM (2 layers, 4 heads, 128 hidden size)
	- Tokenizer: Word-level tokenizer trained on the synthetic prompts/answers with special tokens `[BOS]`, `[EOS]`, `[PAD]`, `[UNK]`
	- Task: Text generation / instruction following for SQL-style outputs
	- Base model: `local-synthetic-gpt2` (initialized from scratch)

	## Training
	- Data: 64 synthetic Spider-inspired text pairs combining schema prompts with target SQL answers (no real images)
	- Batch size: 2 (gradient accumulation 1)
	- Max steps: 30
	- Precision: fp32 on CPU
	- Regularization: LoRA rank 8, alpha 16 on `c_attn` modules

	## Usage
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	model = AutoModelForCausalLM.from_pretrained("JohnnyZeppelin/sql-ocr")
	tokenizer = AutoTokenizer.from_pretrained("JohnnyZeppelin/sql-ocr")
	text = "<\|system\|>Given the database schema displayed above for database 'sales_0', analyze relations...<\|end\|><\|user\|>"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=64)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Limitations & Notes
	- This is a demonstration LoRA trained on synthetic text-only data; it is not a production OCR or SQL model.
	- The tokenizer and model are tiny and intended for quick CPU experiments only.
	- Because training is fully synthetic, outputs will be illustrative rather than accurate for real schemas.