Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,3 +1,41 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- lora
|
| 4 |
+
- transformers
|
| 5 |
+
base_model: local-synthetic-gpt2
|
| 6 |
+
license: mit
|
| 7 |
+
task: text-generation
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# SQL OCR LoRA (synthetic, CPU-friendly)
|
| 11 |
+
|
| 12 |
+
This repository hosts a tiny GPT-2–style LoRA adapter trained on a synthetic SQL Q&A corpus that mimics table-structure reasoning prompts. The model and tokenizer are initialized from scratch to avoid external downloads and keep the pipeline CPU-friendly.
|
| 13 |
+
|
| 14 |
+
## Model Details
|
| 15 |
+
- **Architecture:** GPT-2 style causal LM (2 layers, 4 heads, 128 hidden size)
|
| 16 |
+
- **Tokenizer:** Word-level tokenizer trained on the synthetic prompts/answers with special tokens `[BOS]`, `[EOS]`, `[PAD]`, `[UNK]`
|
| 17 |
+
- **Task:** Text generation / instruction following for SQL-style outputs
|
| 18 |
+
- **Base model:** `local-synthetic-gpt2` (initialized from scratch)
|
| 19 |
+
|
| 20 |
+
## Training
|
| 21 |
+
- **Data:** 64 synthetic Spider-inspired text pairs combining schema prompts with target SQL answers (no real images)
|
| 22 |
+
- **Batch size:** 2 (gradient accumulation 1)
|
| 23 |
+
- **Max steps:** 30
|
| 24 |
+
- **Precision:** fp32 on CPU
|
| 25 |
+
- **Regularization:** LoRA rank 8, alpha 16 on `c_attn` modules
|
| 26 |
+
|
| 27 |
+
## Usage
|
| 28 |
+
```python
|
| 29 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 30 |
+
model = AutoModelForCausalLM.from_pretrained("JohnnyZeppelin/sql-ocr")
|
| 31 |
+
tokenizer = AutoTokenizer.from_pretrained("JohnnyZeppelin/sql-ocr")
|
| 32 |
+
text = "<|system|>Given the database schema displayed above for database 'sales_0', analyze relations...<|end|><|user|>"
|
| 33 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 34 |
+
outputs = model.generate(**inputs, max_new_tokens=64)
|
| 35 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
## Limitations & Notes
|
| 39 |
+
- This is a demonstration LoRA trained on synthetic text-only data; it is **not** a production OCR or SQL model.
|
| 40 |
+
- The tokenizer and model are tiny and intended for quick CPU experiments only.
|
| 41 |
+
- Because training is fully synthetic, outputs will be illustrative rather than accurate for real schemas.
|