JohnnyZeppelin commited on
Commit
b64831d
·
verified ·
1 Parent(s): 31f7c97

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +41 -3
README.md CHANGED
@@ -1,3 +1,41 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - lora
4
+ - transformers
5
+ base_model: local-synthetic-gpt2
6
+ license: mit
7
+ task: text-generation
8
+ ---
9
+
10
+ # SQL OCR LoRA (synthetic, CPU-friendly)
11
+
12
+ This repository hosts a tiny GPT-2–style LoRA adapter trained on a synthetic SQL Q&A corpus that mimics table-structure reasoning prompts. The model and tokenizer are initialized from scratch to avoid external downloads and keep the pipeline CPU-friendly.
13
+
14
+ ## Model Details
15
+ - **Architecture:** GPT-2 style causal LM (2 layers, 4 heads, 128 hidden size)
16
+ - **Tokenizer:** Word-level tokenizer trained on the synthetic prompts/answers with special tokens `[BOS]`, `[EOS]`, `[PAD]`, `[UNK]`
17
+ - **Task:** Text generation / instruction following for SQL-style outputs
18
+ - **Base model:** `local-synthetic-gpt2` (initialized from scratch)
19
+
20
+ ## Training
21
+ - **Data:** 64 synthetic Spider-inspired text pairs combining schema prompts with target SQL answers (no real images)
22
+ - **Batch size:** 2 (gradient accumulation 1)
23
+ - **Max steps:** 30
24
+ - **Precision:** fp32 on CPU
25
+ - **Regularization:** LoRA rank 8, alpha 16 on `c_attn` modules
26
+
27
+ ## Usage
28
+ ```python
29
+ from transformers import AutoModelForCausalLM, AutoTokenizer
30
+ model = AutoModelForCausalLM.from_pretrained("JohnnyZeppelin/sql-ocr")
31
+ tokenizer = AutoTokenizer.from_pretrained("JohnnyZeppelin/sql-ocr")
32
+ text = "<|system|>Given the database schema displayed above for database 'sales_0', analyze relations...<|end|><|user|>"
33
+ inputs = tokenizer(text, return_tensors="pt")
34
+ outputs = model.generate(**inputs, max_new_tokens=64)
35
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
36
+ ```
37
+
38
+ ## Limitations & Notes
39
+ - This is a demonstration LoRA trained on synthetic text-only data; it is **not** a production OCR or SQL model.
40
+ - The tokenizer and model are tiny and intended for quick CPU experiments only.
41
+ - Because training is fully synthetic, outputs will be illustrative rather than accurate for real schemas.