RichardErkhov commited on
Commit
25449c2
·
verified ·
1 Parent(s): 589f139

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +280 -0
README.md ADDED
@@ -0,0 +1,280 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ magistrate-3.2-3b-base - bnb 4bits
11
+ - Model creator: https://huggingface.co/macadeliccc/
12
+ - Original model: https://huggingface.co/macadeliccc/magistrate-3.2-3b-base/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ library_name: transformers
20
+ license: llama3.2
21
+ license_link: https://huggingface.co/meta-llama/Llama-3.2-3B/blob/main/LICENSE.txt
22
+ base_model: meta-llama/Llama-3.2-3B
23
+ datasets:
24
+ - macadeliccc/US-SupremeCourtVerdicts
25
+ - macadeliccc/US-FederalLaws
26
+ tags:
27
+ - generated_from_trainer
28
+ - llama-3
29
+ - spectrum
30
+ - axolotl
31
+ language:
32
+ - en
33
+ pipeline_tag: text-generation
34
+ ---
35
+ # Magistrate 3.2 3B
36
+
37
+ Continued pretraining applied to [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) using no synthetic legal data. ~250M tokens.
38
+
39
+ The model achieves the following results on the evaluation set:
40
+ - Loss: 0.6802
41
+
42
+ Instruct version is available [here]()
43
+
44
+ [<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
45
+ <details><summary>See axolotl config</summary>
46
+
47
+ axolotl version: `0.4.1`
48
+ ```yaml
49
+ base_model: meta-llama/Llama-3.2-3B
50
+ model_type: LlamaForCausalLM
51
+ tokenizer_type: AutoTokenizer
52
+
53
+ load_in_8bit: false
54
+ load_in_4bit: false
55
+ strict: false
56
+
57
+ datasets:
58
+ - path: json
59
+ data_files: "data/amendments_with_content_converted.json"
60
+ type: completion
61
+ - path: json
62
+ data_files: "data/federal_rules_converted.json"
63
+ type: completion
64
+ - path: json
65
+ data_files: "data/cornell_legal_encyclopedias_converted.json"
66
+ type: completion
67
+ - path: json
68
+ data_files: "data/pocket_guide_for_judges_converted.json"
69
+ type: completion
70
+ - path: json
71
+ data_files: "data/us_federal_code.json"
72
+ type: completion
73
+ - path: json
74
+ data_files: "data/us_supreme_court_summaries_converted.json"
75
+ type: completion
76
+ - path: json
77
+ data_files: "data/us_supreme_court_converted.json"
78
+ type: completion
79
+ - path: json
80
+ data_files: "data/ucfr.json"
81
+ type: completion
82
+ - path: json
83
+ data_files: "data/map-code-filtered.json"
84
+ type: completion
85
+
86
+ dataset_prepared_path:
87
+ val_set_size: 0.05
88
+ output_dir: ./outputs/lora-out
89
+
90
+ sequence_len: 8192
91
+ sample_packing: true
92
+ eval_sample_packing: false
93
+ pad_to_sequence_len: true
94
+
95
+ # adapter: lora
96
+ # lora_model_dir:
97
+ # lora_r: 128
98
+ # lora_alpha: 32
99
+ # lora_dropout: 0.05
100
+ # lora_target_linear: true
101
+ # lora_fan_in_fan_out:
102
+ # lora_modules_to_save:
103
+ # - embed_tokens
104
+ # - lm_head
105
+
106
+ unfrozen_parameters:
107
+ - ^lm_head.weight$
108
+ - ^model.embed_tokens.weight$
109
+ # mlp.down_proj layers
110
+ - model.layers.0.mlp.down_proj
111
+ - model.layers.1.mlp.down_proj
112
+ - model.layers.17.mlp.down_proj
113
+ - model.layers.19.mlp.down_proj
114
+ - model.layers.18.mlp.down_proj
115
+ - model.layers.5.mlp.down_proj
116
+ - model.layers.20.mlp.down_proj
117
+ - model.layers.2.mlp.down_proj
118
+ - model.layers.4.mlp.down_proj
119
+ - model.layers.6.mlp.down_proj
120
+ - model.layers.3.mlp.down_proj
121
+ - model.layers.16.mlp.down_proj
122
+ - model.layers.15.mlp.down_proj
123
+ - model.layers.13.mlp.down_proj
124
+ # mlp.gate_proj layers
125
+ - model.layers.0.mlp.gate_proj
126
+ - model.layers.1.mlp.gate_proj
127
+ - model.layers.2.mlp.gate_proj
128
+ - model.layers.3.mlp.gate_proj
129
+ - model.layers.22.mlp.gate_proj
130
+ - model.layers.21.mlp.gate_proj
131
+ - model.layers.20.mlp.gate_proj
132
+ - model.layers.23.mlp.gate_proj
133
+ - model.layers.19.mlp.gate_proj
134
+ - model.layers.4.mlp.gate_proj
135
+ - model.layers.18.mlp.gate_proj
136
+ - model.layers.17.mlp.gate_proj
137
+ - model.layers.5.mlp.gate_proj
138
+ - model.layers.24.mlp.gate_proj
139
+ # mlp.up_proj layers
140
+ - model.layers.4.mlp.up_proj
141
+ - model.layers.3.mlp.up_proj
142
+ - model.layers.5.mlp.up_proj
143
+ - model.layers.6.mlp.up_proj
144
+ - model.layers.7.mlp.up_proj
145
+ - model.layers.2.mlp.up_proj
146
+ - model.layers.8.mlp.up_proj
147
+ - model.layers.14.mlp.up_proj
148
+ - model.layers.13.mlp.up_proj
149
+ - model.layers.11.mlp.up_proj
150
+ - model.layers.9.mlp.up_proj
151
+ - model.layers.1.mlp.up_proj
152
+ - model.layers.15.mlp.up_proj
153
+ - model.layers.12.mlp.up_proj
154
+ # self_attn.k_proj layers
155
+ - model.layers.25.self_attn.k_proj
156
+ - model.layers.22.self_attn.k_proj
157
+ - model.layers.19.self_attn.k_proj
158
+ - model.layers.20.self_attn.k_proj
159
+ - model.layers.17.self_attn.k_proj
160
+ - model.layers.24.self_attn.k_proj
161
+ - model.layers.23.self_attn.k_proj
162
+ - model.layers.18.self_attn.k_proj
163
+ - model.layers.21.self_attn.k_proj
164
+ - model.layers.27.self_attn.k_proj
165
+ - model.layers.15.self_attn.k_proj
166
+ - model.layers.10.self_attn.k_proj
167
+ - model.layers.6.self_attn.k_proj
168
+ - model.layers.5.self_attn.k_proj
169
+ # self_attn.o_proj layers
170
+
171
+ wandb_project:
172
+ wandb_entity:
173
+ wandb_watch:
174
+ wandb_name:
175
+ wandb_log_model:
176
+
177
+ gradient_accumulation_steps: 4
178
+ micro_batch_size: 2
179
+ num_epochs: 3
180
+ optimizer: paged_adamw_32bit
181
+
182
+ # Gradient clipping max norm
183
+ max_grad_norm: 1.0
184
+ noisy_embedding_alpha: 0 # no noisy embedding to ensure maximal memorization
185
+
186
+
187
+ lr_scheduler: cosine
188
+ learning_rate: 0.0002
189
+ train_on_inputs: false
190
+ group_by_length: false
191
+ bf16: auto
192
+ fp16:
193
+ tf32: false
194
+
195
+ gradient_checkpointing: true
196
+ early_stopping_patience:
197
+ resume_from_checkpoint:
198
+ local_rank:
199
+ logging_steps: 1
200
+ xformers_attention:
201
+ flash_attention: true
202
+ s2_attention:
203
+
204
+ warmup_steps: 690
205
+ evals_per_epoch: 2
206
+ eval_table_size:
207
+ eval_max_new_tokens: 128
208
+ saves_per_epoch: 1
209
+ debug:
210
+ deepspeed: deepspeed_configs/zero3.json
211
+ weight_decay: 0.0
212
+ fsdp:
213
+ fsdp_config:
214
+ special_tokens:
215
+ pad_token: <|end_of_text|>
216
+
217
+ ```
218
+
219
+ </details><br>
220
+
221
+
222
+
223
+
224
+ ## Model description
225
+
226
+ This is a base model trained on US Supreme Court proceedings, US federal code and regulations.
227
+
228
+ ## Intended uses & limitations
229
+
230
+ This model is intended for research purposes. You are liable for all model outputs.
231
+
232
+ ## Training and evaluation data
233
+
234
+ The training data consists of US Supreme Court verdicts, federal regulations, laws and treaties.
235
+
236
+ Some other resources have been included from institutions like CLL to fill in the gaps in knowledge for industry jargon.
237
+
238
+ ## Training procedure
239
+
240
+ Spectrum top 35% fine tune. Thanks to the cognitive computations team for the work done on spectrum.
241
+
242
+ Methodology based on Cohere's paper: [To Code, or Not To Code? Exploring Impact of Code in Pre-training](https://arxiv.org/abs/2408.10914)
243
+
244
+ ### Training hyperparameters
245
+
246
+ The following hyperparameters were used during training:
247
+ - learning_rate: 0.0002
248
+ - train_batch_size: 2
249
+ - eval_batch_size: 2
250
+ - seed: 42
251
+ - distributed_type: multi-GPU
252
+ - num_devices: 2
253
+ - gradient_accumulation_steps: 4
254
+ - total_train_batch_size: 16
255
+ - total_eval_batch_size: 4
256
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
257
+ - lr_scheduler_type: cosine
258
+ - lr_scheduler_warmup_steps: 690
259
+ - num_epochs: 3
260
+
261
+ ### Training results
262
+
263
+ | Training Loss | Epoch | Step | Validation Loss |
264
+ |:-------------:|:------:|:----:|:---------------:|
265
+ | 1.3589 | 0.0004 | 1 | 1.5640 |
266
+ | 0.9936 | 0.4984 | 1154 | 0.9440 |
267
+ | 0.8384 | 0.9968 | 2308 | 0.8392 |
268
+ | 0.8226 | 1.4963 | 3462 | 0.7802 |
269
+ | 0.6568 | 1.9949 | 4616 | 0.7059 |
270
+ | 0.5163 | 2.4923 | 5770 | 0.6886 |
271
+ | 0.492 | 2.9922 | 6924 | 0.6802 |
272
+
273
+
274
+ ### Framework versions
275
+
276
+ - Transformers 4.45.0
277
+ - Pytorch 2.3.1+cu121
278
+ - Datasets 2.21.0
279
+ - Tokenizers 0.20.0
280
+