RichardErkhov
/

macadeliccc_-_magistrate-3.2-3b-base-4bits

+Quantization made by Richard Erkhov.
+[Github](https://github.com/RichardErkhov)
+[Discord](https://discord.gg/pvy7H8DZMG)
+[Request more models](https://github.com/RichardErkhov/quant_request)
+magistrate-3.2-3b-base - bnb 4bits
+- Model creator: https://huggingface.co/macadeliccc/
+- Original model: https://huggingface.co/macadeliccc/magistrate-3.2-3b-base/
+Original model description:
+---
+library_name: transformers
+license: llama3.2
+license_link: https://huggingface.co/meta-llama/Llama-3.2-3B/blob/main/LICENSE.txt
+base_model: meta-llama/Llama-3.2-3B
+datasets:
+- macadeliccc/US-SupremeCourtVerdicts
+- macadeliccc/US-FederalLaws
+tags:
+- generated_from_trainer
+- llama-3
+- spectrum
+- axolotl
+language:
+- en
+pipeline_tag: text-generation
+---
+# Magistrate 3.2 3B
+Continued pretraining applied to  [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) using no synthetic legal data.  ~250M tokens.
+The model achieves the following results on the evaluation set:
+- Loss: 0.6802
+Instruct version is available [here]()
+[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
+<details><summary>See axolotl config</summary>
+axolotl version: `0.4.1`
+```yaml
+base_model: meta-llama/Llama-3.2-3B
+model_type: LlamaForCausalLM
+tokenizer_type: AutoTokenizer
+load_in_8bit: false
+load_in_4bit: false
+strict: false
+datasets:
+  - path: json
+    data_files: "data/amendments_with_content_converted.json"
+    type: completion
+  - path: json
+    data_files: "data/federal_rules_converted.json"
+    type: completion
+  - path: json
+    data_files: "data/cornell_legal_encyclopedias_converted.json"
+    type: completion
+  - path: json
+    data_files: "data/pocket_guide_for_judges_converted.json"
+    type: completion
+  - path: json
+    data_files: "data/us_federal_code.json"
+    type: completion
+  - path: json
+    data_files: "data/us_supreme_court_summaries_converted.json"
+    type: completion
+  - path: json
+    data_files: "data/us_supreme_court_converted.json"
+    type: completion
+  - path: json
+    data_files: "data/ucfr.json"
+    type: completion
+  - path: json
+    data_files: "data/map-code-filtered.json"
+    type: completion
+dataset_prepared_path:
+val_set_size: 0.05
+output_dir: ./outputs/lora-out
+sequence_len: 8192
+sample_packing: true
+eval_sample_packing: false
+pad_to_sequence_len: true
+# adapter: lora
+# lora_model_dir:
+# lora_r: 128
+# lora_alpha: 32
+# lora_dropout: 0.05
+# lora_target_linear: true
+# lora_fan_in_fan_out:
+# lora_modules_to_save:
+#   - embed_tokens
+#   - lm_head
+unfrozen_parameters:
+- ^lm_head.weight$
+- ^model.embed_tokens.weight$
+# mlp.down_proj layers
+- model.layers.0.mlp.down_proj
+- model.layers.1.mlp.down_proj
+- model.layers.17.mlp.down_proj
+- model.layers.19.mlp.down_proj
+- model.layers.18.mlp.down_proj
+- model.layers.5.mlp.down_proj
+- model.layers.20.mlp.down_proj
+- model.layers.2.mlp.down_proj
+- model.layers.4.mlp.down_proj
+- model.layers.6.mlp.down_proj
+- model.layers.3.mlp.down_proj
+- model.layers.16.mlp.down_proj
+- model.layers.15.mlp.down_proj
+- model.layers.13.mlp.down_proj
+# mlp.gate_proj layers
+- model.layers.0.mlp.gate_proj
+- model.layers.1.mlp.gate_proj
+- model.layers.2.mlp.gate_proj
+- model.layers.3.mlp.gate_proj
+- model.layers.22.mlp.gate_proj
+- model.layers.21.mlp.gate_proj
+- model.layers.20.mlp.gate_proj
+- model.layers.23.mlp.gate_proj
+- model.layers.19.mlp.gate_proj
+- model.layers.4.mlp.gate_proj
+- model.layers.18.mlp.gate_proj
+- model.layers.17.mlp.gate_proj
+- model.layers.5.mlp.gate_proj
+- model.layers.24.mlp.gate_proj
+# mlp.up_proj layers
+- model.layers.4.mlp.up_proj
+- model.layers.3.mlp.up_proj
+- model.layers.5.mlp.up_proj
+- model.layers.6.mlp.up_proj
+- model.layers.7.mlp.up_proj
+- model.layers.2.mlp.up_proj
+- model.layers.8.mlp.up_proj
+- model.layers.14.mlp.up_proj
+- model.layers.13.mlp.up_proj
+- model.layers.11.mlp.up_proj
+- model.layers.9.mlp.up_proj
+- model.layers.1.mlp.up_proj
+- model.layers.15.mlp.up_proj
+- model.layers.12.mlp.up_proj
+# self_attn.k_proj layers
+- model.layers.25.self_attn.k_proj
+- model.layers.22.self_attn.k_proj
+- model.layers.19.self_attn.k_proj
+- model.layers.20.self_attn.k_proj
+- model.layers.17.self_attn.k_proj
+- model.layers.24.self_attn.k_proj
+- model.layers.23.self_attn.k_proj
+- model.layers.18.self_attn.k_proj
+- model.layers.21.self_attn.k_proj
+- model.layers.27.self_attn.k_proj
+- model.layers.15.self_attn.k_proj
+- model.layers.10.self_attn.k_proj
+- model.layers.6.self_attn.k_proj
+- model.layers.5.self_attn.k_proj
+# self_attn.o_proj layers
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+gradient_accumulation_steps: 4
+micro_batch_size: 2
+num_epochs: 3
+optimizer: paged_adamw_32bit
+# Gradient clipping max norm
+max_grad_norm: 1.0
+noisy_embedding_alpha: 0 # no noisy embedding to ensure maximal memorization
+lr_scheduler: cosine
+learning_rate: 0.0002
+train_on_inputs: false
+group_by_length: false
+bf16: auto
+fp16:
+tf32: false
+gradient_checkpointing: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+s2_attention:
+warmup_steps: 690
+evals_per_epoch: 2
+eval_table_size:
+eval_max_new_tokens: 128
+saves_per_epoch: 1
+debug:
+deepspeed: deepspeed_configs/zero3.json
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
+  pad_token: <|end_of_text|>
+```
+</details><br>
+## Model description
+This is a base model trained on US Supreme Court proceedings, US federal code and regulations.
+## Intended uses & limitations
+This model is intended for research purposes. You are liable for all model outputs.
+## Training and evaluation data
+The training data consists of US Supreme Court verdicts, federal regulations, laws and treaties.
+Some other resources have been included from institutions like CLL to fill in the gaps in knowledge for industry jargon.
+## Training procedure
+Spectrum top 35% fine tune. Thanks to the cognitive computations team for the work done on spectrum.
+Methodology based on Cohere's paper: [To Code, or Not To Code? Exploring Impact of Code in Pre-training](https://arxiv.org/abs/2408.10914)
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 0.0002
+- train_batch_size: 2
+- eval_batch_size: 2
+- seed: 42
+- distributed_type: multi-GPU
+- num_devices: 2
+- gradient_accumulation_steps: 4
+- total_train_batch_size: 16
+- total_eval_batch_size: 4
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: cosine
+- lr_scheduler_warmup_steps: 690
+- num_epochs: 3
+### Training results
+| Training Loss | Epoch  | Step | Validation Loss |
+|:-------------:|:------:|:----:|:---------------:|
+| 1.3589        | 0.0004 | 1    | 1.5640          |
+| 0.9936        | 0.4984 | 1154 | 0.9440          |
+| 0.8384        | 0.9968 | 2308 | 0.8392          |
+| 0.8226        | 1.4963 | 3462 | 0.7802          |
+| 0.6568        | 1.9949 | 4616 | 0.7059          |
+| 0.5163        | 2.4923 | 5770 | 0.6886          |
+| 0.492         | 2.9922 | 6924 | 0.6802          |
+### Framework versions
+- Transformers 4.45.0
+- Pytorch 2.3.1+cu121
+- Datasets 2.21.0
+- Tokenizers 0.20.0