Thank you for the quants

by ArtemProc - opened 20 days ago

20 days ago

I have tested that with llama.cpp

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 7399 (254098a27)
built with GNU 13.3.0 for Linux x86_64

and the following start options:

.../llama-server
--model /.../GLM-4.6V-Q4_K_S-00001-of-00002.gguf
--threads -1
--n-gpu-layers 99
--jinja
--ctx-size 60000
--flash-attn on
--cache-type-k q8_0
--cache-type-v q8_0
--temp 1.0
--top-p 0.95
--top-k 40
--repeat_penalty 1.1

It demonstrated good performance on my HW and relevantly good answers and results on my workloads.

AliceThirty

Owner 20 days ago

•

edited 20 days ago

Personally I don't think this is a great text model at all. I don't think it comes from the quant itself. Glad you're liking it though

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment