Thank you for the quants
I have tested that with llama.cpp
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 7399 (254098a27)
built with GNU 13.3.0 for Linux x86_64
and the following start options:
.../llama-server
--model /.../GLM-4.6V-Q4_K_S-00001-of-00002.gguf
--threads -1
--n-gpu-layers 99
--jinja
--ctx-size 60000
--flash-attn on
--cache-type-k q8_0
--cache-type-v q8_0
--temp 1.0
--top-p 0.95
--top-k 40
--repeat_penalty 1.1
It demonstrated good performance on my HW and relevantly good answers and results on my workloads.
Personally I don't think this is a great text model at all. I don't think it comes from the quant itself. Glad you're liking it though