128K context does not work (possibly because YaRN meta information is missing?)

#8
by Lissanro - opened

Thank you for providing the quants, but I think I discovered an issue. Example command in the model card mentions --ctx-size 65536, so I assumed these quants support full 128K context length mentioned at the official card https://huggingface.co/inclusionAI/Ling-1T, but when I actually downloaded, they fail to go beyond 32K context length, this is very unfortunate, since for example Roo Code and many other use cases require greater context for real world tasks.

According to discussion at https://github.com/ikawrakow/ik_llama.cpp/issues/873, ik_llama.cpp supports YaRN and ikawrakow suggested that there is may something wrong with the meta information in the GGUF. Could you please check? If you discover a solution, I wonder if it is possible to fix already downloaded quant (since it takes over a week with my 4G connection)?

Thanks to suggestion by CISC at https://github.com/ikawrakow/ik_llama.cpp/issues/873, a solution was found to be able to use the 128K context length! Here is full command for reference that worked for me:

numactl --cpunodebind=0 --interleave=all ~/pkgs/ik_llama.cpp/build/bin/llama-server \
--jinja --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 --override-kv bailingmoe2.context_length=int:131072 \
--model /mnt/neuro/models/Ling-1T/Ling-1T-smol-IQ4_KSS-00001-of-00011.gguf \
--ctx-size 131072 --n-gpu-layers 62 --tensor-split 12,26,31,31 -ctk q8_0 -ctv q8_0 -b 4096 -ub 4096 -ger \
-ot "blk\.(4|5)\.ffn_.*=CUDA0" \
-ot "blk\.(7|8)\.ffn_.*=CUDA1" \
-ot "blk\.(9|10)\.ffn_.*=CUDA2" \
-ot "blk\.(11|12)\.ffn_.*=CUDA3" \
-ot exps=CPU \
--threads 64 --host 0.0.0.0 --port 5000 \
--slot-save-path /var/cache/ik_llama.cpp/ling

This turned out to be an issue with the quant meta information missing YaRN parameters and not a bug in ik_llama.cpp. Important part that was missing in the example command in the model card is this: --jinja --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 --override-kv bailingmoe2.context_length=int:131072 - to set correct chat template and YaRN parameters, and update the context length to match.

Thanks for the followup. Yes I changed the model card example to 32k after realizing the model doesn't support the 64k I originally provided.

The details which you now already know are in this buried discussion with magikRUKKOLA https://github.com/ikawrakow/ik_llama.cpp/discussions/839#discussioncomment-14745117

The relevant bits for extending context beyond 32k seem to be:
--rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 --override-kv bailingmoe2.context_length=int:131072

Or if you only need 64k then only extend it by 2x given longer extension tends towards worse quality output:
--rope-scaling yarn --rope-scale 2 --yarn-orig-ctx 32768 --override-kv bailingmoe2.context_length=int:65536

I'll add to the model card, thanks!

Sign up or log in to comment