nixcfg

Author	SHA1	Message	Date
Harald Hoyer	ee396ffd42	chore(halo): more parallel	2026-05-21 20:54:08 +02:00
Harald Hoyer	70da67555f	chore(halo): llama.cpp update	2026-05-21 20:46:06 +02:00
Harald Hoyer	1376ab0ba0	chore(halo): reduce ubatch size	2026-05-21 08:47:39 +02:00
Harald Hoyer	6c5ce8742c	fix(halo): only one model	2026-05-20 14:23:42 +02:00
Harald Hoyer	5ee2f65337	chore(halo): tune llama models.ini and drop 35B-A3B model Serve only Qwen3.6-27B; remove the unused 35B-A3B preset. Tuning: - Move model-specific keys (spec-type, sampling temp/top-p/top-k/min-p) out of the [] defaults into [Qwen3.6-27B] so they no longer leak onto other models; draft-mtp in particular only works on MTP-weighted models. - Drop the duplicate parallel key from []. - Bump ubatch-size 256 -> 512 for faster iGPU prefill on Strix Halo. - Add threads-batch = 16 to use all cores for prefill while keeping generation at threads = 8 under full GPU offload.	2026-05-20 14:23:42 +02:00
Harald Hoyer	ac70c57c15	chore(halo): preload both llama models and tune preset Preload Qwen3.6-27B and Qwen3.6-35B-A3B at startup (load-on-startup) so both are warm immediately under --models-max 2, set parallel = 1 as the [*] fallback for any other model, and adjust per-model context size and draft depth.	2026-05-20 07:14:26 +02:00
Harald Hoyer	31e491e314	Revert "fix(halo): 27 only" This reverts commit `72e7bf613f`.	2026-05-20 07:05:27 +02:00
Harald Hoyer	72e7bf613f	fix(halo): 27 only	2026-05-20 02:14:08 +02:00
Harald Hoyer	807a3d0d8e	fix(halo): context	2026-05-20 01:21:10 +02:00
Harald Hoyer	0edf975c30	feat(halo): serve multiple llama models via models.ini preset Replace the per-model llama-server units with a single service that uses llama-server's --models-preset (models.ini) and --models-max 2, so the 35B-A3B and 27B models are loaded on demand from one config. Drop the now-redundant 27B / 27B-MTP / coder-next variant files and the unused CacheDirectory + slot-save-path KV-slot handling.	2026-05-20 00:23:50 +02:00

10 commits