Commit graph

11 commits

Author SHA1 Message Date
689389ebf8 chore(halo): rename model to coder and add ngram-simple speculation
Rename the Qwen3.6-27B model section to "coder" so it matches the
opencode provider config, and add ngram-simple to the speculative
decoding chain alongside draft-mtp.
2026-05-21 22:07:57 +02:00
ee396ffd42 chore(halo): more parallel 2026-05-21 20:54:08 +02:00
70da67555f chore(halo): llama.cpp update 2026-05-21 20:46:06 +02:00
1376ab0ba0 chore(halo): reduce ubatch size 2026-05-21 08:47:39 +02:00
6c5ce8742c fix(halo): only one model 2026-05-20 14:23:42 +02:00
5ee2f65337 chore(halo): tune llama models.ini and drop 35B-A3B model
Serve only Qwen3.6-27B; remove the unused 35B-A3B preset.

Tuning:
- Move model-specific keys (spec-type, sampling temp/top-p/top-k/min-p)
  out of the [*] defaults into [Qwen3.6-27B] so they no longer leak onto
  other models; draft-mtp in particular only works on MTP-weighted models.
- Drop the duplicate parallel key from [*].
- Bump ubatch-size 256 -> 512 for faster iGPU prefill on Strix Halo.
- Add threads-batch = 16 to use all cores for prefill while keeping
  generation at threads = 8 under full GPU offload.
2026-05-20 14:23:42 +02:00
ac70c57c15 chore(halo): preload both llama models and tune preset
Preload Qwen3.6-27B and Qwen3.6-35B-A3B at startup (load-on-startup)
so both are warm immediately under --models-max 2, set parallel = 1
as the [*] fallback for any other model, and adjust per-model context
size and draft depth.
2026-05-20 07:14:26 +02:00
31e491e314 Revert "fix(halo): 27 only"
This reverts commit 72e7bf613f.
2026-05-20 07:05:27 +02:00
72e7bf613f fix(halo): 27 only 2026-05-20 02:14:08 +02:00
807a3d0d8e fix(halo): context 2026-05-20 01:21:10 +02:00
0edf975c30 feat(halo): serve multiple llama models via models.ini preset
Replace the per-model llama-server units with a single service that
uses llama-server's --models-preset (models.ini) and --models-max 2,
so the 35B-A3B and 27B models are loaded on demand from one config.

Drop the now-redundant 27B / 27B-MTP / coder-next variant files and
the unused CacheDirectory + slot-save-path KV-slot handling.
2026-05-20 00:23:50 +02:00