Serve only Qwen3.6-27B; remove the unused 35B-A3B preset.
Tuning:
- Move model-specific keys (spec-type, sampling temp/top-p/top-k/min-p)
out of the [*] defaults into [Qwen3.6-27B] so they no longer leak onto
other models; draft-mtp in particular only works on MTP-weighted models.
- Drop the duplicate parallel key from [*].
- Bump ubatch-size 256 -> 512 for faster iGPU prefill on Strix Halo.
- Add threads-batch = 16 to use all cores for prefill while keeping
generation at threads = 8 under full GPU offload.
Preload Qwen3.6-27B and Qwen3.6-35B-A3B at startup (load-on-startup)
so both are warm immediately under --models-max 2, set parallel = 1
as the [*] fallback for any other model, and adjust per-model context
size and draft depth.
Replace the per-model llama-server units with a single service that
uses llama-server's --models-preset (models.ini) and --models-max 2,
so the 35B-A3B and 27B models are loaded on demand from one config.
Drop the now-redundant 27B / 27B-MTP / coder-next variant files and
the unused CacheDirectory + slot-save-path KV-slot handling.
Was set defensively without knowing the actual GPU arch; if ROCm
supports the card natively, the override is at best a no-op and at
worst masks the real arch. Add it back with the right value if the
service actually fails to detect the GPU.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runs llama.cpp's ROCm build under DynamicUser, with the HF model cache
in StateDirectory (survives systemctl clean) and KV slot saves in
CacheDirectory. Listens on :8000.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Included `opencode` in the `packages` list for both HALO and AMD system configurations.
- Improves development environment by providing additional tooling.