Serve only Qwen3.6-27B; remove the unused 35B-A3B preset. Tuning: - Move model-specific keys (spec-type, sampling temp/top-p/top-k/min-p) out of the [*] defaults into [Qwen3.6-27B] so they no longer leak onto other models; draft-mtp in particular only works on MTP-weighted models. - Drop the duplicate parallel key from [*]. - Bump ubatch-size 256 -> 512 for faster iGPU prefill on Strix Halo. - Add threads-batch = 16 to use all cores for prefill while keeping generation at threads = 8 under full GPU offload. |
||
|---|---|---|
| .. | ||
| default.nix | ||
| hardware-configuration.nix | ||
| llama-server.nix | ||
| models.ini | ||
| sound.nix | ||
| wyoming.nix | ||
| xremap.nix | ||