nixcfg

Author	SHA1	Message	Date
Harald Hoyer	87dfe74daa	chore(halo): max models	2026-05-22 10:37:28 +02:00
Harald Hoyer	c845f7c385	fix(halo): use ROCm-enabled btop so the GPU panel shows the iGPU The base module installs btop without ROCm support, so btop can't read the Strix Halo iGPU (no rocm_smi at runtime). Add the rocmSupport build with hiPrio to win the bin/btop collision against the base package.	2026-05-22 10:34:34 +02:00
Harald Hoyer	fc3e40686a	chore(halo): add fast model	2026-05-22 09:11:26 +02:00
Harald Hoyer	ab729a0720	feat(halo): serve bge-m3 embeddings alongside coder Add a multilingual bge-m3 embedding model to the llama-server preset and raise --models-max to 2 so it stays co-resident with the coder model. This gives the RAG stack a local embeddings endpoint without a second service, keeping all inference on halo. Embedding-specific overrides (ubatch-size, context, pooling) are pinned since the global defaults would truncate or misconfigure embedding requests.	2026-05-22 00:35:54 +02:00
Harald Hoyer	a1b55fe2ec	chore(halo): model update	2026-05-22 00:21:23 +02:00
Harald Hoyer	ccd8750899	chore(halo): set spec-draft-p-min for coder model Add a 0.74 confidence threshold so speculative drafting stops early once the draft model's predicted token probability drops below it, favoring shorter, higher-acceptance draft sequences.	2026-05-21 23:15:09 +02:00
Harald Hoyer	3a070413e4	chore(halo): upgrade coder model to Q8 quant and bump spec draft Switch the coder model from Q6_K to the UD-Q8_K_XL quant for better output quality, and raise spec-draft-n-max from 4 to 5 to allow longer speculative draft sequences.	2026-05-21 23:11:00 +02:00
Harald Hoyer	689389ebf8	chore(halo): rename model to coder and add ngram-simple speculation Rename the Qwen3.6-27B model section to "coder" so it matches the opencode provider config, and add ngram-simple to the speculative decoding chain alongside draft-mtp.	2026-05-21 22:07:57 +02:00
Harald Hoyer	ee396ffd42	chore(halo): more parallel	2026-05-21 20:54:08 +02:00
Harald Hoyer	70da67555f	chore(halo): llama.cpp update	2026-05-21 20:46:06 +02:00
Harald Hoyer	1376ab0ba0	chore(halo): reduce ubatch size	2026-05-21 08:47:39 +02:00
Harald Hoyer	6c5ce8742c	fix(halo): only one model	2026-05-20 14:23:42 +02:00
Harald Hoyer	5ee2f65337	chore(halo): tune llama models.ini and drop 35B-A3B model Serve only Qwen3.6-27B; remove the unused 35B-A3B preset. Tuning: - Move model-specific keys (spec-type, sampling temp/top-p/top-k/min-p) out of the [] defaults into [Qwen3.6-27B] so they no longer leak onto other models; draft-mtp in particular only works on MTP-weighted models. - Drop the duplicate parallel key from []. - Bump ubatch-size 256 -> 512 for faster iGPU prefill on Strix Halo. - Add threads-batch = 16 to use all cores for prefill while keeping generation at threads = 8 under full GPU offload.	2026-05-20 14:23:42 +02:00
Harald Hoyer	ac70c57c15	chore(halo): preload both llama models and tune preset Preload Qwen3.6-27B and Qwen3.6-35B-A3B at startup (load-on-startup) so both are warm immediately under --models-max 2, set parallel = 1 as the [*] fallback for any other model, and adjust per-model context size and draft depth.	2026-05-20 07:14:26 +02:00
Harald Hoyer	31e491e314	Revert "fix(halo): 27 only" This reverts commit `72e7bf613f`.	2026-05-20 07:05:27 +02:00
Harald Hoyer	72e7bf613f	fix(halo): 27 only	2026-05-20 02:14:08 +02:00
Harald Hoyer	807a3d0d8e	fix(halo): context	2026-05-20 01:21:10 +02:00
Harald Hoyer	0edf975c30	feat(halo): serve multiple llama models via models.ini preset Replace the per-model llama-server units with a single service that uses llama-server's --models-preset (models.ini) and --models-max 2, so the 35B-A3B and 27B models are loaded on demand from one config. Drop the now-redundant 27B / 27B-MTP / coder-next variant files and the unused CacheDirectory + slot-save-path KV-slot handling.	2026-05-20 00:23:50 +02:00
Harald Hoyer	b4063fda66	feat(halo): MTP --parallel 2	2026-05-19 23:48:53 +02:00
Harald Hoyer	8bd096ff8d	feat(halo): inc. mtp to 6	2026-05-19 06:40:13 +02:00
Harald Hoyer	38d2d4f4ae	fix(halo): q6_k with mtp 2	2026-05-15 07:47:43 +02:00
Harald Hoyer	1e3b2fc9a7	feat(halo): unsloth MTP	2026-05-13 19:42:54 +02:00
Harald Hoyer	dadfb07914	fix(halo): set `--alias halo-8000`	2026-05-13 14:52:49 +02:00
Harald Hoyer	e440bf39fd	feat(halo): llama-server-27B-MTP.nix	2026-05-12 16:16:15 +02:00
Harald Hoyer	ca4ee90828	feat(halo): coder next	2026-05-11 12:22:34 +02:00
Harald Hoyer	7b04b55ce8	feat(halo): cache-ram 0	2026-05-10 20:50:08 +02:00
Harald Hoyer	04342222a2	fix(halo): 27b	2026-05-10 20:46:12 +02:00
Harald Hoyer	689cdec28d	feat(halo): activate qwen 27b	2026-05-10 20:44:38 +02:00
Harald Hoyer	bef528e26a	feat(halo): use qwen-35b-a3b	2026-05-10 20:44:38 +02:00
Harald Hoyer	d47bb6e15b	feat(halo): add different llama servers	2026-05-07 14:54:48 +02:00
Harald Hoyer	b548126fb8	fix(halo): fix systemd description for llama	2026-05-07 14:40:18 +02:00
Harald Hoyer	02b3c73376	fix(halo): fix systemd description for llama	2026-05-06 14:03:28 +02:00
Harald Hoyer	7ebd97629d	feat(halo): use am17an/Qwen3.6-27B-MTP-GGUF:Q8_0 with MTP spec	2026-05-06 14:01:31 +02:00
Harald Hoyer	a95417da8b	feat(halo): use unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL	2026-05-06 13:02:20 +02:00
Harald Hoyer	da88a9b2d6	fix(halo): drop speculative HSA_OVERRIDE_GFX_VERSION from llama-server Was set defensively without knowing the actual GPU arch; if ROCm supports the card natively, the override is at best a no-op and at worst masks the real arch. Add it back with the right value if the service actually fails to detect the GPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:42:17 +02:00
Harald Hoyer	b11e5c8356	feat(halo): add llama-server systemd unit for Qwen3.6-35B-A3B Runs llama.cpp's ROCm build under DynamicUser, with the HF model cache in StateDirectory (survives systemctl clean) and KV slot saves in CacheDirectory. Listens on :8000. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:42:17 +02:00
Harald Hoyer	f74928ce5f	chore: nix fmt	2026-05-03 14:57:49 +02:00
Harald Hoyer	e96bf83dfd	feat(halo): add python313Packages.huggingface-hub	2026-05-03 09:00:13 +02:00
Harald Hoyer	6d0186eadb	chore: statix fix	2026-04-20 10:09:24 +02:00
Harald Hoyer	866bc090ca	feat(halo): add llama-cpp-rocm and real docker	2026-03-17 10:03:54 +01:00
Harald Hoyer	3db2e2a226	fix: halo	2026-03-17 10:03:54 +01:00
Harald Hoyer	6029a6908b	feat(halo): add user to audio group	2026-03-10 14:21:35 +01:00
Harald Hoyer	d82ac6915c	feat(halo): add linger	2026-03-10 14:21:35 +01:00
Harald Hoyer	0e4f847145	feat(halo): add sound	2026-03-10 13:35:24 +01:00
Harald Hoyer	e4ff822a88	feat(halo): open more ports	2026-02-26 11:48:26 +01:00
Harald Hoyer	04150f10d4	nix fmt	2026-02-24 13:25:42 +01:00
Harald Hoyer	27343e49bd	feat(halo): add wyoming whisper-cpp	2026-02-24 13:16:28 +01:00
Harald Hoyer	b5e7d5031a	feat(halo): add wyoming	2026-02-24 10:56:58 +01:00
Harald Hoyer	4722d5ef88	fix(halo): remove rocm patches	2026-02-17 16:45:23 +01:00
Harald Hoyer	14670bb0fa	fix(halo): patch rocm-runtime	2026-02-17 13:05:35 +01:00

1 2

61 commits