nixcfg

Author	SHA1	Message	Date
Harald Hoyer	87dfe74daa	chore(halo): max models	2026-05-22 10:37:28 +02:00
Harald Hoyer	ab729a0720	feat(halo): serve bge-m3 embeddings alongside coder Add a multilingual bge-m3 embedding model to the llama-server preset and raise --models-max to 2 so it stays co-resident with the coder model. This gives the RAG stack a local embeddings endpoint without a second service, keeping all inference on halo. Embedding-specific overrides (ubatch-size, context, pooling) are pinned since the global defaults would truncate or misconfigure embedding requests.	2026-05-22 00:35:54 +02:00
Harald Hoyer	6c5ce8742c	fix(halo): only one model	2026-05-20 14:23:42 +02:00
Harald Hoyer	0edf975c30	feat(halo): serve multiple llama models via models.ini preset Replace the per-model llama-server units with a single service that uses llama-server's --models-preset (models.ini) and --models-max 2, so the 35B-A3B and 27B models are loaded on demand from one config. Drop the now-redundant 27B / 27B-MTP / coder-next variant files and the unused CacheDirectory + slot-save-path KV-slot handling.	2026-05-20 00:23:50 +02:00
Harald Hoyer	dadfb07914	fix(halo): set `--alias halo-8000`	2026-05-13 14:52:49 +02:00
Harald Hoyer	689cdec28d	feat(halo): activate qwen 27b	2026-05-10 20:44:38 +02:00
Harald Hoyer	bef528e26a	feat(halo): use qwen-35b-a3b	2026-05-10 20:44:38 +02:00
Harald Hoyer	d47bb6e15b	feat(halo): add different llama servers	2026-05-07 14:54:48 +02:00
Harald Hoyer	b548126fb8	fix(halo): fix systemd description for llama	2026-05-07 14:40:18 +02:00
Harald Hoyer	02b3c73376	fix(halo): fix systemd description for llama	2026-05-06 14:03:28 +02:00
Harald Hoyer	7ebd97629d	feat(halo): use am17an/Qwen3.6-27B-MTP-GGUF:Q8_0 with MTP spec	2026-05-06 14:01:31 +02:00
Harald Hoyer	a95417da8b	feat(halo): use unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL	2026-05-06 13:02:20 +02:00
Harald Hoyer	da88a9b2d6	fix(halo): drop speculative HSA_OVERRIDE_GFX_VERSION from llama-server Was set defensively without knowing the actual GPU arch; if ROCm supports the card natively, the override is at best a no-op and at worst masks the real arch. Add it back with the right value if the service actually fails to detect the GPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:42:17 +02:00
Harald Hoyer	b11e5c8356	feat(halo): add llama-server systemd unit for Qwen3.6-35B-A3B Runs llama.cpp's ROCm build under DynamicUser, with the HF model cache in StateDirectory (survives systemctl clean) and KV slot saves in CacheDirectory. Listens on :8000. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:42:17 +02:00

14 commits