nixcfg

Author	SHA1	Message	Date
Harald Hoyer	87dfe74daa	chore(halo): max models	2026-05-22 10:37:28 +02:00
Harald Hoyer	c845f7c385	fix(halo): use ROCm-enabled btop so the GPU panel shows the iGPU The base module installs btop without ROCm support, so btop can't read the Strix Halo iGPU (no rocm_smi at runtime). Add the rocmSupport build with hiPrio to win the bin/btop collision against the base package.	2026-05-22 10:34:34 +02:00
Harald Hoyer	bad775f49a	fix(mx): point nextcloud bot at the fast model halo's llama-server now runs in router mode where the model field selects a preset (coder/fast/bge-m3); the old "halo-8000" name is no longer valid. Use the fast MoE model for the Talk bot's responses.	2026-05-22 10:17:13 +02:00
Harald Hoyer	bc0d79db57	feat(rag): route the fast model and use it for synthesis by default Expose halo's [fast] MoE preset through the LiteLLM gateway and make it the rag CLI's default chat model (overridable via RAG_CHAT_MODEL), so query synthesis is quicker than the larger coder model.	2026-05-22 09:15:59 +02:00
Harald Hoyer	89c82cf804	feat(rag): install on m4 and amd, default endpoints to sgx Add the rag CLI to the m4 and amd hosts and point its default API_BASE and QDRANT_URL at sgx (where the gateway and Qdrant run) instead of localhost. The services live on sgx, so a localhost default only worked there; sgx resolves to itself on sgx, so this default is correct on every host and leaves only RAG_API_KEY to set.	2026-05-22 09:12:22 +02:00
Harald Hoyer	fc3e40686a	chore(halo): add fast model	2026-05-22 09:11:26 +02:00
Harald Hoyer	116d04665d	fix(sgx): move LiteLLM off port 4000 to avoid uptime-kuma clash Uptime Kuma already binds 4000, so the gateway never got the port and requests hit the wrong service. Move LiteLLM to 4001 and update the rag CLI default endpoint to match.	2026-05-22 07:08:26 +02:00
Harald Hoyer	95668b71a7	feat(sgx): add CLI RAG stack (Qdrant + embeddings gateway + rag tool) Stand up document retrieval as shared, client-agnostic primitives rather than locking it inside Open WebUI: - Qdrant as the LAN-reachable vector store - LiteLLM gains a bge-m3 route so sgx:4000 also serves /v1/embeddings - a thin `rag` CLI (ingest/query, optional coder synthesis) usable from any machine and from scripts Embeddings and synthesis run on halo via the gateway; the CLI is configured entirely through RAG_* env vars.	2026-05-22 00:35:54 +02:00
Harald Hoyer	ab729a0720	feat(halo): serve bge-m3 embeddings alongside coder Add a multilingual bge-m3 embedding model to the llama-server preset and raise --models-max to 2 so it stays co-resident with the coder model. This gives the RAG stack a local embeddings endpoint without a second service, keeping all inference on halo. Embedding-specific overrides (ubatch-size, context, pooling) are pinned since the global defaults would truncate or misconfigure embedding requests.	2026-05-22 00:35:54 +02:00
Harald Hoyer	a1b55fe2ec	chore(halo): model update	2026-05-22 00:21:23 +02:00
Harald Hoyer	9986d286b1	refactor(openwebui): drop stale backend env vars now managed via UI The Ollama/OpenAI connection env vars are PersistentConfig: read only on first launch and thereafter owned by Open WebUI's DB. They no longer reflected the live backend, so remove them and document that connections are configured through the admin UI.	2026-05-21 23:15:47 +02:00
Harald Hoyer	fdefdf31b2	feat(litellm): add LiteLLM gateway on sgx fronting halo's llama-server Exposes an OpenAI-compatible endpoint on sgx:4000 (LAN-reachable) that routes the `coder` model to halo's llama-server, so clients get a stable gateway with per-key auth instead of hardcoding halo's address. Master key is sourced from a sops-encrypted env file.	2026-05-21 23:15:47 +02:00
Harald Hoyer	ccd8750899	chore(halo): set spec-draft-p-min for coder model Add a 0.74 confidence threshold so speculative drafting stops early once the draft model's predicted token probability drops below it, favoring shorter, higher-acceptance draft sequences.	2026-05-21 23:15:09 +02:00
Harald Hoyer	3a070413e4	chore(halo): upgrade coder model to Q8 quant and bump spec draft Switch the coder model from Q6_K to the UD-Q8_K_XL quant for better output quality, and raise spec-draft-n-max from 4 to 5 to allow longer speculative draft sequences.	2026-05-21 23:11:00 +02:00
Harald Hoyer	689389ebf8	chore(halo): rename model to coder and add ngram-simple speculation Rename the Qwen3.6-27B model section to "coder" so it matches the opencode provider config, and add ngram-simple to the speculative decoding chain alongside draft-mtp.	2026-05-21 22:07:57 +02:00
Harald Hoyer	ee396ffd42	chore(halo): more parallel	2026-05-21 20:54:08 +02:00
Harald Hoyer	70da67555f	chore(halo): llama.cpp update	2026-05-21 20:46:06 +02:00
Harald Hoyer	1376ab0ba0	chore(halo): reduce ubatch size	2026-05-21 08:47:39 +02:00
Harald Hoyer	6c5ce8742c	fix(halo): only one model	2026-05-20 14:23:42 +02:00
Harald Hoyer	5ee2f65337	chore(halo): tune llama models.ini and drop 35B-A3B model Serve only Qwen3.6-27B; remove the unused 35B-A3B preset. Tuning: - Move model-specific keys (spec-type, sampling temp/top-p/top-k/min-p) out of the [] defaults into [Qwen3.6-27B] so they no longer leak onto other models; draft-mtp in particular only works on MTP-weighted models. - Drop the duplicate parallel key from []. - Bump ubatch-size 256 -> 512 for faster iGPU prefill on Strix Halo. - Add threads-batch = 16 to use all cores for prefill while keeping generation at threads = 8 under full GPU offload.	2026-05-20 14:23:42 +02:00
Harald Hoyer	5b44e037a1	feat(halo): add `song <URL>` command to convert via song.link Resolves the URL through the Odesli public API (api.song.link) and replies with the canonical song.link page plus per-platform deep links (Spotify, Apple Music, YouTube/YT Music, Tidal, Deezer, Amazon Music, SoundCloud). Country is pinned to DE.	2026-05-20 09:42:11 +02:00
Harald Hoyer	ac70c57c15	chore(halo): preload both llama models and tune preset Preload Qwen3.6-27B and Qwen3.6-35B-A3B at startup (load-on-startup) so both are warm immediately under --models-max 2, set parallel = 1 as the [*] fallback for any other model, and adjust per-model context size and draft depth.	2026-05-20 07:14:26 +02:00
Harald Hoyer	31e491e314	Revert "fix(halo): 27 only" This reverts commit `72e7bf613f`.	2026-05-20 07:05:27 +02:00
Harald Hoyer	72e7bf613f	fix(halo): 27 only	2026-05-20 02:14:08 +02:00
Harald Hoyer	807a3d0d8e	fix(halo): context	2026-05-20 01:21:10 +02:00
Harald Hoyer	0edf975c30	feat(halo): serve multiple llama models via models.ini preset Replace the per-model llama-server units with a single service that uses llama-server's --models-preset (models.ini) and --models-max 2, so the 35B-A3B and 27B models are loaded on demand from one config. Drop the now-redundant 27B / 27B-MTP / coder-next variant files and the unused CacheDirectory + slot-save-path KV-slot handling.	2026-05-20 00:23:50 +02:00
Harald Hoyer	ae068cfd84	feat(mx): increase halo bot timeout	2026-05-19 23:52:46 +02:00
Harald Hoyer	b4063fda66	feat(halo): MTP --parallel 2	2026-05-19 23:48:53 +02:00
Harald Hoyer	8bd096ff8d	feat(halo): inc. mtp to 6	2026-05-19 06:40:13 +02:00
Harald Hoyer	492362fa31	feat(amd): enable Wake-on-LAN on enp7s0	2026-05-16 13:40:25 +02:00
Harald Hoyer	38d2d4f4ae	fix(halo): q6_k with mtp 2	2026-05-15 07:47:43 +02:00
Harald Hoyer	1e3b2fc9a7	feat(halo): unsloth MTP	2026-05-13 19:42:54 +02:00
Harald Hoyer	42c52bd87f	refactor(mx): drive opencode bot via direct chat-completions API The bot no longer shells out to `opencode run`. Instead it POSTs to the OpenAI-compatible /chat/completions endpoint exposed by llama-server on halo.hoyer.tail:8000 directly. This removes the Bun/sqlite cold-start overhead per request, drops the pkgs.opencode runtime dependency, and eliminates the ExecStartPre dance that materialized config.json into the service's $HOME. Conversation history is now stored as a proper OpenAI `messages` list with system/user/assistant roles, instead of the XML blob that was inlined into a single `opencode run` argument. The interactive opencode setup (config/opencode/config.json) is unchanged — only the bot stops depending on it. The module gains a `modelBaseUrl` option; `model` is now the bare model name (`halo-8000`) without the provider/ prefix that the opencode CLI required.	2026-05-13 16:38:58 +02:00
Harald Hoyer	d8e8293c0e	feat(mx): add Nextcloud Talk opencode bot pointing at halo.hoyer.tail:8000 Mirrors the existing nextcloud-claude-bot setup but invokes `opencode run` against the local `halo-8000` provider/model. The bot listens on 127.0.0.1:8086, is exposed via the `/_opencode-bot/` location on nc.hoyer.xyz, and uses `@Halo` as its mention trigger in group chats. The opencode config (config/opencode/config.json) is installed into the service's $HOME/.config/opencode/ on each start, so the bot picks up the same provider definition the user uses interactively. The model map keys are renamed to `halo-8000` / `halo-8001` so the canonical `provider/model` reference works without an alias indirection.	2026-05-13 15:08:18 +02:00
Harald Hoyer	dadfb07914	fix(halo): set `--alias halo-8000`	2026-05-13 14:52:49 +02:00
Harald Hoyer	f9a2e0d301	chore(x1,amd): disable cratedocs-mcp service Keep it enabled only on sgx.	2026-05-13 11:35:59 +02:00
Harald Hoyer	4ce7bcf354	fix(mx): make tailscale exit-node advertisement actually apply tailscale set is strict about boolean flags and silently ignores --advertise-exit-node without =true. Result: the tailscaled-set unit ran cleanly but AdvertiseRoutes stayed null. Spell the value out so the flag takes effect.	2026-05-13 09:28:20 +02:00
Harald Hoyer	67b7c3a9fd	feat(headscale): add ACL policy, isolate mx, make mx an exit node Introduces a headscale ACL policy (file-mode) plus matching client config: - New systems/x86_64-linux/attic/headscale-policy.hujson: * tag:llm restricts a node to talking only to halo:8000 * all other harald@ nodes have full mesh access to each other * harald@ nodes can route internet traffic via approved exit nodes * autoApprovers.exitNode = [tag:llm] auto-approves the exit route advertised by any tag:llm node (currently mx) - attic headscale.nix: wire policy.mode = "file" / policy.path to the .hujson above. - mx default.nix: enable useRoutingFeatures = "server" (needed for IP forwarding) and add extraSetFlags = ["--advertise-exit-node"] so the flag is reapplied on every activation, not just initial login. Operational steps after deploy: headscale nodes tag -i 10 -t tag:llm	2026-05-13 09:06:40 +02:00
Harald Hoyer	87bdaf15da	fix(attic): keep headscale domain as headscale.hoyer.xyz Avoid breaking existing clients and the registered OIDC redirect URI by keeping the original domain. Only the host backing it changes (mx -> attic); DNS just needs to be repointed.	2026-05-13 08:48:37 +02:00
Harald Hoyer	12c25bcde8	refactor(attic): move headscale from mx to attic Headscale is moving off the mx mailserver onto the attic cache host. The new public URL is https://headscale.hoyer.world. - Switch from useACMEHost = "hoyer.xyz" (mx wildcard DNS-01) to enableACME = true, since attic only has HTTP-01 configured. - Move headscale port to 8081 to avoid clashing with atticd on 8080. - Drop the 192.168.178.254 LAN nameserver from dns.nameservers.global, which isn't reachable from the Hetzner instance. Operational steps still required on attic: - Provision /var/lib/headscale/client_secret - Migrate the headscale state DB from mx - Point headscale.hoyer.world DNS at attic - Update the Nextcloud OIDC client's redirect URI	2026-05-13 08:42:46 +02:00
Harald Hoyer	e440bf39fd	feat(halo): llama-server-27B-MTP.nix	2026-05-12 16:16:15 +02:00
Harald Hoyer	ca4ee90828	feat(halo): coder next	2026-05-11 12:22:34 +02:00
Harald Hoyer	7b04b55ce8	feat(halo): cache-ram 0	2026-05-10 20:50:08 +02:00
Harald Hoyer	04342222a2	fix(halo): 27b	2026-05-10 20:46:12 +02:00
Harald Hoyer	689cdec28d	feat(halo): activate qwen 27b	2026-05-10 20:44:38 +02:00
Harald Hoyer	bef528e26a	feat(halo): use qwen-35b-a3b	2026-05-10 20:44:38 +02:00
Harald Hoyer	d47bb6e15b	feat(halo): add different llama servers	2026-05-07 14:54:48 +02:00
Harald Hoyer	b548126fb8	fix(halo): fix systemd description for llama	2026-05-07 14:40:18 +02:00
Harald Hoyer	02b3c73376	fix(halo): fix systemd description for llama	2026-05-06 14:03:28 +02:00
Harald Hoyer	7ebd97629d	feat(halo): use am17an/Qwen3.6-27B-MTP-GGUF:Q8_0 with MTP spec	2026-05-06 14:01:31 +02:00

1 2 3 4 5 ...

446 commits