Commit graph

437 commits

Author SHA1 Message Date
a1b55fe2ec chore(halo): model update 2026-05-22 00:21:23 +02:00
9986d286b1 refactor(openwebui): drop stale backend env vars now managed via UI
The Ollama/OpenAI connection env vars are PersistentConfig: read only on
first launch and thereafter owned by Open WebUI's DB. They no longer
reflected the live backend, so remove them and document that connections
are configured through the admin UI.
2026-05-21 23:15:47 +02:00
fdefdf31b2 feat(litellm): add LiteLLM gateway on sgx fronting halo's llama-server
Exposes an OpenAI-compatible endpoint on sgx:4000 (LAN-reachable) that
routes the `coder` model to halo's llama-server, so clients get a stable
gateway with per-key auth instead of hardcoding halo's address. Master
key is sourced from a sops-encrypted env file.
2026-05-21 23:15:47 +02:00
ccd8750899 chore(halo): set spec-draft-p-min for coder model
Add a 0.74 confidence threshold so speculative drafting stops early
once the draft model's predicted token probability drops below it,
favoring shorter, higher-acceptance draft sequences.
2026-05-21 23:15:09 +02:00
3a070413e4 chore(halo): upgrade coder model to Q8 quant and bump spec draft
Switch the coder model from Q6_K to the UD-Q8_K_XL quant for better
output quality, and raise spec-draft-n-max from 4 to 5 to allow longer
speculative draft sequences.
2026-05-21 23:11:00 +02:00
689389ebf8 chore(halo): rename model to coder and add ngram-simple speculation
Rename the Qwen3.6-27B model section to "coder" so it matches the
opencode provider config, and add ngram-simple to the speculative
decoding chain alongside draft-mtp.
2026-05-21 22:07:57 +02:00
ee396ffd42 chore(halo): more parallel 2026-05-21 20:54:08 +02:00
70da67555f chore(halo): llama.cpp update 2026-05-21 20:46:06 +02:00
1376ab0ba0 chore(halo): reduce ubatch size 2026-05-21 08:47:39 +02:00
6c5ce8742c fix(halo): only one model 2026-05-20 14:23:42 +02:00
5ee2f65337 chore(halo): tune llama models.ini and drop 35B-A3B model
Serve only Qwen3.6-27B; remove the unused 35B-A3B preset.

Tuning:
- Move model-specific keys (spec-type, sampling temp/top-p/top-k/min-p)
  out of the [*] defaults into [Qwen3.6-27B] so they no longer leak onto
  other models; draft-mtp in particular only works on MTP-weighted models.
- Drop the duplicate parallel key from [*].
- Bump ubatch-size 256 -> 512 for faster iGPU prefill on Strix Halo.
- Add threads-batch = 16 to use all cores for prefill while keeping
  generation at threads = 8 under full GPU offload.
2026-05-20 14:23:42 +02:00
Harald Hoyer
5b44e037a1 feat(halo): add song <URL> command to convert via song.link
Resolves the URL through the Odesli public API (api.song.link) and
replies with the canonical song.link page plus per-platform deep links
(Spotify, Apple Music, YouTube/YT Music, Tidal, Deezer, Amazon Music,
SoundCloud). Country is pinned to DE.
2026-05-20 09:42:11 +02:00
ac70c57c15 chore(halo): preload both llama models and tune preset
Preload Qwen3.6-27B and Qwen3.6-35B-A3B at startup (load-on-startup)
so both are warm immediately under --models-max 2, set parallel = 1
as the [*] fallback for any other model, and adjust per-model context
size and draft depth.
2026-05-20 07:14:26 +02:00
31e491e314 Revert "fix(halo): 27 only"
This reverts commit 72e7bf613f.
2026-05-20 07:05:27 +02:00
72e7bf613f fix(halo): 27 only 2026-05-20 02:14:08 +02:00
807a3d0d8e fix(halo): context 2026-05-20 01:21:10 +02:00
0edf975c30 feat(halo): serve multiple llama models via models.ini preset
Replace the per-model llama-server units with a single service that
uses llama-server's --models-preset (models.ini) and --models-max 2,
so the 35B-A3B and 27B models are loaded on demand from one config.

Drop the now-redundant 27B / 27B-MTP / coder-next variant files and
the unused CacheDirectory + slot-save-path KV-slot handling.
2026-05-20 00:23:50 +02:00
ae068cfd84 feat(mx): increase halo bot timeout 2026-05-19 23:52:46 +02:00
b4063fda66 feat(halo): MTP --parallel 2 2026-05-19 23:48:53 +02:00
8bd096ff8d feat(halo): inc. mtp to 6 2026-05-19 06:40:13 +02:00
492362fa31 feat(amd): enable Wake-on-LAN on enp7s0 2026-05-16 13:40:25 +02:00
38d2d4f4ae fix(halo): q6_k with mtp 2 2026-05-15 07:47:43 +02:00
1e3b2fc9a7 feat(halo): unsloth MTP 2026-05-13 19:42:54 +02:00
42c52bd87f refactor(mx): drive opencode bot via direct chat-completions API
The bot no longer shells out to `opencode run`. Instead it POSTs to the
OpenAI-compatible /chat/completions endpoint exposed by llama-server on
halo.hoyer.tail:8000 directly. This removes the Bun/sqlite cold-start
overhead per request, drops the pkgs.opencode runtime dependency, and
eliminates the ExecStartPre dance that materialized config.json into the
service's $HOME.

Conversation history is now stored as a proper OpenAI `messages` list
with system/user/assistant roles, instead of the XML blob that was
inlined into a single `opencode run` argument. The interactive opencode
setup (config/opencode/config.json) is unchanged — only the bot stops
depending on it.

The module gains a `modelBaseUrl` option; `model` is now the bare model
name (`halo-8000`) without the provider/ prefix that the opencode CLI
required.
2026-05-13 16:38:58 +02:00
d8e8293c0e feat(mx): add Nextcloud Talk opencode bot pointing at halo.hoyer.tail:8000
Mirrors the existing nextcloud-claude-bot setup but invokes `opencode run`
against the local `halo-8000` provider/model. The bot listens on
127.0.0.1:8086, is exposed via the `/_opencode-bot/` location on
nc.hoyer.xyz, and uses `@Halo` as its mention trigger in group chats.

The opencode config (config/opencode/config.json) is installed into the
service's $HOME/.config/opencode/ on each start, so the bot picks up the
same provider definition the user uses interactively. The model map keys
are renamed to `halo-8000` / `halo-8001` so the canonical
`provider/model` reference works without an alias indirection.
2026-05-13 15:08:18 +02:00
dadfb07914 fix(halo): set --alias halo-8000 2026-05-13 14:52:49 +02:00
f9a2e0d301 chore(x1,amd): disable cratedocs-mcp service
Keep it enabled only on sgx.
2026-05-13 11:35:59 +02:00
4ce7bcf354 fix(mx): make tailscale exit-node advertisement actually apply
tailscale set is strict about boolean flags and silently ignores
--advertise-exit-node without =true. Result: the tailscaled-set unit
ran cleanly but AdvertiseRoutes stayed null. Spell the value out so the
flag takes effect.
2026-05-13 09:28:20 +02:00
67b7c3a9fd feat(headscale): add ACL policy, isolate mx, make mx an exit node
Introduces a headscale ACL policy (file-mode) plus matching client config:

- New systems/x86_64-linux/attic/headscale-policy.hujson:
  * tag:llm restricts a node to talking only to halo:8000
  * all other harald@ nodes have full mesh access to each other
  * harald@ nodes can route internet traffic via approved exit nodes
  * autoApprovers.exitNode = [tag:llm] auto-approves the exit route
    advertised by any tag:llm node (currently mx)

- attic headscale.nix: wire policy.mode = "file" / policy.path to
  the .hujson above.

- mx default.nix: enable useRoutingFeatures = "server" (needed for IP
  forwarding) and add extraSetFlags = ["--advertise-exit-node"] so the
  flag is reapplied on every activation, not just initial login.

Operational steps after deploy:
  headscale nodes tag -i 10 -t tag:llm
2026-05-13 09:06:40 +02:00
87bdaf15da fix(attic): keep headscale domain as headscale.hoyer.xyz
Avoid breaking existing clients and the registered OIDC redirect URI by
keeping the original domain. Only the host backing it changes (mx -> attic);
DNS just needs to be repointed.
2026-05-13 08:48:37 +02:00
12c25bcde8 refactor(attic): move headscale from mx to attic
Headscale is moving off the mx mailserver onto the attic cache host.
The new public URL is https://headscale.hoyer.world.

- Switch from useACMEHost = "hoyer.xyz" (mx wildcard DNS-01) to
  enableACME = true, since attic only has HTTP-01 configured.
- Move headscale port to 8081 to avoid clashing with atticd on 8080.
- Drop the 192.168.178.254 LAN nameserver from dns.nameservers.global,
  which isn't reachable from the Hetzner instance.

Operational steps still required on attic:
- Provision /var/lib/headscale/client_secret
- Migrate the headscale state DB from mx
- Point headscale.hoyer.world DNS at attic
- Update the Nextcloud OIDC client's redirect URI
2026-05-13 08:42:46 +02:00
e440bf39fd feat(halo): llama-server-27B-MTP.nix 2026-05-12 16:16:15 +02:00
ca4ee90828 feat(halo): coder next 2026-05-11 12:22:34 +02:00
7b04b55ce8 feat(halo): cache-ram 0 2026-05-10 20:50:08 +02:00
04342222a2 fix(halo): 27b 2026-05-10 20:46:12 +02:00
689cdec28d feat(halo): activate qwen 27b 2026-05-10 20:44:38 +02:00
bef528e26a feat(halo): use qwen-35b-a3b 2026-05-10 20:44:38 +02:00
d47bb6e15b feat(halo): add different llama servers 2026-05-07 14:54:48 +02:00
b548126fb8 fix(halo): fix systemd description for llama 2026-05-07 14:40:18 +02:00
02b3c73376 fix(halo): fix systemd description for llama 2026-05-06 14:03:28 +02:00
7ebd97629d feat(halo): use am17an/Qwen3.6-27B-MTP-GGUF:Q8_0 with MTP spec 2026-05-06 14:01:31 +02:00
a95417da8b feat(halo): use unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL 2026-05-06 13:02:20 +02:00
3a1cb7487a refactor(opencode): extract serve service into shared NixOS module
New `metacfg.services.opencode` module under modules/nixos/services/opencode/
with options for port, user, homeDir, sopsFile, and extraPackages. User and
homeDir default off `metacfg.user`. Host configs for amd and sgx reduce to
enabling the module and pointing at their respective sops file.

Service PATH gains jq, yq-go, python3, gh, gnutar, gzip, unzip, wget,
diffutils, patch, file, tree, bun, uv, ast-grep, claude-code, and tmux for
agent ergonomics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:43:27 +02:00
da88a9b2d6 fix(halo): drop speculative HSA_OVERRIDE_GFX_VERSION from llama-server
Was set defensively without knowing the actual GPU arch; if ROCm
supports the card natively, the override is at best a no-op and at
worst masks the real arch. Add it back with the right value if the
service actually fails to detect the GPU.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:42:17 +02:00
b11e5c8356 feat(halo): add llama-server systemd unit for Qwen3.6-35B-A3B
Runs llama.cpp's ROCm build under DynamicUser, with the HF model cache
in StateDirectory (survives systemctl clean) and KV slot saves in
CacheDirectory. Listens on :8000.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:42:17 +02:00
624a72737c fix(opencode): narrow LD_LIBRARY_PATH to libstdc++ only
The full nix-ld library list shadowed nix's own curl, breaking
libnixstore.so with "CURL_OPENSSL_4 not found". The prebuilt node
watcher binding only needs libstdc++/libgcc_s, so use stdenv.cc.cc.lib
and let nix-built tools resolve their own deps via RUNPATH.
2026-05-04 08:58:37 +02:00
0d5fb73022 fix(amd): opencode 2026-05-03 16:31:02 +02:00
5693009488 fix(opencode): set LD_LIBRARY_PATH for prebuilt node bindings
The file watcher binding (and other node-precompiled .node modules
loaded via dlopen) failed with "libstdc++.so.6: cannot open shared
object file" because systemd services don't inherit the user shell's
LD path. Reuse the nix-ld library list so the service sees the same
common libraries unwrapped binaries get globally.
2026-05-03 16:29:24 +02:00
441df05d86 fix(opencode): add git and dev tools to service PATH
The opencode-serve unit ran with systemd's minimal default PATH, so
shell commands invoked by the agent (git, make, nix, node, rg, etc.)
were not found. Set systemd.services.opencode-serve.path on both sgx
and amd to a common dev toolset.
2026-05-03 16:09:31 +02:00
0e723e2da8 feat(amd): add opencode web server at opencode.amd.hoyer.world
Mirror of the sgx opencode setup: systemd service on port 4196 fronted
by nginx with a per-host ACME cert (DNS-01 via internetbs). Adds amd
key + path rule to .sops.yaml so secrets under .secrets/amd/ encrypt
for the host.
2026-05-03 15:55:15 +02:00