The base module installs btop without ROCm support, so btop can't read
the Strix Halo iGPU (no rocm_smi at runtime). Add the rocmSupport build
with hiPrio to win the bin/btop collision against the base package.
halo's llama-server now runs in router mode where the model field selects
a preset (coder/fast/bge-m3); the old "halo-8000" name is no longer valid.
Use the fast MoE model for the Talk bot's responses.
Expose halo's [fast] MoE preset through the LiteLLM gateway and make it
the rag CLI's default chat model (overridable via RAG_CHAT_MODEL), so
query synthesis is quicker than the larger coder model.
Add the rag CLI to the m4 and amd hosts and point its default API_BASE
and QDRANT_URL at sgx (where the gateway and Qdrant run) instead of
localhost. The services live on sgx, so a localhost default only worked
there; sgx resolves to itself on sgx, so this default is correct on every
host and leaves only RAG_API_KEY to set.
Uptime Kuma already binds 4000, so the gateway never got the port and
requests hit the wrong service. Move LiteLLM to 4001 and update the rag
CLI default endpoint to match.
Stand up document retrieval as shared, client-agnostic primitives rather
than locking it inside Open WebUI:
- Qdrant as the LAN-reachable vector store
- LiteLLM gains a bge-m3 route so sgx:4000 also serves /v1/embeddings
- a thin `rag` CLI (ingest/query, optional coder synthesis) usable from
any machine and from scripts
Embeddings and synthesis run on halo via the gateway; the CLI is
configured entirely through RAG_* env vars.
Add a multilingual bge-m3 embedding model to the llama-server preset and
raise --models-max to 2 so it stays co-resident with the coder model.
This gives the RAG stack a local embeddings endpoint without a second
service, keeping all inference on halo. Embedding-specific overrides
(ubatch-size, context, pooling) are pinned since the global defaults
would truncate or misconfigure embedding requests.
The Ollama/OpenAI connection env vars are PersistentConfig: read only on
first launch and thereafter owned by Open WebUI's DB. They no longer
reflected the live backend, so remove them and document that connections
are configured through the admin UI.
Exposes an OpenAI-compatible endpoint on sgx:4000 (LAN-reachable) that
routes the `coder` model to halo's llama-server, so clients get a stable
gateway with per-key auth instead of hardcoding halo's address. Master
key is sourced from a sops-encrypted env file.
Add a 0.74 confidence threshold so speculative drafting stops early
once the draft model's predicted token probability drops below it,
favoring shorter, higher-acceptance draft sequences.
Switch the coder model from Q6_K to the UD-Q8_K_XL quant for better
output quality, and raise spec-draft-n-max from 4 to 5 to allow longer
speculative draft sequences.
Rename the Qwen3.6-27B model section to "coder" so it matches the
opencode provider config, and add ngram-simple to the speculative
decoding chain alongside draft-mtp.
Serve only Qwen3.6-27B; remove the unused 35B-A3B preset.
Tuning:
- Move model-specific keys (spec-type, sampling temp/top-p/top-k/min-p)
out of the [*] defaults into [Qwen3.6-27B] so they no longer leak onto
other models; draft-mtp in particular only works on MTP-weighted models.
- Drop the duplicate parallel key from [*].
- Bump ubatch-size 256 -> 512 for faster iGPU prefill on Strix Halo.
- Add threads-batch = 16 to use all cores for prefill while keeping
generation at threads = 8 under full GPU offload.
Resolves the URL through the Odesli public API (api.song.link) and
replies with the canonical song.link page plus per-platform deep links
(Spotify, Apple Music, YouTube/YT Music, Tidal, Deezer, Amazon Music,
SoundCloud). Country is pinned to DE.
Preload Qwen3.6-27B and Qwen3.6-35B-A3B at startup (load-on-startup)
so both are warm immediately under --models-max 2, set parallel = 1
as the [*] fallback for any other model, and adjust per-model context
size and draft depth.
Replace the per-model llama-server units with a single service that
uses llama-server's --models-preset (models.ini) and --models-max 2,
so the 35B-A3B and 27B models are loaded on demand from one config.
Drop the now-redundant 27B / 27B-MTP / coder-next variant files and
the unused CacheDirectory + slot-save-path KV-slot handling.
The bot no longer shells out to `opencode run`. Instead it POSTs to the
OpenAI-compatible /chat/completions endpoint exposed by llama-server on
halo.hoyer.tail:8000 directly. This removes the Bun/sqlite cold-start
overhead per request, drops the pkgs.opencode runtime dependency, and
eliminates the ExecStartPre dance that materialized config.json into the
service's $HOME.
Conversation history is now stored as a proper OpenAI `messages` list
with system/user/assistant roles, instead of the XML blob that was
inlined into a single `opencode run` argument. The interactive opencode
setup (config/opencode/config.json) is unchanged — only the bot stops
depending on it.
The module gains a `modelBaseUrl` option; `model` is now the bare model
name (`halo-8000`) without the provider/ prefix that the opencode CLI
required.
Mirrors the existing nextcloud-claude-bot setup but invokes `opencode run`
against the local `halo-8000` provider/model. The bot listens on
127.0.0.1:8086, is exposed via the `/_opencode-bot/` location on
nc.hoyer.xyz, and uses `@Halo` as its mention trigger in group chats.
The opencode config (config/opencode/config.json) is installed into the
service's $HOME/.config/opencode/ on each start, so the bot picks up the
same provider definition the user uses interactively. The model map keys
are renamed to `halo-8000` / `halo-8001` so the canonical
`provider/model` reference works without an alias indirection.
tailscale set is strict about boolean flags and silently ignores
--advertise-exit-node without =true. Result: the tailscaled-set unit
ran cleanly but AdvertiseRoutes stayed null. Spell the value out so the
flag takes effect.
Introduces a headscale ACL policy (file-mode) plus matching client config:
- New systems/x86_64-linux/attic/headscale-policy.hujson:
* tag:llm restricts a node to talking only to halo:8000
* all other harald@ nodes have full mesh access to each other
* harald@ nodes can route internet traffic via approved exit nodes
* autoApprovers.exitNode = [tag:llm] auto-approves the exit route
advertised by any tag:llm node (currently mx)
- attic headscale.nix: wire policy.mode = "file" / policy.path to
the .hujson above.
- mx default.nix: enable useRoutingFeatures = "server" (needed for IP
forwarding) and add extraSetFlags = ["--advertise-exit-node"] so the
flag is reapplied on every activation, not just initial login.
Operational steps after deploy:
headscale nodes tag -i 10 -t tag:llm
Avoid breaking existing clients and the registered OIDC redirect URI by
keeping the original domain. Only the host backing it changes (mx -> attic);
DNS just needs to be repointed.
Headscale is moving off the mx mailserver onto the attic cache host.
The new public URL is https://headscale.hoyer.world.
- Switch from useACMEHost = "hoyer.xyz" (mx wildcard DNS-01) to
enableACME = true, since attic only has HTTP-01 configured.
- Move headscale port to 8081 to avoid clashing with atticd on 8080.
- Drop the 192.168.178.254 LAN nameserver from dns.nameservers.global,
which isn't reachable from the Hetzner instance.
Operational steps still required on attic:
- Provision /var/lib/headscale/client_secret
- Migrate the headscale state DB from mx
- Point headscale.hoyer.world DNS at attic
- Update the Nextcloud OIDC client's redirect URI