Stand up document retrieval as shared, client-agnostic primitives rather
than locking it inside Open WebUI:
- Qdrant as the LAN-reachable vector store
- LiteLLM gains a bge-m3 route so sgx:4000 also serves /v1/embeddings
- a thin `rag` CLI (ingest/query, optional coder synthesis) usable from
any machine and from scripts
Embeddings and synthesis run on halo via the gateway; the CLI is
configured entirely through RAG_* env vars.
Exposes an OpenAI-compatible endpoint on sgx:4000 (LAN-reachable) that
routes the `coder` model to halo's llama-server, so clients get a stable
gateway with per-key auth instead of hardcoding halo's address. Master
key is sourced from a sops-encrypted env file.