# Tolkien — Infrastructure Agent for Valinor ## Overview Tolkien is an infrastructure agent for homelab management. It understands the valinor GitOps repository, the TuringPi k3s cluster, external hosts managed via Ansible, Terraform resources, and the lord-of-the-rings documentation repository. **Architecture**: Client-server model. - **Server** runs in the k3s cluster (has access to kubectl, vault, argocd, tea, docs) - **CLI client** is a thin REPL on the operator's machine that connects to the server over HTTPS ## Key Decisions | Decision | Choice | Rationale | |----------|--------|-----------| | Language | Python (uv) | Matches infra-agent pattern, good Anthropic SDK support | | LLM | Anthropic API (direct) | Simpler than Vertex AI for homelab use | | Interface | CLI REPL → server API | Start with CLI, add Telegram/Signal later | | Registry | Gitea container registry | Zero additional setup, `gitea.jpnadas.xyz/jpnadas/tolkien` | | Deployment | Helm chart in valinor, ArgoCD | Self-managed, same pattern as all other apps | | CI | Gitea Actions runner in cluster | Gitea host (isildur) is too weak for builds | ## Infrastructure Context ### Valinor Repository Structure ``` valinor/ ├── apps/ # Helm-based apps (ArgoCD ApplicationSet) │ └── /config.yaml, values.yaml, *.yaml ├── manifests/ # Raw k8s manifests (ArgoCD ApplicationSet) │ └── /config.yaml, *.yaml ├── ansible/ # Docker Compose stacks on external hosts │ ├── stacks/isildur/ # Pi 5 — Caddy, Unifi Controller, HAProxy │ ├── stacks/iluvatar/ # ZFS NAS — iSCSI, NFS │ ├── playbooks/ │ └── inventory.yml ├── terraform/ # IaC for supporting services │ ├── vault/ # Vault policies + k8s auth roles │ ├── minio/ # Buckets, users, S3 creds → Vault │ ├── cloudflare/ # DNS records │ ├── arr/ # Arr stack config │ └── netbox/ # NetBox resources └── applicationset.yaml # ArgoCD ApplicationSet definitions ``` ### Cluster Nodes | Node | Role | Notes | |------|------|-------| | merry | k3s controller | Dedicated SSD (etcd only) | | sam | k3s worker | SATA SSD (Longhorn + databases) | | pippin | k3s worker | SATA SSD (Longhorn + databases) | | rosie | k3s worker | NVMe SSD (fastest, Longhorn + databases) | | isildur | external | Pi 5 — Caddy, Unifi, HAProxy, Gitea | | iluvatar | external | Optiplex — ZFS NAS, iSCSI, NFS | ### Storage Classes | Class | Backend | Use Case | |-------|---------|----------| | `longhorn` | Distributed SSD (replica=2) | Configs, caches, app state | | `local-path` | Node-local SSD | CNPG databases (PG replication handles HA) | | `bulk-storage` | ZFS on iluvatar (iSCSI) | Media, downloads | | `frigate-storage` | ZFS on iluvatar (iSCSI) | Frigate recordings | ### Key Services - **ArgoCD**: GitOps sync (auto-sync + prune + self-heal) - **Vault**: Secret management (Vault Secrets Operator in cluster) - **MinIO**: S3-compatible storage for CNPG backups - **CNPG**: CloudNativePG for PostgreSQL (local-path + streaming replication) - **Longhorn**: Distributed block storage - **VolSync**: PVC backup/restore - **cert-manager**: TLS certificates - **ingress-nginx**: Ingress controller - **MetalLB**: Bare-metal load balancer ### Documentation - **lord-of-the-rings** (`~/Personal/lord-of-the-rings/`): All homelab docs - `homelab/valinor/` — per-app docs, migration plans, storage docs - `homelab/` — infrastructure docs (MinIO, ArgoCD, Vault, hardware) - **valinor CLAUDE.md**: Contains repo conventions, app structure, skills ### Git Workflow - Never push directly to main - Feature/fix branches → PR via `tea` CLI → merge in Gitea - ArgoCD auto-syncs from main branch --- ## Phase 1: Project Skeleton + CLI REPL ### Goal A working REPL that sends messages to Claude and gets responses, with no tools yet. ### Project Structure ``` tolkien/ ├── pyproject.toml # uv project config ├── plan.md # This file ├── README.md ├── tolkien/ │ ├── __init__.py │ ├── __main__.py # CLI entrypoint (repl or serve) │ ├── agent.py # Claude tool-calling loop (Anthropic SDK) │ ├── session.py # Conversation state + history trimming │ ├── cli.py # REPL client (connects to server API) │ ├── server.py # HTTP API server (FastAPI or Flask) │ ├── config.py # Settings (env vars, defaults) │ └── tools/ │ └── __init__.py # Tool registry + dispatch └── tests/ └── ... ``` ### Entrypoints - `python -m tolkien serve` — Start the API server - `python -m tolkien repl` — Start the CLI REPL (thin client to the server) ### API Design (Server) ``` POST /sessions → Create a new session, returns {session_id} POST /sessions/{id}/messages → Send a message, returns {response} GET /sessions/{id} → Get session state DELETE /sessions/{id} → End session GET /healthz → Health check ``` ### Session Model - In-memory dict (session_id → Session) - Session holds conversation history (messages list) - History trimming: keep first turn + last N turns (like infra-agent's 20-turn cap) - Session expiry after 30 min inactivity ### Agent Loop Based on infra-agent's `_drive_to_end_turn()`: 1. Build messages list (system prompt + conversation history) 2. Call Claude API with tools 3. If tool_use in response → execute tools concurrently → append results 4. Loop until stop_reason == "end_turn" 5. Return assistant text ### Dependencies - `anthropic` — Claude API SDK - `httpx` — CLI client HTTP calls - `flask` or `fastapi` + `uvicorn` — API server - `rich` — CLI REPL formatting - `python-dotenv` — env var loading ### Tasks - [ ] Initialize uv project with pyproject.toml - [ ] Implement config.py (ANTHROPIC_API_KEY, SERVER_URL, etc.) - [ ] Implement agent.py with Claude loop (no tools yet) - [ ] Implement session.py - [ ] Implement server.py with session + message endpoints - [ ] Implement cli.py REPL that POSTs to the server - [ ] Implement __main__.py with serve/repl subcommands - [ ] Test locally: run server, connect with REPL, have a conversation --- ## Phase 2: Tool System ### Goal Give the agent tools to query and understand the homelab infrastructure. ### Tool Architecture Each tool module exports: - `TOOLS: list[dict]` — Claude tool schemas (name, description, input_schema) - `DISPATCH: dict[str, Callable]` — name → handler function `tools/__init__.py` aggregates all modules into a single registry. ### Tools to Implement #### `tools/kubectl.py` — Kubernetes Queries - **Allowed verbs**: get, describe, logs, top, auth can-i - **Blocked**: create, delete, apply, patch, edit, exec, port-forward - Timeout: 60s - Output truncation: 12,000 chars #### `tools/gitea.py` — Gitea via `tea` CLI - List/view issues - List/view PRs - Create issues, create PRs (draft) - Search repos - Blocked: delete operations #### `tools/argocd.py` — ArgoCD Status - `argocd app list` — all apps and sync status - `argocd app get ` — detailed app status - `argocd app diff ` — pending changes - `argocd app history ` — deployment history - Read-only: no sync, rollback, or delete operations #### `tools/vault.py` — Vault Metadata Only - `vault kv metadata get ` — check secret exists, see versions - `vault kv list ` — list secret paths - **Explicitly blocked**: `vault kv get` (cannot read actual secret values) - Policy: `capabilities = ["list"]` on `secret/metadata/*`, deny on `secret/data/*` #### `tools/docs.py` — Lord of the Rings Documentation - Read files from `~/Personal/lord-of-the-rings/` - List available docs (directory listing) - Search docs content (grep) - In server context: mount lotr as a volume or clone it #### `tools/web.py` — Web Fetch - Allowlisted domains: Kubernetes docs, Helm chart docs, ArgoCD docs, etc. - HTTPS only - Timeout: 15s - Output truncation: 12,000 chars #### `tools/write_code.py` — Claude Code Subprocess - Clones valinor repo → runs `claude --print` → creates branch + PR - Risk classification for changes (optional, simpler than infra-agent's) - Always creates draft PRs, never merges ### Tasks - [ ] Implement tool registry in tools/__init__.py - [ ] Implement each tool module - [ ] Wire tools into agent.py (pass to Claude API, dispatch results) - [ ] Add concurrent tool execution (ThreadPoolExecutor) - [ ] Test each tool in isolation - [ ] Test full loop: REPL → server → Claude → tool → response --- ## Phase 3: System Prompt + Knowledge ### Goal Give the agent enough context to be genuinely useful for homelab operations. ### System Prompt Sections 1. **Identity**: You are Tolkien, an infrastructure agent for the valinor homelab 2. **Infrastructure overview**: Cluster nodes, storage classes, key services 3. **Repository structure**: How valinor is organized (apps/, manifests/, ansible/, terraform/) 4. **Tool usage guide**: When to use each tool, with examples 5. **Workflows**: Step-by-step for common operations: - Deploying a new app (check chart, create config.yaml + values.yaml, terraform if needed) - Checking app health (argocd status, kubectl pods, logs) - Investigating issues (kubectl describe, logs, events) - Checking backup status (volsync, CNPG S3 backups) - Managing external hosts (ansible stacks) 6. **Documentation references**: How to find and cite lotr docs 7. **Safety rules**: Read-only by default, write_code for changes, always draft PRs ### Tasks - [ ] Write system prompt in agent.py (or separate prompt.py) - [ ] Test with real scenarios (deploy app, check status, troubleshoot) - [ ] Iterate on prompt based on results --- ## Phase 4: Containerize + Deploy ### Goal Run tolkien as a service in the k3s cluster, accessible via CLI from the operator's machine. ### Dockerfile ```dockerfile FROM python:3.13-slim # Install: kubectl, tea, argocd CLI, vault CLI, claude CLI # Copy project, install deps with uv # Entrypoint: gunicorn/uvicorn serving the API ``` ### Helm Chart (in valinor repo) ``` valinor/apps/tolkien/ ├── config.yaml # Chart reference (bjw-s app-template) ├── values.yaml # Image, env, probes, volumes ├── vault-auth.yaml # VaultAuth for k8s auth └── vault-secret.yaml # VaultStaticSecret (ANTHROPIC_API_KEY) ``` ### Networking - Ingress: `tolkien.jpnadas.xyz` (or internal-only via ClusterIP + VPN) - Consider: API key auth or mTLS for the API endpoint (don't expose unauthenticated) ### API Authentication Simple shared secret for now: - Server checks `Authorization: Bearer ` header - Token stored in Vault, configured in CLI via env var - Can upgrade to mTLS or OAuth later ### Volume Mounts - lord-of-the-rings docs: either git-clone init container or PVC - Valinor repo: clone on-demand for write_code tool - kubectl: ServiceAccount with read-only ClusterRole ### RBAC - ServiceAccount `tolkien` with read-only ClusterRole: - get/list/watch on pods, services, deployments, statefulsets, events, configmaps, PVCs, ingresses - get/list on nodes - logs on pods - No write permissions ### Vault - Policy: read `secret/data/tolkien/*`, list `secret/metadata/*` - Kubernetes auth role bound to `tolkien` ServiceAccount ### Tasks - [ ] Write Dockerfile - [ ] Build and push to Gitea registry manually (first time) - [ ] Create valinor/apps/tolkien/ with all manifests - [ ] Add Vault terraform module for tolkien - [ ] Create RBAC manifests (ClusterRole + ClusterRoleBinding) - [ ] Deploy via ArgoCD - [ ] Test CLI → server connectivity - [ ] Set up Gitea Actions runner in cluster (separate task) - [ ] Set up CI pipeline (.gitea/workflows/) for build + push on tag --- ## Phase 5: Messaging Integration (Future) ### Goal Add Telegram or Signal as alternative interfaces. ### Approach - Add a `telegram.py` handler (python-telegram-bot library) - Same session/agent backend, just different input/output transport - Bot token in Vault - Webhook mode (Telegram pushes to tolkien's API) ### Tasks - [ ] Choose Telegram vs Signal - [ ] Implement bot handler - [ ] Add webhook endpoint to server - [ ] Deploy and test --- ## Open Questions 1. **Ansible access**: Should tolkien be able to run ansible playbooks, or just read the config? Running playbooks from in-cluster would need SSH keys to external hosts — big blast radius. Start read-only? 2. **write_code scope**: Should it only modify valinor, or also lord-of-the-rings? Probably both (code + docs). 3. **Streaming responses**: Should the CLI stream Claude's response as it generates, or wait for the full response? Streaming is better UX for long answers. 4. **Rate limiting**: Any concern about Anthropic API costs? Could add a simple per-session token budget. 5. **lotr access in-cluster**: Git clone as init container (stale) vs. mount from a shared PVC vs. fetch on demand via Gitea API? Gitea API is simplest and always fresh.