Covers architecture (client-server model), tool system, deployment strategy, and phased implementation plan with full homelab context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
13 KiB
Tolkien — Infrastructure Agent for Valinor
Overview
Tolkien is an infrastructure agent for homelab management. It understands the valinor GitOps repository, the TuringPi k3s cluster, external hosts managed via Ansible, Terraform resources, and the lord-of-the-rings documentation repository.
Architecture: Client-server model.
- Server runs in the k3s cluster (has access to kubectl, vault, argocd, tea, docs)
- CLI client is a thin REPL on the operator's machine that connects to the server over HTTPS
Key Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Language | Python (uv) | Matches infra-agent pattern, good Anthropic SDK support |
| LLM | Anthropic API (direct) | Simpler than Vertex AI for homelab use |
| Interface | CLI REPL → server API | Start with CLI, add Telegram/Signal later |
| Registry | Gitea container registry | Zero additional setup, gitea.jpnadas.xyz/jpnadas/tolkien |
| Deployment | Helm chart in valinor, ArgoCD | Self-managed, same pattern as all other apps |
| CI | Gitea Actions runner in cluster | Gitea host (isildur) is too weak for builds |
Infrastructure Context
Valinor Repository Structure
valinor/
├── apps/ # Helm-based apps (ArgoCD ApplicationSet)
│ └── <app>/config.yaml, values.yaml, *.yaml
├── manifests/ # Raw k8s manifests (ArgoCD ApplicationSet)
│ └── <app>/config.yaml, *.yaml
├── ansible/ # Docker Compose stacks on external hosts
│ ├── stacks/isildur/ # Pi 5 — Caddy, Unifi Controller, HAProxy
│ ├── stacks/iluvatar/ # ZFS NAS — iSCSI, NFS
│ ├── playbooks/
│ └── inventory.yml
├── terraform/ # IaC for supporting services
│ ├── vault/ # Vault policies + k8s auth roles
│ ├── minio/ # Buckets, users, S3 creds → Vault
│ ├── cloudflare/ # DNS records
│ ├── arr/ # Arr stack config
│ └── netbox/ # NetBox resources
└── applicationset.yaml # ArgoCD ApplicationSet definitions
Cluster Nodes
| Node | Role | Notes |
|---|---|---|
| merry | k3s controller | Dedicated SSD (etcd only) |
| sam | k3s worker | SATA SSD (Longhorn + databases) |
| pippin | k3s worker | SATA SSD (Longhorn + databases) |
| rosie | k3s worker | NVMe SSD (fastest, Longhorn + databases) |
| isildur | external | Pi 5 — Caddy, Unifi, HAProxy, Gitea |
| iluvatar | external | Optiplex — ZFS NAS, iSCSI, NFS |
Storage Classes
| Class | Backend | Use Case |
|---|---|---|
longhorn |
Distributed SSD (replica=2) | Configs, caches, app state |
local-path |
Node-local SSD | CNPG databases (PG replication handles HA) |
bulk-storage |
ZFS on iluvatar (iSCSI) | Media, downloads |
frigate-storage |
ZFS on iluvatar (iSCSI) | Frigate recordings |
Key Services
- ArgoCD: GitOps sync (auto-sync + prune + self-heal)
- Vault: Secret management (Vault Secrets Operator in cluster)
- MinIO: S3-compatible storage for CNPG backups
- CNPG: CloudNativePG for PostgreSQL (local-path + streaming replication)
- Longhorn: Distributed block storage
- VolSync: PVC backup/restore
- cert-manager: TLS certificates
- ingress-nginx: Ingress controller
- MetalLB: Bare-metal load balancer
Documentation
- lord-of-the-rings (
~/Personal/lord-of-the-rings/): All homelab docshomelab/valinor/— per-app docs, migration plans, storage docshomelab/— infrastructure docs (MinIO, ArgoCD, Vault, hardware)
- valinor CLAUDE.md: Contains repo conventions, app structure, skills
Git Workflow
- Never push directly to main
- Feature/fix branches → PR via
teaCLI → merge in Gitea - ArgoCD auto-syncs from main branch
Phase 1: Project Skeleton + CLI REPL
Goal
A working REPL that sends messages to Claude and gets responses, with no tools yet.
Project Structure
tolkien/
├── pyproject.toml # uv project config
├── plan.md # This file
├── README.md
├── tolkien/
│ ├── __init__.py
│ ├── __main__.py # CLI entrypoint (repl or serve)
│ ├── agent.py # Claude tool-calling loop (Anthropic SDK)
│ ├── session.py # Conversation state + history trimming
│ ├── cli.py # REPL client (connects to server API)
│ ├── server.py # HTTP API server (FastAPI or Flask)
│ ├── config.py # Settings (env vars, defaults)
│ └── tools/
│ └── __init__.py # Tool registry + dispatch
└── tests/
└── ...
Entrypoints
python -m tolkien serve— Start the API serverpython -m tolkien repl— Start the CLI REPL (thin client to the server)
API Design (Server)
POST /sessions → Create a new session, returns {session_id}
POST /sessions/{id}/messages → Send a message, returns {response}
GET /sessions/{id} → Get session state
DELETE /sessions/{id} → End session
GET /healthz → Health check
Session Model
- In-memory dict (session_id → Session)
- Session holds conversation history (messages list)
- History trimming: keep first turn + last N turns (like infra-agent's 20-turn cap)
- Session expiry after 30 min inactivity
Agent Loop
Based on infra-agent's _drive_to_end_turn():
- Build messages list (system prompt + conversation history)
- Call Claude API with tools
- If tool_use in response → execute tools concurrently → append results
- Loop until stop_reason == "end_turn"
- Return assistant text
Dependencies
anthropic— Claude API SDKhttpx— CLI client HTTP callsflaskorfastapi+uvicorn— API serverrich— CLI REPL formattingpython-dotenv— env var loading
Tasks
- Initialize uv project with pyproject.toml
- Implement config.py (ANTHROPIC_API_KEY, SERVER_URL, etc.)
- Implement agent.py with Claude loop (no tools yet)
- Implement session.py
- Implement server.py with session + message endpoints
- Implement cli.py REPL that POSTs to the server
- Implement main.py with serve/repl subcommands
- Test locally: run server, connect with REPL, have a conversation
Phase 2: Tool System
Goal
Give the agent tools to query and understand the homelab infrastructure.
Tool Architecture
Each tool module exports:
TOOLS: list[dict]— Claude tool schemas (name, description, input_schema)DISPATCH: dict[str, Callable]— name → handler function
tools/__init__.py aggregates all modules into a single registry.
Tools to Implement
tools/kubectl.py — Kubernetes Queries
- Allowed verbs: get, describe, logs, top, auth can-i
- Blocked: create, delete, apply, patch, edit, exec, port-forward
- Timeout: 60s
- Output truncation: 12,000 chars
tools/gitea.py — Gitea via tea CLI
- List/view issues
- List/view PRs
- Create issues, create PRs (draft)
- Search repos
- Blocked: delete operations
tools/argocd.py — ArgoCD Status
argocd app list— all apps and sync statusargocd app get <name>— detailed app statusargocd app diff <name>— pending changesargocd app history <name>— deployment history- Read-only: no sync, rollback, or delete operations
tools/vault.py — Vault Metadata Only
vault kv metadata get <path>— check secret exists, see versionsvault kv list <path>— list secret paths- Explicitly blocked:
vault kv get(cannot read actual secret values) - Policy:
capabilities = ["list"]onsecret/metadata/*, deny onsecret/data/*
tools/docs.py — Lord of the Rings Documentation
- Read files from
~/Personal/lord-of-the-rings/ - List available docs (directory listing)
- Search docs content (grep)
- In server context: mount lotr as a volume or clone it
tools/web.py — Web Fetch
- Allowlisted domains: Kubernetes docs, Helm chart docs, ArgoCD docs, etc.
- HTTPS only
- Timeout: 15s
- Output truncation: 12,000 chars
tools/write_code.py — Claude Code Subprocess
- Clones valinor repo → runs
claude --print→ creates branch + PR - Risk classification for changes (optional, simpler than infra-agent's)
- Always creates draft PRs, never merges
Tasks
- Implement tool registry in tools/init.py
- Implement each tool module
- Wire tools into agent.py (pass to Claude API, dispatch results)
- Add concurrent tool execution (ThreadPoolExecutor)
- Test each tool in isolation
- Test full loop: REPL → server → Claude → tool → response
Phase 3: System Prompt + Knowledge
Goal
Give the agent enough context to be genuinely useful for homelab operations.
System Prompt Sections
- Identity: You are Tolkien, an infrastructure agent for the valinor homelab
- Infrastructure overview: Cluster nodes, storage classes, key services
- Repository structure: How valinor is organized (apps/, manifests/, ansible/, terraform/)
- Tool usage guide: When to use each tool, with examples
- Workflows: Step-by-step for common operations:
- Deploying a new app (check chart, create config.yaml + values.yaml, terraform if needed)
- Checking app health (argocd status, kubectl pods, logs)
- Investigating issues (kubectl describe, logs, events)
- Checking backup status (volsync, CNPG S3 backups)
- Managing external hosts (ansible stacks)
- Documentation references: How to find and cite lotr docs
- Safety rules: Read-only by default, write_code for changes, always draft PRs
Tasks
- Write system prompt in agent.py (or separate prompt.py)
- Test with real scenarios (deploy app, check status, troubleshoot)
- Iterate on prompt based on results
Phase 4: Containerize + Deploy
Goal
Run tolkien as a service in the k3s cluster, accessible via CLI from the operator's machine.
Dockerfile
FROM python:3.13-slim
# Install: kubectl, tea, argocd CLI, vault CLI, claude CLI
# Copy project, install deps with uv
# Entrypoint: gunicorn/uvicorn serving the API
Helm Chart (in valinor repo)
valinor/apps/tolkien/
├── config.yaml # Chart reference (bjw-s app-template)
├── values.yaml # Image, env, probes, volumes
├── vault-auth.yaml # VaultAuth for k8s auth
└── vault-secret.yaml # VaultStaticSecret (ANTHROPIC_API_KEY)
Networking
- Ingress:
tolkien.jpnadas.xyz(or internal-only via ClusterIP + VPN) - Consider: API key auth or mTLS for the API endpoint (don't expose unauthenticated)
API Authentication
Simple shared secret for now:
- Server checks
Authorization: Bearer <token>header - Token stored in Vault, configured in CLI via env var
- Can upgrade to mTLS or OAuth later
Volume Mounts
- lord-of-the-rings docs: either git-clone init container or PVC
- Valinor repo: clone on-demand for write_code tool
- kubectl: ServiceAccount with read-only ClusterRole
RBAC
- ServiceAccount
tolkienwith read-only ClusterRole:- get/list/watch on pods, services, deployments, statefulsets, events, configmaps, PVCs, ingresses
- get/list on nodes
- logs on pods
- No write permissions
Vault
- Policy: read
secret/data/tolkien/*, listsecret/metadata/* - Kubernetes auth role bound to
tolkienServiceAccount
Tasks
- Write Dockerfile
- Build and push to Gitea registry manually (first time)
- Create valinor/apps/tolkien/ with all manifests
- Add Vault terraform module for tolkien
- Create RBAC manifests (ClusterRole + ClusterRoleBinding)
- Deploy via ArgoCD
- Test CLI → server connectivity
- Set up Gitea Actions runner in cluster (separate task)
- Set up CI pipeline (.gitea/workflows/) for build + push on tag
Phase 5: Messaging Integration (Future)
Goal
Add Telegram or Signal as alternative interfaces.
Approach
- Add a
telegram.pyhandler (python-telegram-bot library) - Same session/agent backend, just different input/output transport
- Bot token in Vault
- Webhook mode (Telegram pushes to tolkien's API)
Tasks
- Choose Telegram vs Signal
- Implement bot handler
- Add webhook endpoint to server
- Deploy and test
Open Questions
- Ansible access: Should tolkien be able to run ansible playbooks, or just read the config? Running playbooks from in-cluster would need SSH keys to external hosts — big blast radius. Start read-only?
- write_code scope: Should it only modify valinor, or also lord-of-the-rings? Probably both (code + docs).
- Streaming responses: Should the CLI stream Claude's response as it generates, or wait for the full response? Streaming is better UX for long answers.
- Rate limiting: Any concern about Anthropic API costs? Could add a simple per-session token budget.
- lotr access in-cluster: Git clone as init container (stale) vs. mount from a shared PVC vs. fetch on demand via Gitea API? Gitea API is simplest and always fresh.