Files
tolkien/plan.md
João Pedro Battistella Nadas 21da959ff2 Add project plan for tolkien infrastructure agent
Covers architecture (client-server model), tool system, deployment
strategy, and phased implementation plan with full homelab context.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 19:30:10 +02:00

13 KiB

Tolkien — Infrastructure Agent for Valinor

Overview

Tolkien is an infrastructure agent for homelab management. It understands the valinor GitOps repository, the TuringPi k3s cluster, external hosts managed via Ansible, Terraform resources, and the lord-of-the-rings documentation repository.

Architecture: Client-server model.

  • Server runs in the k3s cluster (has access to kubectl, vault, argocd, tea, docs)
  • CLI client is a thin REPL on the operator's machine that connects to the server over HTTPS

Key Decisions

Decision Choice Rationale
Language Python (uv) Matches infra-agent pattern, good Anthropic SDK support
LLM Anthropic API (direct) Simpler than Vertex AI for homelab use
Interface CLI REPL → server API Start with CLI, add Telegram/Signal later
Registry Gitea container registry Zero additional setup, gitea.jpnadas.xyz/jpnadas/tolkien
Deployment Helm chart in valinor, ArgoCD Self-managed, same pattern as all other apps
CI Gitea Actions runner in cluster Gitea host (isildur) is too weak for builds

Infrastructure Context

Valinor Repository Structure

valinor/
├── apps/                    # Helm-based apps (ArgoCD ApplicationSet)
│   └── <app>/config.yaml, values.yaml, *.yaml
├── manifests/               # Raw k8s manifests (ArgoCD ApplicationSet)
│   └── <app>/config.yaml, *.yaml
├── ansible/                 # Docker Compose stacks on external hosts
│   ├── stacks/isildur/      # Pi 5 — Caddy, Unifi Controller, HAProxy
│   ├── stacks/iluvatar/     # ZFS NAS — iSCSI, NFS
│   ├── playbooks/
│   └── inventory.yml
├── terraform/               # IaC for supporting services
│   ├── vault/               # Vault policies + k8s auth roles
│   ├── minio/               # Buckets, users, S3 creds → Vault
│   ├── cloudflare/          # DNS records
│   ├── arr/                 # Arr stack config
│   └── netbox/              # NetBox resources
└── applicationset.yaml      # ArgoCD ApplicationSet definitions

Cluster Nodes

Node Role Notes
merry k3s controller Dedicated SSD (etcd only)
sam k3s worker SATA SSD (Longhorn + databases)
pippin k3s worker SATA SSD (Longhorn + databases)
rosie k3s worker NVMe SSD (fastest, Longhorn + databases)
isildur external Pi 5 — Caddy, Unifi, HAProxy, Gitea
iluvatar external Optiplex — ZFS NAS, iSCSI, NFS

Storage Classes

Class Backend Use Case
longhorn Distributed SSD (replica=2) Configs, caches, app state
local-path Node-local SSD CNPG databases (PG replication handles HA)
bulk-storage ZFS on iluvatar (iSCSI) Media, downloads
frigate-storage ZFS on iluvatar (iSCSI) Frigate recordings

Key Services

  • ArgoCD: GitOps sync (auto-sync + prune + self-heal)
  • Vault: Secret management (Vault Secrets Operator in cluster)
  • MinIO: S3-compatible storage for CNPG backups
  • CNPG: CloudNativePG for PostgreSQL (local-path + streaming replication)
  • Longhorn: Distributed block storage
  • VolSync: PVC backup/restore
  • cert-manager: TLS certificates
  • ingress-nginx: Ingress controller
  • MetalLB: Bare-metal load balancer

Documentation

  • lord-of-the-rings (~/Personal/lord-of-the-rings/): All homelab docs
    • homelab/valinor/ — per-app docs, migration plans, storage docs
    • homelab/ — infrastructure docs (MinIO, ArgoCD, Vault, hardware)
  • valinor CLAUDE.md: Contains repo conventions, app structure, skills

Git Workflow

  • Never push directly to main
  • Feature/fix branches → PR via tea CLI → merge in Gitea
  • ArgoCD auto-syncs from main branch

Phase 1: Project Skeleton + CLI REPL

Goal

A working REPL that sends messages to Claude and gets responses, with no tools yet.

Project Structure

tolkien/
├── pyproject.toml           # uv project config
├── plan.md                  # This file
├── README.md
├── tolkien/
│   ├── __init__.py
│   ├── __main__.py          # CLI entrypoint (repl or serve)
│   ├── agent.py             # Claude tool-calling loop (Anthropic SDK)
│   ├── session.py           # Conversation state + history trimming
│   ├── cli.py               # REPL client (connects to server API)
│   ├── server.py            # HTTP API server (FastAPI or Flask)
│   ├── config.py            # Settings (env vars, defaults)
│   └── tools/
│       └── __init__.py      # Tool registry + dispatch
└── tests/
    └── ...

Entrypoints

  • python -m tolkien serve — Start the API server
  • python -m tolkien repl — Start the CLI REPL (thin client to the server)

API Design (Server)

POST /sessions              → Create a new session, returns {session_id}
POST /sessions/{id}/messages → Send a message, returns {response}
GET  /sessions/{id}         → Get session state
DELETE /sessions/{id}       → End session
GET  /healthz               → Health check

Session Model

  • In-memory dict (session_id → Session)
  • Session holds conversation history (messages list)
  • History trimming: keep first turn + last N turns (like infra-agent's 20-turn cap)
  • Session expiry after 30 min inactivity

Agent Loop

Based on infra-agent's _drive_to_end_turn():

  1. Build messages list (system prompt + conversation history)
  2. Call Claude API with tools
  3. If tool_use in response → execute tools concurrently → append results
  4. Loop until stop_reason == "end_turn"
  5. Return assistant text

Dependencies

  • anthropic — Claude API SDK
  • httpx — CLI client HTTP calls
  • flask or fastapi + uvicorn — API server
  • rich — CLI REPL formatting
  • python-dotenv — env var loading

Tasks

  • Initialize uv project with pyproject.toml
  • Implement config.py (ANTHROPIC_API_KEY, SERVER_URL, etc.)
  • Implement agent.py with Claude loop (no tools yet)
  • Implement session.py
  • Implement server.py with session + message endpoints
  • Implement cli.py REPL that POSTs to the server
  • Implement main.py with serve/repl subcommands
  • Test locally: run server, connect with REPL, have a conversation

Phase 2: Tool System

Goal

Give the agent tools to query and understand the homelab infrastructure.

Tool Architecture

Each tool module exports:

  • TOOLS: list[dict] — Claude tool schemas (name, description, input_schema)
  • DISPATCH: dict[str, Callable] — name → handler function

tools/__init__.py aggregates all modules into a single registry.

Tools to Implement

tools/kubectl.py — Kubernetes Queries

  • Allowed verbs: get, describe, logs, top, auth can-i
  • Blocked: create, delete, apply, patch, edit, exec, port-forward
  • Timeout: 60s
  • Output truncation: 12,000 chars

tools/gitea.py — Gitea via tea CLI

  • List/view issues
  • List/view PRs
  • Create issues, create PRs (draft)
  • Search repos
  • Blocked: delete operations

tools/argocd.py — ArgoCD Status

  • argocd app list — all apps and sync status
  • argocd app get <name> — detailed app status
  • argocd app diff <name> — pending changes
  • argocd app history <name> — deployment history
  • Read-only: no sync, rollback, or delete operations

tools/vault.py — Vault Metadata Only

  • vault kv metadata get <path> — check secret exists, see versions
  • vault kv list <path> — list secret paths
  • Explicitly blocked: vault kv get (cannot read actual secret values)
  • Policy: capabilities = ["list"] on secret/metadata/*, deny on secret/data/*

tools/docs.py — Lord of the Rings Documentation

  • Read files from ~/Personal/lord-of-the-rings/
  • List available docs (directory listing)
  • Search docs content (grep)
  • In server context: mount lotr as a volume or clone it

tools/web.py — Web Fetch

  • Allowlisted domains: Kubernetes docs, Helm chart docs, ArgoCD docs, etc.
  • HTTPS only
  • Timeout: 15s
  • Output truncation: 12,000 chars

tools/write_code.py — Claude Code Subprocess

  • Clones valinor repo → runs claude --print → creates branch + PR
  • Risk classification for changes (optional, simpler than infra-agent's)
  • Always creates draft PRs, never merges

Tasks

  • Implement tool registry in tools/init.py
  • Implement each tool module
  • Wire tools into agent.py (pass to Claude API, dispatch results)
  • Add concurrent tool execution (ThreadPoolExecutor)
  • Test each tool in isolation
  • Test full loop: REPL → server → Claude → tool → response

Phase 3: System Prompt + Knowledge

Goal

Give the agent enough context to be genuinely useful for homelab operations.

System Prompt Sections

  1. Identity: You are Tolkien, an infrastructure agent for the valinor homelab
  2. Infrastructure overview: Cluster nodes, storage classes, key services
  3. Repository structure: How valinor is organized (apps/, manifests/, ansible/, terraform/)
  4. Tool usage guide: When to use each tool, with examples
  5. Workflows: Step-by-step for common operations:
    • Deploying a new app (check chart, create config.yaml + values.yaml, terraform if needed)
    • Checking app health (argocd status, kubectl pods, logs)
    • Investigating issues (kubectl describe, logs, events)
    • Checking backup status (volsync, CNPG S3 backups)
    • Managing external hosts (ansible stacks)
  6. Documentation references: How to find and cite lotr docs
  7. Safety rules: Read-only by default, write_code for changes, always draft PRs

Tasks

  • Write system prompt in agent.py (or separate prompt.py)
  • Test with real scenarios (deploy app, check status, troubleshoot)
  • Iterate on prompt based on results

Phase 4: Containerize + Deploy

Goal

Run tolkien as a service in the k3s cluster, accessible via CLI from the operator's machine.

Dockerfile

FROM python:3.13-slim
# Install: kubectl, tea, argocd CLI, vault CLI, claude CLI
# Copy project, install deps with uv
# Entrypoint: gunicorn/uvicorn serving the API

Helm Chart (in valinor repo)

valinor/apps/tolkien/
├── config.yaml          # Chart reference (bjw-s app-template)
├── values.yaml          # Image, env, probes, volumes
├── vault-auth.yaml      # VaultAuth for k8s auth
└── vault-secret.yaml    # VaultStaticSecret (ANTHROPIC_API_KEY)

Networking

  • Ingress: tolkien.jpnadas.xyz (or internal-only via ClusterIP + VPN)
  • Consider: API key auth or mTLS for the API endpoint (don't expose unauthenticated)

API Authentication

Simple shared secret for now:

  • Server checks Authorization: Bearer <token> header
  • Token stored in Vault, configured in CLI via env var
  • Can upgrade to mTLS or OAuth later

Volume Mounts

  • lord-of-the-rings docs: either git-clone init container or PVC
  • Valinor repo: clone on-demand for write_code tool
  • kubectl: ServiceAccount with read-only ClusterRole

RBAC

  • ServiceAccount tolkien with read-only ClusterRole:
    • get/list/watch on pods, services, deployments, statefulsets, events, configmaps, PVCs, ingresses
    • get/list on nodes
    • logs on pods
  • No write permissions

Vault

  • Policy: read secret/data/tolkien/*, list secret/metadata/*
  • Kubernetes auth role bound to tolkien ServiceAccount

Tasks

  • Write Dockerfile
  • Build and push to Gitea registry manually (first time)
  • Create valinor/apps/tolkien/ with all manifests
  • Add Vault terraform module for tolkien
  • Create RBAC manifests (ClusterRole + ClusterRoleBinding)
  • Deploy via ArgoCD
  • Test CLI → server connectivity
  • Set up Gitea Actions runner in cluster (separate task)
  • Set up CI pipeline (.gitea/workflows/) for build + push on tag

Phase 5: Messaging Integration (Future)

Goal

Add Telegram or Signal as alternative interfaces.

Approach

  • Add a telegram.py handler (python-telegram-bot library)
  • Same session/agent backend, just different input/output transport
  • Bot token in Vault
  • Webhook mode (Telegram pushes to tolkien's API)

Tasks

  • Choose Telegram vs Signal
  • Implement bot handler
  • Add webhook endpoint to server
  • Deploy and test

Open Questions

  1. Ansible access: Should tolkien be able to run ansible playbooks, or just read the config? Running playbooks from in-cluster would need SSH keys to external hosts — big blast radius. Start read-only?
  2. write_code scope: Should it only modify valinor, or also lord-of-the-rings? Probably both (code + docs).
  3. Streaming responses: Should the CLI stream Claude's response as it generates, or wait for the full response? Streaming is better UX for long answers.
  4. Rate limiting: Any concern about Anthropic API costs? Could add a simple per-session token budget.
  5. lotr access in-cluster: Git clone as init container (stale) vs. mount from a shared PVC vs. fetch on demand via Gitea API? Gitea API is simplest and always fresh.