Files

João Pedro Battistella Nadas 21da959ff2 Add project plan for tolkien infrastructure agent

Covers architecture (client-server model), tool system, deployment
strategy, and phased implementation plan with full homelab context.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-29 19:30:10 +02:00

13 KiB

Raw Blame History

Tolkien — Infrastructure Agent for Valinor

Overview

Tolkien is an infrastructure agent for homelab management. It understands the valinor GitOps repository, the TuringPi k3s cluster, external hosts managed via Ansible, Terraform resources, and the lord-of-the-rings documentation repository.

Architecture: Client-server model.

Server runs in the k3s cluster (has access to kubectl, vault, argocd, tea, docs)
CLI client is a thin REPL on the operator's machine that connects to the server over HTTPS

Key Decisions

Decision	Choice	Rationale
Language	Python (uv)	Matches infra-agent pattern, good Anthropic SDK support
LLM	Anthropic API (direct)	Simpler than Vertex AI for homelab use
Interface	CLI REPL → server API	Start with CLI, add Telegram/Signal later
Registry	Gitea container registry	Zero additional setup, `gitea.jpnadas.xyz/jpnadas/tolkien`
Deployment	Helm chart in valinor, ArgoCD	Self-managed, same pattern as all other apps
CI	Gitea Actions runner in cluster	Gitea host (isildur) is too weak for builds

Infrastructure Context

Valinor Repository Structure

valinor/
├── apps/                    # Helm-based apps (ArgoCD ApplicationSet)
│   └── <app>/config.yaml, values.yaml, *.yaml
├── manifests/               # Raw k8s manifests (ArgoCD ApplicationSet)
│   └── <app>/config.yaml, *.yaml
├── ansible/                 # Docker Compose stacks on external hosts
│   ├── stacks/isildur/      # Pi 5 — Caddy, Unifi Controller, HAProxy
│   ├── stacks/iluvatar/     # ZFS NAS — iSCSI, NFS
│   ├── playbooks/
│   └── inventory.yml
├── terraform/               # IaC for supporting services
│   ├── vault/               # Vault policies + k8s auth roles
│   ├── minio/               # Buckets, users, S3 creds → Vault
│   ├── cloudflare/          # DNS records
│   ├── arr/                 # Arr stack config
│   └── netbox/              # NetBox resources
└── applicationset.yaml      # ArgoCD ApplicationSet definitions

Cluster Nodes

Node	Role	Notes
merry	k3s controller	Dedicated SSD (etcd only)
sam	k3s worker	SATA SSD (Longhorn + databases)
pippin	k3s worker	SATA SSD (Longhorn + databases)
rosie	k3s worker	NVMe SSD (fastest, Longhorn + databases)
isildur	external	Pi 5 — Caddy, Unifi, HAProxy, Gitea
iluvatar	external	Optiplex — ZFS NAS, iSCSI, NFS

Storage Classes

Class	Backend	Use Case
`longhorn`	Distributed SSD (replica=2)	Configs, caches, app state
`local-path`	Node-local SSD	CNPG databases (PG replication handles HA)
`bulk-storage`	ZFS on iluvatar (iSCSI)	Media, downloads
`frigate-storage`	ZFS on iluvatar (iSCSI)	Frigate recordings

Key Services

ArgoCD: GitOps sync (auto-sync + prune + self-heal)
Vault: Secret management (Vault Secrets Operator in cluster)
MinIO: S3-compatible storage for CNPG backups
CNPG: CloudNativePG for PostgreSQL (local-path + streaming replication)
Longhorn: Distributed block storage
VolSync: PVC backup/restore
cert-manager: TLS certificates
ingress-nginx: Ingress controller
MetalLB: Bare-metal load balancer

Documentation

lord-of-the-rings (~/Personal/lord-of-the-rings/): All homelab docs
- homelab/valinor/ — per-app docs, migration plans, storage docs
- homelab/ — infrastructure docs (MinIO, ArgoCD, Vault, hardware)
valinor CLAUDE.md: Contains repo conventions, app structure, skills

Git Workflow

Never push directly to main
Feature/fix branches → PR via tea CLI → merge in Gitea
ArgoCD auto-syncs from main branch

Phase 1: Project Skeleton + CLI REPL

Goal

A working REPL that sends messages to Claude and gets responses, with no tools yet.

Project Structure

tolkien/
├── pyproject.toml           # uv project config
├── plan.md                  # This file
├── README.md
├── tolkien/
│   ├── __init__.py
│   ├── __main__.py          # CLI entrypoint (repl or serve)
│   ├── agent.py             # Claude tool-calling loop (Anthropic SDK)
│   ├── session.py           # Conversation state + history trimming
│   ├── cli.py               # REPL client (connects to server API)
│   ├── server.py            # HTTP API server (FastAPI or Flask)
│   ├── config.py            # Settings (env vars, defaults)
│   └── tools/
│       └── __init__.py      # Tool registry + dispatch
└── tests/
    └── ...

Entrypoints

python -m tolkien serve — Start the API server
python -m tolkien repl — Start the CLI REPL (thin client to the server)

API Design (Server)

POST /sessions              → Create a new session, returns {session_id}
POST /sessions/{id}/messages → Send a message, returns {response}
GET  /sessions/{id}         → Get session state
DELETE /sessions/{id}       → End session
GET  /healthz               → Health check

Session Model

In-memory dict (session_id → Session)
Session holds conversation history (messages list)
History trimming: keep first turn + last N turns (like infra-agent's 20-turn cap)
Session expiry after 30 min inactivity

Agent Loop

Based on infra-agent's _drive_to_end_turn():

Build messages list (system prompt + conversation history)
Call Claude API with tools
If tool_use in response → execute tools concurrently → append results
Loop until stop_reason == "end_turn"
Return assistant text

Dependencies

anthropic — Claude API SDK
httpx — CLI client HTTP calls
flask or fastapi + uvicorn — API server
rich — CLI REPL formatting
python-dotenv — env var loading

Tasks

Initialize uv project with pyproject.toml
Implement config.py (ANTHROPIC_API_KEY, SERVER_URL, etc.)
Implement agent.py with Claude loop (no tools yet)
Implement session.py
Implement server.py with session + message endpoints
Implement cli.py REPL that POSTs to the server
Implement main.py with serve/repl subcommands
Test locally: run server, connect with REPL, have a conversation

Phase 2: Tool System

Goal

Give the agent tools to query and understand the homelab infrastructure.

Tool Architecture

Each tool module exports:

TOOLS: list[dict] — Claude tool schemas (name, description, input_schema)
DISPATCH: dict[str, Callable] — name → handler function

tools/__init__.py aggregates all modules into a single registry.

Tools to Implement

`tools/kubectl.py` — Kubernetes Queries

Allowed verbs: get, describe, logs, top, auth can-i
Blocked: create, delete, apply, patch, edit, exec, port-forward
Timeout: 60s
Output truncation: 12,000 chars

`tools/gitea.py` — Gitea via `tea` CLI

List/view issues
List/view PRs
Create issues, create PRs (draft)
Search repos
Blocked: delete operations

`tools/argocd.py` — ArgoCD Status

argocd app list — all apps and sync status
argocd app get <name> — detailed app status
argocd app diff <name> — pending changes
argocd app history <name> — deployment history
Read-only: no sync, rollback, or delete operations

`tools/vault.py` — Vault Metadata Only

vault kv metadata get <path> — check secret exists, see versions
vault kv list <path> — list secret paths
Explicitly blocked: vault kv get (cannot read actual secret values)
Policy: capabilities = ["list"] on secret/metadata/*, deny on secret/data/*

`tools/docs.py` — Lord of the Rings Documentation

Read files from ~/Personal/lord-of-the-rings/
List available docs (directory listing)
Search docs content (grep)
In server context: mount lotr as a volume or clone it

`tools/web.py` — Web Fetch

Allowlisted domains: Kubernetes docs, Helm chart docs, ArgoCD docs, etc.
HTTPS only
Timeout: 15s
Output truncation: 12,000 chars

`tools/write_code.py` — Claude Code Subprocess

Clones valinor repo → runs claude --print → creates branch + PR
Risk classification for changes (optional, simpler than infra-agent's)
Always creates draft PRs, never merges

Tasks

Implement tool registry in tools/init.py
Implement each tool module
Wire tools into agent.py (pass to Claude API, dispatch results)
Add concurrent tool execution (ThreadPoolExecutor)
Test each tool in isolation
Test full loop: REPL → server → Claude → tool → response

Phase 3: System Prompt + Knowledge

Goal

Give the agent enough context to be genuinely useful for homelab operations.

System Prompt Sections

Identity: You are Tolkien, an infrastructure agent for the valinor homelab
Infrastructure overview: Cluster nodes, storage classes, key services
Repository structure: How valinor is organized (apps/, manifests/, ansible/, terraform/)
Tool usage guide: When to use each tool, with examples
Workflows: Step-by-step for common operations:
- Deploying a new app (check chart, create config.yaml + values.yaml, terraform if needed)
- Checking app health (argocd status, kubectl pods, logs)
- Investigating issues (kubectl describe, logs, events)
- Checking backup status (volsync, CNPG S3 backups)
- Managing external hosts (ansible stacks)
Documentation references: How to find and cite lotr docs
Safety rules: Read-only by default, write_code for changes, always draft PRs

Tasks

Write system prompt in agent.py (or separate prompt.py)
Test with real scenarios (deploy app, check status, troubleshoot)
Iterate on prompt based on results

Phase 4: Containerize + Deploy

Goal

Run tolkien as a service in the k3s cluster, accessible via CLI from the operator's machine.

Dockerfile

FROM python:3.13-slim
# Install: kubectl, tea, argocd CLI, vault CLI, claude CLI
# Copy project, install deps with uv
# Entrypoint: gunicorn/uvicorn serving the API

Helm Chart (in valinor repo)

valinor/apps/tolkien/
├── config.yaml          # Chart reference (bjw-s app-template)
├── values.yaml          # Image, env, probes, volumes
├── vault-auth.yaml      # VaultAuth for k8s auth
└── vault-secret.yaml    # VaultStaticSecret (ANTHROPIC_API_KEY)

Networking

Ingress: tolkien.jpnadas.xyz (or internal-only via ClusterIP + VPN)
Consider: API key auth or mTLS for the API endpoint (don't expose unauthenticated)

API Authentication

Simple shared secret for now:

Server checks Authorization: Bearer <token> header
Token stored in Vault, configured in CLI via env var
Can upgrade to mTLS or OAuth later

Volume Mounts

lord-of-the-rings docs: either git-clone init container or PVC
Valinor repo: clone on-demand for write_code tool
kubectl: ServiceAccount with read-only ClusterRole

RBAC

ServiceAccount tolkien with read-only ClusterRole:
- get/list/watch on pods, services, deployments, statefulsets, events, configmaps, PVCs, ingresses
- get/list on nodes
- logs on pods
No write permissions

Vault

Policy: read secret/data/tolkien/*, list secret/metadata/*
Kubernetes auth role bound to tolkien ServiceAccount

Tasks

Write Dockerfile
Build and push to Gitea registry manually (first time)
Create valinor/apps/tolkien/ with all manifests
Add Vault terraform module for tolkien
Create RBAC manifests (ClusterRole + ClusterRoleBinding)
Deploy via ArgoCD
Test CLI → server connectivity
Set up Gitea Actions runner in cluster (separate task)
Set up CI pipeline (.gitea/workflows/) for build + push on tag

Phase 5: Messaging Integration (Future)

Goal

Add Telegram or Signal as alternative interfaces.

Approach

Add a telegram.py handler (python-telegram-bot library)
Same session/agent backend, just different input/output transport
Bot token in Vault
Webhook mode (Telegram pushes to tolkien's API)

Tasks

Choose Telegram vs Signal
Implement bot handler
Add webhook endpoint to server
Deploy and test

Open Questions

Ansible access: Should tolkien be able to run ansible playbooks, or just read the config? Running playbooks from in-cluster would need SSH keys to external hosts — big blast radius. Start read-only?
write_code scope: Should it only modify valinor, or also lord-of-the-rings? Probably both (code + docs).
Streaming responses: Should the CLI stream Claude's response as it generates, or wait for the full response? Streaming is better UX for long answers.
Rate limiting: Any concern about Anthropic API costs? Could add a simple per-session token budget.
lotr access in-cluster: Git clone as init container (stale) vs. mount from a shared PVC vs. fetch on demand via Gitea API? Gitea API is simplest and always fresh.

13 KiB Raw Blame History

Tolkien — Infrastructure Agent for Valinor

Overview

Key Decisions

Infrastructure Context

Valinor Repository Structure

Cluster Nodes

Storage Classes

Key Services

Documentation

Git Workflow

Phase 1: Project Skeleton + CLI REPL

Goal

Project Structure

Entrypoints

API Design (Server)

Session Model

Agent Loop

Dependencies

Tasks

Phase 2: Tool System

Goal

Tool Architecture

Tools to Implement

tools/kubectl.py — Kubernetes Queries

tools/gitea.py — Gitea via tea CLI

tools/argocd.py — ArgoCD Status

tools/vault.py — Vault Metadata Only

tools/docs.py — Lord of the Rings Documentation

tools/web.py — Web Fetch

tools/write_code.py — Claude Code Subprocess

Tasks

Phase 3: System Prompt + Knowledge

Goal

System Prompt Sections

Tasks

Phase 4: Containerize + Deploy

Goal

Dockerfile

Helm Chart (in valinor repo)

Networking

API Authentication

Volume Mounts

RBAC

Vault

Tasks

Phase 5: Messaging Integration (Future)

Goal

Approach

Tasks

Open Questions

13 KiB

Raw Blame History

`tools/kubectl.py` — Kubernetes Queries

`tools/gitea.py` — Gitea via `tea` CLI

`tools/argocd.py` — ArgoCD Status

`tools/vault.py` — Vault Metadata Only

`tools/docs.py` — Lord of the Rings Documentation

`tools/web.py` — Web Fetch

`tools/write_code.py` — Claude Code Subprocess