tolkien/plan.md

# Tolkien — Infrastructure Agent for Valinor

## Overview

Tolkien is an infrastructure agent for homelab management. It understands the valinor GitOps repository, the TuringPi k3s cluster, external hosts managed via Ansible, Terraform resources, and the lord-of-the-rings documentation repository.

**Architecture**: Client-server model.
- **Server** runs in the k3s cluster (has access to kubectl, vault, argocd, tea, docs)
- **CLI client** is a thin REPL on the operator's machine that connects to the server over HTTPS

## Key Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| Language | Python (uv) | Matches infra-agent pattern, good Anthropic SDK support |
| LLM | Anthropic API (direct) | Simpler than Vertex AI for homelab use |
| Interface | CLI REPL → server API | Start with CLI, add Telegram/Signal later |
| Registry | Gitea container registry | Zero additional setup, `gitea.jpnadas.xyz/jpnadas/tolkien` |
| Deployment | Helm chart in valinor, ArgoCD | Self-managed, same pattern as all other apps |
| CI | Gitea Actions runner in cluster | Gitea host (isildur) is too weak for builds |

## Infrastructure Context

### Valinor Repository Structure
```
valinor/
├── apps/                    # Helm-based apps (ArgoCD ApplicationSet)
│   └── <app>/config.yaml, values.yaml, *.yaml
├── manifests/               # Raw k8s manifests (ArgoCD ApplicationSet)
│   └── <app>/config.yaml, *.yaml
├── ansible/                 # Docker Compose stacks on external hosts
│   ├── stacks/isildur/      # Pi 5 — Caddy, Unifi Controller, HAProxy
│   ├── stacks/iluvatar/     # ZFS NAS — iSCSI, NFS
│   ├── playbooks/
│   └── inventory.yml
├── terraform/               # IaC for supporting services
│   ├── vault/               # Vault policies + k8s auth roles
│   ├── minio/               # Buckets, users, S3 creds → Vault
│   ├── cloudflare/          # DNS records
│   ├── arr/                 # Arr stack config
│   └── netbox/              # NetBox resources
└── applicationset.yaml      # ArgoCD ApplicationSet definitions
```

### Cluster Nodes
| Node | Role | Notes |
|------|------|-------|
| merry | k3s controller | Dedicated SSD (etcd only) |
| sam | k3s worker | SATA SSD (Longhorn + databases) |
| pippin | k3s worker | SATA SSD (Longhorn + databases) |
| rosie | k3s worker | NVMe SSD (fastest, Longhorn + databases) |
| isildur | external | Pi 5 — Caddy, Unifi, HAProxy, Gitea |
| iluvatar | external | Optiplex — ZFS NAS, iSCSI, NFS |

### Storage Classes
| Class | Backend | Use Case |
|-------|---------|----------|
| `longhorn` | Distributed SSD (replica=2) | Configs, caches, app state |
| `local-path` | Node-local SSD | CNPG databases (PG replication handles HA) |
| `bulk-storage` | ZFS on iluvatar (iSCSI) | Media, downloads |
| `frigate-storage` | ZFS on iluvatar (iSCSI) | Frigate recordings |

### Key Services
- **ArgoCD**: GitOps sync (auto-sync + prune + self-heal)
- **Vault**: Secret management (Vault Secrets Operator in cluster)
- **MinIO**: S3-compatible storage for CNPG backups
- **CNPG**: CloudNativePG for PostgreSQL (local-path + streaming replication)
- **Longhorn**: Distributed block storage
- **VolSync**: PVC backup/restore
- **cert-manager**: TLS certificates
- **ingress-nginx**: Ingress controller
- **MetalLB**: Bare-metal load balancer

### Documentation
- **lord-of-the-rings** (`~/Personal/lord-of-the-rings/`): All homelab docs
  - `homelab/valinor/` — per-app docs, migration plans, storage docs
  - `homelab/` — infrastructure docs (MinIO, ArgoCD, Vault, hardware)
- **valinor CLAUDE.md**: Contains repo conventions, app structure, skills

### Git Workflow
- Never push directly to main
- Feature/fix branches → PR via `tea` CLI → merge in Gitea
- ArgoCD auto-syncs from main branch

---

## Phase 1: Project Skeleton + CLI REPL

### Goal
A working REPL that sends messages to Claude and gets responses, with no tools yet.

### Project Structure
```
tolkien/
├── pyproject.toml           # uv project config
├── plan.md                  # This file
├── README.md
├── tolkien/
│   ├── __init__.py
│   ├── __main__.py          # CLI entrypoint (repl or serve)
│   ├── agent.py             # Claude tool-calling loop (Anthropic SDK)
│   ├── session.py           # Conversation state + history trimming
│   ├── cli.py               # REPL client (connects to server API)
│   ├── server.py            # HTTP API server (FastAPI or Flask)
│   ├── config.py            # Settings (env vars, defaults)
│   └── tools/
│       └── __init__.py      # Tool registry + dispatch
└── tests/
    └── ...
```

### Entrypoints
- `python -m tolkien serve` — Start the API server
- `python -m tolkien repl` — Start the CLI REPL (thin client to the server)

### API Design (Server)
```
POST /sessions              → Create a new session, returns {session_id}
POST /sessions/{id}/messages → Send a message, returns {response}
GET  /sessions/{id}         → Get session state
DELETE /sessions/{id}       → End session
GET  /healthz               → Health check
```

### Session Model
- In-memory dict (session_id → Session)
- Session holds conversation history (messages list)
- History trimming: keep first turn + last N turns (like infra-agent's 20-turn cap)
- Session expiry after 30 min inactivity

### Agent Loop
Based on infra-agent's `_drive_to_end_turn()`:
1. Build messages list (system prompt + conversation history)
2. Call Claude API with tools
3. If tool_use in response → execute tools concurrently → append results
4. Loop until stop_reason == "end_turn"
5. Return assistant text

### Dependencies
- `anthropic` — Claude API SDK
- `httpx` — CLI client HTTP calls
- `flask` or `fastapi` + `uvicorn` — API server
- `rich` — CLI REPL formatting
- `python-dotenv` — env var loading

### Tasks
- [ ] Initialize uv project with pyproject.toml
- [ ] Implement config.py (ANTHROPIC_API_KEY, SERVER_URL, etc.)
- [ ] Implement agent.py with Claude loop (no tools yet)
- [ ] Implement session.py
- [ ] Implement server.py with session + message endpoints
- [ ] Implement cli.py REPL that POSTs to the server
- [ ] Implement __main__.py with serve/repl subcommands
- [ ] Test locally: run server, connect with REPL, have a conversation

---

## Phase 2: Tool System

### Goal
Give the agent tools to query and understand the homelab infrastructure.

### Tool Architecture
Each tool module exports:
- `TOOLS: list[dict]` — Claude tool schemas (name, description, input_schema)
- `DISPATCH: dict[str, Callable]` — name → handler function

`tools/__init__.py` aggregates all modules into a single registry.

### Tools to Implement

#### `tools/kubectl.py` — Kubernetes Queries
- **Allowed verbs**: get, describe, logs, top, auth can-i
- **Blocked**: create, delete, apply, patch, edit, exec, port-forward
- Timeout: 60s
- Output truncation: 12,000 chars

#### `tools/gitea.py` — Gitea via `tea` CLI
- List/view issues
- List/view PRs
- Create issues, create PRs (draft)
- Search repos
- Blocked: delete operations

#### `tools/argocd.py` — ArgoCD Status
- `argocd app list` — all apps and sync status
- `argocd app get <name>` — detailed app status
- `argocd app diff <name>` — pending changes
- `argocd app history <name>` — deployment history
- Read-only: no sync, rollback, or delete operations

#### `tools/vault.py` — Vault Metadata Only
- `vault kv metadata get <path>` — check secret exists, see versions
- `vault kv list <path>` — list secret paths
- **Explicitly blocked**: `vault kv get` (cannot read actual secret values)
- Policy: `capabilities = ["list"]` on `secret/metadata/*`, deny on `secret/data/*`

#### `tools/docs.py` — Lord of the Rings Documentation
- Read files from `~/Personal/lord-of-the-rings/`
- List available docs (directory listing)
- Search docs content (grep)
- In server context: mount lotr as a volume or clone it

#### `tools/web.py` — Web Fetch
- Allowlisted domains: Kubernetes docs, Helm chart docs, ArgoCD docs, etc.
- HTTPS only
- Timeout: 15s
- Output truncation: 12,000 chars

#### `tools/write_code.py` — Claude Code Subprocess
- Clones valinor repo → runs `claude --print` → creates branch + PR
- Risk classification for changes (optional, simpler than infra-agent's)
- Always creates draft PRs, never merges

### Tasks
- [ ] Implement tool registry in tools/__init__.py
- [ ] Implement each tool module
- [ ] Wire tools into agent.py (pass to Claude API, dispatch results)
- [ ] Add concurrent tool execution (ThreadPoolExecutor)
- [ ] Test each tool in isolation
- [ ] Test full loop: REPL → server → Claude → tool → response

---

## Phase 3: System Prompt + Knowledge

### Goal
Give the agent enough context to be genuinely useful for homelab operations.

### System Prompt Sections
1. **Identity**: You are Tolkien, an infrastructure agent for the valinor homelab
2. **Infrastructure overview**: Cluster nodes, storage classes, key services
3. **Repository structure**: How valinor is organized (apps/, manifests/, ansible/, terraform/)
4. **Tool usage guide**: When to use each tool, with examples
5. **Workflows**: Step-by-step for common operations:
   - Deploying a new app (check chart, create config.yaml + values.yaml, terraform if needed)
   - Checking app health (argocd status, kubectl pods, logs)
   - Investigating issues (kubectl describe, logs, events)
   - Checking backup status (volsync, CNPG S3 backups)
   - Managing external hosts (ansible stacks)
6. **Documentation references**: How to find and cite lotr docs
7. **Safety rules**: Read-only by default, write_code for changes, always draft PRs

### Tasks
- [ ] Write system prompt in agent.py (or separate prompt.py)
- [ ] Test with real scenarios (deploy app, check status, troubleshoot)
- [ ] Iterate on prompt based on results

---

## Phase 4: Containerize + Deploy

### Goal
Run tolkien as a service in the k3s cluster, accessible via CLI from the operator's machine.

### Dockerfile
```dockerfile
FROM python:3.13-slim
# Install: kubectl, tea, argocd CLI, vault CLI, claude CLI
# Copy project, install deps with uv
# Entrypoint: gunicorn/uvicorn serving the API
```

### Helm Chart (in valinor repo)
```
valinor/apps/tolkien/
├── config.yaml          # Chart reference (bjw-s app-template)
├── values.yaml          # Image, env, probes, volumes
├── vault-auth.yaml      # VaultAuth for k8s auth
└── vault-secret.yaml    # VaultStaticSecret (ANTHROPIC_API_KEY)
```

### Networking
- Ingress: `tolkien.jpnadas.xyz` (or internal-only via ClusterIP + VPN)
- Consider: API key auth or mTLS for the API endpoint (don't expose unauthenticated)

### API Authentication
Simple shared secret for now:
- Server checks `Authorization: Bearer <token>` header
- Token stored in Vault, configured in CLI via env var
- Can upgrade to mTLS or OAuth later

### Volume Mounts
- lord-of-the-rings docs: either git-clone init container or PVC
- Valinor repo: clone on-demand for write_code tool
- kubectl: ServiceAccount with read-only ClusterRole

### RBAC
- ServiceAccount `tolkien` with read-only ClusterRole:
  - get/list/watch on pods, services, deployments, statefulsets, events, configmaps, PVCs, ingresses
  - get/list on nodes
  - logs on pods
- No write permissions

### Vault
- Policy: read `secret/data/tolkien/*`, list `secret/metadata/*`
- Kubernetes auth role bound to `tolkien` ServiceAccount

### Tasks
- [ ] Write Dockerfile
- [ ] Build and push to Gitea registry manually (first time)
- [ ] Create valinor/apps/tolkien/ with all manifests
- [ ] Add Vault terraform module for tolkien
- [ ] Create RBAC manifests (ClusterRole + ClusterRoleBinding)
- [ ] Deploy via ArgoCD
- [ ] Test CLI → server connectivity
- [ ] Set up Gitea Actions runner in cluster (separate task)
- [ ] Set up CI pipeline (.gitea/workflows/) for build + push on tag

---

## Phase 5: Messaging Integration (Future)

### Goal
Add Telegram or Signal as alternative interfaces.

### Approach
- Add a `telegram.py` handler (python-telegram-bot library)
- Same session/agent backend, just different input/output transport
- Bot token in Vault
- Webhook mode (Telegram pushes to tolkien's API)

### Tasks
- [ ] Choose Telegram vs Signal
- [ ] Implement bot handler
- [ ] Add webhook endpoint to server
- [ ] Deploy and test

---

## Open Questions

1. **Ansible access**: Should tolkien be able to run ansible playbooks, or just read the config? Running playbooks from in-cluster would need SSH keys to external hosts — big blast radius. Start read-only?
2. **write_code scope**: Should it only modify valinor, or also lord-of-the-rings? Probably both (code + docs).
3. **Streaming responses**: Should the CLI stream Claude's response as it generates, or wait for the full response? Streaming is better UX for long answers.
4. **Rate limiting**: Any concern about Anthropic API costs? Could add a simple per-session token budget.
5. **lotr access in-cluster**: Git clone as init container (stale) vs. mount from a shared PVC vs. fetch on demand via Gitea API? Gitea API is simplest and always fresh.