diff --git a/plan.md b/plan.md
new file mode 100644
index 0000000..1d5f404
--- /dev/null
+++ b/plan.md
@@ -0,0 +1,337 @@
+# Tolkien — Infrastructure Agent for Valinor
+
+## Overview
+
+Tolkien is an infrastructure agent for homelab management. It understands the valinor GitOps repository, the TuringPi k3s cluster, external hosts managed via Ansible, Terraform resources, and the lord-of-the-rings documentation repository.
+
+**Architecture**: Client-server model.
+- **Server** runs in the k3s cluster (has access to kubectl, vault, argocd, tea, docs)
+- **CLI client** is a thin REPL on the operator's machine that connects to the server over HTTPS
+
+## Key Decisions
+
+| Decision | Choice | Rationale |
+|----------|--------|-----------|
+| Language | Python (uv) | Matches infra-agent pattern, good Anthropic SDK support |
+| LLM | Anthropic API (direct) | Simpler than Vertex AI for homelab use |
+| Interface | CLI REPL → server API | Start with CLI, add Telegram/Signal later |
+| Registry | Gitea container registry | Zero additional setup, `gitea.jpnadas.xyz/jpnadas/tolkien` |
+| Deployment | Helm chart in valinor, ArgoCD | Self-managed, same pattern as all other apps |
+| CI | Gitea Actions runner in cluster | Gitea host (isildur) is too weak for builds |
+
+## Infrastructure Context
+
+### Valinor Repository Structure
+```
+valinor/
+├── apps/                    # Helm-based apps (ArgoCD ApplicationSet)
+│   └── <app>/config.yaml, values.yaml, *.yaml
+├── manifests/               # Raw k8s manifests (ArgoCD ApplicationSet)
+│   └── <app>/config.yaml, *.yaml
+├── ansible/                 # Docker Compose stacks on external hosts
+│   ├── stacks/isildur/      # Pi 5 — Caddy, Unifi Controller, HAProxy
+│   ├── stacks/iluvatar/     # ZFS NAS — iSCSI, NFS
+│   ├── playbooks/
+│   └── inventory.yml
+├── terraform/               # IaC for supporting services
+│   ├── vault/               # Vault policies + k8s auth roles
+│   ├── minio/               # Buckets, users, S3 creds → Vault
+│   ├── cloudflare/          # DNS records
+│   ├── arr/                 # Arr stack config
+│   └── netbox/              # NetBox resources
+└── applicationset.yaml      # ArgoCD ApplicationSet definitions
+```
+
+### Cluster Nodes
+| Node | Role | Notes |
+|------|------|-------|
+| merry | k3s controller | Dedicated SSD (etcd only) |
+| sam | k3s worker | SATA SSD (Longhorn + databases) |
+| pippin | k3s worker | SATA SSD (Longhorn + databases) |
+| rosie | k3s worker | NVMe SSD (fastest, Longhorn + databases) |
+| isildur | external | Pi 5 — Caddy, Unifi, HAProxy, Gitea |
+| iluvatar | external | Optiplex — ZFS NAS, iSCSI, NFS |
+
+### Storage Classes
+| Class | Backend | Use Case |
+|-------|---------|----------|
+| `longhorn` | Distributed SSD (replica=2) | Configs, caches, app state |
+| `local-path` | Node-local SSD | CNPG databases (PG replication handles HA) |
+| `bulk-storage` | ZFS on iluvatar (iSCSI) | Media, downloads |
+| `frigate-storage` | ZFS on iluvatar (iSCSI) | Frigate recordings |
+
+### Key Services
+- **ArgoCD**: GitOps sync (auto-sync + prune + self-heal)
+- **Vault**: Secret management (Vault Secrets Operator in cluster)
+- **MinIO**: S3-compatible storage for CNPG backups
+- **CNPG**: CloudNativePG for PostgreSQL (local-path + streaming replication)
+- **Longhorn**: Distributed block storage
+- **VolSync**: PVC backup/restore
+- **cert-manager**: TLS certificates
+- **ingress-nginx**: Ingress controller
+- **MetalLB**: Bare-metal load balancer
+
+### Documentation
+- **lord-of-the-rings** (`~/Personal/lord-of-the-rings/`): All homelab docs
+  - `homelab/valinor/` — per-app docs, migration plans, storage docs
+  - `homelab/` — infrastructure docs (MinIO, ArgoCD, Vault, hardware)
+- **valinor CLAUDE.md**: Contains repo conventions, app structure, skills
+
+### Git Workflow
+- Never push directly to main
+- Feature/fix branches → PR via `tea` CLI → merge in Gitea
+- ArgoCD auto-syncs from main branch
+
+---
+
+## Phase 1: Project Skeleton + CLI REPL
+
+### Goal
+A working REPL that sends messages to Claude and gets responses, with no tools yet.
+
+### Project Structure
+```
+tolkien/
+├── pyproject.toml           # uv project config
+├── plan.md                  # This file
+├── README.md
+├── tolkien/
+│   ├── __init__.py
+│   ├── __main__.py          # CLI entrypoint (repl or serve)
+│   ├── agent.py             # Claude tool-calling loop (Anthropic SDK)
+│   ├── session.py           # Conversation state + history trimming
+│   ├── cli.py               # REPL client (connects to server API)
+│   ├── server.py            # HTTP API server (FastAPI or Flask)
+│   ├── config.py            # Settings (env vars, defaults)
+│   └── tools/
+│       └── __init__.py      # Tool registry + dispatch
+└── tests/
+    └── ...
+```
+
+### Entrypoints
+- `python -m tolkien serve` — Start the API server
+- `python -m tolkien repl` — Start the CLI REPL (thin client to the server)
+
+### API Design (Server)
+```
+POST /sessions              → Create a new session, returns {session_id}
+POST /sessions/{id}/messages → Send a message, returns {response}
+GET  /sessions/{id}         → Get session state
+DELETE /sessions/{id}       → End session
+GET  /healthz               → Health check
+```
+
+### Session Model
+- In-memory dict (session_id → Session)
+- Session holds conversation history (messages list)
+- History trimming: keep first turn + last N turns (like infra-agent's 20-turn cap)
+- Session expiry after 30 min inactivity
+
+### Agent Loop
+Based on infra-agent's `_drive_to_end_turn()`:
+1. Build messages list (system prompt + conversation history)
+2. Call Claude API with tools
+3. If tool_use in response → execute tools concurrently → append results
+4. Loop until stop_reason == "end_turn"
+5. Return assistant text
+
+### Dependencies
+- `anthropic` — Claude API SDK
+- `httpx` — CLI client HTTP calls
+- `flask` or `fastapi` + `uvicorn` — API server
+- `rich` — CLI REPL formatting
+- `python-dotenv` — env var loading
+
+### Tasks
+- [ ] Initialize uv project with pyproject.toml
+- [ ] Implement config.py (ANTHROPIC_API_KEY, SERVER_URL, etc.)
+- [ ] Implement agent.py with Claude loop (no tools yet)
+- [ ] Implement session.py
+- [ ] Implement server.py with session + message endpoints
+- [ ] Implement cli.py REPL that POSTs to the server
+- [ ] Implement __main__.py with serve/repl subcommands
+- [ ] Test locally: run server, connect with REPL, have a conversation
+
+---
+
+## Phase 2: Tool System
+
+### Goal
+Give the agent tools to query and understand the homelab infrastructure.
+
+### Tool Architecture
+Each tool module exports:
+- `TOOLS: list[dict]` — Claude tool schemas (name, description, input_schema)
+- `DISPATCH: dict[str, Callable]` — name → handler function
+
+`tools/__init__.py` aggregates all modules into a single registry.
+
+### Tools to Implement
+
+#### `tools/kubectl.py` — Kubernetes Queries
+- **Allowed verbs**: get, describe, logs, top, auth can-i
+- **Blocked**: create, delete, apply, patch, edit, exec, port-forward
+- Timeout: 60s
+- Output truncation: 12,000 chars
+
+#### `tools/gitea.py` — Gitea via `tea` CLI
+- List/view issues
+- List/view PRs
+- Create issues, create PRs (draft)
+- Search repos
+- Blocked: delete operations
+
+#### `tools/argocd.py` — ArgoCD Status
+- `argocd app list` — all apps and sync status
+- `argocd app get <name>` — detailed app status
+- `argocd app diff <name>` — pending changes
+- `argocd app history <name>` — deployment history
+- Read-only: no sync, rollback, or delete operations
+
+#### `tools/vault.py` — Vault Metadata Only
+- `vault kv metadata get <path>` — check secret exists, see versions
+- `vault kv list <path>` — list secret paths
+- **Explicitly blocked**: `vault kv get` (cannot read actual secret values)
+- Policy: `capabilities = ["list"]` on `secret/metadata/*`, deny on `secret/data/*`
+
+#### `tools/docs.py` — Lord of the Rings Documentation
+- Read files from `~/Personal/lord-of-the-rings/`
+- List available docs (directory listing)
+- Search docs content (grep)
+- In server context: mount lotr as a volume or clone it
+
+#### `tools/web.py` — Web Fetch
+- Allowlisted domains: Kubernetes docs, Helm chart docs, ArgoCD docs, etc.
+- HTTPS only
+- Timeout: 15s
+- Output truncation: 12,000 chars
+
+#### `tools/write_code.py` — Claude Code Subprocess
+- Clones valinor repo → runs `claude --print` → creates branch + PR
+- Risk classification for changes (optional, simpler than infra-agent's)
+- Always creates draft PRs, never merges
+
+### Tasks
+- [ ] Implement tool registry in tools/__init__.py
+- [ ] Implement each tool module
+- [ ] Wire tools into agent.py (pass to Claude API, dispatch results)
+- [ ] Add concurrent tool execution (ThreadPoolExecutor)
+- [ ] Test each tool in isolation
+- [ ] Test full loop: REPL → server → Claude → tool → response
+
+---
+
+## Phase 3: System Prompt + Knowledge
+
+### Goal
+Give the agent enough context to be genuinely useful for homelab operations.
+
+### System Prompt Sections
+1. **Identity**: You are Tolkien, an infrastructure agent for the valinor homelab
+2. **Infrastructure overview**: Cluster nodes, storage classes, key services
+3. **Repository structure**: How valinor is organized (apps/, manifests/, ansible/, terraform/)
+4. **Tool usage guide**: When to use each tool, with examples
+5. **Workflows**: Step-by-step for common operations:
+   - Deploying a new app (check chart, create config.yaml + values.yaml, terraform if needed)
+   - Checking app health (argocd status, kubectl pods, logs)
+   - Investigating issues (kubectl describe, logs, events)
+   - Checking backup status (volsync, CNPG S3 backups)
+   - Managing external hosts (ansible stacks)
+6. **Documentation references**: How to find and cite lotr docs
+7. **Safety rules**: Read-only by default, write_code for changes, always draft PRs
+
+### Tasks
+- [ ] Write system prompt in agent.py (or separate prompt.py)
+- [ ] Test with real scenarios (deploy app, check status, troubleshoot)
+- [ ] Iterate on prompt based on results
+
+---
+
+## Phase 4: Containerize + Deploy
+
+### Goal
+Run tolkien as a service in the k3s cluster, accessible via CLI from the operator's machine.
+
+### Dockerfile
+```dockerfile
+FROM python:3.13-slim
+# Install: kubectl, tea, argocd CLI, vault CLI, claude CLI
+# Copy project, install deps with uv
+# Entrypoint: gunicorn/uvicorn serving the API
+```
+
+### Helm Chart (in valinor repo)
+```
+valinor/apps/tolkien/
+├── config.yaml          # Chart reference (bjw-s app-template)
+├── values.yaml          # Image, env, probes, volumes
+├── vault-auth.yaml      # VaultAuth for k8s auth
+└── vault-secret.yaml    # VaultStaticSecret (ANTHROPIC_API_KEY)
+```
+
+### Networking
+- Ingress: `tolkien.jpnadas.xyz` (or internal-only via ClusterIP + VPN)
+- Consider: API key auth or mTLS for the API endpoint (don't expose unauthenticated)
+
+### API Authentication
+Simple shared secret for now:
+- Server checks `Authorization: Bearer <token>` header
+- Token stored in Vault, configured in CLI via env var
+- Can upgrade to mTLS or OAuth later
+
+### Volume Mounts
+- lord-of-the-rings docs: either git-clone init container or PVC
+- Valinor repo: clone on-demand for write_code tool
+- kubectl: ServiceAccount with read-only ClusterRole
+
+### RBAC
+- ServiceAccount `tolkien` with read-only ClusterRole:
+  - get/list/watch on pods, services, deployments, statefulsets, events, configmaps, PVCs, ingresses
+  - get/list on nodes
+  - logs on pods
+- No write permissions
+
+### Vault
+- Policy: read `secret/data/tolkien/*`, list `secret/metadata/*`
+- Kubernetes auth role bound to `tolkien` ServiceAccount
+
+### Tasks
+- [ ] Write Dockerfile
+- [ ] Build and push to Gitea registry manually (first time)
+- [ ] Create valinor/apps/tolkien/ with all manifests
+- [ ] Add Vault terraform module for tolkien
+- [ ] Create RBAC manifests (ClusterRole + ClusterRoleBinding)
+- [ ] Deploy via ArgoCD
+- [ ] Test CLI → server connectivity
+- [ ] Set up Gitea Actions runner in cluster (separate task)
+- [ ] Set up CI pipeline (.gitea/workflows/) for build + push on tag
+
+---
+
+## Phase 5: Messaging Integration (Future)
+
+### Goal
+Add Telegram or Signal as alternative interfaces.
+
+### Approach
+- Add a `telegram.py` handler (python-telegram-bot library)
+- Same session/agent backend, just different input/output transport
+- Bot token in Vault
+- Webhook mode (Telegram pushes to tolkien's API)
+
+### Tasks
+- [ ] Choose Telegram vs Signal
+- [ ] Implement bot handler
+- [ ] Add webhook endpoint to server
+- [ ] Deploy and test
+
+---
+
+## Open Questions
+
+1. **Ansible access**: Should tolkien be able to run ansible playbooks, or just read the config? Running playbooks from in-cluster would need SSH keys to external hosts — big blast radius. Start read-only?
+2. **write_code scope**: Should it only modify valinor, or also lord-of-the-rings? Probably both (code + docs).
+3. **Streaming responses**: Should the CLI stream Claude's response as it generates, or wait for the full response? Streaming is better UX for long answers.
+4. **Rate limiting**: Any concern about Anthropic API costs? Could add a simple per-session token budget.
+5. **lotr access in-cluster**: Git clone as init container (stale) vs. mount from a shared PVC vs. fetch on demand via Gitea API? Gitea API is simplest and always fresh.