diff --git a/plan.md b/plan.md new file mode 100644 index 0000000..1d5f404 --- /dev/null +++ b/plan.md @@ -0,0 +1,337 @@ +# Tolkien — Infrastructure Agent for Valinor + +## Overview + +Tolkien is an infrastructure agent for homelab management. It understands the valinor GitOps repository, the TuringPi k3s cluster, external hosts managed via Ansible, Terraform resources, and the lord-of-the-rings documentation repository. + +**Architecture**: Client-server model. +- **Server** runs in the k3s cluster (has access to kubectl, vault, argocd, tea, docs) +- **CLI client** is a thin REPL on the operator's machine that connects to the server over HTTPS + +## Key Decisions + +| Decision | Choice | Rationale | +|----------|--------|-----------| +| Language | Python (uv) | Matches infra-agent pattern, good Anthropic SDK support | +| LLM | Anthropic API (direct) | Simpler than Vertex AI for homelab use | +| Interface | CLI REPL → server API | Start with CLI, add Telegram/Signal later | +| Registry | Gitea container registry | Zero additional setup, `gitea.jpnadas.xyz/jpnadas/tolkien` | +| Deployment | Helm chart in valinor, ArgoCD | Self-managed, same pattern as all other apps | +| CI | Gitea Actions runner in cluster | Gitea host (isildur) is too weak for builds | + +## Infrastructure Context + +### Valinor Repository Structure +``` +valinor/ +├── apps/ # Helm-based apps (ArgoCD ApplicationSet) +│ └── /config.yaml, values.yaml, *.yaml +├── manifests/ # Raw k8s manifests (ArgoCD ApplicationSet) +│ └── /config.yaml, *.yaml +├── ansible/ # Docker Compose stacks on external hosts +│ ├── stacks/isildur/ # Pi 5 — Caddy, Unifi Controller, HAProxy +│ ├── stacks/iluvatar/ # ZFS NAS — iSCSI, NFS +│ ├── playbooks/ +│ └── inventory.yml +├── terraform/ # IaC for supporting services +│ ├── vault/ # Vault policies + k8s auth roles +│ ├── minio/ # Buckets, users, S3 creds → Vault +│ ├── cloudflare/ # DNS records +│ ├── arr/ # Arr stack config +│ └── netbox/ # NetBox resources +└── applicationset.yaml # ArgoCD ApplicationSet definitions +``` + +### Cluster Nodes +| Node | Role | Notes | +|------|------|-------| +| merry | k3s controller | Dedicated SSD (etcd only) | +| sam | k3s worker | SATA SSD (Longhorn + databases) | +| pippin | k3s worker | SATA SSD (Longhorn + databases) | +| rosie | k3s worker | NVMe SSD (fastest, Longhorn + databases) | +| isildur | external | Pi 5 — Caddy, Unifi, HAProxy, Gitea | +| iluvatar | external | Optiplex — ZFS NAS, iSCSI, NFS | + +### Storage Classes +| Class | Backend | Use Case | +|-------|---------|----------| +| `longhorn` | Distributed SSD (replica=2) | Configs, caches, app state | +| `local-path` | Node-local SSD | CNPG databases (PG replication handles HA) | +| `bulk-storage` | ZFS on iluvatar (iSCSI) | Media, downloads | +| `frigate-storage` | ZFS on iluvatar (iSCSI) | Frigate recordings | + +### Key Services +- **ArgoCD**: GitOps sync (auto-sync + prune + self-heal) +- **Vault**: Secret management (Vault Secrets Operator in cluster) +- **MinIO**: S3-compatible storage for CNPG backups +- **CNPG**: CloudNativePG for PostgreSQL (local-path + streaming replication) +- **Longhorn**: Distributed block storage +- **VolSync**: PVC backup/restore +- **cert-manager**: TLS certificates +- **ingress-nginx**: Ingress controller +- **MetalLB**: Bare-metal load balancer + +### Documentation +- **lord-of-the-rings** (`~/Personal/lord-of-the-rings/`): All homelab docs + - `homelab/valinor/` — per-app docs, migration plans, storage docs + - `homelab/` — infrastructure docs (MinIO, ArgoCD, Vault, hardware) +- **valinor CLAUDE.md**: Contains repo conventions, app structure, skills + +### Git Workflow +- Never push directly to main +- Feature/fix branches → PR via `tea` CLI → merge in Gitea +- ArgoCD auto-syncs from main branch + +--- + +## Phase 1: Project Skeleton + CLI REPL + +### Goal +A working REPL that sends messages to Claude and gets responses, with no tools yet. + +### Project Structure +``` +tolkien/ +├── pyproject.toml # uv project config +├── plan.md # This file +├── README.md +├── tolkien/ +│ ├── __init__.py +│ ├── __main__.py # CLI entrypoint (repl or serve) +│ ├── agent.py # Claude tool-calling loop (Anthropic SDK) +│ ├── session.py # Conversation state + history trimming +│ ├── cli.py # REPL client (connects to server API) +│ ├── server.py # HTTP API server (FastAPI or Flask) +│ ├── config.py # Settings (env vars, defaults) +│ └── tools/ +│ └── __init__.py # Tool registry + dispatch +└── tests/ + └── ... +``` + +### Entrypoints +- `python -m tolkien serve` — Start the API server +- `python -m tolkien repl` — Start the CLI REPL (thin client to the server) + +### API Design (Server) +``` +POST /sessions → Create a new session, returns {session_id} +POST /sessions/{id}/messages → Send a message, returns {response} +GET /sessions/{id} → Get session state +DELETE /sessions/{id} → End session +GET /healthz → Health check +``` + +### Session Model +- In-memory dict (session_id → Session) +- Session holds conversation history (messages list) +- History trimming: keep first turn + last N turns (like infra-agent's 20-turn cap) +- Session expiry after 30 min inactivity + +### Agent Loop +Based on infra-agent's `_drive_to_end_turn()`: +1. Build messages list (system prompt + conversation history) +2. Call Claude API with tools +3. If tool_use in response → execute tools concurrently → append results +4. Loop until stop_reason == "end_turn" +5. Return assistant text + +### Dependencies +- `anthropic` — Claude API SDK +- `httpx` — CLI client HTTP calls +- `flask` or `fastapi` + `uvicorn` — API server +- `rich` — CLI REPL formatting +- `python-dotenv` — env var loading + +### Tasks +- [ ] Initialize uv project with pyproject.toml +- [ ] Implement config.py (ANTHROPIC_API_KEY, SERVER_URL, etc.) +- [ ] Implement agent.py with Claude loop (no tools yet) +- [ ] Implement session.py +- [ ] Implement server.py with session + message endpoints +- [ ] Implement cli.py REPL that POSTs to the server +- [ ] Implement __main__.py with serve/repl subcommands +- [ ] Test locally: run server, connect with REPL, have a conversation + +--- + +## Phase 2: Tool System + +### Goal +Give the agent tools to query and understand the homelab infrastructure. + +### Tool Architecture +Each tool module exports: +- `TOOLS: list[dict]` — Claude tool schemas (name, description, input_schema) +- `DISPATCH: dict[str, Callable]` — name → handler function + +`tools/__init__.py` aggregates all modules into a single registry. + +### Tools to Implement + +#### `tools/kubectl.py` — Kubernetes Queries +- **Allowed verbs**: get, describe, logs, top, auth can-i +- **Blocked**: create, delete, apply, patch, edit, exec, port-forward +- Timeout: 60s +- Output truncation: 12,000 chars + +#### `tools/gitea.py` — Gitea via `tea` CLI +- List/view issues +- List/view PRs +- Create issues, create PRs (draft) +- Search repos +- Blocked: delete operations + +#### `tools/argocd.py` — ArgoCD Status +- `argocd app list` — all apps and sync status +- `argocd app get ` — detailed app status +- `argocd app diff ` — pending changes +- `argocd app history ` — deployment history +- Read-only: no sync, rollback, or delete operations + +#### `tools/vault.py` — Vault Metadata Only +- `vault kv metadata get ` — check secret exists, see versions +- `vault kv list ` — list secret paths +- **Explicitly blocked**: `vault kv get` (cannot read actual secret values) +- Policy: `capabilities = ["list"]` on `secret/metadata/*`, deny on `secret/data/*` + +#### `tools/docs.py` — Lord of the Rings Documentation +- Read files from `~/Personal/lord-of-the-rings/` +- List available docs (directory listing) +- Search docs content (grep) +- In server context: mount lotr as a volume or clone it + +#### `tools/web.py` — Web Fetch +- Allowlisted domains: Kubernetes docs, Helm chart docs, ArgoCD docs, etc. +- HTTPS only +- Timeout: 15s +- Output truncation: 12,000 chars + +#### `tools/write_code.py` — Claude Code Subprocess +- Clones valinor repo → runs `claude --print` → creates branch + PR +- Risk classification for changes (optional, simpler than infra-agent's) +- Always creates draft PRs, never merges + +### Tasks +- [ ] Implement tool registry in tools/__init__.py +- [ ] Implement each tool module +- [ ] Wire tools into agent.py (pass to Claude API, dispatch results) +- [ ] Add concurrent tool execution (ThreadPoolExecutor) +- [ ] Test each tool in isolation +- [ ] Test full loop: REPL → server → Claude → tool → response + +--- + +## Phase 3: System Prompt + Knowledge + +### Goal +Give the agent enough context to be genuinely useful for homelab operations. + +### System Prompt Sections +1. **Identity**: You are Tolkien, an infrastructure agent for the valinor homelab +2. **Infrastructure overview**: Cluster nodes, storage classes, key services +3. **Repository structure**: How valinor is organized (apps/, manifests/, ansible/, terraform/) +4. **Tool usage guide**: When to use each tool, with examples +5. **Workflows**: Step-by-step for common operations: + - Deploying a new app (check chart, create config.yaml + values.yaml, terraform if needed) + - Checking app health (argocd status, kubectl pods, logs) + - Investigating issues (kubectl describe, logs, events) + - Checking backup status (volsync, CNPG S3 backups) + - Managing external hosts (ansible stacks) +6. **Documentation references**: How to find and cite lotr docs +7. **Safety rules**: Read-only by default, write_code for changes, always draft PRs + +### Tasks +- [ ] Write system prompt in agent.py (or separate prompt.py) +- [ ] Test with real scenarios (deploy app, check status, troubleshoot) +- [ ] Iterate on prompt based on results + +--- + +## Phase 4: Containerize + Deploy + +### Goal +Run tolkien as a service in the k3s cluster, accessible via CLI from the operator's machine. + +### Dockerfile +```dockerfile +FROM python:3.13-slim +# Install: kubectl, tea, argocd CLI, vault CLI, claude CLI +# Copy project, install deps with uv +# Entrypoint: gunicorn/uvicorn serving the API +``` + +### Helm Chart (in valinor repo) +``` +valinor/apps/tolkien/ +├── config.yaml # Chart reference (bjw-s app-template) +├── values.yaml # Image, env, probes, volumes +├── vault-auth.yaml # VaultAuth for k8s auth +└── vault-secret.yaml # VaultStaticSecret (ANTHROPIC_API_KEY) +``` + +### Networking +- Ingress: `tolkien.jpnadas.xyz` (or internal-only via ClusterIP + VPN) +- Consider: API key auth or mTLS for the API endpoint (don't expose unauthenticated) + +### API Authentication +Simple shared secret for now: +- Server checks `Authorization: Bearer ` header +- Token stored in Vault, configured in CLI via env var +- Can upgrade to mTLS or OAuth later + +### Volume Mounts +- lord-of-the-rings docs: either git-clone init container or PVC +- Valinor repo: clone on-demand for write_code tool +- kubectl: ServiceAccount with read-only ClusterRole + +### RBAC +- ServiceAccount `tolkien` with read-only ClusterRole: + - get/list/watch on pods, services, deployments, statefulsets, events, configmaps, PVCs, ingresses + - get/list on nodes + - logs on pods +- No write permissions + +### Vault +- Policy: read `secret/data/tolkien/*`, list `secret/metadata/*` +- Kubernetes auth role bound to `tolkien` ServiceAccount + +### Tasks +- [ ] Write Dockerfile +- [ ] Build and push to Gitea registry manually (first time) +- [ ] Create valinor/apps/tolkien/ with all manifests +- [ ] Add Vault terraform module for tolkien +- [ ] Create RBAC manifests (ClusterRole + ClusterRoleBinding) +- [ ] Deploy via ArgoCD +- [ ] Test CLI → server connectivity +- [ ] Set up Gitea Actions runner in cluster (separate task) +- [ ] Set up CI pipeline (.gitea/workflows/) for build + push on tag + +--- + +## Phase 5: Messaging Integration (Future) + +### Goal +Add Telegram or Signal as alternative interfaces. + +### Approach +- Add a `telegram.py` handler (python-telegram-bot library) +- Same session/agent backend, just different input/output transport +- Bot token in Vault +- Webhook mode (Telegram pushes to tolkien's API) + +### Tasks +- [ ] Choose Telegram vs Signal +- [ ] Implement bot handler +- [ ] Add webhook endpoint to server +- [ ] Deploy and test + +--- + +## Open Questions + +1. **Ansible access**: Should tolkien be able to run ansible playbooks, or just read the config? Running playbooks from in-cluster would need SSH keys to external hosts — big blast radius. Start read-only? +2. **write_code scope**: Should it only modify valinor, or also lord-of-the-rings? Probably both (code + docs). +3. **Streaming responses**: Should the CLI stream Claude's response as it generates, or wait for the full response? Streaming is better UX for long answers. +4. **Rate limiting**: Any concern about Anthropic API costs? Could add a simple per-session token budget. +5. **lotr access in-cluster**: Git clone as init container (stale) vs. mount from a shared PVC vs. fetch on demand via Gitea API? Gitea API is simplest and always fresh.