Files
tolkien/plan.md
João Pedro Battistella Nadas 58d1cd69fd added plan
2026-04-30 08:08:55 +02:00

343 lines
13 KiB
Markdown

# Tolkien — Infrastructure Agent for Valinor
## Overview
Tolkien is an infrastructure agent for homelab management. It understands the valinor GitOps repository, the TuringPi k3s cluster, external hosts managed via Ansible, Terraform resources, and the lord-of-the-rings documentation repository.
**Architecture**: Client-server model.
- **Server** runs in the k3s cluster (has access to kubectl, vault, argocd, tea, docs)
- **CLI client** is a thin REPL on the operator's machine that connects to the server over HTTPS
## Key Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Language | Python (uv) | Matches infra-agent pattern, good Anthropic SDK support |
| LLM | Anthropic API (direct) | Simpler than Vertex AI for homelab use |
| Interface | CLI REPL → server API | Start with CLI, add Telegram/Signal later |
| Registry | Gitea container registry | Zero additional setup, `gitea.jpnadas.xyz/jpnadas/tolkien` |
| Deployment | Helm chart in valinor, ArgoCD | Self-managed, same pattern as all other apps |
| CI | Gitea Actions runner in cluster | Gitea host (isildur) is too weak for builds |
## Infrastructure Context
### Valinor Repository Structure
```
valinor/
├── apps/ # Helm-based apps (ArgoCD ApplicationSet)
│ └── <app>/config.yaml, values.yaml, *.yaml
├── manifests/ # Raw k8s manifests (ArgoCD ApplicationSet)
│ └── <app>/config.yaml, *.yaml
├── ansible/ # Docker Compose stacks on external hosts
│ ├── stacks/isildur/ # Pi 5 — Caddy, Unifi Controller, HAProxy
│ ├── stacks/iluvatar/ # ZFS NAS — iSCSI, NFS
│ ├── playbooks/
│ └── inventory.yml
├── terraform/ # IaC for supporting services
│ ├── vault/ # Vault policies + k8s auth roles
│ ├── minio/ # Buckets, users, S3 creds → Vault
│ ├── cloudflare/ # DNS records
│ ├── arr/ # Arr stack config
│ └── netbox/ # NetBox resources
└── applicationset.yaml # ArgoCD ApplicationSet definitions
```
### Cluster Nodes
| Node | Role | Notes |
|------|------|-------|
| merry | k3s controller | Dedicated SSD (etcd only) |
| sam | k3s worker | SATA SSD (Longhorn + databases) |
| pippin | k3s worker | SATA SSD (Longhorn + databases) |
| rosie | k3s worker | NVMe SSD (fastest, Longhorn + databases) |
| isildur | external | Pi 5 — Caddy, Unifi, HAProxy, Gitea |
| iluvatar | external | Optiplex — ZFS NAS, iSCSI, NFS |
### Storage Classes
| Class | Backend | Use Case |
|-------|---------|----------|
| `longhorn` | Distributed SSD (replica=2) | Configs, caches, app state |
| `local-path` | Node-local SSD | CNPG databases (PG replication handles HA) |
| `bulk-storage` | ZFS on iluvatar (iSCSI) | Media, downloads |
| `frigate-storage` | ZFS on iluvatar (iSCSI) | Frigate recordings |
### Key Services
- **ArgoCD**: GitOps sync (auto-sync + prune + self-heal)
- **Vault**: Secret management (Vault Secrets Operator in cluster)
- **MinIO**: S3-compatible storage for CNPG backups
- **CNPG**: CloudNativePG for PostgreSQL (local-path + streaming replication)
- **Longhorn**: Distributed block storage
- **VolSync**: PVC backup/restore
- **cert-manager**: TLS certificates
- **ingress-nginx**: Ingress controller
- **MetalLB**: Bare-metal load balancer
### Documentation
- **lord-of-the-rings** (`~/Personal/lord-of-the-rings/`): All homelab docs
- `homelab/valinor/` — per-app docs, migration plans, storage docs
- `homelab/` — infrastructure docs (MinIO, ArgoCD, Vault, hardware)
- **valinor CLAUDE.md**: Contains repo conventions, app structure, skills
### Git Workflow
- Never push directly to main
- Feature/fix branches → PR via `tea` CLI → merge in Gitea
- ArgoCD auto-syncs from main branch
---
## Phase 1: Project Skeleton + CLI REPL
### Goal
A working REPL that sends messages to Claude and gets responses, with no tools yet.
### Project Structure
```
tolkien/
├── pyproject.toml # uv project config
├── plan.md # This file
├── README.md
├── tolkien/
│ ├── __init__.py
│ ├── __main__.py # CLI entrypoint (repl or serve)
│ ├── agent.py # Claude tool-calling loop (Anthropic SDK)
│ ├── session.py # Conversation state + history trimming
│ ├── cli.py # REPL client (connects to server API)
│ ├── server.py # HTTP API server (FastAPI or Flask)
│ ├── config.py # Settings (env vars, defaults)
│ └── tools/
│ └── __init__.py # Tool registry + dispatch
└── tests/
└── ...
```
### Entrypoints
- `python -m tolkien serve` — Start the API server
- `python -m tolkien repl` — Start the CLI REPL (thin client to the server)
### API Design (Server)
```
POST /sessions → Create a new session, returns {session_id}
POST /sessions/{id}/messages → Send a message, returns {response}
GET /sessions/{id} → Get session state
DELETE /sessions/{id} → End session
GET /healthz → Health check
```
### Session Model
- In-memory dict (session_id → Session)
- Session holds conversation history (messages list)
- History trimming: keep first turn + last N turns (like infra-agent's 20-turn cap)
- Session expiry after 30 min inactivity
### Agent Loop
Based on infra-agent's `_drive_to_end_turn()`:
1. Build messages list (system prompt + conversation history)
2. Call Claude API with tools
3. If tool_use in response → execute tools concurrently → append results
4. Loop until stop_reason == "end_turn"
5. Return assistant text
### Dependencies
- `anthropic` — Claude API SDK
- `httpx` — CLI client HTTP calls
- `flask` or `fastapi` + `uvicorn` — API server
- `rich` — CLI REPL formatting
- `python-dotenv` — env var loading
### Tasks
- [ ] Initialize uv project with pyproject.toml
- [ ] Implement config.py (ANTHROPIC_API_KEY, SERVER_URL, etc.)
- [ ] Implement agent.py with Claude loop (no tools yet)
- [ ] Implement session.py
- [ ] Implement server.py with session + message endpoints
- [ ] Implement cli.py REPL that POSTs to the server
- [ ] Implement __main__.py with serve/repl subcommands
- [ ] Test locally: run server, connect with REPL, have a conversation
---
## Phase 2: Tool System
### Goal
Give the agent tools to query and understand the homelab infrastructure.
### Tool Architecture
Each tool module exports:
- `TOOLS: list[dict]` — Claude tool schemas (name, description, input_schema)
- `DISPATCH: dict[str, Callable]` — name → handler function
`tools/__init__.py` aggregates all modules into a single registry.
### Tools to Implement
#### `tools/kubectl.py` — Kubernetes Queries
- **Allowed verbs**: get, describe, logs, top, auth can-i
- **Blocked**: create, delete, apply, patch, edit, exec, port-forward
- Timeout: 60s
- Output truncation: 12,000 chars
#### `tools/gitea.py` — Gitea via `tea` CLI
- List/view issues
- List/view PRs
- Create issues, create PRs (draft)
- Search repos
- Blocked: delete operations
#### `tools/argocd.py` — ArgoCD Status
- `argocd app list` — all apps and sync status
- `argocd app get <name>` — detailed app status
- `argocd app diff <name>` — pending changes
- `argocd app history <name>` — deployment history
- Read-only: no sync, rollback, or delete operations
#### `tools/vault.py` — Vault Metadata Only
- `vault kv metadata get <path>` — check secret exists, see versions
- `vault kv list <path>` — list secret paths
- **Explicitly blocked**: `vault kv get` (cannot read actual secret values)
- Policy: `capabilities = ["list"]` on `secret/metadata/*`, deny on `secret/data/*`
#### `tools/docs.py` — Lord of the Rings Documentation
- Read files from `~/Personal/lord-of-the-rings/`
- List available docs (directory listing)
- Search docs content (grep)
- In server context: mount lotr as a volume or clone it
#### `tools/web.py` — Web Fetch
- Allowlisted domains: Kubernetes docs, Helm chart docs, ArgoCD docs, etc.
- HTTPS only
- Timeout: 15s
- Output truncation: 12,000 chars
#### `tools/write_code.py` — Valinor Code Changes
- Clones valinor repo → runs `claude --print` → creates branch + draft PR
- Scope: infrastructure changes (apps/, manifests/, ansible/, terraform/)
- Always creates draft PRs, never merges
#### `tools/write_documentation.py` — Lord of the Rings Doc Changes
- Clones lord-of-the-rings repo → runs `claude --print` → creates branch + draft PR
- Scope: homelab documentation (docs, runbooks, index pages)
- Always creates draft PRs, never merges
### Tasks
- [ ] Implement tool registry in tools/__init__.py
- [ ] Implement each tool module
- [ ] Wire tools into agent.py (pass to Claude API, dispatch results)
- [ ] Add concurrent tool execution (ThreadPoolExecutor)
- [ ] Test each tool in isolation
- [ ] Test full loop: REPL → server → Claude → tool → response
---
## Phase 3: System Prompt + Knowledge
### Goal
Give the agent enough context to be genuinely useful for homelab operations.
### System Prompt Sections
1. **Identity**: You are Tolkien, an infrastructure agent for the valinor homelab
2. **Infrastructure overview**: Cluster nodes, storage classes, key services
3. **Repository structure**: How valinor is organized (apps/, manifests/, ansible/, terraform/)
4. **Tool usage guide**: When to use each tool, with examples
5. **Workflows**: Step-by-step for common operations:
- Deploying a new app (check chart, create config.yaml + values.yaml, terraform if needed)
- Checking app health (argocd status, kubectl pods, logs)
- Investigating issues (kubectl describe, logs, events)
- Checking backup status (volsync, CNPG S3 backups)
- Managing external hosts (ansible stacks)
6. **Documentation references**: How to find and cite lotr docs
7. **Safety rules**: Read-only by default, write_code for changes, always draft PRs
### Tasks
- [ ] Write system prompt in agent.py (or separate prompt.py)
- [ ] Test with real scenarios (deploy app, check status, troubleshoot)
- [ ] Iterate on prompt based on results
---
## Phase 4: Containerize + Deploy
### Goal
Run tolkien as a service in the k3s cluster, accessible via CLI from the operator's machine.
### Dockerfile
```dockerfile
FROM python:3.13-slim
# Install: kubectl, tea, argocd CLI, vault CLI, claude CLI
# Copy project, install deps with uv
# Entrypoint: gunicorn/uvicorn serving the API
```
### Helm Chart (in valinor repo)
```
valinor/apps/tolkien/
├── config.yaml # Chart reference (bjw-s app-template)
├── values.yaml # Image, env, probes, volumes
├── vault-auth.yaml # VaultAuth for k8s auth
└── vault-secret.yaml # VaultStaticSecret (ANTHROPIC_API_KEY)
```
### Networking
- Ingress: `tolkien.jpnadas.xyz` (or internal-only via ClusterIP + VPN)
- Consider: API key auth or mTLS for the API endpoint (don't expose unauthenticated)
### API Authentication
Simple shared secret for now:
- Server checks `Authorization: Bearer <token>` header
- Token stored in Vault, configured in CLI via env var
- Can upgrade to mTLS or OAuth later
### Volume Mounts
- lord-of-the-rings docs: either git-clone init container or PVC
- Valinor repo: clone on-demand for write_code tool
- kubectl: ServiceAccount with read-only ClusterRole
### RBAC
- ServiceAccount `tolkien` with read-only ClusterRole:
- get/list/watch on pods, services, deployments, statefulsets, events, configmaps, PVCs, ingresses
- get/list on nodes
- logs on pods
- No write permissions
### Vault
- Policy: read `secret/data/tolkien/*`, list `secret/metadata/*`
- Kubernetes auth role bound to `tolkien` ServiceAccount
### Tasks
- [ ] Write Dockerfile
- [ ] Build and push to Gitea registry manually (first time)
- [ ] Create valinor/apps/tolkien/ with all manifests
- [ ] Add Vault terraform module for tolkien
- [ ] Create RBAC manifests (ClusterRole + ClusterRoleBinding)
- [ ] Deploy via ArgoCD
- [ ] Test CLI → server connectivity
- [ ] Set up Gitea Actions runner in cluster (separate task)
- [ ] Set up CI pipeline (.gitea/workflows/) for build + push on tag
---
## Phase 5: Messaging Integration (Future)
### Goal
Add Telegram or Signal as alternative interfaces.
### Approach
- Add a `telegram.py` handler (python-telegram-bot library)
- Same session/agent backend, just different input/output transport
- Bot token in Vault
- Webhook mode (Telegram pushes to tolkien's API)
### Tasks
- [ ] Choose Telegram vs Signal
- [ ] Implement bot handler
- [ ] Add webhook endpoint to server
- [ ] Deploy and test
---
## Open Questions
1. ~~**Ansible access**~~: Read config + propose changes via draft PR. Operator runs playbooks manually.
2. ~~**write_code scope**~~: Two separate tools — `write_code` for valinor (infra), `write_documentation` for lord-of-the-rings (docs). Both produce draft PRs.
3. ~~**Streaming responses**~~: Full response mode. CLI will be used on unreliable internet (trains), so wait for complete response rather than streaming.
4. ~~**Rate limiting**~~: Use a dedicated Anthropic API key for tolkien with a monthly spend limit set in the Anthropic Console. No in-app budget tracking needed.
5. ~~**lotr access in-cluster**~~: Git clone (init container or sidecar with periodic pull). Faster reads, works offline if needed.