Covers architecture (client-server model), tool system, deployment strategy, and phased implementation plan with full homelab context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
338 lines
13 KiB
Markdown
338 lines
13 KiB
Markdown
# Tolkien — Infrastructure Agent for Valinor
|
|
|
|
## Overview
|
|
|
|
Tolkien is an infrastructure agent for homelab management. It understands the valinor GitOps repository, the TuringPi k3s cluster, external hosts managed via Ansible, Terraform resources, and the lord-of-the-rings documentation repository.
|
|
|
|
**Architecture**: Client-server model.
|
|
- **Server** runs in the k3s cluster (has access to kubectl, vault, argocd, tea, docs)
|
|
- **CLI client** is a thin REPL on the operator's machine that connects to the server over HTTPS
|
|
|
|
## Key Decisions
|
|
|
|
| Decision | Choice | Rationale |
|
|
|----------|--------|-----------|
|
|
| Language | Python (uv) | Matches infra-agent pattern, good Anthropic SDK support |
|
|
| LLM | Anthropic API (direct) | Simpler than Vertex AI for homelab use |
|
|
| Interface | CLI REPL → server API | Start with CLI, add Telegram/Signal later |
|
|
| Registry | Gitea container registry | Zero additional setup, `gitea.jpnadas.xyz/jpnadas/tolkien` |
|
|
| Deployment | Helm chart in valinor, ArgoCD | Self-managed, same pattern as all other apps |
|
|
| CI | Gitea Actions runner in cluster | Gitea host (isildur) is too weak for builds |
|
|
|
|
## Infrastructure Context
|
|
|
|
### Valinor Repository Structure
|
|
```
|
|
valinor/
|
|
├── apps/ # Helm-based apps (ArgoCD ApplicationSet)
|
|
│ └── <app>/config.yaml, values.yaml, *.yaml
|
|
├── manifests/ # Raw k8s manifests (ArgoCD ApplicationSet)
|
|
│ └── <app>/config.yaml, *.yaml
|
|
├── ansible/ # Docker Compose stacks on external hosts
|
|
│ ├── stacks/isildur/ # Pi 5 — Caddy, Unifi Controller, HAProxy
|
|
│ ├── stacks/iluvatar/ # ZFS NAS — iSCSI, NFS
|
|
│ ├── playbooks/
|
|
│ └── inventory.yml
|
|
├── terraform/ # IaC for supporting services
|
|
│ ├── vault/ # Vault policies + k8s auth roles
|
|
│ ├── minio/ # Buckets, users, S3 creds → Vault
|
|
│ ├── cloudflare/ # DNS records
|
|
│ ├── arr/ # Arr stack config
|
|
│ └── netbox/ # NetBox resources
|
|
└── applicationset.yaml # ArgoCD ApplicationSet definitions
|
|
```
|
|
|
|
### Cluster Nodes
|
|
| Node | Role | Notes |
|
|
|------|------|-------|
|
|
| merry | k3s controller | Dedicated SSD (etcd only) |
|
|
| sam | k3s worker | SATA SSD (Longhorn + databases) |
|
|
| pippin | k3s worker | SATA SSD (Longhorn + databases) |
|
|
| rosie | k3s worker | NVMe SSD (fastest, Longhorn + databases) |
|
|
| isildur | external | Pi 5 — Caddy, Unifi, HAProxy, Gitea |
|
|
| iluvatar | external | Optiplex — ZFS NAS, iSCSI, NFS |
|
|
|
|
### Storage Classes
|
|
| Class | Backend | Use Case |
|
|
|-------|---------|----------|
|
|
| `longhorn` | Distributed SSD (replica=2) | Configs, caches, app state |
|
|
| `local-path` | Node-local SSD | CNPG databases (PG replication handles HA) |
|
|
| `bulk-storage` | ZFS on iluvatar (iSCSI) | Media, downloads |
|
|
| `frigate-storage` | ZFS on iluvatar (iSCSI) | Frigate recordings |
|
|
|
|
### Key Services
|
|
- **ArgoCD**: GitOps sync (auto-sync + prune + self-heal)
|
|
- **Vault**: Secret management (Vault Secrets Operator in cluster)
|
|
- **MinIO**: S3-compatible storage for CNPG backups
|
|
- **CNPG**: CloudNativePG for PostgreSQL (local-path + streaming replication)
|
|
- **Longhorn**: Distributed block storage
|
|
- **VolSync**: PVC backup/restore
|
|
- **cert-manager**: TLS certificates
|
|
- **ingress-nginx**: Ingress controller
|
|
- **MetalLB**: Bare-metal load balancer
|
|
|
|
### Documentation
|
|
- **lord-of-the-rings** (`~/Personal/lord-of-the-rings/`): All homelab docs
|
|
- `homelab/valinor/` — per-app docs, migration plans, storage docs
|
|
- `homelab/` — infrastructure docs (MinIO, ArgoCD, Vault, hardware)
|
|
- **valinor CLAUDE.md**: Contains repo conventions, app structure, skills
|
|
|
|
### Git Workflow
|
|
- Never push directly to main
|
|
- Feature/fix branches → PR via `tea` CLI → merge in Gitea
|
|
- ArgoCD auto-syncs from main branch
|
|
|
|
---
|
|
|
|
## Phase 1: Project Skeleton + CLI REPL
|
|
|
|
### Goal
|
|
A working REPL that sends messages to Claude and gets responses, with no tools yet.
|
|
|
|
### Project Structure
|
|
```
|
|
tolkien/
|
|
├── pyproject.toml # uv project config
|
|
├── plan.md # This file
|
|
├── README.md
|
|
├── tolkien/
|
|
│ ├── __init__.py
|
|
│ ├── __main__.py # CLI entrypoint (repl or serve)
|
|
│ ├── agent.py # Claude tool-calling loop (Anthropic SDK)
|
|
│ ├── session.py # Conversation state + history trimming
|
|
│ ├── cli.py # REPL client (connects to server API)
|
|
│ ├── server.py # HTTP API server (FastAPI or Flask)
|
|
│ ├── config.py # Settings (env vars, defaults)
|
|
│ └── tools/
|
|
│ └── __init__.py # Tool registry + dispatch
|
|
└── tests/
|
|
└── ...
|
|
```
|
|
|
|
### Entrypoints
|
|
- `python -m tolkien serve` — Start the API server
|
|
- `python -m tolkien repl` — Start the CLI REPL (thin client to the server)
|
|
|
|
### API Design (Server)
|
|
```
|
|
POST /sessions → Create a new session, returns {session_id}
|
|
POST /sessions/{id}/messages → Send a message, returns {response}
|
|
GET /sessions/{id} → Get session state
|
|
DELETE /sessions/{id} → End session
|
|
GET /healthz → Health check
|
|
```
|
|
|
|
### Session Model
|
|
- In-memory dict (session_id → Session)
|
|
- Session holds conversation history (messages list)
|
|
- History trimming: keep first turn + last N turns (like infra-agent's 20-turn cap)
|
|
- Session expiry after 30 min inactivity
|
|
|
|
### Agent Loop
|
|
Based on infra-agent's `_drive_to_end_turn()`:
|
|
1. Build messages list (system prompt + conversation history)
|
|
2. Call Claude API with tools
|
|
3. If tool_use in response → execute tools concurrently → append results
|
|
4. Loop until stop_reason == "end_turn"
|
|
5. Return assistant text
|
|
|
|
### Dependencies
|
|
- `anthropic` — Claude API SDK
|
|
- `httpx` — CLI client HTTP calls
|
|
- `flask` or `fastapi` + `uvicorn` — API server
|
|
- `rich` — CLI REPL formatting
|
|
- `python-dotenv` — env var loading
|
|
|
|
### Tasks
|
|
- [ ] Initialize uv project with pyproject.toml
|
|
- [ ] Implement config.py (ANTHROPIC_API_KEY, SERVER_URL, etc.)
|
|
- [ ] Implement agent.py with Claude loop (no tools yet)
|
|
- [ ] Implement session.py
|
|
- [ ] Implement server.py with session + message endpoints
|
|
- [ ] Implement cli.py REPL that POSTs to the server
|
|
- [ ] Implement __main__.py with serve/repl subcommands
|
|
- [ ] Test locally: run server, connect with REPL, have a conversation
|
|
|
|
---
|
|
|
|
## Phase 2: Tool System
|
|
|
|
### Goal
|
|
Give the agent tools to query and understand the homelab infrastructure.
|
|
|
|
### Tool Architecture
|
|
Each tool module exports:
|
|
- `TOOLS: list[dict]` — Claude tool schemas (name, description, input_schema)
|
|
- `DISPATCH: dict[str, Callable]` — name → handler function
|
|
|
|
`tools/__init__.py` aggregates all modules into a single registry.
|
|
|
|
### Tools to Implement
|
|
|
|
#### `tools/kubectl.py` — Kubernetes Queries
|
|
- **Allowed verbs**: get, describe, logs, top, auth can-i
|
|
- **Blocked**: create, delete, apply, patch, edit, exec, port-forward
|
|
- Timeout: 60s
|
|
- Output truncation: 12,000 chars
|
|
|
|
#### `tools/gitea.py` — Gitea via `tea` CLI
|
|
- List/view issues
|
|
- List/view PRs
|
|
- Create issues, create PRs (draft)
|
|
- Search repos
|
|
- Blocked: delete operations
|
|
|
|
#### `tools/argocd.py` — ArgoCD Status
|
|
- `argocd app list` — all apps and sync status
|
|
- `argocd app get <name>` — detailed app status
|
|
- `argocd app diff <name>` — pending changes
|
|
- `argocd app history <name>` — deployment history
|
|
- Read-only: no sync, rollback, or delete operations
|
|
|
|
#### `tools/vault.py` — Vault Metadata Only
|
|
- `vault kv metadata get <path>` — check secret exists, see versions
|
|
- `vault kv list <path>` — list secret paths
|
|
- **Explicitly blocked**: `vault kv get` (cannot read actual secret values)
|
|
- Policy: `capabilities = ["list"]` on `secret/metadata/*`, deny on `secret/data/*`
|
|
|
|
#### `tools/docs.py` — Lord of the Rings Documentation
|
|
- Read files from `~/Personal/lord-of-the-rings/`
|
|
- List available docs (directory listing)
|
|
- Search docs content (grep)
|
|
- In server context: mount lotr as a volume or clone it
|
|
|
|
#### `tools/web.py` — Web Fetch
|
|
- Allowlisted domains: Kubernetes docs, Helm chart docs, ArgoCD docs, etc.
|
|
- HTTPS only
|
|
- Timeout: 15s
|
|
- Output truncation: 12,000 chars
|
|
|
|
#### `tools/write_code.py` — Claude Code Subprocess
|
|
- Clones valinor repo → runs `claude --print` → creates branch + PR
|
|
- Risk classification for changes (optional, simpler than infra-agent's)
|
|
- Always creates draft PRs, never merges
|
|
|
|
### Tasks
|
|
- [ ] Implement tool registry in tools/__init__.py
|
|
- [ ] Implement each tool module
|
|
- [ ] Wire tools into agent.py (pass to Claude API, dispatch results)
|
|
- [ ] Add concurrent tool execution (ThreadPoolExecutor)
|
|
- [ ] Test each tool in isolation
|
|
- [ ] Test full loop: REPL → server → Claude → tool → response
|
|
|
|
---
|
|
|
|
## Phase 3: System Prompt + Knowledge
|
|
|
|
### Goal
|
|
Give the agent enough context to be genuinely useful for homelab operations.
|
|
|
|
### System Prompt Sections
|
|
1. **Identity**: You are Tolkien, an infrastructure agent for the valinor homelab
|
|
2. **Infrastructure overview**: Cluster nodes, storage classes, key services
|
|
3. **Repository structure**: How valinor is organized (apps/, manifests/, ansible/, terraform/)
|
|
4. **Tool usage guide**: When to use each tool, with examples
|
|
5. **Workflows**: Step-by-step for common operations:
|
|
- Deploying a new app (check chart, create config.yaml + values.yaml, terraform if needed)
|
|
- Checking app health (argocd status, kubectl pods, logs)
|
|
- Investigating issues (kubectl describe, logs, events)
|
|
- Checking backup status (volsync, CNPG S3 backups)
|
|
- Managing external hosts (ansible stacks)
|
|
6. **Documentation references**: How to find and cite lotr docs
|
|
7. **Safety rules**: Read-only by default, write_code for changes, always draft PRs
|
|
|
|
### Tasks
|
|
- [ ] Write system prompt in agent.py (or separate prompt.py)
|
|
- [ ] Test with real scenarios (deploy app, check status, troubleshoot)
|
|
- [ ] Iterate on prompt based on results
|
|
|
|
---
|
|
|
|
## Phase 4: Containerize + Deploy
|
|
|
|
### Goal
|
|
Run tolkien as a service in the k3s cluster, accessible via CLI from the operator's machine.
|
|
|
|
### Dockerfile
|
|
```dockerfile
|
|
FROM python:3.13-slim
|
|
# Install: kubectl, tea, argocd CLI, vault CLI, claude CLI
|
|
# Copy project, install deps with uv
|
|
# Entrypoint: gunicorn/uvicorn serving the API
|
|
```
|
|
|
|
### Helm Chart (in valinor repo)
|
|
```
|
|
valinor/apps/tolkien/
|
|
├── config.yaml # Chart reference (bjw-s app-template)
|
|
├── values.yaml # Image, env, probes, volumes
|
|
├── vault-auth.yaml # VaultAuth for k8s auth
|
|
└── vault-secret.yaml # VaultStaticSecret (ANTHROPIC_API_KEY)
|
|
```
|
|
|
|
### Networking
|
|
- Ingress: `tolkien.jpnadas.xyz` (or internal-only via ClusterIP + VPN)
|
|
- Consider: API key auth or mTLS for the API endpoint (don't expose unauthenticated)
|
|
|
|
### API Authentication
|
|
Simple shared secret for now:
|
|
- Server checks `Authorization: Bearer <token>` header
|
|
- Token stored in Vault, configured in CLI via env var
|
|
- Can upgrade to mTLS or OAuth later
|
|
|
|
### Volume Mounts
|
|
- lord-of-the-rings docs: either git-clone init container or PVC
|
|
- Valinor repo: clone on-demand for write_code tool
|
|
- kubectl: ServiceAccount with read-only ClusterRole
|
|
|
|
### RBAC
|
|
- ServiceAccount `tolkien` with read-only ClusterRole:
|
|
- get/list/watch on pods, services, deployments, statefulsets, events, configmaps, PVCs, ingresses
|
|
- get/list on nodes
|
|
- logs on pods
|
|
- No write permissions
|
|
|
|
### Vault
|
|
- Policy: read `secret/data/tolkien/*`, list `secret/metadata/*`
|
|
- Kubernetes auth role bound to `tolkien` ServiceAccount
|
|
|
|
### Tasks
|
|
- [ ] Write Dockerfile
|
|
- [ ] Build and push to Gitea registry manually (first time)
|
|
- [ ] Create valinor/apps/tolkien/ with all manifests
|
|
- [ ] Add Vault terraform module for tolkien
|
|
- [ ] Create RBAC manifests (ClusterRole + ClusterRoleBinding)
|
|
- [ ] Deploy via ArgoCD
|
|
- [ ] Test CLI → server connectivity
|
|
- [ ] Set up Gitea Actions runner in cluster (separate task)
|
|
- [ ] Set up CI pipeline (.gitea/workflows/) for build + push on tag
|
|
|
|
---
|
|
|
|
## Phase 5: Messaging Integration (Future)
|
|
|
|
### Goal
|
|
Add Telegram or Signal as alternative interfaces.
|
|
|
|
### Approach
|
|
- Add a `telegram.py` handler (python-telegram-bot library)
|
|
- Same session/agent backend, just different input/output transport
|
|
- Bot token in Vault
|
|
- Webhook mode (Telegram pushes to tolkien's API)
|
|
|
|
### Tasks
|
|
- [ ] Choose Telegram vs Signal
|
|
- [ ] Implement bot handler
|
|
- [ ] Add webhook endpoint to server
|
|
- [ ] Deploy and test
|
|
|
|
---
|
|
|
|
## Open Questions
|
|
|
|
1. **Ansible access**: Should tolkien be able to run ansible playbooks, or just read the config? Running playbooks from in-cluster would need SSH keys to external hosts — big blast radius. Start read-only?
|
|
2. **write_code scope**: Should it only modify valinor, or also lord-of-the-rings? Probably both (code + docs).
|
|
3. **Streaming responses**: Should the CLI stream Claude's response as it generates, or wait for the full response? Streaming is better UX for long answers.
|
|
4. **Rate limiting**: Any concern about Anthropic API costs? Could add a simple per-session token budget.
|
|
5. **lotr access in-cluster**: Git clone as init container (stale) vs. mount from a shared PVC vs. fetch on demand via Gitea API? Gitea API is simplest and always fresh.
|