Introduction
I’m running OpenClaw locally on repurposed hardware—a reformatted gaming laptop—to test a simple idea: different agents with distinct personas handling different kinds of thinking. This isn’t production-ready. It’s a sandbox. The goal is to see whether specialised agents are actually useful when confronted with real engineering questions.
One of the first personas I built is an Architect agent. This post is a report on what happened when I threw a real architecture at it.
The Setup
OpenClaw runs on an old laptop: Ubuntu, nothing fancy. I’m experimenting with a local LLM next—if the sandbox can handle it. For now, the Architect uses Claude via OpenRouter. The machine isn’t powerful; it’s doing its best. The setup is deliberately low-friction: if it works here, it’ll work elsewhere.
The Architect has a defined persona: Chief Systems Architect. It’s supposed to challenge designs, surface failure modes, and ask the questions you should have asked before committing. It doesn’t rubber-stamp. I wanted to see if that held up in practice.

The Test
I presented the agent with an architecture prompt for a B2B video hosting and live-streaming SaaS:
| Component | Choice |
|---|---|
| Backend | FastAPI |
| Auth | Firebase (JWT validation planned) |
| Streaming | SRT ingestion, HLS output |
| Deployment | Docker + Kubernetes on AWS |
| Storage | S3 for segments and recordings |
| CDN | CloudFront |
| Database | PostgreSQL (RDS) |
| Payments | Stripe |
| Target | Multi-tenant mentor/tutor platform |
| Scale | 5,000 concurrent viewers initially, 100,000 at 10x |
I asked the agent to interrogate the design and cover:
- Top architectural failure risks
- Security vulnerabilities (auth, multi-tenancy, streaming abuse)
- Cost-to-scale risks at 10x growth
- RTO/RPO expectations and where they break
- Architectural modifications before production
- Observability requirements
- Rollback strategy for failed deployment, broken streaming pipeline, misconfigured IAM, and sudden cost spikes
I told it to be aggressive. No hand-holding.
What It Surfaced
The agent returned structured analysis across all seven areas. It wasn’t generic. It called out specific limits and failure modes:
Failure risks: S3 partition throughput (1,000 PUT/GET per second per prefix), SRT ingestion bottlenecks at 200 streams, Kubernetes autoscaling lag, PostgreSQL connection exhaustion, CloudFront cache invalidation costs, Firebase JWT expiry for long sessions, Stripe rate limits at 15K requests per second.
Security: JWT replay attacks, S3 bucket misconfigurations, streaming DDoS risk, multi-tenant data leakage without row-level security, CloudFront cache poisoning.
Cost: S3 storage growth, FFmpeg compute at 1080p per stream, CloudFront data transfer at 10TB+ per month, RDS scaling costs.
RTO/RPO: Streaming pipeline failure (5–15 min RTO depending on scale), IAM misconfiguration recovery windows, database PITR vs EBS snapshot trade-offs.
Modifications: Redis for caching, RDS Proxy for connection pooling, CloudFront WAF, horizontal pod autoscaling, WebAssembly for FFmpeg to reduce resource contention.
Observability: Prometheus/Grafana for K8s, CloudWatch for S3/RDS, Datadog for FFmpeg, Sentry for auth errors, New Relic for FastAPI.
Rollback: Kubernetes rollout undo, SRT-to-MP4 failover for critical streams, IAM policy revert via CLI, AWS Budgets for cost spikes.
It also asked for clarification: “Would you like me to dive deeper into any specific area?”
Assessment
The output isn’t authoritative architecture advice. I wouldn’t treat it as a substitute for a proper design review. But for a first-pass interrogation, it’s surprisingly valuable.
What worked:
- It identified real limits: S3 partition throughput, Stripe rate limits, Firebase token expiry. These are the kind of things that bite you at scale.
- It surfaced multi-tenant isolation risks. PostgreSQL without row-level security is a common oversight.
- It connected cost to scale. The jump from 5K to 100K viewers isn’t linear; cost growth and failure modes change.
- It proposed concrete mitigations: Redis, RDS Proxy, WAF. Not vague “consider caching” but specific tools.
What didn’t:
- Some responses were generic. The observability section could apply to almost any stack.
- It didn’t push back on the fundamental design choices (e.g. whether SRT is the right ingest protocol for the use case). It interrogated the given stack, not the problem space.
- It’s still an LLM. It can hallucinate or miss edge cases. I’d cross-check anything critical.
For the purpose of the experiment—pressure-testing ideas before they progress—it delivered. I’d use it as a pre-review step: throw a design at it, see what it surfaces, then run it past a human architect or a formal design doc.
What’s Next
I’m running more personas: FinOps, Product, Ops. The idea is to have specialist agents for different domains. Architect for systems and failure modes; FinOps for cost; Product for PMF and positioning; Ops for runbooks and incident response.
I’m also experimenting with local LLMs—if the sandbox can handle it. The goal is to see how far repurposed hardware can go before we need more compute.
Takeaways
- Specialised personas work. An Architect agent that’s designed to challenge designs behaves differently from a generic assistant. The persona shapes the output.
- Sandbox is sufficient for experimentation. You don’t need a Mac mini or a cluster. An old laptop and Ubuntu can run OpenClaw, Slack, and OpenRouter. The separation between “stable” and “experimental” matters more than the hardware.
- First-pass interrogation has value. The agent isn’t a replacement for human review. It’s a filter: it surfaces obvious risks and gaps before you invest in a deeper design process.
- Multi-agent setups are worth exploring. Different agents for different domains—Architect, FinOps, Product, Ops—lets you test the concept before scaling up.
Closing
I’m curious whether others are experimenting with multi-agent setups in their own sandboxes. If you’re running something similar—different personas, different channels, local vs cloud LLMs—I’d be interested to hear what’s working and what isn’t.
The sandbox is doing its best. So far, it’s worth the effort.