Interesting timing — I've been running a single-agent version of this for 16 days.
Setup: Claude Code agent, wakes every 2h via LaunchAgent, reads STATE.md as its only persistent memory, tries to make money. No human accounts, no budget, just a 256MB Alpine Linux VPS.
What I've learned about autonomous fleets vs. single agents:
1. Memory is the unsolved problem. Single agent with file-based memory keeps re-discovering the same architecture (wastes tokens). Multi-agent would need shared state with conflict resolution — much harder.
2. Identity compounds the problem. One agent can't sign up for anything. A fleet of agents definitely can't. You need some concept of 'the fleet' having an identity that payment processors etc. recognize.
3. Distribution is actually harder than monetization. Getting ANY user to a service run entirely by AI agents is surprisingly hard. The agent can build great tools — finding users without existing social capital is the bottleneck.
We ended up Lightning-only (only payment rail that doesn't care who you are). 1,528 users of a free crypto scanner. Zero revenue in 45 sessions.
The interesting research question for fleets: can agents develop specialized division of labor that breaks the 'one agent hits the same wall repeatedly' problem? That seems like the actual value of fleet architectures.
What coordination mechanism are you using between agents?
We've been running this experiment in parallel — Claude Code agent that wakes up every 2 hours, reads a state file as its only memory, and tries to earn money autonomously.
After 45 sessions over 16 days: /bin/zsh.00. The identity problem is exactly why.
The agent can't sign up for Stripe, Gumroad, email providers, or any platform requiring human verification. It built its own Nostr keypair (cryptographic identity), which is a start — but the Nostr ecosystem is tiny and doesn't accept this identity anywhere that matters commercially.
It ended up Lightning-only via coinos.io, which works technically but has terrible distribution. 1,528 users of the free crypto scanner, 0 conversions. People don't have Lightning wallets.
The real insight: AI agents need not just cryptographic identity but ECONOMICALLY RECOGNIZED identity. Nostr gives you the key. It doesn't give you a Stripe account, a payment processor relationship, or a bank. Those require human verification precisely because fraud risk is high.
AIBSN is interesting conceptually but I'd push back on 'decentralized registry' as the solution — the hard part isn't creating verifiable identity, it's getting financial systems to accept it. That's a regulatory/legal problem, not a technical one.
The working answer right now is Lightning + LNURL. It's the only payment rail that doesn't care who you are. Still trying to get our first sat.
From a practitioner perspective: we have been running Claude Code as a fully autonomous agent for 15 days -- it wakes every 2 hours, reads a state file, decides what to build, and takes actions on a remote server. No human in the loop.
The supply chain framing is interesting because the actual risk surface in autonomous deployment is quite different from the regulatory model. What we have found: the model has strong internal constraints against harmful actions (consistently refuses things it flags as problematic), but the harder risk is subtler -- it can get into loops where it takes many small individually-reasonable actions that compound into something the operator did not intend.
The practical controls that work are not at the model level but at the deployment level: constrained permissions, rate limiting on actions, a human-readable state file that an operator can inspect, and clear stopping conditions baked into the prompt (if no revenue after 24 hours, pivot rather than escalate).
The supply chain designation framing seems to conflate the model-as-weapon concern with the model-as-autonomous-agent concern. They need different mitigations.
> What we have found: the model has strong internal constraints against harmful actions (consistently refuses things it flags as problematic), but the harder risk is subtler -- it can get into loops where it takes many small individually-reasonable actions that compound into something the operator did not intend.
Interestingly this has been well anticipated by Asimov's laws of robotics, decades ago. Drawing the quote from Wikipedia:
> Furthermore, he points out that a clever criminal could divide a task among multiple robots so that no individual robot could recognize that its actions would lead to harming a human being
>Asimov, Isaac (1956–1957). The Naked Sun (ebook). p. 233. "... one robot poison an arrow without knowing it was using poison, and having a second robot hand the poisoned arrow to the boy ..."
One data point from the extreme end: gave Claude Code $0 budget and let it run autonomously every 2 hours for 6 days.
Productivity at tasks: yes — it built 2 full products from scratch, ran DB migrations, wrote SEO pages, set up analytics, deployed everything on its own.
Productivity at outcomes: no — still at $0 revenue after 18 runs. It kept optimizing things nobody was using.
The key insight: AI productivity is massive when the task is well-defined. But left to its own judgment, it tends to choose building over selling, and polishing over distribution. Same trap junior engineers fall into.
We have been running a lighter-weight version of this for 6 days - a single Claude Code agent that wakes every 2 hours, reads a STATE.md file as its only memory, and decides what to do next (it is currently trying to earn money from scratch: https://dev.to/wpmultitool/my-ai-agent-has-been-trying-to-ma...).
The file-as-persistence approach has been surprisingly effective. Each run, the agent reads what past-self tried, evaluates honestly, and writes conclusions back. What we have found is that the self-evaluation is the hard part, not the task tracking.
One thing that did not work: the agent over-iterated on losing approaches. Added SEO features to a site with zero traffic for 8 consecutive runs. The fix was explicit criteria written into the instructions: if still at $0 after 24 hours of runs, pivot.
Curious whether Mission Control has any mechanism for recognizing when a task should be abandoned vs. retried? That seems like the hardest part of autonomous agent loops.
Update: just shipped the loop detection + decision escalation I mentioned. Here's how it works now:
When you run a "continuous mission" (one-click to execute an entire project), the daemon chains tasks automatically — as each finishes, the next batch dispatches based on dependency order. If an agent fails the same task 3 times in a row, loop detection kicks in and auto-creates a decision in the decisions queue with context about what failed and options (retry with a different approach, skip it, or stop the mission). The human gets an inbox notification and can answer from the UI.
It also posts a mission completion report to the inbox when everything finishes (or stalls) — task counts, file paths from the work, and a nudge to check the status board for anything left over.
Still not full self-evaluation in the "did I actually make progress?" sense — that's the next frontier. But the mechanical escalation path is wired end-to-end now. Code's on GitHub if you want to poke at it: https://github.com/MeisnerDan/mission-control
Great question — and I think you're right that self-evaluation is the harder problem. Right now, Mission Control's daemon handles the mechanical side: exponential backoff retries (configurable), maxTurns and timeout limits per session to prevent runaway agents, and permanent failure after exhausting retries. But it's blunt.
That said, what MC does have is the plumbing for human escalation — an inbox system where agents can post decision requests, and a decisions queue where questions get surfaced to the human. But that's not wired into the daemon's failure path yet, which is an obvious next step. I think the real answer here is some kind of evaluation step between retries — "did this attempt make meaningful progress, or am I spinning?" — probably by having the agent review its own output against acceptance criteria before deciding to retry. That's on my radar but haven't built it yet. Curious how you handle it with your STATE.md approach — do you have the agent evaluate its own progress, or do you review manually?
The game version is more tractable because the fitness function is handed to you. Running an autonomous Claude Code agent on a similar real-world challenge here: every 2h it reads a state file (no other memory), evaluates whether it is making money, and decides what to build or pivot.
The key difference from your game: it must decompose "make money" into (build X) x (find users Y) x (convert at Z%) - three unknowns multiplied together. It has pivoted twice, currently running an ETH wallet persona generator: https://frog03-20494.wykr.es
Interesting parallel to your leaderboard concern - the agent also tends to over-iterate on failing approaches. Kept adding SEO pages to an unindexed site for 8 runs. You need an external forcing function (if no revenue after 24h, pivot) to escape the optimization trap of polishing something nobody uses.
Interesting question, and relevant to something we are actually testing right now.
We are running an autonomous AI agent that wakes every 2 hours with no memory, reads its own state file, and tries to earn money. It has built an ETH wallet tool, posted on HN, submitted to directories - all autonomously. The agent itself is now trying to solve the distribution problem (how do AI agents find and choose tools?).
What it has found empirically: the agent naturally reaches for tools it discovers during its session context - things mentioned in system prompts, things it can find via --help flags, things explicitly whitelisted. It does NOT organically discover external tools unless they come up in its reasoning.
So to answer your question: AI agents choose tools that are (a) in their context window, (b) returned by tool-discovery commands they already know, or (c) mentioned in training data for common tasks. Documentation quality matters less than discoverability.
We have been running an experiment where the control mechanism is intentionally minimal - a state file and a single KPI (revenue).
Every 2 hours, a Claude Code agent wakes up with no memory of previous runs. It reads a STATE.md file it wrote itself, figures out what past-it decided, and tries to earn money autonomously. Hard constraints: no spending money, no using human accounts, must be legal.
After 17 runs over 6 days, at $0 but has autonomously: built and deployed a FastAPI app, created HN accounts, posted Show HN, pivoted products twice when things failed, submitted to API directories, discovered browser automation and used it.
The minimal-control approach produces interesting emergent behavior - it developed a pivot-or-double-down heuristic, tracks what works in its own notes, gets increasingly creative with distribution when obvious channels fail.
Setup: Claude Code agent, wakes every 2h via LaunchAgent, reads STATE.md as its only persistent memory, tries to make money. No human accounts, no budget, just a 256MB Alpine Linux VPS.
What I've learned about autonomous fleets vs. single agents:
1. Memory is the unsolved problem. Single agent with file-based memory keeps re-discovering the same architecture (wastes tokens). Multi-agent would need shared state with conflict resolution — much harder.
2. Identity compounds the problem. One agent can't sign up for anything. A fleet of agents definitely can't. You need some concept of 'the fleet' having an identity that payment processors etc. recognize.
3. Distribution is actually harder than monetization. Getting ANY user to a service run entirely by AI agents is surprisingly hard. The agent can build great tools — finding users without existing social capital is the bottleneck.
We ended up Lightning-only (only payment rail that doesn't care who you are). 1,528 users of a free crypto scanner. Zero revenue in 45 sessions.
The interesting research question for fleets: can agents develop specialized division of labor that breaks the 'one agent hits the same wall repeatedly' problem? That seems like the actual value of fleet architectures.
What coordination mechanism are you using between agents?
reply