Hacker Newsnew | past | comments | ask | show | jobs | submit | unsaved159's commentslogin

Quite the opposite. Review is the bottleneck. LLMs generate more code, hence more cognitive load on my mind to comprehend and review it all. Review isn't optional if it's a production system that other people use.

After I see how astonishingly poor LLMs are at decision-making (even the best ones, like Claude Opus or GPT 5.4) while writing code, I naturally stop trusting them in other areas of life too much to "just have a conversation with them and get all the answers".

It's all fun and games while the stakes are non-existent, but if the question really matters, would you trust an LLM fully as much as to not exercise thinking at all?


I am curious what are you using multiple agents for? In my experience, without supervision even the most advanced models degrade quickly.

Day to day I run a dev pod (implementor + qa + frontend design), a review pod doing adversarial review with one Claude and one Codex, and an orchestrator pair. I think the best flex here to illustrate real work being done is so far the longest single rig I've kept running continuously was about 4 days, so that means a large implementation spec being executed with test driven dev approach from obra superpowers + independent deep contextual code reviews at milestones (my own skill pack) + automated vercel agent-browser testing along the way. So currently it's a closed sdlc loop that is only limited by the amount of work I gave it. The "babysitting agents" part moves me up a layer to watching for spec drift and handling weird edge cases that come up. So its not set and forget but you can definitely have it work on something real overnight to get that 'my agents shipped code for while I slept' kind of outcome. I watch a demo video in the morning to see what they built, then do my own code review spot checks of pr's.

The original motivation for making OpenRig is this pattern works well I've been doing this for months now, and I'm sure many people have also gotten something like to work, but the topology is fragile. Like the sessions die, your laptop needs a reboot, you lose the setup you built up that took weeks to perfect. OpenRig makes the topology itself a first-class thing, like a docker-compose but for the topology of claude codes / codex on your machine and all their specific context and configs you fine-tuned.

Regarding supervision - that is the key question for sure - I can't really babysit more than 4-5 agents without feeling like I've lost the plot a bit. So the demo pod in the onboarding includes an example of a pattern I use where there are 2 orchestrators in a "high availability" pair, so I just really interact with 1 agent for the workstream - the orch-lead. The peer is there to monitor and absorb the lead's mental model in realtime, and can take over for the rig if the lead's context limit hits the wall, or something else goes wrong.


What use cases did you find this approach works for, and what doesn't? Any observations on what topologies work better?

I tried doing the same for the cases of maintaining OSS projects. So far, best I could manage is to get the agents to autonomously do %80~ of the work. But then, I have to review manually each potential PR, and almost in every case to further work with an agent providing it with guidance live to fix it. This takes about as much time as without the swarm. So far I found that the usefulness of the swarm is mostly for the initial scouting, to map out what work needs to be done in first place, and store it in a nice JSON file.

From my observations, all it takes is one mistake for an agent to make, from there, the architecture just snowballs into chaos as the future work builds on top of incorrect initial approach.


Yeah I can definitely relate with the snowballing. I am mostly building web apps (python/typescript) so ymmv. Have you tried to pair codex with claude? This is like the gateway drug for doing agent topologies. This is definitely worth trying. Claude is better at understanding your intent, but at the expense of it makes lots of mistakes. Codex makes less mistakes but at the expense of over-engineering. Together they are not perfect but significantly more accurate. They complement each other well. So Codex reviews claude, using TDD is even better because codex will gate each change claude makes. You can apply this pattern to implementation, reviews, PM, even research, etc. OpenRig has a spec called implementation-pair which lets you try this pretty easily. There is another one called adversarial-review which is the same topology just different starter context / instructions to make them less constructive, more combative. You'll get a feel for which one you need for a task pretty quick. But lots of people have made this pattern into skills. I think OpenRig is probably the easiest happy path to try it because the 2 agents can literally type into each others terminals using "rig send" and "rig capture" and see each other screens using tmux, as if you were the one typing the commands. But now you just sit back and watch them find and fix bugs. You dont need OpenRig to do this, just tmux, but raw tmux is a little fiddly to get working which is why i made the rig send command as a tmux wrapper.

Nice idea! Though it's hard to find attention to read the text between jumping the obstacles :) maybe something where text would naturally be a part of the gameplay?

i agree with this. i can't read the texts if i'm playing the game or i can't play the game if i'm reading the text

Nice idea for a diary app. "Can't edit yesterday" is off-putting for me. Such a constraint should not be something a software imposes on you, should be a person's mental policy, if they so wish. I want to have full control over my data, without arbitrary restrictions. Another thing is easy deployment. Would love to give it a shot, but I need something like that to be available on both mobile and desktop, that would mean server deployment with all the headache of managing a server and backups...

I am using daily notes currently with Obsidian + Calendar plugin. Also E2EE, available on all devices, no problems syncing, plain-old files so I am not afraid of vendor-lock and can backup any way I want.


Text is a limited medium for communication inherently. Doesn't make too much difference if some words were cleaned. Overthinking hidden meanings of a text message is a good way to misinterpret people. Just ask to meet face-to-face when in doubt.


Not really AI problem, more like garbage coworkers.


Literally never in my life did I receive anything like that website suggests via email or DMs. Curate your social circle is the answer.


Oh how I wish I could curate my coworkers...


I got one from my office, who at some point decided to use ChatGPT to write Asana tickets that are clearly not vetted.


They get a magic wand to turn their words into software, and still complain the wand is not their favourite colour.


Not clear to me why need this. You can just write a markdown spec without any side projects, then tell an agent to code it.


Why not context 7?


Context7 is great but ultimately it’s just a pre-generated static summarization that might not include the specific answers the agent needs. I have a slightly different approach where the actual source code is scanned for each question so it’s much more targeted and never out of date.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: