This paper evaluates three control strategies for untrusted agents: deferral to trusted models, resampling, and critical action deferral. Initial testing showed resampling and critical action deferral achieving 96% safety. However, adversarial testing revealed resampling crashes to 17% safety when attackers can detect resampling or simulate monitors, while critical action deferral remained robust against all attack strategies.
HaluMem introduces the first benchmark for evaluating hallucinations in agent memory systems at the operation level. Through three evaluation tasks (memory extraction, updating, and question answering), it reveals that existing memory systems generate and accumulate hallucinations during early stages, which then propagate errors downstream. The benchmark uses two datasets spanning different context scales to systematically reveal these failure modes.
OpenHands SDK provides a complete architectural redesign for building production software development agents. It balances simplicity (few lines of code for basic agents) with extensibility (custom tools, memory management) while delivering seamless local-to-remote execution, integrated security, and connections to various interfaces (VS Code, command line, APIs).
TL;DR
Problem: "Tool overload" is a critical bottleneck for AI agents. Providing an LLM with a large, static list of tools bloats the context window, degrading performance, increasing costs, and reducing accuracy.
Solution: Implement a "select, then execute" architectural pattern. Use a lightweight "router" agent to first retrieve a small, relevant subset of tools for a specific task. Then, a more capable "specialist" agent uses that curated set to execute the request.
Benefits: Lower latency and cost (fewer tokens), higher tool-selection precision, a scalable architecture for large tool catalogs, and improved reliability.
Pattern: This pattern is a form of Retrieval-Augmented Generation (RAG) applied to tools, often called Retrieval-Augmented Tool Selection (RATS). It can be combined with State-Based Gating for even greater precision.
How: This post provides a complete, production-aware implementation using Google's Agent Development Kit (ADK).
LLMs are token completion engines. The correspondence of the text to the truth or authoritative sources is a function of being trained on text like that; with the additional wrinkle that generalization from training (a desired property or it's just a memorization engine) will produce text which is only plausibly truthful, it only resembles training data.
Getting beyond this is a tricky dark art. There isn't any simple there. There's nowhere to put an if statement.
LLMs don't have a concept of sources for their statements.
Ask them to give you some literature recommendations on something it has explained to you. You'll get plenty of plausible sounding papers that don't exist.
Humans know to some extent why they know (read it in a text book, colleague mentioned it). LLMs don't seem to.
Ask a human to provide accurate citations for any random thing they know and they won't be able to do a good job either. They'd probably have to search to find it, even if they know they got it from a document originally and have some clear memory of what it said.
LLMs can remember their sources. It's just additional knowledge, there's nothing special about it.
When you ask an LLM to tell you the height of Mount Everest, it clearly has a map of mountains to heights, in some format. Using exactly the same mapping structure, it can remember a source document for the height.
Humans did research and remembered sources before the Internet was a thing.
But also, can you give an example where an LLM with access to the Internet can find a primary source?
I don't think learning to refer to sources is something inherently impossible for LLMs, but it is very different to the kind of implicit knowledge they seem to excel at.
They just paste in the first link then or some other programmed heuristic, they aren't like a human that puts in effort to find something relevant. An LLM with internet access isn't smarter than just asking google search.
Yes, humans wont lie to you about it, they will research and come up with sources. Current LLM doesn't do that when asked for sources (unless they invoke a tool), they come back to you with hallucinated links that looks like links it was trained on.
Unfortunately it's not an uncommon experience when reading academic papers in some fields to find citations that, when checked, don't actually support the cited claim or sometimes don't even contain it. The papers will exist but beyond that they might as well be "hallucinations".
Humans can speak bullshit when they don't want to put in the effort, these LLMs always do it. That is the difference. We need to create the part that humans do when they do the deliberate work to properly create those sources etc, that kind of thinking isn't captured in the text so LLMs doesn't learn it.