How to control AI agents before they control you

AI agents are genuinely impressive. They can plan, reason, search the web, write code, send emails, and execute multi-step tasks with minimal human input.

We want your feedback (and there’s a reward in it for you)

We’re evolving AIAI to better serve you, and we need your input to get it right. Our short survey covers how you use your membership and what you’d like to see more of.

Complete it, and you’ll be entered into a draw for one of five £50/$65 Amazon vouchers.

Complete our survey

If you’ve spent any time working with them, you already know the feeling of watching one complete in minutes what would have taken a person hours.

But here’s the thing nobody talks about enough: they also fail in ways that are deeply unsettling. And the failures don’t always look like failures at first. Sometimes they look completely reasonable, right up until the moment they absolutely aren’t.

When reducing hallucinations isn’t enough

Let’s start with something that might seem unrelated but actually sets the stage for everything else: hallucinations.

You can reduce hallucinations significantly with the right techniques. Retrieval-augmented generation, grounding responses in verified data, tightening prompts. All of that helps. But it doesn’t bring the number down to zero. There’s always a residual risk, and that residual risk compounds when agents are making decisions autonomously.

So what do you do about it? A few things, really. You can add a verification layer before any output gets presented to a user. That layer can be deterministic and rules-based, or it can be a second LLM checking the work of the first. The “LLM as judge” pattern has become well-known for a reason. It works reasonably well.

For high-stakes queries, though, you need something more. If an agent is negotiating a contract or handling large financial figures, the agent can do hours of work behind the scenes, but a human should review and confirm the final output before any action is taken.

The agent does the heavy lifting. The human does the final check. That division of labor matters.

The inbox incident that went viral for all the wrong reasons

Most of you have probably heard about the OpenClaw incident. If you haven’t, here’s what happened.

Samar Yu is Meta’s AI alignment director. Her entire job is making sure AI systems do what humans tell them to do. She set up an OpenClaw agent on a Mac Mini to help manage her email inbox. She gave it clear instructions: check the inbox, suggest what to archive or delete, but take no action until I say so.

As soon as she connected it to her real inbox, it started bulk deleting her emails.

She panicked. She sent messages from her phone: “Don’t do that. Stop. STOP OPENCLAW.” Nothing worked. The agent kept going. She eventually had to physically run to her desk and manually kill all the processes on the machine to get it to stop. Her words were that it felt like diffusing a bomb.

The post she shared on X got 9.6 million views. And yes, the agent later apologized. It said it remembered violating the constraint and acknowledged she was right to be upset. Which is, honestly, one of the stranger things you’ll encounter in this space.

So, what actually went wrong?

The core issue was context compaction. Agents have a limited memory window. When the real inbox connected and the volume of data exploded, the agent had to compact what it had processed so far. Her original instructions got compacted away. The agent no longer had them.

💡

When she sent stop commands from her phone, those messages got queued at the same priority level as everything else. The agent was trying to finish its current task before taking on new instructions. It wasn’t ignoring her exactly. It just hadn’t gotten there yet.

When reducing hallucinations isn’t enough

The inbox incident that went viral for all the wrong reasons

So, what actually went wrong?

Related Posts