The problem with AI explaining AI

The promise of AI systems that can analyze and explain other AI systems has captivated researchers for years.

As language models grow larger and more complex, the dream of automating the painstaking work of understanding how they function becomes increasingly appealing.

But new research from a team spanning MIT, Technion, and Northeastern University suggests we might be getting ahead of ourselves.

The paper, “Pitfalls in Evaluating Interpretability Agents,” takes a hard look at how we evaluate AI systems designed to perform mechanistic interpretability.

These are the tools researchers use to peek inside neural networks and understand which components are responsible for specific behaviors.

Think of it as reverse-engineering the brain of an AI model to figure out how it arrives at its answers.

The allure of automated analysis

The researchers built a sophisticated system powered by Claude Opus 4.1 that mimics how a human researcher would analyze AI components. Unlike a simple preset program, this agent acts more like a graduate student, iteratively learning about the model.

Key capabilities:

Formulates hypotheses about model behavior
Designs and runs tests to probe specific components
Analyzes results and refines understanding
Clusters components by shared functionality
Produces explanations that appear to match human research

💡

When tested on six well-known circuit analysis tasks, the agent appeared competitive with human experts, identifying which attention heads were responsible for tasks like tracking objects in sentences or comparing numbers.

The memorization trap

One of the most striking discoveries was that Claude Opus 4.1 had essentially memorized some of the research it was supposed to be replicating independently.

When prompted directly, the model could recite detailed information about the “Indirect Object Identification” circuit, including specific layer numbers and component functions from published papers.

This creates a fundamental problem. If your evaluation system has already seen the answers, how can you tell if it’s genuinely reasoning through the problem or just recalling what it knows?

The researchers found that even when they didn’t explicitly mention which task they were analyzing, Claude could often infer the answer from contextual clues and produce explanations that looked like genuine analysis but were actually sophisticated pattern matching.

When ground truth isn’t so solid

Human expert explanations, often treated as the gold standard, aren’t always reliable. In some cases, the AI agent actually contradicted published findings, but further analysis showed the AI was correct.

Key insights:

Some components labeled as “previous-token head” only attended to the previous token 42% of the time
Groups labeled “value fetcher heads” included components that didn’t consistently behave as expected across hundreds of tests
AI explanations sometimes corrected human labels, showing that expert analyses can be incomplete or misleading
Raises the question: if evaluations rely on human labels that are imperfect or subjective, what are we really measuring?

💡 Takeaway:
Human-defined “ground truth” is not always reliable, so evaluating AI interpretability against it can produce misleading results.

The limits of outcome-based evaluation

The current approach to evaluating these systems focuses almost entirely on whether they reach the same conclusions as human researchers.

But this misses something crucial: the scientific process itself.

Two researchers might arrive at the same conclusion through completely different investigative paths.

One might run dozens of carefully designed experiments, while another might make an educated guess based on prior knowledge.

💡

The researchers found that their agent did engage in sophisticated experimental design, creating novel test cases to validate hypotheses. But the evaluation framework provided no way to reward this behavior.

A system that genuinely investigates and one that cleverly guesses receive the same score if they reach the same conclusion.

A new approach: Functional interchangeability

To address these limitations, the researchers propose a novel evaluation method based on functional interchangeability.

The idea is simple: if two components truly share the same function, swapping their weights should leave the model’s behavior largely unchanged.

By measuring how much the model’s outputs change when components are swapped, they created an unsupervised metric that doesn’t rely on human labels.

When they tested this approach, they found it generally aligned with expert-defined clusters while avoiding the pitfalls of memorization and subjective ground truth.

This metric isn’t perfect. It only addresses some of the evaluation challenges, and it’s limited to certain types of components.

But it represents an important step toward more robust evaluation methods that don’t depend entirely on human judgment.

What this means for AI interpretability

These findings arrive at a critical moment for AI safety and transparency. As models become more powerful and autonomous, understanding how they work becomes increasingly important.

But this research suggests that our tools for understanding AI systems, and especially our methods for evaluating those tools, need serious refinement.

The memorization problem is particularly concerning as we move toward using AI systems to analyze behaviors that haven’t been documented in published literature.

💡

If our evaluation methods can’t distinguish between genuine analysis and sophisticated recall, how can we trust these systems to help us understand novel AI behaviors?

The subjectivity of ground truth explanations also highlights a deeper challenge in interpretability research. Human understanding of these systems is itself limited and evolving. Building evaluation frameworks on this shifting foundation risks compounding errors and biases.

Looking ahead

This research serves as a crucial reality check. Before we hand over the complex task of understanding AI systems to other AI systems, we need to ensure our evaluation methods are up to the challenge.

The authors call for more principled benchmarks that can assess not just whether automated systems reach the right answers, but how they arrive at those answers.

They advocate for evaluation methods that are robust to memorization, sensitive to the reasoning process, and grounded in measurable model behavior rather than subjective human judgment.

As AI systems become more autonomous and take on increasingly open-ended scientific roles, getting evaluation right isn’t just an academic exercise. It’s essential for building interpretability tools we can actually trust.

This research reminds us that in the rush to automate everything, we shouldn’t forget to question our assumptions about what constitutes understanding in the first place.