Coding agents have gotten remarkably good at fixing bugs. The benchmark suites designed to measure this capability, like SWE-bench, keep pushing higher success rates. But something crucial is being obscured by these overall improvement metrics: we have no idea which specific skills are actually driving the gains.
When an agent successfully resolves a bug, that success comes from at least three distinct capabilities working together. The agent had to understand the repository structure well enough to find relevant files. It had to pinpoint the exact lines within those files that mattered. It had to diagnose what was wrong and write a correct fix. A binary resolved/unresolved label tells us these three things worked in concert, but not which ones are weak links. Maybe an agent fails 90% of the time because it explores repositories poorly, not because it can’t write patches. Or maybe the opposite is true. Current benchmarks can’t tell us.
This is the fundamental measurement problem that SWE-Explore sets out to solve. Rather than evaluate the entire pipeline, the benchmark isolates one critical phase: repository exploration. This seemingly small shift in focus reveals something important about how coding agents actually work and where the real bottlenecks lie.
Decomposing a complex skill
The insight underlying SWE-Explore is that a complex problem can be understood by breaking it into measurable parts. Current benchmarks treat coding task completion as a holistic prediction problem. An issue either gets resolved or it doesn’t. But this masks what’s actually happening underneath.
Look at that visualization and the abstraction becomes clear. Three overlapping capabilities get compressed into a single number. You lose the ability to diagnose whether an agent’s failures stem from poor repository understanding, inaccurate line-level localization, or weak repair logic. This is like assessing a doctor by only checking recovery rates, without examining whether they correctly diagnosed the disease or ordered the right tests.
By isolating exploration as a standalone evaluation target, SWE-Explore makes it possible to measure something more granular: given a repository and an issue description, can the agent return a ranked list of relevant code regions efficiently? This single question opens up a much clearer picture of what modern coding agents are actually good at.
Defining exploration precisely
Exploration, in this framing, means the ranked list of code regions an agent thinks are worth examining before attempting any repair. It’s the pre-reading phase of problem-solving, the phase where a developer orients themselves to understand the landscape: what files are involved, what functions call what, where do error messages originate.
The benchmark defines this as a retrieval problem with specific constraints. An explorer gets a fixed line budget, like a developer with limited time to read code before diving into fixes. Within that budget, the explorer returns a ranked list of lines it considers relevant. The question is fundamentally empirical: which lines would someone actually need to read to understand and fix this bug?
This differs from traditional code search because it operates at line granularity rather than file level, ranking matters (finding critical code early beats finding it eventually), and relevance is specific to the bug rather than generic. The framing reflects reality: developers don’t examine entire repositories uniformly. They prioritize based on what might matter.
Deriving ground truth from successful paths
The clever part is figuring out what correct exploration actually looks like without requiring humans to manually annotate every instance. Instead, the researchers extracted ground truth from agents that successfully solved issues. When an agent fixes a bug, it leaves a trail: which files did it open, which line ranges did it examine?


