Are you still manually fighting with LaTeX and TikZ to create publication-quality figures?

Scientists spend enormous time hand-crafting publication-quality figures, yet every automated system in existence handles only one figure type at a time, producing static images that cannot be tweaked. The assumption underlying this limitation is straightforward: throw more data and model capacity at the problem, and eventually one system will master them all.

This assumption is wrong, not because we lack compute or data, but because it misunderstands what a figure actually is. A scientific figure is not a monolithic prediction task. It is a structured composition of discrete semantic components, where errors occur locally. A bar chart fails when its y-axis label is misplaced, not because the entire visualization is fundamentally flawed. A phylogenetic tree fails when a branch angle is off by five degrees. A molecule diagram fails when a bond is the wrong color. These are not problems that scale to solve with a bigger backbone model. They are problems that demand intelligent coordination among specialists.

This is the core insight behind Crafter, a multi-agent harness for scientific figure generation that achieves something existing systems cannot: it generalizes across completely different figure types and input conditions without architectural changes. Rather than training one model harder, the system deploys multiple specialized agents that debate and refine specific components until they converge on a good figure.

When monolithic models meet diverse problems

Researchers need to generate bar charts from captions, phylogenetic trees from sketch inputs, molecule diagrams from reference images, and dozens of other figure types under widely varying input conditions. Existing systems each carve out a narrow slice of this problem space. SciFig targets bar charts from text. AutoFigure-Edit handles figure editing but requires raster inputs. Pixels-Paths works with multi-agent frameworks but for different structured outputs.

Each system optimizes for one task type and one input modality.

When you task a single model with solving all of these problems simultaneously, it learns to average. It produces mediocre compromises that work reasonably well across all cases but excellently for none. This is an architectural problem. The system is being asked to compress entirely different reasoning patterns into a single bottleneck.

The real issue surfaces when you examine failure modes. They are almost never global catastrophes. A generated figure usually gets most things right. Instead, failures cluster in specific locations: a misplaced element, a wrong styling choice, a label in the wrong position. These are localized problems that benefit from localized solutions, not wholesale regeneration.

Rethinking generation as coordinated problem-solving

Crafter reframes figure generation as a multi-agent conversation rather than a single neural network’s dream. The architecture consists of four specialized roles that iterate until convergence.

The intent reasoner begins the process. It does not generate a figure. Instead, it reads whatever input the user provides, whether caption, sketch, reference image, or combination, and produces a semantic representation of what success looks like. This semantic language becomes the common currency that all downstream agents use to evaluate proposals and feedback. By decoupling intent interpretation from rendering, the system can handle any input modality without retraining.

Crafter coordinates four specialized agents: an intent reasoner interprets user intent into semantic structure; a plan generator proposes multiple candidate plans; an image generator renders each plan; a critic evaluates all options against the intent and feeds back to refine plans.

The plan generator does not produce one figure. It proposes K candidate plans, each representing a different approach to satisfying the intent. This matters because committing to the wrong approach early is expensive, but filtering bad approaches before rendering is cheap. By generating alternatives upfront, the system explores a broader space than greedy decoding ever would. Each plan is a structured specification of what elements should appear, where, and with what properties.

When monolithic models meet diverse problems

Rethinking generation as coordinated problem-solving

Related Posts