← All articles

PrezEval: Benchmarking AI Agents on Professional Slides

April 6, 2026 · 6 min read

Goal

How well can an AI agent reproduce professional consulting slides from visual guidance?

After building Verso, we’ve come to the belief that our aproach produces far superior results to others.

But let’s put numbers on that.

PrezEval is a benchmark that measures exactly this. Given a target slide image and the original source presentation (with the correct layout pre-selected), an agent must edit the slide to match the target as closely as possible. A vision-language model then scores the result by comparing structure, content, hierarchy, and styling.

This task is deceptively hard. Real consulting slides are dense, precise artifacts: a misaligned chart legend, a missing axis label, or a wrong color in a heatmap cell all count as failures. The benchmark tests not just whether an agent can write text to a slide, but whether it can handle charts, tables, custom shapes, multi-column layouts, and brand-specific styling, all at once.

Benchmark Building

Source material

We curated 61 slides from 10 professional presentation decks spanning major consulting and advisory firms: McKinsey, Bain, BCG, PwC, EY, and Deloitte, as well as law firms Cleary Gottlieb and Mattos Filho. These are real-world decks covering topics from healthcare economics to energy transitions to consumer privacy regulation.

The slides were selected to maximize visual complexity and diversity of elements. Here is what the benchmark contains:

ElementSlidesShare
Charts (bar, line, pie, combo…)3354%
Multi-column layouts2439%
Logos and icons17*28%
Tables1423%
Dense text layouts1321%
Complex diagrams / timelines813%
Maps58%
Custom composite shapes35%

*Counting only substantive illustrative icons, not company logos (which appear on ~45 slides).

What makes it hard

Task setup

For each of the 61 tasks, the agent receives:

The agent then edits the slide through tool calls, and the final result is rendered as a PNG and scored by a vision-language model evaluator. The evaluator rates each result on an integer scale of 1 to 5, since research shows that a compact integer scale maximizes human-LLM alignment for LLM-as-a-judge setups. We then convert ratings to a 0-100% score for readability.

Results

We compared three configurations:

ConfigScoreTimeStepsTasks
Verso Medium49.6%207.7s8.861/61
Verso Fast38.9%157.5s9.561/61
Claude for Powerpoint36.5%176.5s11.661/61

Verso Medium achieves the highest score at 49.6%: most reproductions capture the right structure and content but have noticeable differences in styling or positioning.

Verso Fast trades accuracy for speed, completing tasks 24% faster while scoring 38.9%. Interestingly, it uses more steps on average (9.5 vs 8.8), suggesting the smaller model takes more exploratory actions.

Claude for Powerpoint scores 36.5% despite using the most steps (11.6) and significantly more compute.

Score breakdown by content type

Breaking down scores by what the slide contains reveals clear patterns:

Content typeVerso MediumClaude for PPT
Dense text66.8%48.3%
Non-chart slides63.5%44.8%
Tables48.3%38.3%
Diagrams47.3%25.0%
Charts38.0%29.5%
Maps12.5%12.5%
Overall49.5%36.5%

Text-heavy slides are the easiest category, while maps are the hardest (equally bad for both agents). Charts, which make up 54% of the benchmark, pull the overall score down significantly.

Where Verso excels

Verso consistently scores well on structured text slides: formatted legal text, multi-section layouts with colored boxes, table-of-contents style pages, and multi-column icon layouts. On these, both Verso Medium and Verso Fast achieve near-perfect scores (75-100%), while Claude for Powerpoint typically lags significantly behind.

What remains hard

About 20% of the benchmark is essentially unsolved: all three agents score 25% or below. The common failure modes:

Where Verso still has room to grow

On about 15 tasks, Verso variants still struggle (scoring 25% or below). These tend to be slides with large structured grids, brand logos embedded in charts, or decorative elements. This suggests specific opportunities to improve Verso’s handling of these patterns.

All results, including per-task generated vs. reference images and evaluator critiques, are available in the PrezEval repository.