SWE-bench Verified: Real-World Agent Performance Leaderboard
Top closed models resolve 55–65% of real GitHub issues end-to-end; leading open-weight agents trail at 28–34%.
Reasoning Engine
Conducting research…
Research Telemetry
live · demoSynthesized Answer
Base case for global vertical-agent ARR in 2026: $14–18B, growing to $48–65B by 2028. Software engineering and customer support lead today; legal, healthcare RCM, and financial back-office drive the next leg.
Key Points
Vertical agents are out-scaling horizontal copilots in revenue per seat by 3–5×.
Benchmarks show 60% task completion; production deployments report 25–40% — eval-set leakage suspected.
Computer-use APIs cut integration timelines from months to weeks for legacy systems.
Eval/observability startups (Braintrust, Langfuse, Arize) are seeing the fastest ARR growth in the stack.
Enterprises with agents in production
31%
18pp YoYGartner Q1 2026
SWE-bench Verified (top closed)
65%
22pp YoYreal GitHub issue resolution
Vertical agent gross margin
72%
9pp YoYmedian across 40 disclosed startups
Inference cost per agent task
$0.18
−54% YoYblended, 30-min analyst-equivalent
Top closed models resolve 55–65% of real GitHub issues end-to-end; leading open-weight agents trail at 28–34%.
Workflow-specific agents in legal, sales, and finance are achieving 40–70% task automation with bounded eval surface.
Screen-reading + structured action agents now handle multi-app workflows; reliability gates remain the production bottleneck.
Human performance on GAIA is 92%; best agent score is 49%. Long-horizon planning is the dominant failure mode.
31% of enterprises have at least one agent in production; 78% cite evaluation tooling as the #1 blocker to scale.
Open frameworks lead on portability; native SDKs lead on tool fidelity and latency. Convergence expected within 12 months.
Refine your research
Demo mode · All sources, insights, and data are mock-generated for illustration.