Formal Verification Benchmarks Are the Key to Ironclad Software Infrastructure

Formal verification is the gold standard for software correctness, and we argue that AI could finally make it practical at industrial scale.

Seriously. We are inviting you to dream of a world where critical infrastructure ships with ironclad guarantees of safety. Where we can patch our buggy systems once and for all. Where even vibe code can be verified and trusted. 

But are we actually moving in that direction? If so, what is missing—and how do we accelerate the process to ensure we get there?

For all their incredible utility, modern LLMs are spectacular mimics. Having ingested the entire internet—including my old grad school Rocq spaghetti proofs—they appear highly capable of synthesizing complex mathematics. And yet, LLMs are notorious for hallucinating or lying. How do we have confidence that AIs are reasoning about code, not just parrotting things they have memorized? To drive them towards the future we envision, we must understand their capabilities and failures.

The foundation that makes the verification revolution possible is something surprisingly mundane: measurement – building benchmarks that capture the realities of verification engineering, expose where models fail, and drive progress toward industrial-scale verification.

But taming the AI models that enable scalable verification is a massive undertaking, and no single team can do it in a silo. We need a community-wide effort to build the rigorous tests that will actually push this field forward. We must design high quality formal verification benchmarks.

Galois is leading the charge with two new projects—Metron and Praxis—measuring AI capability in reasoning about code, and how well it can use this reasoning to harden and upgrade existing codebases. 

Why Are Benchmarks Important?

“To think is to forget differences, generalize, make abstractions.”

     - Jorge Luis Borges, “Funes, the Memorious”

To confidently assess the correctness of AI output in FV applications, we need a way to measure and compare it against well-reasoned and agreed-upon benchmarks.

Benchmarks tell us if LLMs generalize their reasoning abilities beyond memorization. We know these models generalize to some extent, but their massive pre-training data blurs the line between genuine reasoning and sophisticated regurgitation. How do we tell which is which? Recent systems have already demonstrated the ability to verify nontrivial cryptographic software and synthesize complex proofs, but we need more confidence that these capabilities will generalize to new domains, scale to industrial workflows, or robustly handle unfamiliar verification challenges.

Benchmarks help us understand when and how LLMs fail. We mostly use AI as a black box. We don’t yet know which verification tasks models naturally master during training and what improves with scale. More importantly, when an AI gets stuck, we don’t know if the bottleneck is a fundamental flaw in its reasoning, shortcoming of its training, or just a workflow roadblock that better prompting and external tools could fix. Well-designed benchmarks provide these intermediary signals that indicate the boundary of current capabilities and help characterize failure modes. 

Moreover, benchmarks shape what frontier labs optimize for. Making rigorous verification matter and performance easily measurable is one of the clearest ways to steer the field toward a future where machine-checked proofs of software correctness become abundant, cheap, and woven into the fabric of everyday computing.

What Makes a Good Benchmark?

AI capabilities are evolving rapidly, and verification benchmarks must evolve with them. But the direction is already clear: the next generation of evaluations must satisfy three core requirements.

  1. Real-world tasks. Benchmarks are only useful if they measure the capabilities we actually care about. We want AI systems that can reason about the correctness of production-grade infrastructure, not just solve puzzles that don’t reflect reality. Accordingly, benchmark tasks should reflect the realities of deployable software and the kinds of verification problems that emerge in real engineering environments.
  2. Maps of capability. Pass/fail evaluations are useful (and using them was key to expanding capabilities of LLMs!), but they conceal the most important information: where reasoning breaks down. Just as there are many ways to get a question wrong, there are also potentially many ways of getting a question right. Strong benchmarks should expose intermediate signals that help researchers understand finer-grained details about how models fail, recover, decompose problems, maintain invariants, and navigate long-horizon verification workflows.
  3. The Hidden exam. Evaluation data must remain genuinely held-out. In a world where frontier models are trained on vast portions of the public internet, open-source benchmarks quickly stop measuring reasoning and start measuring recall. To have benchmarks that  evaluate generalization, we need high confidence that models were never trained on the underlying tasks, proofs, or specifications during training. Benchmarks like FrontierMath and ARC-AGI have already demonstrated the value of closed evaluations and private holdout sets–and also the difficulty in making and maintaining such!. Future verification benchmarks should build on this foundation.

The revolution has already Begun: Pioneers in Benchmarking Verification 

For the first time, formal verification benchmarks are giving us a picture of how AI systems reason about software correctness. Benchmarks like SV-COMP, VeriBench, VeriContest, VeriSoftBench, DafnyBench, Verina and FormalBench are helping raise the standard for evaluating AI systems in software engineering. They introduced machine-checked proofs, specification inference, repository-scale evaluation, and consistency checks between code, tests, and formal invariants. Most importantly, they gave us the first real map of where frontier models still struggle: decomposition, invariants, proof repair, and long-horizon reasoning.

But this is only the start. We now have the opportunity (nay, the duty!) to build benchmarks that capture the messy reality of real-world verification: OS interactions, evolving repositories, ambiguous specifications, concurrency, novel security failures, and long-horizon engineering workflows. We must move beyond static pass/fail tasks and reveal where reasoning breaks down, how models recover from failure, and whether they can generalize beyond memorized patterns.

This is not a challenge any one lab or benchmark will solve alone. The field needs shared infrastructure, evolving evaluation standards, and open collaboration between the formal methods and AI communities.

A Call to Arms

Formal verification has always promised radically more trustworthy software, but its cost confined it to a handful of extraordinary projects. LLMs may finally change that. Yet capability claims are not enough: if we cannot distinguish genuine reasoning from memorized proof patterns, we cannot trust these systems with critical infrastructure.

This is why benchmarks matter. They determine what the field measures, optimizes, and ultimately learns to build. The next generation of verification benchmarks must become shared infrastructure: dynamic, reproducible, resistant to contamination, and grounded in real engineering workflows. So let’s stop guessing. Let’s standardize the yardstick, compete on proven correctness, and build systems that can truly prove code correct.

No more guessing. No more mimicry. It is time to build the high-quality, real-world benchmarks that actually measure what we care about, and start engineering a future we can trust.

This is the second in a series of posts on scaling formal guarantees in the AI era. Stay tuned for our next installment, exploring the need to scale and unify the formal verification ecosystem.