Artificial intelligence is poised to trigger an "industrial revolution" in formal verification by automating complex proof synthesis – making the process dramatically faster, cheaper, and easier for non-experts to use. If AI can automate theorem proving and software verification, then verification can scale. And if verification can scale, then security may no longer depend on racing to patch the next vulnerability, but on proving the vulnerabilities cannot exist in the first place.
The implications are profound: critical infrastructure could ship with ironclad safety guarantees, AI-generated code could be mathematically verified before deployment, and organizations could build systems that are secure by construction.
Yet to realize this vision, we must confront a fundamental understanding gap: Today's Large Language Models are spectacular mimics, notorious for hallucinating or lying. We need to know when they are genuinely reasoning about complex code versus simply parroting memorized proof patterns, and we must figure out how best to drive their capabilities toward a secure future.
Answering these questions requires rigorous measurement, yet current evaluation methods are falling short. Early benchmarking efforts were a necessary start, but are ultimately insufficient – they rely far too heavily on pedagogical toy examples and binary pass/fail metrics, and miss the messy, long-horizon realities of actual software engineering, such as operating system interactions and concurrency.
To push the field forward, we need high-quality, realistic formal verification benchmarks that capture real-world complexity, expose exactly where AI reasoning breaks down, and provide the definitive yardstick required to achieve industrial-scale verification.
To bridge this gap, Galois is developing Metron—a rigorous evaluation framework created in collaboration with Kestrel Institute and Kry10, and funded under the ARIA programme Mathematics for Safe AI with input from the FMxAI community.
Metron is designed to measure the state-of-the-art performance in AI’s ability to formally reason about code at multiple levels of granularity. These benchmarks will help us understand exactly when and how AI models fail, revealing fundamental flaws in reasoning, shortcomings in training, or simply workflow issues that better tooling could fix. Ultimately, well-crafted benchmarks that accurately reflect real-world workflows and challenges shape what frontier labs optimize for. By making rigorous verification performance easily measurable, this project provides the incentive and the necessary software tooling needed to steer AI development toward a future where machine-checked proofs of software correctness are abundant, cheap, and woven into the fabric of everyday computing
To accurately measure AI capabilities, this project executes on three high-level principles: testing on real-world tasks, mapping granular capabilities, and maintaining a strictly private "hidden exam."
Specifically, we achieve this through the following technical approaches: