Metron

Challenge

Artificial intelligence is poised to trigger an "industrial revolution" in formal verification by automating complex proof synthesis – making the process dramatically faster, cheaper, and easier for non-experts to use. If AI can automate theorem proving and software verification, then verification can scale. And if verification can scale, then security may no longer depend on racing to patch the next vulnerability, but on proving the vulnerabilities cannot exist in the first place.

The implications are profound: critical infrastructure could ship with ironclad safety guarantees, AI-generated code could be mathematically verified before deployment, and organizations could build systems that are secure by construction.

Yet to realize this vision, we must confront a fundamental understanding gap: Today's Large Language Models are spectacular mimics, notorious for hallucinating or lying. We need to know when they are genuinely reasoning about complex code versus simply parroting memorized proof patterns, and we must figure out how best to drive their capabilities toward a secure future.

Answering these questions requires rigorous measurement, yet current evaluation methods are falling short. Early benchmarking efforts were a necessary start, but are ultimately insufficient – they rely far too heavily on pedagogical toy examples and binary pass/fail metrics, and miss the messy, long-horizon realities of actual software engineering, such as operating system interactions and concurrency.

To push the field forward, we need high-quality, realistic formal verification benchmarks that capture real-world complexity, expose exactly where AI reasoning breaks down, and provide the definitive yardstick required to achieve industrial-scale verification.

Solution

To bridge this gap, Galois is developing Metron—a rigorous evaluation framework created in collaboration with Kestrel Institute and Kry10, and funded under the ARIA programme Mathematics for Safe AI with input from the FMxAI community.

Metron is designed to measure the state-of-the-art performance in AI’s ability to formally reason about code at multiple levels of granularity. These benchmarks will help us understand exactly when and how AI models fail, revealing fundamental flaws in reasoning, shortcomings in training, or simply workflow issues that better tooling could fix. Ultimately, well-crafted benchmarks that accurately reflect real-world workflows and challenges shape what frontier labs optimize for. By making rigorous verification performance easily measurable, this project provides the incentive and the necessary software tooling needed to steer AI development toward a future where machine-checked proofs of software correctness are abundant, cheap, and woven into the fabric of everyday computing

How Does it Work?

To accurately measure AI capabilities, this project executes on three high-level principles: testing on real-world tasks, mapping granular capabilities, and maintaining a strictly private "hidden exam."

Specifically, we achieve this through the following technical approaches:

Real-World Formal Verification Tasks: Rather than testing on toy problems, we evaluate AI on deployable industrial systems leveraging world-class expertise. Galois brings over 25 years of top-tier experience across formal methods, automated reasoning, and cryptography that make us well-suited to ground the evaluation in reality. The Kestrel Institute is an industry pioneer in formal synthesis and automated proof repair (using tools like ACL2), while Kry10 is a premier leader in the formally verified seL4 microkernel ecosystem. Together, this complementary consortium provides the deep, practical expertise necessary to benchmark AI against complex, production-grade proofs.
Closed-Sourced Evaluation: To guarantee we are measuring genuine reasoning rather than internet recall, human formal methods experts will develop and curate the new problem sets. This dataset will be heavily secured as a private hold-out to ensure it is never inadvertently included in the pre-training data of frontier models.
Open-Sourced Tooling: To properly map capabilities, we are open-sourcing the toolchains, APIs, and metrics. This allows the benchmarks to plug directly into the reinforcement learning (RL) training loops of frontier AI labs. Anyone can use the tools, metrics, and infrastructure we develop in Metron. And the more people who use these high-assurance techniques, the more the community benefits. Open source sets the foundation for a more secure future.

Value Add

FV for Trustworthy AI: Developing a standardized benchmark to evaluate AI reasoning for formal verification provides the tooling and evidence to enable training of frontier AI models for genuine safety and correctness rather than convincing mimicry
AI for Scalable FV: Using trustworthy AI to automate and scale formal verification tasks, making the process faster, cheaper, and easier – transforming it from a manual, high-cost specialty into an abundant, scalable standard
Towards a Secure by Construction World: The Metron benchmark is a crucial stepping stone in moving the industry beyond the endless, reactive cycle of patching vulnerabilities towards a future where software is correct by construction ‍
Accelerating a Virtuous Cycle: Building a shared, open-source foundation of metrics and tools that allow the community to build on and compound progress, learn from failures, and evolve tools rapidly

GO DEEPER:

Artificial Intelligence & Machine Learning