Let me get right to the point without any nonsense about aliens:
Interactive theorem proving tools such as Lean are the most powerful and trustworthy kind of formal methods tool. They have been used to formally verify important things such as cryptographic libraries, compilers, and operating systems. Unfortunately, even experts find ITP proofs time-consuming and error-prone. That’s why it’s exciting—and very surprising!—to find that Claude Code is so good at ITP. Today, Claude Code can complete many complex proof steps independently, but it still needs a ‘project manager’ (me) to guide it through the whole formalization. But I think Claude Code points to a world where experts aren’t necessary, and theorem provers can be used by many more people.
The rest of this post digs into what Claude Code can actually do. But if you’re interested in automated reasoning or formal verification, I recommend you stop reading, go sign up for Claude Code, Gemini CLI, Aider, Codex, or some other coding agent, and try it out on a problem you know well. It’ll cost about $20 / month for something useful, and maybe $100 / month for access to a state-of-the-art model. I reckon you’ll be able to get surprising successes (and interesting failures) with about two hours of work.
(If you actually do this, email me and tell me how it went).
There’s a deep tradeoff in automated reasoning between restricting the underlying math—which tends to make reasoning easier to automate—and making it more powerful and general. For example, an SMT solver can only solve queries expressed in a simple logical language. As a result, SMT solvers are predictable enough to run a billion times a day as part of AWS’s S3 bucket security controls. No-one has to write a proof on a whiteboard; instead, the solver solves each theorem quickly and cheaply.1
At the other end of the spectrum, interactive theorem proving tools are designed to be as general as possible so we can use them for Real Math™—for example, Fermat’s Last Theorem. Unfortunately, this power makes ITP tools notoriously difficult to use. Nick Benton memorably wrote in 2006 “I have rarely felt as stupid and frustrated as I did during my first few weeks using Coq.” I taught myself Lean this year—almost twenty years after Benton. The specific issues he pointed to are long solved, but I still felt stupid and frustrated, far beyond any other kind of tool I’ve used.
Why is ITP so hard? Some reasons are quite obvious: interfaces are confusing, libraries are sparse, documentation is poor, error messages are mysterious. But these aren’t interesting problems, and anyway they are getting fixed. Lean, in particular, is improving at astonishing pace thanks to focused work by the Lean FRO.
Deeper than this, and keenly felt if not always said, ITP is cognitively demanding in multiple ways. It requires the ability to juggle abstractions as in pen-and-paper mathematics, a willingness to wrangle complex constraints in the exotic languages used by ITP tools, and a tolerance for microscopic pedantry which most people find almost impossible to summon.
I once joked that the dominant ITP strategy was the “PSMGS cluster” - Proof Search by Miserable Graduate Students. The most impressive results in ITP have been achieved by years of very tedious work by very clever people, slowly grinding each key theorem down to a nub, case by case by case. A few supremely talented people can do this well, but they are vanishingly rare and their time is very expensive. This puts a very hard limit on how broadly ITP tools can be used, because very few potential projects pencil out in terms of costs and benefits.
Sadly I am not supremely talented (I am a lazy idiot) and so I found learning Lean annoying and difficult.
So I started looking for shortcuts.
What about SMT solvers? Lean has impressive SMT support including the new grind tactic. That’s useful when it works, but SMT only works on a very limited set of math properties. Other kinds of non-AI automation such as hammers are similar: very useful, but typically only on a limited range of problems.
It’s 2025, can’t AI help me? There’s been a lot of work on AI mathematics; I even wrote about some previous results. I installed the fine-tuned AI models from a couple of recent papers, tried them out on some example tasks, and… none of them really helped.2
The problem (I realized) is that the models I had chosen were designed to prove individual Lean theorems. That’s only one small part of how I use Lean for ITP. When I’m formalizing a theory, I’m typically doing some combination of the following tasks:
(There are other tasks I might do. For example, I sometimes write property-based tests using Plausible, and any long-lived proof will eventually need to be updated and repaired)
There’s often a push-pull between these tasks. To give a recent example, suppose I want to prove that a function is associative (task 4). I realize that this theorem is false—oops—and figure out this is because I’ve defined the function using list concatenation (task 5). Maybe this is a problem I can avoid by decomposing other theorems differently (task 3). But in this case, I realize it’s a hard requirement, and I need to replace Lean lists with some other type—for example, a finite map—and then refactor the whole formalization accordingly (task 2). I also need to check whether my theory is mathematically correct, or if I’ve made some conceptual mistake which renders important theorems unprovable (task 1).
My most frustrating experiences with Lean have rarely involved proofs of individual theorems. More often, some previous decision has unexpectedly blocked me, and I’m facing a choice between several different complex refactors to fix it. This is what happened with the example above—in the end, I decided to rewrite everything with finite maps, which meant rewriting several definitions and proofs.
In other words, realistic uses of Lean look more like software engineering on proofs—proof engineering, in other words. Proofs of individual theorems are important, but much of the work with Lean involves adjacent tasks such as selecting abstractions, refactoring code, analyzing requirements, fixing failures.
Okay (I thought) maybe what I need is something different, more like an AI designed for software engineering.
Claude Code is an AI coding agent, and no, I didn’t understand what that was either, so let me explain. A chatbot like ChatGPT will typically take a single request and give a single answer, while an AI agent can break down a request into many sub-tasks and then perform them. This is particularly useful for software engineering, where a single large task might require reading documents, modifying files, running tools, and so on.
More practically speaking, you install Claude Code as a command line application. It pops up a text box a bit like ChatGPT, you type in requests, and the agent works on them. For example, you might ask:
Let’s consider the last request, asking for a fix to compose_is_assoc. Claude Code might break down the task as follows:
(You may be thinking that letting an AI agent run arbitrary command-line tools seems hilariously dangerous. It is! By default, each command requires user confirmation, with a user-configurable whitelist. However, footguns abound and I strongly recommend against experimenting on data you care about. Designing AI agents so they can be used safely and securely seems like an important and very under-studied problem.)
Behind the scenes, each step above might involve several calls into Anthropic’s cloud-hosted AI models. For a complex request like this one, the whole process can take several minutes. This can actually be kind of boring as a user - you push go and watch the agent grind away.
Obviously, the AI agent makes mistakes of all kinds: syntactic, semantic, conceptual. What I’ve found is that given a tool that can detect mistakes, the agent can often correct them. For example, suppose we want to prove a particular theorem. Typically, the agent will create a first version, and then loop: (1) run lake build to see if the current proof passes, (2) read the error messages, (3) think about what might be wrong, and (4) try to correct the errors. After several iterations, the proof is often correct.
I think this is part of why Claude Code is surprisingly good at theorem proving. Lean is very strict in the programs it will accept. This makes writing Lean arduous for humans, but it means an AI agent gets detailed feedback. To me, this suggests we may want to design tools differently for AI agents rather than human use. Instead of trying to avoid failures and only output highly processed information that a human can digest, we may want to be more strict but produce more information about what’s going wrong.
I used Claude Code to work on a Lean formalization of an old paper, Deny-Guarantee Reasoning (Dodds, Feng, Parkinson, Vafeiadis - 2009). This paper isn’t too mathematically complex, but since I last worked on it 15 years ago, I'd forgotten almost all of the details. The theory includes:
I started from an empty repository and built the formalization as follows:
To my surprise, this approach worked. You can see the Lean development on GitHub here. I’m about 50% of the way through the formalization plan, with 2,535 lines of Lean code total, and about 1,232 lines of proof.
The agent was able to work on all the different levels I talked about above, and it was able to figure where switches between levels were needed. The refactoring from lists to finite maps I talked about above was something the agent did for me (although this was at the edge of its capability - it went down one blind alley, and I supervised it carefully on the second attempt).
Before you get too excited, here’s a killer caveat: I think formalizing using AI was way slower than if I’d done it by hand. Let me explain.
My experience as a ‘project manager’ for Claude was quite varied:
I found that Claude Code could resolve most mistakes with a small amount of nudging. I can also imagine that many of these mistakes could be avoided or resolved more quickly if the agent had access to better tools. For example, about half way through the project, I installed the lean-mcp-lsp package, which lets the agent query the proof state, search the code, run snippets of test code, and other small features. This made Claude Code noticeably better at proving theorems, I suspect because it had more ways to diagnose errors and test hypotheses.
The last type of error, deep persistent mistakes, were rare but very costly. In these situations, the agent itself was confused, and often made many changes that embedded the confused understanding. Unpicking its plausible-sounding nonsense was very time consuming.3 In these cases, my experience with ITP was absolutely necessary, because it allowed me to step back and see the whole picture in a way that an AI agent today mostly cannot. The high cost of these rare errors is the reason that I think the project probably took longer with the AI agent than it would have done by hand.
I also don’t feel anywhere near as confident in the formalization as I would if it had been produced by an expert. The theorems are checked by Lean and I think we should have high trust in them. But more than 60% of the Lean code consists of definitions not theorems. The fact that the definitions are consistent enough that we can prove properties about them should also give us a bit of confidence. And the definitions look reasonably plausible, which should give us a bit more. But to be truly certain, we would have to carefully audit the whole formalization, which itself would be very arduous. In the best case, we would have a small set of core definitions which are obviously correct, but I suspect for most formalizations, that simply won’t be possible.
So Claude Code can write Lean proofs, how surprising is that really? Oh traveller, let us journey through the mists of time to the inconceivably distant year of… *cue wibbling flashback effect* … 2024.
As of last year, if you wanted to formalize a paper in Lean, your options were (1) go to grad school and learn to do it yourself, (2) hope one of the few theorem proving experts got interested in your problem, or (3) give up. There was no next-best tool which AI agents are supplanting; this capability simply did not exist.
What’s even crazier to me is that Claude Code isn’t designed or marketed as a theorem proving tool. I think its success here comes from related AI capabilities—agency, long-term planning, task decomposition, software engineering, and traditional math. It seems likely that Anthropic trains its models on Lean math benchmarks, and who knows, maybe they’ve thrown some random formal methods problems. But I think it’s also possible I’m one of the first humans to ever apply Claude Code to a formal methods paper.
Of course, the experiment I’ve described above isn’t Claude Code autonomously formalizing my old paper from end to end. Rather the agent took a series of individually-impressive but self-contained steps, with me evaluating the result and nudging it back on track when necessary. That’s consistent with METR’s benchmarking, which shows current AI agents can’t yet perform very long tasks. However, METR’s results also show this ability increasing rapidly with each new AI model. If this trend continues, we may soon have AI models that can prove theorems as well or better than human experts, just thanks to their native capabilities.
Even if AI doesn’t get much smarter, Claude Code will be more useful when it can just think faster. Managing the agent is currently kind of frustrating because it involves a lot of dead time. Each request takes maybe 5-10 minutes to complete, during which you can’t do much else of use, and can’t easily swap to another task.
I think there are also easy opportunities to make Claude Code as it is today more effective. Adding lean-mcp-lsp made a big difference, which suggests even simple tools can help if they give the agent more information. We could also give the agent access to the theorem-focused AI models I mentioned earlier and let the agent decide when to use them. Many of the errors I encountered seemed like random failures, and we might be able to avoid them by running multiple agents in parallel and picking winners. The bar for improvement is honestly quite low: in my experiments, I did the dumbest thing I could think of and it worked pretty well.
As long as I’ve been in the field, automated reasoning has evolved slowly. We are used to small-% improvements achieved through clever, careful work. Claude Code breaks that pattern. Even with the limitations it has today, it can do things that seemed utterly out of reach even a year ago. Even more surprising, this capability doesn’t come from some fancy solver or novel algorithm; Claude Code wasn’t designed for theorem proving at all!
I think what Claude Code really points to is the bitter lesson coming for formal methods, just as it did for image recognition, language translation, and program synthesis. Just as in those fields, in the long run I think formal methods ideas will be essential in making AI-driven proof successful. That’s because AIs will need many of the same tools as humans do to decompose, analyze, and debug proofs. But before that happens, I think we will see much of the clever, careful work we esteem rendered obsolete by AI-driven tools that have no particular domain expertise. The lesson will be bitter indeed, and many people are resisting it.
I think the result will be worth the pain, however. The reason ITP has never been widely adopted is that it is simply too cognitively demanding for most humans. Claude Code points to a future where theorem proving is solved - cheap, abundant, and automatic. I think that would be a good future, and if it happens, we should be ready and know what problem we want to solve next.
Further reading - Galois intern Twain Byrnes on using AI to reason about memory safety of C code: Escaping Isla Nublar: Coming around to LLMs for Formal Methods
[1] There’s a reasonable argument that a large % of all distinct theorems proved in human history have been proved by the AWS Zelkova access control tool.
[2] I’m not going to point to the specific papers because the scientific results seem strong and my evaluation of the models was very unsystematic - really just installing them and poking around
[3] METR recently released an entertaining study on coding agents where they asked open-source developers to complete AI-assisted engineering tasks. On average, the developers thought that the AI agent increased their productivity, but the study showed their productivity actually decreased. One explanation is that current AI agents can consistently get code nearly right, but the gap to really right is bigger than we think.