2026-06-11

Opus 4.8 One-Shots an Algorithm With Its Proof: Formal Verification Is Becoming a Hard Benchmark

A developer used Opus 4.8 to autonomously produce a polygon-intersection algorithm with a Lean proof of correctness; earlier models could not. A proof either checks or it does not, which is more honest than a leaderboard, but one case is not a general capability.

formal-verification coding evaluation

Opus 4.8 One-Shots an Algorithm With Its Proof: Formal Verification Is Becoming a Hard Benchmark — Photo / Unsplash

Summary

A developer (GitHub user schildep) published a polygon-intersection algorithm he calls, “to my knowledge, the first formally verified implementation of an intersection algorithm for polygons.” The thing worth recording is not the algorithm but how it was produced: he gave Opus 4.8 a one-line formal specification, the model worked autonomously for about eight hours, and out came an implementation plus a proof of correctness that passes the Lean 4 checker. When he tried the same task with earlier models, he kept getting stuck.

The signal here is valuable because the test is hard in the literal sense: a formal proof is either accepted by the Lean checker or it is not, with no gray zone of “looks right.” It cannot be inflated by memorizing a question bank or by matching an evaluator’s taste, the way leaderboards can. So “Opus 4.8 one-shots, previous models failed” is a clean capability signal. But state it precisely: this is one case on a problem that can be described exactly in a single line of spec, not the general claim that “the model can now do formal verification.” Below, three direct answers: what this actually proves about the model, why formal verification makes a good benchmark, and which readings go too far.

What happened

The task is polygon intersection: given two multipolygons (which may have holes, self-intersections, and overlapping edges), construct a new one whose interior is exactly the intersection of the two inputs. Computational geometry algorithms like this are notoriously hard to verify by ordinary testing, because much of the complexity hides in rare special configurations (edges that exactly overlap, vertices that coincide, a hole sized right at the critical point where the boundary grouping stops being unique). There are infinitely many configurations, so classical testing cannot be exhaustive.

Instead of testing, he used the Lean 4 proof assistant to formalize “intersection” into one specification: the interior of the result equals the intersection of the two input interiors, and had the agent produce an implementation plus a proof satisfying it. The whole setup deliberately shrinks what a human has to review. A reviewer reads three files totaling 87 lines of simple Lean specification (mostly basic geometric definitions and the algorithm interface) and runs the Lean checker. The proved implementation (the ...Proofs.lean and ...Impl.lean files) was written autonomously by the agent; the author says he never read it and does not need to, because the trust comes from the checker, not the model. He also checked that the proof depends only on three accepted-as-trusted axioms (propext, Classical.choice, Quot.sound), to stop the agent from slipping in an assumption that would let a proof slide through.

The gap between models is the most informative part of this project, because it is a longitudinal comparison by one person toward one goal. Per his log: he started early in the year on Opus 4.5 and 4.6; 4.5 was the first model that could handle non-trivial Lean proofs, but only when he first wrote out a rigorous proof sketch would it translate that into Lean. The lemma that “the interior definition is independent of ray direction,” for instance, had to be split into many small steps and fed in, and he was stuck there until 4.7 shipped. 4.7 could take larger steps, could prove that an intersection exists for any two polygons, but still needed him to supply the idea of using Eulerian circuits and to hint, in separate steps, how to handle tricky special cases. The day after 4.8 released, he ran two sessions in parallel in ultracode mode: one to reprove the main theorem from scratch with no hints in an isolated container (nearly all the prior work), another to extend the algorithm to the overlapping-edge special cases where 4.7 had repeatedly failed. Both succeeded after a few hours of autonomous work. One concrete change he observed in 4.8: faced with an intermediate theorem that might itself be wrong, earlier models would dive in and burn time, while 4.8 became suspicious and pivoted to a different strategy on its own, and at one point spawned parallel subagents to try several routes.

He also recorded the cost honestly: forcing an agent to formally verify its work tends to produce code that is slower or ignores practical considerations the spec does not capture. He guesses this comes both from the difficulty of verification pushing toward simpler code, and from the near-absence of formally verified practical software in the training data.

Why it matters

As a model benchmark, formal verification is scarce because it is honest. Most frontier-model capability claims today rest on leaderboards, and leaderboards have a structural weakness: the question bank may be in the training set, the evaluation may be tuned to reward a certain answering style, and a point or two of headroom is hard to attribute to real skill versus overfitting. A formal proof has no such gray zone. The Lean checker is a deterministic system; it accepts or rejects a proof regardless of how confident the model sounds or how pretty the process looks. A model either submitted a proof that checks or it did not. That hardness of the test is exactly why it is a better measure of “can the model actually carry a long chain of rigorous reasoning all the way through” than chasing a score.

It also happens to address the hardest trust problem in agentic coding. Normally, when you let an agent write code, you either review it line by line or write a pile of unit tests and pray the edge cases are covered, and both are expensive and incomplete. This project moves the anchor of trust from “read the agent’s few hundred lines of implementation” to “read 87 lines of spec and trust a deterministic checker.” The agent’s implementation and proof can be treated as a black box, judged correct without anyone reading them. This is a fundamentally different mode of collaboration from reviewing code: the human only has to state the problem correctly (write the spec), correctness is machine-guaranteed, and the heavy lifting in between is outsourced to the model with no need to trust it. For any domain that cares about strong correctness guarantees (the author’s cited NASA example computes keep-out zones for autonomous vehicles), this path is far more meaningful than “have the AI write more tests.”

The cost curve is where this might actually change a judgment. The NASA 2021 work the author cites verified a different polygon algorithm (a merge, not an intersection); that was 2021, before capable large models, and the paper notes the authors manually wrote 700 lemmas to produce the proof. Formal verification has historically been worth it only for very critical software, precisely because of that human cost. If an agent can automate the lemma-writing and proof-filling, the bar drops. But be clear: the realistic precondition today is that the problem can be stated exactly in a short spec. Polygon intersection qualifies; a great deal of real software does not.

Builder impact

If you build with coding agents, the transferable lesson here is not “use Opus 4.8,” it is the workflow that moves the anchor of trust. Find the small core of your problem that can be specified precisely and where correctness is expensive (geometry, concurrency, state-machine transitions, protocols, permission decisions), isolate it from the pile of code that merely “looks like it runs,” write a formal spec for that core alone, and have the agent fill in the implementation and proof while you review only the spec and run the checker. The author’s structure (a human reads 87 lines, treats the rest as a black box) is a reusable design: the implementation can grow and be optimized by later models while the spec a human must read stays the same size.

But do not rush to apply it to everything. The pattern holds only when the goal can be written as a short spec, and most business logic simply cannot be formalized that cleanly (writing “computes money correctly” as a spec is itself an open problem). The author also recorded the real cost: code forced to carry a proof tends to be slower and to ignore engineering considerations, and his first next step is to go back and fix performance. So the realistic use is on a critical component, not in expecting an agent to attach a proof to your whole system. Where it does apply, it replaces “review the agent’s code line by line plus a pile of unit tests and pray you covered the edges” with “review a short spec and trust a deterministic checker,” which is a genuine reduction in work.

Research impact

If you design model evaluations, this case points to an underrated class of benchmark: have the model produce checked proofs in proof assistants like Lean or Rocq. The appeal is not just difficulty, it is that the test is objective and naturally contamination-resistant. A fresh formal spec that has never been in any training set either gets proved or it does not, and there is no question bank to memorize. Compared with yet another coding benchmark that gets saturated and suspected of leakage, this kind of task separates “genuinely does long rigorous reasoning” from “matches patterns well.” The author inadvertently gives a longitudinal data point: on the same hard problem, 4.5 could only translate human-written sketches, 4.7 could take big steps given the idea, and 4.8 could formulate and execute whole strategies and self-correct on dead ends. A metric of “how long a stretch of proof the model can carry autonomously before a human has to take over” may characterize reasoning progress better than a single pass rate.

The research trap to avoid is reading a single point as a curve. This is an n=1 observation, from an author who knows Lean well and knows how to build the scaffolding (separating spec from implementation/proof, using main theorems as checkpoints). Hand the same model to someone who cannot write the spec and does not know Lean, and it almost certainly does not reproduce the result. Model capability is a necessary condition for this outcome, not a sufficient one; the scaffolding and the asker’s expertise are doing real work too.

What to ignore

Ignore the generalization that “the model can now do formal verification.” The accurate statement is: on a problem that can be described exactly in a one-line spec, with scaffolding the author built, Opus 4.8 autonomously produced a proof that passes the Lean checker. That is far from “the model can formally verify arbitrary software.” Most real systems cannot even be given that clean a spec. If you wanted to verify that a trading system “never miscomputes money,” just formalizing “computes money correctly” is an open problem. That a one-line spec can pin down the goal is the reason this problem was chosen, not a general capability the model conquered.

Do not read “eight hours of autonomous work” as efficient or cheap, either. The author says 4.8 ran about eight hours, and that verification tends to produce slower, less practical code; the first item on his “next steps” list is to measure and improve performance and simplify proofs that take detours. This is a state where the capability has arrived but engineering efficiency and cost are nowhere near mature. It shows “it can be done,” not “it is worth doing.”

Finally, do not treat the HN front page (93 points) as evidence of importance. What is worth recording is the hard test (the proof passed a deterministic checker) and the one author’s longitudinal comparison across four model generations; the upvotes only say some people found it interesting. The genuinely useful discussion on HN was not in the score but in the follow-up questions: commenters noted that the verified core uses exact rationals, not floating point (formally verifying floats is much harder), and that the WebAssembly is compiled from the Lean core while the outer JS UI is not verified. Those details are the reminder: the significance of this result only becomes clear once you pin down what was verified and under what assumptions, not from a sentence like “AI proved an algorithm.”

FAQ

Can you trust AI-written Lean proofs without human review?

In this project, yes, but the trust does not come from the model. The author read only 87 lines of specification (basic geometric definitions and the algorithm interface); the implementation and proof themselves were written autonomously by the agent and never read by a human. Correctness is checked by the deterministic Lean checker. The author also verified the proof depends only on three trusted axioms (propext, Classical.choice, Quot.sound) to stop the agent from quietly introducing an assumption. So the trust holds only if the spec itself is correct and the proof genuinely passes the deterministic checker.

When is formal verification worth the cost?

Historically only for very critical software, because the cost was high. The NASA 2021 work the author cites (computing keep-out zones for autonomous vehicles) had humans manually write 700 lemmas to produce its proof. If an agent can automate that part, the bar for verified software drops, but for now it is realistic only for problems whose goal can be stated precisely in a short specification, like polygon intersection.

What is the real gap between Opus 4.8, 4.7, and 4.5 on Lean proofs?

Per the author's log: 4.5 was the first model that could handle non-trivial Lean proofs, but only translated a rigorous proof sketch the human wrote first; 4.7 took larger steps but still needed the human to supply the key idea (e.g. Eulerian circuits) and hints for tricky special cases; 4.8 could formulate and execute large proof strategies on its own, became suspicious of likely-wrong intermediate theorems and pivoted, and at one point spawned parallel subagents to try multiple strategies. This is a longitudinal comparison within one project by one author, sample size one.

Sources

No official primary source available; this analysis is based on reliable secondary reporting (named outlets, cross-confirmed).