Dark Mode Toggle

A shallow dive into formal verification

2026 May 18 See all posts

A shallow dive into formal verification

Special thanks to Yoichi Hirai, Justin Drake, Nadim Kobeissi and Alex Hicks for feedback and review

Over the last couple of months, a new programming paradigm has been rapidly gaining traction within Ethereum's frontier research and development circles, and many other corners of computing: writing code directly either in very low-level languages (eg. EVM bytecode, assembly language) or in Lean, and verifying its correctness with automatically-checkable mathematical proofs written in Lean.

If done right, this has potential to both output extremely efficient code, and be far more secure than the way programming has been done before. Yoichi Hirai calls this the "final form of software development". This post will attempt to demystify the basics of what is going on here, what formal verification of software can do, and where its weaknesses and limits are, in Ethereum and beyond.

What is formal verification?

Formal verification refers to writing proofs of mathematical theorems in such a way that these theorems can be checked automatically. To give a reasonably simple but still interesting example, let's take a basic theorem about the Fibonacci sequence: every third number is even, the others are odd.

1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 ...

A simple way to prove this is proof by induction, walking forward three at a time.

First, the base case. Let F₁ = F₂ = 1 and F₃ = 2. By inspection, we see that the statement ("F_i is even for multiples of 3, odd otherwise") is true, up to x = 3.

Now, the inductive case. Assume that the statement is true up to 3k+3, ie. we already know that F_3k+1, F_3k+2, F_3k+3 are odd, odd, even. We can compute the parity of the next triple:

F_3k+4 = F_3k+2 + F_3k+3 = odd + even = odd

F_3k+5 = F_3k+3 + F_3k+4 = even + odd = odd

F_3k+6 = F_3k+4 + F_3k+5 = odd + odd = even

Thus, we've gone from knowing that the statement is true up to 3k+3, to knowing that the statement is true up to 3k+6. We can apply this inference over and over again, and thereby be sure that the rule holds for all integers.

This argument is convincing to a human. But what if you want to prove something a hundred times more complicated, and you want to be really really sure you didn't make a mistake? Well, you can make a proof that is convincing to a computer.

Here's what that looks like:

-- Fibonacci with fib 0 = 0, fib 1 = 1, fib 2 = 1 (indices offset by 1)
def fib : Nat → Nat
  | 0     => 0
  | 1     => 1
  | n + 2 => fib (n + 1) + fib n

-- Claim: fib (3k+1) is odd, fib (3k+2) is odd, fib (3k+3) is even.
-- Equivalently: every third Fibonacci number starting from fib 3 is even.
-- We prove all three at once by induction on k, since each case
-- of the next block is built from the previous block.

theorem fib_triple (k : Nat) :
    fib (3 * k + 1) % 2 = 1 ∧
    fib (3 * k + 2) % 2 = 1 ∧
    fib (3 * k + 3) % 2 = 0 := by
  induction k with
  | zero => decide
  | succ k ih =>
    -- Rewrite the new indices into the form (something) + 2 so fib unfolds.
    refine ⟨?_, ?_, ?_⟩
    · show (fib (3 * k + 3) + fib (3 * k + 2)) % 2 = 1
      omega
    · show (fib (3 * k + 3) + fib (3 * k + 2) + fib (3 * k + 3)) % 2 = 1
      omega
    · show (fib (3 * k + 3) + fib (3 * k + 2) + fib (3 * k + 3)
      + (fib (3 * k + 3) + fib (3 * k + 2))) % 2 = 0
      omega

It's the same reasoning, but expressed in Lean, a programming language often used for writing and verifying mathematical proofs.

This looks different than the "human" proof I gave above, and for good reason: what is intuitive to a computer (in the old sense of "computer" - a "deterministic" program made up of if/then statements - not an LLM) is very different from what is intuitive to a human. In the above proof, you're not highlighting the fact that fib(3k+4) = fib(3k+3) + fib(3k+2), rather you're highlighting the fact that fib(3k+3) + fib(3k+2) is odd, and Lean's appropriately grandiosely-named omega tactic automatically combines that with its knowledge about the definition of fib(3k+4). In more complicated proofs, you sometimes have to explicitly specify at each step what law of mathematics allows you to take the step you're taking, sometimes with obscure names like Prod.mk.inj. But on the other hand, you can expand huge polynomial expressions in one step, and justify it simply by writing single-line expressions like "omega" or "ring".

The unintuitiveness and tediousness here is a big part of why, despite the existence of machine-verifiable proofs for almost 60 years, the area has remained niche. But on the other hand, a lot of formerly impossible things are now rapidly becoming possible, thanks to the rapid progress of AI.

Verifying computer programs

So far you might be thinking: okay, so we can have computers verify proofs about mathematical theorems, so we finally know for sure which of our crazy new results about prime numbers or whatever are true, and which are just mistakes inside hundred-page long pdf papers. Maybe we'll even figure out if Shinichi Mochizuki is right about the ABC conjecture! But aside from curiosity value, so what?

There are many possible answers. But one answer that is very dear to me, is verifying correctness of computer programs, especially programs that are doing anything cryptographic or security-related. After all, a computer program is a mathematical object, and so proving that a computer program behaves in a certain way is a mathematical theorem.

Suppose, for example, you wanted to prove whether or not an encrypted messenger like Signal is actually secure. You could write down what "secure" mathematically means in this context. At a high level, you're proving that, assuming some cryptographic assumptions are true, only someone who has the private key can learn anything about the contents of the messages. In reality, there are many different security properties that matter.

Well, as it turns out, there is a group trying to figure out exactly this! This is what one of their security theorems looks like:

theorem passive_secrecy_le_ddh
    (g : G)
    (adv : PassiveAdversary G SK) :
    passiveSecrecyAdvantage (F := F) g adv ≤
    ProbComp.boolDistAdvantage
      (DiffieHellman.ddhExpReal (F := F) g (ddhReduction adv))
      (DiffieHellman.ddhExpRand (F := F) g (ddhReduction adv))

And here's how Leanstral summarizes what this means:

The passive_secrecy_le_ddh theorem is a tight reduction showing that X3DH's
passive message secrecy is at least as hard as the DDH assumption under the Random Oracle Model.

If an adversary can break X3DH's passive message secrecy, then they
can also break DDH. Since DDH is assumed to be hard, X3DH is also secure against passive attacks.

This theorem proves that if an adversary can passively observe Signal's key
exchange messages, they cannot distinguish the resulting session key from a random key with better than negligible probability.

If you combine this with a proof that the AES encryption is implemented correctly, you get a proof that the Signal protocol's encryption is secure against passive attackers.

Similar verification projects exist that prove secure implementations of TLS and other pieces of in-browser cryptography.

If you formally verify end-to-end, then you are proving not just that some description of the protocol is secure in theory, but that the specific piece of code that the user runs is secure in practice. And from a user's perspective, this greatly improves trustlessness: in order to fully trust the code, you don't need to check over the entire code, you simply need to check over the statements that are proven about it.

Now, there are some big important asterisks to keep in mind here, particularly with regard to what that all-important word "secure" actually means. It's very easy to forget to prove the claims that are actually important. It's very easy to find situations where there is no simpler description of the claims that need to be proven than the code itself. It's very easy to sneak in assumptions into the proofs that end up not actually true. And it's very easy to decide that only one part of the system really needs to be formally proven, but then still get hit by a critical bug in the other parts (or even hardware). Even the Lean implementation itself can have bugs. But before we talk about all of these annoying nuances, let's first dig deeper into the kind of utopia that formal verification done ideally and correctly could potentially bring.

Formal verification for security

Bugs in computer code are scary.

Bugs in computer code become more scary when you put cryptocurrency into immutable onchain smart contracts from which North Korea can automatically drain all your money with no recourse if there's a bug in the code.

Bugs in computer code become even more scary when this all gets wrapped in zero-knowledge proofs, so if someone manages to hack the zero-knowledge proof system, they can extract all the money, and we have no idea what went wrong - or worse, when something has gone wrong.

Bugs in computer code become even even more scary when we have powerful AI models, like Claude Mythos but after two more years of improvement, that can automate discovering them.

Some people are responding to this reality by advocating giving up on the fundamental idea of smart contracts, or even that the cyber domain even can be a domain where defenders can have an asymmetric advantage against attackers.

Some quotes:

to harden a system you need to spend more tokens discovering exploits than attackers will spend exploiting them

And:

We built our profession around deterministic code. Write it, test it, ship it, know it works - but in my experience that contract is breaking. Inside the top few percent of operators at truly AI-native companies, the codebase has started to become something you believe works, with a probability you can no longer precisely state.

Worse, some think that the only solution is to give up on open source.

This would be a bleak future for cybersecurity. It's especially an extremely bleak future for those of us who care about internet decentralization and freedom. The entire cypherpunk ethos is fundamentally based on the idea that on the internet, the defender has an advantage: it's much easier to build a digital "castle" (whether that means encryption, signatures or proofs) than to destroy one. If we lose that, then internet security can only come from economies of scale, from going all over the world to chase down possible attackers, and more generally from a binary choice between domination and doom.

I disagree, and I have a much more optimistic vision of the future of cybersecurity.

I think the challenge brought by powerful AI bug-finding is a serious challenge, but it is a transitional challenge. Once the dust settles and we get into the new equilibrium, we will get to something far more defense-favoring than what we had before.

Mozilla agrees with me. Quoting them:

You may need to reprioritize everything else to bring relentless and single-minded focus to the task, but there is light at the end of the tunnel. We are extremely proud of how our team rose to meet this challenge, and others will too. Our work isn't finished, but we've turned the corner and can glimpse a future much better than just keeping up. Defenders finally have a chance to win, decisively.

...

The defects are finite, and we are entering a world where we can finally find them all.

Now, if you Ctrl+F for the words "formal" and "verification" in Mozilla's post, you will find zero hits for either. The positive future of cybersecurity does not strictly depend on formal verification, or for that matter any other single technology.

What does it depend on? Basically, this chart:

There are many technologies that have contributed to number go down over the decades:

Type systems
Memory-safe languages
Improvements in software architecture (including sandboxing, permissions, and more generally explicitly thinking about "trusted computing base" vs "other code")
Better testing methodologies
Growing bodies of knowledge about safe and unsafe coding patterns
Growing bodies of pre-written and audited software libraries

Formal verification, aided by AI, should be viewed not as totally new paradigm, but as a powerful accelerant of a trend and a paradigm that was already marching forward.

Formal verification is not a panacea. But it is particularly well-suited for situations where the goal is much simpler than the implementation. This is particularly true in some of the most devilishly hard pieces of technology that we will need to deploy in the next major iteration of Ethereum: quantum-resistant signatures, STARKs, consensus algorithms, and ZK-EVMs.

A STARK is a very complex piece of software. But the core security property that it achieves is simple to understand and formalize: if you see a proof that points to a hash H of program P, input x, and output y, then either (i) the hash algorithm used in the STARK is broken, or (ii) P(x) = y . And so we have the Arklib project, which is trying to create a fully formally-verified implementation of a STARK (see also, VCV-io, which provides the foundational oracle computation infrastructure that can be used to formally verify all kinds of other cryptographic protocols, many of which are dependencies in a STARK). is formally-verifying all kinds of other cryptographic algorithms, many of which are dependencies in a STARK).

Even more ambitious is evm-asm: a project building an entire EVM implementation that is formally verified. Here, the security property is not so clean: basically, the goal is to prove equivalence to a different implementation of the EVM written in Lean - though the implementation can be written to maximize intuitiveness and readability without regard for concrete efficiency. It's possible that we'll have ten implementations of the EVM, all provably equivalent to each other, all with the same fatal flaw that somehow lets an attacker drain all the ETH from an address they have no permissions for. But that's vastly less likely than one EVM implementation having that kind of flaw today. And another security property whose importance we came to understand only after painful experience, DoS resistance, is easy to specify.

Two other important areas are:

Byzantine fault-tolerant consensus. Here too, it's difficult to formally specify all the desired security properties, but given how prevalent bugs have been it's worth trying - and so we have work-in-progress implementations and proofs of Lean consensus in Lean.
Smart contract programming languages: see formal verification in Vyper and Verity.

A huge part of the value-add in all of these cases is that the proofs are truly end-to-end. Often, the nastiest bugs are interaction bugs, that sit at the edge of two sub-systems that are considered separately. For a human, it's too difficult to reason about the entire system end-to-end. But an automated rule-checking system can.

Formal verification for efficiency

Let's look again at evm-asm. This is an EVM implementation. But it's an EVM implementation written directly in RISC-V assembly.

Literally.

Here's the ADD opcode:

import EvmAsm.Rv64.Program

namespace EvmAsm.Evm64

open EvmAsm.Rv64

/-- 256-bit EVM ADD: binary, pops 2, pushes 1.
    Limb 0: LD, LD, ADD, SLTU (carry), SD (5 instructions).
    Limbs 1-3: LD, LD, ADD, SLTU (carry1), ADD (carryIn), SLTU (carry2), OR (carryOut), SD (8 each).
    Then ADDI sp, sp, 32.
    Registers: x12=sp, x7=acc, x6=operand, x5=carry, x11=carry1. -/
def evm_add : Program :=
  -- Limb 0 (5 instructions)
  LD .x7 .x12 0 ;; LD .x6 .x12 32 ;;
  ADD .x7 .x7 .x6 ;; SLTU .x5 .x7 .x6 ;; SD .x12 .x7 32 ;;
  -- Limb 1 (8 instructions)
  LD .x7 .x12 8 ;; LD .x6 .x12 40 ;;
  ADD .x7 .x7 .x6 ;; SLTU .x11 .x7 .x6 ;;
  ADD .x7 .x7 .x5 ;; SLTU .x6 .x7 .x5 ;;
  OR' .x5 .x11 .x6 ;; SD .x12 .x7 40 ;;
  -- Limb 2 (8 instructions)
  LD .x7 .x12 16 ;; LD .x6 .x12 48 ;;
  ADD .x7 .x7 .x6 ;; SLTU .x11 .x7 .x6 ;;
  ADD .x7 .x7 .x5 ;; SLTU .x6 .x7 .x5 ;;
  OR' .x5 .x11 .x6 ;; SD .x12 .x7 48 ;;
  -- Limb 3 (8 instructions)
  LD .x7 .x12 24 ;; LD .x6 .x12 56 ;;
  ADD .x7 .x7 .x6 ;; SLTU .x11 .x7 .x6 ;;
  ADD .x7 .x7 .x5 ;; SLTU .x6 .x7 .x5 ;;
  OR' .x5 .x11 .x6 ;; SD .x12 .x7 56 ;;
  -- sp adjustment
  ADDI .x12 .x12 32

end EvmAsm.Evm64

RISC-V was chosen because the ZK-EVM provers being built generally work by proving RISC-V and compiling Ethereum clients to RISC-V. So if you have an EVM implementation written in RISC-V directly, this should be the fastest possible implementation that you can get. RISC-V can also be emulated inside regular computers very efficiently (and RISC-V laptops exist). Of course, to be truly end-to-end, you have to formally verify the RISC-V implementation (or prover arithmetization) itself, but don't worry - that exists too.

Writing code directly in assembly is the sort of thing we used to do all the time, fifty years ago. Since then, we've moved away from that, toward writing code in high-level languages. High-level languages take a penalty in efficiency, but in exchange they are must faster to write code in, and importantly, much faster to understand other people's code - a necessary thing for security.

With the combination of formal verification and AI, we have an opportunity to go "back to the future". Specifically, we have AI write the assembly, and then write a formal proof verifying that the assembly has the desired properties. At the very least, the desired property can just be perfect equivalence to an implementation optimized for readability and written in some human-friendly high-level language.

Instead of there being a single code object that has to balance between readability and efficiency, we have two separate objects: one (the assembly implementation) that optimizes for efficiency only, taking into account the needs of the specific environment it's executing in, and another (the security claims, or the high-level-language implementation) that optimizes for readability only, and then we have a mathematical proof that proves the equivalence between the two. A user can (automatically) verify this proof once, and then from that point forward, they just run the fast version.

This is powerful, and there is a reason why Yoichi Hirai has called it "the final form of software development".

Formal verification is not a panacea

There is a tradition in cryptography and computer science that is almost as old as the tradition of formal methods: the tradition of criticizing formal methods (or reliance on "proofs" more generally). This literature is rich with practical examples. Let's start with hand-written proofs from the older and simpler ages of cryptography, here criticized by Menezes and Koblitz in 2004:

In 1979 Rabin [51] produced an encryption function that ... was
in some sense "provably" secure, that is, it had a reductionist security property.
Reductionist security claim. Someone who can find messages m from the
ciphertext y must also be able to factor n.

...

Soon after Rabin proposed his encryption scheme, Rivest (see [63]) pointed
out that, ironically, the very feature that gave it an extra measure of security
would also lead to total collapse if it were confronted with a different type
of adversary, called a "chosen-ciphertext" attacker. Namely, suppose that the
adversary could somehow fool Alice into decrypting a ciphertext of its own
choosing. The adversary could then follow the same procedure that Sam used
in the previous paragraph to factor n.

Menezes and Koblitz proceed to give a few more examples. The common pattern: designing cryptographic protocols around making them more "provable" often makes them less "natural", which makes it more likely that they break down in some situation that the designer did not even consider.

Now, let's go back to machine-verifiable proofs and code. Here's a paper from 2011 finding bugs in formally verified C compilers:

The second CompCert problem we found was illustrated by two bugs that resulted in generation of code like this:

stwu r1, -44432(r1)

Here, a large PowerPC stack frame is being allocated. The problem is that the 16-bit displacement field is overflowed. CompCert's PPC semantics failed to specify a constraint on the width of this immediate value, on the assumption that the assembler would catch out-of-range values.

And a paper from 2022:

In CompCert-KVX, commit e2618b31 fixed a bug— "nand" instructions would be printed as "and"; "nand" is selected only for the rare ~(a & b) pattern. The bug was found by compiling randomly generated programs.

And today, in 2026, here's Nadim Kobeissi describing vulnerabilities in formally-verified software in Cryspen:

In November 2025, Filippo Valsorda independently reported [50] that libcrux-ml- dsa v0.0.3 produced different public keys and signatures on different platforms given identical deterministic inputs. The bug resided in the _vxarq_u64 intrin- sic wrapper (Listing 1.1), which implements the XAR operation used in SHA-3's Keccak-f permutation. The fallback passed incorrect arguments to the shift oper- ations, corrupting SHA-3 digests on ARM64 platforms without hardware SHA-3 support. This is a Type I failure: the intrinsic was marked ^[1], and the entire NEON backend had no completed proofs for runtime safety or correctness

And:

The libcrux-psq crate [13] implements a post-quantum pre-shared-key protocol.
In the decrypt_out method, the AES-GCM 128 decryption path calls .unwrap()
on the decryption result instead of propagating the error (Listing 1.3). A single
malformed ciphertext crashes the process

The above four issues all fall into two types:

Situations where only part of the code is being verified (because verifying the rest is too hard), and it turns out that the non-verified code is more buggy (and in more critical ways) than the author imagined
Situations where the author forgets to specify a critical property that needs to be proven.

Nadim's post contains a categorization of formal verification of failure modes; he also gives other types of failure modes (eg. another major one is situations where "the formal specification itself is wrong, or the proofs contain false claims that are silently accepted by the build system").

Finally, we can look at formal verification failures at the boundary of software and hardware. A common issue here is verifying side-channel attack resistance. Even if you have a perfectly secure form of encryption protecting your messages, if someone a few meters away can pick up on electrical fluctuations and extract your private key after a few hundred thousand encryptions, you're still insecure. Here's an article on "differential power analysis", a now well-understood example of such a technique.

Differential power analysis is a common type of side channel attack. Source: Wikipedia

There have been attempts to prove security against such attackers. However, any such proof requires some kind of mathematical model of an attacker, that you can prove security against. Sometimes, the "d-probing model" is used: we assume there is a known limit on how many locations in a circuit an attacker can query. However, there are forms of leakage that such a model does not capture.

A common issue, observed in this article, is transitional leakage: if you can observe a signal that depends not on the value at a location but on the change in that value, then often that is enough to recover the information you need from two values (the old and the new) rather than just one. This article gives a taxonomy of other forms of leakage.

These lines of criticism of formal verification have, over the decades, helped to make formal verification much better. We are much better at watching out for these kinds of issues than we used to be. But even today, it's not perfect.

Zooming out, there is a common thread here. Formal verification is powerful. But regardless of the marketing that makes it sound like formal verification gives you "provable correctness", so-called "provable correctness" fundamentally does not prove that software (or hardware) is "correct".

As understood by most human beings, "correct" means something like: "the behavior of the thing matches the user's understanding of the developer's intentions".

And "secure" means something like: "the behavior of the thing does not deviate from the user's expectations in a way that is adverse to the user's interests".

In both cases, correctness and security boil down to a comparison between a mathematical object and human intention or expectation. Human intention and expectation are technically mathematical objects - after all, human brains are part of the universe, which follows laws of physics that you could simulate if you had enough compute. But they are incredibly complex mathematical objects, that neither computers or we ourselves can understand or even read. For all intents and purposes, they are black boxes; we only understand anything about our intentions and expectations at all because we each have many years of experience observing our own thoughts and making inferences about the thoughts of others.

And because we cannot stick raw human intention inside a computer, formal verification has no way to prove comparisons against human intentions. And so, "provable correctness" and "provable security" are not, in fact, proving what we human beings understand by "correctness" and "security" - until we can simulate human brains, nothing can.

So what is the helpful thing that formal verification is doing?

I like to view test suites, type systems and formal verification, as all being different implementations of the same underlying approach to programming language safety - probably the only approach that makes any sense. They are all about redundantly specifying our intentions in different ways, and then automatically checking that these different specifications are compatible with each other.

Take, for example, this python code:

def fib(n: int) -> int:
    if n < 0:
        raise Exception("Negative values not supported")
    elif 0 <= n < 2:
        return n
    else:
        return fib(n-1) + fib(n-2)

if __name__ == '__main__':
    assert [fib(i) for i in range(10)] == [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
    assert fib(15) == 610

Here, you are expressing your intention in three different ways:

Explicitly, by implementing the Fibonacci formula in code
Indirectly, via a type system (specifying that the input, output and recursive intermediate steps are integers)
Via a "bag of examples" approach: the test cases

Running the file checks the formula against the examples. A type checker can verify that the types are compatible: adding two integers together is a valid thing to do, and produces another integer. Type systems are often a good way to check your work in physics: if you are computing an acceleration, and you get an answer whose units are \(\frac{m}{s}\) and not \(\frac{m}{s^2}\), you know you've done something wrong. And a bag-of-examples "definition", which test cases are an instance of, is often just a much more natural way for humans to deal with concepts than a direct explicit definition.

The more different ways in which you can specify your intent, ideally very different ways that require you to approach the issue with different styles of thinking, the more likely it is that, if all of those expressions prove compatible with each other, you actually expressed the thing that you wanted.

Safe programming is about expressing your intention in multiple different ways, and then automatically verifying that the expressions are all compatible with each other.

Formal verification lets you extend this approach even further. With formal verification, you can specify your intent in an almost arbitrary number of different redundant ways, and the program only verifies if they are all compatible. You can specify an optimized implementation and a very inefficient but human-readable implementation, and verify that they match. You can ask ten of your friends to provide lists of mathematical properties that your program should have, and then check if it passes them all - and if it doesn't, figure out if the program is wrong or the mathematical property. And you can use AI to do all of the above extremely efficiently.

So how do I start?

Realistically, you won't be writing proofs yourself. The entire reason why formal methods have not taken off is that most people can't wrap their heads around writing the damn things. Can you tell me what the code below means?

  /-- Helper: pointwise ≤ at the foldl level with an accumulator. -/
private theorem foldl_acc_le (ds1 ds2 : List Nat) (w : Nat) (a b : Nat) (hAcc : a ≤ b)
    (hLE : Forall₂ (· ≤ ·) ds1 ds2) :
    List.foldl (λ acc d => acc * w + d) a ds1 ≤
    List.foldl (λ acc d => acc * w + d) b ds2 := by
  match ds1, ds2, hLE with
  | [], [], .nil => exact hAcc
  | d1::ds1', d2::ds2', .cons hd htl =>
    simp [List.foldl]
    refine foldl_acc_le ds1' ds2' w (a * w + d1) (b * w + d2) ?_ htl
    exact Nat.add_le_add (Nat.mul_le_mul hAcc (Nat.le_refl _)) hd

(If you want to know, it's one of the many sub-lemmas in this proof of one particular security claim about a variant of SPHINCS signaures - namely, that, unless there is a hash collision, a signature for one message will require a value higher up on at least one hash ladder than a signature for any other message - and hence requires information that cannot be computed from that other signature)

Instead of writing code and proofs by hand, you ask AI to write the program for you (either writing in Lean directly, or if you want speed, writing in assembly), and prove any desired properties along the way.

One of the nice things about this task is that it's inherently self-verifying, and so you don't need to supervise - you can just let the AI run for hours all on its own. The worst thing that can happen is that it will go in circles without making progress (or, as my leanstral did on one occasion, decide to make its own job easier by replacing the statement that it's asked to prove). The thing you need to check at the end is that the statement proven matches what you want.

In the SPHINCS signature variant, here's the final statement:

theorem wots_fullDigits_incomparable
    {dig1 dig2 : List Nat} {w l1 l2 : Nat}
    (hw : 0 < w)
    (hLen1 : dig1.length = l1) (hLen2 : dig2.length = l1)
    (hBound1 : ∀ d ∈ dig1, d < w) (hBound2 : ∀ d ∈ dig2, d < w)
    (hL2suff : l1 * (w - 1) < w ^ l2)
    (hNeq : dig1 ≠ dig2) :
    ¬ Forall₂ (· ≤ ·) (wotsFullDigits dig1 w l1 l2) (wotsFullDigits dig2 w l1 l2) ∧
    ¬ Forall₂ (· ≤ ·) (wotsFullDigits dig2 w l1 l2) (wotsFullDigits dig1 w l1 l2)

This is actually bordering on the edge of readability:

IF the digits generated from one hash digest (dig1) are not equal to the digits generated from another hash digest (dig2)

THEN it's not true that

for all digits, dig1's digit <= dig2's digit
for all digits, dig2's digit <= dig1's digit

in the "extended digits" generated by adding a checksum (wotsFullDigits). That is, inevitably in some places the digits in the extension of dig1 will be higher and in other places the digits in the extension of dig2 will be higher.

For writing proofs with LLMs, I've found Claude to be sufficient, and Deepseek 4 Pro to be sufficient. Leanstral, a smaller open-weights model specifically fine-tuned for writing Lean, is a promising alternative; with 119B params, 6B active per token, you can run it locally albeit slowly (I get ~15 tok/sec on my laptop). According to benchmarks, Leanstral outperforms much larger general-purpose models:

In my personal experience so far, it's somewhat worse than Deepseek 4 Pro, but still effective.

Formal verification will not solve all our problems. But if we want to have a model of internet security that is not based on everyone trusting a few powerful organizations, we need to be able to instead trust code - including trusting code in the face of powerful AI adversaries. AI-assisted formal verification lets us take large steps in doing just that.

Like blockchains and ZK-SNARKs, AI and formal verification are very complementary technologies.

Blockchains give you open verifiability and censorship resistance at the cost of privacy and scalability, and ZK-SNARKs give you back ... privacy and scalability (in fact, even more than you had before).

AI gives you the ability to write large volumes of code at the cost of accuracy, and formal verification gives you back ... accuracy (in fact, even more than you had before).

By default, AI will enable large amounts of very sloppy code, and the number of bugs will increase. Indeed, in some situations it's the correct tradeoff to let the bugs increase: if the bugs are mild, even buggy software is better than not having that software. But there is an optimistic future for cybersecurity here: software will (continue to) split into "insecure edge pieces" around a "secure core".

The insecure edge pieces will be run in sandboxes, and trusted with the minimum authority they need to do their job. The secure core will administer everything. If the secure core breaks, everything breaks - your personal data, your money, and more. But if some insecure edge breaks, the secure core protects you.

When it comes to the secure core, we don't let the buggy code multiply. We act aggressively to keep the size of the secure core small, and indeed even shrink it further. Instead, we put the entire extra horsepower that AI brings into making the secure core more secure, so that it can handle the extremely high trust burdens we are putting on it in a highly digitized society.

The kernel of your operating system (or at least part of it) will be one such secure core. Ethereum will be another. Hopefully, the hardware you use at least for all non-performance-intensive computations will be a third. Something involving IoT will be a fourth. At least here in these secure cores, the old maxim that bugs are inevitable and you can only strive to find them before the attacker does will become false, replaced by a more hopeful world where you get actual security. But if you want to give up your assets and data to software that is written poorly and may accidentally send it all into a black hole, well, you'll have that freedom too.