A GKR Tutorial

2025 Oct 19 See all posts


A GKR Tutorial

Special thanks to Lev Soukhanov, Zhenfei Zhang and Zachary Williamson for feedback and review

If you're following the "cryptography side of crypto", at this point you've likely heard a lot about ultra-fast ZK-provers: ZK-EVM provers proving the Ethereum L1 in real-time with only about fifty consumer GPUs, people proving 2 million Poseidon hashes per second on consumer laptops, and zk-ML systems proving LLM inference with increasing speed.

In this post, I will explain in detail a family of protocols that is behind the extreme speed of many of these proving systems: GKR.

I will focus on a GKR implementation for proving Poseidon hashes (and other computations that follow a similar pattern). To see GKR in the context of more generic circuit evaluation, see Justin Thaler's note and this Lambdaclass post.

What is GKR and why is it so fast?

Suppose that you have a computation that is "big" in two dimensions: it involves processing data through at least a medium number of (low-degree) "layers", and it involves applying the same function to a large number of inputs. Something like this:



As it turns out, a lot of the biggest computation we do fits into this kind of pattern. Cryptographers will notice that many computationally intensive proving operations involve proving lots of hashes, and each hash internally has that exact kind of internal structure. AI researchers will notice that a neural network, the core building block of an LLM, has exactly that internal structure (both because you can prove inference of many tokens in parallel, and because within each token, you have a structure mixing element-wise neural layers with global matrix multiplication layers, which technically violate the cross-input independence in the diagram above but in practice are very easy to put into a GKR system).

GKR is a cryptographic scheme that is tailor-made for this pattern. It achieves its efficiencies by avoiding the need to make commitments to any of the intermediate layers: you only need to commit to the inputs and the outputs. "Commitments" here mean putting the data into a cryptographic data structure, either a KZG or a Merkle tree, that allows you to prove queries about specific things about that data. The cheapest commitment, a Merkle tree of an erasure coding (used in STARKs), requires you to hash 4-16 bytes for each byte you commit to: hundreds of additions and multiplications when the underlying operation you're proving at that step might be a single multiplication. GKR avoids all of this, except at the very beginning and the very end.

Note that GKR is not "zero knowledge": it only handles succinctness, not privacy. If you want zero knowledge, wrap the GKR proof in a ZK-SNARK or ZK-STARK.

Sumchecks: the core building block

Suppose you have a multivariate polynomial \(P(x_1, x_2 ... x_n)\), which is low-degree in each dimension. You want to prove the sum of evaluations of that polynomial across a hypercube - that is, you want to prove the value of \(\sum_{i_k \in \{0,1\}} P(i_1, i_2 ... i_N)\). There is a standard technique that reduces this problem to that of evaluating \(P\) at a random coordinate. That is, you can convert an "obligation" to prove \(\sum_{i_k \in \{0,1\}} P(i_1, i_2 ... i_N)\) into an obligation to prove \(P(c_1, c_2 .. c_n)\), for some \(c_1, c_2 ... c_n\) that get randomly generated as part of the proving process. Here is what it looks like for a multilinear polynomial (one that is linear in each variable; so \(x_ix_j\) is allowed but not eg. \(x_i^2\)):



At each step, the prover provides a sum for each half of the current hypercube. These sums (and any previous randomness) get hashed together and used as randomness to choose an evaluation coordinate. The prover then uses a simple linear combination of the two halves of the (dimension N) hypercube to compute the half-size (dimension N-1) hypercube of evaluations where the first coordinate is set to the evaluation coordinate that was just randomly chosen. In the diagram above, the prover starts with evaluations at ((0,0,0), (0,0,1) ... (1,1,1)), and takes the top half plus 6 times the difference between the top half and the bottom half to compute the half-sized hypercube of evaluations at ((6,0,0), (6,0,1), (6,1,0), (6,1,1)).

This is a standard "random linear combination" trick: proving that a linear claim (in this case, a sum) is true for multiple values (here, two halves of the hypercube) by proving it for a random linear combination of those values. A similar idea is used in bulletproofs. And like in bulletproofs, we use the random linear combination trick recursively: The prover continues turning a claim about a hypercube into a claim about a hypercube half the size, until the "hypercube" shrinks to one single value. This value is the evaluation of the original multivariate polynomial at a random point (here, (6, 4, 7)). If the prover can prove that evaluation, then they've proven the original sum.

A sumcheck is not useful by itself: after all, it simply converts one unproven claim about a bunch of numbers into another unproven claim about a bunch of numbers. But it is an incredibly useful and powerful ingredient in a whole bunch of proof systems, including GKR. It is also very efficient: particularly, the prover does not need to commit to anything, and the prover and verifier only need to hash a small constant amount of data per round to generate the random coordinates.

Now, let's get into GKR. For this post, we'll focus on a single type of computation: proving a large number of hashes of the Poseidon2 hash function. The Poseidon2 hash function is quite simple; here's a full graphical description of the core permutation function that it uses:



Essentially, you do many rounds that alternate between multiplying the whole data by a matrix, and cubing and adding a constant to each element. The rounds in the middle are "partial rounds", where both operations are simplified: the matrix is an "easy" matrix (\(diag + J\) in the diagram above) that can be evaluated as \(x_i \rightarrow x_i * d_i + sum(x_0, x_1 ... x_{15})\), and only the first value gets cubed. This combination of full and partial rounds has been deemed optimal for the tradeoff between performance and security, but it's not essential to the construction: you could do just full rounds with a random matrix if you wanted to.

There are many versions and parameter choices of Poseidon2; the above is not meant to accurately depict any specific live implementation, rather it shows the general pattern.

GKR is well-suited to prove a large number of executions like this in parallel. Let us go through how it works. We will first start with a simplified version of Poseidon, where there are no partial rounds and no matrices - instead, you just do repeated \(x \rightarrow x^3 + r_i\). GKR works by running through the computation backwards:

Here's the top layer. For cleaner exposition (ie. to avoid the numbers blowing up in size), in this example we will do all the math modulo 89. In reality, \(V\) is over a 32-bit prime field (often KoalaBear (\(2^{31} - 2^{24} + 1\))), and \(p_i\) (and hence \(W\) later on in this post) are over a degree-4 extension field (often represented as elements of the form \(x = a_0 + a_1v + a_2v^2 + a_3v^3\) where \(v^4 = 3\)) to ensure (close to) 128-bit security.



Note that we have not yet done any sumchecks. The first sumcheck we do will actually be on the second-last layer (ie. V31). And to do this we will need to introduce some extra machinery. Particularly, the goal is not to prove the basic sum \(\sum_{i_k \in \{0,1\}} V_{31}(i_1, i_2 ... i_N)\). Instead, we are proving \(\sum_{i_k \in \{0,1\}} (V_{31}(i_1, i_2 ... i_N)^3 + r_{31}) * W_{31}(i_1, i_2 ... i_N)\).

\(W_{31}\) here is the "weights" that correspond to "evaluate at \(p_{32}\)". Any evaluation of a multilinear polynomial can be represented by weights. For example, consider the 2D multilinear polynomial with evaluations [[1, 5], [3, 9]]:



Now, let's say we want the evaluation at (4, 6). We can compute this! An easy way is to take the difference (eg. \(x_{01} - x_{00}\)) along a horizontal or vertical line and re-apply it over and over to "extend" that line:



Here, [[15, -18], [-20, 24]] is the "weights" corresponding to "evaluate at (4, 6)". We can also compute the weights via a nice formula: \([[(1-4) * (1-6), (1-4) * 6], [4 * (1-6), 4 * 6]]\). A similar thing works in higher dimensions.

Going back to GKR. The expression we wanted to prove the value of is \(\sum_{i_k \in \{0,1\}} (V_{31}(i_1, i_2 ... i_N)^3 + r_{31}) * W_{31}(i_1, i_2 ... i_N)\). Because \(W_{31}\) is the weights corresponding to "evaluate at \(p_{31}\)", this expression is exactly the same thing as saying "compute \((V_{31}(i_1,i_2...i_N)^3 + r_{31})(p_{31})\)", which is itself the same thing as \(V_{32}(p_{31})\). Now, we do the sumcheck. It's like we did before, except at each step we provide partial sums not of the values themselves, but of the evaluations of the values. Additionally, there is another nuance: because this is a degree-4 expression (degree 3 coming from \(V_{31}^3\) and degree 1 coming from \(W\)), we need to provide five evaluations at each step, instead of just the left and right side.

This is all complex, so to avoid adding even more pain let's just set \(r_{31}\) to zero so we can just focus on the cubing. Here we go:



As the bot loves to say, let's go through this step by step. The prover starts off with the claim \(sum(V_{31}^3 * W_{31}) = 83\) (remember: this is the same thing as saying \(V_{32}(p_{32}) = 83\)). The prover computes and provides the five partial sums of \(sum(V_{31}^3(x, ...) * W_{31}(x, ...))\) for \(x \in \{0,1,2,3,4\}\). The prover gets a random coordinate \(c_0 = 9\), and computes half-sized hypercubes from \(V_{31}^3(9, ...)\) and \(W_{31}(9, ...)\). The prover then repeats.

The verifier is able to check each step against the previous step. Particularly:

Here's the code to compute the Lagrange coefficients:

def deg4_lagrange_weights(x):
    """
    Return weights w_0, w_1, w_2, w_3, w_4 such that
    for any degree-4 polynomial p,
    p(x) = sum_k w_k * p(k)  with k in {0, 1, 2, 3, 4}.
    """
    nodes = (0, 1, 2, 3, 4)
    denoms = (24, -6, 4, -6, 24)  # Π_{m≠k} (k - m) for k = 0,1,2,3,4
    coeffs = []
    for idx, k in enumerate(nodes):
        num = 1
        for m in nodes:
            if m != k:
                num *= (x - m)
        coeffs.append(num / denoms[idx])
    return tuple(coeffs)

You can check the example above yourself, and see that the linear combination of the values in each round with coefficients generated by this equals the next round's total (modulo 89).

At the end, the prover provides the evaluation \(V_{31}(9, 16, 8) = 39\). Because \(W\) has a mathematical structure, the verifier can compute \(W_{31}(9, 16, 8) = 85\) themselves. Then, the verifier checks the equation (remember: you can check an equation between polynomials by checking an evaluation at a random point): \(39^3 * 85 = 87\) (all modulo 89 in our example).

What have we done? We have converted an unproven claim of the form \(V_{32}(p_{32}) = y_{32}\) into an unproven claim of the form \(V_{31}(p_{31}) = y_{31}\), all without using any commitments!!!

We then proceed and keep going, until we get to a claim \(V_0(p_0) = y_0\). Because \(V_0\) is just the inputs, and the verifier has the inputs, the verifier can just check this directly. Let us recap the entire flow in diagram form:



Now, let's get back to one thing I left out here: the matrix multiplication layers. It turns out that there is a fairly easy way to generate weights that represent the idea "multiply \(V\) by a matrix \(M\) that only combines values that are within the same batch of 16, and then evaluate \(VM\) at point \(p\)". There is also a fairly easy way to evaluate the polynomial generated by those weights at some other point \(q\) without actually needing to compute \(M\) (cryptographers seem to love Star Trek, so they sometimes say "materialize" \(M\)). This is important for the verifier to be able to check the link between \(V_{n-1}(p_{n-1})\) and \(V_n(p_{n-1})\).

But there is also a much dumber way to do this, which does not lose much efficiency (in fact, it increases efficiency slightly on the prover side): just run 16 sumchecks in parallel, sharing the same randomness so they all get the same coordinates. The verifier's "state" \(V_{n-1}(p_{n-1})\) will be a size-16 object, and the verifier can apply the matrix to that directly.

Optimizations

We can do a major optimization to the sumcheck above, reducing the number of sums computed and provided per round from 5 to 3. This happens from two sources.

First, because we know the verifier can check \(sum_0 + sum_1\) against the previous total, we can instead take out \(x_1\), and ask the verifier to just compute \(sum_1 = total - sum_0\) themselves.

A second, more involved optimization is usually called Gruen's trick. Let us go back to the definition of the matrix \(W\). Here is some python code for computing it:

def generate_weights(eval_point):
    weights = [1]
    for c in eval_point[::-1]:
        R = weights[0] * c
        L = weights[0] - R
        weights = [0] * (len(weights) * 2)
        weights[::2] = L
        weights[1::2] = R
    return weights

Let's first focus on the first round. Notice that the only part of this calculation that depends on the first coordinate is the very last step. What happens if the prover computes \(V^3 * W\) using a half-sized version of \(W\) where we just skip the last step?

Here are the \(W_{half}\) and \(W\) for our example above (derived from \(p = [2, 7, 18]\)), side by side:



As with our earlier example, this is just \([[(1-7) * (1-18), (1-7) * 18], [7 * (1-18), 7 * 18]]\), modulo 89. The evaluations of \(W\) at some \(x\) can be found by taking \(W_{half}\) and multiplying in \(x * c + (1-x) * (1-c)\) where \(c\) is the current evaluation point (in this case, the \(2\) from \([2, 7, 18]\)). At \(x=0\), this just means multiplying by \(1 - 2 = -1\) (modulo 89), and you should be able to visually check that the front face of the right cube is exactly that.

You can do the same thing with linear combinations of \(W\): given the sum \(hsum_x = sum(V_{31}^3(x, ...) * W_{half}(...))\), you can multiply it by \(x * c + (1-x) * (1-c)\) to get the "real sum" \(sum_x = sum(V_{31}^3(x, ...) * W_{31}(x, ...))\).

This is another way to allow the prover to send four values instead of five: because \(W_{half}\) is independent of the current dimension, its evaluation will be the same for the top half of \(V\) and the bottom half, and for that matter for the extension \(V(2, ...)\). This means that, in the current dimension, \(V^3 * W_{half}\) is only degree-3 (and not degree-4), and so you only need four values (\(hsum_0 ... hsum_3\)) to compute \(hsum_c\) with Lagrange coefficients.

To check against the previous total, the verifier would check if \(hsum_0 * (1 - c) + hsum_1 * c = total\). But even better, we can combine the two tricks to let the prover only send three values! We do this by allowing the verifier themselves to compute \(hsum_1 = \frac{total - hsum_0 * (1 - c)}{c}\).

A third, smaller, optimization is that when \(r_i\) is not zero, we don't need to handle it inside the sumcheck; instead, the verifier can simply subtract it from \(V_i(p_i)\) at the appropriate time.

Moving to "real" Poseidon2

The next source of optimization comes from moving to the real version of Poseidon2, where ¾ of the layers only cube one value per round. This means that 15 out of 16 values in each round stay the same.

Unfortunately, we can't skip doing a sumcheck over them entirely, because if the evaluation point for the cubed value shifts from \(p_i\) to \(p_{i-1}\), we need to shift the evaluation point for the other values as well. But the good news is that (with Gruen's trick) it's only a linear sumcheck. The even better news is that you can batch it: you can prove \(sum(V_{(i, 1)}) ... sum(V_{(i, 15)})\) by proving \(sum(\sum_{j=1}^{15} V_{(i,j}) * \alpha ^ j)\) for some random \(\alpha\). Thus, most of the prover's work is just computing the random linear combination \(\sum_{j=1}^{15} V_{(i,j}) * \alpha ^ j\). \(V_{(i,0)}\) gets cubed, so it does have to be handled with a cubic sumcheck (but sharing the same randomness), but this is now a much smaller cost.

You can find some sample code implementing these algorithms here: https://github.com/ethereum/research/tree/master/gkr

Polynomial commitments

One thing that I threw under the rug in the above analysis is that in this whole example, the verifier is working with the full list of inputs and outputs "in the clear". For such a protocol as written to be secure, this means that the verifier would have to hash that entire list to generate the Fiat-Shamir randomness. Thus, to receive a proof of the evaluations of \(N\) hashes, the verifier would have to compute... \(N\) hashes.

In a real-world protocol, no one is interested in the evaluations of lots of hashes in isolation. Instead, the idea is to prove the hashes, and then expose the set of hash inputs and hash outputs as a "lookup table" that can be used to prove some bigger computation that includes those hashes.



For this, you could "just use FRI". The verifier needs to get a proof of \(V_{32}(p_{31})\) and \(V_0(p_0)\), which would turn into a linear combination over the values encoded in the commitments. This can be done, and is certainly much faster than proving the hashes inside the FRI. An alternative is to use polynomial commitment schemes that are native to multilinear polynomials. A common one is BaseFold, though there have been other more recent improvements. Realistically, the choice depends on what proof system you are using for the non-hash part of your computation.

One important security consideration is an observation from 2025: if the circuit being proved can also compute the Fiat–Shamir hash used to derive challenges fast enough within its own depth, a malicious prover might be able to predict challenges, and thus cheat the proof system. This is not an issue for using GKR to prove hashes as in the example above, because (i) the scheme is designed around one fixed circuit that does not create room for an attacker to use these tricks, and (ii) the attack involves feeding GKR a circuit that internally computes both a hash and other computation, and so by definition there is not enough depth in a hash-only circuit to execute it. However, for more complicated and general-purpose GKR use cases, it is an issue; one mitigation is to adjust the hash used to compute Fiat-Shamir challenges so that it requires more rounds to compute than the number of rounds the circuit has.

How efficient is GKR?

Let's continue to focus on the proving Poseidon hashes use case. Here is a spreadsheet I made that tallies up all the costs a demo implementation that I made:



Roughly a 15x theoretical overhead (!!). This compares to a ~100x theoretical overhead for proving Poseidon with traditional STARKs. The gains come from the fact that the largest costs of the traditional STARK proof (see my prover here, see the "regular STARK" tab of my theorycraft spreadsheet here) come from hashing and FFT'ing the trace that consists of all intermediate values (though there is a clever trick that allows the proof to only require one trace value per inner round). Hence, we don't have to do any of that at all: the prover is only committing to the input and the output, for everything in the middle the prover is only doing a (much lighter) sumcheck at each round, and for linear work (ie. matmul) the prover needs to do nothing at all.

But these numbers are just theory. There are two practical reasons why practice might differ from theory, that pull in opposite directions:

  1. Sumchecks require more complicated and highly intensive memory shuffling that naive hashing does not. In practice, sumchecks are often memory-bound.
  2. Real-world hashing workloads are often at least partially serial, whereas sumchecks are massively parallelizable.

And on top of this, there is the possibility that one or the other just happens to be harder to optimize in a given software or hardware context. Does my implementation match up with the theory? Well, let's see:



Under 10x!!! Though I would not take this number as the "true overhead" either; it most likely means that my implementation of the raw execution is under-optimized.

Now, there still are some optimizations we have not tried. Particularly, you can adjust the sumcheck so that the "hypercube" is an unbalanced prism, where the first dimension has a size larger than 2. This makes the first round larger, but makes all subsequent rounds smaller; the tradeoff is fine because in the first round \(V\) is in the base KoalaBear prime field whereas in later rounds everything is in the extension field, which is much more expensive. If you're hashing larger amounts of data, you could also give up on the 2→1 structure, and hash eg. 4→1 or even more; as the width increases, the overhead of GKR approaches zero (!!!).

Hence, proving hashes with single-digit overhead is very possible.

In this post, hashes were just an example; GKR is a family of techniques that is well-suited for proving many kinds of computations. If a computation can be expressed in the "batch and layers" format, where each layer can be expressed as a low-degree polynomial, then GKR can be applied directly. But even if there are cross-interactions between different values in a large computation, there are often tricks that can be used. This is relevant for proving LLM inference, and many other situations. Most provers that strive for very high efficiency and very low overhead are going to use something like GKR in some form.