Exploring circle STARKs

2024 Jul 23 See all posts


Exploring circle STARKs

This article assumes familiarity with the basics of how SNARKs and STARKs work; if you are not familiar, I recommend the first few sections in this article. Special thanks to Eli ben-Sasson, Shahar Papini, Avihu Levy and others at starkware for feedback and discussion.

The most important trend in STARK protocol design over the last two years has been the switch to working over small fields. The earliest production implementations of STARKs worked over 256-bit fields - arithmetic modulo large numbers such as 21888...95617 \(\approx 1.51 * 2^{253}\) - which made these protocols naturally compatible with verifying elliptic curve-based signatures, and made them easy to reason about. But this led to inefficiency: in most cases we don't actually have good ways to make use of these larger numbers, and so they ended up as mostly wasted space, and even more wasted computation, since arithmetic over 4x bigger numbers takes ~9x more computation time. To deal with this, STARKs have started working over smaller fields: first Goldilocks (modulus \(2^{64} - 2^{32} + 1\)) and then Mersenne31 and BabyBear (\(2^{31} - 1\) and \(2^{31} - 2^{27} + 1\), respectively).

This switch has already led to demonstrated massive improvements in proving speed, most notably Starkware being able to prove 620,000 Poseidon2 hashes per second on an M3 laptop. Particularly, this means that, provided we're willing to trust Poseidon2 as a hash function, one of the hardest parts of making an efficient ZK-EVM is effectively solved. But how do these techniques work, and how do cryptographic proofs, which typically require large numbers for security, get built over these fields? And how do these protocols compare to even more exotic constructions such as Binius? This post will explore some of these nuances, with a particular eye to a construction called Circle STARKs (implemented in Starkware's stwo, Polygon's plonky3, and my own implementation in (sort of) python), which has some unique properties designed to be compatible with the highly efficient Mersenne31 field.

Issues common to small fields

One of the most important "tricks" when making hash-based proofs (or really, any kind of proof) is the idea of proving things about evaluations of a polynomial as a random point, as a substitute for proving things about the underlying polynomials.

For example, suppose that a proof system requires you to generate a commitment to a polynomial, \(A\), which must satisfy \(A^3(x) + x - A(\omega*x) = x^N\) (a pretty common type of claim to prove in ZK-SNARK protocols). The protocol can require you to pick a random coordinate \(r\), and prove that \(A(r) + r - A(\omega*r) = r^N\). And then in turn, to prove that \(A(r) = c\), you prove that \(Q = \frac{A - c}{X - r}\) is a polynomial (as opposed to a fractional expression).

If you know \(r\) ahead of time, you can always cheat these protocols. In this case, you could just set \(A(r)\) to be zero, retrofit \(A(\omega * r)\) to satisfy the equation, and then let \(A\) be the line that passes through those two points. And similarly for the second step, if you know \(r\) ahead of time, you can generate whatever \(Q\) you want, and then retrofit \(A\) to match it, even if \(A\) is a fractional (or other non-polynomial) expression.

To prevent these attacks, we need to choose \(r\) after the attacker provides \(A\) (the "Fiat-Shamir heuristic" is a fancy name for setting \(r\) to be the hash of \(A\)). Importantly, we need to choose \(r\) from a set large enough that the attacker cannot guess it.

In elliptic curve based protocols and even 2019-era STARKs, this was trivial: all of our math was done over 256-bit numbers, so we choose \(r\) as a random 256-bit number, and we're fine. With STARKs over smaller fields, we have a problem: there are only about two billion possible values of \(r\) to choose from, and so an attacker wanting to make a fake proof need only try two billion times - a lot of work, but quite doable for a determined attacker!

There are two natural solutions to this problem:

The approach of performing multiple random checks is intuitively appealing and simple: instead of checking at one coordinate, you repeat the check at each of four random coordinates. This is theoretically doable, but there is an efficiency issue. If you're dealing with degree < \(N\) polynomials over a size \(p\) field, it's actually possible for an attacker to craft bad polynomials that "look" good in \(N\) positions. Hence, their chance of breaking one round of the protocol is \(\frac{N}{p}\). If eg. \(p = 2^{31} - 1\) and \(N = 2^{24}\), that means the attacker only gets seven bits of security per round, and so you need to do not four, but around 18 rounds, to be properly robust against such attackers. Ideally, we would have something where we do \(k\) times more work but only have to subtract \(N\) from the security level once.

This gets us to the other solution: extension fields. Extension fields are like complex numbers, but over finite fields: we imagine into existence a new value, call it \(i\), and declare that \(i^2 = -1\). Multiplication becomes: \((a+bi) * (c+di) = (ac - bd) + (ad + bc)i\). We can now operate over pairs \((a,b)\) rather than just single numbers. Assuming we're working over size \(\approx 2^{31}\) fields like Mersenne or BabyBear, this gets us up to having \(\approx 2^{62}\) values from which to choose \(r\). To go even higher, we apply the same technique again, except we already used \(i\) so we need to define a new value differently: in Mersenne31, we pick \(w\) where \(w^2 = -2i-1\). Multiplication now becomes \((a + bi + cw + diw) * (e + fi + gw + hiw) = ...\)


OK fine, here's the code implementation. It's not optimal (you can improve it with Karatsuba), but it shows the principles.


Now, we have \(\approx 2^{124}\) values to choose \(r\) from, which is high enough for our security needs: if we are dealing with degree < \(2^{20}\) polynomials, we get 104 bits of security from one round. If we want to be paranoid and go up to the more widely-accepted 128 bit security level, we can add some proof of work into the protocol.

Note that we only actually use this extension field in the FRI protocol, and other cases where random linear combinations are required. The bulk of the math is done over only the "base field" (modulo \(2^{31}-1\) or \(15 * 2^{27} + 1\)), and almost all of the data that is hashed is over the base field, so you only hash four bytes per value. This lets us both benefit from the efficiency of small fields, and retain the ability to dip into a larger field when we need to do so for security.

Regular FRI

When building a SNARK or STARK, the first step is typically arithmetization: reducing an arbitrary computation problem into an equation where some of the variables and coefficients are polynomials (eg. the equation often looks like \(C(T(x), T(next(x))) = Z(x) * H(x)\), where \(C\), \(next\) and \(Z\) are provided and the solver needs to provide \(T\) and \(H\)). Once you have such an equation, a solution to the equation corresponds to a solution to the underlying computational problem.

To prove that you have a solution, you need to prove that the values that you are proposing actually are real polynomials (as opposed to fractions, or datasets that look like one polynomial in one place and a different polynomial in another place, or...), and have a certain maximum degree. In order to do this, we apply a random linear combination trick iteratively:

Essentially, what's going on is that \(B\) isolates the even coefficients of \(A\), and \(C\) isolates the odd coefficients. Given \(B\) and \(C\), you can recover \(A\): \(A(x) = B(x^2) + xC(x^2)\). And if \(A\) really has degree \(< 2^{20}\), then (i) \(B\) and \(C\) have degree \(< 2^{19}\). And being a random linear combination, \(D\) must also have degree \(< 2^{19}\).

We've reduced a "prove degree \(< 2^{20}\)" problem into a "prove degree \(< 2^{19}\)" problem. Repeat this 20 times, and you get the technique that is called "Fast Reed-Solomon Interactive Oracle Proofs of Proximity", or "FRI". If someone tries to push something through this technique which is not a degree \(< 2^{20}\) polynomial, then the second-round output will (with probability \(\approx 1 - \frac{1}{2^{124}}\)) not be a degree \(< 2^{19}\) polynomial, the third-round output will not be degree \(< 2^{18}\), and so on, and the final check at the end will fail. A dataset which is equal to a degree \(< 2^{20}\) polynomial in most positions has some chance of passing through the scheme, but in order to construct such a dataset you need to know the underlying polynomial, so even such a slightly-defective proof is a convincing argument that the prover could generate a "real" proof if they wanted to. There are further technical complexities in proving that this holds for all possible inputs; understanding the fine details of this has been a major focus of academic STARK research over the last five years.

Let's look into what's going on here in more detail, and what properties are necessary to make this all work. At each step, we're reducing the degree by a factor of 2, and we're also reducing the domain (the set of points we're looking at) by a factor of 2. The former is what makes FRI work at all. The latter is what makes it so blazing fast: because each round is 2x smaller than the previous, the total cost is \(O(N)\) instead of \(O(N*log(N))\).

To do this domain reduction, we needed a two-to-one map: \(\{x, -x\} \rightarrow x^2\). What's nice about this two-to-one map is that it's repeatable: if you start with a multiplicative subgroup (a set \(\{1, \omega, \omega^2 ... \omega^{n-1}\}\)), then you start off with a set where for any \(x\) in the set, \(-x\) is also in the set (as if \(x = \omega^k\), \(-x = \omega^{k\pm\frac{N}{2}}\)), and if you then square it to get \(\{1, (\omega^2), (\omega^2)^2 ... (\omega^2)^{\frac{n}{2}-1}\}\), then the exact same property applies, and so you can keep reducing all the way down to one value (though in practice we usually stop a little bit earlier).



You can think of this as being an operation of taking a line that goes around a circle, and stretching that line until it makes two rotations along that circle. A point at x degrees becomes a point at 2x degrees. Each point from 0...179 degrees has a corresponding point at 180...359 degrees that it ends up overlapping with. And you can repeat this procedure again and again.

For this to work, you need the original multiplicative subgroup to have a size with a large power of 2 as a product. BabyBear has modulus \(15 * 2^{27} + 1\), and so the largest possible subgroup is all nonzero values - hence, size \(15 * 2^{27}\). This is very friendly to the above technique. You could take a subgroup of size \(2^{27}\), or you could just take that full set, do the FRI to reduce the polynomial all the way down to degree 15, and then check tthe degree directly at the end. Mersenne31, however, does not work in this way. The modulus is \(2^{31} - 1\), and so the multiplicative subgroup has size \(2^{31} - 2\). This can be divided by 2 only once. From there forward, we have no way to do an FFT - at least not using the technique above.

This is a tragedy, because Mersenne31 is a super-convenient field to do arithmetic in using existing 32-bit CPU/GPU operations. If you add two numbers, the result may be above \(2^{31}-1\), but you can reduce it by doing \(x \rightarrow x + (x >> 31)\), where \(>>\) is a bit shift. For multiplication, you can do something similar, though you need to use a special (but commonly available) opcode that returns the "high-order bits" of a multiplication result (ie. \(floor(\frac{xy}{2^{32}})\)). This allows arithmetic to be around 1.3x more efficient than BabyBear. If we could do FRI over Mersenne31, it would make things significantly better for us.

Circle FRI

Here is where the clever trick of circle STARKs comes in. Given a prime \(p\), it turns out that we also have easy access to a group of size \(p+1\) that has similar two-to-one properties: the set of points \((x,y)\) where \(x^2 + y^2 = 1\). Let's look at this structure modulo 31:



The points follow an addition law, which might feel very familiar if you've recently done either trigonometry or complex multiplication:

\((x_1, y_1) + (x_2, y_2) = (x_1x_2 - y_1y_2, x_1y_2 + x_2y_1)\)

The doubling form is:

\(2 * (x, y) = (2x^2 - 1, 2xy)\)

Now, let's focus on only the points that are in "odd" positions on this circle:


Now, here is our FFT. First, we collapse all the points down to a single line. Our equivalent of the \(B(x^2)\) and \(C(x^2)\) formulas that we had in regular FRI is:

\(f_0(x) = \frac{F(x,y) + F(x,-y)}{2}\) \(f_1(x) = \frac{F(x,y) - F(x,-y)}{2y}\)

We can then take a random linear combination, and we get a one-dimensional \(F\) that is over a subset of the x line:



From the second round onward, the map changes:

\(f_0(2x^2-1) = \frac{F(x) + F(-x)}{2}\) \(f_1(2x^2-1) = \frac{F(x) - F(-x)}{2x}\)

And this map actually takes the above set, and reduces its size in half each time! What is going on here is that each \(x\) is in some sense "standing in" for two points: \((x,y)\) and \((x,-y)\). And \(x \rightarrow 2x^2-1\) is the point doubling law above. Hence, we take the \(x\) coordinate of two opposite points on the circle, and convert it into the \(x\) coordinate of the doubled point.

For example, if we take the second-rightmost value, \(2\), and apply the map, we get \(2(2^2) - 1 = 7\). If we go back to the original circle, \((2,11)\) is the third point going counterclockwise from the right, and so if we double it, we get the sixth point going counterclockwise from the right, which is... \((7, 13)\).

This could have all been done two-dimensionally, but operating over one dimension makes things more efficient.

Circle FFTs

An algorithm closely related to FRI is the fast Fourier transform, which takes a set of \(n\) evaluations of a degree \(< n\) polynomial and converts it into the \(n\) coefficients of the polynomial. An FFT follows the same path as a FRI, except instead of generating a random linear combination \(f_0\) and \(f_1\) at each step, it just recursively applies a half-sized FFT on both, and then takes the output of \(FFT(f_0)\) as the even coefficients and \(FFT(f_1)\) as the odd coefficients.

The circle group also supports an FFT, which is also constructed from FRI along similar lines. However, a key difference is that the objects that circle FFTs (and circle FRI) work over are not technically polynomials. Rather, they are what mathematicians call a Riemann-Roch space: in this case, polynomials "modulo" the circle (\(x^2 + y^2 - 1 = 0\)). That is, we treat any multiple of \(x^2 + y^2 - 1\) as being equal to zero. Another way of thinking about it is: we only allow degree-1 powers of \(y\): as soon as we get a \(y^2\) term, we replace it with \(1 - x^2\).

One other thing that this implies is that the "coefficients" that a circle FFT outputs are not monomials like in regular FRI (eg. if regular FRI outputs \([6, 2, 8, 3]\), then we know this means \(P(x) = 3x^3 + 8x^2 + 2x = 6\)). Instead, the coefficients are in a strange basis specific to circle FFTs:

\(\{1, y, x, xy, 2x^2-1, 2x^2y-y, 2x^3-x, 2x^3y-xy, 8 x^4 - 8 x^2 + 1...\}\)

The good news is that as a developer, you can almost completely ignore this. STARKs never give you a need to know the coefficients. Instead, you can just always store "polynomials" as a set of evaluations on a particular domain. The only place you need to use FFTs, is to perform (the Riemann-Roch space analogue of) low-degree extension: given \(N\) values, generate \(k*N\) values that are on that same polynomial. In that case, you can do an FFT to generate the coefficients, append \((k-1)n\) zeroes to those coefficients, and then do an inverse-FFT to get back your larger set of evaluations.

Circle FFTs are not the only type of "exotic FFT". Elliptic curve FFTs are even more powerful, because they work over any finite field (prime, binary, etc). However, ECFFTs are even more complex to understand and less efficient, and so because we can use circle FFTs for \(p = 2^{31}-1\), we do.

From here, let's get into some of the more esoteric minutiae that will be different for someone implementing circle STARKs, as compared to regular STARKs.

Quotienting

A common thing that you do in STARK protocols is you take quotients at specific points, either deliberately chosen or randomly chosen. For example, if you want to prove that \(P(x) = y\), you do so by providing \(Q = \frac{P - y}{X - x}\), and proving that \(Q\) is a polynomial (as opposed to a fractional value). Randomly choosing evaluation points is used in the DEEP-FRI protocol, which lets FRI be secure with fewer Merkle branches.

Here, we get to one subtle challenge: in the circle group, there is no line function, analogous to \(X - x\) for regular FRI, that passes through only one point. This is visible geometrically:



You could make a line function tangent to one point \((P_x, P_y)\), but that would pass through the point "with multiplicity 2" - that is, for a polynomial to be a multiple of that line function, it would have to fulfill a much stricter condition than just being zero at that point. Hence, you can't prove an evaluation at only one point. So what do we do? Basically, we bite the bullet, and prove an evaluation at two points, adding a dummy point whose evaluation we don't need to care about.


A line function: \(ax + by + c\). If you turn it into an equation by forcing it to equal 0, then you might recognize it as a line in what high school math calls "standard form".


If we have a polynomial \(P\) that equals \(v_1\) at \(P_1\), and \(v_2\) at \(P_2\), then we choose an interpolant \(I\): a line function that equals \(v_1\) at \(P_1\), and \(v_2\) at \(P_2\). This can be as simple as \(v_1 + (v_2 - v_1) * \frac{y - y_1}{(P_2)_y - (P_1)_y}\). We then prove that \(P\) equals \(v_1\) at \(P_1\), and \(v_2\) at \(P_2\) by subtracting \(I\) (so \(P-I\) equals zero at both points), dividing by \(L\) (the line function between \(P_1\) and \(P_2\)), and proving that the quotient \(\frac{P - I}{L}\) is a polynomial.

Vanishing polynomials

In a STARK, the polynomial equation you're trying to prove often looks like \(C(P(x), P(next(x))) = Z(x) * H(x)\), where \(Z(x)\) is a polynomial that equals zero across your entire original evaluation domain. In "regular" STARKs, that function is just \(x^n - 1\). In circle STARKs, you the equivalent is:

\(Z_1(x,y) = y\)

\(Z_2(x,y) = x\)

\(Z_{n+1}(x,y) = (2 * Z_n(x,y)^2) - 1\)

Notice that you can derive the vanishing polynomial from the folding function: in regular STARKs, you're repeating \(x \rightarrow x^2\), here you're repeating \(x \rightarrow 2x^2-1\), though you're doing something different for the first round, because the first round is special.

Reverse bit order

In STARKs, evaluations of a polynomial are typically arranged not in the "natural" order (\(P(1)\), \(P(\omega)\), \(P(\omega^2)\) ... \(P(\omega^{n-1})\)), but rather what I call "reverse bit order":

\(P(1)\), \(P(\omega^{\frac{n}{2}})\), \(P(\omega^{\frac{n}{4}})\), \(P(\omega^{\frac{3n}{4}})\), \(P(\omega^{\frac{n}{8}})\), \(P(\omega^{\frac{5n}{8}})\), \(P(\omega^{\frac{3n}{8}})\), \(P(\omega^{\frac{7n}{8}})\), \(P(\omega^{\frac{n}{16}})\)...

If we set \(n = 16\), and we focus just on which powers of \(\omega\) we're evaluating at, the list looks like this:

\(\{0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15\}\)

This ordering has the key property that values which get grouped together early on in a FRI evaluation are put beside each other in the ordering. For example, the first step of FRI groups together \(x\) and \(-x\). In the \(n=16\) case, \(\omega^8 = -1\), so that means \(P(\omega^i)\) and \(P(-\omega^i) = P(\omega^{i+8})\). And, as we can see, those are exactly the pairs that are right beside each other. The second step of FRI groups together \(P(\omega^i)\), \(P(\omega^{i+4})\), \(P(\omega^{i+8})\) and \(P(\omega^{i+12})\). And, those are exactly the groups of four that we see. And so forth. This makes FRI much more space-efficient, because it lets you provide one Merkle proof for both of the values that get folded together (or, if you fold \(k\) rounds at a time, all \(2^k\) of the values) simultaneously.

In circle STARKs, the folding structure is a bit different: in the first step we group together \((x, y)\) with \((x, -y)\), in the second step \(x\) with \(-x\), and in subsequent steps \(p\) with \(q\), selecting \(p\) and \(q\) such that \(Q^i(p) = -Q^i(q)\) where \(Q^i\) is the map \(x \rightarrow 2x^2-1\) repeated \(i\) times. If we think of the points in terms of their position along the circle, at each step this looks like the first point getting paired with the last, the second with the second last, etc.

To adjust reverse bit order to reflect this folding structure, we reverse every bit except the last. We keep the last bit, and we also use it to determine whether or not to flip the other bits.



A size-16 folded reverse bit order looks as follows:

\(\{0, 15, 8, 7, 4, 11, 12, 3, 2, 13, 10, 5, 6, 9, 14, 1\}\)

If you look at the circle in the previous section, the 0th, 15th, 8th and 7th points (going counterclockwise, starting from the right) are of the form \((x, y)\), \((x, -y)\), \((-x, -y)\) and \((-x, y)\), which is exactly what we need.

Efficiency

Circle STARKs (and 31-bit-prime STARKs in general) are very efficient. A realistic computation that is being proven in a circle STARK would most likely involve a few types of computation:

  1. Native arithmetic, used for "business logic" such as counting
  2. Native arithmetic, used for cryptography (eg. hash functions like Poseidon)
  3. Lookup arguments, a generic way to do many kinds of computation efficiently by implementing them via reading values from tables

The key measure of efficiency is: are you using the entire space in the computational trace to do useful work, or are you leaving a lot of wasted space? In large-field SNARKs, there is a lot of wasted space: business logic and lookup tables mostly involve computation over small numbers (often the numbers are under N in an N-step computation, so under \(2^{25}\) in practice), but you have to pay the cost of using a size \(2^{256}\)-bit field anyway. Here, the field is size \(2^{31}\), so the wasted space is not large. "Designed-for-SNARKs" low-arithmetic-complexity hashes (eg. Poseidon) use every bit of each number in the trace in any field.

Hence, circle STARKs actually get pretty close to optimal! Binius is even stronger, because it lets you mix-and-match fields of different sizes and thereby get even more efficient bit packing for everything. Binius also opens up options for doing 32-bit addition without incurring the overhead of lookup tables. However, those gains at the cost of (in my opinion) significantly higher theoretical complexity, whereas circle STARKs (and even more so BabyBear-based regular STARKs) are conceptually quite simple.

Conclusion: what do I think about circle STARKs?

Circle STARKs don't impose too many extra complexities on developers compared to regular STARKs. In the process of making an implementation, the above three issues are essentially the only differences that I saw compared to regular FRI. The underlying math behind what the "polynomials" that circle FRI is operating on is quite counterintuitive, and takes a while to understand and appreciate. But it just so happens that this complexity is hidden away in such a way it's not that visible to developers. The complexity of circle math is encapsulated, not systemic.

Understanding circle FRI and circle FFTs can also be a good intellectual gateway to understanding other "exotic FFTs": most notably binary-field FFTs as used in Binius and in LibSTARK before, and also spookier constructions such as elliptic curve FFTs, which use few-to-1 maps that work nicely with elliptic curve point operations.

With the combination of Mersenne31, BabyBear, and binary-field techniques like Binius, it does feel like we are approaching the limits of efficiency of the "base layer" of STARKs. At this point, I am expecting the frontiers of STARK optimization to move to making maximally-efficient arithmetizations of primitives like hash functions and signatures (and optimizing those primitives themselves for that purpose), making recursive constructions to enable more parallelization, arithmetizing VMs to improve developer experience, and other higher-level tasks.