Dark Mode Toggle

Plinko PIR tutorial

2025 Nov 25 See all posts

Plinko PIR tutorial

Special thanks to Alex Hoover, Keewoo Lee and Ali for feedback and review

One underrated form of privacy, that is not well satisfied by ZK-SNARKs, traditional versions of FHE or other commonly talked about methods, is private reads. Users want to learn things from a large database, without revealing what they're reading. Some use cases of this include:

Privacy-preserving Wikipedia (so you can learn things without even the server knowing what you're reading)
Privacy-preserving RAG or LLM search (so your local LLM can get data from the internet without revealing what you're querying)
Blockchain data reads (so you can use a dapp that queries an external RPC node without the node learning what you're querying; see "private reads" here)

There is a category of privacy protocols that are directly designed to solve this problem, called private information retrieval (PIR). A server has a database \(D\). A client has an index \(i\). The client sends the server a query, the server replies back with an answer that allows the client to reconstruct \(D[i]\), all without the server learning anything about \(i\). In this post I will describe some basics that underlies all PIR protocols, and then describe Plinko, a protocol that efficiently achieves this.

Basics of PIR

To understand more complicated PIR protocols, it's best to start with "classic two-server PIR". Classic two-server PIR assumes that a client is making queries to two servers, and it trusts that at least one of the two is honest. Here's how the protocol works:

The client wants to know \(D[i]\), in the above example \(D[8]\).
The client generates a random subset of indices in \(D\), on average about half full (in the above example, \([1, 5, 6, 10, 15]\)).
The client sends that subset to one server, and then sends a modified subset, where the membership of \(D[i]\) is flipped (ie. added if it was not there, removed if it was there), to the other server.
Each server computes the XOR of the values in \(D[i]\) that it was asked, and returns the result
The client XORs these two values, and it gets the value of \(D[i]\)

The key idea is that each server receives what looks like a completely random subset of the indices. There is not even a slight bias from mixing in \(i\), because that modification is as likely to add the value as it is to remove it. The two servers give XOR-sums of subsets that are almost identical, but differ in one place: one includes \(D[i]\) and the other doesn't. The client gets these two values, and takes the difference between the two, which gives \(D[i]\).

This naturally generalizes to "cells" that are bigger than one bit; you could have each \(D[i]\) have eg. 1000 bits, and the protocol is exactly the same.

This protocol has three big weaknesses:

You have to trust that one of the two servers is honest.
The client has communication overhead equal to the number of cells (this can be much smaller than the total size of the data if the cells are big, but it's still a lot).
The server has to XOR half the total data to respond to a single query.

There are two well-understood modifications to classic two-server PIR that solve (1).

One strategy is PIR with preprocessing. Notice that in the protocol above, you can compute one of the two random sample sets before you even know which \(i\) you want to query for. You generate thousands of these random queries. You have an initial phase where you download (but do not need to store) the whole file \(D\), and you compute the XOR sums for these random sets of indices locally. Essentially, you are acting as one of the two servers - and in this way, satisfy the 1-of-2 trust assumption all by yourself, so you do not need to trust the other server. This approach does require a huge amount of upfront bandwidth, but you can do it ahead of time, or do it in the background and use a more inefficient form of PIR to answer any queries that you get before it finishes.

The other strategy is single-server PIR. Basically, you use some variant of homomorphic encryption (often simpler than full-on FHE) to encrypt your query, in such a way that the server can compute on your encrypted query to give you an encrypted answer. This has high computational overhead on the server side, and so is not yet feasible for very large datasets. This strategy can be combined with PIR with preprocessing, so you can theoretically avoid the initial download by doing that procedure homomorphically encrypted on the server side.

Plinko solves (2) and (3) - or at least, it decreases both communication overhead and server-side compute overhead from \(O(N)\) to \(O(\sqrt{N})\) where \(N\) is the number of cells. It uses the preprocessing strategy to remove the trust assumption. Theoretically, you can do the preprocessing in FHE; it is currently an open problem to do this efficiently enough.

In Plinko's language, there is:

A setup phase, where the client processes all the data once (but only remembers a small number of "hints")
A query mechanism, where the client makes a query to the server and combines the answer with the right hint to compute the needed value.

There is also an easy way for the client to update hints if the data set changes. Now, let's get into the protocol!

Setup phase

Treat the database as being a square, with \(\sqrt{N}\) rows and \(\sqrt{N}\) columns:

The client generates a random master seed \(S\). For each row \(i\), the client uses the master seed to generate a row seed \(S_i\). The client generates roughly \(128 * \sqrt{N}\) hints. Each hint is generated as follows:

Randomly (or rather, pseudorandomly by hashing the master seed) pick \(\frac{\sqrt{N}}{2} + 1\) rows
If we're computing the \(j\)'th hint, for each row \(r_i\), use a hash \(H(S_{r_i}, j)\) to generate a random column \(c_i\). So we get \(\frac{\sqrt{N}}{2} + 1\) \((r_i, c_i)\) pairs
XOR the pairs, and save the bit

The "hash" that is used to compute the \(c_i\) values could be a regular hash, but that leads to (tolerable but significant) inefficiency during the query phase. So instead, Plinko prefers special type of hash called an "invertible PRF". The core property that it achieves is that given \(S_i\) and \(c_i\), you can easily recover all \(j\) values that lead to that \(c_i\). In practical terms, this means that for any cell \((r_i, c_i)\), you can easily find all hints that "include" that cell.

Additionally, the setup phase generates some "backup hints". These are constructed like regular hints, except for each index \(j\), you choose \(\frac{\sqrt{N}}{2}\) rows, and compute a pair of hints: one over the chosen subset of rows, and the other for its complement (ie. the \(\frac{\sqrt{N}}{2}\) rows that were not chosen):

This whole process can be done "streaming", without requiring the client to store any data at any point in time other than the hints themselves. In total, the client needs to download the whole data set and store data equal to roughly \(64 \* \sqrt{N}\) cells, plus \(\sqrt{N}\) cells per backup hint (the client will need one backup hint for each query that the client will make).

If desired, the process can also be done in FHE, which means the client would only need to download the hints, but the server would need to do FHE work over the whole dataset; this is of course expensive, and optimizing it is an ongoing problem.

Query phase

Suppose you are interested in a specific coordinate \((x, y)\). First, you discover a hint index \(j\) that generates a hint that contains \((x, y)\). This is where the "invertible PRF" mechanism helps: it lets you immediately run the function in reverse knowing \(S_x\) and \(y\) to compute all \(j\)'s such that \(H(S_x, j) = y\). However, if you want to avoid this complexity, you could just set \(H(S_x, j) = sha256(S_x, \lfloor \frac{j}{16} \rfloor)[j \% 16: (j \% 16) + 2]\) and then compute on average \(\frac{\sqrt{N}}{16}\) hashes every time.

Then, you determine the \(\frac{\sqrt{N}}{2} + 1\) rows that the hint generates. You take out row \(x\). You then send the server a message that contains:

\([(r_1, c_1), (r_2, c_2) ... (r_{\frac{\sqrt{N}}{2}}, c_{\frac{\sqrt{N}}{2}})]\): the set of points included in that hint, except for the point you want
\([(x_1, y_1), (x_2, y_2) ... (x_{\frac{\sqrt{N}}{2}}, y_{\frac{\sqrt{N}}{2}})]\): the set of all other rows (including the row you want), with a randomly chosen column for each one

You send these in a random order, so the server does not know which is which. As far as the server can tell, you just sent it a random set of points, one per row, with the rows split in half in an arbitrary way. The server XORs both sets of cells, and replies with both outputs.

The client then locally takes its hint, XORs it with the (non-junk) value returned by the server, and it has its answer.

The server does a computation based on \(\sqrt{N}\) cells, and returns that data to the client. The server knows nothing about which cell the client wants (provided the client does not reuse hints), because the client gave the server every point in the set in its hint except the point it actually wants. And without knowing the hash that generated these subsets, to the server, all the row and column choices look completely random.

Backup queries

To avoid using a hint twice (and thus leaking data), when the client makes a query, it dumps the hint that it used, and "promotes" a backup hint into being a regular hint. Remember that each "backup hint" is a pair of hints, one that is based on a random subset of the rows, and the other based on its complement. The algorithm is simple:

Take the hint value generated from the subset of the rows that does not contain the row you queried
XOR in the value you queried

Now, you have a set of \(\frac{\sqrt{N}}{2} + 1\) rows with an XOR of one randomly-chosen column from each row: exactly the same format as a regular hint.

Updating the dataset

Suppose a value in the data updates. Then, you can find all hints that contain that value, and simply do a direct update: mix in the XOR of the old value and the new value.

And that's the whole protocol!

Concrete efficiency

In asymptotic terms, everything in Plinko is \(O(\sqrt{N})\): the size of the hints that the client needs to store, the data transmitted per query (technically that's \(O(\sqrt{N} * log(N))\)), and the server-side compute costs. But it helps to put this all into the context of a real-world workload. So let's use the real-world workload that I am the most familiar with: Ethereum.

Suppose that you are querying a dataset with 10 billion values, each of which are 32 bytes. This is a pretty good description of the Ethereum state tree, in a couple of years when it gets significantly bigger due to scale. We use cuckoo hashing to convert a key-value store into a "flat" lookup table of the type that Plinko handles.

To include Merkle branches, one simple way (though not the optimal way) is to treat them separately as queries into smaller databases; if it's a binary tree, then the first level is full-sized, the second level is half-sized, the third level is quarter-sized, etc; because PIR costs are all proportional to sqrt, this means that our total costs will all be multiplied by \(1 + 1 + \frac{1}{\sqrt{2}} + \frac{1}{2} + \frac{1}{2\sqrt{2}} + ... = 3 + \sqrt{2} \approx 4.414\). Ethereum's state tree is (for now) a less-efficient hexary tree, but you can ZK-prove equivalence to a binary tree to get the efficiencies of a binary tree today.

But first, let's provide raw costs without the Merkle branches. This implies a 100000×100000 grid, where each cell is 32 bytes (256 bits). The client would need to store 100000×128 hints, which is 100000 × 128 × 32 bytes ~= 390 MB, plus some extra backup hints.

For each query, the client would need to send the server a partitioning of the rows into two parts, and 100000 column indices. We could be super-clever and do slightly better than this, but if we do it naively, the row partition is 100000 bits (12.2 kB) and the column indices are 100000 * ceil(log2(100000)) = 1.7m bits = 202.7 kB. These two sum up to 215 kB per query.

If we introduce Merkle branches, the hint costs blow up to about 1.68 GB. With clever work, we can actually avoid the per-query communication costs increasing by 4.414x, because we can adjust the PIR in such a way that we use the same (or rather, analogous) query indices for the datasets representing different levels of the tree.

Server-side costs involve reading and XOR'ing 100000 cells, or 3.05MB per query; the XOR is negligible, but the read load is significant. Adding Merkle branches would blow this up to about 13.4 MB in theory, though the realistic added cost may well be negligible because clever engineering can ensure that matching values are always in the same page (memory pages are usually 4096 bytes).

Our main tool to fiddle with these numbers is to make the rectangle unbalanced (have different width and height): we can increase hint storage by 2x to reduce query size by 2x. Realistically, this is a good tradeoff: storing even 10 GB of hints is not that hard, but especially if nodes are sending multiple queries every 200 milliseconds, 215 kB per query is higher than we would be comfortable with.

In the case of a search engine, the size of the objects would be much larger (eg. 10 kB per text webpage), but the number of queries may be smaller. Hence, query complexity would be relatively low, but hint size would be large. To compensate for this, it may make sense to make the rectangle unbalanced in the other direction.

Beyond Plinko: comparing to the alternatives

One already-existing alternative to Plinko is TreePIR. TreePIR works by using a "puncturable PRF": a hash function where you can provide a key that allows evaluating it at all points except for one chosen point. This allows you to generate a representation of the query columns ("values generated from a seed \(S\) except the \(j\)'th value") that is logarithmic-sized - much more compact than providing the whole list directly:

The full approach is somewhat more clever than the above diagram. The idea in this diagram reveals the excluded index 01100, violating privacy. TreePIR works by providing the yellow values without revealing their (horizontal) position in the tree. There are \(2^{log_2(\sqrt{N})} = \sqrt{N}\) possible sets of leaf values, and thus \(\sqrt{N}\) possible sets of columns, that can be generated by taking the left and right choice of position for each yellow value. The server has an \(O(\sqrt{N} * log(N))\) time algorithm to compute the XOR sum of chosen cells for every possible arrangement simultaneously. The client knows which of these sums corresponds to the set that it actually needs. It uses single-server PIR to ask for it.

This reduces communication complexity to logarithmic, at the cost of increased server work, and at the cost of removing the "invertible PRF" mechanism, forcing the client to "grind" through many possible hints until they find a matching hint.

There are various theoretical lower bounds for how much overhead a PIR protocol needs to have. In reality, however, most of these bounds come with caveats, and so the "real" lower bounds are far fewer. For example, if you have obfuscation, you can make any of these protocols require only a constant real-time communication overhead - but of course, today, such protocols are still quite far away from practicality. The harder thing to reduce is server-side computation. Because it's "just" XOR, the workload in Plinko is already not a big deal. But if we want to remove the client-side \(O(N)\) download requirement, then we need the server side to do much heavier computation.

With Plinko, PIR is much further along than it was a decade ago, but there is still theoretically much further that these protocols could go.