Memory access is O(N^[1/3])
2025 Oct 05
See all posts
Memory access is O(N^[1/3])
In computer science, we often compare the efficiency of algorithms by
describing their runtime as a function of the size of the input. Sorting is
O(n * log(n)), meaning that sorting a list of N items takes an amount of
time proportional to the number of items multiplied by its logarithm.
Matrix multiplication is somewhere between 2.37
and 2.8, depending on the choice of algorithm. But these estimates
are all relative to some model of how long it takes for the underlying
machine to perform some basic underlying operations. Typically,
arithmetic operations (addition, multiplication, division...) are
considered to take one unit of time for fixed-size numbers, and memory
accesses are also considered to take one unit of time.
In this post, I will argue that this choice for memory access is
wrong. Memory access, both in theory and in practice, takes O(N^⅓) time:
if your memory is 8x bigger, it will take 2x longer to do a read or
write to it. I will also show an example of an application where this
concretely matters.
The theoretical argument
Imagine you have a processor sitting in the middle of a pile of
memory. The processor's ability to talk to any individual piece of
memory is bounded by the speed of light, ergo the delay is proportional
to distance. In a three-dimensional world, you can fit 8x as much memory
within 2x the distance from you.

Double the distance, eight times the memory.
This covers sequential access. In practice, of course, a CPU
is not literally situated inside of a homogeneous cube of memory, and
electrical signals don't travel exactly in a straight line (and, for
that matter, are slower than light). But, as we will see later, the
model is surprisingly close to accurate in practice.
For parallel access, things get more subtle and depend on the medium.
If you think of read and write as occupying a "wire" of the needed
length for a fixed length of time, then you get the same result: 8x the
memory, 2x the wire length * time. If you think of it as a signal made
out of light that takes some amount of energy, where its intensity needs
to take into account dispersal, then 8x the memory implies 2x the length
which implies 4x the energy required for a single access. This can be
mitigated by having repeaters repeat the signal in the middle, but that
starts to get us closer again to wires. So O(N^⅓) is probably the better
model.
The empirical argument
Computers in reality have different types of memory: registers,
different levels of cache, and RAM. We'll omit SSD/HDDs from here
because they have additional requirements on top: persistent storage
without energy requirements, and lower cost per bit.
We can ask a question: how long (in nanoseconds) does it take to
access a type of memory of which an average laptop has N bytes? Here's
GPT's answer:

As it turns out, a naive formula of treating access time as the cube
root of the amount of memory accessed gets us surprisingly close.
We can also do the same for bandwidth:

Note that here the fit is considerably worse, which is to be
expected: compared to latency, bandwidth is much less about fundamental
application of physics principles and much more about architecture
choices. L3 cache is not built for mass throughput in the same way that
DRAM is, and so it has roughly identical mass throughput despite its
much closer distance to the computation.
Where that this matter?
I can give a concrete example of why this all matters that I ran into
myself a year ago: optimized implementations of various
algorithms that precompute and reuse intermediate values.
Often, especially in cryptography, it is the case that you have a
mathematical procedure that involves N steps, where in each step,
depending on one bit of the input, you either "mix in" a particular
value or you don't. The "mixing in" is often an associative
operation. Such a procedure can be done naively in N steps, in N/8 steps
if you precompute 256 values for each step, in N/16 steps if you
precompute 65536 values for each step, and so on. Examples of this
include elliptic curve multiplication, binary field math (see eg. my
implementation here), and many other algorithms.

A surprisingly widely applicable way to optimize
cryptography.
How big should you make the precomputed table? If you treat a memory
access as O(1), then the answer is clear: as big as your machine has
memory. But if you treat memory access as O(N^⅓), then it becomes a more
subtle tradeoff: you have to optimize N^⅓ / log(N)
, and the
optimal value is always some "interior solution" (ie. not 1 and not "as
much as you can") whose exact position depends on the constants
involved.
In my binary field code, I found that an 8-bit precomputation table
(in that context, 224 items taking up 128 MB) led to faster
computations than a 16-bit precomputation table (232 items
taking up 8 GB): while the latter fit into RAM, the former could fit
into cache, and the faster access time of the former was decisive.
In the longer term, we are in an era right now where we are
approaching the limits of general-purpose CPUs, and people are
increasingly exploring ASICs for various tasks. Here, the structure of
the ASIC matters. If you can break up a task into many parts,
each of which is highly local, then memory access in each part will be
O(1). GPUs are already often very good at getting precisely
these kinds of efficiencies. But if the task requires a lot of memory
interdependencies, then you will get lots of O(N^⅓) terms. An open
problem is coming up with mathematical models of computation that are
simple but do a good job of capturing these nuances.
Memory access is O(N^[1/3])
2025 Oct 05 See all postsIn computer science, we often compare the efficiency of algorithms by describing their runtime as a function of the size of the input. Sorting is O(n * log(n)), meaning that sorting a list of N items takes an amount of time proportional to the number of items multiplied by its logarithm. Matrix multiplication is somewhere between 2.37 and 2.8, depending on the choice of algorithm. But these estimates are all relative to some model of how long it takes for the underlying machine to perform some basic underlying operations. Typically, arithmetic operations (addition, multiplication, division...) are considered to take one unit of time for fixed-size numbers, and memory accesses are also considered to take one unit of time.
In this post, I will argue that this choice for memory access is wrong. Memory access, both in theory and in practice, takes O(N^⅓) time: if your memory is 8x bigger, it will take 2x longer to do a read or write to it. I will also show an example of an application where this concretely matters.
The theoretical argument
Imagine you have a processor sitting in the middle of a pile of memory. The processor's ability to talk to any individual piece of memory is bounded by the speed of light, ergo the delay is proportional to distance. In a three-dimensional world, you can fit 8x as much memory within 2x the distance from you.
Double the distance, eight times the memory.
This covers sequential access. In practice, of course, a CPU is not literally situated inside of a homogeneous cube of memory, and electrical signals don't travel exactly in a straight line (and, for that matter, are slower than light). But, as we will see later, the model is surprisingly close to accurate in practice.
For parallel access, things get more subtle and depend on the medium. If you think of read and write as occupying a "wire" of the needed length for a fixed length of time, then you get the same result: 8x the memory, 2x the wire length * time. If you think of it as a signal made out of light that takes some amount of energy, where its intensity needs to take into account dispersal, then 8x the memory implies 2x the length which implies 4x the energy required for a single access. This can be mitigated by having repeaters repeat the signal in the middle, but that starts to get us closer again to wires. So O(N^⅓) is probably the better model.
The empirical argument
Computers in reality have different types of memory: registers, different levels of cache, and RAM. We'll omit SSD/HDDs from here because they have additional requirements on top: persistent storage without energy requirements, and lower cost per bit.
We can ask a question: how long (in nanoseconds) does it take to access a type of memory of which an average laptop has N bytes? Here's GPT's answer:
As it turns out, a naive formula of treating access time as the cube root of the amount of memory accessed gets us surprisingly close.
We can also do the same for bandwidth:
Note that here the fit is considerably worse, which is to be expected: compared to latency, bandwidth is much less about fundamental application of physics principles and much more about architecture choices. L3 cache is not built for mass throughput in the same way that DRAM is, and so it has roughly identical mass throughput despite its much closer distance to the computation.
Where that this matter?
I can give a concrete example of why this all matters that I ran into myself a year ago: optimized implementations of various algorithms that precompute and reuse intermediate values. Often, especially in cryptography, it is the case that you have a mathematical procedure that involves N steps, where in each step, depending on one bit of the input, you either "mix in" a particular value or you don't. The "mixing in" is often an associative operation. Such a procedure can be done naively in N steps, in N/8 steps if you precompute 256 values for each step, in N/16 steps if you precompute 65536 values for each step, and so on. Examples of this include elliptic curve multiplication, binary field math (see eg. my implementation here), and many other algorithms.
A surprisingly widely applicable way to optimize cryptography.
How big should you make the precomputed table? If you treat a memory access as O(1), then the answer is clear: as big as your machine has memory. But if you treat memory access as O(N^⅓), then it becomes a more subtle tradeoff: you have to optimize
N^⅓ / log(N)
, and the optimal value is always some "interior solution" (ie. not 1 and not "as much as you can") whose exact position depends on the constants involved.In my binary field code, I found that an 8-bit precomputation table (in that context, 224 items taking up 128 MB) led to faster computations than a 16-bit precomputation table (232 items taking up 8 GB): while the latter fit into RAM, the former could fit into cache, and the faster access time of the former was decisive.
In the longer term, we are in an era right now where we are approaching the limits of general-purpose CPUs, and people are increasingly exploring ASICs for various tasks. Here, the structure of the ASIC matters. If you can break up a task into many parts, each of which is highly local, then memory access in each part will be O(1). GPUs are already often very good at getting precisely these kinds of efficiencies. But if the task requires a lot of memory interdependencies, then you will get lots of O(N^⅓) terms. An open problem is coming up with mathematical models of computation that are simple but do a good job of capturing these nuances.