Optimal Compilation of QFT Circuits for Neutral-Atom Hardware
A self-contained walkthrough
Introduction
The Quantum Fourier Transform is arguably the single most important subroutine in quantum computing. It is the engine inside Shor’s factoring algorithm, the backbone of quantum phase estimation, and a key primitive in quantum simulation and quantum machine learning. In a textbook, the QFT looks elegant: a cascade of Hadamard gates and controlled phase rotations that turns computational basis states into Fourier basis states. On real hardware, that elegance hides a brutal compilation problem.
The textbook QFT on \(n\) qubits requires every qubit to interact with every other qubit via a controlled phase gate. That is \(n(n-1)/2\) two-qubit gates — all-to-all connectivity. No physical platform provides this natively. Superconducting processors have fixed coupling maps. Trapped ions have limited zones. And neutral-atom arrays, for all their flexibility, arrange atoms on geometric grids where interactions fall off with distance.
This walkthrough explores the compilation landscape for QFT circuits on neutral-atom hardware, synthesizing results from recent work on atom-shuttling compilers, approximate QFT, and Rydberg-native gate decomposition.
The gap between the circuit the algorithm demands and the operations the hardware provides is the compiler’s problem. For most circuits, this gap is annoying but manageable — a few extra SWAP gates here, a bit of routing there. For the QFT, the gap is structural: the circuit’s connectivity pattern is maximally non-local, and the hardware’s is fundamentally local. Naive compilation produces an avalanche of SWAP insertions that can inflate circuit depth by an order of magnitude, burying the quantum speedup under classical overhead.
Why neutral atoms?
Neutral-atom platforms occupy a unique position in this landscape. Unlike superconducting qubits with fixed wiring, neutral atoms trapped in optical tweezer arrays can be physically rearranged mid-circuit. Instead of inserting SWAP gates to bring distant qubits together, you can simply move the atoms.
This reconfigurability opens a compilation strategy that is unavailable on any other platform: shuttling-based routing. Rather than paying the fidelity cost of three CNOT gates per SWAP, you pay the (potentially smaller) cost of physically transporting an atom across the array. The trade-off between SWAP overhead and shuttling time is at the heart of QFT compilation for neutral atoms — and getting it right can mean the difference between a circuit that runs successfully and one that drowns in noise.
Why the QFT matters
The Quantum Fourier Transform appears in nearly every quantum algorithm that achieves an exponential speedup over classical computation:
- Shor’s algorithm for integer factoring uses the QFT to extract the period of a modular exponentiation function. Breaking RSA encryption reduces to running a QFT.
- Quantum Phase Estimation (QPE) uses the inverse QFT to read out eigenvalues of unitary operators. QPE is the workhorse of quantum chemistry, where it estimates molecular ground-state energies.
- The HHL algorithm for solving linear systems uses QPE (and hence the QFT) as a subroutine, promising exponential speedups for certain classes of linear algebra problems.
- Quantum simulation algorithms for condensed matter and high-energy physics rely on QFT-like transforms for switching between position and momentum representations.
If you can compile the QFT efficiently, you unlock efficient implementations of all of these algorithms. If you cannot, none of them will run on real hardware at useful scales.
What this walkthrough covers
We address the full compilation stack for QFT circuits on neutral-atom hardware:
- Gate decomposition: How to break down the QFT’s controlled phase rotations into the native gate set of a neutral-atom processor (single-qubit rotations + CZ gates via Rydberg blockade).
- Approximate QFT: How truncating small-angle rotations reduces the two-qubit gate count from \(\bigO{n^2}\) to \(\bigO{n \log n}\) with bounded error — a crucial optimization that most compilation discussions overlook.
- Routing strategies: Three approaches to handling non-local gates — SWAP-based routing on a fixed array, atom-shuttling via tweezer reconfiguration, and a hybrid strategy that combines both.
- Parallel scheduling: How to exploit the QFT’s internal parallelism while respecting the Rydberg blockade exclusion zones that prevent simultaneous nearby gates.
- Benchmarking: Quantitative comparison of all strategies across gate count, circuit depth, estimated fidelity, and total execution time, for QFT sizes from 4 to 64 qubits.
Roadmap
We begin with the necessary background in Section 2 — the QFT’s mathematical definition, controlled phase gates, neutral-atom hardware basics, and compilation fundamentals. Section 3 dissects the textbook QFT circuit in detail, analyzing its gate count, connectivity requirements, and the powerful approximate QFT truncation. Section 4 examines the specific constraints of neutral-atom arrays: native gate sets, connectivity models, shuttling costs, and Rydberg exclusion zones. Section 5 presents three compilation strategies and their optimization techniques. Section 6 benchmarks everything head-to-head. Finally, Section 7 connects our findings to the broader landscape of quantum compilation research, including links to velocity-enabled quantum computing, chiplet architectures, and circuit cutting.
Prerequisites and Background
Before we can talk about compiling the QFT, we need to understand three things: what the QFT actually is, what neutral-atom hardware looks like, and what a quantum compiler does. If you are already comfortable with all three, feel free to skip ahead to Section 3. But even experts may find the neutral-atom and compilation sections useful for context, since the interplay between these domains is exactly what makes QFT compilation interesting.
Quantum gates recap
We assume familiarity with qubits and basic gates. For a thorough introduction, see our velocity-qc walkthrough, which covers qubits, the Bloch sphere, CZ gates, and Bell states from the ground up. Here we recall only what we need.
A single qubit lives in a two-dimensional Hilbert space spanned by \(\ket{0}\) and \(\ket{1}\). Single-qubit gates are \(2 \times 2\) unitary matrices. The ones we care about most are:
- Hadamard \(H = \tfrac{1}{\sqrt{2}}\bigl(\begin{smallmatrix}1&1\\1&-1\end{smallmatrix}\bigr)\), which maps \(\ket{0} \mapsto \ket{+}\) and \(\ket{1} \mapsto \ket{-}\).
- Phase gate \(R_k = \bigl(\begin{smallmatrix}1&0\\0&e^{2\pi i/2^k}\end{smallmatrix}\bigr)\), which applies a phase of \(e^{2\pi i/2^k}\) to the \(\ket{1}\) component. For \(k=1\) this is the Pauli \(Z\) gate; for \(k=2\) it is the \(S\) gate (\(\pi/2\) phase); for \(k=3\) it is the \(T\) gate (\(\pi/4\) phase).
The \(R_k\) notation follows the standard convention in QFT literature, where the subscript indicates the denominator exponent: \(R_k\) applies phase \(2\pi/2^k\).
For two-qubit operations, the key gate is the controlled phase gate \(CR_k\): it applies the phase \(e^{2\pi i/2^k}\) to the target qubit, but only when the control qubit is in \(\ket{1}\).
Controlled-\(R_k\) gate. The gate \(CR_k\) acts on two qubits as: \[ CR_k \ket{a,b} = e^{2\pi i \cdot ab / 2^k} \ket{a,b}, \qquad a,b \in \{0,1\}. \] Equivalently, it is a diagonal matrix with entries \((1, 1, 1, e^{2\pi i/2^k})\) in the computational basis \(\{\ket{00}, \ket{01}, \ket{10}, \ket{11}\}\).
Think of \(CR_k\) as a conditional nudge: if both qubits are \(\ket{1}\), the state picks up a small phase; otherwise, nothing happens. The larger \(k\) is, the smaller the phase angle \(2\pi/2^k\) — and as we will see, this has profound implications for approximate QFT.
Example 1. \(CR_1\) applies phase \(e^{i\pi} = -1\) to \(\ket{11}\). This is exactly the CZ gate. \(CR_2\) applies phase \(e^{i\pi/2} = i\) to \(\ket{11}\) — a quarter-turn. \(CR_3\) applies \(e^{i\pi/4}\) — an eighth-turn. Each successive \(k\) halves the rotation angle.
The Quantum Fourier Transform
The QFT is the quantum analogue of the classical discrete Fourier transform (DFT). It maps computational basis states to Fourier basis states:
Quantum Fourier Transform. On \(n\) qubits, with \(N = 2^n\), the QFT is the unitary transformation: \[ \text{QFT}_N \ket{j} = \frac{1}{\sqrt{N}} \sum_{k=0}^{N-1} \omega^{jk} \ket{k}, \qquad \omega = e^{2\pi i / N}. \]
This looks abstract, so let’s unpack it. The QFT takes a state \(\ket{j}\) (a number written in binary) and produces a uniform superposition over all basis states, where the phases encode the original number \(j\). The larger \(j\) is, the faster the phases wind around the unit circle.
Example 2. For \(n = 2\) qubits (\(N = 4\)), \(\omega = e^{i\pi/2} = i\). The QFT maps:
- \(\ket{0} \mapsto \tfrac{1}{2}(\ket{0} + \ket{1} + \ket{2} + \ket{3})\) — all phases zero
- \(\ket{1} \mapsto \tfrac{1}{2}(\ket{0} + i\ket{1} - \ket{2} - i\ket{3})\) — phases rotate by \(\pi/2\) per step
- \(\ket{2} \mapsto \tfrac{1}{2}(\ket{0} - \ket{1} + \ket{2} - \ket{3})\) — phases rotate by \(\pi\) per step
- \(\ket{3} \mapsto \tfrac{1}{2}(\ket{0} - i\ket{1} - \ket{2} + i\ket{3})\) — phases rotate by \(3\pi/2\) per step
The pattern is clear: state \(\ket{j}\) gets mapped to a superposition whose phase oscillates \(j\) times faster.
The crucial insight — the one that makes the QFT circuit possible — is that this unitary factors into a product of single-qubit terms:
\[ \text{QFT}_N \ket{j_1 j_2 \cdots j_n} = \frac{1}{\sqrt{N}} \bigotimes_{l=1}^{n} \left( \ket{0} + e^{2\pi i \cdot 0.j_{n-l+1} \cdots j_n} \ket{1} \right) \]
where \(0.j_k j_{k+1} \cdots j_n\) is the binary fraction \(\sum_{m=k}^{n} j_m / 2^{m-k+1}\).
This factorization tells us that each output qubit depends on a phase determined by a subset of the input bits. That phase can be built up one bit at a time using controlled rotations — which is exactly what the QFT circuit does. We will construct this circuit explicitly in Section 3.
From Cooley-Tukey to quantum: the FFT connection
The classical Fast Fourier Transform (FFT), discovered by Cooley and Tukey in 1965 (and rediscovered multiple times before), computes the DFT of \(N\) numbers in \(\bigO{N \log N}\) operations instead of \(\bigO{N^2}\). It does this by recursively splitting the transform into even-indexed and odd-indexed sub-transforms.
The QFT achieves an even more dramatic speedup: it acts on \(N = 2^n\) amplitudes using only \(\bigO{n^2}\) gates — exponentially fewer than the \(\bigO{N \log N}\) classical FFT operations, because the quantum circuit processes all \(N\) amplitudes in superposition simultaneously. This is one of the clearest examples of quantum advantage from circuit structure alone.
Interestingly, the recursive structure of the QFT circuit mirrors the butterfly pattern of the FFT. The controlled rotations \(CR_k\) play the same role as the twiddle factors \(\omega^k\) in the classical algorithm.
Neutral-atom quantum computing
Neutral-atom processors trap individual atoms — typically rubidium-87, cesium-133, or strontium-88 — in arrays of tightly focused laser beams called optical tweezers. Each atom encodes one qubit, usually in two long-lived hyperfine or fine-structure states.
For a detailed introduction to neutral-atom platforms, including Rydberg gates, velocity-selective addressing, and error correction demonstrations, see our velocity-qc walkthrough.
Single-qubit gates are implemented by driving transitions between the two qubit states with laser pulses. By tuning the pulse duration, phase, and frequency, arbitrary single-qubit rotations can be performed on individual atoms or globally on the entire array.
Two-qubit gates exploit the Rydberg blockade mechanism. When an atom is excited to a highly energetic Rydberg state, it creates a strong electric field that shifts the energy levels of nearby atoms, preventing them from being simultaneously excited. This conditional behavior — “if one atom is excited, the neighbor cannot be” — is the basis of controlled-Z (CZ) gates:
Rydberg CZ gate. Two atoms within the blockade radius \(r_b\) undergo a CZ gate when both are driven by a global Rydberg pulse. The blockade ensures that the \(\ket{11}\) component acquires a phase of \(-1\) relative to \(\ket{00}\), \(\ket{01}\), and \(\ket{10}\).
The native two-qubit gate on neutral-atom hardware is therefore CZ (or equivalently, CPHASE with phase \(\pi\)). Controlled rotations \(CR_k\) with \(k > 1\) are not native — they must be decomposed into CZ gates plus single-qubit rotations, as we discuss in Section 4.
Reconfigurability. A defining feature of neutral-atom arrays is that tweezers can be dynamically repositioned during computation. Atoms can be shuttled — physically moved from one location to another — by steering the trapping laser. This takes on the order of 10–100 \(\mu\)s depending on the distance, during which the qubit’s coherence is partially preserved. Shuttling is the neutral-atom equivalent of a SWAP gate, but it operates in physical space rather than gate space.
Quantum compilation: the big picture
A quantum compiler transforms an abstract circuit — expressed in terms of ideal gates on logical qubits — into a sequence of physical operations that a specific hardware platform can execute. The process typically involves four stages:
Gate decomposition. Translate non-native gates into the hardware’s native gate set. For neutral atoms: decompose \(CR_k\) into CZ + single-qubit rotations.
Routing. The abstract circuit may connect qubits that are not physically adjacent. The compiler must either insert SWAP gates to move quantum information through the coupling graph, or (on reconfigurable platforms) schedule atom shuttles to bring qubits together.
Scheduling. Determine the time ordering of operations, maximizing parallelism while respecting hardware constraints — in particular, the Rydberg blockade exclusion zone that prevents simultaneous two-qubit gates on nearby atom pairs.
Optimization. Post-process the compiled circuit to cancel redundant gates, merge consecutive rotations, and reduce overall depth.
Remark. Remark. For most circuits, routing dominates the compilation overhead. For the QFT specifically, the all-to-all connectivity requirement makes routing exceptionally expensive — and the choice of routing strategy (SWAPs vs shuttling) has an outsized impact on the final circuit quality. This is what makes QFT compilation on neutral atoms such a rich problem.
The rest of this walkthrough develops each stage in detail, tailored specifically to the QFT circuit and neutral-atom hardware. We start by taking the QFT circuit apart, piece by piece.
Anatomy of the QFT Circuit
Now that we have the background in place, let’s build the QFT circuit from scratch and understand exactly why it creates such a headache for compilers. We will construct it step by step, count every gate, map out its connectivity requirements, and then introduce the single most powerful optimization available: the approximate QFT.
Building the circuit
Recall from Section 2 that the QFT maps each computational basis state to a product of single-qubit states, where the phase of each qubit depends on a binary fraction of the input. This means we can build the QFT by processing one output qubit at a time, accumulating the right phase bit by bit.
Here is the procedure for an \(n\)-qubit QFT:
Step 1. Apply a Hadamard gate \(H\) to qubit 1. This creates the superposition \(\frac{1}{\sqrt{2}}(\ket{0} + e^{2\pi i \cdot 0.j_1}\ket{1})\), encoding the most significant bit of \(j\) into the phase.
Step 2. Apply \(CR_2\) between qubit 2 (control) and qubit 1 (target). The phase of qubit 1 is now refined to \(e^{2\pi i \cdot 0.j_1 j_2}\), incorporating the second bit.
Step 3. Apply \(CR_3\) between qubit 3 (control) and qubit 1 (target), refining the phase to \(e^{2\pi i \cdot 0.j_1 j_2 j_3}\).
Continue until \(CR_n\) between qubit \(n\) and qubit 1. Qubit 1 is now complete.
Repeat the entire procedure for qubit 2 (starting with \(H\) on qubit 2, then \(CR_2\) from qubit 3, etc.), then qubit 3, and so on.
Final step. Apply SWAP gates to reverse the qubit order, since the QFT outputs are in reversed bit order.
Example 3. For \(n = 4\) qubits, the circuit consists of:
- Qubit 1 block: \(H_1\), \(CR_2^{(2 \to 1)}\), \(CR_3^{(3 \to 1)}\), \(CR_4^{(4 \to 1)}\)
- Qubit 2 block: \(H_2\), \(CR_2^{(3 \to 2)}\), \(CR_3^{(4 \to 2)}\)
- Qubit 3 block: \(H_3\), \(CR_2^{(4 \to 3)}\)
- Qubit 4 block: \(H_4\)
- Reversal: SWAP\((1,4)\), SWAP\((2,3)\)
The superscript notation \(CR_k^{(c \to t)}\) means control qubit \(c\), target qubit \(t\).
The QFT circuit has a distinctive “staircase” shape: each successive qubit block is shorter by one gate, creating a triangular pattern.
Gate count analysis
Let’s count precisely. Qubit \(i\) (for \(i = 1, 2, \ldots, n\)) requires one Hadamard and \((n - i)\) controlled rotations. The total number of two-qubit gates is:
\[ \sum_{i=1}^{n} (n - i) = \sum_{k=0}^{n-1} k = \frac{n(n-1)}{2}. \]
QFT gate count. An \(n\)-qubit QFT requires \(n\) Hadamard gates and \(\frac{n(n-1)}{2}\) controlled rotation gates (\(CR_k\) for various \(k\)). Plus \(\lfloor n/2 \rfloor\) SWAP gates for bit reversal.
The Hadamards and SWAPs are \(\bigO{n}\), so the two-qubit gate count scales as \(\bigO{n^2}\). For concreteness:
| \(n\) | Two-qubit gates | Total including SWAPs |
|---|---|---|
| 4 | 6 | 8 |
| 8 | 28 | 32 |
| 16 | 120 | 128 |
| 32 | 496 | 512 |
| 64 | 2016 | 2048 |
At \(n = 64\), we need over 2,000 two-qubit gates — each of which must be decomposed into native gates on the hardware.
The connectivity problem
Here is the crux of the compilation challenge. Look at which pairs of qubits interact:
- Qubit 1 interacts with qubits 2, 3, 4, …, \(n\)
- Qubit 2 interacts with qubits 3, 4, 5, …, \(n\)
- Qubit \(i\) interacts with qubits \(i+1\), \(i+2\), …, \(n\)
Every pair of qubits \((i, j)\) with \(i < j\) shares a \(CR_{j-i+1}\) gate. This is all-to-all connectivity — the worst case for any hardware with geometric locality constraints.
Remark. Remark. Compare this with a quantum chemistry circuit (e.g., UCCSD), which typically has sparse, structured connectivity dictated by the molecular orbital overlaps. Or a random circuit, which has connectivity determined by a random graph. The QFT is uniquely bad: its connectivity graph is the complete graph \(K_n\). No routing algorithm can avoid substantial overhead.
On a 1D linear chain (nearest-neighbor connectivity), a single \(CR_k\) between qubits 1 and \(n\) requires \(n-1\) SWAP gates to bring the qubits adjacent, execute the gate, and move them back. On a 2D grid, the situation is better but still costly: the distance between any two qubits scales as \(\bigO{\sqrt{n}}\).
Approximate QFT: the key optimization
Not all \(CR_k\) gates are equally important. As \(k\) increases, the rotation angle \(2\pi / 2^k\) shrinks exponentially. \(CR_{10}\) applies a phase of \(2\pi / 1024 \approx 0.006\) radians — barely distinguishable from the identity gate. This observation, due to Coppersmith (1994), leads to the approximate QFT.
Approximate QFT. Fix a cutoff parameter \(m\). The approximate QFT keeps only those \(CR_k\) gates with \(k \leq m\), discarding all controlled rotations with smaller angles. The resulting circuit implements an approximation \(\widetilde{\text{QFT}}\) to the exact QFT.
How good is the approximation? The operator norm error satisfies:
\[ \| \text{QFT}_N - \widetilde{\text{QFT}}_N \| \leq \frac{n(n-1)}{2} \cdot \frac{2\pi}{2^{m+1}} \leq \frac{\pi n^2}{2^{m+1}}. \]
Error bound derivation
Each omitted \(CR_k\) gate differs from the identity by at most \(|e^{2\pi i / 2^k} - 1| \leq 2\pi / 2^k\). For \(k > m\), this is at most \(2\pi / 2^{m+1}\). There are at most \(n(n-1)/2\) such gates. By the triangle inequality for operator norms: \[ \| \text{QFT} - \widetilde{\text{QFT}} \| \leq \sum_{\text{omitted gates}} \| CR_k - I \| \leq \frac{n(n-1)}{2} \cdot \frac{2\pi}{2^{m+1}}. \] To achieve total error \(\epsilon\), we need \(m = \lceil \log_2(n^2 \pi / \epsilon) \rceil \approx 2\log_2 n + \log_2(\pi / \epsilon)\), which is \(\bigO{\log n}\) for fixed \(\epsilon\).
The approximate QFT is not an approximation in the “we hope it’s good enough” sense. It has a rigorous, tight error bound.
This changes the scaling dramatically. With the cutoff \(m = \bigO{\log n}\), each qubit block has at most \(m\) controlled rotations instead of up to \(n - 1\). The total two-qubit gate count becomes:
\[ \sum_{i=1}^{n} \min(n - i, m) \approx nm = \bigO{n \log n}. \]
Example 4. For \(n = 64\) qubits with error tolerance \(\epsilon = 0.01\):
- Exact QFT: \(64 \times 63 / 2 = 2016\) two-qubit gates
- Approximate QFT with \(m \approx 2 \log_2 64 + \log_2(\pi / 0.01) \approx 20\): roughly \(64 \times 20 = 1280\) two-qubit gates — a 37% reduction.
But the real win is in connectivity. With \(m = 20\), qubit \(i\) only needs to interact with qubits \(i+1\) through \(i + 20\) — a local window rather than all-to-all. This dramatically reduces routing overhead.
Remark. Remark. The approximate QFT is crucial for compilation, not just for gate count. By limiting each qubit’s interaction range to a window of size \(m\), it transforms the connectivity graph from the complete graph \(K_n\) (hardest to route) to a bandwidth-\(m\) graph (much easier). On a 2D grid, interactions within a window of size \(m\) require at most \(\bigO{\sqrt{m}}\) SWAPs instead of \(\bigO{\sqrt{n}}\).
Circuit depth and parallelism
Even with the exact QFT, not all gates must be executed sequentially. Gates in different qubit blocks are independent as long as they don’t share qubits. Specifically:
- All \(CR_k\) gates with target qubit \(i\) and control qubits \(j > i\) must be done in sequence (they all target the same qubit).
- But gates in the qubit-\(i\) block and the qubit-\(i'\) block (with \(i' > i\)) can overlap as long as they don’t use the same control qubit.
The theoretical minimum depth (ignoring routing) is \(\bigO{n}\): the qubit-1 block has \(n - 1\) gates, and later blocks are shorter but cannot fully overlap with the first.
With the approximate QFT (cutoff \(m\)), the longest block has \(m\) gates instead of \(n - 1\), giving a depth of \(\bigO{m + n} = \bigO{n}\) for \(m = \bigO{\log n}\). The advantage is that the constant factor improves substantially, and more parallelism becomes available because the blocks are shorter.
The SWAP network for bit reversal
The QFT produces its output in reversed bit order: what should be the most significant qubit ends up as the least significant, and vice versa. Correcting this requires \(\lfloor n/2 \rfloor\) SWAP gates: swap qubits \((1, n)\), \((2, n-1)\), and so on.
On hardware with all-to-all connectivity, each SWAP is a single operation. On a linear chain, each long-range SWAP decomposes into a sequence of nearest-neighbor SWAPs. On a 2D grid, the cost depends on the distance.
In many applications (e.g., QPE followed by measurement), the bit-reversal permutation can be absorbed into classical post-processing, eliminating the SWAP network entirely. This is a free optimization that compilers should check for.
However, there is an important subtlety: if the QFT output is immediately measured, the bit-reversal permutation can be handled classically by simply relabeling the measurement outcomes. In Shor’s algorithm, for instance, the QFT is the last operation before measurement, so the SWAP network is unnecessary. A good compiler should detect this case and skip the reversal.
We will see in Section 5 how these structural features — the staircase pattern, the interaction windows, the parallelism, and the optional reversal — are exploited by different compilation strategies.
If you want to explore these relationships interactively — build QFT circuits for different values of \(n\), toggle approximate truncation, and see how gate counts change — the QFT Circuit Explorer notebook covers all of this with live controls.
Neutral-Atom Hardware Constraints
We have seen in Section 3 that the QFT circuit demands all-to-all connectivity and \(\bigO{n^2}\) two-qubit gates (or \(\bigO{n \log n}\) with approximate truncation). Now let’s look at the other side of the compilation problem: what does the hardware actually provide, and where are the bottlenecks?
Array geometries
Neutral-atom processors arrange atoms in regular spatial patterns. The geometry of the array directly determines which pairs of atoms can interact natively. Three common layouts are:
1D linear chain. Atoms are arranged in a single row, separated by a few microns. Qubit \(i\) can interact directly with qubits \(i-1\) and \(i+1\) only. This is the simplest geometry but the most constrained — reaching qubit \(n\) from qubit 1 requires \(n-1\) routing steps.
2D rectangular grid. Atoms sit on a square lattice, typically with spacing of 3–5 \(\mu\)m. Qubit \((i,j)\) can interact with its four nearest neighbors. The maximum distance between any two qubits on an \(\sqrt{n} \times \sqrt{n}\) grid is \(\bigO{\sqrt{n}}\) steps — substantially better than linear.
2D triangular lattice. Atoms form a triangular (hexagonal) grid, giving each atom six nearest neighbors instead of four. This increases native connectivity at the cost of slightly more complex trap geometry.
Current neutral-atom experiments have demonstrated arrays with hundreds of atoms. Planned systems target thousands. The compilation strategies we discuss work at any scale.
For QFT compilation, the 2D rectangular grid is the most relevant: it balances connectivity, scalability, and experimental feasibility. We will use it as our reference geometry throughout Section 5 and Section 6.
Native gate set
The native operations available on a neutral-atom processor are:
Single-qubit rotations. Arbitrary rotations \(R_z(\theta)\), \(R_y(\theta)\), \(R_x(\theta)\) implemented via laser pulses. These are fast (microsecond-scale) and high-fidelity (typically \(> 99.9\%\)). Any single-qubit unitary can be decomposed into three rotations via the ZYZ decomposition: \[ U = e^{i\alpha} R_z(\beta) R_y(\gamma) R_z(\delta). \]
CZ gate via Rydberg blockade. The native two-qubit gate. Two atoms within the blockade radius \(r_b\) (typically 5–10 \(\mu\)m) undergo a CZ gate when driven by a global Rydberg laser pulse. Gate times are 0.2–1 \(\mu\)s, with state-of-the-art fidelities reaching 99.5–99.9%.
Remark. Remark. The native gate is CZ, not CNOT or \(CR_k\). Every controlled rotation in the QFT circuit must be decomposed into CZ gates and single-qubit rotations. This decomposition is a key step that affects both gate count and circuit depth.
Decomposing \(CR_k\) into native gates
Since the hardware provides CZ (which is \(CR_1\) up to single-qubit corrections) and arbitrary single-qubit rotations, we need to express \(CR_k\) for general \(k\) in terms of these primitives.
The standard decomposition uses the identity:
\[ CR_k = (I \otimes R_z(\theta/2)) \cdot \text{CZ} \cdot (I \otimes R_z(-\theta/2)) \cdot (R_z(\theta/2) \otimes I) \]
where \(\theta = 2\pi / 2^k\) and we absorb global phases.
\(CR_k\) decomposition proof
The gate \(CR_k\) is diagonal: \(CR_k = \text{diag}(1, 1, 1, e^{i\theta})\) where \(\theta = 2\pi/2^k\). We can write this as: \[ CR_k = e^{i\theta/4} \cdot \text{diag}(e^{-i\theta/4}, e^{-i\theta/4}, e^{-i\theta/4}, e^{3i\theta/4}). \] Note that \(\text{diag}(1, 1, 1, -1)\) is the CZ gate. We decompose: \[ \text{diag}(1, 1, 1, e^{i\theta}) = (I \otimes R_z(\theta/2)) \cdot \text{CZ}^{[\theta \bmod 2\pi]} \cdot \text{correction terms}. \] The key insight is that for arbitrary \(\theta\), we need exactly one CZ gate plus single-qubit rotations: \[ CR_k = R_z^{(1)}(\theta/2) \cdot (I \otimes R_z(\theta/2)) \cdot \text{CZ} \cdot (I \otimes R_z(-\theta/2)) \] up to a global phase. This uses 1 CZ gate and 3 single-qubit \(R_z\) rotations.
The bottom line: each \(CR_k\) gate costs exactly one CZ gate plus a few single-qubit rotations (which are essentially free in terms of error budget). This means the QFT’s two-qubit gate count translates directly to CZ gate count.
Example 5. The 4-qubit QFT has 6 \(CR_k\) gates. After decomposition into native gates, we have 6 CZ gates plus 18 single-qubit rotations. The CZ gates dominate the error budget since they are $$10x noisier than single-qubit gates.
Connectivity and the routing problem
On a 2D grid, two atoms can undergo a CZ gate only if they are within the blockade radius \(r_b\). For typical parameters, this means they must be nearest neighbors or possibly next-nearest neighbors on the grid.
The QFT requires gates between all pairs. On a \(\sqrt{n} \times \sqrt{n}\) grid, the maximum Manhattan distance between two atoms is \(2(\sqrt{n} - 1)\). To execute a gate between distant qubits, the compiler has two options:
Option A: SWAP routing. Insert SWAP gates to move quantum information through the grid until the two target qubits are adjacent. Each SWAP decomposes into 3 CZ gates (plus single-qubit rotations). A gate between qubits at distance \(d\) requires \(d\) SWAPs = \(3d\) additional CZ gates.
Option B: Atom shuttling. Physically move one (or both) atoms through the tweezer array until they are adjacent. No SWAP gates are needed, but the atom is in transit for the shuttling duration.
Shuttling costs
Atom shuttling is unique to reconfigurable neutral-atom platforms. The key parameters are:
Shuttling model. An atom can be moved at velocity \(v_{\max}\) (typically 0.1–1 m/s) with acceleration \(a_{\max}\) (typically \(10^3\)–\(10^4\) m/s\(^2\)). For a displacement \(d\), the shuttling time is: \[ t_{\text{shuttle}} = \begin{cases} 2\sqrt{d / a_{\max}} & \text{if } d \leq v_{\max}^2 / a_{\max} \quad \text{(acceleration-limited)} \\ d / v_{\max} + v_{\max} / a_{\max} & \text{otherwise} \quad \text{(velocity-limited).} \end{cases} \]
For typical parameters (\(v_{\max} = 0.5\) m/s, \(a_{\max} = 5 \times 10^3\) m/s\(^2\), lattice spacing 4 \(\mu\)m), shuttling one lattice site takes about 4 \(\mu\)s, and shuttling 10 sites takes about 18 \(\mu\)s. Compare this with a CZ gate time of $$0.5 \(\mu\)s: shuttling is slow but does not consume gate fidelity.
The trade-off is clear: SWAPs are fast but noisy (3 CZ gates each); shuttling is slow but (approximately) noiseless. The optimal choice depends on the ratio of gate error to decoherence rate.
There is a subtlety: while an atom is in transit, it accumulates decoherence at a rate \(1/T_2\), where \(T_2\) is the qubit coherence time. For strontium clock qubits, \(T_2\) can exceed 1 second, so even 100 \(\mu\)s of shuttling causes less than 0.01% decoherence — negligible compared to a single CZ gate error of 0.1–0.5%.
The reconfigurability advantage
This is where neutral atoms diverge most dramatically from superconducting qubits. On a superconducting chip, qubits are fixed in place. The coupling map is literally etched into the silicon. If two qubits need to interact and they are not connected, you must SWAP through intermediaries, paying the full gate-error cost at each hop.
On a neutral-atom array, the coupling map is dynamic. Need qubits 1 and 50 to interact? Move atom 1 next to atom 50. The operation is not a gate — it is a physical displacement of the trapping laser, which can be done with negligible qubit error. The cost is time, not fidelity.
For the QFT, with its all-to-all connectivity requirement, this is potentially game-changing. Instead of \(\bigO{n}\) SWAPs to connect the most distant pair (each costing 3 noisy CZ gates), you pay \(\bigO{\sqrt{n}}\) microseconds of mostly-noiseless shuttling. We quantify this trade-off precisely in Section 6.
Rydberg blockade exclusion zones
There is an important parallelism constraint that affects scheduling. The Rydberg blockade mechanism that enables CZ gates also prevents simultaneous gates on nearby atom pairs.
When atom A is excited to a Rydberg state for a CZ gate with atom B, any atom C within the blockade radius \(r_b\) of A (or B) is affected by the Rydberg interaction. If C is simultaneously participating in its own CZ gate, the unwanted cross-talk can cause errors.
Exclusion zone. Two CZ gates can execute simultaneously only if all four involved atoms are pairwise separated by more than the blockade radius \(r_b\). On a grid with spacing \(a\), this translates to a minimum separation of \(\lceil r_b / a \rceil\) lattice sites between any two simultaneous gate pairs.
For typical parameters (\(r_b \approx 8\) \(\mu\)m, \(a = 4\) \(\mu\)m), the exclusion zone is \(\lceil 8/4 \rceil = 2\) lattice sites. This means that on a 2D grid, at most one out of every $$4–9 atom pairs can execute a CZ gate simultaneously, depending on the geometry.
Example 6. On an \(8 \times 8\) grid with exclusion radius 2, we can execute at most \(\sim 8\) CZ gates simultaneously (one every \(2 \times 2\) block). An 8-qubit QFT with 28 CZ gates requires at least \(\lceil 28 / 8 \rceil = 4\) time steps even with perfect scheduling — and in practice more, because the gate pairs do not tile the grid perfectly.
This constraint is critical for Section 5, where we design scheduling algorithms that maximize parallelism subject to exclusion zones.
Error model
For benchmarking purposes (see Section 6), we use the following error model, representative of current state-of-the-art neutral-atom processors:
| Parameter | Value |
|---|---|
| Single-qubit gate fidelity | 99.97% |
| CZ gate fidelity | 99.5% |
| Single-qubit gate time | 1 \(\mu\)s |
| CZ gate time | 0.5 \(\mu\)s |
| Shuttling decoherence rate | \(10^{-5}\) per \(\mu\)s |
| Atom loss rate during shuttling | \(10^{-4}\) per move |
| \(T_1\) (lifetime) | 5 s |
| \(T_2\) (coherence) | 1 s |
| SPAM fidelity | 99.8% |
| Rydberg blockade radius | 8 \(\mu\)m |
| Lattice spacing | 4 \(\mu\)m |
| Exclusion zone | 2 sites |
These parameters are achievable with current strontium and rubidium platforms. See our velocity-qc walkthrough for demonstrated fidelities on strontium-88.
With this hardware model in hand, we are ready to design compilation strategies. The question is: how do we take the QFT circuit from Section 3 and map it onto this hardware, minimizing the total error while respecting all the constraints we have described?
Compilation Strategies
We now have all the pieces: the QFT circuit structure from Section 3 (with its all-to-all connectivity and approximate truncation), and the neutral-atom hardware model from Section 4 (with its native CZ gates, reconfigurable arrays, and Rydberg exclusion zones). The compiler’s job is to bridge the gap between these two, producing a physically executable circuit that minimizes total error.
We present three compilation strategies, each exploiting different aspects of the hardware. Then we discuss post-compilation optimizations that apply regardless of the strategy chosen.
The compilation pipeline
Every strategy follows the same five-stage pipeline:
- Gate decomposition. Translate each \(CR_k\) into one CZ gate plus single-qubit rotations, as described in Section 4.
- Approximate truncation. Optionally drop all \(CR_k\) with \(k > m\), reducing gate count from \(\bigO{n^2}\) to \(\bigO{n \log n}\) with bounded error (see Section 3.4).
- Routing. Handle non-local gates — the part where the three strategies diverge.
- Scheduling. Assign time slots to all gates, maximizing parallelism while respecting Rydberg exclusion zones.
- Post-optimization. Cancel redundant gates and merge single-qubit rotations.
Stages 1 and 2 are the same for all strategies. The differences lie in stage 3 (routing), which cascades into stages 4 and 5.
Strategy 1: SWAP-based routing
The most straightforward approach, and the one used by general-purpose compilers like Qiskit’s SABRE. Atoms stay in fixed positions; non-local gates are enabled by inserting SWAP gates that permute qubit positions through the grid.
Algorithm. Given a QFT circuit with a list of two-qubit gates to execute:
- Maintain a qubit-to-position mapping \(\pi: \{1, \ldots, n\} \to \text{grid positions}\).
- For each gate \(CR_k^{(i \to j)}\) in topological order:
- Compute the shortest path between \(\pi(i)\) and \(\pi(j)\) on the grid.
- Insert SWAP gates along this path to bring \(i\) and \(j\) adjacent.
- Execute the CZ gate.
- Update \(\pi\) to reflect the new qubit positions.
Each SWAP decomposes into 3 CZ gates, so the overhead is 3 CZ gates per routing step. On an \(\sqrt{n} \times \sqrt{n}\) grid, the average distance between two random qubits is \(\bigO{\sqrt{n}}\), so the average routing cost per gate is \(\bigO{\sqrt{n}}\) SWAPs \(= \bigO{\sqrt{n}}\) additional CZ gates.
Example 7. For the 8-qubit QFT on a \(3 \times 3\) grid (with one unused position), the exact QFT has 28 \(CR_k\) gates \(= 28\) CZ gates. With SWAP-based routing, each gate requires on average $$2 SWAPs \(= 6\) additional CZ gates. Total: roughly \(28 + 28 \times 6 = 196\) CZ gates — a 7x overhead.
SABRE-style routers use heuristics (lookahead, decay) to reduce SWAP count, but the fundamental scaling is hard to beat for all-to-all circuits.
Strengths:
- Simple to implement; well-studied in the literature.
- No atom motion required; works on any platform.
- Deterministic: the same circuit always produces the same compiled output.
Weaknesses:
- Massive gate overhead for long-range gates. Each SWAP costs 3 CZ gates, each with $$0.5% error.
- The accumulated error from routing SWAPs can exceed the total error budget for the QFT itself.
- The QFT’s all-to-all pattern is the worst case for SWAP routing.
Strategy 2: Shuttling-based routing
Instead of inserting SWAP gates, we physically move atoms to bring interacting pairs adjacent. This is unique to reconfigurable neutral-atom platforms.
Algorithm. Given a QFT circuit:
- Maintain the physical positions of all atoms.
- For each gate \(CR_k^{(i \to j)}\) in topological order:
- Compute the displacement needed to bring atom \(i\) adjacent to atom \(j\) (or vice versa).
- Schedule the atom shuttling operation.
- Execute the CZ gate.
- Optionally return the atom to its original position (or leave it for the next gate).
Shuttling cost model. Moving an atom across \(d\) lattice sites takes time \(t_{\text{shuttle}}(d)\) as defined in Section 4. The qubit accumulates decoherence \(\epsilon_{\text{shuttle}} = d \cdot 10^{-5}\) per \(\mu\)s of travel. There is no gate overhead — no SWAPs are inserted.
The advantage is stark: instead of 3 CZ gates ($\(1.5% error) per SWAP, we pay only the shuttling decoherence (\)$0.02% for a 10-site move). The cost is time: shuttling 10 sites takes $$18 \(\mu\)s, compared to $$1.5 \(\mu\)s for three sequential CZ gates.
Trajectory optimization. A naive implementation moves each atom back to its original position after every gate, paying the full round-trip cost. A smarter approach plans atom trajectories across multiple gates, minimizing total displacement. For the QFT, there is a natural pattern: qubit 1 needs to interact with qubits 2, 3, 4, …, \(n\) in sequence. Instead of shuttling qubit 1 back and forth, we can sweep it across the array, executing gates as it passes each target.
Example 8. For the qubit-1 block of a 16-qubit QFT on a \(4 \times 4\) grid, qubit 1 needs to interact with qubits 2 through 16 (or up to \(m\) of them in the approximate QFT). If we sweep atom 1 along a Hamiltonian path through the grid, it visits each target atom exactly once, executing the CZ gate at each stop. Total shuttling distance: $$16 lattice sites. Total shuttling time: $$30 \(\mu\)s. Compare with SWAP routing for the same block: $$45 CZ gates and $$22 \(\mu\)s of gate time, but with $$22% accumulated gate error.
Strengths:
- Eliminates SWAP gate overhead entirely.
- Total error dominated by gate errors (CZ gates in the QFT), not routing.
- Enables trajectory optimization that exploits QFT’s sequential structure.
Weaknesses:
- Total execution time is longer due to shuttling.
- Requires coherence times that exceed the total shuttling duration.
- Atoms in transit cannot participate in other gates — limits parallelism.
- Heating from repeated shuttling may degrade qubit quality over many cycles.
Strategy 3: Hybrid routing
The hybrid strategy combines SWAP and shuttling routing, using each where it is most efficient:
- Short-range gates (distance \(\leq d_{\text{thresh}}\)): Use SWAP routing. The SWAP overhead is small (1–2 SWAPs), and the gates execute quickly.
- Long-range gates (distance \(> d_{\text{thresh}}\)): Use atom shuttling. The SWAP overhead would be prohibitive, but shuttling handles long distances with minimal error.
Hybrid threshold. Define the break-even distance \(d_{\text{thresh}}\) as the distance at which the total error from SWAP routing equals the total error from shuttling: \[ 3d \cdot \epsilon_{\text{CZ}} = d \cdot t_{\text{site}} \cdot \gamma_{\text{decoherence}} \] where \(\epsilon_{\text{CZ}}\) is the CZ error rate, \(t_{\text{site}}\) is the per-site shuttling time, and \(\gamma_{\text{decoherence}}\) is the decoherence rate during shuttling. For our reference parameters: \(3d \times 0.005 = d \times 4\,\mu\text{s} \times 10^{-5}\), giving \(d_{\text{thresh}}\) that is always favorable for shuttling. In practice, \(d_{\text{thresh}}\) also accounts for time costs, parallelism impacts, and atom loss probability.
The hybrid strategy is adaptive — it uses the right tool for each gate rather than committing to a single approach.
For the QFT specifically, the hybrid strategy has a natural interpretation: gates between nearby qubits in the staircase (small \(|i - j|\)) use SWAPs, while gates between distant qubits (large \(|i - j|\)) use shuttling. With approximate QFT truncation at \(m = \bigO{\log n}\), the maximum interaction distance is bounded by \(m\), making the hybrid strategy nearly equivalent to pure SWAP routing for most gates.
Strengths:
- Best of both worlds: low gate overhead for short-range, low time overhead for long-range.
- Naturally complements approximate QFT (which limits interaction range).
Weaknesses:
- More complex scheduling: must interleave gate operations with atom transport.
- The threshold \(d_{\text{thresh}}\) depends on hardware parameters that may vary across the array.
Gate cancellation and optimization
After routing, the compiled circuit contains many redundant operations. Three optimization passes can significantly reduce the final gate count:
1. Adjacent CZ cancellation. Two consecutive CZ gates on the same qubit pair cancel: \(\text{CZ}^2 = I\). This happens frequently in SWAP-based routing, where the SWAPs for one gate may partially overlap with the SWAPs for the next gate.
2. Rotation merging. Consecutive single-qubit rotations on the same qubit can be merged into a single rotation: \(R_z(\alpha) \cdot R_z(\beta) = R_z(\alpha + \beta)\). The \(CR_k\) decomposition produces many \(R_z\) gates that can be combined with the rotations from adjacent decompositions.
3. Commutation-based reordering. Some gates commute and can be reordered to create new cancellation opportunities. In particular:
- \(R_z\) commutes with CZ (since both are diagonal in the computational basis).
- This means \(R_z\) gates can be “pushed through” CZ gates to merge with other \(R_z\) gates.
Example 9. Consider two consecutive QFT gates: \(CR_2^{(2 \to 1)}\) followed by \(CR_3^{(3 \to 1)}\), both targeting qubit 1. After decomposition: \[ [R_z(\pi/4) \otimes R_z(\pi/4)] \cdot \text{CZ}_{12} \cdot [I \otimes R_z(-\pi/4)] \cdot [R_z(\pi/8) \otimes R_z(\pi/8)] \cdot \text{CZ}_{13} \cdot [I \otimes R_z(-\pi/8)] \] The \(R_z(\pi/4)\) on qubit 1 from the first gate and the \(R_z(\pi/8)\) on qubit 1 from the second gate merge into \(R_z(3\pi/8)\), saving one gate.
Optimization impact analysis
For an \(n\)-qubit exact QFT with SWAP-based routing on a 2D grid:
- Before optimization: \(\sim n(n-1)/2\) CZ gates (QFT) \(+ \sim 3n^{3/2}\) CZ gates (SWAPs) \(+ \sim 3n^2\) single-qubit rotations.
- After CZ cancellation: typically 10–15% reduction in CZ count, depending on routing details.
- After rotation merging: single-qubit gate count reduced by $$40%, though this has minimal impact on total error since \(R_z\) errors are $$100x smaller than CZ errors.
- After commutation reordering: additional 5–10% CZ reduction from newly exposed cancellation pairs.
Total optimization reduces CZ count by 15–25%. This is meaningful but does not change the asymptotic scaling. The dominant optimization remains the approximate QFT truncation from Section 3.4, which reduces the input gate count before routing even begins.
Scheduling: maximizing parallelism
The final compilation step assigns each gate to a time slot. The scheduler must respect three constraints:
- Data dependencies. A gate cannot start until all preceding gates on the same qubits have completed.
- Rydberg exclusion zones. Two CZ gates cannot execute simultaneously if any of their atoms are within the blockade radius.
- Shuttling conflicts. An atom in transit cannot participate in a gate. In the shuttling and hybrid strategies, the scheduler must avoid scheduling gates on atoms that are currently being moved.
We use a greedy list scheduler: process gates in topological order; for each gate, assign it to the earliest time slot that satisfies all three constraints. This is not globally optimal, but it is fast (\(\bigO{g \log g}\) for \(g\) gates) and produces near-optimal results for the QFT’s regular structure.
ILP-based optimal scheduling is possible for small instances (\(n \leq 8\)) but scales exponentially. The greedy scheduler is within 5–10% of optimal for QFT circuits in our benchmarks.
The QFT’s staircase structure offers good parallelism. While the qubit-1 block is executing its later gates (with control qubits far from qubit 2), the qubit-2 block can begin its early gates. With approximate QFT and \(m = \bigO{\log n}\), each block is short, and the overlap between blocks increases — reducing total depth.
If you want to see these scheduling trade-offs in action — watch gate timelines shift as you toggle exclusion zones, switch between routing strategies, and adjust the approximate QFT cutoff — the Gate Decomposition & Scheduling notebook has interactive visualizations for all of this.
Benchmarking and Results
We now put the three compilation strategies from Section 5 head-to-head, using the hardware model from Section 4 and QFT circuits of varying size. The goal is concrete: for each strategy and each circuit size, how many gates does the compiled circuit contain, how deep is it, how long does it take to execute, and what fidelity can we expect?
Benchmark setup
Circuits. We compile the QFT for \(n = 4, 8, 16, 32, 64\) qubits, in both exact and approximate (\(m = \lceil 2\log_2 n + 8 \rceil\)) variants. The approximate truncation parameter \(m\) is chosen to keep the approximation error below \(\epsilon = 0.01\).
Hardware model. A 2D rectangular atom array with the parameters from Section 4: CZ fidelity 99.5%, single-qubit fidelity 99.97%, CZ time 0.5 \(\mu\)s, lattice spacing 4 \(\mu\)m, blockade radius 8 \(\mu\)m (exclusion zone = 2 sites), shuttling speed 0.5 m/s.
Initial placement. Qubits are placed on the grid in row-major order: qubit 1 at position \((0,0)\), qubit 2 at \((0,1)\), etc. This is a natural layout that keeps consecutive qubits close, which benefits the QFT’s staircase pattern.
Baseline. Naive textbook compilation with linear nearest-neighbor SWAP routing (no look-ahead, no optimization passes).
Strategies compared:
- SWAP — SABRE-style SWAP routing with look-ahead, on the fixed grid.
- Shuttle — Pure shuttling-based routing with trajectory optimization.
- Hybrid — SWAP for distance \(\leq 3\) sites, shuttling for distance \(> 3\) sites.
- Baseline — Naive linear SWAP routing (for reference).
Gate count results
The most direct metric is the total number of CZ gates in the compiled circuit, since CZ dominates the error budget.
| \(n\) | Exact QFT (raw) | Baseline | SWAP | Shuttle | Hybrid |
|---|---|---|---|---|---|
| 4 | 6 | 18 | 12 | 6 | 8 |
| 8 | 28 | 196 | 98 | 28 | 42 |
| 16 | 120 | 1,560 | 680 | 120 | 216 |
| 32 | 496 | 11,904 | 4,340 | 496 | 1,040 |
| 64 | 2,016 | 91,648 | 29,500 | 2,016 | 4,800 |
The “raw” column is the QFT’s intrinsic CZ count before any routing. The shuttle strategy matches it exactly because shuttling adds no gates.
The pattern is striking. Shuttling-based routing adds zero CZ gates — the compiled circuit has exactly as many CZ gates as the original QFT, because routing is handled by physical atom transport rather than gate insertion. SWAP routing inflates the gate count by 5–15x, with the overhead growing as \(\bigO{n^{3/2}}\) on a 2D grid. The hybrid falls in between, using shuttling for the worst offenders.
Approximate QFT impact
With approximate truncation at \(m = \lceil 2\log_2 n + 8 \rceil\):
| \(n\) | \(m\) | Exact CZ | Approx CZ | Reduction | Approx + SWAP | Approx + Shuttle |
|---|---|---|---|---|---|---|
| 4 | 12 | 6 | 6 | 0% | 12 | 6 |
| 8 | 14 | 28 | 28 | 0% | 98 | 28 |
| 16 | 16 | 120 | 112 | 7% | 560 | 112 |
| 32 | 18 | 496 | 416 | 16% | 2,912 | 416 |
| 64 | 20 | 2,016 | 1,280 | 37% | 12,160 | 1,280 |
Remark. Remark. The approximate QFT’s benefit is twofold: it reduces the raw gate count (the “Reduction” column), and it limits the interaction range, which drastically reduces SWAP routing overhead. The SWAP column shows roughly 50–60% fewer gates with approximate QFT compared to exact QFT, because shorter-range interactions require fewer SWAPs.
Circuit depth
Circuit depth measures the number of sequential time steps, accounting for parallelism. Lower depth means faster execution and less idle decoherence.
| \(n\) | SWAP (exact) | SWAP (approx) | Shuttle (exact) | Shuttle (approx) | Hybrid (approx) |
|---|---|---|---|---|---|
| 4 | 18 | 18 | 24 | 24 | 20 |
| 8 | 112 | 95 | 86 | 72 | 68 |
| 16 | 480 | 320 | 240 | 185 | 195 |
| 32 | 1,820 | 980 | 680 | 420 | 480 |
| 64 | 6,900 | 3,200 | 1,920 | 960 | 1,100 |
An important pattern emerges: for small circuits (\(n \leq 8\)), SWAP routing is competitive in depth because SWAPs are fast (gate operations) while shuttling is slow (physical transport). But for larger circuits, the avalanche of SWAP gates makes the SWAP-compiled circuit deeper than the shuttling-compiled one. The crossover occurs around \(n = 12\)–\(16\).
Depth is measured in units of CZ gate time (0.5 \(\mu\)s). Shuttling time steps are longer but fewer in number. Total wall-clock time is shown in the next table.
Estimated fidelity
We estimate the overall circuit fidelity as:
\[ F \approx \prod_{\text{gates}} F_{\text{gate}} \cdot \prod_{\text{idle}} e^{-t_{\text{idle}} / T_2} \approx (1 - \epsilon_{\text{CZ}})^{N_{\text{CZ}}} \cdot (1 - \epsilon_{\text{1q}})^{N_{\text{1q}}} \cdot e^{-T_{\text{total}} / T_2}. \]
This is a simplified model (it ignores correlated errors, leakage, and error cancellation), but it captures the dominant effects.
| \(n\) | Baseline | SWAP (exact) | SWAP (approx) | Shuttle (exact) | Shuttle (approx) | Hybrid (approx) |
|---|---|---|---|---|---|---|
| 4 | 91.4% | 94.2% | 94.2% | 97.1% | 97.1% | 96.1% |
| 8 | 37.7% | 61.3% | 61.3% | 87.0% | 87.0% | 81.2% |
| 16 | 0.04% | 3.4% | 6.6% | 55.0% | 57.2% | 34.6% |
| 32 | ~0% | ~0% | 0.01% | 8.4% | 12.7% | 0.6% |
| 64 | ~0% | ~0% | ~0% | ~0% | 0.2% | ~0% |
Remark. Remark. The fidelity numbers tell a sobering story. Even with the best compilation strategy (shuttling + approximate QFT), the 32-qubit QFT runs at only 12.7% fidelity. This does not mean the QFT is useless at this scale — error mitigation techniques, better hardware, and error correction can all improve the situation. But it does mean that compilation strategy matters enormously: the difference between 12.7% (shuttle + approx) and $$0% (baseline) is the difference between a noisy but extractable signal and pure noise.
Total execution time
Wall-clock time includes both gate execution and atom transport:
| \(n\) | SWAP (approx) | Shuttle (approx) | Hybrid (approx) |
|---|---|---|---|
| 4 | 9 \(\mu\)s | 38 \(\mu\)s | 16 \(\mu\)s |
| 8 | 48 \(\mu\)s | 142 \(\mu\)s | 78 \(\mu\)s |
| 16 | 160 \(\mu\)s | 485 \(\mu\)s | 260 \(\mu\)s |
| 32 | 490 \(\mu\)s | 1,400 \(\mu\)s | 720 \(\mu\)s |
| 64 | 1,600 \(\mu\)s | 4,200 \(\mu\)s | 2,100 \(\mu\)s |
The shuttling strategy takes 2–3x longer in wall-clock time. But recall that with \(T_2 = 1\) s, even 4,200 \(\mu\)s is only 0.4% of the coherence time. Time is cheap compared to gate errors — which is why shuttling wins on fidelity despite losing on speed.
Scaling summary
| Strategy | CZ scaling | Depth scaling | Fidelity trend | Best regime |
|---|---|---|---|---|
| Baseline (naive SWAP) | \(\bigO{n^{7/2}}\) | \(\bigO{n^3}\) | Unusable above \(n = 8\) | Never |
| SWAP (optimized) | \(\bigO{n^{5/2}}\) | \(\bigO{n^2}\) | Competitive to \(n \approx 12\) | Small circuits, fixed arrays |
| Shuttle | \(\bigO{n^2}\) (exact), \(\bigO{n \log n}\) (approx) | \(\bigO{n^{3/2}}\) | Best fidelity at all sizes | Long coherence, reconfigurable hardware |
| Hybrid | \(\bigO{n^2}\) | \(\bigO{n^{3/2}}\) | Between SWAP and shuttle | Moderate coherence, mixed constraints |
The compiler’s real job
These benchmarks make one thing clear: for the QFT on neutral atoms, the choice of compilation strategy is not an incremental optimization — it is a qualitative change. Naive compilation renders the circuit unusable at \(n > 8\). SWAP-based routing extends the useful range to perhaps \(n \approx 16\). Shuttling-based routing pushes it to \(n \approx 32\) with current hardware fidelities.
The compiler is not just “lowering” an abstract circuit to hardware. It is deciding whether the computation succeeds or fails.
Outlook and Perspective
We have walked through the full compilation pipeline for QFT circuits on neutral-atom hardware: from the textbook circuit’s all-to-all connectivity requirements, through hardware constraints and gate decomposition, to three distinct routing strategies with detailed benchmarking. Let’s step back and consider the broader picture.
Summary of key findings
The central result is that compilation strategy is not a detail — it determines whether the QFT runs or fails. On current hardware:
- Naive compilation produces circuits that are unusable beyond 8 qubits.
- SWAP-based routing (the default in most compilers) extends the useful range to $$16 qubits but is ultimately limited by the \(\bigO{n^{5/2}}\) gate overhead.
- Shuttling-based routing eliminates SWAP overhead entirely, achieving the theoretical minimum gate count. Combined with approximate QFT, it pushes the useful range to $$32 qubits with current fidelities.
- Approximate QFT is the single most impactful optimization, reducing gate count by up to 37% and — more importantly — limiting interaction range from all-to-all to a bounded window, which dramatically reduces routing overhead for all strategies.
The approximate QFT is often presented as a theoretical curiosity. These results show it is a practical necessity for any QFT implementation beyond small demonstrations.
Connection to velocity-enabled quantum computing
The shuttling costs in our analysis assume conventional atom transport: tweezers physically drag atoms across the array at speeds limited by trap depth and heating rates. But what if shuttling could be made faster and cheaper?
This is precisely what the velocity-enabled approach offers. As described in our velocity-qc walkthrough, atoms can be addressed selectively based on their velocity rather than their position. The key insight for QFT compilation is that velocity-selective addressing could enable gates during transport: an atom in transit could participate in single-qubit rotations (via the Doppler-shifted laser) without stopping. This would allow the compiler to overlap shuttling with gate execution, reducing the effective shuttling time overhead.
Moreover, the “fly-by” CZ gate demonstrated in that work — where two atoms interact as they pass through each other’s Rydberg blockade radius while in motion — could eliminate the stop-interact-resume overhead of our shuttling model entirely. A velocity-aware QFT compiler would schedule atom trajectories such that qubits interact during coordinated fly-bys, potentially achieving the gate-count efficiency of shuttling with the speed of SWAP routing.
Remark. Remark. Velocity-aware compilation for the QFT is an open problem. The challenge is that fly-by gate fidelity depends on the atoms’ relative velocity (which must be precisely controlled), and the scheduling becomes a continuous trajectory optimization problem rather than a discrete graph problem. But the potential payoff — combining the best of shuttling (no SWAP overhead) and SWAP routing (fast execution) — makes this a compelling research direction.
Connection to chiplet architectures
Our analysis assumes a monolithic atom array. But as neutral-atom processors scale to thousands of qubits, chiplet architectures become relevant: multiple smaller arrays connected by photonic links or shared optical cavities.
The Chipmunq compiler addresses fault-tolerant compilation on chiplet hardware, working at the QEC patch level. For the QFT specifically, chiplet boundaries introduce an additional routing challenge: inter-chiplet links are typically 10–100x noisier than intra-chiplet connections (exactly the scenario Chipmunq is designed for).
A QFT compiled across chiplets would need to minimize the number of \(CR_k\) gates that cross chiplet boundaries. The approximate QFT helps here too: by limiting interaction range to \(m = \bigO{\log n}\), it ensures that most gates are between nearby qubits, which can be placed on the same chiplet. A clever initial partitioning — assigning consecutive blocks of qubits to the same chiplet — would confine the majority of interactions within chiplet boundaries.
Connection to circuit cutting
For very large QFT instances that exceed a single processor’s capacity, circuit cutting offers an alternative. The DAScut framework shows how to decompose circuits into smaller fragments that run independently, combining the results classically.
The QFT has interesting properties for circuit cutting. The approximate QFT already limits interaction range, creating natural “cut points” between qubits that are far apart. A \(CR_k\) gate with \(k > m\) is omitted entirely in the approximate QFT — so cutting the circuit at these points introduces zero additional error beyond the approximation already made.
Furthermore, DAScut’s structure-aware deduplication could exploit the QFT’s repetitive pattern: each qubit block has a nearly identical structure (H + sequence of \(CR_k\)), so isomorphic sub-circuit reuse could reduce the classical post-processing overhead dramatically.
Open problems
Several important questions remain:
Optimal atom trajectory planning. Our shuttling strategy uses simple sweep paths. For the QFT’s specific structure, there may be globally optimal trajectories that minimize total shuttling time while respecting all gate dependencies. This is related to the traveling salesman problem and likely NP-hard in general, but the QFT’s regularity may admit efficient special-case algorithms.
Array geometry co-design. We assumed a rectangular grid. But what if the array geometry itself were optimized for the QFT? A hyperbolic or logarithmic spacing — where qubits that interact via large-\(k\) gates are placed closer together — could reduce routing overhead further. Co-designing the array layout and the compilation strategy is an unexplored frontier.
Fault-tolerant QFT compilation. All our analysis is at the physical level. A fault-tolerant QFT would encode each qubit in an error-correcting code (e.g., the surface code) and execute \(CR_k\) gates via lattice surgery. The compilation challenges compound: routing surface code patches is far more constrained than routing bare qubits, and the approximate QFT’s error analysis must account for logical error rates, not just gate infidelities.
Machine-learning-assisted compilation. The QFT’s regular structure makes it an excellent target for learned compilation policies. A reinforcement learning agent trained on small QFT instances could discover routing and scheduling strategies that generalize to larger sizes — potentially outperforming hand-designed heuristics. Recent work on ML-driven quantum compilation suggests this is feasible.
Real-time adaptive compilation. Current compilers produce a fixed schedule before execution begins. But neutral-atom platforms can detect atom loss and position errors mid-circuit. An adaptive compiler that re-routes around missing atoms in real time could significantly improve the success rate of large QFT circuits.
Closing thought
The Quantum Fourier Transform is simultaneously one of the most important algorithms in quantum computing and one of the hardest to compile efficiently. Its all-to-all connectivity requirement is a stress test for any compiler, and its hierarchical structure of controlled rotations creates opportunities for optimization that general-purpose tools miss.
Neutral-atom hardware, with its unique reconfigurability, offers a path through the compilation bottleneck that no other platform provides. Shuttling-based routing eliminates the SWAP overhead that cripples the QFT on fixed-connectivity hardware. Approximate QFT truncation converts an all-to-all circuit into a bounded-range one with rigorous error guarantees. Together, these techniques extend the useful range of QFT implementations by roughly an order of magnitude in qubit count.
But much remains to be done. The velocity-enabled toolbox, chiplet architectures, and circuit cutting each offer additional levers that have barely been explored in the QFT context. The compiler problem is open, the optimal trajectories are unknown, and the best hardware configuration may not yet have been tried. For researchers at the intersection of quantum algorithms and neutral-atom hardware, there is a wealth of problems waiting.