Results
All results as ratio vs MWPM logical error rate. Lower is better. 1.0x = identical to MWPM.
Conditions: p=0.01, d=3 (80 detectors, 222 edges) and d=5 (240 detectors, 930 edges), r=10 unless noted. MWPM baseline: 11.8% (d=3), 10.9% (d=5).
Here's the bottom line -- every decoder I built over ~50 experiments, ranked against MWPM. The short version: Sinkhorn + local search gets remarkably close.
Table 1: Full Decoders Overview
| Approach | d=3 vs MWPM | d=5 vs MWPM | Time Deterministic? |
|---|---|---|---|
| MWPM (PyMatching) | 1.000x | 1.000x | No |
| Sinkhorn + 2-opt | 1.030x | 1.256x | Yes |
| Sinkhorn + 3-opt | 1.044x | 1.183x | Yes |
| Sinkhorn (best tau/iter) | 1.135x | 1.306x | Yes |
| Dijkstra-greedy | 1.128x | 1.990x | No |
| Sparse Sinkhorn (3 iter) | 1.254x | 1.993x | Yes |
| Union-Find (odd-only) | 1.218x | 2.433x | No |
| Chain phase1 only | 2.93x | - | Yes |
Sinkhorn best: tau=2.0 iter=20 (d=3), tau=0.5 iter=20 (d=5).
Before tackling the full 3D problem, here's what happened in 2D (single measurement round). At d=3, learned decoders actually beat MWPM - by learning where it makes mistakes.
Table 2: 2D (Single-Round) Results
| Approach | d=3 vs MWPM | d=5 vs MWPM | Time Deterministic? |
|---|---|---|---|
| MWPM (PyMatching) | 1.000x | 1.000x | No |
| Shift-OR 2D | 1.000x | 1.000x | Yes |
| NN + AND/XOR features | ~0.91x | - | Yes |
| PP + MWPM Distillation | 0.90x | 1.108x | Yes |
All r=1 (single round), p=0.01. PP+Distillation also beats MWPM at d=3 r=3: 0.948x. All approaches degrade at d=5+ and with more rounds.
Why explore Sinkhorn relaxations?
MWPM provides excellent decoding accuracy but involves irregular, branch-heavy computation that is difficult to parallelize or implement in hardware with strict latency guarantees. Sinkhorn relaxations provide a different structure: deterministic runtime, highly parallel matrix operations, and compatibility with GPUs or specialized accelerators. This makes them interesting to explore as bounded-time decoding approximations.
Once I committed to Sinkhorn, I needed to find the right temperature and iteration count. This was a systematic grid search -- and the landscape turned out to be surprisingly non-monotonic.
Table 3: Sinkhorn Parameter Sweep
Heatmap of vs_mwpm ratio across tau (temperature) and n_iter (iterations). Green = close to MWPM, red = far worse.
d=3, r=10, p=0.01 (MWPM err: 11.8%)
| tau \ iter | 5 | 10 | 20 | 50 | 100 |
|---|---|---|---|---|---|
| 0.05 | 2.215x | 2.290x | 2.335x | 2.356x | 2.369x |
| 0.1 | 1.326x | 1.324x | 1.340x | 1.330x | 1.343x |
| 0.2 | 1.154x | 1.165x | 1.148x | 1.138x | 1.145x |
| 0.5 | 1.162x | 1.164x | 1.161x | 1.149x | 1.145x |
| 1 | 1.146x | 1.151x | 1.146x | 1.150x | 1.151x |
| 2 | 1.139x | 1.137x | 1.135x | 1.137x | 1.137x |
| 5 | 1.148x | 1.144x | 1.147x | 1.149x | 1.147x |
Best: 1.135x (bold border). Green = close to MWPM, red = far worse.
d=5, r=10, p=0.01 (MWPM err: 10.9%)
| tau \ iter | 5 | 10 | 20 | 50 |
|---|---|---|---|---|
| 0.2 | 1.531x | 1.457x | 1.373x | 1.361x |
| 0.5 | 1.406x | 1.330x | 1.306x | 1.306x |
| 1 | 1.440x | 1.385x | 1.351x | 1.342x |
| 2 | 1.773x | 1.725x | 1.700x | 1.648x |
| 5 | 1.958x | 1.911x | 1.879x | 1.868x |
Best: 1.306x (bold border). Green = close to MWPM, red = far worse.
Raw Sinkhorn gets you to 1.135x. The question was: can local search close the remaining gap? Turns out 2-opt is the sweet spot -- 3-opt helps at d=5 but the compute cost is brutal.
Table 4: Post-Processing Improvement Chain
| Stage | d=3 vs MWPM | d=5 vs MWPM | Improvement |
|---|---|---|---|
| Sinkhorn only | 1.135x | 1.306x | - |
| + 2-opt local search | 1.030x | 1.256x | -9.3% (d=3), -3.8% (d=5) |
| + 3-opt local search | 1.044x | 1.183x | +1.4% (d=3), -5.8% (d=5) |
2-opt is the big win at d=3. 3-opt helps at d=5 but costs 50.8 s (impractical). Note: 3-opt at d=3 is slightly worse than 2-opt alone (1.044x vs 1.030x) due to overfitting to local minima.
This was the biggest chapter of the exploration -- six different NN architectures, from simple PingPong to hybrid Sinkhorn+NN. I tried everything I could think of. The punchline: NNs learn features beautifully at d=3, but none of them crack the matching problem at d=5.
Table 5: NN Decoder Architectures
| Model | d | r | Params | Test Acc | MWPM Err |
|---|---|---|---|---|---|
| PingPong (d=3, r=3) | |||||
| PP 2R baseline | 3 | 3 | 49.9K | 88.0% | 4.5% |
| Scale A (8ch 128x64) | 3 | 3 | 50.2K | 86.8% | 4.5% |
| Scale B (16ch 256x128) | 3 | 3 | 142.9K | 90.1% | 4.5% |
| Scale D (32ch 256x128) | 3 | 3 | 202.0K | 90.3% | 4.5% |
| Scale C (16ch 512x256) | 3 | 3 | 347.6K | 91.1% | 4.5% |
| Temporal Mixing (d=3, r=10) | |||||
| tconv PP_2R | 3 | 10 | 416.2K | 74.0% | 11.6% |
| tconv PP_4R | 3 | 10 | 417.3K | 76.4% | 11.6% |
| tconv Inter_k3 | 3 | 10 | 417.8K | 78.3% | 11.6% |
| tconv Inter_k5 | 3 | 10 | 418.8K | 70.0% | 11.6% |
| tconv Post_k3 | 3 | 10 | 417.0K | 64.2% | 11.6% |
| DEM-Attention | |||||
| demattn PP_2R | 3 | 10 | 416.2K | 65.1% | 11.4% |
| demattn Inter_k3 | 3 | 10 | 417.8K | 67.9% | 11.4% |
| demattn Attn_2L | 3 | 10 | 417.3K | 86.4% | 11.6% |
| demattn Attn_4L | 3 | 10 | 419.5K | 85.4% | 11.6% |
| demattn Attn_2L | 5 | 5 | 1.33M | 81.4% | 5.5% |
| demattn Attn_4L | 5 | 5 | 1.33M | 83.1% | 5.5% |
| Hybrid (Sinkhorn + NN) | |||||
| hybrid sinkhorn P1_Med | 3 | 10 | 307.8K | 87.2% | 11.6% |
| hybrid binary1hop P1_Med | 3 | 10 | 307.8K | 86.7% | 11.6% |
| hybrid b1h+txorpadcross P1_Med | 3 | 10 | 594.5K | 87.8% | 11.6% |
| hybrid binary2hop P1_Med | 3 | 10 | 717.4K | 86.0% | 11.6% |
| hybrid cost2hop P1_Med | 3 | 10 | 717.4K | 83.4% | 11.6% |
| hybrid sinkhorn P1_Med | 5 | 10 | 901.7K | 71.2% | 10.8% |
| hybrid b1h+txorpadcross P1_Med | 5 | 10 | 1.82M | 67.9% | 10.8% |
| hybrid binary1hop P1_Med | 5 | 10 | 901.7K | 61.8% | 10.8% |
Best d=3 r=10 NN: hybrid b1h+txorpadcross (87.8%). All NNs degrade sharply at d=5.
Test Accuracy vs Model Size (PingPong scaling)
Diminishing returns: 7x more params (A→C) gives only +4% accuracy at d=3, +4% at d=5.
After the NN decoder chapter, I realized NNs shine at pruning, not decoding. A tiny 346-parameter network can tell you which edges matter -- and then you hand the pruned graph to MWPM. This is where neural networks actually earn their keep.
Table 6: Edge Pruner Comparison
| Model | d | Params | AUC | Recall@3 | Recall@5 | MWPM@3 | MWPM@5 |
|---|---|---|---|---|---|---|---|
| MatOps | 3 | 0 | 0.973 | 92.5% | 97.0% | 1.07x | 1.02x |
| Linear | 3 | 11 | 0.979 | 92.3% | 96.9% | 1.08x | 1.02x |
| PP 2R 8ch | 3 | 346 | 0.988 | 95.4% | 98.5% | 1.01x | 1.00x |
| PP 2R 16ch | 3 | 1,202 | 0.989 | 95.8% | 98.5% | 1.01x | 1.00x |
| MatOps | 5 | 0 | 0.960 | 88.6% | 93.7% | 1.33x | 1.17x |
| Linear | 5 | 11 | 0.961 | 88.7% | 93.7% | 1.33x | 1.17x |
| PP 2R 8ch | 5 | 346 | 0.970 | 88.7% | 93.8% | 1.24x | 1.06x |
| PP 4R 16ch | 5 | 2,290 | 0.983 | 90.5% | 95.6% | 1.14x | 1.01x |
MWPM@K = MWPM error rate on graph pruned to top-K edges per detector, divided by unpruned MWPM. At d=3, PP 2R 16ch with K=5 recovers exact MWPM quality (1.00x).
Pruner: MWPM Error on Pruned Graph vs K (candidates per detector)
d=3
d=5
Lower is better. At K=3, PP pruner achieves 1.01x at d=3 but 1.24x at d=5 - the d=5 quality wall.
I also explored what happens when you abandon MWPM entirely. Union-Find, greedy Dijkstra, sparse Sinkhorn -- they all hit the same wall at d=5, clustering around 2.0x. That consistency told me the wall is structural, not algorithmic.
Table 7: Alternative Matching Methods
| Method | d | Error Rate | vs MWPM | Time Deterministic? |
|---|---|---|---|---|
| Union-Find (odd-only) | 3 | 14.4% | 1.218x | No |
| Dijkstra-greedy | 3 | 13.4% | 1.128x | No |
| Sparse Sinkhorn (tau=1.0, 3 iter) | 3 | 14.9% | 1.254x | Yes |
| Chain phase1 only | 3 | 34.7% | 2.93x | Yes |
| Union-Find (odd-only) | 5 | 26.5% | 2.433x | No |
| Dijkstra-greedy | 5 | 21.7% | 1.990x | No |
| Sparse Sinkhorn (tau=1.0, 3 iter) | 5 | 21.7% | 1.993x | Yes |
All d=5 alternatives cluster around 2.0x - the structural quality wall for non-MWPM methods.
Training Curves
Training Loss Curves (d=3, r=10 edge pruners)
Solid = train loss, dashed = validation loss. All models: d=3, r=10, p=0.01.
After ~50 experiments, these are the lessons that stuck. Each one was earned the hard way.
Key Insights
1. The decoder operates on the detector graph, not the lattice
Surface-code diagrams emphasize the lattice, but the decoder operates on the detector graph produced by the DEM. Every approach that used grid topology failed. Every approach that used DEM topology worked better. All successful approaches implicitly worked in that representation.
2. is the keystone
Projecting through it (DeepPipe), iterating through it (PingPong), and masking attention with it (DEM-attention) are the three most productive architectural ideas.
3. Global matching pressure is hard to avoid
Multiple approaches converged toward structures similar to MWPM. No local feature, no NN, no iterative message passing can substitute for minimum-weight matching at d≥5. This suggests that global pairing may be an unavoidable component of effective decoders.
4. Pruning helps — selection hurts
Reducing the search space improves performance, but committing to edges early destroys global optimality. NNs can identify which edges are likely errors (AUC 0.99+), but they cannot determine which consistent subset to select. Premature commitment is catastrophic in matching problems.
5. Sinkhorn+2opt is the quality champion
1.030x d=3, 1.256x d=5. Deterministic time. The closest any fixed-time decoder gets to MWPM.
6. Binary features > cost features
NNs work better with 0/1 presence indicators than floating-point costs. The network learns implicit weighting. Cost2hop underperforms binary1hop at both d=3 and d=5.
7. 2-hop features don't help
Adding 2-hop connectivity (binary or cost) does not improve over 1-hop, and often hurts. At d=5, binary2hop and cost2hop plateau at 60.2% accuracy.
8. Distance-5 introduces a structural difficulty wall
Many heuristics that perform well at d=3 degrade sharply at d=5. All fixed-graph decoders cluster around 2.0x at d=5 regardless of method. NNs drop 15-27 percentage points. This suggests that decoding pressure becomes global rather than local as distance increases.