Results

All results as ratio vs MWPM logical error rate. Lower is better. 1.0x = identical to MWPM.

Conditions: p=0.01, d=3 (80 detectors, 222 edges) and d=5 (240 detectors, 930 edges), r=10 unless noted. MWPM baseline: 11.8% (d=3), 10.9% (d=5).

Here's the bottom line -- every decoder I built over ~50 experiments, ranked against MWPM. The short version: Sinkhorn + local search gets remarkably close.

Table 1: Full Decoders Overview

Approach	d=3 vs MWPM	d=5 vs MWPM	Time Deterministic?
MWPM (PyMatching)	1.000x	1.000x	No
Sinkhorn + 2-opt	1.030x	1.256x	Yes
Sinkhorn + 3-opt	1.044x	1.183x	Yes
Sinkhorn (best tau/iter)	1.135x	1.306x	Yes
Dijkstra-greedy	1.128x	1.990x	No
Sparse Sinkhorn (3 iter)	1.254x	1.993x	Yes
Union-Find (odd-only)	1.218x	2.433x	No
Chain phase1 only	2.93x	-	Yes

Sinkhorn best: tau=2.0 iter=20 (d=3), tau=0.5 iter=20 (d=5).

Before tackling the full 3D problem, here's what happened in 2D (single measurement round). At d=3, learned decoders actually beat MWPM - by learning where it makes mistakes.

Table 2: 2D (Single-Round) Results

Approach	d=3 vs MWPM	d=5 vs MWPM	Time Deterministic?
MWPM (PyMatching)	1.000x	1.000x	No
Shift-OR 2D	1.000x	1.000x	Yes
NN + AND/XOR features	~0.91x	-	Yes
PP + MWPM Distillation	0.90x	1.108x	Yes

All r=1 (single round), p=0.01. PP+Distillation also beats MWPM at d=3 r=3: 0.948x. All approaches degrade at d=5+ and with more rounds.

Why explore Sinkhorn relaxations?

MWPM provides excellent decoding accuracy but involves irregular, branch-heavy computation that is difficult to parallelize or implement in hardware with strict latency guarantees. Sinkhorn relaxations provide a different structure: deterministic runtime, highly parallel matrix operations, and compatibility with GPUs or specialized accelerators. This makes them interesting to explore as bounded-time decoding approximations.

Once I committed to Sinkhorn, I needed to find the right temperature and iteration count. This was a systematic grid search -- and the landscape turned out to be surprisingly non-monotonic.

Table 3: Sinkhorn Parameter Sweep

Heatmap of vs_mwpm ratio across tau (temperature) and n_iter (iterations). Green = close to MWPM, red = far worse.

d=3, r=10, p=0.01 (MWPM err: 11.8%)

tau \ iter	5	10	20	50	100
0.05	2.215x	2.290x	2.335x	2.356x	2.369x
0.1	1.326x	1.324x	1.340x	1.330x	1.343x
0.2	1.154x	1.165x	1.148x	1.138x	1.145x
0.5	1.162x	1.164x	1.161x	1.149x	1.145x
1	1.146x	1.151x	1.146x	1.150x	1.151x
2	1.139x	1.137x	1.135x	1.137x	1.137x
5	1.148x	1.144x	1.147x	1.149x	1.147x

Best: 1.135x (bold border). Green = close to MWPM, red = far worse.

d=5, r=10, p=0.01 (MWPM err: 10.9%)

tau \ iter	5	10	20	50
0.2	1.531x	1.457x	1.373x	1.361x
0.5	1.406x	1.330x	1.306x	1.306x
1	1.440x	1.385x	1.351x	1.342x
2	1.773x	1.725x	1.700x	1.648x
5	1.958x	1.911x	1.879x	1.868x

Best: 1.306x (bold border). Green = close to MWPM, red = far worse.

Raw Sinkhorn gets you to 1.135x. The question was: can local search close the remaining gap? Turns out 2-opt is the sweet spot -- 3-opt helps at d=5 but the compute cost is brutal.

Table 4: Post-Processing Improvement Chain

Stage	d=3 vs MWPM	d=5 vs MWPM	Improvement
Sinkhorn only	1.135x	1.306x	-
+ 2-opt local search	1.030x	1.256x	-9.3% (d=3), -3.8% (d=5)
+ 3-opt local search	1.044x	1.183x	+1.4% (d=3), -5.8% (d=5)

2-opt is the big win at d=3. 3-opt helps at d=5 but costs 50.8 s (impractical). Note: 3-opt at d=3 is slightly worse than 2-opt alone (1.044x vs 1.030x) due to overfitting to local minima.

This was the biggest chapter of the exploration -- six different NN architectures, from simple PingPong to hybrid Sinkhorn+NN. I tried everything I could think of. The punchline: NNs learn features beautifully at d=3, but none of them crack the matching problem at d=5.

Table 5: NN Decoder Architectures

Model	d	r	Params	Test Acc	MWPM Err
PingPong (d=3, r=3)
PP 2R baseline	3	3	49.9K	88.0%	4.5%
Scale A (8ch 128x64)	3	3	50.2K	86.8%	4.5%
Scale B (16ch 256x128)	3	3	142.9K	90.1%	4.5%
Scale D (32ch 256x128)	3	3	202.0K	90.3%	4.5%
Scale C (16ch 512x256)	3	3	347.6K	91.1%	4.5%
Temporal Mixing (d=3, r=10)
tconv PP_2R	3	10	416.2K	74.0%	11.6%
tconv PP_4R	3	10	417.3K	76.4%	11.6%
tconv Inter_k3	3	10	417.8K	78.3%	11.6%
tconv Inter_k5	3	10	418.8K	70.0%	11.6%
tconv Post_k3	3	10	417.0K	64.2%	11.6%
DEM-Attention
demattn PP_2R	3	10	416.2K	65.1%	11.4%
demattn Inter_k3	3	10	417.8K	67.9%	11.4%
demattn Attn_2L	3	10	417.3K	86.4%	11.6%
demattn Attn_4L	3	10	419.5K	85.4%	11.6%
demattn Attn_2L	5	5	1.33M	81.4%	5.5%
demattn Attn_4L	5	5	1.33M	83.1%	5.5%
Hybrid (Sinkhorn + NN)
hybrid sinkhorn P1_Med	3	10	307.8K	87.2%	11.6%
hybrid binary1hop P1_Med	3	10	307.8K	86.7%	11.6%
hybrid b1h+txorpadcross P1_Med	3	10	594.5K	87.8%	11.6%
hybrid binary2hop P1_Med	3	10	717.4K	86.0%	11.6%
hybrid cost2hop P1_Med	3	10	717.4K	83.4%	11.6%
hybrid sinkhorn P1_Med	5	10	901.7K	71.2%	10.8%
hybrid b1h+txorpadcross P1_Med	5	10	1.82M	67.9%	10.8%
hybrid binary1hop P1_Med	5	10	901.7K	61.8%	10.8%

Best d=3 r=10 NN: hybrid b1h+txorpadcross (87.8%). All NNs degrade sharply at d=5.

Test Accuracy vs Model Size (PingPong scaling)

Diminishing returns: 7x more params (A→C) gives only +4% accuracy at d=3, +4% at d=5.

After the NN decoder chapter, I realized NNs shine at pruning, not decoding. A tiny 346-parameter network can tell you which edges matter -- and then you hand the pruned graph to MWPM. This is where neural networks actually earn their keep.

Table 6: Edge Pruner Comparison

Model	d	Params	AUC	Recall@3	Recall@5	MWPM@3	MWPM@5
MatOps	3	0	0.973	92.5%	97.0%	1.07x	1.02x
Linear	3	11	0.979	92.3%	96.9%	1.08x	1.02x
PP 2R 8ch	3	346	0.988	95.4%	98.5%	1.01x	1.00x
PP 2R 16ch	3	1,202	0.989	95.8%	98.5%	1.01x	1.00x

MatOps	5	0	0.960	88.6%	93.7%	1.33x	1.17x
Linear	5	11	0.961	88.7%	93.7%	1.33x	1.17x
PP 2R 8ch	5	346	0.970	88.7%	93.8%	1.24x	1.06x
PP 4R 16ch	5	2,290	0.983	90.5%	95.6%	1.14x	1.01x

MWPM@K = MWPM error rate on graph pruned to top-K edges per detector, divided by unpruned MWPM. At d=3, PP 2R 16ch with K=5 recovers exact MWPM quality (1.00x).

Pruner: MWPM Error on Pruned Graph vs K (candidates per detector)

d=3

d=5

Lower is better. At K=3, PP pruner achieves 1.01x at d=3 but 1.24x at d=5 - the d=5 quality wall.

I also explored what happens when you abandon MWPM entirely. Union-Find, greedy Dijkstra, sparse Sinkhorn -- they all hit the same wall at d=5, clustering around 2.0x. That consistency told me the wall is structural, not algorithmic.

Table 7: Alternative Matching Methods

Method	d	Error Rate	vs MWPM	Time Deterministic?
Union-Find (odd-only)	3	14.4%	1.218x	No
Dijkstra-greedy	3	13.4%	1.128x	No
Sparse Sinkhorn (tau=1.0, 3 iter)	3	14.9%	1.254x	Yes
Chain phase1 only	3	34.7%	2.93x	Yes

Union-Find (odd-only)	5	26.5%	2.433x	No
Dijkstra-greedy	5	21.7%	1.990x	No
Sparse Sinkhorn (tau=1.0, 3 iter)	5	21.7%	1.993x	Yes

All d=5 alternatives cluster around 2.0x - the structural quality wall for non-MWPM methods.

Training Curves

Training Loss Curves (d=3, r=10 edge pruners)

Solid = train loss, dashed = validation loss. All models: d=3, r=10, p=0.01.

After ~50 experiments, these are the lessons that stuck. Each one was earned the hard way.

Key Insights

1. The decoder operates on the detector graph, not the lattice

Surface-code diagrams emphasize the lattice, but the decoder operates on the detector graph produced by the DEM. Every approach that used grid topology failed. Every approach that used DEM topology worked better. All successful approaches implicitly worked in that representation.

2. $H_{\text{det}}$ is the keystone

Projecting through it (DeepPipe), iterating through it (PingPong), and masking attention with it (DEM-attention) are the three most productive architectural ideas.

3. Global matching pressure is hard to avoid

Multiple approaches converged toward structures similar to MWPM. No local feature, no NN, no iterative message passing can substitute for minimum-weight matching at d≥5. This suggests that global pairing may be an unavoidable component of effective decoders.

4. Pruning helps — selection hurts

Reducing the search space improves performance, but committing to edges early destroys global optimality. NNs can identify which edges are likely errors (AUC 0.99+), but they cannot determine which consistent subset to select. Premature commitment is catastrophic in matching problems.

5. Sinkhorn+2opt is the quality champion

1.030x d=3, 1.256x d=5. Deterministic time. The closest any fixed-time decoder gets to MWPM.

6. Binary features > cost features

NNs work better with 0/1 presence indicators than floating-point costs. The network learns implicit weighting. Cost2hop underperforms binary1hop at both d=3 and d=5.

7. 2-hop features don't help

Adding 2-hop connectivity (binary or cost) does not improve over 1-hop, and often hurts. At d=5, binary2hop and cost2hop plateau at 60.2% accuracy.

8. Distance-5 introduces a structural difficulty wall

Many heuristics that perform well at d=3 degrade sharply at d=5. All fixed-graph decoders cluster around 2.0x at d=5 regardless of method. NNs drop 15-27 percentage points. This suggests that decoding pressure becomes global rather than local as distance increases.

Explore All ~50 Approaches →