Practical refinement

min2phase

Shuang Chen (cs0x7f)'s Java implementation pushes Kociemba two-phase to its practical limit. Symmetry compression shrinks tables by an order of magnitude, huge prune tables tighten phase-1 lower bounds, and the search interleaves phase-1 candidates with phase-2 to minimise total length. It's what csTimer and TNoodle ship.

Project identity

min2phase is written by Chinese speedcuber Shuang Chen (Chen Shuang, GitHub: cs0x7f) and released under dual GPLv3 + MIT. The Java package is literally cs.min2phase. The earliest public traces go back to Chinese cubing forums in the early 2010s; it has long been the bundled solver behind csTimer and TNoodle, and the upstream is still actively maintained by the author as of 2023.

The algorithmic skeleton is still Kociemba's two-phase. What sets min2phase apart is that every engineering knob has been turned to its limit: prune tables compressed by symmetry classes into a few hundred KB, ~200 ms startup, ~0.8 ms average solve, average length 19–20.6 moves depending on probe budget. After csTimer compiles it to JS in the browser, virtually every scramble a WCA-aware local timer hands you was generated by this solver.

Scope

This page is about the 3x3 two-phase solver. Inside csTimer, cs0x7f also wrote standalone solvers for 2x2 / 4x4 / 5x5 / Square-1 / Skewb with the same family of tricks; their tables are different and out of scope here.

Five core differences vs vanilla Kociemba

Herbert Kociemba's 1992 algorithm splits the 3x3 solve into G0→G1→solved. Phase 1 forces the cube into the subgroup ⟨U,R²,F²,D,L²,B²⟩ (zero edge orientation, zero corner orientation, four UD-slice edges in their layer). Phase 2 finishes the permutation inside that subgroup. The original implementation uses dense prune tables of tens of MB, and feeds the first phase-1 solution straight into phase-2, which is not necessarily globally shortest. min2phase keeps the same split but rebuilds five things.

Aspect	Vanilla Kociemba	min2phase
Prune-table storage	Dense raw-coord arrays, tens of MB	Sym-class compressed, 4-bit packed, ~1 MB total
Heuristic lower bound	UDSliceTwist or UDSliceFlip alone	Max of three: UDSliceTwist + UDSliceFlip + TwistFlip
Multi-axis + inverse	Single axis, forward only	3 G1 axes × forward / inverse = 6 orientations in parallel
Phase 1 ↔ Phase 2 interleaving	Commits to first phase-1 solution	Iterates on total length with allowShorter + pre-moves
Pre-scramble	None	MAX_PRE_MOVES=20, conjugate-equivalent re-roots

Stacked together: at maxDepth=21 the full-init average is 0.8 ms per solve at average length 20.6; raising probeMin to 10000 brings the average down to 18.63 moves, within striking distance of the proven 20-move God's number.

Two layers: CubieCube vs CoordCube

CubieCube is the cubie-level representation: 8 corners + 12 edges, each carrying (position, orientation). In code it is literally two arrays:

// CubieCube.java
byte[] ca = {0, 1, 2, 3, 4, 5, 6, 7};               // 8 corners
byte[] ea = {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22}; // 12 edges

// ca[i] = orientation << 3 | position
// ea[i] = position << 1 | orientation

Each byte packs both fields: corners have 2 bits of twist (0–2) and 3 bits of slot (0–7); edges have 1 bit of flip and 4 bits of slot. Composing two CubieCubes (acting one cube state on another) is just permutation algebra:

// Corner multiplication: prod = a * b
static void CornMult(CubieCube a, CubieCube b, CubieCube prod) {
    for (int corn = 0; corn < 8; corn++) {
        int oriA = a.ca[b.ca[corn] & 7] >> 3;
        int oriB = b.ca[corn] >> 3;
        prod.ca[corn] = (byte) (a.ca[b.ca[corn] & 7] & 7
                              | (oriA + oriB) % 3 << 3);
    }
}

CubieCube is complete but doing 8+12 byte-array multiplications per node is too slow. CoordCube is its coordinate projection: collapse a cube state into a handful of integers (twist / flip / slice / …), and precompute, for each coordinate, where every one of the 18 moves takes it. Search then becomes table lookups instead of array ops:

// CoordCube.java — every coordinate gets a move table
static char[][] UDSliceMove = new char[N_SLICE][N_MOVES];   // 495  × 18
static char[][] TwistMove   = new char[N_TWIST_SYM][N_MOVES]; // 324  × 18 (sym-compressed)
static char[][] FlipMove    = new char[N_FLIP_SYM][N_MOVES];  // 336  × 18
static char[][] CPermMove   = new char[N_PERM_SYM][N_MOVES2]; // 2768 × 10
static char[][] EPermMove   = new char[N_PERM_SYM][N_MOVES2]; // 2768 × 10
static char[][] MPermMove   = new char[N_MPERM][N_MOVES2];    // 24   × 10

Names ending in SYM are symmetry-compressed versions; raw tables are indexed by the raw coordinate value. Each cell stores a char (16 bits) — wide enough for the (class<<3 | sym) packing used by sym-coords.

The six core coordinates

The point of coordinates: slice the full 4.3×10¹⁹-state space along some equivalence and label each slice with one integer, such that the 18 moves' action on the label is precomputable. min2phase uses six of them:

Coord	Size	Meaning	Phase
`Twist`	2187 = 3⁷	Orientation of 8 corners (7 independent)	P1
`Flip`	2048 = 2¹¹	Orientation of 12 edges (11 independent)	P1
`UDSlice`	495 = C(12,4)	Position of FL/FR/BL/BR among 12 edges, ignoring permutation among themselves	P1
`CPerm` / `EPerm`	40320 = 8!	Permutation of 8 corners / 8 UD-layer edges	P2
`MPerm`	24 = 4!	Permutation of the 4 E-slice edges within E layer	P2
`CComb` (Parity+Pos)	140 = 70 × 2	Position of 4 U-layer corners (70) × edge parity (×2)	P2

Phase 1 wants Twist, Flip and UDSlice all simultaneously zero — equivalently, zero corner-orientation, zero edge-orientation, and four E-slice edges back in the E layer. That is exactly the G1 subgroup definition. Phase 2 then drives CPerm/EPerm/MPerm to zero inside G1.

Parity is not a free coordinate; cs0x7f folds it into the high bit of CComb (70 corner positions × 2 parity bits = 140), saving one table.

Symmetry classes: collapsing 16 equivalent states into one

The cube has 48 spatial symmetries; the UD-axis-preserving subgroup has 16 elements (generated by S_U4 quarter-turns, the LR mirror S_LR2 and the F-axis 180° S_F2). If b = S⁻¹ a S for some such S, then a and b have the exact same optimal solution length — they are the same state viewed from a different orientation. So we can collapse each 16-orbit into one symmetry class, storing only the representative + a small sym index back.

Coord	raw	sym class	ratio
Twist	2187	324	6.75×
Flip	2048	336	6.10×
EPerm / CPerm	40320	2768	14.57×

The ratio falls short of 16× because some states are self-symmetric and need extra entries (selfSym). The packed format is (class<<3 | sym), fitting in a 16-bit char. Move transitions on a sym-coord use precomputed SymMult / SymMove / Sym8Move tables to translate move m into its equivalent in the representative's frame:

// CoordCube.doMovePrun — phase 1 transition on a sym-coord
twist = TwistMove[cc.twist][CubieCube.Sym8Move[m << 3 | cc.tsym]];
tsym  = (twist & 7) ^ cc.tsym;
twist >>= 3;

It is set up inside CubieCube.initFlipSym2Raw / initTwistSym2Raw / initPermSym2Raw, producing the bidirectional maps FlipS2R / TwistS2R / EPermS2R.

Why isn't UDSlice compressed?

UDSlice has only 495 values to begin with, and it is aligned with the UD axis — most symmetries in the group don't preserve "those four edges are in E", so compressing it is more trouble than it's worth. It stays a raw coordinate; the symmetry compression only applies to the other coordinate, giving raw×sym tables like UDSliceTwistPrun[495 × 324].

Huge prune tables + 4-bit packing

The phase-1 heuristic is the max of three tables:

// CoordCube.setWithPrun  (phase 1 lower-bound)
prun = max(
    getPruning(TwistFlipPrun,    twist<<11 | flipIdx),     // 2187 × 2048 → "huge"
    getPruning(UDSliceTwistPrun, twist*N_SLICE + sliceX),  // 324  × 495
    getPruning(UDSliceFlipPrun,  flip *N_SLICE + sliceX)); // 336  × 495

The first one — TwistFlipPrun — is cs0x7f's key addition, the "Huge prune table". It has N_TWIST_SYM × N_FLIP = 324 × 2048 = 663,552 cells, 4 bits each, totalling 332 KB. It supplies a joint lower bound on corner+edge orientation alone, abstracting UDSlice away entirely. The accuracy is visible in pruningValue.txt — by depth 9 it already covers 99.97% of states:

// pruningValue.txt — TwistFlipPrun depth distribution
 1         3
 2        15
 3       106
 4       951
 5      9826
 6     89711
 7    447496
 8    662070
 9    663552   ← full coverage at depth 9, total 663,552 cells

The other two saturate only around depth 14; the huge table gives a bound that is on average 1–2 moves tighter, which cascades into exponential pruning of the phase-1 tree. Algorithm.md gives the per-table averages: UDSliceTwist 6.76, UDSliceFlip 6.85, TwistFlip 7.18 — the huge table is tightest but the real win is taking max over all three at every node.

All prune tables are 4-bit packed: one int holds 8 nibbles, each nibble being that cell's distance to solved mod 16. Read/write are one-liners:

static int  getPruning(int[] t, int i) { return  t[i >> 3] >> (i << 2) & 0xf; }
static void setPruning(int[] t, int i, int v) { t[i >> 3] ^= v << (i << 2); }

The (i << 2) wraps mod 32 in Java, so the same expression yields both the word offset and the nibble bit offset, saving an (i & 7) * 4. TwistFlipPrun is 332 KB, UDSliceTwistPrun / UDSliceFlipPrun ~80 KB each, MCPermPrun 33 KB, EPermCCombPPrun 195 KB — together right around the advertised ~1 MB.

Construction is BFS with an inverse-fill switch: forward-fill while the frontier is small, then once depth exceeds INV_DEPTH switch to scanning unfilled cells and asking whether any successor is already filled — cheaper when the unfilled set has shrunk. See CoordCube.initRawSymPrun.

Phase 1: IDA* with triple-max heuristic

Phase 1 is IDA* iterated by total depth length1 from 0 upwards. At each iteration limit, a DFS expands all 18 moves and calls doMovePrun to get the successor's heuristic; the branch is pruned the moment prun + depth > maxl:

// Search.phase1 (excerpt)
for (int axis = 0; axis < 18; axis += 3) {
    if (axis == lm || axis == lm - 9) continue;       // skip same / opposite axis
    for (int power = 0; power < 3; power++) {
        int m = axis + power;
        if (skipMoves != 0 && (skipMoves & 1 << m) != 0) continue;

        int prun = nodeUD[maxl].doMovePrun(node, m, true);
        if      (prun >  maxl) break;        // whole axis ruled out
        else if (prun == maxl) continue;     // this power ruled out

        if (USE_CONJ_PRUN) {
            prun = nodeUD[maxl].doMovePrunConj(node, m);
            if (prun >  maxl) break;
            if (prun == maxl) continue;
        }

        move[depth1 - maxl] = m;
        int ret = phase1(nodeUD[maxl], ssym & moveCubeSym[m], maxl - 1, axis);
        if (ret == 0) return 0;
        if (ret >= 2) break;
    }
}

Three things to notice. (1) Among the three powers (U / U2 / U') on the same axis, continue rules out just that power while break kills the whole axis. cs0x7f signals the grain via "> maxl" vs "== maxl". (2) skipMoves uses selfSym: if the current state has a non-trivial stabilizer, we can drop first-moves that are symmetry-equivalent to ones already tried, avoiding duplicate expansions. (3) USE_CONJ_PRUN queries the same prune table at one conjugate of the state — effectively doubling the heuristic for free, picking up another move or two of pruning.

When node.prun == 0 && maxl < 5 and allowShorter permits, phase1 chooses to exit early — the hook into phase 2 that the next section unpacks.

Phase 2: 10-move search inside G1

Inside phase 2 the move set shrinks to 10: U, U2, U', D, D2, D', R2, F2, L2, B2 — anything else would leave G1. Three coordinates fully describe the state: EPerm (UD edges), CPerm (corners), MPerm (E-slice internal). The heuristic is again a max:

// Search.initPhase2 — phase-2 lower bound
int prun = max(
    getPruning(MCPermPrun,       cperm * N_MPERM + MPermConj[mid][csym]),
    getPruning(EPermCCombPPrun,  edge  * N_COMB  + CCombPConj[Perm2CombP[corn]][...]),
    getPruning(EPermCCombPPrun,  edgei * N_COMB  + CCombPConj[Perm2CombP[corni]][...])  // inverse
);

The third term is the same heuristic evaluated on the inverse of the current state. cs0x7f makes getPermSymInv an O(1) lookup, so it's free. Phase-2 IDA* runs within maxDep2 (≤12). USE_COMBP_PRUN folds parity into the 70–139 range of CComb so it contributes to the pruning bound automatically.

// Search.phase2 — branching with parity-aware move skip
int moveMask = Util.ckmv2bit[lm];
for (int m = 0; m < 10; m++) {
    if ((moveMask >> m & 1) != 0) { m += 0x42 >> m & 3; continue; }
    int midx  = MPermMove[mid][m];
    int cornx = CPermMove[corn][SymMoveUD[csym][m]];
    int csymx = SymMult[cornx & 0xf][csym];
    cornx >>= 4;
    int edgex = EPermMove[edge][SymMoveUD[esym][m]];
    int esymx = SymMult[edgex & 0xf][esym];
    edgex >>= 4;
    // … prun checks … then recurse phase2(edgex, esymx, cornx, csymx, midx, maxl-1, …);
}

m += 0x42 >> m & 3 is one of cs0x7f's bit-LUT tricks: 0x42 = 0100 0010₂ encodes "at m=1 skip 2, at m=6 skip 2", letting the loop hop past sibling powers on the same axis without an inner branch.

Phase-1 ↔ Phase-2 interleaving

Vanilla Kociemba commits to the first phase-1 solution and chains phase-2, returning a legal but not necessarily shortest answer (say 23 moves). The issue: a 7-move phase 1 + 16-move phase 2 may lose to a 9-move phase 1 + 12-move phase 2 = 21, because the longer phase 1 lands at a friendlier point in G1. Vanilla won't look back.

min2phase iterates the outer loop by total length length1. The top-level search() loop:

// Search.search() — outer loop on total length
for (length1 = 0; length1 < solLen; length1++) {
    maxDep2 = Math.min(MAX_DEPTH2, solLen - length1 - 1);
    for (urfIdx = 0; urfIdx < 6; urfIdx++) {
        if ((conjMask & 1 << urfIdx) != 0) continue;   // axis already covered by selfSym
        if (phase1PreMoves(maxPreMoves, -30,
                           urfCubieCube[urfIdx],
                           (int)(selfSym & 0xffff)) == 0) {
            return solution == null ? "Error 8" : solution.toString();
        }
    }
}

For each value of length1, every phase-1 solution found is immediately handed to phase 2 (initPhase2); a phase1+phase2 chain of total ≤ solLen is accepted. The instant one is accepted, maxDep2 is tightened to solLen - length1 - 1 and the search keeps probing more phase-1 routes at this same length — a different phase-1 path may unlock a shorter phase 2. probeMin sets how many phase-2 probes must elapse before early exit. That's the "find shorter solutions" knob in the README.

Two more multipliers: 3 axes × forward/inverse. urfCubieCube[0..5] stores the same cube viewed through six URF-conjugates and inverses; phase 1 runs once on each — matching Algorithm.md's "try three G1 targets + the inverse state". Pre-moves rewrite Phase1 · Phase2 · PreMoves = Scramble⁻¹, letting phase 1 not start at the literal scramble — another factor of candidates. conjMask uses the selfSym bitmap to skip urfIdx values already covered by a symmetry equivalence.

Termination

(a) Solution at length ≤ maxDepth and probe ≥ probeMin → return; (b) probe ≥ probeMax with no solution → "Error 8"; (c) length1 exhausts solLen with no solution → "Error 7" (no solution within the given maxDepth).

Benchmark: the author's numbers

Benchmark.md ships two tables (author's environment). The first is target-length vs solve time:

Init	Unlimited	21 moves	20 moves	Init time
Full	0.685 ms	0.805 ms	4.62 ms	195 ms
Partial	4.12 ms	4.52 ms	19.0 ms	60 ms

Full init precomputes every prune table (~195 ms), after which each solve costs 0.8 ms; partial init only fills BFS up to a shallow depth — init drops to 60 ms but solves are 5–6× slower. The second table shows the probe-budget trade-off:

probeMin	Avg length	Time
5	20.63	0.827 ms
10	20.48	1.07 ms
20	20.27	1.55 ms
50	19.95	2.83 ms
100	19.68	4.87 ms
200	19.48	8.58 ms
500	19.29	19.4 ms
1000	19.11	36.3 ms
10000	18.63	—

Typical csTimer setup: probeMin = 0 (accept the first legal-length-≤21 solution); one scramble costs ~1 ms. For fewest-move analysis, raise probeMin to 1000+ and the average compresses to 19.1 moves at a tens-of-ms cost per solve. Memory per the README: ~1 MB with the TwistFlip table, ~0.7 MB without. That budget is precisely why this fits inside a browser JS bundle.

Engineering details

Search is stateful but solution() is synchronized. Use one Search instance per thread; do not call solution() / next() on the same instance concurrently.
Search.init() is global and one-shot. All CoordCube static tables are process-wide. Subsequent init() calls are short-circuited by initLevel.
Allocation-free hot loop. nodeUD/nodeRL/nodeFB are 21 preallocated CoordCube slots; urfCubieCube[0..5] and preMoveCubes[1..MAX_PRE_MOVES] likewise. The entire inner search reuses these slots — zero per-node allocation.
continueSearch / next(). After the first solution the search state is kept (isRec = true); next(probeMax, probeMin, verbose) resumes IDA* in place to keep finding shorter solutions until the probe budget runs out.
Auto move-merging. Same-axis consecutive moves (R + R → R2, R2 + R' → R) are merged on append, including across the phase-1 / phase-2 join. The reported length is post-merge.
Four verbose flags. USE_SEPARATOR / INVERSE_SOLUTION / APPEND_LENGTH / OPTIMAL_SOLUTION. OPTIMAL_SOLUTION swaps to searchopt() which iterates all three axes' phase 1 simultaneously, producing a provably optimal solve but an order of magnitude slower. csTimer leaves it off.

Deployment: from Java to browser JS

min2phase ships as Java. Putting it inside csTimer is less about hand-porting and more about exploiting Java's strictness and zero external deps. Historically csTimer used GWT (Google Web Toolkit) to compile cs.min2phase to equivalent JS; later cs0x7f maintained a hand-translated JS port at cstimer/src/lib/scramble_333.js, almost line-for-line aligned with the Java source, swapping char[][] for Uint16Array and int[] for Int32Array.

TNoodle (the WCA's official scramble generator) takes a different route: it runs cs.min2phase as a jar on the JVM, hashes the scrambles, and feeds them into the WCA competition system. csTimer and TNoodle share the same algorithm but on different runtimes — TNoodle generates every official WCA scramble.

The iframe at /cstimer on this site runs that JS port of min2phase; the default 3x3 scramble at /scramble/gen is also it (random state + min2phase solve, scramble keeps the inverse). /scramble/analyzer uses cubeopt-wasm — a more aggressive SAB + multi-thread approach — running in parallel.

Closing

min2phase is the textbook case of "engineering the last drop out of an existing algorithm". Kociemba's two-phase framework is not touched; symmetry compression + the huge prune + multi-axis views + full-length IDA* iteration drop the same algorithm from tens of MB and seconds to 1 MB and milliseconds. It doesn't propose a new algorithm — it just makes the old one cheap enough to live in a browser tab. Which is why every WCA scrambler of the last decade looks the way it does.