MPI tuning on the WRF host: what we have actually measured
WRF performance tuning is a deep rabbit hole. Here is what we know about our MPI setup, what we have measured directly, and what is still educated guessing from sensible defaults.
WRF performance tuning is a deep rabbit hole, and most of the published guidance is for academic clusters with very different hardware to a 12 vCPU VPS. The interesting result on our setup is that the biggest single wall-clock win so far came not from tuning MPI but from cutting the domain extent. That story is in the trimmed-domain post. This post is about the MPI configuration itself: what we run, what we have measured directly, and what is still educated guessing from sensible defaults.
What we know. The shipping default is `RASP_SCHEDULED_MPI_RANKS=10` on the live worker. The upgraded-worker benchmark ran WRF in roughly 13 minutes wall-clock with that setup on the trimmed domain; the live `raspuk` worker takes roughly 28 to 30 minutes for WRF and roughly 33 to 36 minutes end-to-end. Quilting (dedicated I/O ranks, controlled by `nio_tasks_per_group`) is disabled in the shipping namelist. That came out of an earlier experiment that compared quilted and non-quilted runs at the same domain and provider and found no clear runtime win on this hardware - the disk I/O is fast enough that it does not bottleneck the compute, so the dedicated I/O ranks are pure overhead.
What we have measured directly. The full-vs-trimmed domain comparison on the upgraded benchmark worker (75 minutes end-to-end versus 16 minutes end-to-end on the same source cycle). The quilted-vs-non-quilted comparison at the same domain. The basic validation that the live cycle runs deterministically and produces artifacts that pass the post-process schema checks. That is essentially the measurement footprint we have. Everything else in the rank-count or decomposition space is educated guessing from defaults.
What we have not measured. The actual scaling curve as a function of rank count - whether 10 is the sweet spot or whether 8 or 16 would be faster on the same domain. The decomposition shape (WRF's `nproc_x` and `nproc_y`), which is currently auto-picked. The effect of pinning ranks to specific cores. The interaction with hyperthreading on the worker's CPU. None of these are exotic measurements - any one of them is a few hours of benchmark time - but none of them have been done.
Why not. Because the trimmed-domain change moved end-to-end wall-clock from full-domain scale to comfortably scheduled trimmed-domain scale in one shot, and the next biggest lever (subset GRIB ingest, covered separately) trimmed another few minutes. Anything we could buy from MPI re-tuning is small compared to those two. The headroom on the current cycle is comfortable enough that detailed MPI work is below higher-impact items on the roadmap.
What would change that. A 2 km UK domain would change it. At 8 times the cell count, the WRF wall-clock would balloon and MPI efficiency would matter much more. So the natural point to revisit MPI tuning is right before the 2 km work. By then, hopefully on a CPU with more memory channels, where the answers might be different anyway.
What we would do if we revisited it tomorrow. Three benchmarks. First, a sweep across rank counts (4, 6, 8, 10, 12, 16) on a fixed test cycle, measuring WRF wall-clock to find the actual optimum on this hardware. Second, a comparison of the auto-decomposition against forced decompositions (1xN, 2xN, NxN-like) to see whether the auto-pick is even close to optimal. Third, a quilting comparison with bigger output volumes than the current trimmed-output stream, to confirm that the no-quilt finding generalises rather than being an artifact of how little we currently write.
Until then, the shipping config is what it is. Documented, deterministic, fast enough for the four-cycle-a-day schedule with comfortable headroom. The detailed MPI sweep waits for the next domain-resolution change to make it worthwhile.
