All posts
Optimisation·7 min read

MPI decompositions, hyperthreads, and a lost week of compute

Getting WRF to run fast on a 12-vCPU box is a solved problem if you know the answer. We did not know the answer. Here is the week of wrong turns that eventually got UK 4 km to ~33 minutes per cycle.

The WRF performance problem looks simple from a distance: you have a domain, you have a CPU, the CPU has cores, decompose the domain across the cores and run it. In practice every choice in that sentence has a sharp edge. MPI decomposition layout, thread count, whether to use hyperthreads, how I/O is configured, where the memory lives. We got enough of these wrong to lose a week of development time, and the reasons were not in any of the tutorials we were following.

The box is a 12 vCPU Hetzner dedicated with 47 GiB RAM. On paper, a comfortable WRF host for a modest UK domain. In practice, vCPU means hyperthreaded cores: 6 physical, 12 logical. This is the first trap.

Attempt 1: 12 MPI ranks, one per vCPU. This is the obvious thing. It was also wrong. WRF runtime on the full UK 4 km cycle came in at 68 minutes. The cores were showing high utilisation, so we assumed this was just how long it took, and moved on to other things.

Attempt 2: looking at the decomposition. WRF defaults to an auto-picked MPI decomposition, and the auto-pick for 12 ranks is 4x3. For our specific domain geometry (tall and skinny, UK latitudes giving more north-south extent than east-west), 4x3 produced oddly-shaped tiles with high boundary-to-interior ratios. We forced 2x6 instead (6 ranks north-south, 2 east-west). Runtime dropped to 54 minutes. That was a 20 percent free win from one namelist change.

Attempt 3: hyperthreads. Reading more carefully around WRF performance, it turns out that WRF's computation is memory-bandwidth bound, not compute bound. Hyperthreaded cores share memory bandwidth and cache. Running two MPI ranks on one physical core does not speed things up, because both ranks are waiting on memory. Often it slows things down because cache thrashes. We dropped to 6 MPI ranks (one per physical core), 1x6 decomposition. Runtime dropped to 41 minutes. The hyperthreads were costing us roughly 25 percent.

Attempt 4: OpenMP. WRF supports hybrid MPI+OpenMP: multiple processes, multiple threads per process. We tried 6 MPI ranks with 2 OMP threads each (total 12 threads across the 12 vCPUs). Runtime went up to 48 minutes. The overhead of OpenMP synchronisation within WRF exceeded the benefit of the extra threads at this domain size. Went back to pure MPI.

Attempt 5: I/O. Default WRF I/O serialises through rank 0, which was now a bottleneck on the smaller rank count. We enabled quilting (dedicated I/O ranks): 5 compute ranks, 1 I/O rank. Runtime: 37 minutes. Net win of 4 minutes because the compute no longer stalls waiting for disk.

Attempt 6: output pruning. WRF will write every field at every output interval, including dozens we do not need. We pruned the wrfout stream to only the fields the post-processing pipeline consumes (roughly 25 out of 120+). Runtime: 33 minutes 41 seconds. Another 3 to 4 minutes saved by not writing what we would not read.

That is the shipping configuration: 5 compute MPI ranks in a 1x5 layout, 1 I/O quilt rank, no OMP, pruned output. Roughly a 50 percent improvement over the naive starting point, achieved in six tuning passes across about a week. Most of the gain came from understanding that the platform was memory-bandwidth constrained, not core-count constrained, which is not how WRF documentation frames the problem.

What would help further. Bigger RAM would let us keep more of the halo data cache-resident, which would probably buy another 10 percent. Proper NVMe I/O (the box uses NVMe but not the fastest generation) would help I/O further. Moving to a CPU with more memory channels would help most of all, because the bottleneck is memory, not cores. None of those are worth the cost at current scale.

The lesson: before you optimise anything, know which resource you are actually bounded on. We assumed compute. It was memory. The week of decomposition fiddling was useful but secondary to the hyperthread decision, which was purely a resource-understanding problem. Every WRF operator running on commodity hardware should profile the memory pattern before tuning decomposition.

Shipping config details live on the WRF model page. Runtime numbers are monitored per-cycle and posted on the status page.