Pruning the worker: how four cycles a day stay off a disk-full alert
Each WRF cycle leaves behind a timestamped working directory worth a few gigabytes. Multiplied by four cycles a day, the disk fills in a fortnight. Here is the small retention helper that keeps the live worker tidy without losing benchmark scratch.
Each Convek cycle produces a timestamped run directory under the worker's runtime root: WPS intermediates, the WRF working directory, post-processing artefacts, the published Convek bundle. End-to-end, a single 48 hour UK cycle leaves around two to three gigabytes of stuff behind once you include the bits we keep for inspection rather than the trimmed output we ship. Four cycles a day, every day, and the disk fills in roughly two weeks.
There are two reasonable answers to this. Don't write the intermediates at all (clean as you go), or write them and prune on a fixed retention rule. We picked the second. Keeping the last few cycles around is genuinely useful: when something looks wrong in the published artefacts, the first thing you want is the WRF working directory of the cycle that produced them, still on disk, still re-runnable. Cleaning eagerly throws that option away to save disk that costs almost nothing on the box we already pay for.
The retention rule that shipped is `prune_run_directories(root, keep=N, preserve_run_ids=...)`. It walks the runtime root, groups directories whose names match the timestamped run pattern (a prefix plus `YYYYMMDDTHHMMSSZ`) by their prefix, sorts each group newest-first by stamp, and deletes everything past the `keep` count. Anything that does not match the timestamped pattern is left alone - benchmark scratch folders, manually-named experiments, anything an operator has parked there on purpose. That is the bit that took the most thought, and the bit most naive cleanup scripts get wrong.
The `preserve_run_ids` set is a small but important escape hatch. If a cycle is currently being inspected, or is referenced by an in-flight comparison, the scheduler can pass that run id in explicitly and the cleanup will skip it even if it would otherwise have aged out. That keeps the cleanup safe to run on a timer without coordinating with whatever else might be touching the same directory tree.
How it runs in practice. The retention helper is invoked by the scheduled cycle launcher just before each cycle starts, with `keep=3`, so the directory tree settles at around 12 active run directories at steady state (three per cycle prefix, four cycle prefixes a day). Each run is a few gigabytes, the disk on the live worker is several hundred, and the steady-state usage sits comfortably under 10 percent without any operator involvement. There is no email when it runs because there is no failure mode that is worth waking anyone for - if the cleanup itself fails, the next cycle just adds to the backlog and someone notices the disk-usage chart before anything actually breaks.
What the helper does not try to do. It does not inspect file ages within a run directory, only the directory's timestamped name. It does not handle non-timestamped scratch directly - those are left to the operator. It does not attempt to handle multi-host shared storage, because the live setup is a single worker. None of that is sophisticated. All of it is sufficient for the scale we run at, which is the right bar for an operational helper that needs to be boring and reliable rather than clever.
This is the kind of plumbing that does not make a forecast better and does not make the API faster, but is exactly what makes a four-cycle-a-day pipeline run unattended for weeks without an operator dropping into a shell. The interesting work happens in WRF, in the post-processing, and in the API. The boring work is what lets the interesting work keep running.
