The first piece of validation: a baseline vs candidate diff at named points
Every previous optimisation post says 'we'll measure this when validation lands'. The first piece of that pipeline shipped this month - a head-to-head diff between two cycle runs at named pilot sites. Here is what it does, what it does not, and why it was the right first step.
Almost every optimisation post on this blog ends with the same sentence: 'numbers will follow when the validation pipeline is in place'. That phrase has covered a Thompson microphysics comparison, a MYNN-2.5 PBL test, a mixed-layer LCL upgrade, an ERA5 soil moisture initialisation, a radiation-interval sweep, and several others. Validation has been the gating dependency for all of them. This post is about the first piece of that pipeline actually landing in the repo.
The first piece is deliberately small. It is a Python script, `compare_convek_runs.py`, that takes two completed cycle artefact bundles - a baseline and a candidate - and reports how the published fields differ at a list of named pilot sites. It does not score against observations. It does not produce a public dashboard. It is a head-to-head diff between two model runs, nothing more. That is the smallest useful thing the validation pipeline can do, and it had to ship before any of the bigger pieces had anywhere to plug in.
What it actually compares. For every site you pass in (latitude, longitude pairs you can repeat), the script finds the nearest grid cell in each run independently, loads the published `site.json` and `sounding.json` row for that cell, and walks a fixed list of fields. Site-level: `wstar_ms`, `hglider_agl_m`, `cloudbase_agl_ft`, `surface_wind_speed_ms`, `surface_wind_dir_deg`, `boundary_layer_m`, `day_rating`. Sounding-level: `temperature_c`, `dewpoint_c`, `wind_speed_ms`, `wind_dir_deg`, `altitude_m`. For each field it counts how many forecast hours changed and reports the maximum absolute difference. The output is JSON, sorted, suitable for piping straight into a comparison report or into a future regression dataset.
Why this shape. Three reasons it is the smallest useful first step. First, it forces a clean baseline. Before the helper existed, comparing two runs meant eyeballing two folders' worth of JSON, which scales badly the moment you want to compare more than one site. Second, it surfaces a category of bug that pure unit tests miss: a config change that compiles, runs, and produces apparently sensible output, but quietly shifts every cloudbase value by 200 ft. The diff catches that immediately. Third, it is observation-free, which means it works today, on any pair of runs, without a parallel ingest pipeline for radiosondes or pilot data.
What it deliberately does not do. It does not tell you whether the candidate is more correct than the baseline. Run-to-run agreement is not validation in the textbook sense - both runs could be wrong in identical ways and the diff would say everything is fine. The textbook version, scoring against radiosondes and against XContest activity, is the next layer up the stack and is described in the sounding and XContest validation post. The diff harness is the layer underneath: the regression suite that lets you know your config change did what you expected before you go to the expense of scoring it against observations.
How it slots in. The expected workflow is: run the live config on a chosen cycle, run the candidate config on the same cycle, run the diff at a list of pilot sites that exercise different parts of the domain (Long Mynd, Lasham, Sutton Bank, Aboyne, Talgarth, Bidford), and read the JSON. If a field changes by less than the expected magnitude of the candidate change, the candidate is doing nothing and you stop there. If a field changes wildly more than expected, the candidate has a bug and you debug. Only once the diff looks plausible is it worth burning the cycles to run a longer parallel campaign.
Some honest limits. The nearest-cell lookup uses a flat lat/lon distance, which is fine over a UK domain but would need replacing with a proper great-circle distance for larger domains. The diff treats string values (like `day_rating` labels) as a binary changed/not-changed, which is conservative but loses information about which way the rating shifted. The `max_abs_diff` summary is a worst-case statistic, not a distribution - if you want RMSE or a histogram you have to compute it from the underlying values rather than the summary. None of these are blocking for what the helper is for, but each is a place future versions can be more ambitious.
What lands next. The 12Z radiosonde stream against Larkhill and Herstmonceux, with the diff harness's nearest-cell logic generalised to nearest-radiosonde-site. That is the first observation-grounded score the pipeline can produce. It depends on the format and ingest of the radiosonde data, which is the actual engineering bottleneck, not the comparison logic. The diff harness is what we hang that work on, not the work itself - but you have to have the hanging point before you can start.
If you are an integrator wondering when Convek will start quoting bias numbers in marketing material, this is the answer to that question's prerequisite. The repository now has a regression baseline. The validation work upstream of it is the next visible step. When the first sounding-grounded numbers land, they will be in the validation post, and they will be honest about what the harness was set up to score and what it cannot.
