Validating against soundings and XContest traces
A forecast is not useful unless it is right. We verify Convek against radiosonde soundings and XContest flight activity. Here is the validation setup and what it says about where we are weak.
Every field in the API carries an implicit claim of accuracy. If `day_rating` says `good`, pilots who go flying on the strength of it should get a genuinely good-day experience. If `hglider_agl_m` says 1800 m, climbs should top out near 1800 m. None of this is automatic: forecasts can be self-consistent and entirely wrong. Validation is the pipeline that checks the model against reality, and it is as much a permanent piece of infrastructure as the model run itself.
Two independent validation sources are in use. Radiosonde soundings give us the meteorological truth: temperature, humidity, wind as a function of height, twice a day from a handful of sites in the UK. XContest gives us the pilot truth: what people actually flew, where, and how long, on days the forecast was for.
The sounding comparison runs automatically after every 12Z cycle lands. For each of Larkhill, Herstmonceux, and Lerwick (when available), we extract the forecast sounding at the radiosonde site from the just-completed run, time-matched to the 12Z observation. Two quantities get scored: boundary layer height (against the well-mixed layer top in the observed profile) and LCL (against the condensation level in the observed profile). Both are written to a validation log that feeds a Grafana dashboard.
Current status on the sounding verification. Across Q1 2026, forecast BL height against Larkhill averaged +80 m bias with RMS error 260 m. Forecast LCL against the same site averaged -120 ft bias with RMS error 340 ft. Herstmonceux is similar, Lerwick slightly worse (more variable maritime airmasses, smaller validation set). These numbers are competitive with published numbers for operational WRF setups at similar resolution. They are also comfortably inside the noise a pilot experiences day-to-day, which is the level that matters.
The XContest comparison is more indirect. We cannot verify `wstar_ms` directly from flight traces (flight vario data is not exposed in XContest). What we can do is correlate `day_rating` with flying activity: on `excellent` days, do people fly big flights? On `poor` days, do flights go out? We run a rolling correlation between forecast `day_rating` (sampled at known launch sites) and total XContest distance flown from those sites on the same day.
Current correlation: 0.74 Spearman rank correlation across the 2025 summer season, which means the ordering of days by flying activity largely agrees with the ordering by `day_rating`. A correlation of 1.0 would mean perfect agreement, which is not achievable given weather does not entirely predict whether people go out (weekend/weekday matters, visibility matters, how tired pilots are from last weekend matters). 0.74 is solid. The upper bound given the other sources of variance is probably around 0.85.
Where we are weak. Cross-checking the two validation streams shows two systematic problems. First: we underpredict cloudbase on high-pressure days with clean arriving airmasses. The correction in the post-processing helps but does not fully remove the bias. This is a microphysics issue we are not sure how to fix without going to a more expensive scheme. Second: we overpredict convective onset on stable-overnight days. The model breaks the inversion 20 to 30 minutes early on those days, a known feature of non-local PBL schemes. We are comfortable with this as long as users know about it. Hence posts like this.
Where the validation is blind. We do not directly verify `wstar_ms` against instrument data. There is a long-standing research programme to fly paragliders with high-quality vario loggers specifically for this purpose, but we have not committed to building that comparison yet. If any instrument maker integrating Convek wants to ship anonymised vario data back, we would be very interested. It is the single best data source not currently on our verification pipeline.
Validation does not mean perfect. It means knowing where the forecast is trustworthy and where it is not, and telling users honestly. A model whose weaknesses are documented is more useful than a model whose weaknesses are hidden behind a clean API, because users can calibrate their own usage. The intention is for every future optimisation post to cite against validation data rather than intuition.
A public validation page is in development. Until then, if you are integrating Convek into a pilot-facing product and want to see the raw verification numbers, email and we will share. See the contact page.