When the Standard Demands Something the Test Cannot Deliver

15 June 2026

By Michael Mannion

Testing can never feasibly deliver failure rate statistics that an international standard like ISO 262262 demands. This article explains why, why it is still necessary, and why collecting data must not end in the lab.

A stern executive in a suit patterned with spreadsheet figures, a speech bubble declaring: I accept no more than 1 failure per 11'400 years! — — ISO 26262, ASIL D (10⁻⁸ / h). He's not wrong.

A safety standard opens with a number that looks innocent: a dangerous-failure rate on the order of 10⁻⁸ — one in a hundred million hours of operation. It is the bar a high-integrity automotive function is asked to clear. It reads like an engineering target concocted in a CEO’s fever dream. It is, in fact, a statement about the limits of human knowledge.

And the bar is not plucked from the air. Behind the wheel, real life kills at a rate of roughly one fatality per million hours of driving — about 10⁻⁶ per hour. A modern car runs dozens of safety-critical electronic functions at once, and most crashes trace to human error, not technical faults; so for electronics and software not to worsen the odds a driver already accepts, each function is apportioned a fraction of that budget — two orders of magnitude below the field rate, at 10⁻⁸ per hour. The number isn’t a fantasy. It is the road’s own fatality rate, divided up.

Ask what it costs to demonstrate that bar by experiment, however, and a statistical rule of thumb — the “rule of three” — hits hard. Run N trials, observe no failure, and the most you may claim — at ninety-five per cent confidence — is a failure rate of roughly 3 ÷ N. Invert it: to show a rate of 10⁻⁸ you need three hundred million flawless trials. You read that right: Three. Hundred. Million.

In the currency of the road, that is tens of thousands of years of incident-free driving, or the hundreds of millions of miles the autonomy world keeps quietly rediscovering. And that is the generous version, where nothing ever goes wrong. To measure such a rate — to see failures and count them i.e. to get a handle on the actual frequency — you would need ten times more.

So the number the standard demands lies far beyond the reach of testing. Testing is induction: you observe the finite and infer about the open, unobserved future, and the rule of three prices that inference to the decimal. Probabilistic testing has a calculable reach, and it stops orders of magnitude short of where the standard points. This is not a flaw to be engineered away. It is the edge of what experiment can do.

And beyond that edge lies something worse than a statistical wall. SOTIF (ISO 21448, Safety Of The Intended Functionality) gives it a name: the unknown unknowns — the triggering conditions and failure modes no one thought to put in the catalogue, much less test for. No confidence interval touches a failure you never imagined; more samples of the scenarios you did foresee buy you nothing against the ones you didn’t. Which is why field testing is not optional padding — functional safety gives the operation phase its own part of the standard. It is the only place the missing hundred million hours can ever accumulate — across a fleet, across years — where the unimaginable eventually reveals itself.

So what use is probabilistic testing good for if it cannot hope to deliver to the standard’s demanded fidelity? This is the point the cynic misses. Field data, on its own, is anecdote. A near-miss in fog means nothing until you can say whether it is drift or merely weather — and you can only say that against a baseline the upfront experiments have built. The original probabilistic testing fixes the reference, draws the line between signal and noise, and clears away the common, measurable failures before they ship. The road, then, will seek out the rare residual. Omit probabilistic testing and the field stops being evidence, and reduces instead to body count.

Of course, the software engineer knows this. Unit tests and code review catch what they can, cheaply; observability watches production for the rest — and no one calls the tests pointless because bugs still reach users. Phase drug trials catch the gross and the common; pharmacovigilance waits for the rare and the slow — and no one calls clinical trials pointless because such monitoring exists.

And this is no longer merely prudent; it is mandated, and not only on the road. Wherever a system’s behaviour is a distribution rather than a fact, the standards have reached the same verdict: road vehicles undergo mandatory market surveillance of the in-service fleet under EU type-approval law, medical devices carry an obligatory post-market surveillance process under the EU MDR, and high-risk AI a required post-market monitoring plan — the field half written into law. Bound what you can, monitor what you cannot, and — above all — know which is which.

Certainty was never an option. Every safety claim ever made was a leap of faith to some degree. Probabilistic testing does not shorten the leap — it turns on the lights. The thing distinguishing an engineer from a gambler is that the former can see the edge they leap from.

feotest is an open-source probabilistic testing framework for Rust, designed with automotive and medical domains in mind. It is built for the reachable half of the discipline this article describes — a measure → verify loop that fixes a statistically defined baseline, reports confidence-bounded floors instead of point scores, conditions every claim on its operating domain, and turns each release into a drift check against that baseline. It bounds what you can measure and gives the field the reference it needs to hunt the rest — and it is honest about the difference.

And the loop need not end in the lab. Both feotest and its Java sibling punit carry a sentinel: a lightweight runtime agent that re-evaluates the very same distributional contract against the live system, on whatever cadence operations choose, and emits its verdicts to a log or a webhook. It is the same test the experiments ran, now pointed at production: the committed baseline is the reference each run is judged against, so when the live system falls below it the verdict flips — and you learn of the regression against a known line, not by surprise. That is precisely the half of the discipline the lab can never reach.

Learn more at mavai.org/projects/feotest.

Sources

Source	Relevance
ISO 26262-5:2018 — Road vehicles: Functional safety, Part 5 (Hardware)	Defines the random-hardware-failure target for the highest integrity level, ASIL D: ≤10⁻⁸ dangerous failures per hour — the number the essay opens on.
Shalev-Shwartz, Shammah & Shashua, On a Formal Model of Safe and Scalable Self-Driving Cars (2017)	States the real-world figure the target is anchored to: ~10⁻⁶ fatalities per hour of human driving.
ISO 21448:2022 — Road vehicles: Safety of the intended functionality (SOTIF)	Names the residual beyond statistics — functional insufficiencies and the unknown unknowns — and (Clause 13) makes field monitoring an operation-phase activity.
ISO 26262-7:2018 — Functional safety, Part 7 (Production, operation, service)	Gives the operation and field phase its own part of the functional-safety standard.
Regulation (EU) 2018/858 — Approval and market surveillance of motor vehicles	EU type-approval law: each Member State must run a market-surveillance authority that re-tests in-service vehicles — the automotive field half written into law.
ISO/TR 20416:2020 — Medical devices: Post-market surveillance for manufacturers	The medical-device counterpart: a mandated, systematic process for monitoring performance in the field (per EU MDR 2017/745).
EU AI Act, Article 72: Post-market monitoring	Requires providers of high-risk AI systems to monitor real-world performance across the system’s lifetime — the field half written into law.
feotest	Open-source, Rust-native probabilistic-testing framework: the measure → verify loop for the reachable, lab half of the discipline.
punit	The Java sibling framework; like feotest it carries a sentinel for continuous, in-the-field drift detection against a committed baseline.