Testing can never feasibly deliver failure rate statistics that an international standard like ISO 262262 demands. This article explains why, why it is still necessary, and why collecting data must not end in the lab.

A safety standard opens with a number that looks innocent: a dangerous-failure rate on the order of 10⁻⁸ — one in a hundred million hours of operation. It is the bar a high-integrity automotive function is asked to clear. It reads like an engineering target concocted in a CEO’s fever dream. It is, in fact, a statement about the limits of human knowledge.
And the bar is not plucked from the air. Behind the wheel, real life kills at a rate of roughly one fatality per million hours of driving — about 10⁻⁶ per hour. A modern car runs dozens of safety-critical electronic functions at once, and most crashes trace to human error, not technical faults; so for electronics and software not to worsen the odds a driver already accepts, each function is apportioned a fraction of that budget — two orders of magnitude below the field rate, at 10⁻⁸ per hour. The number isn’t a fantasy. It is the road’s own fatality rate, divided up.
Ask what it costs to demonstrate that bar by experiment, however, and a statistical rule of thumb — the “rule of three” — hits hard. Run N trials, observe no failure, and the most you may claim — at ninety-five per cent confidence — is a failure rate of roughly 3 ÷ N. Invert it: to show a rate of 10⁻⁸ you need three hundred million flawless trials. You read that right: Three. Hundred. Million.
In the currency of the road, that is tens of thousands of years of incident-free driving, or the hundreds of millions of miles the autonomy world keeps quietly rediscovering. And that is the generous version, where nothing ever goes wrong. To measure such a rate — to see failures and count them i.e. to get a handle on the actual frequency — you would need ten times more.
So the number the standard demands lies far beyond the reach of testing. Testing is induction: you observe the finite and infer about the open, unobserved future, and the rule of three prices that inference to the decimal. Probabilistic testing has a calculable reach, and it stops orders of magnitude short of where the standard points. This is not a flaw to be engineered away. It is the edge of what experiment can do.
And beyond that edge lies something worse than a statistical wall. SOTIF (ISO 21448, Safety Of The Intended Functionality) gives it a name: the unknown unknowns — the triggering conditions and failure modes no one thought to put in the catalogue, much less test for. No confidence interval touches a failure you never imagined; more samples of the scenarios you did foresee buy you nothing against the ones you didn’t. Which is why field testing is not optional padding — functional safety gives the operation phase its own part of the standard. It is the only place the missing hundred million hours can ever accumulate — across a fleet, across years — where the unimaginable eventually reveals itself.
So what use is probabilistic testing good for if it cannot hope to deliver to the standard’s demanded fidelity? This is the point the cynic misses. Field data, on its own, is anecdote. A near-miss in fog means nothing until you can say whether it is drift or merely weather — and you can only say that against a baseline the upfront experiments have built. The original probabilistic testing fixes the reference, draws the line between signal and noise, and clears away the common, measurable failures before they ship. The road, then, will seek out the rare residual. Omit probabilistic testing and the field stops being evidence, and reduces instead to body count.
Of course, the software engineer knows this. Unit tests and code review catch what they can, cheaply; observability watches production for the rest — and no one calls the tests pointless because bugs still reach users. Phase drug trials catch the gross and the common; pharmacovigilance waits for the rare and the slow — and no one calls clinical trials pointless because such monitoring exists.
And this is no longer merely prudent; it is mandated, and not only on the road. Wherever a system’s behaviour is a distribution rather than a fact, the standards have reached the same verdict: road vehicles undergo mandatory market surveillance of the in-service fleet under EU type-approval law, medical devices carry an obligatory post-market surveillance process under the EU MDR, and high-risk AI a required post-market monitoring plan — the field half written into law. Bound what you can, monitor what you cannot, and — above all — know which is which.
Certainty was never an option. Every safety claim ever made was a leap of faith to some degree. Probabilistic testing does not shorten the leap — it turns on the lights. The thing distinguishing an engineer from a gambler is that the former can see the edge they leap from.
Sources
| Source | Relevance |
|---|---|
| ISO 26262-5:2018 — Road vehicles: Functional safety, Part 5 (Hardware) | Defines the random-hardware-failure target for the highest integrity level, ASIL D: ≤10⁻⁸ dangerous failures per hour — the number the essay opens on. |
| Shalev-Shwartz, Shammah & Shashua, On a Formal Model of Safe and Scalable Self-Driving Cars (2017) | States the real-world figure the target is anchored to: ~10⁻⁶ fatalities per hour of human driving. |
| ISO 21448:2022 — Road vehicles: Safety of the intended functionality (SOTIF) | Names the residual beyond statistics — functional insufficiencies and the unknown unknowns — and (Clause 13) makes field monitoring an operation-phase activity. |
| ISO 26262-7:2018 — Functional safety, Part 7 (Production, operation, service) | Gives the operation and field phase its own part of the functional-safety standard. |
| Regulation (EU) 2018/858 — Approval and market surveillance of motor vehicles | EU type-approval law: each Member State must run a market-surveillance authority that re-tests in-service vehicles — the automotive field half written into law. |
| ISO/TR 20416:2020 — Medical devices: Post-market surveillance for manufacturers | The medical-device counterpart: a mandated, systematic process for monitoring performance in the field (per EU MDR 2017/745). |
| EU AI Act, Article 72: Post-market monitoring | Requires providers of high-risk AI systems to monitor real-world performance across the system’s lifetime — the field half written into law. |
| feotest | Open-source, Rust-native probabilistic-testing framework: the measure → verify loop for the reachable, lab half of the discipline. |
| punit | The Java sibling framework; like feotest it carries a sentinel for continuous, in-the-field drift detection against a committed baseline. |
© 2026 Michael Mannion. Licensed under CC BY 4.0 — share and adapt freely with attribution.
