A new aircraft does not get cleared to fly because its blueprint is internally consistent. It gets cleared because someone strapped it to a rig, ran it through scripted abuse, instrumented every joint, and — after hours and hours of tests — produced evidence that the design survives a controlled approximation of the real world. That ritual has a name in aerospace and motorsport — shakedown — and it exists because the cost of guessing is paid in catastrophic failures, irreparable damage to human lives, and the heavy material losses that follow.
Software shipped to production has the same shape of decision and almost none of the same ritual. We tag, we deploy, we wait. Go or no-go.
Here is an uncomfortable observation. As an industry, we automated the cheap half of trust. Types check. Schemas validate. Contracts get signed off in OpenAPI. Unit tests are green. Coverage is up and to the right. Every one of those is a statement about the symbolic world of the codebase — the world where our abstractions agree with each other. And that world is, by construction, the world that is easy to automatically certify for correctness.
Symbols can sprawl into endless glosses; evidence stays finite — the same map-versus-ground tension as in The Relative Trap. Here the focus is narrower: what we automated in the pipeline, and what still has to meet reality on go/no-go.
The expensive half — proving the system behaves the way we intend when wired to real infrastructure, real sequences, real faults, real timing — we still answer with a vibe.
We answer it in war rooms after the fact.
We answer it with a senior engineer who has “a feeling” about the merge.
We answer it with a staging environment that nobody fully trusts, and a rollback plan we hope we don’t have to use.
That’s what we compensate for by not deploying on Fridays.
The lesson is narrower and harder than “test more.”
That mission control go/no-go answer belongs to a category of decision the toolchain has not earned the right to make yet.
Pipelines today can tell you the code is consistent with itself. They cannot tell you, with reproducible evidence, that the system survives the part of reality you actually care about — the broker that stalls, the migration that races the seed, the consumer that lags, the dependency that flaps.
Until that evidence is as cheap to collect as a unit test is to write, we will keep calling faith “rigor” and paying the difference in incidents, postmortems, and quietly eroded customer trust.
The next leverage point in CI is not faster tests or smarter linters or another contract format. It is realistic tests — scripted, repeatable, instrumented, boring to run — that turn “I think we’re safe to ship” into an artifact a regulator, a teammate, or a future-you can replay. The teams that close that gap will stop arguing about whether AI wrote the code or a human did, because the question will have moved. The question will be: show me the evidence.
Right now, for most teams, the honest answer is that there isn’t any.
That is the gap.