No Tests, No Merge

Recently on the Cup o' Go podcast, the hosts discussed whether AI-generated changelists should be accepted, rejected, or treated differently. The conversation circled around trust, responsibility, and reputation. Should we reject PRs that say "co-authored by Claude"? Should we trust the submitter if they stake their name on it? Should AI involvement matter at all?

The right answer was almost reached - they said: judge the code on its merits.

Correct.

But then they defined "merit" in social terms (belief, reputation, whether the author reviewed it, whether they’re trusted), that’s where we part ways.

Belief is not a quality attribute.

Code is not theology. It is either verifiable or it is not.

If you submit a PR to one of my projects and it does not come with extremely high-quality tests - tests that prove behavior under constraints, edge cases, and failure modes - it will be denied. I don’t care if it was written by a junior developer, a staff engineer, or an AI model trained on the entire internet. No tests, no merge.

This is not anti-AI. It is AI-native engineering.

The Coordination Shift Has Already Happened

We are now living in a world where coding agents can:

Draft entire features in minutes
Refactor subsystems instantly
Generate scaffolding, glue, and repetitive logic without fatigue
Produce plausible unit tests that look correct at a glance

Execution cost has collapsed.

Verification cost has not.

That inversion is the real story. It’s the same structural shift described in The Coordination Shift - when execution becomes cheap, coordination and governance become the bottleneck.

AI-generated PRs are not the problem. They are the stress test.

When generation is cheap, reputation is no longer a reliable filter. A trusted engineer can produce garbage quickly. An unknown contributor can generate thousands of lines of plausible-looking code in an afternoon.

The intake boundary must change.

Merit must be redefined:

Merit = demonstrable, reproducible correctness under constraints.

Nothing else scales.

Reputation Is a Weak Signal Now

In the podcast discussion, one thread stood out: the idea that the submitter "believes in the code" and stakes their reputation on it. That worked in a world where code was expensive to produce. Reputation was a proxy for effort and judgment. However, in an AI-augmented reality, effort and authorship are decoupled.

Reputation no longer implies:

Deep understanding of the change
Exhaustive exploration of edge cases
Manual implementation effort
Careful construction of invariants

It may still imply judgment - but that judgment must be visible. The only scalable way to make judgment visible is through executable proof. Tests are not documentation. They are contracts with reality.

AI-Generated Slop Is Not the Real Threat

The real threat is something subtler: plausible slop.

Code that:

Compiles
Passes shallow unit tests
Looks clean in review
Satisfies the happy path

But does not survive adversarial thinking.

Most AI-generated unit tests today are structurally weak. They confirm what the code already assumes. They rarely:

Model boundary conditions
Simulate concurrency hazards
Explore malformed input
Validate failure semantics
Assert invariants across state transitions

If your verification harness is weak, AI does not make you faster. It makes you fragile at scale. And the fragility compounds.

Why 40% Test Code Is Not Excessive

I often see raised eyebrows when I mention that large codebases I’ve worked on contain ~40% tests - unit, integration, end-to-end, not just trivial checks.

In the AI era, that number is not extreme. It is rational.

High verification density gives you:

Refactorability
Optionality
Safe exploration
Fast iteration without regression fear

It transforms the codebase into a self-verifying system. Without that, you get:

Review fatigue
Merge-by-glance culture
Implicit tribal knowledge
Silent decay masked by green dashboards

AI does not remove the need for engineering discipline. It amplifies the consequences of lacking it.

This Is Not About Banning AI

Some maintainers respond by banning AI-generated PRs. That is defensive and shortsighted.

AI is not the variable that matters.

Proof is.

The only intake policy that scales in an AI-augmented ecosystem is brutally simple:

Behavior changes require behavioral proof
Edge cases must be encoded
Failure modes must be demonstrated
Invariants must be executable

If a change increases semantic surface area, the test delta should often exceed the code delta. That is not bureaucracy. That is physics.

Verification Dominates Execution

In The Centaur Manifest, one of the core tenets is: verification dominates execution.

This is not philosophical. It is economic.

When code generation cost trends toward zero:

The scarcest resource becomes correctness under uncertainty.
The bottleneck becomes integration and proof.
The highest leverage activity becomes encoding invariants as tests.

The fastest teams in the AI era will not be those who generate the most code.

They will be those who can:

Encode constraints automatically
Run guardrails continuously
Treat CI as arbiter
Separate outcome validation from output production

They will look conservative to legacy organizations, and they will actually be moving faster.

Consensus Culture Will Struggle With This

There is also a social layer here.

Demanding strong tests forces clarity. It eliminates ambiguity. It makes competence legible. It removes wiggle room. In consensus-heavy cultures, it is often easier to trust the contributor than to demand executable proof. It feels collaborative. It feels kind.

In an AI-augmented environment, that posture is fatal.

Because AI scales volume.

Without a proof-first stance, you get:

Mountains of plausible PRs
Review bottlenecks
Social pressure to merge
Diffusion of accountability

If generation is cheap, discipline must increase—not decrease.

The Real Question Is Structural

The podcast framed the question as:

Should we accept AI-generated code?

The structural question is:

How do we redesign verification for near-zero generation cost?

If your answer is:

"Trust the author."

You are still thinking in a pre-AI execution economy.

If your answer is:

"Show me the invariant."

You are adapting.

A Simple Policy for the AI Era

Here is the rule. If the code cannot stand on its own merit, it does not merge.

And merit means:

Reproducible behavior
Explicit edge-case coverage
Clear failure semantics
Encoded constraints
Automated proof

Not vibes.
Not faith.
Not reputation.

Just evidence.

Execution is no longer the scarce asset. Judgment and proof are. If you have not internalized that yet, you are not working in an AI-augmented reality. You are still arguing about authorship in a world where authorship is no longer the point.

Operator