The V-Model Trap

There is a seductive argument circulating in software circles right now: if large language models can generate implementation from a sufficiently clear specification, then the old systems-engineering V-model is suddenly the right way to build software. Requirements first. Design against the requirements. Implement. Verify. Validate. Keep the specification and the system in sync. Let the code become disposable.

On the surface, this sounds responsible. It borrows the seriousness of aerospace, defense, formal traceability, verification discipline, and engineering maturity. It also flatters a managerial fantasy that has never quite died: that the hard part of software was always the absence of sufficiently precise instructions, and that AI now gives us the machine that can turn those instructions into working code.

That argument is almost exactly backwards.

The V-model is exemplary in a world where execution is the dominant bottleneck, change is expensive, the operating environment is comparatively knowable, and the cost of late failure is catastrophic. It is not a stupid model. NASA’s own systems-engineering material emphasizes requirements, decomposition, verification, validation, documentation, lifecycle control, and traceability in domains where mission assurance and physical constraints carry enormous weight (NASA Systems Engineering Handbook). Those disciplines belong where requirements can be stabilized, where failure has extreme consequences, and where the system must be proven against explicit constraints before it is allowed to operate.

But that is precisely why the analogy is dangerous when imported wholesale into AI-augmented software product development. Most software is not a spacecraft. Most software is a discovery system embedded in changing users, changing workflows, changing dependencies, changing incentives, and changing organizational understanding. The hard part is not merely building the specified thing correctly. The hard part is discovering which intervention deserves to exist, proving that it creates value, and adapting fast enough when reality contradicts the original assumption.

AI does not fix that problem. It intensifies it.

The V-Model Optimizes for Output Conformance

The V-model is fundamentally a conformance machine. On the left side, requirements are decomposed into architecture, subsystem requirements, detailed design, and implementation. On the right side, corresponding tests verify that the realized system satisfies those requirements at increasing levels of integration. Its central virtue is traceability: every lower-level artifact should relate back to a higher-level requirement, and every requirement should have a verification strategy.

V-model at Wikipedia:
V-model system engineering process at Wikipedia

That is useful when the requirements are the right anchor. In product software, that assumption is often the entire dispute.

A requirement is not revealed truth. It is an institutionalized hypothesis. It may represent user research, stakeholder preference, procurement language, a workflow assumption, a legacy process, a political compromise, or an attempted interpretation of a market signal. The V-model can verify that an output conforms to a requirement, but it does not inherently prove that the requirement was worth having.

Output and outcome are not the same thing. An output is what the system produces: a feature, workflow, report, integration, interface, dashboard, migration, or service. An outcome is the observable change that the output was supposed to create: reduced time-to-resolution, better activation, fewer failed deployments, improved compliance posture, lower operational risk, higher conversion, or some other meaningful shift in reality.

The V-model is comfortable with outputs because outputs can be specified. Outcomes are more uncomfortable because outcomes can falsify the specification.

A team can build every requirement correctly and still fail. The product can pass acceptance tests and still leave user behavior unchanged. The workflow can match the approved process and still increase operational friction. The dashboard can show exactly what was requested and still support no real decision. The AI-generated implementation can be technically coherent, fully traced, and perfectly aligned with yesterday’s misunderstanding.

That is not engineering maturity. That is high-speed waste.

Software Already Learned This Lesson

The irony is that software engineering learned this lesson before AI entered the room.

Winston Royce’s 1970 paper, often treated as the origin of waterfall, is frequently misremembered as a simple defense of sequential development. It is not. Royce presents the simple sequential model and then immediately warns that such an implementation is risky and invites failure, because testing occurs too late to cheaply discover design and requirement problems (Royce, "Managing the Development of Large Software Systems"). Barry Boehm’s spiral model later reframed software development as risk-driven rather than primarily document-driven or code-driven, explicitly because uncertainty and risk must be discovered and reduced throughout the work, not merely specified away at the beginning (Boehm, "A Spiral Model of Software Development and Enhancement").

Fred Brooks made the deeper point in "No Silver Bullet." The essential difficulty of software is not typing source code. It is the specification, design, and testing of the conceptual construct: deciding what abstract machine should exist and how it should behave under a messy set of constraints (Brooks, "No Silver Bullet"). AI reduces some accidental difficulty. It can generate code, scaffolding, tests, documentation, migrations, and interface glue with astonishing speed. But it does not remove the essential difficulty of deciding what conceptual construct should exist.

If anything, by making representation cheaper, AI makes conceptual error more dangerous. The wrong idea can now be implemented beautifully, repeatedly, and at scale.

That is why the return of specification-first thinking under the banner of AI is not progress. It is a regression with better tooling.

"Validation" Is Not the Same as Outcome Discovery

A defender of the V-model may object that the model includes validation, not merely verification. That is true, and the distinction deserves precision. In systems engineering, verification asks whether the system was built according to specified requirements, while validation asks whether it provides the needed capability in its intended environment. NASA’s systems-engineering handbook is explicit about this distinction (NASA Systems Engineering Handbook).

That still does not solve the product-software problem.

The issue is not whether formal validation exists. The issue is what is being validated, when, against whom, and how easily the system is allowed to change when validation fails. In many systems-engineering domains, the mission, operating environment, stakeholder need, and physical constraints are expensive to change and comparatively stable. Validation often confirms that the realized system satisfies an intended use that is already substantially understood.

In market-facing software, the intended use itself is often under discovery. Users may not behave as expected. The buyer may value a different outcome. The workflow may reveal a hidden constraint. The integration point may be politically or operationally impossible. The apparent problem may turn out to be a symptom of a deeper system design failure. The first real customer conversation may invalidate months of internal alignment.

This is why the Agile Manifesto’s principles remain relevant. The point was never ceremony, standups, story points, or the later industrialization of "agile" into process theater. The original principles emphasized early and continuous delivery, customer collaboration, working software, and welcoming changing requirements, even late in development, because change can become a source of competitive advantage (Agile Manifesto principles).

AI does not repeal that insight. It raises the cost of ignoring it.

Requirements Are Hypotheses Until Reality Says Otherwise

The strongest evidence against specification-first product development comes from experimentation research, not from ideology.

Ron Kohavi and colleagues have spent years documenting online controlled experiments at scale in companies such as Microsoft, Amazon, Google, LinkedIn, Netflix, and others. Their work repeatedly shows how often plausible, prioritized, internally supported product ideas fail to improve the metrics they were designed to improve. The lesson is not that human judgment is useless. The lesson is that internal confidence is a poor substitute for external evidence in complex sociotechnical systems (Kohavi et al., "Online Randomized Controlled Experiments at Scale"; Kohavi et al., "Unexpected Results in Online Controlled Experiments").

The same logic appears in entrepreneurship research. Camuffo and colleagues ran a randomized controlled trial teaching startups to use a more scientific approach: form hypotheses, make predictions, test assumptions, and decide based on evidence rather than narrative confidence. Their work supports the idea that entrepreneurial and product decisions should be treated as falsifiable hypotheses under uncertainty, not as plans that become correct through internal commitment (Camuffo et al., "A Scientific Approach to Entrepreneurial Decision Making").

That is the core epistemic failure of the V-model as a governing metaphor for product software. It makes requirements look mature before reality has disciplined them. It can organize confidence before evidence. It can make an internally approved idea appear professional, traceable, and complete before users, operations, or the market have had a chance to reject it.

AI makes that failure cheaper per artifact and more expensive per system.

AI Collapses Execution Cost, Not Uncertainty

AI-assisted development changes the economics of software work, but not in the simple way its boosters often claim. It can reduce the cost of first implementation, especially in constrained and well-scoped tasks. A controlled experiment on GitHub Copilot found that developers completed a programming task 55.8% faster with Copilot than without it (Peng et al., "The Impact of AI on Developer Productivity").

But the empirical picture is not a universal speedup story. A randomized controlled trial by METR, using experienced open-source developers working in mature repositories they knew well, found that early-2025 AI tools made participants take 19% longer on their assigned tasks, even though the participants expected AI to make them faster (METR, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity"). The point is not that Copilot is good and agentic tooling is bad, or the reverse. The point is that AI is not a generic acceleration function.

The Harvard/BCG "jagged frontier" study gives the right conceptual frame. Dell’Acqua and colleagues found that consultants using AI performed better on tasks inside the AI capability frontier, but on a task designed to be outside that frontier, consultants using AI were 19 percentage points less likely to produce correct solutions than those working without it (Dell’Acqua et al., "Navigating the Jagged Technological Frontier").

This is fatal to the simplistic "the spec becomes king" argument. If AI performance depends on task shape, context quality, operator judgment, capability-frontier recognition, and verification strength, then the valuable artifact is not a static specification. The valuable system is the whole control loop: intent, constraints, decomposition, executable checks, telemetry, review, customer evidence, and the ability to change course.

AI makes the first serious artifact cheaper. It does not make the first idea correct.

The Bottleneck Moves Downstream

Even when AI improves production speed, it can move the burden elsewhere. Song, Agarwal, and Wen studied GitHub Copilot adoption in open-source software development and found project-level productivity gains, but also a 41.6% increase in integration time, plausibly due to higher coordination costs (Song et al., "The Impact of Generative AI on Collaborative Open-Source Software Development"). Another study on AI-assisted programming in open-source projects found that productivity gains can be driven by less-experienced contributors while increasing rework burden on experienced core developers, who review more code and produce less original code themselves after Copilot introduction (Xu et al., "AI-assisted Programming May Decrease the Productivity of Experienced Developers by Increasing Maintenance Burden").

This is exactly what we should expect. When generation gets cheaper, unvalidated output increases. When output increases, integration, review, verification, and maintenance become more important. The bottleneck does not disappear. It moves.

DORA’s delivery metrics make the same point from another direction. High-performing delivery is not measured merely by output volume. It is measured by flow and stability: lead time for changes, deployment frequency, failed deployment recovery time, change failure rate, and related delivery/reliability measures (DORA metrics). A software organization that generates more code faster while increasing instability, review queues, integration delay, or recovery time is not necessarily improving. It may only be converting one bottleneck into another.

That is the Coordination Shift: when execution accelerates, the scarce capabilities move toward coordination, verification, governance, boundary design, and judgment. The AI-era delivery problem is not "can we produce more?" The problem is whether the organization can keep machine-speed output attached to evidence, coherence, and operational reality.

AI Turns Prototypes Into Pretotypes

The better analogy for AI-augmented software is not the V-model. It is the compression of the distance between hypothesis and evidence.

Lean Startup framed the minimum viable product as a vehicle for validated learning, not as a smaller version of a finished product. Its unit of progress is validated learning, and its core loop is build-measure-learn (Lean Startup methodology; Eric Ries on MVPs). Pretotyping made the same point more bluntly: make sure you are building the right "it" before you build "it" right (Savoia, "Pretotype It"). Effectuation adds a complementary entrepreneurial logic: act from available means, limit exposure through affordable loss, use stakeholder commitment, and let contingencies shape the venture under uncertainty (Effectuation principles).

AI changes the economics of those ideas. The previous generation often used fake doors, explainer videos, concierge tests, Wizard-of-Oz systems, and other pretendotypes because building the real thing was too expensive. Dropbox famously used an explainer video as its MVP; the video validated that people wanted the product because they actually signed up, not because a focus group approved the concept (TechCrunch on Dropbox’s MVP).

Today, AI often makes the working prototype cheaper than the presentation about the prototype. In many domains, what used to require a fake demonstration can now be a real but bounded implementation. That is a profound change in time-to-evidence.

But the thinking is the same. The point is not to build more confidently from a fixed spec. The point is to reduce uncertainty faster.

Dropbox could perhaps be built as a real prototype today instead of a video. It could even be built under a V-model. But that would not guarantee the same product, the same adoption, or the same learning. The important thing about the Dropbox story was not that the artifact was fake or real. The important thing was that the artifact tested a value hypothesis before the company committed to building the fully realized system.

AI compresses the build. It does not eliminate the need for the test.

This is the logic of Just-In-Time Productization: AI compresses the distance between a credible customer problem, a bounded service hypothesis, a working prototype, and a real evidence-producing encounter. The product is not defined first and validated later. The product is formed through contact with reality.

The Specification Is Not the Source of Truth

The claim that "the spec becomes the valuable part and code becomes disposable" contains a useful half-truth. In AI-augmented development, explicit intent becomes more important than ever. Vague desires produce vague systems. Hidden assumptions produce hidden defects. Architecture trapped in someone’s head becomes a coordination bottleneck. Requirements in Word documents, architecture in memory, tests in another system, and operations in a fourth are all signs of a broken delivery system.

But the conclusion should not be that the static specification becomes the throne.

The specification is not the source of truth. Reality is the source of truth. The specification is a hypothesis about how to intervene in reality.

That difference changes everything. A good AI-era specification is not a frozen requirements document. It is a living intent artifact. It explains the desired outcome, the current context, the non-negotiable constraints, the evidence that would indicate progress, the failure modes that must be avoided, the interfaces that must hold, and the proof obligations required before integration. It is closer to an executable theory of change than to a contract for output production.

The moment evidence shows that the output does not produce the intended outcome, the requirements must be allowed to change. If they cannot change, the organization is not doing outcome-driven software development. It is doing requirements compliance with better tooling.

This is why the V-model is so attractive to output-oriented organizations. It lets them preserve the fiction that the hard decision happened at the beginning. It makes the rest of the work look like disciplined execution. But in uncertain software domains, the hardest decisions often happen after first contact with users, operations, integration, data, or the market.

AI Is Not a Shortcut to Competence

There is also a human-capital problem with the specification-first framing. It suggests that if requirements are clear enough, implementation can be delegated and verified later. That underestimates the skill required to supervise AI-generated systems.

Shen and Tamkin studied developers learning a new asynchronous Python library with and without AI assistance. They found that AI use impaired conceptual understanding, code reading, and debugging abilities on average, without delivering significant efficiency gains overall. Participants who fully delegated coding tasks sometimes gained productivity, but at the cost of learning the library; the higher-learning AI patterns involved cognitive engagement, such as asking conceptual questions or seeking explanations (Shen and Tamkin, "How AI Impacts Skill Formation").

This is crucial. The AI-era engineer cannot merely be a requirements typist or acceptance clerk. They must understand enough to challenge the model, detect wrongness, constrain implementation, design verification, interpret failures, and decide when the system should change direction. The more powerful the generator, the more valuable the human verifier becomes.

Security research reinforces the same point. Pearce and colleagues found that Copilot-generated code could produce security vulnerabilities in relevant scenarios, and later large-scale work by Schreiber and Tippe found CWE instances in AI-attributed public GitHub code while also noting that most analyzed AI-generated files had no identifiable CWE-mapped vulnerability (Pearce et al., "Asleep at the Keyboard?"; Schreiber and Tippe, "Security Vulnerabilities in AI-Generated Code"). The sensible conclusion is not that AI code is unusable. It is that AI output must be treated as governed material.

The faster code arrives, the more consequential guardrails become.

The Organizational Topology Problem

The V-model also ignores the organizational topology in which AI-generated software lands.

Conway’s Law remains one of the most durable observations in software engineering: organizations that design systems are constrained to produce designs that mirror their communication structures (Conway, "How Do Committees Invent?"). If the organization is fragmented, handoff-heavy, and weakly aligned around outcomes, AI does not repair that topology. It accelerates local work inside the same broken structure.

Burns and Stalker’s classic distinction between mechanistic and organic management systems is useful here. Mechanistic systems fit more stable conditions; organic systems fit changing conditions where knowledge and authority must move closer to the situation (Burns and Stalker, "The Management of Innovation" excerpt). The V-model belongs to a mechanistic control logic: stabilize intent, decompose work, execute, verify. Product software under uncertainty requires a more adaptive control logic: form intent, act in small batches, observe, learn, and revise.

AI makes this structural mismatch sharper. As argued in Your Organization Is The Bottleneck, faster execution by itself does not create value. It creates output. If that output cannot be validated quickly against reality, it becomes inventory: more code waiting to be integrated, more prototypes waiting to be reviewed, more plausible systems waiting for someone to decide whether any of it should continue.

In organizations with little vertical ownership, AI-generated output is especially dangerous. Requirements are written upstream. Design is interpreted elsewhere. Implementation happens downstream. Testing is separated. Operations inherits the result. Product outcome is someone else’s metric. This is not a value system. It is an output conveyor belt.

AI can accelerate every station on that conveyor belt and still produce no outcome.

Everyone Gets Faster

The V-model argument also misunderstands competition. It compares AI-augmented V-model delivery against the old world, where development was slower. In that comparison, even a bad model can look good. If AI lets a team implement a large upfront specification faster than before, the improvement is real.

But that is not the relevant comparison.

The relevant comparison is against other AI-augmented organizations. When everyone has access to faster generation, the advantage does not go to the organization that implements the upfront specification fastest. It goes to the organization that discovers the right thing fastest, proves it fastest, and then hardens it fastest.

If one organization uses AI to implement the first surface area of a large upfront specification, while another uses AI to run bounded experiments against real users, the second organization has a structural advantage. It is not merely shipping faster. It is learning faster. It can discard wrong requirements before they metastasize. It can converge on fit while the first organization is still celebrating conformance.

This is why output velocity is a weak measure of AI transformation. The real advantage is not producing more artifacts. It is shortening the path from uncertainty to evidence.

Where V-Model Discipline Still Belongs

None of this means that every V-model practice is useless. Traceability, verification, formal acceptance criteria, and requirements discipline all have legitimate roles. In safety-critical, regulated, contractual, or physically constrained domains, V-model-like discipline may remain appropriate for parts of the system. Even in product software, there are subsystems where conformance is the correct goal: cryptographic behavior, compliance controls, safety boundaries, data-retention rules, interoperability contracts, financial calculations, and infrastructure invariants.

The mistake is using the V-model as the governing metaphor for software value creation.

A better approach is to place V-model discipline inside an outcome-driven loop. Use rigorous verification where the constraint is known. Use traceability where accountability requires it. Use formal tests where correctness can be specified. Use model-based discipline where the model is actually a defensible representation of reality. But do not confuse those local verification needs with the global product-development process.

At the product level, the primary loop should be outcome-first and evidence-driven. At the subsystem level, some components may need V-model rigor. That is a very different claim from saying the V-model is the right model for AI-augmented software development.

It is not.

The Better Model: Intent, Constraints, Evidence, Guardrails

The AI-era alternative is not chaos. It is not "vibes." It is not typing prompts until something works. It is a more disciplined control system than the V-model, but the discipline sits in different places.

Start with intent, not a frozen specification. Define the outcome sought, the context, the customer or operational problem, and the reason the work deserves attention. State the constraints explicitly: security, reliability, cost, compliance, data access, architectural boundaries, operational blast radius, and ethical limits.

Then define the proof obligations. What must be true before this can be integrated? What must tests demonstrate? What must telemetry show? What would falsify the hypothesis? What customer behavior would count as evidence? What operational signal would cause a rollback? What decision would be made if the result is neutral or negative?

Only then should AI-augmented execution begin.

The resulting implementation should be small enough to learn from, instrumented enough to interpret, and bounded enough to fail safely. Requirements should evolve when evidence invalidates them. Tests should become more precise as the system matures. Architecture should stabilize after learning, not before it. Productization should follow proof, not precede it.

This is not less professional than the V-model. It is more honest about uncertainty.

Build the Right Thing, Then Build It Right

The slogan "build it right" is incomplete. In software, and especially in AI-augmented software, the stronger discipline is: first make sure you are building the right thing; then build the right thing right.

AI changes the cost of both halves. It makes the first serious prototype cheaper. It makes iteration faster. It makes verification more automatable. It makes documentation easier to produce. It makes refactoring less painful. It makes previously expensive experiments affordable.

But it does not decide what creates value. It does not create market fit. It does not know the organization’s real constraints unless those constraints are surfaced. It does not guarantee that a requirement is valuable because the requirement was clearly written. It does not rescue an output-oriented organization from its inability to learn.

The V-model is attractive because it promises control. In AI-augmented software, that promise is often false control. It controls conformance to a theory before the theory has been sufficiently tested.

The organizations that learn this quickly will move differently. They will govern experiments rather than worship specifications. They will automate proof rather than accumulate approvals. They will shorten time-to-evidence rather than maximize output. They will use AI to compress learning, not just implementation.

The organizations that do not learn it will still get faster. They will produce more requirements, more code, more tests, more documentation, more demos, and more internally coherent systems.

They will just build the wrong things at machine speed.

Operator