Jun 30, 2025

AI4SE and the V-Model: The Case for Shoot-and-Forget BDD

Milhan Kim

How LLM-assisted black-box testing changes the economics of verification and reduces knowledge asymmetry in software teams.

Introduction

AI for software engineering does not make the V-model obsolete. It changes the cost structure around the artifacts that the V-model depends on. The core idea of the model, pairing specification on the left side with verification on the right, still matters. What changes is that large language models reduce the time required to turn natural-language requirements into reviewable test assets.

This matters most in black-box testing. When teams can generate acceptance-oriented tests from requirements in minutes rather than days, validation moves closer to the moment of specification. The lifecycle becomes more iterative, but the V-model’s underlying concern with traceability and verification remains intact.

v-model

The classic V-model pairs development activities with corresponding verification and validation activities.

Leon Osborne, Jeffrey Brummond, Robert Hart, Mohsen (Moe) Zarean Ph.D., P.E, Steven Conger; redrawn by User:Slashme from Clarus Concept of Operations, FHWA-JPO-05-072 (2005).

Why the V-Model Still Matters

The traditional V-model is often criticized for rigidity, but its central insight remains useful: every important design decision should eventually have a corresponding validation mechanism. Requirements should connect to acceptance tests, system design to integration testing, and implementation details to lower-level verification.

The difficulty has always been economic. High-quality black-box testing is expensive. It requires someone to read requirements carefully, translate them into behavior, write executable scenarios, maintain those scenarios over time, and keep them aligned with product changes. Because that work is costly, many teams defer it, narrow it, or collapse it into ad hoc manual checking.

Most modern teams still follow this logic in compressed form. Agile practice did not abolish the V-model so much as shrink it into repeated local cycles.

Where AI Changes the Economics

Black-box testing evaluates a system from the outside against expected behavior rather than internal implementation. Historically, this has been one of the most labor-intensive parts of software delivery. Teams frequently underinvest because every new scenario imposes authoring, review, and maintenance costs.

LLMs change that equation by lowering the cost of converting natural-language requirements into executable behavior descriptions. In my own team, a GitHub Copilot-based agent can turn a GitHub issue into a first-pass Gherkin feature file in roughly ten minutes. The artifact still needs review, but the expensive translation step becomes dramatically cheaper.

Scenario: Password Reset
Given the test user is on the login page
When the test user clicks on "Forgot Password"
And enters their registered email
Then they should receive a password reset link

I use the phrase shoot-and-forget BDD for this workflow. A stakeholder or engineer describes the intended behavior in ordinary language, an agent generates a reviewable BDD artifact, and the artifact enters the normal pull-request and CI pipeline. The value lies not in eliminating review, but in reducing the cost of producing something reviewable.

What Shoot-and-Forget BDD Improves

This workflow offers three practical gains.

Earlier validation: acceptance-oriented thinking moves closer to requirements authoring rather than waiting for a late QA phase.
Broader scenario coverage: once scenario production becomes cheap, teams can afford to encode more edge cases and alternative flows.
Better traceability: requirements, issues, feature files, and executed tests can be linked with less manual effort.

The important point is that AI does not make testing automatic in the strong sense. It makes test authoring cheaper and more iterable. That is enough to change team behavior.

Reliability Comes from Artifacts, Not from Model Determinism

A common objection is that LLM output is probabilistic. That objection is correct but incomplete. The right operational pattern is not to rely on the model’s answer at runtime. It is to capture the model’s output as a reviewable artifact and then run that artifact deterministically in the delivery pipeline.

Once a generated test is versioned, reviewed, and executed in CI, the non-determinism of the original generation step matters far less. The workflow becomes: probabilistic generation, deterministic execution. That distinction is essential if AI-assisted testing is to remain auditable.

Organizational Effects: Less Asymmetry, More Shared Inspection

The organizational consequence is not that non-engineers suddenly become engineers. It is that the cost of inspection falls. Product managers, QA, TPMs, and other adjacent roles can participate more directly in behavior specification because the shared artifact is closer to natural language than to implementation detail.

This narrows, though it does not eliminate, knowledge asymmetry. In principal-agent terms, stakeholders gain a better ability to inspect whether delivered behavior matches intended behavior. In game-theoretic terms, the payoff for hoarding knowledge decreases when explanation and verification become cheaper.

Interactive simulation: adjust each role’s AI-adoption slider to see how its best response changes as more of the other players already share. The compact cards preserve the article’s role-specific argument, and the full 16-profile matrix remains available below for inspection.

AI Adoption α_i

Dev 50 %

PM 50 %

QA 50 %

TPM 50 %

How to read this view. Each role keeps its own AI-adoption level. For each role, the card below shows the best response when `0` to `3` of the other players already share. The summary then lists which full strategy profiles are actually stable under the current role mix.

Show full 16-profile matrix

The important claim here should remain modest. AI does not erase politics, incentives, or expertise gradients. It does, however, make some forms of verification cheaper, and cheaper verification tends to improve alignment.

Conclusion

The V-model remains useful because software still requires explicit links between intent and validation. What AI changes is the price of maintaining those links. LLM-assisted black-box testing, especially in a BDD workflow, makes it practical to produce more reviewable verification artifacts earlier and more often.

That shift has both technical and organizational consequences. Technically, verification becomes more continuous. Organizationally, behavior becomes easier to inspect across roles. The result is not the end of engineering discipline, but a lower-cost path to exercising it.