Milhan Kim
Senior Software Engineer working in high-performance labeling for Autonomous Driving.
I’m a SE Researcher and Practitioner; read & write papers, analyze, and architect systems. I’m interested in philosophy, psychology, management, and naturally, machine learning.
I upload random ideas, findings, and thoughts on this website. My personal blog is https://milhan.lol, where it’s mostly about astrophotography.
Recent Uploads
-
AI4SE and the V-Model; The case of Shoot-and-forget BDD
Introduction
Software engineering is undergoing a paradigm shift as AI for Software Engineering (AI4SE); particularly large language models (LLMs) enters the development lifecycle. Nowhere is this more evident than in the transformation of the traditional V-model of system and software development.
The V-Model
Leon Osborne, Jeffrey Brummond, Robert Hart, Mohsen (Moe) Zarean Ph.D., P.E, Steven Conger ; Redrawn by User:Slashme. - Image extracted from Clarus Concept of Operations. Publication No. FHWA-JPO-05-072, Federal Highway Administration (FHWA), 2005
The V-model is a classic software process that emphasizes a rigorous, sequential relationship between development phases and corresponding testing phases. Each stage of requirements or design on the left “wing” of the V has a mirrored verification or validation step on the right wing, culminating in system validation against the initial requirements. This model promotes upfront planning and traceability between artifacts, but it has also been criticized for rigidity and late discovery of defects. Today, AI-driven tools are reshaping this model—making testing far more iterative and integrated, and enabling non-technical stakeholders to actively participate in creating technical artifacts.
Each development phase on the left side:
- requirements
- analysis
- system design
- architecture design
- module design
- coding
has a corresponding testing phase on the right:
- unit testing
- integration testing
- system testing
- acceptance testing
%%{init: {'theme':'base','themeVariables':{ 'primaryColor':'#DCEFFE', 'primaryBorderColor':'#76B3FE', 'edgeLabelBackground':'#FFFFFF', 'tertiaryColor':'#F0F8FF' }}}%% flowchart TD R(Requirements):::left SD(System Design):::left AD(Architecture Design):::left MD(Module Design):::left IM(Implementation):::left R --> SD --> AD --> MD --> IM UT(Unit Test):::right IT(Integration Test):::right ST(System Test):::right AT(Acceptance Test):::right IM --> UT --> IT --> ST --> AT --> R classDef left fill:#DCEFFE,stroke:#76B3FE,color:#034694; classDef right fill:#DCF8E8,stroke:#34BA7C,color:#0B4F3C; class R,SD,AD,MD,IM left; class UT,IT,ST,AT right;
The V-Model in a linear view.
This model enforces strong traceability and planning for verification and validation, but follows a linear, sequential flow.
In this article, we analyze how AI4SE is transforming the V-model, with a focus on the economics of black-box testing and on cross-functional collaboration. I apply principles from management theory and game theory to understand shifts in team dynamics, knowledge asymmetry, and incentives. The result is a vision of development where behavior-driven testing is continuous (not just an end-phase activity) and product managers (PMs), product owners (POs), technical program managers (TPMs), scrum masters, and other non-engineers can directly shape and verify the software. The goal is to provide a thought-leadership perspective on these changes for a technically literate, managerial audience.
The Traditional V-Model and Its Limits
The V-model (Verification and Validation model) has long been used to structure system development. It visualizes a project in a V-shape: moving down the left side for definition and build phases, then up the right side for testing phases. For example, requirements are defined at the top-left and validated via acceptance testing at the top-right; system design is verified by integration testing; module design by unit testing, and so on. The strength of this model lies in clear verification steps tied to each specification stage and in early planning of tests (even during requirements analysis, one plans the acceptance tests). This ensures that testing isn’t an afterthought and that each requirement is eventually validated.
However, the V-model is essentially a linear lifecycle. It assumes that if you plan well and follow the sequence, you’ll catch issues in the corresponding test phase. In practice, this rigidity has drawbacks. Changes late in the process are costly, and misunderstandings in requirements might not surface until the final validation. There is little room for iterative refinement or unplanned exploration; everything follows a predetermined plan.
As management theorists like Peter Drucker and W. Edwards Deming have noted, such heavy upfront planning and hierarchy can falter in fast-changing environments. The traditional model can lead to a “frozen middle,” where feedback and innovation slow down. In an era where requirements evolve rapidly and quality needs to be assured continuously, the pure V-model feels inflexible.
Another issue is knowledge asymmetry between roles and phases. In the classic setup, business stakeholders define requirements and testers verify them, but only engineers truly understand the system internals during development. This often creates communication gaps or even power imbalances; engineers become gatekeepers of technical knowledge, and non-technical team members must largely trust their judgments until tests validate the outcomes.
In economic terms, this resembles a principal–agent problem: those who own the product vision (principals) rely on those who implement it (agents), but have less information about the technical work. The agent (developer) has more information and may act in self-interest (e.g. saying a feature is “too hard” or deferring tests) while the principal lacks visibility. The incomplete and asymmetric information allows an agent to act opportunistically in ways that diverge from the principal’s goals. Traditional processes tried to counter this with documentation, sign-offs, and structured testing, but the information gap remained.
flowchart LR subgraph "Sprint1" P1["Stories & Requirements"] D1["Design & Implementation"] T1["Test & Review"] P1 --> D1 --> T1 --> P1 end subgraph "Sprint2" P2["Stories & Requirements"] D2["Design & Implementation"] T2["Test & Review"] P2 --> D2 --> T2 --> P2 end Sprint1 --> Sprint2
The V-Model in modern (iterative) configuration.
It’s important to note that while the V-model may be considered “traditional,” its core idea of mapping validation to every development step remains valuable. In fact, most development work today still follows a V-model in miniature. Agile and iterative methods essentially break one large V-cycle into many smaller V-cycles (each sprint or feature is like a mini V-model with its own design, implementation, and testing). In other words, teams haven’t discarded the V-model’s principles of verification; they’ve just compressed and repeated them. This means it’s not enough to dismiss the V-model as outdated; we are all still using some form of it, whether we acknowledge it or not. The key is using it in a flexible, iterative way.
In summary, the V-model ensures thorough verification and validation, but its sequential nature and information silos pose challenges for today’s fast-paced, collaborative development. This is where AI4SE begins to make a profound impact—introducing more agility, continuous testing, and knowledge sharing into the model without losing the traceability that the V-model championed.
AI in the Software Lifecycle: LLMs Change the Game (-theoretic payoffs)
AI4SE refers to applying modern AI techniques (machine learning, NLP, etc.) to software engineering tasks. Large language models (LLMs) have recently shown they can generate code, explain complex concepts, and even produce test cases from natural language descriptions. In effect, coding is becoming easier and more automated, and some aspects of engineering are being “democratized.” Tools like GitHub Copilot already enable developers to generate boilerplate code or unit tests with simple prompts. But beyond assisting coders, these AI tools allow people without coding expertise to contribute in new ways, in theory.
For example, people imagine these:
- A product manager uses an LLM-based tool to prototype an application or query a dataset without writing actual code, even refining features on their own.
- A TPM tests ideas before an engineer ever gets involved.
Within engineering teams, AI is changing the workflow. Developers use LLM “co-pilots” to generate functions or suggest design patterns, acting as force-multipliers (suddenly, every engineer can be more productive with AI help). Engineering managers and tech leads use AI to analyze codebases or generate documentation, saving time on grunt work. In essence, AI is taking on the labor of reading, writing, and synthesizing—tasks that scale with data and code—allowing humans to focus on decision-making and creative problem-solving.
In reality, we see:
- Engineering teams invest most of their time in solving highly technical problems. Boilerplate occurs rarely, and LLMs aren’t ready to solve the truly complex problems yet.
- PMs, POs, TPMs, engineering managers, and other leaders are extremely busy. When engineers can create the same artifact in a fraction of the time, it’s not rational for these folks to engage in “vibe coding” (casual coding for its own sake) during real work.
Instead, I want to showcase one success path I’ve discovered: black-box testing, the practice of verifying a system against its specifications from an external perspective.
Black-Box Testing: From Costly Phase to Continuous Activity — Shoot-and-forget BDD
Revisiting Black-Box (Oracle) Testing
Black-box testing means testing software from the outside, against its requirements, without knowing the internal code. In the V-model, black-box testing activities occur in stages like system testing and acceptance testing—critical but often late phases. Traditionally, black-box testing is labor-intensive: QA engineers must derive test cases from requirements, script them, run them, and maintain them when requirements or UIs change. This effort has always been significant in terms of cost and time. Ensuring broad test coverage with manual or scripted tests is so expensive that teams often prioritize a subset of scenarios, potentially missing edge cases until users find them.
- Black-box testing: testing a system without knowing how it’s constructed (external behavior only).
- White-box testing: testing based on knowledge of how the system is built (internal logic).
Practical AI for Software Engineering: Accelerated Black-Box Testing
LLMs can dramatically shift the economics of black-box testing. AI-powered test generation can turn natural language statements directly into executable test cases within minutes. In my team, it takes around 10 minutes. This makes it feasible to generate many more test scenarios than before, at a fraction of the manual effort. For instance, given a requirement like “It should be able to reset a password using the registered email,” an AI can produce a behavior-driven test scenario in Gherkin syntax. In my team, a GitHub Copilot-based coding agent converts a GitHub issue into a Gherkin feature file:
Scenario: Password Reset Given the test user is on the login page When the test user clicks on "Forgot Password" And enters their registered email Then they should receive a password reset link
This was once a task that QA or developers had to do by hand—translating specs into test steps. Now it’s almost automated, effectively creating failing test cases that highlight unimplemented features. In effect, LLMs can interpret the intent behind requirements and produce test cases that validate those requirements. These generated tests are automatically published as a pull request. (Sometimes developers tweak the
.feature
files afterward, but that’s a fraction of the time compared to writing them from scratch.)Additionally, LLMs generate stub step definitions for the tests (which initially fail), often reusing existing common building blocks and following internal naming taxonomies via a server-side index.
I call this approach “shoot-and-forget BDD” (from the perspective of a TPM writing the scenario in my team).
The implications are profound:
- Easy to review and highly feasible: The benefit of shoot-and-forget BDD is clear. The input is human language and the output is structured human language. This is basically pattern matching—exactly where ML excels.
- True black-box testing at scale: Achieving broad black-box test coverage has historically been expensive. Engineers writing tests for their own systems can fall into a principal–agent trap, and integration tests are often based on how we expect the system to work (i.e. white-box assumptions). Now, behavior scenarios written with minimal inside knowledge can cover many aspects of the system’s intended behavior, addressing the higher-level requirements (the upper parts of the V-model’s left wing).
- Faster, cheaper, iterative test creation: It becomes “write a new requirement and forget (the tests).” Teams can generate hundreds of test cases—something impossible to do manually within an Agile sprint. Because creating tests is so much faster (as easy as writing a user story), it’s now practical to do continuous and ad-hoc testing throughout development, not just plan a fixed test suite upfront.
Some teams report that AI-based observability platforms can even analyze real production logs and generate new test flows based on actual user behavior. This means the test suite can evolve as the product evolves, covering edge cases humans might overlook. Such exploratory testing becomes feasible because an AI can quickly take a new scenario description and produce a runnable test, which can then be executed and tracked.
Going through robust QA/QC for even a small system is still costly, and that fundamental truth won’t change. But ironically, the classic V-model works better with this practice—we actually enjoy the benefits of the V-model’s rigor. The more thorough our internal validation and verification, the less pressure on external QA/QC phases.
It turns out this approach is also traceable. Because these AI-generated tests originate from natural language requirements or user scenarios, they can be tied back to their source information. In a BDD approach, tests are written in a language that business stakeholders can read, ensuring each test case maps to a specific requirement or user story. LLMs enhance this by automating the generation of those BDD scenarios from the requirements themselves. The outcome is that every requirement can have one or many corresponding black-box tests, and if a requirement changes, new tests can be generated just as easily. This achieves something like a traceability matrix (linking formal requirements to JIRA tickets, GitHub issues, feature files, and releases), which was a core goal of the V-model—now achieved with far less manual toil.
Reliability
An important consideration when using LLMs for test generation is reliability. By default, an LLM’s output can be probabilistic or non-deterministic; the same prompt might yield slightly different test code on different runs, or an AI might “hallucinate” a test scenario that doesn’t exactly match the requirement. Relying on an LLM’s ad-hoc answers each time would be risky and hard to reproduce.
The solution—and a key philosophy we must adopt when using AI in software development—is to use the LLM to generate a reviewable artifact (such as a test script or specification) and then automate that artifact in the pipeline. Once the AI produces a test case, that test becomes part of the codebase—subject to code review, version control, and repeated execution. This approach ensures the software’s behavior is validated in a deterministic way, even though the AI that generated the test is nondeterministic. In essence, we get the creativity and speed of the AI combined with the rigorous repeatability of traditional automation. Industry practitioners emphasize this difference: LLM-based coding assistants may produce different outputs if prompted repeatedly, whereas a deterministic test generation tool will always produce the same output for the same input. By capturing the AI’s output as a fixed artifact, teams can eliminate the AI’s randomness from the testing process. The tests will run exactly the same in CI/CD every time, increasing trust in the results.
Micro-Economics, Game-Theoretic Analysis
From a ledger perspective, AI-driven testing yields obvious benefits of cost reduction and value increase. On the cost side, automating test case generation and maintenance slashes the human effort needed for comprehensive testing.
Status Quo
From a game-theoretic standpoint (viewing team interactions as a strategic game), AI is changing the “payoff matrix” for sharing knowledge vs. hoarding it. Historically, an engineer might gain a form of job security or influence by being the only one who understands a critical component (holding a knowledge silo). This could create an incentive to guard information—a non-cooperative strategy to ensure one’s importance. Meanwhile, a PM had little choice but to trust the engineer’s estimates and explanations, operating at an informational disadvantage. This scenario is akin to an asymmetric game where one player has more information and thus more power. Such asymmetry can breed mistrust or suboptimal outcomes (like overly padded estimates or missing customer needs).
Impact of AI-driven Transparency
If the PM can ask an LLM to explain the code or generate an alternative solution, the information asymmetry diminishes. The engineer no longer gains by hoarding knowledge; in fact, since the PM can get a second opinion from AI, the engineer now has incentives to be more forthcoming and collaborative to maintain trust. In game theory terms, the interaction moves closer to a symmetric information game, which supports a more cooperative equilibrium. When all players have more equal access to information, strategies that involve deception or withholding are far less viable because they can be discovered or worked around. The stable strategy becomes collaboration: everyone shares and works together because that produces the best collective outcome, and there’s less advantage in going solo. Essentially, AI tools help make certain knowledge common to all (or at least much easier to obtain), and common knowledge is a known facilitator of coordination in game theory.
In summary, this means a fundamental shift in incentives for each role:
- Before AI Adoption (traditional setup): The developer’s best move was often to keep expertise and information closely guarded (maintain a knowledge silo) because the product owner or manager had no easy way to verify technical claims. The PM was forced to trust the developer’s statements and estimates, often operating with incomplete information and little leverage.
- After AI Adoption (AI-driven transparency): Now the developer gains little by hoarding knowledge—any attempt to do so can be quickly uncovered or bypassed by AI analysis. Instead, the developer is incentivized to collaborate and share, since the PM can and will verify specifics with AI if needed. The PM no longer has to fly blind; they can independently inspect code or generate tests using AI, leading to a more transparent, trust-based working relationship.
Interactive Simulation: Adjust each slider to set the AI adoption level for Dev, PM, QA, and TPM. The payoff matrix updates instantly and highlights Nash equilibrium rows.
Back to AI for SE: Attacking Principal–Agent Theory
Another way to view the incentive shift is through principal–agent theory. The principal–agent problem arises largely from misaligned goals and information gaps. AI4SE attacks this by closing the information gap. The “principal” (say, a product owner or engineering manager) can verify and even do parts of the “agent’s” work independently with AI’s aid, increasing transparency. The agent (engineer) knows that the principal has more visibility now (for example, the PM could run an independent AI analysis to review recent code changes), which discourages any temptation to shirk or mislead. In essence, LLMs act as a real-time monitoring and enabling mechanism; they reduce the need for heavy oversight or blind trust because the knowledge to evaluate work is accessible on demand. Monitoring costs drop and trust can build. Ideally, this leads to better alignment: everyone is working toward the same goal with the same understanding, rather than guarding their own turf.
Team Dynamics and Equilibria in the AI-Assisted Era
As AI levels the playing field, we may witness a shift toward flatter team structures and new collaborative equilibria. Practically, this means team interactions become more about jointly solving problems and less about negotiating hand-offs or protecting domains. Engineers, PMs, QA, and other roles find their day-to-day work involves more overlap and shared language.
- Convergence of Roles (Multi-skilled Teams): While each team member still focuses on their specialty, the skills and activities of different roles now overlap much more. Over time, each team becomes more T-shaped—deep in their specialty but able to contribute across a broad range of tasks. The equilibrium is a team of multi-skilled individuals each supported by AI, rather than strictly siloed specialists. This can increase mutual respect and understanding, as everyone has at least a basic grasp of others’ work (with the AI as their on-demand tutor).
- Incentives to Share Knowledge: With AI agents able to capture and distribute knowledge (e.g. summarizing a design into documentation, or answering questions in a chat), hoarding information makes much less sense. Teams will gravitate towards open information-sharing norms. I anticipate new incentives (perhaps set by management or company culture) that reward collaboration and teaching. In game-theory terms, cooperation becomes the dominant strategy: if one team member tries to keep critical knowledge to themselves, they’ll be quickly outpaced by teams that share and thus move faster—whether they like it or not.
- Leadership and Management Changes: As hierarchy is flattened by AI-enabled transparency, the role of a manager shifts from controlling information flow to enabling and coaching. Middle management, in particular, can be streamlined; fewer “translation layers” are needed when AI helps executives, managers, and engineers communicate directly and clearly. Industry observers have noted that AI-driven tools allow businesses to operate with leaner management structures by lowering the costs of acquiring, processing, and verifying information. I see a lot of potential for automating middle managers’ traditional tasks (gathering status updates, preparing reports, relaying information) through AI-generated reports and templates. Managers, freed from those duties, will focus more on setting direction, defining success metrics, and developing the team’s skills. The hierarchy becomes flatter as one manager can oversee a larger team with help from AI, and decision-making chains shorten. In essence, leadership becomes more about guiding a well-informed team than micromanaging tasks. (Indeed, Harvard Business Review and others have discussed how AI might redefine managerial roles, potentially eliminating some layers of hierarchy and transforming leadership into a more facilitative role.)
- Tension Shift: Ultimately, with LLMs integrated into workflows, the team achieves a more cooperative equilibrium. Everyone has access to the information they need (or can get it with AI), and everyone can contribute to solving the problem at hand, albeit in different ways. This changes old tensions; for example, the classic dev vs. test “us vs. them” mentality fades when developers use AI to generate tests and testers use AI to understand code. Ideally, the new equilibrium is a positive-sum game: the combined output of an AI-augmented, collaborative team is greater than before, which incentivizes continued cooperation. If any member deviates (say, an engineer refuses to use AI assistance and thus slows down the team), they’ll feel pressure to adapt because the rest of the team is moving faster with new tools. Over time, we expect norms to solidify around AI-augmented collaboration, much like how norms solidified around version control or agile ceremonies in earlier eras. Teams that embrace the technology and new ways of working will outperform those that don’t, reinforcing the trend (and likely forcing laggards to catch up).
Of course, challenges remain. There is a risk that if “anyone can code” with AI, then coding might become commoditized and the craft of software engineering could lose some status or bargaining power. One engineer-blogger mused whether widespread LLM adoption could lead to a form of de-skilling of programming. In other words, programmers wouldn’t become less skilled per se, but the job might be perceived as less of a specialized craft, potentially reducing its reward (pay, prestige) in the long run. The worry is that companies might hire more “prompt engineers” or citizen developers at lower cost, while expert software craftsmen become less differentiated. And if misused, software quality can dramatically decrease with an overdose of generated code.
On the other hand, truly expert engineers may be even more in demand for the complex, critical, and highly technical tasks—analogous to how anyone can shoot a short video today, but not everyone can direct a blockbuster film. It is hard to predict exactly how the talent market will shape up.
Conclusion
AI4SE, driven by powerful LLMs, is transforming the software development landscape in ways that upend traditional models like the V-model. By automating and accelerating tasks, AI makes formerly sequential, costly activities (like black-box testing turned into shoot-and-forget BDD) continuous, cheap, and richly informative. By translating between natural language and code, AI enables people outside of engineering to contribute directly to technical work, reducing knowledge silos. Management theory suggests that this empowerment and transparency will flatten hierarchies and align incentives, while game theory implies that teams will settle into more cooperative and efficient patterns when information asymmetry is reduced. We are essentially witnessing the software process become more fluid and equitable—without losing the discipline of verification and validation that models like the V aimed to ensure.
In practical terms, embracing AI4SE means evolving how we collaborate and manage. Teams that leverage LLMs for testing can achieve higher quality at lower cost, with tests evolving alongside the software. Non-technical stakeholders, armed with AI copilots, can inject their domain knowledge directly into the development process, resulting in products that better fit user needs (and faster feedback loops when they don’t). Engineers and technical leads, rather than feeling undermined, must focus on the sophisticated challenges and act as coordinators of ideas coming from many quarters. The economics of development shift: mundane work is automated, so the scarce resource is no longer coding hours but human creativity and strategic thinking. This will likely change how we measure productivity and how we reward team contributions, putting more emphasis on design, innovation, and coordination.
While the transformation is still underway, the trajectory is clear. AI4SE is not just an efficiency booster or academic concept—it’s a catalyst for a more inclusive and collaborative engineering culture. Much like DevOps broke down silos between development and operations, AI-assisted development breaks down silos between technical and non-technical contributors, between planning and testing, and even between management and execution. Organizations that understand and harness this will foster teams that are both highly innovative and disciplined—a new competitive equilibrium in the industry.
The traditional V-model isn’t so much discarded as it is augmented and iterated upon. Verification and validation steps still happen, but they are woven throughout the process with AI as an ever-present assistant. Requirements, development, and testing all converse in real-time via natural language and code generation. This makes the development lifecycle more like a continuous loop than a strict V shape—perhaps an evolving spiral of constant refinement and feedback. For tech leaders and managers, the message is clear: leveraging AI4SE enables your teams to move faster and smarter. It means rethinking roles, training staff to work alongside AI, and fostering an environment where human creativity and AI capability unite. Those who embrace this transformation stand to deliver higher-quality software, align teams more closely with business goals, and ultimately create more value in a competitive marketplace. The future of software development will be written by human–AI teams—and those teams are already reshaping the process today.
Key Takeaways:
- Continuous, AI-Driven Testing: Black-box testing is becoming inexpensive and ongoing. LLMs can generate and even execute tests from natural language specs throughout development, improving quality and keeping requirements and tests in sync. Testing shifts from a late-phase cost to a continuous activity, catching issues early when they are cheaper to fix.
- Empowered Stakeholders: AI tools enable non-engineers to directly create or modify technical artifacts. Product managers can prototype features or derive tests from user stories, and domain experts can query system behavior without writing code. This flattens team structure and lets knowledge from any source flow into the product more easily.
- Reduced Knowledge Asymmetry: By providing on-demand expertise, LLMs reduce technical gatekeeping. Information once confined to specialists is now accessible to all team members (e.g. an LLM explaining a module in plain English). With less information asymmetry, team incentives shift toward transparency and trust—a more “one team” culture where everyone works with the same facts.
- New Team Equilibria: As AI levels the field, teams reach a new balance where collaboration is the norm. Engineers focus on complex problems and architecture, QA ensures AI-generated tests truly capture business intent, and managers orchestrate rather than dictate. The result is a highly collaborative, cross-functional workflow where AI handles grunt work and humans focus on creativity and strategy. Overall team productivity and innovation increase, benefiting everyone.
-
The first and the last surname gTLD, *.kim*
It has been a while since I bought the first top-level domain for family name
.kim
.I did some researches and most likely
.kim
would be the only for very long time. Obviously, I ownmilhan.kim
.For years, https://milhan.kim had redirected to my LinkedIn profile.
Today marks the first day this domain starts to play its role.
🎉 initial commit