LIGHTS
OUT

In the software cycle, the factory floor is going dark. Not from failure, but from the quiet removal of the need for anyone to be there at all.

By Marcelo Kanhan. Photography: Amino and Nika Dominguez via Lummi.ai

Section I

The Threshold

Between 2020 and 2024, AI coding assistance improved steadily but predictably. Each generation did more — finishing a line, then a function, then a full module — but the workflow stayed the same: a human wrote, a machine suggested, a human decided whether the suggestion was any good. The AI was useful the way a fast intern is useful. It saved time. It didn’t change what the job was.

Sometime in late 2025, that gradual improvement stopped being gradual. The industry’s most rigorous benchmark, SWE-bench Verified, tests AI models against 500 real software problems drawn from actual open-source projects, evaluated not on whether the code compiles but on whether it solves the problem — in an isolated environment, with no human guidance, against production-quality standards. By March 2026, the top models clustered above 80%, resolving four out of five real-world software problems correctly on the first attempt. Six frontier models from different companies, all within a percentage point of each other — as if the industry had converged on a shared ceiling.

Then, in April 2026, Anthropic released Claude Mythos Preview and the ceiling moved again: 93.9% on SWE-bench Verified, 77.8% on the harder SWE-bench Pro that replaced it. Mythos wasn’t released commercially — Anthropic withheld it from the public, deploying it exclusively for cybersecurity defense through Project Glasswing, a coalition of twelve major technology companies. The model had autonomously discovered thousands of previously unknown security vulnerabilities in every major operating system and every major web browser, including one bug in OpenBSD that had been present for 27 years. The capability was too dangerous to sell and too valuable not to use.

The trajectory is clear even if the specific numbers will be revised: AI-generated code that works on first pass has become the baseline expectation, not the aspiration. The models haven’t just gotten better at writing code — they’ve crossed a threshold where the question is no longer can the machine write this? but should a human bother to?

Lights on
A senior engineer in 2023 used AI to draft perhaps 30% of the code, reviewed all of it line by line, and was responsible for every character that shipped. The AI was a fast assistant that needed supervision on every output.
Lights off
A senior engineer in 2026 writes the specification, designs the test scenario, and watches a score. The question has shifted from does this code look right? to did this behavior satisfy the specification I wrote? The judgment hasn’t disappeared — it has moved upstream, from the code to the intent behind it.
Section II

The Agentic Loop

Once code works reliably on first pass, the next question becomes inevitable: what happens when the machine doesn’t just write a piece of the software but runs the entire cycle — writing, testing, failing, fixing, and shipping — without a human reviewing what happens in between?

Most people still picture AI-assisted development as a programmer at a keyboard with a very good autocomplete: the human writes, the AI suggests, the human reviews, the human ships. That picture is already obsolete. In July 2025, StrongDM — a production security company serving real enterprise customers, not a research lab — began building software under two rules that its CTO Justin McCarthy stated as a charter: “Code must not be written by humans. Code must not be reviewed by humans.” Three engineers produced roughly 32,000 lines of production security software under those constraints by February 2026, and the software is running in production with paying customers.

The interesting question is not whether this works — it demonstrably does — but how you validate software that no person has ever read. StrongDM’s answer reveals the mechanics of lights-out production. They write what they call scenarios: descriptions of what the software should do from a user’s perspective, stored outside the codebase and deliberately hidden from the AI during development — functioning as holdout tests that prevent the system from optimizing for the test rather than the behavior. To verify integrations, they built a “Digital Twin Universe: behavioral clones of the third-party services our software depends on — replicating their APIs, edge cases, and observable behaviors.” McCarthy’s metric for capacity: “If you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement.”

This pattern is the agentic loop. Claude Code reads the full codebase, plans across files, executes, runs tests, iterates on failures, and commits fixes. OpenAI’s Codex commits code and comments on pull requests without human intervention. Cursor 3, launched in April 2026, rebuilt its interface around the premise — what its developers describe as “a bet that you’ll manage agents, not write code.” Three architectures converging on the same conclusion: the human reviews the outcome, not the process.

Eran Kahana at Stanford Law’s CodeX project framed the legal dimension: “When software is ‘grown’ rather than written, when replicas stand in for real services, and when quality is measured by probability rather than certainty, who is responsible?”

Lights on
A development team ships a release after two weeks of planning, writing, reviewing, testing, and a carefully scheduled deploy with rollback plans.
Lights off
An agent fleet resolves twelve issues overnight, each tested against behavioral scenarios and validated by a digital twin. The engineers review a dashboard of scores in the morning. The code was never read by a person.
Section III

Software Without Versions

When the agentic loop writes, tests, fixes, and deploys in a continuous cycle, one of the oldest conventions in software starts to lose its purpose: the version number.

Software has traditionally moved through named states — version 1.0, version 2.0, the quarterly release, the carefully scheduled Tuesday deploy. The version was a punctuation mark, a moment of declared stability where the system paused and said: this is what we have, this is what we intend. The convention made sense because software changed slowly, expensively, and with human review at every step.

The agentic loop removes the rationale for stability-by-punctuation. In modern software operations, CI/CD — continuous integration and continuous delivery, the practice of merging, testing, and deploying code automatically and frequently rather than in scheduled releases — has already compressed the release cycle from months to hours. When AI agents are embedded in that pipeline, the cycle compresses further: the system runs, identifies problems, fixes them, tests the fixes, and deploys the improvements in a flow that doesn’t pause for a human to declare a version number.

“We will continue to invest in the IDE until codebases are self-driving.” — Cursor

A self-driving codebase doesn’t stop at intersections to confirm direction with a human; it navigates continuously. The design question shifts from how to stabilize these systems to how to build for perpetual flux: what does integrity mean when the software running this afternoon is not the software that ran this morning?

Lights on
Version 4.2.1, released March 15. Changelog reviewed by three engineers. Documentation updated. The next release is in six weeks.
Lights off
The codebase has been improved 340 times since Monday. No version number was assigned. The test suites pass. The system is not the same system it was this morning, and that is how it’s supposed to work.
Dimly lit office scene
Section IV

Atomic Applications

When code generation approaches zero marginal cost, the logic of the durable application inverts — and this is the economic consequence of everything that precedes it.

The application-as-investment model worked because writing software was expensive: months of engineering time, testing cycles, deployment infrastructure. Adobe Creative Suite, Salesforce, SAP — durable, generic, designed for many users across many workflows. The installed base existed because building a specific alternative cost more than living with the compromise.

When a financial analyst can generate the exact interface needed for a particular scenario — once, for that use, discarded when done — the installed base as an economic concept loses its rationale. Anthropic and OpenAI both explore this direction: ephemeral UI, AI-generated interfaces that appear in context, serve their purpose, and vanish.

The counter-argument is worth taking seriously. Andreas Kirsch’s “Flawed Ephemeral Software Hypothesis” argues that making generation cheaper shifts the bottleneck to validation, integration, and ergonomics. Software isn’t ephemeral — it’s malleable. Both propositions appear true simultaneously at different layers. The economy of generation collapses; the economy of quality does not.

Lights on
Your company buys a project management tool for $15,000/year, spends three months configuring it, trains 200 employees, and lives with its model of how projects work for the next five years.
Lights off
A project manager describes what they need in natural language and receives a working interface in minutes — tailored to this team, this quarter, this specific set of deliverables. When the project ends, the interface is discarded.
Code on screen close-up
Section V

The Human as Seed

Here is what the dark factory does not automate: wanting the right thing.

Everything described so far assumes a human upstream who specified what to build and how to verify it. The factory runs dark because the design office did its work precisely enough that the factory doesn’t need supervision. The question pressing against the boundary now is whether the design office itself can go dark.

In content marketing, it largely already has. Fully automated pipelines — where AI identifies keyword gaps, generates topic lists, writes articles, handles SEO, and publishes without human intervention — are commercially available and used at scale. The seed itself is synthetic, derived from search gap analysis and algorithmic trend detection rather than from a human with something to say.

The results are visible. “Slop” — AI-generated content recognizable by its blandness — became Merriam-Webster’s Word of the Year in 2025, the same year that an estimated 90% of content marketers began incorporating AI writing tools. The factory runs at scale, the lights are off, and the output is precisely what you’d expect from a system optimized for volume over substance.

In software, the frontier is different. “Spec-driven development” is emerging as a named paradigm. Addy Osmani advises developers to “pour your mentorship into the spec” and notes that “if the agent produces something that technically meets the spec but doesn’t feel right, trust your judgement.” Thoughtworks acknowledges “there’s not yet a systematic way to evaluate specs.”

The FANUC factory in Yamanashi has run lights-out since 2001 — robots building robots, 50 units per 24-hour shift, unsupervised for up to 30 days. The design office stays lit. The lights-out software factory follows the same logic.

What would a synthetic seed actually require? At minimum, three things. First, the ability to identify a genuine unmet need — not by analyzing keyword gaps but by understanding what people struggle with in practice. Second, the capacity to distinguish between a specification that technically satisfies a requirement and one that would actually be worth using — taste. Third, the willingness to say no — to decide that a particular application shouldn’t exist. This is the capacity the Medvi case demonstrates by its absence.

None of these are technically impossible. But they require a model of the world that extends beyond software into the texture of lived experience — and the distance between “possible in principle” and “reliable in practice” is the same distance that separates the design office from the factory floor.

Lights on
A product manager writes requirements based on years of user research, domain expertise, and the accumulated intuition of having watched people use — and curse at — the previous version.
Lights off
An AI identifies a market gap from usage data, generates a specification, and deploys a working application — all without a human deciding that this product should exist. The question is whether it would be worth using, or software slop.
Section VI

The Drive Beneath the Automation

The aspiration to build businesses that run without you predates AI by decades, and it is the economic motive that pulls the entire lights-out architecture forward — not technological enthusiasm but the oldest aspiration in capitalism: decouple revenue from labor.

1990s
Build a website, make money while you sleep
Affiliate marketing, early e-commerce, set-it-and-forget-it passive income sites
Real productivity gains wrapped in fantasy marketing
2010s
Build a funnel, make money while you sleep
Information products, Shopify dropshipping, the automated email sequence as business model
Bottleneck narrowed without being eliminated
2020s
Use AI, make money while you sleep
Automated content, AI customer service, code generation, the dark factory as business model
Fantasy approximately true for a narrowing set of cases

Matthew Gallagher spent two months and $20,000 to build Medvi, a GLP-1 telehealth provider, using more than a dozen AI tools. As profiled by Erin Griffith in the New York Times in April 2026, the company reported $401 million in revenue in its first full year and was projecting $1.8 billion for 2026. Two full-time employees. The architecture: AI wrote the code, produced the copy, generated ads, and managed customer service, while regulated infrastructure was outsourced to partners whose workforces don’t appear on Medvi’s headcount. (The $1.8 billion figure is a revenue projection, not a valuation.)

Then the reporting caught up. Multiple investigations found fabricated physician profiles and before-and-after images altered with AI. Gallagher acknowledged to the Times that the initial site used AI-generated photos. The FDA had issued a warning letter in February 2026. Lawsuits alleging deceptive marketing followed. A data breach through partner OpenLoop Health was reported, though the full scope remains contested.

The Medvi case is instructive because the dark factory ran precisely as designed: it generated at scale without meaningful human review. What it generated included fabricated credentials and marketing that triggered regulatory action. The lights went out not just on execution but on judgment. The seed was genuine. The factory amplified it faithfully, including the parts that needed a human to say no, not like this.

The drive beneath the automation is leverage. The factory runs dark because engineers love elegant systems and because capital loves margins, and these impulses reinforce each other. The question the Medvi case forces is whether “approximately true” is good enough when the factory has no conscience, only throughput.

Lights on
A telehealth company employs 200 people: doctors, engineers, compliance officers, customer service. Every ad is reviewed, every physician credentialed, every interaction has a human in the loop.
Lights off
A telehealth company employs two people and an AI stack. The doctors are outsourced, the code generated, the ads generated, the customer service generated. The factory runs at scale. Nobody checked whether the doctors were real.
Code in the shadows
Section VII

The Lit Room

The factory is dark. The question is whether the design office stays lit.

Everything in this article traces a single arc: AI-generated code crossed the threshold of first-pass reliability, which enabled agentic loops that write, test, and ship without human review, which made versioning an artifact of a slower era, which inverted the economics of durable applications toward disposable ones, which redefined the human role from operator to seed. Each step enabled the next; together they describe a cascade already in motion.

The frontier that has not been crossed is the seed itself. Spec-driven development assumes a human who knows what to build, for whom, and why. The content marketing industry has demonstrated that this assumption can be bypassed at scale — and that what emerges tends toward slop. Whether the same dynamic applies to software is the open question, and the answer determines whether the design office eventually goes dark too.

The consequences extend well beyond software. If the agentic loop can build and ship code autonomously, it can operate in any domain where work can be specified and output verified: legal documents, financial models, scientific literature, architectural design. The industries where human expertise justified human labor are precisely the industries where first-pass AI capability is reaching the threshold that software crossed in 2025.

For businesses, the immediate implication is not that employees become unnecessary but that the location of human value shifts. The organizations that continue to deploy humans where the factory can run dark are paying for reassurance, not quality. The organizations that move humans upstream — into specification, judgment, and the editorial decisions about what the factory should produce — are building for the architecture that’s emerging.

The factory runs dark. Now, the question worth focusing on is what keeps the lights on in the design office.

Lights on
A world where humans build, review, test, and ship — where every artifact passes through human hands and human judgment before it reaches a user.
Lights off
A world where humans decide what should exist, and machines handle everything else — where the value of a person is not what they can produce but what they can imagine, specify, and refuse to accept.

Follow the practice

Lights Out is part of an ongoing body of work by Marcelo Kanhan on technology, organization, and the human condition.

Read on Substack
Download research & production notes (PDF)