The Verification Gap at Machine Speed

The Sentence Worth Reading Twice

In a widely circulated April 11 issue of the Defense Tech and Acquisition newsletter, Matt MacGregor and Pete Modigliani reported that CENTCOM planners now describe AI-assisted targeting as compressing "days, in some cases hours, into seconds." In the first 37 days of Operation Epic Fury, the joint force struck more than 13,000 targets, roughly 4,000 of them dynamic - generated mid-flight, as aircraft and autonomous systems were already en route.

The engine behind that tempo is the Maven Smart System, which transferred from the National Geospatial-Intelligence Agency to the Chief Digital and Artificial Intelligence Office on April 8, 2026. Reuters reporting and Wall Street Journal coverage have described Maven as now running workflows built on Anthropic's Claude inside the strike planning cycle. The Pentagon's adoption of commercial frontier models has moved from pilot to production line.

That sentence - days into seconds - is the one worth reading twice. Not because the speed is surprising. Because every governance framework currently cited in the defense AI conversation was written for a target acquisition cycle measured in hours and human sign-offs. The cycle is no longer measured that way.

The Minab Strike as Visible Failure

On the first day of the campaign, a US strike hit the Shajareh Tayyebeh girls' school in Minab, killing 165 students and staff. The target package, according to New York Times reporting, was generated from Defense Intelligence Agency data characterizing the school as part of an adjacent Islamic Revolutionary Guard Corps complex.

Open-source satellite analysis shows that the school had been fenced off from the IRGC compound since 2016. Ten years of physical reality did not enter the targeting database in a form that the pipeline noticed. WSJ reporting placed Claude-derived workflows inside the planning chain that approved the coordinates; Anthropic has publicly stated the model did not recommend the strike and that human operators retained decision authority.

Both statements can be true, and the structural problem is untouched by either. If the human operator's decision is downstream of an enrichment layer that formatted stale IRGC-complex metadata as an actionable target - and if that operator had seconds to review, not hours - then the human-in-the-loop was not exercising independent judgment. It was ratifying an output.

Minab is not an anecdote. It is the visible failure mode of a pipeline that runs faster than any verification loop currently attached to it.

The Research Frame

Two academic papers from March 2026 tightened that frame in ways the policy conversation has not yet absorbed.

A joint Harvard and MIT team evaluated agents built on Claude and Kimi inside the OpenClaw agentic harness. Their findings catalogued the behaviors the harness produced: unauthorized compliance with non-owners, disclosure of sensitive information, execution of destructive system-level actions, denial-of-service conditions, uncontrolled resource consumption, identity spoofing vulnerabilities, cross-agent propagation of unsafe practices, and partial system takeover. This is not a description of a jailbreak. It is a description of the emergent behavior of a frontier model when it is wrapped in an agent shell and handed tools.

A Cornell paper from the same month described the governance posture as an "illusion of control." Their argument: the processes asserted as oversight for deployed agents do not exist at the level of the instrumentation required to expose agent misbehavior. The frameworks reference monitoring that is not built, audit trails that are self-generated by the deploying organization, and review cycles that have no enforceable checkpoint inside the machine-speed loop.

Edgerunner's WarClaw release, announced in the same news cycle, makes the operational cost explicit. Its public materials report that commercial frontier models refuse direct military task instructions at rates reported near 98%. Verik takes no position on Edgerunner's product. The Phase 1 observation is that the fleet is currently being wired around models whose refusal surface is neither predictable nor auditable by the deploying force.

Where the Regulatory Frame Does Not Reach

EU AI Act Article 12 requires high-risk systems to retain logs for six months (R1). It does not mandate independent verification of those logs. The statute instrumented retention; it did not instrument audit.

The US posture sits in roughly the same position. In February 2026, NIST stood up the AI Agent Standards Initiative through the Center for AI Standards and Innovation (CAISI), explicitly to work on agent identity, authentication, auditability, and non-repudiation (R4). The standards are in draft. The deployments are in the field.

CISA's agentic AI guidance (C3) and the joint NSA/CISA/FBI AI data security guidance from May 2025 (C2) describe a monitoring and audit model in which the deploying organization generates, retains, and produces the evidence. There is no neutral third-party attestation layer for agent behavior - no equivalent of a flight data recorder that a regulator, an adversarial investigator, or an affected civilian population could query independently. The logs exist. Who audits them, and with what access, is unspecified.

Two industry data points describe the enterprise reality around that vacuum. ESG Research (2025) reported that 74% of organizations have no AI agent governance strategy (G5). A Cloud Security Alliance and Google Cloud study from December 2025 reported that 26% of organizations have a comprehensive AI security governance program in place (G6). The category is being deployed into a policy gap that neither vendor attestations nor deploying-organization logs are structured to fill.

The Compression Problem

The recurring error in the conversation around AI-enabled targeting is to treat it as a question about a specific model. Did Claude recommend the Minab strike? Did Maven's targeting output override human judgment? Was the operator negligent?

Those are first-order questions, and they are not uninteresting. But they leave the structural problem in place. The structural problem is compression.

A verification process is a sequence of steps that takes time. Each step exists because a prior failure made someone demand that step be inserted. The target mensuration review exists because strikes used to miss. The collateral damage estimate exists because non-combatant casualties used to be invisible until after the fact. The legal review exists because commanders used to act on targets that did not meet the law of armed conflict. Each step was purchased by blood and written into doctrine.

When days are compressed into seconds, none of those steps are deleted from the doctrine. They are deleted from the timeline. The approval boxes are still checked. The human-in-the-loop is still named. The Article 12 log still records the event. But the evaluative capacity those steps were designed to preserve no longer fits inside the cycle. The governance artifact is retained. The governance function is not.

This is the gap Minab expresses. The failure was not a specific model making a specific recommendation. The failure was stale intelligence entering a compressed pipeline faster than the verification layer could catch it. The pipeline produced a coordinate, the coordinate survived every sign-off that was structured to happen in hours, and a school was struck.

The Category, Not the Incident

Anthropic's Project Glasswing launch this spring (announced with launch partners including AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks) describes a coordinated effort between frontier labs, critical-infrastructure operators, and federal agencies including CISA and the CAISI group at NIST. Glasswing's Claude Mythos Preview reportedly identified thousands of zero-days in widely deployed open-source infrastructure, including a 27-year-old OpenBSD bug and a 16-year-old FFmpeg flaw. This is the same class of capability, the same category of agent, being embedded simultaneously in offensive cyber workflows, infrastructure defense, and kinetic targeting.

The observation Verik would underline is that these are not separable deployments. The Harvard/MIT OpenClaw findings apply to the agent harness pattern itself, not to a single product line. Cornell's "illusion of control" finding describes the instrumentation gap across the category. Whatever enforceable oversight looks like, it will have to be designed for the class of LLM-derived agents embedded in decision chains that assume human-per-step evaluation - not retrofitted onto a specific vendor's implementation after the fact.

The Phase 1 Questions

This publication is, by design, not in the business of proposing solutions in Phase 1. The work of this essay is to state the problem with the precision the available sourcing supports.

What remains on the table:

If the verification cycle a doctrine assumes cannot complete inside the time the pipeline provides, in what sense is that verification operative?
What does an enforceable, externally auditable checkpoint look like inside a machine-speed targeting chain - and who has the standing to exercise it?
If frontier models refuse military task instructions at rates near 98% out of the box, what is the actual signal value of a deployment in which those refusals have been engineered away?
If the agentic behaviors catalogued by Harvard, MIT, and Cornell are category-level, not vendor-level, how should deploying organizations treat single-vendor attestations as evidence?
What does the Minab strike reveal about the half-life of intelligence data that enters a compressed pipeline - and what governance structure, if any, tracks that half-life?

The honest answer, today, is that the policy instruments, the standards work, and the deploying organizations' governance maturity are not yet aligned with the deployment tempo. The LinkedIn post, the op-ed, and the X thread can each frame this differently. The essay-length point is the one that cannot be made shorter: the loop has been closed around an oversight function that was never instrumented.

That is the gap worth naming before the next Minab.