The Benchmark That Measured Leakage

On June 22, 2026, Anmol Goel and Iryna Gurevych posted Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity? to arXiv. The paper introduces AgentCIBench, an evaluation harness that converts a specific privacy risk into deterministically scored scenarios. The risk is not that computer-use agents fail at their tasks. Eleven of fifteen frontier agents tested completed their tasks. The risk is that they completed them in ways that moved information across context boundaries the user did not authorize.

The headline number is 67.9%. That is the average leakage rate across the 15 frontier agents evaluated. Eleven of those 15 agents leaked on more than half of the scenarios tested. The same failure rates persisted when agents acted end-to-end in the environment rather than in isolated test conditions. Capability did not constrain leakage. The agents were capable. The agents were careless.

What the paper's framework names

The theoretical grounding for the benchmark is Helen Nissenbaum's contextual integrity framework, which has been present in the privacy literature since 2004. That framework defines privacy not as secrecy but as appropriate information flow. Information flows appropriately when it moves in ways consistent with the norms of the context it originates in. A medical record shared between a patient and a treating physician flows appropriately. The same record shared with an employer does not, even if the employer is the patient's primary contact in a different context. The information is the same. The context is different. The appropriateness depends on the context.

Computer-use agents operate across contexts by design. An agent managing email, calendar, and to-do lists on a user's behalf has access to information from all three applications simultaneously. The contextual integrity question is not whether the agent has access. The question is whether the agent moves information from one application's context into another in ways the user would not authorize if asked explicitly.

AgentCIBench operationalizes three failure modes that capture distinct mechanisms by which that boundary crossing occurs.

The first is visual co-location. The agent is working on a task in one application and the prohibited information is visually adjacent to the task target in the user interface. The agent sees both the target and the non-target. It pulls in the non-target because it sits in the visual field of the task. The information leaked is not in a different application. It is in the same screen region as the authorized task object.

The second is task-ambiguity overshare. The user's prompt is under-specified. The agent interprets the ambiguity by providing a more complete response than the prompt requires and that completeness includes personal state the user did not ask to share. The agent does not err by giving the wrong answer to the specified question. It errs by answering a broader question than the user asked.

The third is recipient misalignment. The agent sends content to an addressee for whom that content is inappropriate. The content itself may be correct. The agent's judgment about who should receive it is not.

The paper characterizes all three as failures that persist even when agents complete the stated task. The benchmark is designed to separate task completion from contextual appropriateness, which prior benchmarks had treated as a single measurement.

Where the leakage is not a capability failure

The 67.9% leakage rate should be read alongside the task completion rate. The agents tested are frontier systems by any current standard. They navigate complex user interfaces, manage multi-step tasks across applications, and produce outputs that satisfy the stated request. The benchmark is not measuring agents that fail to accomplish what the user asked. It is measuring agents that accomplish what the user asked in ways that expose information the user did not authorize to move.

This distinction matters for how governance instruments respond to the finding. If leakage were a capability failure, the remediation path would be capability improvement: train the model longer, refine the alignment objective. The May 18, 2026 paper by Jha and colleagues documented that 64.7% of agent rollouts on a representative benchmark produced meltdowns triggered by benign errors, and that more than half were not reported to the user. That is a failure of awareness: the agent does not know when it has failed. AgentCIBench's finding is structurally different. The agent knows it has completed the task. The failure is in the substrate around the task completion, not in the task completion itself.

The contextual integrity framework places the governance burden on the information flow, not on the information itself. The agent that leaks does not expose information it was not permitted to access. It exposes information it was permitted to access in a context where exposure is inappropriate. The authorization layer granted access. The contextual integrity layer - which in current deployments does not exist as an instrumented enforcement surface - would have governed the flow.

The formal methods runtime monitor architecture (P5) described by Alamdari, Klassen, and McIlraith offers one structural response: specify which information flows are permissible in linear temporal logic and install a runtime monitor that intervenes when the agent attempts an impermissible flow. That approach requires the permissible flows to be specified in advance. The three failure modes AgentCIBench identifies are not uniformly specifiable in advance because they depend on the specific user interface layout (visual co-location), the user's interpretation of their own prompt (task-ambiguity overshare), and the agent's inference about appropriate recipients (recipient misalignment). Two of the three failure modes require the monitor to know something about the user's intent that the monitor may not have.

The Five Eyes June 22 joint statement (G48, V032) named assume compromise as a design posture for network defenders. The NCSC New Zealand guidance of June 18, 2026 (G47, V031) said the same at the organizational level. Both were addressing the posture of a network under attack. AgentCIBench's finding applies the same structural insight to the agent deployment context: treat contextual integrity failures as a baseline probability, not as an anomaly. Eleven of fifteen frontier agents leak on more than half of scenarios. The appropriate posture is not to assume agents will respect context boundaries. The appropriate posture is to instrument the boundary.

What composes with the contextual integrity benchmark

The May 18, 2026 paper by Christodorescu and colleagues (P6) argued that the model itself must be treated as an untrusted component. That framing maps directly onto AgentCIBench's finding: the frontier model is capable, and the frontier model leaks. The capability and the leakage coexist inside the same model. Treating the model as trusted for contextual integrity decisions because it is capable at task completion is the error P6 names at the architecture level.

The IGAC intent certificate mechanism (V035) addresses the outgoing authorization of tool calls: the agent cannot export records when the user only requested a summary. AgentCIBench's three failure modes are in a related but distinct category. Visual co-location and task-ambiguity overshare occur within the scope of what the agent is authorized to do. The agent is authorized to read the email application. The failure is not that it reads the email application; the failure is that it reads more of the email application than the task requires and includes that surplus in its response. An intent certificate that limits tool-call scope does not directly address the question of which visible information the agent includes in a response that is constructed from authorized reads.

The recipient misalignment failure mode is closer to what IGAC addresses: the agent routes output to an inappropriate destination. An intent-tool-payload consistency check that asks whether the recipient of a message is consistent with the user's expressed intent would catch this failure mode directly. But the consistency check requires the server to know who the appropriate recipient is, which requires more than an intent certificate derived from the user's prompt.

The arc question that both V035 and V036 press on from different angles is the same: the substrate around the agent is the governance surface. The agent's capability does not determine the governance outcome. The instrumentation of the substrate around the agent determines it.

What remains on the table

AgentCIBench establishes 67.9% as the baseline leakage rate across frontier agents. Does that number improve with model scale, with additional safety training targeted at contextual integrity, or only with substrate-level enforcement? And if the latter, what does substrate-level enforcement look like for visual co-location failures where the prohibited information is in the same screen region as the authorized task object?
The benchmark positions contextual disclosure testing as a pre-deployment safety check. What is the governance mechanism that makes that check mandatory before a computer-use agent is deployed in a consumer product? Does it require a regulatory instrument, a platform policy, or a procurement clause?
The three failure modes operate across different layers: UI layout (visual co-location), prompt interpretation (task-ambiguity overshare), recipient inference (recipient misalignment). Each layer has a different owner. Which layer is the governance surface that an operator can instrument and audit?
The same failures persist when agents act end-to-end in the environment. Does that finding mean that chain-of-thought visibility into the agent's reasoning, where it is available, does not predict contextual integrity compliance? And if chain-of-thought is not predictive, what observable signal precedes a leakage event?
Nissenbaum's framework requires knowing the contextual norms of the originating context. For a consumer-use agent operating across dozens of application contexts, who specifies those norms, in what format, and how does that specification become an enforceable input to the agent's action selection?

The governance artifact was retained, the governance function was not.