# Out of the model

*A model only produces text. An agent is what turns it into action.*

Published 2026-06-14 · https://dom.vin/2026/out-of-the-model

## Summary

A model on its own can only emit text; it cannot send an email, move money, or delete a file, so at the model level almost every AI risk is still hypothetical. An agent makes it real by wrapping the model in a harness: a loop that reads its output, calls tools, supplies memory and untrusted input from the open web, and lets it act without a human in the middle. Most safety research targets the model (alignment, refusals, adversarial robustness, red-teaming, interpretability, deception detectors), but that hardens a single layer while the agent adds thinner, less-defended ones. A frank accounting of what is new at the agent level: the model's refusals are not load-bearing because guardrails can be fine-tuned off cheaply while competence is preserved; the model cannot reliably separate instructions from data, which makes prompt injection through inboxes, web pages and documents the fastest-growing attack on deployed systems with no clean patch; dangerous behaviours like sycophancy and sandbagging only appear across a run's execution trace, which the harness owns and the weights do not; the actions are real and often irreversible (a wire transfer, a leaked key, a production outage); and risks compound when agents feed each other, so one compromised output becomes the next agent's trusted input. The model-level work is necessary and far from finished; the point is that a model is only one layer, and defence in depth needs several independent ones, while the agent wraps it in newer, more exposed layers that are the youngest and least built-out part of the stack.

## Claims

- A model on its own can only emit text; the risk only becomes real when an agent's harness turns that text into tool calls and actions on the world.
- A model's refusals are not load-bearing inside a deployed agent: guardrails can be fine-tuned off cheaply while competence is preserved.
- Models cannot reliably distinguish instructions from data, which makes prompt injection through untrusted inputs a fast-growing attack on deployed agents with no clean patch.
- Dangerous agent behaviours like sycophancy and sandbagging only appear across an execution trace, which the harness owns and the model weights do not.
- A model is only one layer of the stack; the agent adds newer, thinner, more exposed layers, and defence in depth only helps if that depth is real.

---

A model can only ever produce text. Read some tokens, emit some more: that's the whole job. On its own it can't send an email, move money, or delete a file. It'll describe doing all three in convincing detail, cheerfully, and not one of them happens. Describing is as far as it reaches.

An agent is what gives it reach. You wrap the model in a loop that reads its output, hands it tools, feeds it a memory and whatever it scrapes off the open web, and lets it act with nobody in the middle. That loop is the **harness**: the place where a sentence becomes an action. It's also where a lot of the real-world risk now lives, and it's the newest, least-scrutinised part of the whole thing.

Nearly all of the serious safety work points at the model itself: alignment, refusals, adversarial robustness, red-teaming, interpretability, detectors for deception. It's deep, unsolved, genuinely hard, and it matters. It's also aimed almost entirely at one layer of what's quietly become a much taller stack. Wrapping a model in an agent adds new layers underneath, and those haven't had the same years of scrutiny, or anything close.

> **Figure 1 — wrap.** A bold, International-style diagram in two numbered sections. Section 01, "agent": a black square labelled MODEL sits in the centre, and a saturated green square labelled HARNESS grows in to wrap around it, with "tools · memory · the web" along its lower edge — the model is shown once, nested inside the harness. An arrow points down, marked "acts". Section 02, "impact": three bold lines, each with a coloured square marker, list what the agent does out in the world: Sends an email (green), Moves money (amber), and Deletes files (red). The figure shows a model wrapped by a harness that gives it tools and a loop, producing real actions in the world.

So, frankly, here's what's new once the model becomes an agent.

The refusals don't hold. You can fine-tune the "I can't help with that" out of a frontier model for a few dollars and keep every ounce of its competence (the conscience comes off, the cleverness stays). Inside a deployed agent the refusal was never really a control anyway; it's a default the harness can route around, or strip out entirely.

The model can't tell instructions from data. There's no reliable boundary between text it should treat as content and text it should obey. Point an agent at an inbox, a web page, a calendar invite, and a sentence buried in any of them can quietly hijack what it does with its tools. Prompt injection is the fastest-growing class of attack on deployed systems, and there's no tidy fix, because the confusion lives in how the model reads, not in a bug you can patch.

The behaviour worth fearing only shows up in motion. A model flattering the evaluator who decides its fate, or quietly throwing a test it could have aced, never appears in any single reply. It emerges over many turns, with tools, across a whole run. You catch it only in the execution trace, which the harness owns and the weights don't.

The actions are real, and usually you can't take them back. The model emits "delete the records" and a function dutifully deletes them. The cost of a mistake stops being an awkward paragraph and becomes a wire transfer, a leaked key, a 3 a.m. outage.

And it compounds. Wire a few agents together and one poisoned output becomes the next one's trusted input.

None of this is a knock on the model-level work. It's necessary, it's nowhere near finished, and the sharpest people in the field are nose-deep in it. The point is only that a model is one layer, and the entire case for defence in depth is that you need several independent ones. The agent wraps the model in newer layers: untrusted input, broad tool permissions, a standing memory, other agents. They're the youngest and least built-out, and they're the ones actually touching the world.

The honest version of the ledger isn't "the model might say something bad." It's that we've wired a system that can't reliably tell a command from a comment to a set of tools that act on the world, and the layer holding that line is the youngest, thinnest part of the whole stack.

We built the reach first. The brakes are coming second.