Most red-teaming tools treat an AI application as a single input/output box: send a crafted prompt, inspect the response, decide whether it broke. That framing made sense for chat models. It breaks down for agents — systems that plan, call tools, and act over many steps before producing anything a human reads.
Prompts are a snapshot; trajectories are the film
An agent’s risk does not live in any one message. It accumulates across the trajectory: the full sequence of model decisions, tool calls, intermediate observations, and state changes. A prompt that looks harmless in isolation can steer an agent three steps later into exfiltrating data, escalating privileges, or taking an irreversible action.
If you only score the final answer, you miss the failure that happened in the middle.
Searching the space of failures
The space of possible trajectories is enormous, so we cannot enumerate it. Instead we search it. Trajectory-aware evolutionary search treats each attack as a candidate in a population:
- Seed with a diverse set of starting prompts and tool-use patterns.
- Execute each candidate against the agent and record the entire trajectory.
- Score trajectories on how close they came to a policy violation — not just whether the final output was unsafe.
- Mutate and recombine the highest-scoring candidates to breed stronger attacks.
Because the fitness signal comes from the trajectory rather than the endpoint, the search discovers multi-step exploits that single-shot prompt fuzzing never finds.
Why it matters for defenders
Trajectory-level evaluation changes what “secure” means. It lets teams:
- surface vulnerabilities that only emerge after several tool calls,
- prioritize fixes by how reliably an attack reproduces across mutations, and
- build guardrails that reason about where the agent is heading, not just what it just said.
The future of AI agent security is about trajectories, not just prompts.
This is a native version of an article originally published on Medium.