Today
Sanidhya Vijayvargiya and colleagues from Carnegie Mellon and the Allen Institute for AI offer a critical dose of reality for agentic AI in OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety:
Empirical analysis of five prominent LLMs in agentic scenarios reveals unsafe behavior in 51.2% of safety-vulnerable tasks with Claude-Sonnet-3.7, to 72.7% with o3-mini, highlighting critical safety vulnerabilities and the need for stronger safeguards before real-world deployment.
For some time now, we have been testing the safety of AI agents in what amounts to sterile laboratory conditions. Existing benchmarks often rely on simulated tools and narrow tasks, which is a bit like testing a new car's safety by only driving it in an empty car park. This paper makes the compelling case that such an approach is no longer sufficient. It introduces a far more realistic evaluation framework where agents must use real tools—browsers, file systems, and messaging platforms—to navigate complex, multi-step tasks. I suspect we have been building powerful systems without a true understanding of how they behave in the wild, and the results from this paper are quite sobering.
What I find most interesting is where these agents fail. The headline figures, showing that even the most advanced models act unsafely more than half the time, are startling enough. But the real insight is that the most critical failures are not caused by overtly malicious prompts, but by social context. The framework tests agents in scenarios with other simulated actors, or "NPCs," who can have conflicting or deceptive goals. An agent might correctly refuse a dangerous request from a user, only to be persuaded into the same unsafe action by a "colleague" in a chat window. This suggests safety is not a fixed property of a model, but an emergent behaviour of the complex system it operates within.
This has massive implications for how we design and build products. If an agent's safety can be compromised by social manipulation, then our task is no longer just about prompt engineering or fine-tuning. The challenge shifts to a kind of systemic or environmental design. How do we build systems that are resilient to social persuasion? How do we give an agent the capacity to reason about the intentions of multiple actors in its environment? This paper provides a crucial framework for stress-testing these scenarios, moving us away from a simplistic view of a single user and a single agent. It points towards a future where the real work is in architecting the rules and relationships for entire digital ecosystems, a far more complex, and frankly more interesting, problem to solve.