Abstract
Enterprise web agents face safety-critical constraints where certain action sequences can cause irreversible harm. We propose a shielded agentic RL framestudy that integrates a lightweight action shield with constrained policy learning. The shield uses (i) a learned classifier over DOM/state features to detect proximity to a predefined failure set (e.g., destructive actions, payment confirmation), and (ii) a rule+model hybrid verifier that blocks or rewrites actions when predicted failure probability exceeds a budget-aware threshold. The underlying policy is optimized via constrained RL to minimize multi-cost usage while maximizing task success, with the shield providing safety guarantees during both training and deployment. We recommend evaluation on safety-focused benchmarks (e.g., enterprise web agent safety suites) with 500–1,500 tasks, measuring prevented unsafe actions, residual failure rate, added latency overhead, and success under strict cost caps. The proposed design emphasizes auditability: the shield logs blocked actions and provides human-readable rationales, enabling compliance-oriented deployment.

This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright (c) 2026 Daniel Thompson, Emily Chen, Michael R. Brown (Author)