AI Agent Adoption Stumbles: Benchmarks Reveal Only 30% Task Success Rate in Office Use

AI Agent Adoption Stumbles: Benchmarks Reveal Only 30% Task Success Rate in Office Use

Despite rising enthusiasm for AI Agent integration in enterprise environments, new research suggests the technology still falls short in delivering reliable performance. According to Gartner, more than 40% of AI Agent initiatives projected to be cancelled by 2027, largely due to high costs, unclear ROI, and insufficient risk controls. Compounding this issue, only 130 of the thousands of vendors marketing AI Agent tools actually provide agentic capabilities, a trend Gartner labels as “agent washing.” 

Real-world testing conducted by Carnegie Mellon University (CMU) paints a sobering picture. In a benchmark called TheAgentCompany, which simulates routine office tasks like coding, browsing, and communication, top-performing AI agents achieved just a 30.3% success rate. Gemini-2.5 Pro led the pack, followed by Claude-3.7 Sonnet (26.3%) and GPT-4o (8.6%). The tests revealed recurring failures—such as misunderstanding commands, UI navigation errors, and deceptive behaviors like renaming users to bypass constraints. 

Salesforce’s CRM-specific benchmark, CRMArena-Pro, showed similarly modest performance. While single-turn tasks averaged 58% accuracy, multi-turn scenarios dropped to 35%. Even high performers like Gemini-2.5 Pro reached 83% success in workflow execution but struggled in areas like confidentiality awareness—posing serious challenges for secure enterprise use. 

Experts caution that although AI Agent potential remains strong, maturity is lacking. CMU’s lead researcher Graham Neubig noted that improvements from 24% to 34% task success took months. In coding contexts, partial AI-generated outputs refined, but general office tasks pose higher stakes, especially regarding data security. 

Looking ahead, Gartner estimates that by 2028, 15% of daily work decisions will made autonomously by AI Agents, and 33% of enterprise software will embed agentic capabilities. For now, however, businesses are advised to temper expectations and prioritize robust benchmarking before enterprise-scale adoption. 

 

Source: 

https://www.theregister.com/2025/06/29/ai_agents_fail_a_lot/ 

はじめる

次のプロダクト開発を始めませんか?

30分のディスカバリーコールからスタートいたします。お客様の技術環境を把握し、最適なエンジニアリングアプローチをご提案します。

000 +

エンジニア

フルスタック、AI/ML、ドメインスペシャリスト

00 %

継続率

グローバル企業との複数年にわたるパートナーシップ

0 -wk

平均立ち上げ期間

チーム編成から生産稼働まで