Nearly 1 in 20 AI requests fail in production as capacity limits become the primary bottleneck to scaling AI reliably
As AI adoption accelerates, operational complexity – not model intelligence – is becoming the primary barrier to reliable AI at scale, according to new data from Datadog, Inc. (NASDAQ: DDOG), the AI-powered observability and security platform.
Datadog’s State of AI Engineering 2026 report, based on real-world data from thousands of organizations running AI in production, highlights a compounding complexity challenge as AI systems scale. Nearly seven in ten companies (69%) now use three or more models alongside increasingly complex agent workflows. Around 5% of AI model requests fail in production, with nearly 60% of those failures caused by capacity limits – leading to slowdowns, errors, and broken experiences in AI-powered applications.
Additional key findings:
- Multi-model is now the norm: OpenAI remains the most widely used provider at 63% share, alongside rising adoption of Google Gemini and Anthropic Claude which grew by 20 and 23 percentage points, respectively.
- Agent framework adoption doubled year-over-year, accelerating development but also introducing more moving parts into production systems.
- The amount of data sent to AI models per request is also rising: the average number of tokens more than doubled for ‘median use’ teams (50th percentile of usage volume) and quadrupled for heavy users (90th percentile).
“AI is starting to look a lot like the early days of cloud,” said Yanbing Li, Chief Product Officer at Datadog. “The cloud made systems programmable but much more complex to manage. AI is now doing the same thing to the application layer. The companies that win won’t just build better models – they’ll build operational control around them. In this new era, AI observability becomes as essential as cloud observability was a decade ago.”
Speed Requires Control
Competitive pressure is accelerating AI deployment across startups and large enterprises alike. But as systems scale, speed without control creates risk. Failures are increasingly driven by system design, including fragmented workflows, excessive retries, and inefficient routing.
“The next wave of agent failures won’t be about what agents can’t do but what teams can’t observe,” said Guillermo Rauch, CEO at Vercel, the company behind Next.js and a leading platform for building AI-powered web applications. “We built agentic infrastructure at Vercel because agents need the same production feedback loops as great software. Unlike traditional software, agents have control flow driven by the LLM itself, making observability not just useful, but essential.”
“Innovation alone isn’t enough,” added Li. “To scale AI with confidence, organizations need real-time visibility across the entire stack – from GPU utilization to model behavior to agent workflows. Visibility and operational control are what allow teams to move fast without sacrificing reliability or governance. At scale, how you operate AI may matter more than the models you choose.”
Read the full report – The State of AI Engineering 2026 – and learn how Datadog is investing in AI observability to help teams operate and scale AI systems in production here.
Report Methodology
Datadog analyzed anonymized usage data from thousands of customers using LLMs in production environments, with global coverage across industries and geographies.
The post AI Is Hitting Operational Limits as Companies Rush to Scale: Datadog Report first appeared on AI-Tech Park.


