The Double Thank You Moment Between Kubernetes and LLMs

Large language models (LLMs) may dominate AI-related headlines, but the underlying infrastructure that makes them work reliably at scale rarely does. 

Kubernetes, an open source container cluster manager, is not only enabling the AI era by orchestrating inference at scale, but also evolving through the demands of AI workloads, a mutually reinforcing cycle between the two, according to Jonathan Bryce, executive director of the Cloud Native Computing Foundation (CNCF).

“We are in the middle of what I think is a huge shift from the traditional workloads of applications to AI applications,” Bryce told AIM in an interview. For context, Kubernetes is maintained by CNCF. 

While great performance and efficient response time and uptime still remain priorities, the requirements in terms of hardware have evolved to suit AI, said Bryce, referring to GPU utilisation. 

The high costs of GPUs make orchestration efficiency a critical factor. This is also why recent developments within the cloud native community have prioritised GPU scheduling, allocation, and workload placement. This brings us to the networking capabilities that Kubernetes brings to the table. 

“Kubernetes has always had these networking concepts available to applications that allow containers to move within pods, and that’s becoming really key to building high-performance AI applications, specifically inference,” said Bryce. 

Bryce, like many voices in the industry today, highlighted how inference is set to become the most critical workload in AI — surpassing model training for the time being. As not all AI inference runs in GPU datacentres, some will also need to operate on devices like laptops, phones, cars, and other edge systems, where orchestration becomes a priority. 

Several open source frameworks are exploring how to run these workloads efficiently, such as Ray, Red Hat’s new llm-d, and ByteDance’s AI bricks.

“The common factor: Kubernetes and the Kubernetes primitives,” Bryce said. 

Kubernetes remains the core foundation for orchestrating LLM workloads, handling deployment, scaling, fault‑tolerance, and hardware abstraction. And recent developments across the Kubernetes ecosystem further extend its capabilities for LLM inference and serving.

In June, the Gateway API Inference Extension was introduced to Kubernetes. Unlike generic HTTP load balancers, this extension enables inference-aware routing, announcing session state, model identity, and resource usage, tailored for long-lived GPU-intensive LLM requests. Released in July, Google Cloud’s GKE Inference Gateway embodies these capabilities.

It routes LLM requests based on GPU-specific metrics like KV cache usage, enabling better throughput and lower latency. It supports multiple models behind a single endpoint and allows autoscaling based on AI workload patterns, optimised for vLLM and various GPU types.

Furthermore, Red Hat AI recently announced llm-d, a Kubernetes-native distributed inference framework built on top of vLLM, one of the most widely used open-source frameworks today to accelerate AI inference. 

In a dual-node NVIDIA H100 cluster, llm-d achieved ‘3x lower time-to-first-token’ and ‘50–100% higher QPS (queries per second) SLA-compliant performance’ compared to a baseline. 

Google Cloud, a key contributor in the llm-d project, said, “Early tests by Google Cloud using llm-d show 2x improvements in time-to-first-token for use cases like code completion, enabling more responsive applications.”

How AI Is Shaping Kubernetes Development

While Kubernetes enables AI workloads, AI is also influencing Kubernetes’ own evolution, and the open source community’s responsiveness to real-world needs is a big part of that. 

“What’s been happening with Kubernetes is as people are taking AI to production, they’re realising Kubernetes has the right types of workload orchestration to manage these complex environments — with networking, with different types of hardware, with different SLAs [service level agreements], and they’re pushing updates into Kubernetes or related projects,” Bryce explained.

One example is improved workload placement control. For high-performance inference, simply moving a model between GPUs can be inefficient because of context and key-value (KV) cache portability issues. 

“How do I make sure that requests are going to a GPU that has its cache? How do I share or distribute the cache using things like LM cache?” Bryce said. “These are differences Kubernetes didn’t support a year ago and does now because of the work that folks are writing.”

Bryce said that these aren’t just experiments or exploring fun ideas among developers, but are driven by real-world needs. 

The changes have made Kubernetes a more capable platform for LLM inference workloads, accommodating mixed hardware environments, tighter performance constraints, and more complex service-level requirements.

Opportunities for Devs in Cloud Native AI Era

For developers considering where to focus their skills in the AI age, Bryce sees the cloud-native ecosystem as a long-term bet. 

While Kubernetes is still the most prominent project at the CNCF, Bryce points to OpenTelemetry as the fastest-growing. 

With AI systems acting as partial black boxes, instrumentation is critical. “You’re going to need more information out of these systems to understand what’s really happening,” Bryce said. 

“Being able to instrument them and get better data out of them… that’s going to be a huge area of innovation,” he added. 

Another growth area Bryce points to is platform engineering, which involves designing internal platforms to make developers more effective. 

“Almost all [companies in the CNCF ecosystem] are implementing platform engineering in some way… If you can learn a skill set that makes your organisation’s developers more effective, that’s hugely valuable,” Bryce said.

Having said that, it is crucial to recognise that LLMs are only as effective as the infrastructure supporting them at scale. Kubernetes, with its advancing GPU orchestration, application-aware networking, and real-world-driven feature development, is emerging, if it has not already established itself, as the backbone for inference workloads. 

“It’s underappreciated how much the CNCF is actually at the centre of AI innovation… our community makes technology work reliably at scale,” added Bryce.

The post The Double Thank You Moment Between Kubernetes and LLMs appeared first on Analytics India Magazine.

Scroll to Top