Kubernetes Native llm-d Could Be a ‘Turning Point in Enterprise AI’ for Inferencing

Over the past two years, powerful AI models—both open source and proprietary—have successfully provided a wide range of use cases for individuals and organisations. However, deploying these models in production-ready environments involves several challenges, particularly concerning inference and maximising cost-effectiveness.

Red Hat AI, a US-based open-source technology provider for enterprises, unveiled a new framework that claims to solve this problem. It’s called ‘llm-d’, a Kubernetes-native distributed inference framework built on top of vLLM, one of the most widely used open-source frameworks to accelerate inference.

“[llm-d] amplifies the power of vLLM to transcend single-server limitations and unlock production at scale for AI inference,” said Red Hat.

Built in collaboration with tech giants like Google Cloud, IBM Research, NVIDIA, AMD, Cisco, and Intel, the framework optimises how AI models are served and run in demanding environments like data centres with several GPUs.

llm-d Achieves a ‘3x Lower Time-to-First-Token’

llm-d results from several specific techniques used in its architecture. For example, llm-d features ‘Prefill and Decode Disaggregation’ to differentiate input context processing from token generation. Separating these into two distinct operations can enable them to be distributed across multiple servers, which enhances efficiency.

Furthermore, KV (key-value) Cache Offloading significantly reduces the memory burden on GPUs by shifting the KV cache to more cost-effective standard storage like CPU or network memory.

The framework is also based on Kubernetes-powered clusters and controllers, which facilitate the efficient scheduling of compute and storage resources.

In a dual-node NVIDIA H100 cluster, llm-d achieved ‘3x lower time-to-first-token’ and ‘50–100% higher QPS (queries per second) SLA-compliant performance’ compared to a baseline. This means faster responses and higher throughput while meeting service-level agreements.

Google Cloud, a key contributor in the llm-d project, said, “Early tests by Google Cloud using llm-d show 2x improvements in time-to-first-token for use cases like code completion, enabling more responsive applications.”

Besides, llm-d also features AI-aware network routing to schedule requests to servers and accelerators with ‘hot caches’ to minimise redundant calculations. The framework is also flexible enough to work across NVIDIA, Google’s TPU, AMD, and Intel hardware.

“Distributed inference is the future of GenAI—and most teams don’t have time to build custom monoliths,” said Red Hat in a post on X. “llm-d helps you adopt production-grade serving patterns using Kubernetes, vLLM, and Inference Gateway.”

“I think Red Hat’s launch of llm-d could mark a turning point in Enterprise AI,” said Armand Ruiz, VP of AI Platform at IBM.

“While much of the recent focus has been on training LLMs, the real challenge is scaling inference, the process of delivering AI outputs quickly and reliably in production,” he added.

Companies have increasingly focused on solutions for scaling AI inference, both in hardware and software. Over the past two years, companies like Cerebras, Groq, and SambaNova have developed and scaled a series of hardware infrastructure products to accelerate AI inference.

“We [Groq] need to be one of the most important compute providers in the world. Our goal by the end of 2027 is to provide at least half of the world’s AI inference compute,” said Jonathan Ross, founder and CEO of Groq, earlier this year.

Moreover, last year, NVIDIA CEO Jensen Huang said that one of the challenges NVIDIA currently faces is generating tokens at incredibly low latency.

An Extensive Research Across Inference Optimisation Strategies

Although there is an increasing focus on inference-specific hardware, substantial advancements have also been made in software frameworks and architectures for scaling AI inference.

We are excited to share SwiftKV, our recent work at @SnowflakeDB AI Research! SwiftKV reduces the pre-fill compute for enterprise LLM inference by up to 2x, resulting in higher serving throughput for input-heavy workloads. pic.twitter.com/sOhHogbCqK

— Aurick Qiao (@AurickQ) December 5, 2024

A study titled ‘Taming the Titans: A Survey of Efficient LLM Inference Serving’ was released last month from Huawei Cloud and Soochow University in China, which surveyed different techniques and emerging research tackling the problem of LLM inference.

It surveyed inference optimisation methods across the instance level and cluster level, alongside some of the emerging scenarios.

At the instance level, optimisations include efficient model placement through parallelism and offloading, advanced request scheduling algorithms, and various KV cache optimisation methods. Within the cluster level, the focus is on GPU cluster deployment and load balancing to ensure efficient resource utilisation across multiple instances.

The paper also examines emerging scenarios, such as serving models for long contexts, retrieval-augmented generation (RAG), mixture of experts (MoE), LoRA, and more.

vLLM also announced a ‘Production Stack’ in March, another enterprise-grade inference solution. The open-source solution is designed for Kubernetes native deployment and focuses on efficient resource utilisation based on distributed KV Cache sharing and intelligent autoscaling based on demand patterns.

“Early adopters report 30-40% cost reduction in real-world deployment compared to traditional serving solutions while maintaining or improving response times,” said LMCache Lab, a co-creator of vLLM based on the University of Chicago.

The post Kubernetes Native llm-d Could Be a ‘Turning Point in Enterprise AI’ for Inferencing appeared first on Analytics India Magazine.

llm-d Achieves a ‘3x Lower Time-to-First-Token’

An Extensive Research Across Inference Optimisation Strategies

Related Posts