Perplexity Transferred a Trillion Parameters Between GPUs in Just 1.3s

Perplexity Research recently demonstrated it transferred the weights of the one trillion-parameter model Kimi-K2 from 256 training GPUs to 128 inference GPUs in just 1.3 seconds.

Perplexity utilised RDMA-based point-to-point communication to perform the process, directly transferring the weights between the training and inference GPUs.

This contrasts with traditional methods, which involve using an intermediary (rank-0) GPU for transferring the weights. “Many existing frameworks take several seconds—or even minutes—for trillion-parameter models,” said Perplexity.

Centralised communication — in this case, having a coordinator GPU manage all communication logic is conceptually straightforward and easier to debug. It also works reliably across various network configurations without requiring complex scheduling.

However, this simplicity comes at a cost.

“Gather on training rank-0, send to inference rank-0, then scatter again. This quickly becomes a choke point, limited by a single GPU’s PCIe bandwidth and NIC,” said Perplexity.

So, What’s the Deal?

AIM reached out to several industry experts to understand why or how this is beneficial.

“The big deal here is that you can eventually update the model in real time, and always have up-to-date training data,” said Kirk Kaiser, a software developer and author of Make Art With Python.

“Right now models are commonly frozen (weights don’t change), but having a model that continuously updates itself would be a huge breakthrough,” he said, indicating how many desire models that improve themselves over time.

John Leimgruber, an engineer who extensively works on LLMs, told AIM how point-to-point RDMA networks have become very important in the data centre sector.

“Because you can’t fit enough GPUs into a single server, you have to have a small cluster of servers,” he said, citing an example of how 8 x NVIDIA H100 GPUs can be deployed per server “node” and multiple “nodes” connected via InfiniBand networks (a network technology that supports RDMA).

A cluster of 8 x H100 GPUs, however, has only 640 GB of VRAM, which is insufficient for large-scale language models, Leimgruber said. “This is why fast networking and RDMA protocols become so important today because the LLMs are too big for a single server.”

For context, RDMA, known as ‘Remote Direct Memory Access’, is a technology that enables one computer to transfer data directly into another computer’s memory without involving the operating system or CPU. The mechanism is used across all high-performance AI data centres to facilitate data transfer between GPUs.

How Perplexity Did It

The model parameters were distributed using Fully Sharded Data Parallel (FSDP) placements, which split model weights across multiple GPUs.

The GPUs were organised into groups called DeviceMeshes, where all GPUs within a group could reconstruct full tensors, meaning any GPU could serve as a source for weight transfer. These mesh groups were designed to be disjoint from one another. They don’t overlap, so they could transfer weights independently and in parallel without interfering with each other.

Before the transfer, the model was quantised from BF16 to FP8, significantly reducing memory and bandwidth consumption while preserving most of the model’s accuracy.

Perplexity used pipelined execution that overlaps four hardware domains — host-device data movement, GPU computation for quantisation, RDMA network transfer, and Ethernet-based control signalling.

“We treat the transfer of each parameter tensor as a task. The weight transfer process utilises multiple types of hardware sources; hence, we split a weight transfer task into different pipeline stages which overlap in time,” said Perplexity.

The system maintains task queues for each pipeline stage, creating a continuous flow. To prevent out-of-memory errors at the trillion-parameter scale, it tracks temporary memory usage and only launches new tasks when usage falls below a configurable threshold.

“Some implementations recompute a transfer schedule at every training step, repeatedly collecting metadata and distributing instructions. This adds unnecessary control-plane latency,” said Perplexity.

Instead, a centralised controller computes a static schedule once during initialisation, mapping which training GPU sends each parameter to which inference GPU and in what order. Each training iteration simply replays this plan when the controller issues a “go” signal.

Each phase — metadata collection, tensor reconstruction, quantisation, and network transfer is modularised as an independent component, allowing easier testing and optimisation. By combining RDMA WRITE, static scheduling, and pipelined execution, Perplexity reduced trillion-parameter updates to just 1.3 seconds.

This leads to higher throughput, lower energy consumption, and improved scalability in distributed AI systems.

The post Perplexity Transferred a Trillion Parameters Between GPUs in Just 1.3s appeared first on Analytics India Magazine.

So, What’s the Deal?

How Perplexity Did It

Related Posts