AMD’s AI GPU Strategy Is Paying Off Big

AMD GPUAMD GPU

O for OpenAI, O for Oracle, and O for the orders stacking up on AMD’s ledger. 

The Lisa Su-led chipmaker, days after announcing a deal to deploy 6 GW of compute, potentially including its GPUs, for OpenAI, landed another major deal with Oracle

This deal involves a commitment to deploy 50,000 of the upcoming AMD MI450 GPUs starting in Q3 2026, with further expansion planned into 2027 and beyond. 

These commitments are in addition to the existing AMD GPU deployments by companies like Meta, Microsoft, and others. 

At the same time, AMD also showcased ‘Helios’ — its rack-scale solution of 72 MI450 GPUs. This marks a clear future roadmap for AMD, with the MI450 GPU slated for release in the second half of next year and built on the 2-nanometer TSMC nodes. 

It’s running a two-horse race with NVIDIA, but on paper, AMD’s GPUs look competitive. For instance, NVIDIA has yet to reveal concrete plans to adopt the 2nm process. 

Current AMD GPUs, such as the MI355X, feature higher memory capacity, and the company claims they outperform NVIDIA’s Blackwell B200 GPU. It also costs significantly less than NVIDIA’s GPUs. 

Then why the disparity in market share? Because the spec sheet and internal benchmarks have not been able to translate to real-world performance. And why is that? It’s because AMD’s Achilles’ heel is the software underneath its GPU.

Source: SemiAnalysis

And AMD Is Going All In to Fix It 

In NVIDIA’s case, the CUDA ecosystem, given its maturity, delivers shorter development cycles, fewer surprises in production, reliability at scale, and easier access to expertise. 

Because tools, libraries, and community converge on CUDA, users get predictable performance. 

However, AMD’s ROCm, although open-source, has been reported to frustrate developers due to poor out-of-the-box usability, buggy libraries, and insufficient testing. 

GPU research firm SemiAnalysis stated in the past that this is a key reason why, despite having a lower total cost of ownership, AMD’s GPUs deliver worse training performance per dollar compared to NVIDIA’s GPUs. 

While large-scale enterprises using AMD’s systems can afford to invest in custom software tooling to tune ROCm for their workloads, smaller developers or startups typically cannot.

However, AMD appears to be approaching these quite seriously.

Beneath these marquee deals, there’s a much more significant factor affecting AMD’s long-term future. A comprehensive overhaul of ROCm, with its seventh iteration, is its most substantial effort to date. 

ROCm 7 shows a 3.5x boost in inference throughput and is three times faster on training compared to its predecessor on the MI300 GPUs. 

Anush Elangovan, VP of AI software at AMD, told AIM about the significant changes with ROCm 7.0. It brings distributed inference support through integration with frameworks such as vLLM, llm-d, and SGLang. 

“You can now do prefill, decode, and disaggregation —  which means that you could actually scale to a massive size deployment with a few nodes,” he said. 

Elangovan said that fine-tuning large models is possible with a single AMD GPU, because it is equipped with ~288 GB of High Bandwidth Memory (HBM). “You can run or fine-tune a full Llama 405B with one GPU.”

He pointed towards the various enterprise capabilities that are bundled with ROCm 7. “We have a resource manager which allows you to manage your workloads, your developer ecosystem, and your MLops platform,” said Elangovan. 

Moreover, AMD also claimed that in internal benchmarks, ROCm 7.0 preview builds running DeepSeek-R1 on MI355X GPUs demonstrated 1.3× higher FP8 inference throughput compared to NVIDIA’s B200 platform under similar conditions.

Elangovan added that ROCm 7 is now available on Windows OS, which makes it easier to use. He said AMD will continue to expand capabilities and applications popular with the open-source AI community, citing tools like Ollama and Comfy UI that are widely used. For detailed information about everything new with ROCm 7.0, check AMD’s documentation here.

Elangovan acknowledged the importance of providing continuous support for developers, who have often complained about documentation issues and difficulty contacting the company. 

“We want to have the largest software ecosystem on AI, we want to double down on all parts of it [support] and not just documentation,” he said. 

“It starts with evangelism, documentation, easy-to-use samples, and access to compute,” he said, adding that the company wants to bring all of these together so that clients have a touchpoint with AMD from ideation to deployment. 

Elangovan also highlighted how the improvements with ROCm stem from the company’s improved software development processes. 

“Every commit gets tested by CI/CD across all capabilities — from core libraries, to frameworks, and serving solutions.” 

He said that AMD now adheres to the ‘trunk shippable’ process, meaning that ROCm can be delivered to customers at any stage of development. 

This approach is evident as support for new frameworks and libraries with ROCm 7 was available right from launch day, much to the relief of users who previously encountered issues with out-of-the-box support. 

Several developers have already acknowledged the improvements. 

ROCm 7 Showing Improvements

SemiAnalysis, historically a harsher critic of AMD, said that the quality of AMD software is ‘totally different’ from that of last year, when they encountered several ROCm-specific bugs. 

“Today, the frequency in running [into] ROCm bugs is orders of magnitude lower. AMD hardware is pretty good & the software is getting better every night.” 

“It isn’t just us saying this, but many of AMD’s Instinct GPU customers are saying this too,” added the GPU research firm.

According to SemiAnalysis’s new InferenceMAX benchmark, the MI300X running vLLM delivers just 5-10% lower performance-per-dollar compared to NVIDIA’s H100 across all user interaction levels on Llama 3 70B workloads. 

The newer MI325X offers competitive performance with NVIDIA’s H200 in terms of cost efficiency, while the upcoming MI355 demonstrates competitive performance against the B200 on quantised models.

Benchmark Results: Configuration of 8 GPUs with 64 concurrent users, running the GPT OSS 120B, with a 1k tokens in / 1k tokens out workload at FP4 precision.

The AMD MI355X and NVIDIA’s B200 offer similar performance on the task.

GPU Latency (s) Token Throughput Per GPU (Tokens/s/GPU) Cost/Million Tokens Interactivity (Tokens/s/user)
NVIDIA Blackwell B200 5.349 2679.174 $0.20 172.987
AMD MI355X 5.469 2587.187 $0.16 170.52

The NVIDIA H200 offers better performance on all aspects compared to AMD MI325X, albeit by just 10 – 15 %

GPU Latency (s) Token Throughput Per GPU (Tokens/s/GPU) Cost/Million Tokens Interactivity (Tokens/s/user)
NVIDIA H200 7.216 1991.603 $0.20 128.084
AMD MI325X 8.514 2587.187 $0.22 109.948

(Source: SemiAnalysis InferenceMAX)

Several developers on various forums have also reported improvements since the arrival of ROCm 7. One developer said on Reddit that with ROCm 7, AMD has improved the driver stability, memory management, multi-GPU connection, and more. 

But for AMD, the philosophy with ROCm continues in keeping it open source, and in Elangovan’s words, letting developers own their destiny. 

“ROCm means open. ROCm means accessible. ROCm means performant. And across all of the GPUs that we have — that’s what ROCm means,”

And AMD seems to be on its dream run. Its deal with OpenAI is its largest GPU deployment yet. 

Given the ecosystem validation, roadmap, and ROCm momentum, we may see AMD catch up to NVIDIA sooner than expected.

The post AMD’s AI GPU Strategy Is Paying Off Big appeared first on Analytics India Magazine.

Scroll to Top