AMD's AI Future Is Rack Scale 'Helios'
rbanffy | 133 points | 8day ago | morethanmoore.substack.com
Minks|7day ago
ROCm really is hit or miss depending on the use case.
Plus their consumer card support is questionable to say the least. I really wish it was a viable alternative, but swapping to CUDA really saved me some headaches and a ton or time.
Having to run MiOpen benchmarks for HIP can take forever.
m_mueller|7day ago
Exactly the same has been said over and over again, ever since CUDA took off for scientific computing around 2010. I don’t really understand why 15 years later AMD still hasn’t been able to copy the recipy, and frankly it may be too late now with all that mindshare in NVIDIA’s software stack.
bayindirh|7day ago
Just remember that 4 of the top 10 Top500 systems run on AMD Instinct cards, based on the latest June 2025 list announced at ISC Hamburg.
NVIDIA has a moat for smaller systems, but that is not true for clusters.
As long as you have a team to work with the hardware you have, performance beats mindshare.
aseipp|7day ago
The Top500 is an irrelevant comparison; of course AMD is going to give direct support to single institutions that give them hundreds of millions of dollars and help make their products work acceptably. They would be dead if they didn't. Nvidia also does the same thing to their major clients, and yet they still make their products actually work day 1 on consumer products, too.
Nvidia of course has a shitload more money, and they've been doing this for longer, but that's just life.
> smaller systems
El Capitan is estimated to cost around $700 million or something with like 50k deployed MI300 GPUs. xAI's Colossus cluster alone is estimated to be north of $2 billion with over 100k GPUs, and that's one of ~dozens of deployed clusters Nvidia has developed in the past 5 years. AI is a vastly bigger market in every dimension, from profits to deployments.
wmf|7day ago
HPC has probably been holding AMD back from the much larger AI market.
pjmlp|7day ago
Custom builds with top paid employees to make the customer happy.
bayindirh|7day ago
What do you mean?
pjmlp|7day ago
Besides sibling comment, HPC labs are the kind of customers that get hardware companies to fly in engineers when there is a problem bringing down the compute cluster.
convolvatron|7day ago
presumably that in HPC you can dump enough money into individual users to make the platform useful in a way that is impossible in a more horizontal market. in HPC it used to be fairly common to get one of only 5 machines with processor architecture that had never existed before, dump a bunch of energy into making it work for you, and then throw it all out in 6 years.
bigyabai|7day ago
It's just not easy. Even if AMD was willing to invest in the required software, they would need a competitive GPU architecture to make the most of it. It's a lot easier to split 'cheap raster' and 'cheap inference' into two products, despite Nvidia's success.
7speter|6day ago
Well, AMD is supposed to be releasing UDNA next year, which will presumably ‘unite’ capabilities like raster and inference within one architecture.
alecco|7day ago
Jensen knows what he is doing with the CUDA stack and workstations. AMD needs to beat that more than thinking about bigger hardware. Most people are not going to risk years learning an arcane stack for an architecture that is used by less than 10% of the GPGPU market.
hyperbovine|7day ago
I'm willing to bet almost nobody you know calls the CUDA API directly. What AMD needs to focus on is getting the ROCm backend going for XLA and PyTorch. That would unlock a big slice of the market right there.
They should also be dropping free AMD GPUs off helicopters, as Nvidia did a decade or so ago, in order to build up an academic userbase. Academia is getting totally squeezed by industry when it comes to AI compute. We're mostly running on hardware that's 2 or 3 generations out of date. If AMD came with a well supported GPU that cost half what an A100 sells for, voila you'd have cohort after cohort of grad students training models on AMD and then taking that know-how into industry.
bwfan123|7day ago
Indeed. the user-facing software stack componentry - pytorch and jax/xla - are owned by meta, and google and open sourced. Further, the open-source models (llama/deepseek) are largely hw agnostic. There is really no user or eco-system lock-in. Also, clouds are highly incentivized to have multiple hardware alternatives.
pjmlp|7day ago
HN keeps forgetting game development and VFX exists.
hyperbovine|7day ago
What fraction of Nvidia revenue comes from those applications?
akshayt|6day ago
About 0.1% from professional visualization in Q1 this year
pjmlp|7day ago
Lets put it this way, they need graphics cards, and CUDA is now relatively common.
For example OTOY OctaneRender, one of the key renders in Hollywood.
7speter|6day ago
AMD isn’t doing what you’re proposing, but it seems intel is a few months out from this.
aseipp|7day ago
There already is ROCm support for PyTorch. Then there's stuff like this: https://semianalysis.com/2024/12/22/mi300x-vs-h100-vs-h200-b...
They have improved since that article, by a decent amount from my understanding. But by now, it isn't enough to have "a backend". The historical efforts have spoiled that narrative so badly that it won't be enough to just have a pytorch-rocm pypi package; some of that flak is unfair though not completely unsubstantiated. But frankly they need to deliver better software, across all their offerings, for multiple successive generations before the bad optics around their software stack will start fading. Their competitors are already on their next gen architecture since that article was written.
You are correct that people don't really invoke CUDA APIs much, but that's partially because those APIs actually work and deliver good performance, so things can actually be built on top of them.
pjmlp|7day ago
Additionally when people discuss CUDA they always think about C, ignoring that has been a C++ first since CUDA 3.0, also has Fortran surpport, and NVidia always embraced having multiple languages being able to play on PTX land as well.
And as of 2025, there is a Python CUDA JIT DSL as well.
Also, even if not the very latest version, the fact that CUDA SDK works on any consumer laptop with NVidia hardware, anyone can slowly get into CUDA, even if their hardware isn't that great.
cedws|7day ago
At this point it looks to me like something is seriously broken internally at AMD resulting in their software stack being lacklustre. They’ve had a lot of time to talk to customers about their problems and spin up new teams, but as far as I’ve heard there’s been very little progress, despite the enormous incentives. I think Lisa Su is a great CEO but perhaps not shaking things up enough in the software department. She is from a hardware background after all.
bwfan123|7day ago
There used to be a time when hw vendors begudgingly put out sample driver code which contained 1 file with 5000 lines of C code - which just about barely worked. The quality of software was not really a priority, as most of the revenue was from hw sales. That reflected in the quality of hires and incentive structures.
rbanffy|7day ago
Indeed. The stories I hear about software support for their entry-level hardware aren't great. Having a good on-ramp is essential.
OTOH, by emphasizing datacenter hardware, they can cover a relatively small portfolio and maximize access to it via cloud providers.
As much as I'd love to see an entry-level MI350-A workstation, that's not something that will likely happen.