close
close

Mondor Festival

News with a Local Lens

Nvidia’s Spectrum-X Ethernet will power the world’s largest AI supercomputer: 200,000 Hopper GPUs
minsta

Nvidia’s Spectrum-X Ethernet will power the world’s largest AI supercomputer: 200,000 Hopper GPUs

When you purchase through links on our articles, Future and its syndication partners may earn a commission.

    Nvidia.     Nvidia.

Credit: Nvidia

One of the challenges of building high-end AI data centers is connecting servers and running tens of thousands of GPUs in concert and seamlessly, making network interconnections as important as GPU. To build the Colossus supercomputer from xAI, which now has 100,000 Nvidia Hopper processors and will expand to 200,000 H100 And H200 GPU in the coming months, the company selected Nvidia Spectrum-X Ethernet.

Nvidia’s Spectrum-X platform includes the Spectrum SN5600 Ethernet switch, which enables port speeds of up to 800 Gbps and is built on the Spectrum-4 switch ASIC. The networking platform works with Nvidia’s BlueField-3 SuperNICs to deliver exceptional speed and efficiency when transferring massive data streams required for AI training. With Spectrum-X, Colossus achieves consistently high data throughput (95%) and virtually eliminates network latency issues and packet loss, enabling seamless operation at an unprecedented scale.

The green company says traditional Ethernet would struggle to handle such scale, often experiencing heavy congestion and low data throughput. In contrast, Spectrum-X’s adaptive routing, congestion control, and performance isolation technologies address these issues, ensuring a stable, high-performance environment.

“AI is becoming mission critical and requires increased performance, security, scalability and cost effectiveness,” said Gilad Shainer, senior vice president of networking at Nvidia. “The Nvidia Spectrum-X Ethernet networking platform is designed to provide innovators like xAI with faster processing, analysis and execution of AI workloads, accelerating the development, deployment and release of AI workloads. the AI ​​solutions market.

Even with 100,000 Hopper GPUs, xAI’s Colossus is one of the world’s most powerful supercomputers for AI training. Yet it was built in just 122 days, and its rapid deployment stands in stark contrast to the typical timelines for such massive systems, which often extend over months or even years. This efficiency extended to its operational setup, where training began 19 days after the delivery and installation of the first equipment.

It remains to be seen how long it will take xAI to install 100,000 more Hopper GPUs, although it’s safe to say that for a while it will be the most powerful AI supercomputer in the world, at least before Microsoft and Oracle rolled out their Blackwell-based system. PC.

“Colossus is the most powerful training system in the world,” Elon Musk said of X. “Great work from the xAI team, NVIDIA and our many partners/vendors.”