close
close

Mondor Festival

News with a Local Lens

Google covers its compute engine bases because it has to
minsta

Google covers its compute engine bases because it has to


The moment search engine giant Google wanted to go cloud, and a few years later, Google realized that businesses weren’t ready to buy full-blown platform services hiding the hardware underlying, but wanted lower-tier infrastructure services that gave them more options. Along with more responsibilities, it was inevitable that Google Cloud would have to purchase compute engines from Intel, AMD and Nvidia for its server fleet.

And the profit margins that Intel once commanded for CPUs and that AMD now does, and that Nvidia still commands for GPUs and will for the foreseeable future, also meant that it was inevitable that Google would create its own CPUs and accelerators of AI to try to reduce the total cost of ownership of its server fleet, especially for internal work such as search engine indexing, ad serving, video serving and data analysis in its countless forms and on a large scale.

So every time a Google Cloud event takes place, like this week, we get a little more information about which compute engines Google is buying or building as it assembles its server fleets. Google doesn’t release products like normal chip vendors do, with lots of chip plans and packages and a multitude of power supplies, speeds, slots, and watts. We need to piece it together over time and wait for a retrospective article to be published in a few years to find out what Google is actually doing now.

It’s really boring. But Google has always been secretive because IT is certainly a competitive advantage for the company, but also a bit bipolar in that it wants to boast about its ingenuity because that’s what attracts the next wave of innovators to the company. All hyperscalers and big cloud builders are like this. You would be too if you had such loyal competitors and as much at stake in protecting and growing your businesses.

With that said, let’s get into what Google revealed during its keynote about its compute engines, and start with the in-house “Trillium” TPU v6 AI accelerators.

We did an analysis of the Trillium accelerators in Junewhich seems so long ago and which has provided as much detail as possible about this sixth generation of Google’s AI accelerators. There were many more questions than answers about TPU v6 devices and the systems that use them, as we pointed out at the time. But we now have some relative performance numbers for inference and training, as well as an idea of ​​the relative value for money between the TPU v5e and TPU v6 compute engines.

Amin Vahdat, who led networks at Google and is now general manager of machine learning, systems and cloud AI, reiterated some of the key aspects of the Trillium TPU in his keynote at the Google Cloud App Dev & Infrastructure Summit . TPUv6’s peak performance is 4.7x the TPU v5e it (somewhat) replaces in the lineup, and has double the memory capacity and bandwidth of HBM and double the bandwidth chip interconnect (ICI) between adjacent TPUs in a system.

Google also provided real-world benchmarks for training and inference, which are useful. This is what the training comparison between TPU v5e and TPU v6 looks like:

The average performance increase between the current and penultimate TPU across these five different training benchmarks is 3.85X, which Google rounds up to 4X in its presentations. We added the maximum performance share that each benchmark gets on the benchmark compared to the 4.7X inherent in chippery.

For inference, Google only showed Trillium performance against TPU v5e on Stability AI’s Stable Diffusion XL text-image model, which was just announced in late July and is the state of the art. ‘art :

The novelty of this code could explain why the performance delta between TPU v5e and TPU v6 is less than two-thirds of the 4.7X delta of peak performance.

It would have been better to see a handful of different inference criteria. For example, where are the benchmark results from Google’s JetStream inference engine? By the way, where are the tests comparing the TPU v5p to the Trillium chip?

In his blog describing the benchmarksGoogle had this to say: “We design TPUs to maximize performance per dollar, and Trillium is no exception, demonstrating a nearly 1.8x increase in performance per dollar over v5e and approximately a 2x increase in performance per dollar. performance per dollar compared to v5p. This makes Trillium our highest performing TPU yet.

We tried to use this data to reverse engineer TPU v6 prices from this comparison, and it makes no sense. First, is Google talking about training or inference in these price/performance comparisons, and is it using real-world benchmarks or theoretical maximum performance? Given the divergent prices of the TPU v5p and TPU v5e instances, it’s hard to imagine how they can be so close in the value for money multiples that the TPU v6 brings. We dug around and found that even though Trillium instances are only in technical preview, pricing has been announced. So we’ve updated our TPU feature and pricing table. Take a look:

As usual, items in bold red italics are estimates we made in the absence of actual data.

As you can see from this table, the TPU v5p has much larger pods and much higher HBM memory bandwidth than the TPU v5e and half the INT8 and BF16 floating point precision performance of the TPU v6. As far as we know, the TPU v6 pod size is 256 accelerators in a single frame, and this peaks at 474 petaflops with INT8 accuracy. Vahdat confirmed this, then extrapolated beyond the module.

“Trillium can scale from a single, high-bandwidth, low-latency ICI domain of 256 chips to tens of thousands of chips in a building-scale supercomputer interconnected by a data center network of several petabytes per second,” Vahdat explained. “Trillium delivers an unprecedented 91 exaflops in a single cluster, four times that of the largest cluster we built with our previous generation TPU. Customers love our Trillium TPUs and we are seeing unprecedented demand for the sixth generation.

Given that the TPU v6 instances are only in technical preview, it must be a relatively small handful of very important customers doing the praising.

Vahdat also showed some images of the Trillium equipment. Here is a TPU v6 system board with four TPU v6 compute engines:

And here are some holders of this Trillium iron type with a suggestively posing bow exposed in front of them.

And now a pivot to Nvidia GPU infrastructure, which Google Cloud needs to build so businesses can deploy the Nvidia AI Enterprise software stack if they want on cloud infrastructure and which is also being developed by Google and Nvidia to run Google’s favorite JAX framework. (written in Python) and its cross-platform XLA compiler, which is fluent in TPU and GPU.

Google has already launched A3 and A3 Mega instances based on Nvidia “Hopper” H100 GPU accelerators with 80 GB and 96 GB of HBM3 memory, and Vahdat took the opportunity to preview the new A3 Ultra instances which will soon be available on Google Cloud and based on the Hopper H200 GPU, which has 141 GB of larger HBM3E memory. A3 Ultra instances will be released “later this year” and will include Google’s “Titanium” offload engine coupled with Nvidia ConnectX-7 SmartNICs, which will have 3.2 TB/s of bandwidth interconnecting the GPUs in the cluster using Google switching. adjustments to RoCE Ethernet.

Vahdat didn’t say much about Nvidia’s announced “Blackwell” GPUs, but said he has “a few working Nvidia GB200 NVL72 racks and is actively working to bring this technology to our customers.”

Vahdat also added that C4A instances based on Google’s Axion Arm “Cypress” server processors are now generally available. Google announced the first Axion chip in Aprilbut apparently had two chips in the works, the other named “Maple” and based on technology licensed from Marvell and Cypress based on Neoverse V2 cores. Axion processors are also combined with Titanium offload engines.

Google said C4A instances had 64% better price/performance on SPEC integer SKUs and up to 60% better power efficiency than “current generation X86-based instances”, but did not clarified what these instances were. He added that C4A instances delivered 10% better performance than other Arm instances available on other clouds. He did not specify how the Axion processor performed compared to an Intel “Granite Rapids” Xeon 6 or AMD “Turin” Epyc 9005 processor.

And for fun, Google showed this bang for your buck chart:

What we didn’t know until now is what Axion C4A instances look like. Here are the speeds and flows for the standard editions of C4A instances, which have 4 GB per vCPU:

There are high CPU configurations of Axion C4A instances that have 2 GB per vCPU and high memory configurations that have 8 GB of memory per vCPU. And as the fine print says, these V2 cores of the Axion chip do not support simultaneous multithreading, so one core is one thread is one vCPU.

Here is the hourly rate for standard instances in the Google Northern Virginia (US-East-4) region:

C4A instances are available in US-Central1 (Iowa), US-East4 (Virginia), US-East1 (South Carolina), EU-West1 (Belgium), EU-West4 (Netherlands), EU-West3 (Frankfurt ) and Southeast Asia Regions1 (Singapore); availability in other regions is expected “soon”.

We look forward to comparing the AWS Graviton 4, Google Cloud C4A, and Microsoft Azure Cobalt 100 Arm server chips running in these respective clouds. Hopefully Microsoft will roll out the Cobalt 100 at its Ignite 2024 conference in a few weeks so we can.

Subscribe to our newsletter

Featuring the week’s highlights, analysis and stories straight from us to your inbox, with nothing in between.
Subscribe now