minsta

Here’s how Arm accelerates AI workloads

Pabio December 17, 2024

In today’s rapidly evolving technology landscape, machine learning and AI are allowing us to redefine industries and change the way we interact with the world. AI enables developers to create more intelligent, flexible and efficient applications. Arm, due to its unique position in the industry, has been enabling AI across various platforms for over a decade.

In this article, I want to detail how Arm technology allows developers to focus on innovation and unique aspects of the applications they develop while using existing Arm technology.

You don’t need an NPU for AI applications

For over seven years, I have been creating content on the pros and cons of different machine learning accelerators, called NPUs. These processors have their place in devices, but a myth is that you need them to run machine learning or AI tasks. This is simply not true. AI tasks do not depend exclusively on NPUs; they can run on anything from CPU to GPU. Thanks to Arm v8 and Arm v9 chip technology, you can run accelerated machine learning tasks particularly well on Arm processors.

Matrix multiplication is the key mathematical operation at the heart of machine learning and AI. GPUs and NPUs are good at matrix multiplication. However, modern Arm processors are also efficient at this processing task and have hardware accelerators that enable this as well. Whether it is Arm v8 or Arm v9, Cortex-A or Cortex-X processors, or even Arm Neoverse processors, they all have different technologies that make it possible to accelerate matrix multiplication operations.

For example, Arm’s Neon technology and Scalable Vector Extensions (SVE) are both available in Arm processors. There are also 8-bit matrix-by-matrix multiplication instructions in the Arm instruction set. In smartphones and edge devices, Scalable Matrix Extensions (SME) are available in Arm processors. These technologies allow the CPU to perform matrix-accelerated operations without a GPU or NPU.

Arm Kleidi Technology

Beyond hardware systems, Cleidi is a central part of Arm’s strategy to enable AI across Arm-based mobile and server platforms. It covers a range of resources and partnerships to help developers accelerate AI seamlessly on Arm, including a high-performance machine learning kernel library called KleidiAI, which has been optimized for Arm processors using these different hardware accelerators.Kleidi is available on GitLaband these kernels have been integrated into various frameworks ready for developers to use. As a result, we now have hardware acceleration for Arm processors, supporting various machine learning technologies, from classic machine learning to today’s generative AI.

Arm has integrated Kleidi technology into popular AI frameworks such as PyTorch and ExecuTorch, resulting in significant out-of-the-box performance improvements for developers. This integration ensures that developers can seamlessly leverage Arm’s optimized libraries into their existing workflows, achieving up to a 12x performance improvement with minimal effort.

Arm has partnered with Meta to ensure that the recently launched Llama 3.2 model runs smoothly on Arm processors. The availability of smaller LLMs that enable fundamental text-based generative AI workloads, such as the one billion and three billion parameter versions, is critical to enabling AI inference at scale. Arm processors can run larger models, such as Llama 3.2 with 11 billion parameters and even the 90 billion parameter model in the cloud. These larger models are ideal for CPU-based inference workloads in the cloud that generate text and images.

The 11 billion parameter version of Llama 3.2, running on an Amazon AWS Graviton4 processor, can achieve 29.3 tokens per second during the generation phase, and that’s on the processor alone. Thanks to Arm and Meta’s collaboration on the ExecuTorch framework, you can now access optimal performance by running these models at the edge. Running the new Llama 3.2, three billion LLM on an Arm-powered smartphone Thanks to Arm CPU optimizations, fast processing is increased by 5x and token generation is increased by 3x, reaching 19.92 tokens per second during the generation phase.

If you want a demonstration of this, watch my video above.

A huge boost for developers

With these advances, the possibilities for developers are enormous. Think of everything you could do with a large language model running on a phone using an Arm processor: no GPU, no NPU, no cloud, just the CPU. Arm’s approach is performance portability, meaning AI developers can optimize once and then deploy their models to different platforms without making any changes. This is particularly useful for developers who need to deploy models both at the edge, on a smartphone, and in the cloud. By using Arm, developers can be sure that their model will also work well on other platforms once optimized for one platform.

Arm is good developer resourcesincluding documentation on how to accelerate generative AI and ML workloads when you choose to run them on Arm processors, in addition to running AI and ML on Android devices.