It’s Time to Orchestrate AI Hardware for Maximum Effect

B. Shimmin
B. Shimmin

Summary Bullets:

  • There are many AI-savvy chipsets on the market right now, each fine-tuned to support specific AI workloads, development frameworks, or vendor platforms.
  • But, what if developers could flexibly combine AI-specific hardware resource pools on the fly, on-premises as well as online?

There’s certainly enough buzz in the industry right now about artificial intelligence (AI). If you look beyond the doomsday predictions of a machine uprising, the prevailing view is that AI is a literal Swiss Army knife of circumstance, able to cut through any and all problems, ready to assemble opportunity out of nothing more than data. It seems that every vendor has one or two machine learning (ML) and deep learning (DL) frameworks lying about. It’s no wonder. There’s TensorFlow, Caffe, Theano, Torch, and many, many more to choose from, most of which open source and are quite accessible to the broader developer community.

Boundless industry optimism and abundant developer tools aside, what’s a little more difficult to come by is the hardware necessary to make the most out of AI. Even a seemingly mundane AI task, such as training a DL neural network to classify a data point as ‘yes’ or ‘no,’ can be quite demanding depending on the size and complexity of the data set. That’s why Google created its own Tensor Processing Unit (TPU) chip, with which it could speed up its TensorFlow DL routines to cut down latency for user-facing AI decisions like ‘which ad should we show?’ Google estimates that each TPU is capable of delivering up to 225,000 predictions per second. A regular old CPU can muster just over 5,000. That’s a lot more potential outcomes to consider, comparatively.

And thanks to the economies of scale available on public clouds like Google Cloud Platform, all developers can readily buy time on AI-tuned hardware. Such hardware is fast becoming a differentiator among AI players. Google has its TPUs. Amazon and Microsoft are offering time on their NVIDIA-built graphics processing units (GPUs). All of these vendors are using field programmable gate arrays (FPGAs) for special AI projects and to accelerate internal data center operations. And more hardware innovations are on the way, with Microsoft, Baidu, Intel, and Xilinx all doing some interesting work in combining various architectures. Microsoft’s new Project Brainwave, for example, combines aspects of FPGAs and ‘ye old’ ASICs to purportedly match the performance of hard-coded chips.

Which one is best at speeding up difficult AI tasks? That depends on which DL framework you’re using (Google TensorFlow or Microsoft Cognitive Toolkit), what kind of precision you need (high vs. low), as well as the size/complexity of the data set and the task at hand (model training vs. real-time decisions). The point is that for each use case, there is an optimal hardware solution.

But, what if you could choose one hardware solution but still have the option to choose the best AI hardware? While traveling in China last week at the Huawei Analyst Summit, I was surprised to discover that the Chinese ICT powerhouse was building toward just this kind of flexibility.

Like the other vendors mentioned here, Huawei is working on AI acceleration as a part of its public cloud platform, with both GPUs and FPGAs within its Atlas offering (announced last September). The company’s approach is in line with the rest of the industry in terms of offering AI-specific hardware. Huawei is also working on a very unique proposition with Atlas and other high-performance computing (HPC) offerings, which will let customers make use of AI-specific hardware within flexible resource pools, both on-cloud and on-premises.

Why is this important? Currently, AI developers have to provision physical resource nodes that combine one kind of chip with storage and memory. And predominantly, this is done on public clouds. With some workloads, like DL training, the cloud is the only financially viable platform, though there are a few on-premises options emerging, as with IBM’s PowerAI deep learning software running on its Power Systems hardware.

What Huawei is building is a new kind of platform that blurs the boundaries between CPUs, servers, and data centers. Huawei calls this boundless computing. Rather than provision a physical node like an x86 box to perform a given task, developers simply provision a complex task that will automatically run discrete workloads on the most appropriate piece of hardware.

The secret sauce here is what Huawei calls a ‘hardware composer,’ which calls into action a select micro module that could be memory or storage or even a chip, be that x86, system on a chip (SoC), GPU, NPU, FPGA, etc. That’s some serious flexibility.

What do you think?

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.