Backend.AI 소식: Graphcore IPU 통합 발표

Backend.AI starts supporting Graphcore's IPU. In this blog post, we will introduce the IPU and how Backend.AI supports it.

Graphcore IPU

Currently, deep learning accelerators utilize the compute units of graphics processing units (GPUs) through programming, which has become the mainstream method. However, as performance and computation increase, power consumption also increases, leading to various attempts to increase power-performance ratios. One of the representative approaches is AI-specific processors that have specialized compute units for neural network operations and can handle low-precision integer and floating-point types like int8/fp8/fp16 at the processor level to increase power-performance ratios. The IPU (Intelligence Processing Unit) is one such processor that Graphcore is developing for AI and ML acceleration.

Architecture

The IPU consists of tiles, IPU-Exchange™️ interface, IPU-Link™️ interface, and PCIe interface.

Tiles: The minimum operation unit of the IPU, consisting of an IPU-Core™️ and SRAM. There are several tiles¹ in a single IPU. Based on the Bow-IPU, the total SRAM capacity in the IPU is 900MB, which is smaller than other GPUs with similar performance, but the IPU provides faster shared memory than L1 cache in other GPUs² and similar latency to L2 cache in CPUs³.
IPU-Exchange™️: A channel for transmitting data between processors when multiple IPUs are connected simultaneously. It supports non-blocking connections up to 11TB/s.
IPU-Link™️: A channel for transferring data between IPU-Cores™️ within a single IPU. The Bow-IPU has 10 IPU-Links™️, supporting a bandwidth of 320GB/s between chip-to-chip.
PCIe: An interface for connecting with the host computer. It supports PCI-e Gen4 (x16) mode, and bidirectional data transmission up to 64GB/s.

(Reference page: https://www.graphcore.ai/bow-processors) (Reference document: Dissecting the Graphcore IPU Architecture via Microbenchmarking - Citadel Security, 2019 (arXiv:1912.03413v1 [cs.DC]))

Bow IPU

The Bow IPU is the latest generation IPU that uses the Wafer on Wafer (WoW) technology. It has a total of 1,472 computing cores and 900MB of memory, and can perform operations at speeds of up to 350 teraFLOPS.

Bow 2000

(Attached Bow 2000 diagram) It is a 1U blade equipped with four Bow IPU chips. Infiniband interfaces exist for data transmission between multiple Bow IPUs.

IPU Pod

An IPU cluster consisting of multiple Bow 2000s and a single host system.

Host System A server that performs operations through the IPUs present in the Pod. It is connected to Bow 2000s via RoCE (RDMA over Converged Ethernet), enabling low-latency, fast data transfer from memory during operations.
Bow 2000 Infiniband interfaces exist for connections between Bow 2000s. This enables data transmission between IPUs without being affected by data transfer between the host and Bow 2000s.
V-IPU
- Used when bundling IPU chips together for partitioning.

Comparison with Traditional GPUs

Advantages

By using a larger number of more finely-grained compute units¹ than GPUs, it can process more operations simultaneously.
With the high bandwidth and low latency of IPU-Exchange™️, it can quickly handle multi-node operations using multiple Bow IPUs.

Disadvantages

By default, it is difficult to use in multi-user environments through containers. Similar to NVIDIA's nvidia-docker, Graphcore provides the gc-docker CLI tool to run containers with access to IPUs. However, in this case, the use of the Host Network in Docker may have disadvantages in terms of security.

Backend.AI with Graphcore IPU

Backend.AI is being developed with the basic premise of a multi-user environment that requires high security in the cloud and enterprise. The disadvantages of gc-docker that were difficult to use in multi-user environments were resolved through collaboration with Graphcore and an update to the Backend.AI Container Pilot engine. As a result, Graphcore IPUs can now be used in multi-user environments with complete isolation and no degradation in performance.

Fast and convenient resource allocation IPU partitioning can be performed in isolated containers with just a few clicks.
Isolated network environment With Backend.AI, isolated session usage without exposing containers to the Host Network is possible.
Multiple IPU Pods can be combined and used Using the cluster session feature in Backend.AI that combines multiple distributed containers into a single operation session, distributed learning is possible even with multiple IPU Pods.

Conclusion

Backend.AI effectively abstracts interfaces for hardware recognition and allocation, such as storage and AI accelerators. This design greatly assists in supporting new hardware flexibly. The Backend.AI support for Graphcore IPUs was no exception. With this update, Graphcore IPUs have become the fourth AI accelerator guaranteed for use in Backend.AI, following CUDA GPUs, ROCm GPUs, and Google TPUs. Going forward, Backend.AI will continue to work to add value to hardware, leading the way with the release of new and powerful hardware. Please look forward to it!

¹ 1,472 computing cores in the Bow IPU

² Based on the Nvidia Turing T4

³ Based on the Intel Skylake/Kaby Lake/Coffee Lake

Graphcore IPU​

Architecture​

Bow IPU​

Bow 2000​

IPU Pod​

Comparison with Traditional GPUs​

Advantages​

Disadvantages​

Backend.AI with Graphcore IPU​

Conclusion​