Skip to main content

Backend.AI: March 2022 Update

· 6 min read
Lablup

Hello! We would like to inform you of the improvements made in Backend.AI 22.03 / Enterprise R2 for the March 2022 update!

This is a six-month integrated update release, and major bug fixes and essential feature improvements will also be reflected in Backend.AI 21.09 / Enterprise R2. Backend.AI 21.03 and earlier versions will no longer receive updates unless urgent security patches are needed for existing customers.

Enhanced MLOps Support

Full support for AppProxy v2

Previously, users had to go through the Manager to access applications in containers when connecting via AppProxy. While this wasn't a problem for web applications that only required simple interactions, it was difficult to scale for more sophisticated applications, such as model serving. With this release, the existing method can still be used, but in addition, support for the new and improved AppProxy v2 has been added, allowing direct access without going through the Manager. As a result, container applications can now be served with a horizontally scalable architecture.

Support for batch session dependencies and session status webhooks

When creating a batch session, an option has been added to specify existing batch sessions as dependencies, allowing other tasks to be reserved and executed when the previous task has completed successfully. In addition to the existing session-specific event streaming API, a new option has been added to actively call a webhook URL with information on the scheduling and execution status of a session whenever the status changes. These features can be used directly at the API and SDK level, or through a publicly available MLOps pipeline interface.

New hardware platform support

Integration with Dell PowerScale storage

The Dell PowerScale storage backend has been integrated with Storage-Proxy, providing more detailed real-time statistics and monitoring features when using this storage.

Cross-architecture cluster configuration support

It is now possible to register and use multi-architecture container images in the registry when they have been built for multiple architectures, and the CPU architecture of the Backend.AI Agent installed on each compute node in the cluster is recognized to select and assign the image based on the architecture type during scheduling. As a result, a new cr.backend.ai/multiarch repository has been opened to distribute multi-architecture images.

Stability and performance improvements

Automatic round-down applied when allocating GPU fraction

Previously, if GPU resources were excessively fragmented when allocating GPU fractions, an allocation failure was returned based on a pre-specified quantization size. (e.g., if 1.0 fGPU was allocated and the remaining GPU resource sizes were limited to 0.33, 0.33, and 0.34 per device, and the quantization size was 0.1, it was treated as an allocation failure.)

Starting with this version, automatic round-down is applied to the quantization size to allow for normal session creation even if a slightly smaller resource is allocated instead of an error. (e.g., if 1.0 fGPU was allocated, and under the same conditions, 0.3, 0.3, and 0.3 were allocated and treated as successful.)

We have improved the convenience of customers who frequently use fractional GPU scaling by reflecting the actual resource usage accurately in session information even if the actual allocated capacity is smaller than the requested capacity.

Database Usage Optimization

To alleviate database bottleneck caused by a large number of users simultaneously creating and deleting session requests, we have migrated the authentication key-based session count tracking and container-based real-time statistics collection features to Redis-based operations. Additionally, we have optimized the resource auto-correction query within a single transaction, which used to cause excessive overhead.

Python 3.10 Upgrade

The Backend.AI server engine execution environment has been upgraded to Python 3.10.

Miscellaneous

  • Modified certain container images that use jemalloc, a system package installed in the container image, to prevent conflicts with the container resource constraint layer when running.
  • Disabled the repetitive storage capacity scan feature for container working directories to prevent excessive performance degradation in certain environments.

Other Improvements in User Interface and Usability

  • File browser is now separated into a dedicated container managed by storage proxy, providing better file I/O performance. (BETA)
  • You can now set a pending timeout to automatically cancel a session creation request if it is not scheduled for the specified time period for each resource group.
  • You can specify the types of sessions allowed to run for each resource group.
  • You can mount a virtual folder to an arbitrary absolute path location inside a container.
  • You can mount subdirectories of a virtual folder to the /home/work subdirectory or an arbitrary absolute path location inside a container.
  • You can now see session information more simply in the WebUI.
  • The WebUI now distinguishes between batch sessions and interactive sessions.
  • The CLI provides consistent JSON output for not only queries but also mutation operations that change server information.
  • The backend.ai-client and backend.ai-client-cli packages have been separated to reduce the potential for dependency conflicts when integrating the SDK with other applications and frameworks. (BETA)
  • Error details in Storage proxy are better conveyed via Manager API, with more detailed information on the error cause and related file paths.

Development and Research Framework Support

  • TensorFlow 2.7/2.8 support
    • Supports TensorFlow 2.7/2.8. Some TensorFlow components are missing for TF 2.8 because some compatible components like TFX have not yet been updated.
  • PyTorch 1.10/1.11 support
  • NGC TensorFlow/PyTorch 22.03, Triton 22.03 support
    • Supports the March 2022 version of the NGC TensorFlow image.
    • Supports the March 2022 version of the NGC PyTorch image.
    • Supports the March 2022 version of Nvidia Triton service image.
  • GPU-accelerated Julia 1.7 and FluxML support
    • Supports CUDA GPU-accelerated Julia 1.7.
    • Provides FluxML with CUDA 11.3-based GPU acceleration.
  • (Cloud) R Studio support
    • Provides R Studio on Backend.AI Cloud directly from the web without desktop app support.
  • Weights & Bias integration support (BETA)
    • Provides integrated support for W&B on Backend.AI, making it easy to run W&B. It is still in beta, and will be officially released at the end of April.