Announcing Amazon SageMaker HyperPod training operator

Today, we’re announcing general availability of Amazon SageMaker HyperPod training operator, a purpose-built Kubernetes extension for resilient foundation model training on HyperPod.

Amazon SageMaker HyperPod empowers customers to accelerate AI model development across hundreds or thousands of GPUs with built-in resiliency, decreasing model training time by up to 40%. As training clusters expand, recovery from training interruptions becomes increasingly disruptive. Failure recovery traditionally requires a complete job restart across all nodes when even a single training process fails, resulting in additional downtime and increased costs. Moreover, identifying and resolving critical training issues such as stalled GPUs, low training throughput, and numerical instabilities, typically requires complex custom monitoring code, further extending development timelines and delaying time to market.

With the HyperPod training operator, customers can further enhance training resilience for Kubernetes workloads. Instead of a full job restart when failures occur, the HyperPod training operator performs surgical recovery, selectively restarting only the affected training resources for faster recovery from faults. It also introduces a customizable hanging job monitoring capability to help overcome problematic training scenarios including stalled training batches, non-numeric loss values, and performance degradation through simple YAML configurations. Getting started is simple: create a HyperPod cluster, install the training operator add-on, optionally define custom recovery policies for hanging jobs, and launch training.

This release is generally available in all AWS Regions where SageMaker HyperPod is currently supported.

See the documentation to learn more.

Source:: Amazon AWS