Amazon CloudWatch Container Insights now auto-discovers the health status of your SageMaker HyperPod nodes running on EKS and visualizes them in curated dashboards to help you monitor your node availability for operational excellence. Using out-of-the-box dashboards, you can identify unhealthy nodes easily and mitigate quickly to achieve efficient training durations.
Container Insights works with SageMaker to collect deep health check test results for HyperPod nodes and displays them in preset dashboards to help you understand the health and performance of your nodes, and identify if they are ready for scheduling. Container Insights assists you in optimizing training durations by classifying failing nodes as “pending reboot” and “pending replacement,” and guiding you on maintaining node health in case automatic node replacement is disabled. If auto-recovery is enabled, you can gain visibility into your node mutations, delays in your training jobs, and understand how your tasks resume from the last check-point.
Getting started with Container Insights is easy. You can onboard either by installing CloudWatch Observability EKS Add-on or the latest CloudWatch agent into your clusters, or upgrading your Helm charts with the latest CloudWatch Agent version. Once configured you can navigate to Container Insights console and view your SageMaker Hyperpod node health status out-of-the-box.
SageMaker HyperPod node health observability is now available in Container Insights for EKS in all commercial regions where SageMaker HyperPod is present. HyperPod node health metrics follow observation based pricing – see Container Insights pricing page for details. For further information, see the Container Insights user guide.
Source:: Amazon AWS