Monitoring
Last updated
Was this helpful?
Last updated
Was this helpful?
Note Amazon CloudWatch supports and its finest resolution is 1 second. However, the finer the resolution, the shorter the lifespan of the CloudWatch metrics. For the 1-second frequency resolution, the CloudWatch metrics are available for 3 hours. For more information about the resolution and the lifespan of the CloudWatch metrics, see in the Amazon CloudWatch API Reference.
Tip If you want to profile your training job with a finer resolution down to 100-millisecond (0.1 second) granularity and store the training metrics indefinitely in Amazon S3 for custom analysis at any time, consider using . SageMaker Debugger provides built-in rules to automatically detect common training issues; it detects hardware resource utilization issues (such as CPU, GPU, and I/O bottlenecks) and non-converging model issues (such as overfit, vanishing gradients, and exploding tensors). SageMaker Debugger also provides visualizations through Studio and its profiling report. To explore the Debugger visualizations, see , , and .
Note
1. The /aws/sagemaker/NotebookInstances/[LifecycleConfigHook]
log stream is created when you create a notebook instance with a lifecycle configuration. For more information, see .
2. For Inference Pipelines, if you don't provide container names, the platform uses **container-1, container-2**, and so on, corresponding to the order provided in the SageMaker model.
Important The following examples may not work for all endpoints. For a list of features that may exclude your endpoint, see the page.