Training
Last updated
Was this helpful?
Last updated
Was this helpful?
Tip Adjust your hyperparameters to ensure that your model trains to a satisfying convergence as you increase its batch size.
Note
The ml instance types used by SageMaker training have the same number of GPUs as the corresponding p3 instance types. For example, ml.p3.8xlarge
has the same number of GPUs as p3.8xlarge
- 4.
Tip SageMaker Training Compiler only compiles DL models for training on managed by SageMaker. To compile your model for inference and deploy it to run anywhere in the cloud and at the edge, use .
Tip To learn more about how to configure Amazon FSx for Lustre or Amazon EFS with your VPC configuration using the SageMaker Python SDK estimators, see in the SageMaker Python SDK documentation.
Tip The data input mode integrations with Amazon S3, Amazon EFS, and FSx for Lustre are recommended ways to optimally configure data source for the best practices. You can strategically improve data loading performance using the SageMaker managed storage options and input modes, but it's not strictly constrained. You can write your own data reading logic directly in your training container. For example, you can set to read from a different data source, write your own S3 data loader class, or use third-party frameworks' data loading functions within your training script. However, you must make sure that you specify the right paths that SageMaker can recognize.
Tip If you use a custom training container, make sure you install the that helps set up the environment for SageMaker training jobs. Otherwise, you must specify the environment variables explicitly in your Dockerfile. For more information, see .
Tip
When you specify directory_path
, make sure that you provide the Amazon FSx file system path starting with MountName
.
Tip
When you specify DirectoryPath
, make sure that you provide the Amazon FSx file system path starting with MountName
.
Tip To learn more, see . This AWS machine learning blog further discusses case studies and performance benchmark of data sources and input modes.
Note This feature is available in the SageMaker Python SDK v2.98.0 and later.
Note Currently, only one instance group of a heterogeneous cluster can be specified to the distribution configuration.
Note
In this example we used the original datasets in the incremental training, however you can use different datasets, such as ones that contain newly added samples. Upload the new datasets to S3 and make adjustments to the data_channels
variable used to train the new model.
Note
Unless your training job will complete quickly, we recommend you use checkpointing with managed spot training. SageMaker built-in algorithms and marketplace algorithms that do not checkpoint are currently limited to a MaxWaitTimeInSeconds
of 3600 seconds (60 minutes).
Note All warm pool instance usage counts toward your SageMaker training resource limit. Increasing your warm pool resource limit does not increase your instance limit, but allocates a subset of your resource limit to warm pool training.
Note
Training job names have date/time suffixes. The example training job names my-training-job-1
and my-training-job-2
should be replaced with actual training job names. You can use the estimator.latest_training_job.job_name
command to fetch the actual training job name.
Note This feature is available through the SageMaker and framework estimator classes. Supported frameworks are PyTorch v1.10 or later and TensorFlow v2.6 or later.
Note
The instance_type
and instance_count
argument pair and the instance_groups
argument of the SageMaker estimator class are mutually exclusive. For homogeneous cluster training, use the instance_type
and instance_count
argument pair. For heterogeneous cluster training, use instance_groups
. Note
To find a complete list of available framework containers, framework versions, and Python versions, see in the AWS Deep Learning Container GitHub repository.
Note When using the SageMaker data parallel library, make sure the instance group consists of the .
Important Only three built-in algorithms currently support incremental training: , , and .
Important SageMaker Managed Warm Pools are a billable resource. For more information, see .
Note This feature is available in the SageMaker and later.
Note Amazon CloudWatch supports , and its finest resolution is 1 second. However, the finer the resolution, the shorter the lifespan of the CloudWatch metrics. For the 1-second frequency resolution, the CloudWatch metrics are available for 3 hours. For more information about the resolution and the lifespan of the CloudWatch metrics, see in the Amazon CloudWatch API Reference.
Tip If you want to profile your training job with a finer resolution down to 100-millisecond (0.1 second) granularity and store the training metrics indefinitely in Amazon S3 for custom analysis at any time, consider using . SageMaker Debugger provides built-in rules to automatically detect common training issues; it detects hardware resource utilization issues (such as CPU, GPU, and I/O bottlenecks) and non-converging model issues (such as overfit, vanishing gradients, and exploding tensors). SageMaker Debugger also provides visualizations through Studio and its profiling report. To explore the Debugger visualizations, see , , and .
Tip To create an augmented manifest file, use Amazon SageMaker Ground Truth and create a labeling job. For more information about the output from a labeling job, see .