githubEdit

Training

circle-info

This page was generated from content adapted from the AWS Developer Guidearrow-up-right

Distributed Training

  • Tip Adjust your hyperparameters to ensure that your model trains to a satisfying convergence as you increase its batch size.

  • Note The ml instance types used by SageMaker training have the same number of GPUs as the corresponding p3 instance types. For example, ml.p3.8xlarge has the same number of GPUs as p3.8xlarge - 4.

Training Compiler

Access Training Data

  • Tip To learn more about how to configure Amazon FSx for Lustre or Amazon EFS with your VPC configuration using the SageMaker Python SDK estimators, see Use File Systems as Training Inputsarrow-up-right in the SageMaker Python SDK documentation.

  • Tip The data input mode integrations with Amazon S3, Amazon EFS, and FSx for Lustre are recommended ways to optimally configure data source for the best practices. You can strategically improve data loading performance using the SageMaker managed storage options and input modes, but it's not strictly constrained. You can write your own data reading logic directly in your training container. For example, you can set to read from a different data source, write your own S3 data loader class, or use third-party frameworks' data loading functions within your training script. However, you must make sure that you specify the right paths that SageMaker can recognize.

  • Tip If you use a custom training container, make sure you install the SageMaker training toolkitarrow-up-right that helps set up the environment for SageMaker training jobs. Otherwise, you must specify the environment variables explicitly in your Dockerfile. For more information, see Create a container with your own algorithms and modelsarrow-up-right.

  • Tip When you specify directory_path, make sure that you provide the Amazon FSx file system path starting with MountName.

  • Tip When you specify DirectoryPath, make sure that you provide the Amazon FSx file system path starting with MountName.

  • Tip To learn more, see Choose the best data source for your Amazon SageMaker training jobarrow-up-right. This AWS machine learning blog further discusses case studies and performance benchmark of data sources and input modes.

Heterogeneous Cluster Training

  • Note This feature is available in the SageMaker Python SDK v2.98.0 and later.

  • Note This feature is available through the SageMaker PyTorcharrow-up-right and TensorFlowarrow-up-right framework estimator classes. Supported frameworks are PyTorch v1.10 or later and TensorFlow v2.6 or later.

  • Note The instance_type and instance_count argument pair and the instance_groups argument of the SageMaker estimator class are mutually exclusive. For homogeneous cluster training, use the instance_type and instance_count argument pair. For heterogeneous cluster training, use instance_groups. Note To find a complete list of available framework containers, framework versions, and Python versions, see SageMaker Framework Containersarrow-up-right in the AWS Deep Learning Container GitHub repository.

  • Note Currently, only one instance group of a heterogeneous cluster can be specified to the distribution configuration.

  • Note When using the SageMaker data parallel library, make sure the instance group consists of the supported instance types by the libraryarrow-up-right.

Incremental Training

Managed Spot Training

  • Note Unless your training job will complete quickly, we recommend you use checkpointing with managed spot training. SageMaker built-in algorithms and marketplace algorithms that do not checkpoint are currently limited to a MaxWaitTimeInSeconds of 3600 seconds (60 minutes).

Managed Warm Pools

  • Important SageMaker Managed Warm Pools are a billable resource. For more information, see Billing.

  • Note All warm pool instance usage counts toward your SageMaker training resource limit. Increasing your warm pool resource limit does not increase your instance limit, but allocates a subset of your resource limit to warm pool training.

  • Note This feature is available in the SageMaker Python SDK v2.110.0arrow-up-right and later.

  • Note Training job names have date/time suffixes. The example training job names my-training-job-1 and my-training-job-2 should be replaced with actual training job names. You can use the estimator.latest_training_job.job_name command to fetch the actual training job name.

Monitor and Analyze Using CloudWatch Metrics

Use Augmented Manifest Files

  • Tip To create an augmented manifest file, use Amazon SageMaker Ground Truth and create a labeling job. For more information about the output from a labeling job, see Output Dataarrow-up-right.

Last updated