Cluster Management

This guide will help you create and manage your first HyperPod cluster using the CLI.

Prerequisites

Before you begin, ensure you have:

An AWS account with appropriate permissions for SageMaker HyperPod
AWS CLI configured with your credentials
HyperPod CLI installed (pip install sagemaker-hyperpod)

Note

Region Configuration: For commands that accept the --region option, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration.

Cluster stack names must be unique within each AWS region. If you attempt to create a cluster stack with a name that already exists in the same region, the deployment will fail.

Creating Your First Cluster

1. Start with a Clean Directory

It's recommended to start with a new and clean directory for each cluster configuration:

mkdir my-hyperpod-cluster
cd my-hyperpod-cluster

2. Initialize a New Cluster Configuration

.. tab-set::

   .. tab-item:: CLI

      .. code-block:: bash

         hyp init cluster-stack

This creates three files:

config.yaml: The main configuration file you'll use to customize your cluster
cfn_params.jinja: A reference template for CloudFormation parameters
README.md: Usage guide with instructions and examples

Important

The resource_name_prefix parameter in the generated config.yaml file serves as the primary identifier for all AWS resources created during deployment. Each deployment must use a unique resource name prefix to avoid conflicts. This prefix is automatically appended with a unique identifier during cluster creation to ensure resource uniqueness.

3. Configure Your Cluster

You can configure your cluster in two ways:

Option 1: Edit config.yaml directly

The config.yaml file contains key parameters like:

template: cluster-stack
namespace: kube-system
stage: gamma
resource_name_prefix: sagemaker-hyperpod-eks

Option 2: Use CLI/SDK commands (Pre-Deployment)

.. tab-set::

   .. tab-item:: CLI

      .. code-block:: bash

         hyp configure --resource-name-prefix your-resource-prefix

Note

The hyp configure command only modifies local configuration files. It does not affect existing deployed clusters.

4. Create the Cluster

Warning

Cluster Stack Name Uniqueness: Cluster stack names must be unique within each AWS region. Ensure your resource_name_prefix in config.yaml generates a unique stack name for the target region to avoid deployment conflicts.

.. tab-set::

   .. tab-item:: CLI

      .. code-block:: bash

         hyp create --region your-region

This will:

Validate your configuration
Create a timestamped folder in the run directory
Initialize the cluster creation process

5. Monitor Your Cluster

Check the status of your cluster:

.. tab-set::

   .. tab-item:: CLI

      .. code-block:: bash

         hyp describe cluster-stack your-cluster-name --region your-region

   .. tab-item:: SDK

      .. code-block:: python

         from sagemaker.hyperpod.cluster_management.hp_cluster_stack import HpClusterStack

         # Describe a specific cluster stack
         response = HpClusterStack.describe("your-cluster-name", region="your-region")
         print(f"Stack Status: {response['Stacks'][0]['StackStatus']}")
         print(f"Stack Name: {response['Stacks'][0]['StackName']}")

Note

Region-Specific Stack Names: Cluster stack names are unique within each AWS region. When describing a stack, ensure you specify the correct region where the stack was created, or the command will fail to find the stack.

List all clusters:

.. tab-set::

   .. tab-item:: CLI

      .. code-block:: bash

         hyp list cluster-stack --region your-region

   .. tab-item:: SDK

      .. code-block:: python

         from sagemaker.hyperpod.cluster_management.hp_cluster_stack import HpClusterStack

         # List all CloudFormation stacks (including cluster stacks)
         stacks = HpClusterStack.list(region="your-region")
         for stack in stacks['StackSummaries']:
            print(f"Stack: {stack['StackName']}, Status: {stack['StackStatus']}")

Common Operations

Update a Cluster

Important

Runtime vs Configuration Commands:

hyp update cluster modifies existing, deployed clusters (runtime settings like instance groups, node recovery)
hyp configure modifies local config.yaml files before cluster creation

Use the appropriate command based on whether your cluster is already deployed or not.

.. tab-set::

   .. tab-item:: CLI

      .. code-block:: bash

         hyp update cluster \
             --cluster-name your-cluster-name \
             --instance-groups "[]" \
             --region your-region

Reset Configuration

.. tab-set::

   .. tab-item:: CLI

      .. code-block:: bash

         hyp reset

Best Practices

Always validate your configuration before submission:
```
.. tab-set::

   .. tab-item:: CLI

      .. code-block:: bash

         hyp validate
```
Note

This command performs syntactic validation only of the config.yaml file against the appropriate schema. It checks:
- YAML syntax: Ensures file is valid YAML
- Required fields: Verifies all mandatory fields are present
- Data types: Confirms field values match expected types (string, number, boolean, array)
- Schema structure: Validates against the template's defined structure
This command performs syntactic validation only and does not verify the actual validity of values (e.g., whether AWS regions exist, instance types are available, or resources can be created).
Use meaningful resource prefixes to easily identify your clusters
Monitor cluster status regularly after creation
Keep your configuration files in version control for reproducibility

Next Steps

After creating your cluster, you can:

Connect to your cluster:

.. tab-set::

   .. tab-item:: CLI

      .. code-block:: bash

         hyp set-cluster-context --cluster-name your-cluster-name

Start training jobs with PyTorch
Deploy inference endpoints
Monitor cluster resources and performance

For more detailed information on specific commands, use the --help flag:

hyp <command> --help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Management

Prerequisites

Creating Your First Cluster

1. Start with a Clean Directory

2. Initialize a New Cluster Configuration

3. Configure Your Cluster

4. Create the Cluster

5. Monitor Your Cluster

Common Operations

Update a Cluster

Reset Configuration

Best Practices

Next Steps

FilesExpand file tree

cluster_management.rst

Latest commit

History

cluster_management.rst

File metadata and controls

Cluster Management

Prerequisites

Creating Your First Cluster

1. Start with a Clean Directory

2. Initialize a New Cluster Configuration

3. Configure Your Cluster

4. Create the Cluster

5. Monitor Your Cluster

Common Operations

Update a Cluster

Reset Configuration

Best Practices

Next Steps