Cluster Operations

Advanced Cluster Creation

Resource Pool Configuration

A resource pool in an MLDE Cluster is a set of agents that all run equivalent hardware. Some workloads require larger GPUs to hold a larger model in memory, or to simply train it faster. Distributed training is faster when there are more GPUs per agent, requiring less network traffic between agents. However having larger GPUs and more of them on a machine comes with increased cost. These trade-offs impact the kinds of agents you want to run, and you might want to configure several options on your cluster so your team can choose the appropriate trade-off for their particular workload.

With a resource pool, agents will scale up and down as needed with job submission. You can configure your resource pool with a maximum to help control costs. Scaling up can take a few minutes, depending on hardware availability, and nodes will scale down automatically when they have been idle for 10 minutes.

Cluster Actions

Except as noted below, other operations should not cause down time of the service for your team or interrupt running workloads. However, there is an inherent risk in applying any changes to your cluster, so consider scheduling these events at time when the impact of any interruption to your team will be minimal.

All these actions (and other options) are available from the ⋮ menu next to that cluster on the home page.

Cluster Menu

Pausing and Resuming

Pausing a cluster causes all compute resources to scale down completely. This is intended to reduce expenses to a minimum when your team is expected to not be using the cluster for an extended time. When paused, the cluster cannot be interacted with so experiments cannot be viewed, jobs cannot be submitted, etc. Any experiments, Jupyter notebooks, or other running commands will be interrupted. Experiment progress since the last checkpoint, and any other state that has not been saved to persistent storage, will be lost.

A cluster can be resumed in 1-2 minutes. Experiments that were running at the time the cluster was paused will be resumed, and the service can be used again.

Version Upgrades

Consult the Release Notes for a summary of changes before moving to a new release. After selecting Upgrade from the cluster menu, simply select the version you are ready to move to.

Reprovisioning Master

The size of the master node may need to be adjusted if you run many experiments at a large scale. To do this, simply select the appropriate instance type after selecting Reprovision Master from the cluster menu.

This operation is expected to take a couple of minutes, during which the service will be unavailable. This may cause longer delays for running workloads.

Updating Resource Pools

If you wish to switch the kinds of GPUs you use, you can change the instance type of an existing resource pool. You can also add and remove resource pools as your needs evolve. To access this menu, select Update Resource Pools from the cluster menu.

If you change or remove the configuration for a particular resource pool, that may cause problems for workloads currently running in that resource pool. It is recommended to finish or pause all work in a resource pool before modifying it.

Deleting

Deleting a cluster will permanently delete all of the resources associated with the cluster, including all checkpoints, experiment history, etc. Deleted clusters will continue to be listed for another 10 minutes to confirm that the action was completed successfully.