HPE Machine Learning Development Environment Software User Manual

HPE Machine Learning Development Environment (HPE MLDES) provides access to managed cloud infrastructure for training AI models at scale. You get access to AI clusters running the HPE Machine Learning Development Environment without having to provision or configure the hardware yourself. MLDE clusters scale automatically with demand, so you only pay for what you use. This is the best way to start training deep-learning models at scale and collaborate with your team.

Create Your Cluster

To get started, let's create a cluster to train a model on. Click the New Cluster button. Choose a name for your cluster, and select the Standard configuration. You may also choose the Pro configuration which is configured with more powerful GPUs. You may also customize the configuration in the Advanced menu, and modify this configuration later.

Train a Model

While your cluster is launching, install the Determined command-line tool. You must already have Python installed.

pip install determined

Get the cluster URL (you can copy it using the clipboard button next to View Cluster) and set the DET_MASTER environment variable. You may want to set this permanently for your shell (e.g. add it to .bashrc, etc.).

export DET_MASTER=<master ip>

Once the cluster is running, you can tell the CLI to log you into the cluster through your browser.

det auth login

Now you can download an example model and train it using your cluster.

curl -O https://docs.determined.ai/latest/_downloads/61c6df286ba829cb9730a0407275ce50/mnist_pytorch.tgz
tar xzf mnist_pytorch.tgz
cd mnist_pytorch
det experiment create const.yaml .

The experiment, its progress and metrics can now be found in the cluster's web UI.

You can replace const.yaml with distributed.yaml to demonstrate distributed training over multiple GPUs, and adaptive.yaml to demonstrate automatic hyperparameter optimization.

You can visit the Docs tab in your cluster for complete documentation for model developers on your specific version.

Working with Checkpoints

Please consult the MLDE docs for full details on working with checkpoints.

Your experiments will likely generate checkpoints. You can use the command line below to view all checkpoints associated with an experiment:

det experiment list-checkpoints <experiment-id>

To download them on an HPE MLDES cluster, use this command line:

det checkpoint download --mode=master <checkpoint UUID>

The option --mode=master explicitly specifies the download is proxied through the cluster master, which is the only supported download mode on HPE MLDES clusters.

For consistency with the open-source Determined, you can also omit --mode=master:

det checkpoint download <checkpoint UUID>

Caveat: When --mode is not specified, it defaults to auto. The CLI will first attempt to download checkpoints from its storage directly and then will fail over to proxied download through the cluster master. This command line is easier to remember but has a small overhead. In some occasions, it might fail to automatically switch to proxied download.

Add Your Team

Back in the HPE MLDES web portal, click the Members tab. Your user is currently an admin of the organization and has all permissions. This includes the ability to add other team members. Click New Member and enter their email address. Send them a link, and they'll be able to log in and collaborate with you!