Machine Learning on Levante#

How to set up conda#

While the default python3 environment offered on Levante via modules offers a basic set of libraries, often enough ML with Python requires additional libraries. An easy way to gain access to them is to set up your own conda environment.

How to get a custom kernel into jupyterhub.dkrz.de#

Jupyter notebooks and jupyterhub.dkrz.de are suitable environments to get started with ML on Levante. However, the default kernels available on jupyterhub.dkrz.de do not include many of the standard frameworks used for Machine Learning in Python. You can, however, set up your own kernel to address this.

Setting up your ML project#

Many people start from scratch when setting up a repository for their ML project, but usually end up with similar repository structures and tools. To reduce the amount of configuration required, our colleagues at Helmholtz Munich have set up a repository template “Quicksetup-ai” that we recommend to use. It also makes use of hydra to simplify configuration of ML experiments. All details here:

HelmholtzAI-Consultants-Munich/Quicksetup-ai

Tailoring ANN training for batch scripts#

In order to enable easy scripting for batch jobs and hyperparameter tuning, we recommend to set up your ANN training scripts such that most parameters can be set using command line flags, including data input, features to use, hyperparameters such as learning rate and batch size, but also number and configuration of layers and so on - you can potentially use hyperparameter tuning to also test systematically different network architectures, taking a lot of heavy trial & error effort off your shoulders.
For example, consider a main script that is able to consume various parameters like this help text illustrates:
usage: Model.py [-h] [-d DATA] [-e EPOCHS] [-p PATIENCE] [-bs BATCH_SIZE]
                [-lr LEARNING_RATE] [-wd WEIGHT_DECAY] [-optim {Adam,SGD}]
                [-nni] [--n-layers {1,2,3}]
                [--units1 UNITS1] [--dropout1 DROPOUT1]
                [--units2 UNITS1] [--dropout2 DROPOUT1]
                [--units3 UNITS1] [--dropout3 DROPOUT1]

optional arguments:
  -h            Print help and exit
  -d DATA        Input data
  -e EPOCHS        Maximum number of epochs to train
  -p PATIENCE        Stop training prematurely if results do not improve for n epochs
  -bs BATCH_SIZE    Batch size
  -lr LEARNING_RATE    Learning rate
  -wd WEIGHT_DECAY    Weight decay
  -optim {Adam,SGD}    Optimizer
  -nni            Special flag to use if script is run via NNI
  --n-layers {1,2,3}
  --units1 UNITS1
  --dropout1 DROPOUT1
  --units2 UNITS2
  --dropout2 DROPOUT2
  --units3 UNITS3
  --dropout3 DROPOUT3

This could be easily extended to, e.g., cover not only testing with an architecture with different dense and dropout layers, but also with/without convolutional or LSTM layers and different parameters for them.

Automating training and hyperparameter tuning using NNI#

Several parameters are crucial to get good results with ANNs. The learning rate, for example, is an important parameter, for which multiple different values should be tried out. With a main script set up as in the example above, it would be easy to automate testing for different parameters, seeking out an optimal combination.
There are several frameworks which can help to perform such automated hyperparameter tuning. From the frameworks available, we currently recommend to use NNI as a lightweight entry point for beginners. It does not only take care of automation, but also has a comfortable interface to monitor tuning and a nice take on recording individual experiments. A good place to start is the QuickStart guide.
One big advantage of using NNI is that it assigns an unique ID to each experiment, which can later be used to find the results again. All important information, including a log file with the output of the training command, is stored in a folder with such a unique ID in your home directory. NNI can be install via pip with:
pip install nni
Setting NNI up to run on Levante, in this example with 4 GPUs on a single node, can be easily achieved with a simple SLURM run script (adapt to your account, conda environment, paths etc. as necessary):
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --account=<your project>
#SBATCH --nodes=1
#SBATCH --gpus=4
#SBATCH --time=12:00:00
#SBATCH --exclusive

source ~/.bashrc
scriptdir=/work/<path-to-your-working-dir>
conda activate mypython3
port=8080
nnictl create -c $scriptdir/config_nni.yml --port $port
sleep 11h 59m
nnictl stop
Note the sleep command here; “nnictl create” starts the NNI server but the command exits afterwards, keeping it running in the background, so we need to do a simple wait until we properly terminate the server before the SLURM job allocation time runs out. If you discover after such jobs that NNI would need more time to sample your parameter space, increase the sleep and job time accordingly.
In order to access the NNI web interface to monitor the progress of tuning while the job is running, you need to create a SSH tunnel to Levante; replace the node-number with the node the job was assigned to (you can check this via SLURM’s squeue command); the website should then be accessible locally on localhost:8080 .
ssh -L 8080:127.0.0.1:8080 <your-account>@levante<node-number>.dkrz.de