Machine Learning on Levante#
How to set up conda#
While the default python3 environment offered on Levante via modules offers a basic set of libraries, often enough ML with Python requires additional libraries. An easy way to gain access to them is to set up your own conda environment.
How to get a custom kernel into jupyterhub.dkrz.de#
Jupyter notebooks and jupyterhub.dkrz.de are suitable environments to get started with ML on Levante. However, the default kernels available on jupyterhub.dkrz.de do not include many of the standard frameworks used for Machine Learning in Python. You can, however, set up your own kernel to address this.
Setting up your ML project#
Many people start from scratch when setting up a repository for their ML project, but usually end up with similar repository structures and tools. To reduce the amount of configuration required, our colleagues at Helmholtz Munich have set up a repository template “Quicksetup-ai” that we recommend to use. It also makes use of hydra to simplify configuration of ML experiments. All details here:
Tailoring ANN training for batch scripts#
usage: Model.py [-h] [-d DATA] [-e EPOCHS] [-p PATIENCE] [-bs BATCH_SIZE]
[-lr LEARNING_RATE] [-wd WEIGHT_DECAY] [-optim {Adam,SGD}]
[-nni] [--n-layers {1,2,3}]
[--units1 UNITS1] [--dropout1 DROPOUT1]
[--units2 UNITS1] [--dropout2 DROPOUT1]
[--units3 UNITS1] [--dropout3 DROPOUT1]
optional arguments:
-h Print help and exit
-d DATA Input data
-e EPOCHS Maximum number of epochs to train
-p PATIENCE Stop training prematurely if results do not improve for n epochs
-bs BATCH_SIZE Batch size
-lr LEARNING_RATE Learning rate
-wd WEIGHT_DECAY Weight decay
-optim {Adam,SGD} Optimizer
-nni Special flag to use if script is run via NNI
--n-layers {1,2,3}
--units1 UNITS1
--dropout1 DROPOUT1
--units2 UNITS2
--dropout2 DROPOUT2
--units3 UNITS3
--dropout3 DROPOUT3
This could be easily extended to, e.g., cover not only testing with an architecture with different dense and dropout layers, but also with/without convolutional or LSTM layers and different parameters for them.
Automating training and hyperparameter tuning using NNI#
pip install nni
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --account=<your project>
#SBATCH --nodes=1
#SBATCH --gpus=4
#SBATCH --time=12:00:00
#SBATCH --exclusive
source ~/.bashrc
scriptdir=/work/<path-to-your-working-dir>
conda activate mypython3
port=8080
nnictl create -c $scriptdir/config_nni.yml --port $port
sleep 11h 59m
nnictl stop
squeue
command); the website should then be accessible locally on localhost:8080
.ssh -L 8080:127.0.0.1:8080 <your-account>@levante<node-number>.dkrz.de