Machine Learning on Mistral

How to get started with ML on Mistral

Because most machine learning Python packages feature frequent new releases and due to the large variety of packages required, we recommend to set up a custom conda environment where you can then install the packages and specific versions required.

In terms of conda packages, we can recommend the following:

  • In general, our experience shows that pytorch is, in the long-term, more flexible than keras/tensorflow, despite an initially steeper learning curve. Indeed, the learning curve can be further flattened by using pytorch-lightning, which takes care of significant parts of the boilerplate code required to set up a pytorch training loop.

  • To make use of GPUs on Mistral with pytorch, it is often easiest to install the pytorch-gpu package instead of the regular pytorch, torchvision, cudatoolkit etc. packages, and thereby circumvent potential issues with outdated native dependencies from the Mistral default environment. This works well with Python 3.8.

For small, initial data exploration and processing, you could also use Jupyter notebooks and But for further ML training tasks beyond a first experimental stage, we recommend to switch to regular Python modules instead of notebooks for actual ML training tasks, e.g., when you get to do large-scale hyperparameter tuning as described further below.

Tailoring ANN training for batch scripts

In order to enable easy scripting for batch jobs and hyperparameter tuning, we recommend to set up your ANN training scripts such that most parameters can be set using command line flags, including data input, features to use, hyperparameters such as learning rate and batch size, but also number and configuration of layers and so on - you can potentially use hyperparameter tuning to also test systematically different network architectures, taking a lot of heavy trial & error effort off your shoulders.
For example, consider a main script that is able to consume various parameters like this help text illustrates:
usage: [-h] [-d DATA] [-e EPOCHS] [-p PATIENCE] [-bs BATCH_SIZE]
                [-lr LEARNING_RATE] [-wd WEIGHT_DECAY] [-optim {Adam,SGD}]
                [-nni] [--n-layers {1,2,3}]
                [--units1 UNITS1] [--dropout1 DROPOUT1]
                [--units2 UNITS1] [--dropout2 DROPOUT1]
                [--units3 UNITS1] [--dropout3 DROPOUT1]

optional arguments:
  -h            Print help and exit
  -d DATA        Input data
  -e EPOCHS        Maximum number of epochs to train
  -p PATIENCE        Stop training prematurely if results do not improve for n epochs
  -bs BATCH_SIZE    Batch size
  -lr LEARNING_RATE    Learning rate
  -wd WEIGHT_DECAY    Weight decay
  -optim {Adam,SGD}    Optimizer
  -nni            Special flag to use if script is run via NNI
  --n-layers {1,2,3}
  --units1 UNITS1
  --dropout1 DROPOUT1
  --units2 UNITS2
  --dropout2 DROPOUT2
  --units3 UNITS3
  --dropout3 DROPOUT3

This could be easily extended to, e.g., cover not only testing with an architecture with different dense and dropout layers, but also with/without convolutional or LSTM layers and different parameters for them.

Automating training and hyperparameter tuning using NNI

Several parameters are crucial to get good results with ANNs. The learning rate, for example, is an important parameter, for which multiple different values should be tried out. With a main script set up as in the example above, it would be easy to automate testing for different parameters, seeking out an optimal combination.
There are several frameworks which can help to perform such automated hyperparameter tuning. From the frameworks available, we currently recommend to use NNI, which does not only take care of automation, but also has a comfortable interface to monitor tuning and a nice take on recording individual experiments. A good place to start is the QuickStart guide.
One big advantage of using NNI is that it assigns an unique ID to each experiment, which can later be used to find the results again. All important information, including a log file with the output of the training command, is stored in a folder with such a unique ID in your home directory. NNI can be install via pip with:
pip install nni

On Mistral, you are able to access the web interface if you are using NNI on a GPU node and are inside the DKRZ network (e.g. via VPN or by being in a DKRZ building). The server address and port are displayed when an experiment is started.