Getting Started with slk

file version: 02 May 2022

current slk version: 3.3.21

Note

Please also read Known Issues before you use slk the first time.

Note

Currently (May 2022), slk retrieve is only available on nodes of the slk, compute, shared and interactive partitions on Levante. All other slk commands are available on all nodes.

Introduction

slk and slk_helpers are the command line interfaces that have to be used to interact with the HSM system StrongLink (tape archive) at DKRZ. The previous HSM system HPSS, which was deactivated in October 2021, had pftp as command line interface. pftp does not work anymore. slk can be used to archive data to tape and to list, search and retrieve archived data on/from tape. It is provided by StrongBox Data Solutions. The slk_helpers provide additional functionality and are developed and maintained at DKRZ. Before you can use the slk and/or slk_helpers commands the first time, you must call slk login and login with your DKRZ/luv login credentials.

Warning

Please be aware that some slk commands do not print textual output to the SLURM log when used in batch jobs. Please capture the exit code $? of slk as described in the example batch scripts. Diagnostic information as well as archival/retrieval summaries are written to the slk log ~/.slk/slk-cli.log. Additionally, the command line output might be misleading in some situations as documented in the Known Issues. Please run slk archive and slk retrieve with 6 GB of memory (--mem=6GB in batch jobs).

Overview

The slk and slk_helpers commands behave very similar to known commands on the Linux terminal:

  • slk archive and slk retrieve => cp (limited availability of slk retrieve, see Retrievals from tape)

  • slk delete => rm

  • slk move and slk rename => mv

  • slk chmod => chmod

  • slk group => chgrp

  • slk owner => chown (only admins)

  • slk search and slk_helpers search_limited => find (slk search currently deactivated; please use slk_helpers search_limited)

  • slk_helpers mkdir => mkdir (slk archive automatically creates namespaces)

  • slk_helpers checksum -t sha512 => sha512sum

  • slk_helpers exists => test -e (also returns the resource id)

  • slk_helpers size => get file size, e.g. equal to stat --printf="%s"

Not all slk and slk_helpers commands are listed above. Please have a look into the slk manual and slk_helpers manual to get a list of all available commands.

What we know as a folder or directory is called namespace in StrongLink.

Load slk module

The slk is available as module on all mistral and levante nodes. Just do

$ module load slk

to load the most recent slk version. If you wish to use a specific old release (e.g. slk version 3.3.21), please do

$ module load slk/3.3.21

On which nodes to run slk

Warning

Please run archival and retrieval of large or many files on compute nodes (or similar; not on login nodes) and request 6GB of memory (--mem=6GB).

The memory usage of slk is limited to 4 GB. slk archive and slk retrieve reach this limit when large files or many files are archived/retrieved. The other slk commands commonly use considerably less memory. Additionally, large archival and retrievals tasks cause a high CPU load. Running these tasks on the login nodes might affect other users working on these nodes. Hence, we encourage all users to perform these tasks on exclusive nodes with sufficient amount of memory (e.g. SLURM batch jobs on compute nodes). In order to account for overhead and similar, 6 GB of memory should be allocated in the job script via --mem=6GB (examples: Batch Script Examples; --mem in sbatch Manual: https://slurm.schedmd.com/sbatch.html#OPT_mem).

Running _many_ slk archive or slk retrieve in parallel on one compute node will not necessarily increase the total transferrate because at some point the transfer rate is limited by the hardware. We are still lacking experiance on how many parallel slk archive / retrieve are still reasonable. Please make sure to allocate sufficient memory for slk. Generally, please avoid running too many individual parallel slk retrieves. Please have a look into Retrieve a file or namespace further below for additional remarks on the usage of slk retrieve.

Login

Call slk login and use your common DKRZ credentials to log in. You do not login to another shell with slk login (like pftp) but it creates a login token. This login token allows the usage of the other slk commands. It is located in ~/.slk/config.json and is valid for 30 days.

$ slk login

All levante nodes share the same home directory. Therefore, you need to login only once every 30 days and you can run it on any levante node. The same is true for mistral. However, you have to be separately logged in on levante and mistral.

Archive a file or directory

slk archive can be used to archive files or directories. A progress bar is printed when slk archive is used in interactive mode. When slk archive is used in a batch script, no output will be written into the job log. An example slk archive call would be:

$ slk archive /work/bm0146/k204221/some_file.nc /arch/bm0146/k204221/my_data/

slk archive allows the usage of * as wildcard (but not ? or [...]). Add -R to do recursive archival of directories. slk archive works semi-recursive without -R (see Known Issues for details).

List content of a namespace

slk list automatically prints its findings in a pagination mode with 25 items per page. That means that you see the first 25 results and have to type Return/Enter to show the next 25 results. The pagination mode is deactivated when the output of slk list is piped (|) into another command – e.g. cat, less or more.

$ slk list /arch/bm0146/k204221/my_data | cat
-rw-rw-rw-- k204221     bm0146          1.2K   27 Mar 2020  borehole_01.nc
-rw-rw-rw-- k204221     bm0146          1.2K   04 Mar 2021  nc3.nc
-rw-rw-rw-- k204221     bm0146          1.2K   04 Mar 2021  nc_k_2.nc
-rw-rw-rw-- k204221     bm0146          4.0M   04 Mar 2021  nc_k_3.nc
-rw-rw-rw-- k204221     bm0146          4.0M   04 Mar 2021  nc_k_4.nc
-rw-rw-rw-- k204221     bm0146         13.1K   08 Dec 2020  small.nc
-rwxrwx-w-- k204221     bm0146        105.5M   08 Nov 2019  small_BPb4-Sl-mT_00062104_00040000000_01040000000.AGM07807972.freeze.nc
-rwxrwx-w-- k204221     bm0146        105.5M   14 Nov 2019  small_BPb4-Sl-mT_00062104_00040000000_01040000000.AGM07807972.nc
-rw-rw-rw-- k204221     bm0146          1.2K   23 Mar 2021  test.nc
-rw-rw-rw-- k204221     bm0146          1.2K   23 Mar 2021  zonk.nc
-rw-rw-rw-- k204221     bm0146          1.2K   28 Jun 2021  some_file.nc
Files: 11

Ensure that slk archive terminated properly and that all files were archived completely

For the reason given we ask you to check if slk archive finished correctly when doing archivals. If it did not finish correctly, please check all archived files for completeness. Files might be archived incompletely if slk archive was killed manually, by a timeout of a SLURM job or by a disconnected ssh session. Currently, such files are also displayed by slk list. Therefore, the status of each affected file should be checked.

Check that slk archive terminated properly

This can be done to check whether slk archive finished without an error:

slk archive /path/to/file/to/be/archived /arch/proj/user/test
if [ $? -ne 0 ]; then
    >&2 echo "an error occurred in slk archive call"
else
    echo "archival successful"
fi

If slk archive was killed to due a timeout of a SLURM job or similar, the archival can be resumed. Completely archived files won’t be archived again and incomplete files will be overwritten.

Check the integrity / completeness of archived files

A file which is listed by slk list might has not necessarily been archived completely. The command cp on a Linux system behaves the same: if cp is terminal during the copying process, partly copied files will remain stored in the filesystem. Therefore, files, which have been archived by a killed slk archive, should be manually check for file integrity. The following three methods can be used for this purpose. The first method should be sufficient is most situation.

(a) StrongLink calculates two types of checksums for files that have been completely archived. If StrongLink is very busy, it might take a few hours until the checksums are calculated. If one is able to obtain the checksums via slk_helpers checksum RESOURCE then the file has been completely archived with high probability. If no checksum is available one week after archival then probably something is wrong. Please contact support@dkrz.de in this case.

(b) The sha512 checksum is taken from StrongLink, the sha512 checksum is calculated from the source file and both checksums are compared.

(c) The file is retrieved from the HSM after it has been written to tape (see here on how to check this). Then the retrieved file is copied to the original file - e.g. via checksum or by a (binary) diff.

Incompletely archived files will not be written from the HSM cache to tape. If a file has not been written to tape one week after archival then probably something is wrong. Please contact support@dkrz.de in this case.

Retrieve a file or namespace

slk retrieve is the counterpart of slk archive. It behaves quite the same as slk archive (progress bar; -R) but the * works only for files and not for namespaces. Currently (May 2022), slk retrieve is only available on nodes of the slk, compute, shared and interactive partitions on Levante. If you wish to use slk retrieve interactively, please start an interactive batch session via the interactive partition with salloc (Data Processing on Levante).

slk retrieve /arch/bm0146/k204221/my_data/nc3.nc /work/bm0146/k204221/results

Running many parallel retrievals is inefficient when data have to be read from tape. Instead, we suggest to perform a search and then retrieve the search results as suggested in this example script. This will decresse the overall time needed for the retrieval because the files are read more efficiently from tapes. When the retrieved files are read from the HSM-Cache instead from tape there won’t be a high performance decline when many small retrievals are done.

Note

Striping is mandatory for retrievals of files > 10 GB on mistral to sustain high lustre I/O performance for all users. We recommend to use striping on levante for the time being. This recommendation will change at some point in future.

The permissions of all retrieved files are rw------- and have to be adapted manually. Using umask and setfacl in advance does not work.

Move, rename and delete files or namespaces

Move a file from one namespace to another. The file’s name cannot be changed by this command.

$ slk move /arch/bm0146/k204221/my_data/nc_k_2.nc /arch/bm0146/k204221/old_data

Rename a file. The file’s location cannot be changed by this command.

$ slk rename /arch/bm0146/k204221/my_data/nc_k_3.nc a_netcdf_file.nc

Delete a file. If we apply slk delete onto a namespace then it deletes all files in this namespace without confirmation. Files in sub-namespaces are not deleted. To delete the whole namespace with all sub-, subsub-, …-namespaces and their content, please append -R.

$ slk delete /arch/bm0146/k204221/my_data/zonk.nc

Search files

The slk provides search functionality via the command slk search. Currently, slk search is deactivated due to an internal issue. Please only use slk_helpers search_limited for now. slk_helpers search_limited fails if more than 1000 files are found (details on the slk_helpers manual page). Both search commands return a search ID which can be used with slk list and slk retrieve to list and retrieve, respectively, the found data. Below you’ll find three example tasks for searches. More examples and explanations are given here: Metadata in StrongLink.

Task: find all *.nc files in the namespace /ex/am/ple/data

slk_helpers search_limited '{"$and": [{"path": {"$gte": "/ex/am/ple/data"}}, {"resources.name": {"$regex": ".nc$"}}]}'

Task: find all netCDF files, which have a global attribute project with the value ABC. For netcdf.Project below please note that netcdf is written lower case and Project starts with an upper case P although the global attribute might have been written with lower case p.

slk_helpers search_limited '{"netcdf.Project": "ABC"}'

Task: find all INDEX.txt files that either belong to user k204221 (uid: 25301) in /arch or that are stored in the namespace /double/bm0146.

slk_helpers search_limited '{"$and": [{"resources.name": "INDEX.txt"}, {"$or": [{"$and": [{"resources.posix_uid": 25301}, {"path": {"$gte": "/arch"}}]}, {"path": {"$gte": "/double/bm0146"}}]}]}'

The search queries are written in JSON. You can use jq to print the search queries in a human-readable way. Details are given in Metadata in StrongLink.

Debugging

Check the exit code of slk by reading out $? which prints the exit code of the previously called command.

See the log file in ~/.slk/slk-cli.log for detailed error messages

packems

Packems was adapted to slk and the new HSM system. Please have a look into the packems manual for details and usage of packems: https://code.mpimet.mpg.de/projects/esmenv/wiki/Packems.

pyslk

Most slk and slk_helpers commands are available as functions pyslk.pyslk.slk_COMMAND(...). These functions are simple wrappers that print out the text, which the slk/slk_helpers commands normally print to the command line. A bit more advanced wrappers are available for a few commands via pyslk.parsers.slk_COMMAND_.... pyslk is installed on mistral and levante in the latest python3 modules. It can also be downloaded separately – however, needs slk to be installed. Details on the availability a listed here: pyslk availability.

A few usage examples:

> from pyslk import pyslk as pslk
> from pyslk import parsers as psr
> pslk.slk_list('/arch')
drwxrwxr-x- 7003        1001                   06 Aug 2021  aa0049
drwxrwxr-x- 7003        1151                   06 Aug 2021  aa0238
drwxrwx-w-- 7003        1079                   06 Aug 2021  ab0036
drwxrwx-w-- 7003        1007                   06 Aug 2021  ab0051
...
> pslk.slk_version()
SCLI Version 3.3.10
> psr.slk_list_formatted('/arch')
    permissions    owner   group  size  day  month  year  filename
0   drwxrwxr-x-  a270003  aa0049         06   Aug   2021    aa0049
1   drwxrwxr-x-  a270003  aa0238         06   Aug   2021    aa0238
2   drwxrwx-w--  a270003  ab0036         06   Aug   2021    ab0036
3   drwxrwx-w--  a270003  ab0051         06   Aug   2021    ab0051
4   ...
>

further content