Getting Started with slk#
file version: 25 Aug 2023
current software versions: slk version 3.3.91; slk_helpers version 1.9.7
Note
Please also read Known Issues before you use slk
the first time.
Introduction#
slk
and slk_helpers
are the command line interfaces that have to be used to interact with the HSM system StrongLink (tape archive) at DKRZ. slk
can be used to archive data to tape and to list, search and retrieve archived data on/from tape. It is provided by StrongLink. The slk_helpers
provide additional functionality and are developed and maintained at DKRZ. Before you can use the slk
and/or slk_helpers
commands the first time, you must call slk login
and login with your DKRZ/luv login credentials.
The slk
and slk_helpers
commands behave very similar to known commands on the Linux terminal. Not all slk
and slk_helpers
commands are listed below. Please have a look into the slk manual and slk_helpers manual to get lists of all available commands.
cp
=>slk archive
andslk retrieve
ls
=>slk list
andslk_helpers list_search
rm
=>slk delete
mv
=>slk move
andslk rename
chmod
=>slk chmod
chgrp
=>slk group
chown
=>slk owner
(only admins)find
=>slk search
andslk_helpers search_limited
mkdir
=>slk_helpers mkdir
(slk archive
automatically creates namespaces)sha512sum
=>slk_helpers checksum -t sha512
test -e
=>slk_helpers exists
(also returns the resource id)get file size, e.g. equal to
stat --printf="%s"
=>slk_helpers size
folder
or directory
is called namespace
in StrongLink.
Note
If you use slk retrieve
on the login nodes of Levante then you can only retrieve one file per slk
call. Please use the interactive
partition to run slk retrieve
interactively (Run slk in the “interactive” partition) or use slk retrieve
in a batch job (Run slk as batch job). Please note the warning below.
Warning
Please run archival/retrieval of large/many files on the shared
, compute
or interactive
nodes and request 6GB of memory (--mem=6GB
).If your slk
is killed with a message like /sw/[...]/bin/slk: line 16: [...] Killed
, then please inform the DKRZ support (support@dkrz.de) and allocate 8 GB
or 10 GB
of memory.
Warning
Please be aware that some slk
commands do not print textual output to the SLURM log when used in batch jobs. Please capture the exit code $?
of slk
as described in the slk in SLURM jobs. slk archive
prints verbose output in the interactive and batch mode via -vv
.
Start using slk#
Load slk module#
The slk
is available as module on all Levante nodes. Just do
$ module load slk
to load the most recent slk version. If you wish to use a specific old release (e.g. slk version 3.3.81), please do
$ module load slk/3.3.81
Please be aware that old slk
releases are not officially supported anymore and might not be 100% compatible with the current version of StrongLink.
Login#
Call slk login
and use your common DKRZ credentials to log in. You do not login to another shell with slk login
(like pftp
) but it creates a login token. This login token allows the usage of the other slk commands. It is located in ~/.slk/config.json
and is valid for 30 days.
$ slk login
All Levante nodes share the same home directory. Therefore, you need to login only once every 30 days and you can run it on any Levante node.
Reminder login token expires#
If you wish to be reminded when the login token is due to expire, you can set up SLURM job to check the expiration data of the login token on daily basis. You might use this script
for this purpose. It sends an email to you when the token expires in 30, 5, 4, 3, 2 and 1 days. The email is send to email address which is stored in your user profile at https://luv.dkrz.de . The whole output is written into the file slk_token_expire_check.log
which is automatically kept below a size of 1100 lines. When finished, the script submits itself to start again the next day at the same time. Please submit the script this way replacing YOUR_PROJECT_ACCOUNT
by a valid project/account with compute time:
sbatch --begin="now+1day" --account=YOUR_PROJECT_ACCOUNT ./slk_token_expire_email.sh
slk in SLURM jobs#
Please capture the exit code ($?
) of each slk
command which is run in a SLURM/batch job because some slk
commands do not print textual output when used in batch jobs (see How do I capture exit codes?; example: example batch scripts).
Please run slk archive
and slk retrieve
with 6 GB of memory (--mem=6GB
in batch jobs). Detailed diagnostic information is printed into the slk log file: ~/.slk/slk-cli.log
. The command line output might be misleading in some situations as documented in the Known Issues. If your slk
is killed with a message like /sw/[...]/bin/slk: line 16: [...] Killed
, then please inform the DKRZ support (support@dkrz.de) and re-run the slk
command with 8 GB
or 10 GB
of memory.
slk memory footprint#
slk archive
and slk retrieve
might use up to 4 GB of memory and cause high CPU load when large files or many files are archived/retrieved. Running these tasks on the login nodes might affect other users working on these nodes. Hence, we encourage all users to perform these tasks on nodes of the compute
, interactive
or shared
partitons and with sufficient amount of memory allocated. In order to account for overhead and similar, 6 GB of memory should be allocated in the job script via --mem=6GB
(examples: Batch Script Examples; --mem
in sbatch Manual: https://slurm.schedmd.com/sbatch.html#OPT_mem).
slk data transfer rate#
Transferring files with a single call of slk archive
or slk retrieve
reaches transfer rates of up to 1 GB/s and might even slightly exceed this rate under favorable conditions. The transfer rate can be lower if the StrongLink constellation is under high load. Running _several_ slk archive
or slk retrieve
in parallel on one node will not necessarily increase the total transfer rate because at some point the transfer rate is limited by the hardware. We are still lacking experience on how many parallel slk archive
/ retrieve
are still reasonable. Generally, please avoid running too many individual parallel calls of slk retrieve
but try to bundle the retrieval of individual files as suggested in search and retrieval of search results. If you run several calls of slk
in parallel, please make sure to allocate sufficient memory.
Archival#
slk archive
can be used to archive files or directories. Similar to cp
it prints not output by default. A progress bar is printed when -v
is set in interactive mode. A detailed list of all processed files is printed when -vv
is set (also in batch mode). The exit code of slk archive
should be captured and printed if the command is used in a batch script. The variable $?
holds the exit code of the preceding command (see How do I capture exit codes?; extended examples: Archival script templates and Retrieval script templates). An example slk archive
call would be:
$ slk archive /work/bm0146/k204221/some_file.nc /arch/bm0146/k204221/my_data/
slk archive
allows the usage of *
as wildcard (but not ?
or [...]
). Add -R
to do recursive archival of directories.
Warning
If slk archive
(was) unforeseenly terminated, all archived files should be checked for completeness. Files, which are listed by slk list
, are not necessarily complete. Incomplete files have a size of 0 byte.
Warning
Please avoid archiving more than 5 TB
to 10 TB
with one call of slk archive
. Archiving higher amounts of data at once might cause the slk archive
to fail (details: How much data can I archive at once?).
Warning
Please do not archive many small files but pack them into tar balls or similar. It is more efficient and faster to retrieve one file of 1 GB size instead of ten files of 1 MB size. Details are provied here. Additionally, the tapes are stressed more by many single reading operations which increases the probability for tape failures.
Pack & Archive#
The package packems
, which was developed by MPI-M and DKRZ for the HPSS, has been adapted to the StrongLink system / slk
. The process of packing & archiving of multiple data files to tape and their retrieval is simplified by this package. It consists of three command line programs:
a pack-&-archive tools
packems
,a list-archived-content tool
listems
,a retrieve-&-unpack tool
unpackems
and
Currently, there is a low probability that slk archive
and other slk
commands fail with connection timeouts. When this happens, the packems workflows fails as well. Therefore, we propose to run packems only to generate tar balls and to run slk archive
manually afterwards. This would be an example workflow (download links for scripts below the code section):
# set some variables for better understanding
# TODO: please set these three environment variables according to your needs
source_path=/work/ab1234/c567890/scenario_xyz/output
tmp_path=/scratch/c/c567890/packems_tmp
destination_path=/arch/ab1234/model_output
tar_base_name=scenario_xyz
# load packems module
module load packems slk slk_helpers
# run packems
# if you are in a SLURM job on a compute node you can increase `-j 2` to higher values -- e.g. `-j 16`
# TODO: please add additional flags to the packems call according to your needs
packems -j 2 ${source_path} -d ${tmp_path} -o ${tar_base_name} --no-archive
# write infos on packems and the packems call into a file which will be the header of the INDEX.txt file later on
echo "/sw/spack-levante/packems-2.0.2-cspaky/bin/packems" > ${tmp_path}/INDEX.txt.header
# TODO: please replace the packems call in the next line by the packems call from four lines above
echo "packems -j 2 ${source_path} -d ${tmp_path} -o ${tar_base_name} --no-archive" >> ${tmp_path}/INDEX.txt.header
echo "version: 2.0.2" >> ${tmp_path}/INDEX.txt.header
echo "---" >> ${tmp_path}/INDEX.txt.header
# archive tar balls
slk archive ${tmp_path}/${tar_base_name}*.tar ${destination_path}
# generate parts of the INDEX.txt based on the *.tar.idx files (same folder as tar balls)
./gen_index_part.sh ${tmp_path}
# update the packems INDEX.txt file
./update_packems_index.sh ${tmp_path} ${destination_path}
Download of the used scripts:
For details on the usage of packems
please have a look into the packems manual.
List content#
slk list
will print the content of a namespace similar to ls -la
. The file size is automatically printed human-readably. Please set -b
to print the file sizes in bytes.
$ slk list /arch/bm0146/k204221/my_data
-rw-rw-rw-- k204221 bm0146 1.2K 27 Mar 2020 13:18 borehole_01.nc
-rw-rw-rw-- k204221 bm0146 1.2K 04 Mar 2021 10:13 nc3.nc
-rw-rw-rw-- k204221 bm0146 1.2K 04 Mar 2021 09:29 nc_k_2.nc
-rw-rw-rw-- k204221 bm0146 4.0M 04 Mar 2021 17:22 nc_k_3.nc
-rw-rw-rw-- k204221 bm0146 4.0M 04 Mar 2021 10:02 nc_k_4.nc
-rw-rw-rw-- k204221 bm0146 13.1K 08 Dec 2020 22:29 small.nc
-rwxrwx-w-t k204221 bm0146 105.5M 08 Nov 2019 09:05 small_BPb4-Sl-mT_00062104_00040000000_01040000000.AGM07807972.freeze.nc
-rwxrwx-w-t k204221 bm0146 105.5M 14 Nov 2019 03:23 small_BPb4-Sl-mT_00062104_00040000000_01040000000.AGM07807972.nc
-rw-rw-rw-- k204221 bm0146 1.2K 23 Mar 2021 22:41 test.nc
-rw-rw-rw-- k204221 bm0146 1.2K 23 Mar 2021 20:04 zonk.nc
-rw-rw-rw-- k204221 bm0146 1.2K 28 Jun 2021 14:26 some_file.nc
Files: 11
Validate archivals#
This content has been moved to section Validate archivals on the page Archivals to tape.
Retrievals#
Warning
If you retrieve more than 10 files are once, please run slk_helpers gfbt PATH -R --count-tapes
or slk_helpers gfbt --search-id SEARCH_ID --count-tapes
first. If the files are located on more than five tapes, please split the retrieval into multiple parts as described here (see also short command manual and usage examples). Currently, the whole StrongLink system slows down considerably, when single retrievals / recalls access too many tapes at once. This is an issue which is highly prioritized and is expected to be solved in future. Therefore, splitting retrievals as described above is very important to keep the StrongLink system running fast.
slk retrieve
is the counterpart of slk archive
. It behaves quite the same as slk archive
(-R
for recursive retrieval) but the *
works only for files and not for namespaces. If used on a Levante login node, slk retrieve
can only retrieve one file at once. To retrieve more than one file at once, please run slk retrieve
on nodes of the compute
, shared
and interactive
partitions of Levante. If you wish to use slk retrieve
interactively, please start an interactive batch session via the interactive
partition with salloc
(Run slk in the “interactive” partition and Data Processing on Levante).
slk retrieve /arch/bm0146/k204221/my_data/nc3.nc /work/bm0146/k204221/results
If slk retrieve
is run to retrieve a file, it checks whether this file is already stored in the HSM cache or not.
If the file is stored in the HSM cache then it is copied directly from the cache to Levante.
If the file is not stored in the HSM cache then a recall job is started in StrongLink which copies the file from tape to the HSM cache. When the file has arrived in the HSM cache it is automatically copied from the cache to Levante.
The same workflow applies when a set of files is retrieved. When multiple files, which are stored on one tape, should be retrieved, it is inefficient to retrieve each file with a single call of slk retrieve
because – in the worst case – the tape might be loaded and ejected for each file individually. Instead, all files, which are located on one tape, should be retrieve with one call of slk retrieve
. There are two ways to do this:
Perform a search for all files that should be retrieved and then retrieve the search results as suggested in Run a search query and retrieve search results and this example script. This will decrease the overall time needed for the retrieval because the files are read more efficiently from tapes. In this case, data from more than one tape might be accessed. If one of these tapes is not available in the moment, the whole retrieval might fail. Additionally, the StrongLink system will slow down if you retrieve data from more than 10 tapes with one call of
slk retrieve
. Please check in advance viaslk_helpers gfbt --search-id SEARCH_ID --count-tapes
.The best performance and error control is reached if all needed data which are stored on one tape are retrieved/recalled with one call of
slk retrieve
/slk recall
. At the same time, eachslk retrieve
call should target only one tape. This can be done via the commandslk_helpers group_files_by_tape
/slk_helpers gfbt
. The command usage of the command is described here (see also short command manual and usage examples).
Each copy process from tape to the HSM cache has its own job id. This id is printed into the slk log ~/.slk/slk-cli.log
by slk retrieve
and slk recall
. If your retrieval/recall takes very long, you can check the status of the copy job via slk_helpers job_status JOB_ID
. If the StrongLink system is very busy, jobs are queue. You can check the queue lenght via slk_helpers job_queue
. For details please read Waiting and processing time of retrievals.
Note
We recommend using striping on Levante for the time being. Some folders are striped already. Please check in advance.
Move, rename and delete files or namespaces#
Move a file from one namespace to another. The file’s name cannot be changed by this command.
$ slk move /arch/bm0146/k204221/my_data/nc_k_2.nc /arch/bm0146/k204221/old_data
Warning
slk move
automatically overwrites target files if they exit (similar to mv
). Please run slk move
with -i
avoid this.
Rename a file. The file’s location cannot be changed by this command.
$ slk rename /arch/bm0146/k204221/my_data/nc_k_3.nc a_netcdf_file.nc
Delete a file. If we apply slk delete
onto a namespace then it deletes all files in this namespace without confirmation. Files in sub-namespaces are not deleted. To delete the whole namespace with all sub-, subsub-, …-namespaces and their content, please append -R
. You may supply more than one file to slk delete
to delete more than one file at once.
$ slk delete /arch/bm0146/k204221/my_data/zonk.nc
Search files#
Files stored in StrongLink can be searched via search queries, which are written in JSON. You can use jq
to print the search queries in a human-readable way. Details on the format of search queries are given in Metadata in StrongLink.
The slk
provides search functionality via the command slk search
(activated again since mid Oct. 2022). We also provide slk_helpers search_limited
which intentionally fails if more than 1000 files are found (details on the slk_helpers manual page). Both search commands return a search ID that can be used with slk list
and slk retrieve
to list and retrieve, respectively, the found data. Below you will find three example tasks for searches. More examples and explanations are given here: Metadata in StrongLink.
Task: find all *.nc
files in the namespace /ex/am/ple/data
slk_helpers search_limited '{"$and": [{"path": {"$gte": "/ex/am/ple/data"}}, {"resources.name": {"$regex": ".nc$"}}]}'
Task: find all netCDF files, which have a global attribute project
with the value ABC
. For netcdf.Project
below please note that netcdf
is written lower case and Project
starts with an upper case P
although the global attribute might have been written with lower case p
.
slk_helpers search_limited '{"netcdf.Project": "ABC"}'
Task: find all INDEX.txt
files that either belong to user k204221
(uid: 25301) in /arch
or that are stored in the namespace /double/bm0146
.
slk_helpers search_limited '{"$and": [{"resources.name": "INDEX.txt"}, {"$or": [{"$and": [{"resources.posix_uid": 25301}, {"path": {"$gte": "/arch"}}]}, {"path": {"$gte": "/double/bm0146"}}]}]}'
The slk_helpers
provide the command gen_file_query
with which simple file queries can be generated. The query for the first task could be generated as follows:
slk_helpers gen_file_query '/ex/am/ple/data/.nc$'
Run slk in the “interactive” partition#
If you want to archive/retrieve large files or many files interactively, please use slk archive
/retrieve
via the interactive
partition and allocate 6 GB of memory. This is done by running this command on one of the login nodes of Levante:
salloc --mem=6GB --partition=interactive --account=YOUR_PROJECT_ACCOUNT
Your user account has to be associated to a project account (YOUR_PROJECT_ACCOUT
) which as compute time available. Please contact support@dkrz.de if you are not member in such a project but need to use salloc
or sbatch
for data retrieval or archival.
Run slk as batch job#
A “batch job” denotes a script or a program which is running on one of the Levante compute
or shared
nodes. The script/program was submitted via sbatch
to the SLURM resource manager.
sbatch MY_JOB_SCRIPT.sh
Please always run slk
jobs with 6 GB of memory allocated: either via a parameter to the sbatch
command (--mem=6GB
) or in the header of the batch script (#SBATCH --mem=6GB
).
You can check the status of your jobs (queued, running, finishing, …) via
squeue -u $USER
Details on the usage of sbatch
and similar SLURM-related commands are given in SLURM Introduction. Exemplary batch scripts for archival and retrieval a given in the sections Archival script templates and Retrieval script templates, respectively.
Your user account has to be associated to a project account (YOUR_PROJECT_ACCOUT
) which as compute time available. Please contact support@dkrz.de if you are not member in such a project but need to use salloc
or sbatch
for data retrieval or archival.
Debugging#
Please have a look into the slk log file in ~/.slk/slk-cli.log
for detailed error messages. If you send error reports or questions on failed slk
calls to the DKRZ support, please attach your slk log.
pyslk#
Most slk
and slk_helpers
commands are available as functions pyslk.pyslk.slk_COMMAND(...)
. These functions are simple wrappers that print out the text, which the slk
/slk_helpers
commands normally print to the command line. A bit more advanced wrappers are available for a few commands via pyslk.parsers.slk_COMMAND_...
. pyslk
is installed on Levante in the latest python3 module. It can also be downloaded separately – however, needs slk
to be installed. Details on the availability a listed here: pyslk availability.
A few usage examples:
> from pyslk import pyslk as pslk
> from pyslk import parsers as psr
> pslk.slk_list('/arch')
drwxrwxr-x- 7003 1001 06 Aug 2021 10:17 aa0049
drwxrwxr-x- 7003 1151 06 Aug 2021 07:04 aa0238
drwxrwx-w-- 7003 1079 06 Aug 2021 10:13 ab0036
drwxrwx-w-- 7003 1007 06 Aug 2021 11:56 ab0051
...
> pslk.slk_version()
SCLI Version 3.3.21
> psr.slk_list_formatted('/arch')
permissions owner group size day month year filename
0 drwxrwxr-x- a270003 aa0049 06 Aug 2021 10:17 aa0049
1 drwxrwxr-x- a270003 aa0238 06 Aug 2021 07:04 aa0238
2 drwxrwx-w-- a270003 ab0036 06 Aug 2021 10:13 ab0036
3 drwxrwx-w-- a270003 ab0051 06 Aug 2021 11:56 ab0051
4 ...
>