Retrievals from tape¶
file version: 02 May 2022
current slk version: 3.3.21
Note
Currently, slk retrieve
is available on Levante in the partitions slk
, compute
, shared
and interactive
and on Mistral in the partition slk
. It is not available on login nodes. If you wish to use slk retrieve
interactively, please start an interactive batch session via the interactive
partition with salloc
(Data Processing on Levante). The slk
partition will not exist permanently but will be removed in a few weeks. Will still consider this as a testing phase of slk retriefve
.
Introduction and Summary¶
The slk retrieve
is available on Levante nodes of the partitions slk
, computer
, shared
and interactive
since 19 April 2022. However, it is still considered to be in a test phase.
From calendar week 13 2022 onwards a SLURM queue slk
on levante provided the possibility to use slk retrieve
. It will remain open until the whole test phase of slk retrieve
is finished. At the moment each user can submit a maximum of two jobs to this queue in parallel. Four jobs in total are allowed to run in parallel. An example batch script to run slk retrieve
in the slk
or compute
queues is provided below.
A similar queue with half the amount of parallel jobs allowed to run is offered on mistral (one job per user; two jobs in total). The retrieval of large files (> 1 GB
) might cause considerable performance degradation of the old lustre filesystem connected to mistral. This would affect all users on mistral. The performance degradation does not occur if striping with a factor of 8
is applied when files > 1 GB
are retrieved. Therefore, jobs submitted to the slk
queue on mistral are manually reviewed by DKRZ staff to check whether data are retrieved into a striped folder. The queue on mistral is therefore deactivated over night.
Note
Please note the section Aggregate file retrievals in order to speed up your retrievals.
Recommendations for usage of slk retrieve¶
High memory usage: Please allocate 6 GB of memory for
slk retrieve
andslk archive
calls (argument forsbatch
:--mem=6GB
). Otherwise, your commands might be killed by the operating system. If you plan to run three retrievals in parallel, please allocate 18 GB – and so on.On mistral, mandatory: The target folder of a retrieval has to be striped as follows
lfs setstripe -S 4M -c 8 TARGET_FOLDER
.On levante, recommended: The target folder of a retrieval should be striped as follows
lfs setstripe -E 1G -c 1 -S 1M -E 4G -c 4 -S 1M -E -1 -c 8 -S 1M TARGET_FOLDER
.Check exit codes: Information on the success or failure of
slk archive
andslk retrieve
will not be printed into the SLURM log automatically. We strongly suggest to check the exit code of eachslk archive/retrieve
and print it to the job log. Ifslk archive
(was) unforeseeingly terminated, all archived files should be checked for completeness. See the respective section in Getting Started for details. Please also be aware that some issues on Levante might causeslk
to crash randomly (see section slk issues on Levante on page Known Issues).slk partition: Please only run
slk retrieve
via this queue and don’t process / unpack the retrieved data in this queue. For latter tasks, please submit a job to one of the other queues. This maximizes the number of users who can retrieve data.
Speed up your retrievals¶
When you retrieve data from the HSM, these data are first copied from a tape to the HSM cache and, in a second step, copied from the HSM-cache to your _local_ filesystem (lustre). If the data are already located in the cache, the first step is automatically omitted. Commonly, copying a file from tape to cache takes longer than copying it from cache to the lustre filesystem. Therefore, it is good to know where the needed files are currently stored in order to estimate the time needed for retrieval. Below some hints how to improve the speed of your file retrievals.
Is a file stored in the HSM cache already or exclusively on tape?¶
The output of slk list
indicates whether a file is stored in the HSM cache or not. If the 11th character of the permissions-string is a t
then the file is stored exclusively on tape. If it is a -
then the file is stored in the cache. In the latter case, the user does not know whether the file is additionally stored on tape or not, for example if the file was archived shortly before slk list
was performed and had not yet been transferred to tape. Example:
$ slk list /arch/ex/am/ple
-rw-r--r--- k204221 bm0146 11 B 02 Mar 2021 file_1.txt
-rw-r--r--t k204221 bm0146 16 B 02 Mar 2021 file_2.txt
-rw-r--r--t k204221 bm0146 15 B 02 Mar 2021 file_3.txt
Example explained: The file file_1.txt
is stored in the cache and can be quickly retrieved. The files file_2.txt
and file_3.txt
are only stored on tape and their retrieval will take more time.
Additionally, the slk_helpers
feature a command iscached
which prints out information on the location of storage. Please note that the exit code of this command is 1
if the tested file is not cached. Example:
$ slk_helpers iscached /arch/ex/am/ple/file_2.txt
File is not cached
$ echo $?
1
$ slk_helpers iscached /arch/ex/am/ple/file_1.txt
File is cached
$ echo $?
0
Aggregate file retrievals¶
When several files shall be rerieved, it is most efficient to retrieve all files at once with one call of slk retrieve
instead of retrieving each file with an individual call of slk retrieve
. First, all files that are stored on one tape will be read from that tape at once. Individual slk retrieve
calls would cause the tape to be ejected and brought back to its shelf after each retrieval. Second, one slk retrieve
command might use several tape drives to copy files from distinct tapes in parallel. Using one slk retrieve
call per file does not allow using this feature. Therefore, it is useful to aggregate several file retrievals into one.
Recursive retrievals¶
When an entire namespace or most files of one namespace shall be retrieved, this namespace should be retrieved via one recursive slk retrieve
call. Example
$ slk list /ex/am/ple/data | cat
-rw-r--r--- k204221 bm0146 1.2M 10 Jun 2020 INDEX.txt
-rw-r--r--t k204221 bm0146 19.5G 05 Jun 2020 out_data_001.tar
-rw-r--r--t k204221 bm0146 19.0G 05 Jun 2020 out_data_002.tar
-rw-r--r--t k204221 bm0146 19.4G 05 Jun 2020 out_data_003.tar
-rw-r--r--t k204221 bm0146 19.3G 05 Jun 2020 out_data_004.tar
-rw-r--r--t k204221 bm0146 19.1G 05 Jun 2020 out_data_005.tar
-rw-r--r--t k204221 bm0146 7.8G 05 Jun 2020 out_data_006.tar
Files: 7
$ slk retrieve -R /ex/am/ple/data /tar/get/folder
[ ] 100% complete. Files retrieved: 7/7, [105.3G/105.3G].
$ ls /tar/get/folder
INDEX.txt out_data_001.tar out_data_002.tar out_data_003.tar out_data_004.tar out_data_005.tar out_data_006.tar
Run a search query and retrieve search results¶
If only some but not all files of one namespace or a set of files distributed over different namespaces shall be retrieved, it is reasonable to define a search query to find these files and retrieve them via their search id. In the example below, the files out_data_002.tar
, out_data_005.tar
and out_data_006.tar
shall be retrieved:
$ slk list /ex/am/ple/data | cat
-rw-r--r--- k204221 bm0146 1.2M 10 Jun 2020 INDEX.txt
-rw-r--r--t k204221 bm0146 19.5G 05 Jun 2020 out_data_001.tar
-rw-r--r--t k204221 bm0146 19.0G 05 Jun 2020 out_data_002.tar
-rw-r--r--t k204221 bm0146 19.4G 05 Jun 2020 out_data_003.tar
-rw-r--r--t k204221 bm0146 19.3G 05 Jun 2020 out_data_004.tar
-rw-r--r--t k204221 bm0146 19.1G 05 Jun 2020 out_data_005.tar
-rw-r--r--t k204221 bm0146 7.8G 05 Jun 2020 out_data_006.tar
Files: 7
$ slk_helpers search_limited '{"$and": [{"path": {"$gte": "/ex/am/ple/data"}}, {"resources.name": {"$regex": "out_data_00[256].tar"}}]}'
slk list Search continuing. ..
Search ID: 65621
$ slk list 65621 | cat
-rw-r--r--t k204221 bm0146 19.0G 05 Jun 2020 out_data_002.tar
-rw-r--r--t k204221 bm0146 19.1G 05 Jun 2020 out_data_005.tar
-rw-r--r--t k204221 bm0146 7.8G 05 Jun 2020 out_data_006.tar
Files: 7
$ slk retrieve 65621 /tar/get/folder
[ ] 100% complete. Files retrieved: 3/3, [45.9G/45.9G].
$ ls /tar/get/folder
out_data_002.tar out_data_005.tar out_data_006.tar
In this example, the namespace /ex/am/ple/data
and all of its sub-namespaces would be searched recursively. If the user wants the search to be performed non-recursively only in /ex/am/ple/data
, then the operator $max_depth
can be used.
$ slk_helpers search_limited '{"$and": [{"path": {"$gte": "/ex/am/ple/data", "$max_depth": 1}}, {"resources.name": {"$regex": "out_data_00[256].tar"}}]}'
slk list Search continuing. ..
Search ID: 65622
Limits¶
slk retrieve SEARCH_ID TARGET_FOLDER
will retrieve all files found by the search query SEARCH_ID
into the same TARGET_FOLDER
. Thus, if there are several files with same name but different source locations they will be written into TARGET_FOLDER
a renamed as shown in the example below:
$ slk_helpers search_limited '{"$and": [{"path": {"$gte": "/ex/am/ple"}}, {"resources.name": "INDEX.txt"}]}'
slk list Search continuing. ..
Search ID: 65621
$ slk list 65621 | cat
-rw-r--r--- k204221 bm0146 1.6M 11 Sep 2020 INDEX.txt
-rw-r--r--- k204221 bm0146 1.2M 10 Jun 2020 INDEX.txt
-rw-r--r--- k204221 bm0146 924.9K 11 Sep 2020 INDEX.txt
-rw-r--r--- k204221 bm0146 1.3M 11 Sep 2020 INDEX.txt
-rw-r--r--- k204221 bm0146 924.9K 11 Sep 2020 INDEX.txt
Files: 5
$ slk retrieve 65621 .
[ ] 100% complete. Files retrieved: 5/5, [5.8M/5.8M].
$ ls
INDEX.DUPLICATE_FILENAME.49058705493.1.txt INDEX.DUPLICATE_FILENAME.49058705494.1.txt INDEX.DUPLICATE_FILENAME.49058705496.1.txt INDEX.DUPLICATE_FILENAME.49058705520.1.txt INDEX.txt
Striping¶
Please see Speedup Retrievals with Striping. It is mandatory on mistral and recommend on levante. In the future, it will become unnecessary or even counterproductive to use striping on Levante. This will be announced when the time comes.
Script template¶
Three script templates for three use cases are printed below and available for download:
several retrievals of single files:
retrieve_slurm_template_single_files.sh
one recursive retrieval of a whole namespace:
retrieve_slurm_template_recursive.sh
search and retrieval of search results:
retrieve_slurm_template_search.sh
When you use these templates you need to make a few adaptions:
replace
/work/xz1234/ex/am/ple
by the actual target folder on the lustre filesystemreplace
xz1234
in--account=xz1234
by your project account namereplace
/path/to/your/archived/
by the namespace path to your data on the HSM
several retrievals of single files¶
#!/bin/bash
# HINT:
# * You can change the values right of the "=" as you wish.
# * The "%j" in the log file names means that the job id will be inserted
#SBATCH --job-name=test_slk_retr_job # Specify job name
#SBATCH --output=test_job.o%j # name for standard output log file
#SBATCH --error=test_job.e%j # name for standard error output log
#SBATCH --partition=slk # partition name
#SBATCH --ntasks=1 # max. number of tasks to be invoked
#SBATCH --time=08:00:00 # Set a limit on the total run time
#SBATCH --account=xz1234 # Charge resources on this project
#SBATCH --mem=6GB
# ~~~~~~~~~~~~ preparation ~~~~~~~~~~~~
module load slk
# set target folder for retrieval
target_folder=/work/xz1234/ex/am/ple
# create folder to retrieve into (target folder)
mkdir -p ${target_folder}
# set striping for target folder
# see https://docs.dkrz.de/doc/hsm/striping.html
# ON LEVANTE
lfs setstripe -E 1G -c 1 -S 1M -E 4G -c 4 -S 1M -E -1 -c 8 -S 1M ${target_folder}
# ON MISTRAL
#lfs setstripe -S 4M -c 8 ${target_folder}
# ~~~~~~~~~~~~ retrievals ~~~~~~~~~~~~
# do the retrieval
echo "doing 'slk retrieve'"
# ~~~~~~~~~~~~ doing single-file retrievals ~~~~~~~~~~~~
# You can do multiple retrievals in one script, but based on our
# experience, 10 to 15 separate retrievals are enough to cover the
# 08:00 hours maximum time for running the slurm jobs. Therefore, it
# is recommended to run your retrieval scripts as follows:
# first retrieval and capture exit code (get $? in line after slk command)
slk retrieve /path/to/your/archived/file1 ${target_folder}
if [ $? -ne 0 ]; then
>&2 echo "an error occurred in slk retrieve call"
else
echo "retrieval successful"
fi
# second retrieval and capture exit code (get $? in line after slk cmd)
slk retrieve /path/to/your/archived/file2 ${target_folder}
if [ $? -ne 0 ]; then
>&2 echo "an error occurred in slk retrieve call"
else
echo "retrieval successful"
fi
# ...
# ...
# fifteenth retrieval and capture exit code (get $? in line after slk cmd)
slk retrieve /path/to/your/archived/file15 ${target_folder}
if [ $? -ne 0 ]; then
>&2 echo "an error occurred in slk retrieve call"
else
echo "retrieval successful"
fi
one recursive retrieval of a whole namespace¶
#!/bin/bash
# HINT:
# * You can change the values right of the "=" as you wish.
# * The "%j" in the log file names means that the job id will be inserted
#SBATCH --job-name=test_slk_retr_job # Specify job name
#SBATCH --output=test_job.o%j # name for standard output log file
#SBATCH --error=test_job.e%j # name for standard error output log
#SBATCH --partition=slk # partition name
#SBATCH --ntasks=1 # max. number of tasks to be invoked
#SBATCH --time=08:00:00 # Set a limit on the total run time
#SBATCH --account=xz1234 # Charge resources on this project
#SBATCH --mem=6GB
# ~~~~~~~~~~~~ preparation ~~~~~~~~~~~~
module load slk
# set target folder for retrieval
target_folder=/work/xz1234/ex/am/ple
# create folder to retrieve into (target folder)
mkdir -p ${target_folder}
# set striping for target folder
# see https://docs.dkrz.de/doc/hsm/striping.html
# ON LEVANTE
lfs setstripe -E 1G -c 1 -S 1M -E 4G -c 4 -S 1M -E -1 -c 8 -S 1M ${target_folder}
# ON MISTRAL
#lfs setstripe -S 4M -c 8 ${target_folder}
# ~~~~~~~~~~~~ doing recursive retrievals ~~~~~~~~~~~~
# If you wish to retrieve all files from a specific folder, you can use
# the recursive option (-R). Please use this option only of you need more
# than 90% of the files in a certain namespace directory, and don't
# retrieve a whole directory of e.g. 500 files for only 20 files. an
# example for recursive retrieval is as follows:
slk retrieve -R /path/to/your/archived/directory ${target_folder}
# '$?' captures the exit code of the previous command (you can put it in
# the next line after each slk command).
if [ $? -ne 0 ]; then
>&2 echo "an error occurred in slk retrieve call"
else
echo "retrieval successful"
fi
search and retrieval of search results¶
#!/bin/bash
# HINT:
# * You can change the values right of the "=" as you wish.
# * The "%j" in the log file names means that the job id will be inserted
#SBATCH --job-name=test_slk_retr_job # Specify job name
#SBATCH --output=test_job.o%j # name for standard output log file
#SBATCH --error=test_job.e%j # name for standard error output log
#SBATCH --partition=slk # partition name
#SBATCH --ntasks=1 # max. number of tasks to be invoked
#SBATCH --time=08:00:00 # Set a limit on the total run time
#SBATCH --account=xz1234 # Charge resources on this project
#SBATCH --mem=6GB
# ~~~~~~~~~~~~ preparation ~~~~~~~~~~~~
module load slk
# set target folder for retrieval
target_folder=/work/xz1234/ex/am/ple
# create folder to retrieve into (target folder)
mkdir -p ${target_folder}
# set striping for target folder
# see https://docs.dkrz.de/doc/hsm/striping.html
# ON LEVANTE
lfs setstripe -E 1G -c 1 -S 1M -E 4G -c 4 -S 1M -E -1 -c 8 -S 1M ${target_folder}
# ON MISTRAL
#lfs setstripe -S 4M -c 8 ${target_folder}
# ~~~~~~~~~~~~ doing the search ~~~~~~~~~~~~
# Set up a search that finds the files which you want to retrieve and
# and capture the resulting search id. See this FAQ entry for alternatives
# to capture the search id: "Can the search ID of slk search be captured
# by a shell variable?". We do this in two steps in order to be able to
# capture the exit code of the search command.
search_id_raw=`slk_helpers search_limited '{"$and": [{"path": {"$gte": "/path/to/your/archived"}}, {"resources.name": {"$regex": "out_data_00[256].tar"}}]}'`
# '$?' captures the exit code of the previous command (you can put it in
# the next line after each slk command).
if [ $? -ne 0 ]; then
>&2 echo "an error occurred in slk search call"
else
echo "search successful"
fi
search_id=`echo $search_id_raw | tail -n 1 | sed 's/[^0-9]*//g'`
echo "The search ID is ${search_id}"
#
# If we delimit the search query by `"` instead of `'` and escape all
# `$` and `"` in the query, then we might even use environment variables
# in the search query
#
# source_namespace=/path/to/your/archived
# slk_helpers search_limited "{\"\$and\": [{\"path\": {\"\$gte\": \"${source_namespace}\"}}, {\"resources.name\": {\"\$regex\": \"out_data_00[256].tar\"}}]}"
#
# ~~~~~~~~~~~~ doing recursive retrievals ~~~~~~~~~~~~
# If you wish to retrieve a set of files that have been found by a search
# you can use "slk retrieve SEARCH_ID TARGET_FOLDER"
slk retrieve ${search_id} ${target_folder}
# '$?' captures the exit code of the previous command (you can put it in
# the next line after each slk command).
if [ $? -ne 0 ]; then
>&2 echo "an error occurred in slk retrieve call"
else
echo "retrieval successful"
fi