Important news for slk usage#
retrievals from tape#
Please do not retrieve files from more than 10 tapes with one call of slk retrieve
/ recall
. Using more tapes in one call of slk retrieve
might slow down the whole StrongLink System considerably.
First, count the number of tapes as follows:
$ slk_helpers gfbt /arch/bm0146/k204221/iow -R --count-tapes
10 tapes with single-tape files
0 tapes with multi-tape files
Note
This also works for file lists and search IDs as described here.
Second, generate one search ID per tape:
$ slk_helpers gfbt /arch/bm0146/k204221/iow/ -R --full --count-files
cached (AVAILABLE ): 417725
C25543L6 (AVAILABLE ): 417715
C25566L6 (AVAILABLE ): 417716
M12208M8 (AVAILABLE ): 417717
M20471M8 (AVAILABLE ): 417718
M12211M8 (AVAILABLE ): 417719
C25570L6 (AVAILABLE ): 417720
M12215M8 (AVAILABLE ): 417721
C25539L6 (AVAILABLE ): 417722
B09208L5 (AVAILABLE ): 417723
M12217M8 (AVAILABLE ): 417724
Third, run one retrieval or recall per search ID. Please only run three to five retrievals/recalls in parallel so that other users can get their data in parallel. You might use this batch scripts to submit the retrievals as SLURM jobs.
large archivals#
We recommend not transfer more than 5 TB with one call of slk archive
. We strongly recommend not to transfer more than 10 TB with one call of slk archive
. The probability of slk archive
to fail unexpectedly raises with increasing run time of slk archive
and with increasing amount of transfered data. This is caused by different factors which cannot be totally controlled by the archiving user. Please read the section below if slk archive
fails during a transfer.
slk timeouts#
Users might experiance timeout errors of the slk
and slk_helpers
. This timeout errors might appear when slk
starts (archive.dkrz.de: Name or service not known
or no route to host
) or while slk
is running (Connection reset
or Connection timeout has expired
). These timeouts might indicate that the StrongLink system or one particular StrongLink node is under high load. However, the timeout thresholds in slk
seem to be a too low. This causes timeout errors when there is just increased latency in the network or a minor delay in the reply of the DKRZ DNS. We requested an increase of the timeout values but did not get an updated slk
version yet. Until then, the only workaround is to run slk
commands a second time if a timeout occurrs.
failed archivals#
If slk archive
fails repeatedly, please have a look into the slk log file (~/.slk/slk-cli.log
) and run slk archive
with -vv
(double verbose mode) to print a list of archived, skipped and failed files.
When slk archive
fails or is killed while the data transfer is ongoing, please run the same call of slk archive
repeatedly until it finishes successfully. This only applies when slk archive
fail during the transfer due to timeout or connection lost errors. The files which have already been successfully archived might be flagged as incomplete. These files are marked as Partial File
when listed by slk list
:
$ slk list /dkrz_test/netcdf/example
-rwxr-xr-x- k204221 bm0146 553.9M 19 Jul 2021 02:18 file_500mb_d.nc
-rw-r--r--- k204221 bm0146 553.9M 19 Jul 2021 02:18 file_500mb_e.nc
-rw-r--r--- k204221 bm0146 553.9M 19 Jul 2021 02:18 file_500mb_f.nc (Partial File)
-rw-r--r--- k204221 bm0146 554.0M 19 Jul 2021 02:18 file_500mb_g.nc (Partial File)
Files: 4
The Partial File
is not displayed if the file was moved or renamed or if the permissions, group or owner of the file where changed. This is known slk
bug. Please run slk_helpers has_no_flag_partial -v
to check whether one or multiple files are flagged as partial
:
$ slk_helpers has_no_flag_partial /dkrz_test/netcdf/20230504c -R -v
/dkrz_test/netcdf/20230504c/file_500mb_d.nc has partial flag
/dkrz_test/netcdf/20230504c/file_500mb_f.nc has partial flag
/dkrz_test/netcdf/20230504c/file_500mb_g.nc has partial flag
Number of files without partial flag: 7/10
Warning
has_flag_partial
was renamed to has_no_flag_partial
in slk_helpers 1.9.0. This was also changed in this guidelines.
estimate waiting time for retrievals / recalls#
The command slk_helpers job_queue
was extended to provide some interpretation of the length of StrongLink’s recall job queue. Now, users can obtain a very rough estimate of the expected waiting time of newly submitted recall jobs. The command description is here and a few example calls are given below:
$ slkh job_queue
total read jobs: 110
active read jobs: 12
queued read jobs: 98
$ slkh job_queue --interpret N
3
$ slkh job_queue --interpret T
long
or like this:
$ slkh job_queue
total read jobs: 4
active read jobs: 4
queued read jobs: 0
$ slkh job_queue --interpret N
0
$ slkh job_queue --interpret T
none
$ slk_helpers job_queue --interpret D
no queue, waiting time in the queue: none