Important news for slk usage#

retrievals from tape#

Please do not retrieve files from more than 10 tapes with one call of slk retrieve / recall. Using more tapes in one call of slk retrieve might slow down the whole StrongLink System considerably.

First, count the number of tapes as follows:

$ slk_helpers gfbt /arch/bm0146/k204221/iow -R --count-tapes
10 tapes with single-tape files
0 tapes with multi-tape files

Note

This also works for file lists and search IDs as described here.

Second, generate one search ID per tape:

$ slk_helpers gfbt /arch/bm0146/k204221/iow/ -R --full --count-files
  cached (AVAILABLE  ): 417725
C25543L6 (AVAILABLE  ): 417715
C25566L6 (AVAILABLE  ): 417716
M12208M8 (AVAILABLE  ): 417717
M20471M8 (AVAILABLE  ): 417718
M12211M8 (AVAILABLE  ): 417719
C25570L6 (AVAILABLE  ): 417720
M12215M8 (AVAILABLE  ): 417721
C25539L6 (AVAILABLE  ): 417722
B09208L5 (AVAILABLE  ): 417723
M12217M8 (AVAILABLE  ): 417724

Third, run one retrieval or recall per search ID. Please only run three to five retrievals/recalls in parallel so that other users can get their data in parallel. You might use this batch scripts to submit the retrievals as SLURM jobs.

Note

Please see basic and advanced usage hints for slk_helpers.

large archivals#

We recommend not transfer more than 5 TB with one call of slk archive. We strongly recommend not to transfer more than 10 TB with one call of slk archive. The probability of slk archive to fail unexpectedly raises with increasing run time of slk archive and with increasing amount of transfered data. This is caused by different factors which cannot be totally controlled by the archiving user. Please read the section below if slk archive fails during a transfer.

slk timeouts#

Users might experiance timeout errors of the slk and slk_helpers. This timeout errors might appear when slk starts (archive.dkrz.de: Name or service not known or no route to host) or while slk is running (Connection reset or Connection timeout has expired). These timeouts might indicate that the StrongLink system or one particular StrongLink node is under high load. However, the timeout thresholds in slk seem to be a too low. This causes timeout errors when there is just increased latency in the network or a minor delay in the reply of the DKRZ DNS. We requested an increase of the timeout values but did not get an updated slk version yet. Until then, the only workaround is to run slk commands a second time if a timeout occurrs.

failed archivals#

If slk archive fails repeatedly, please have a look into the slk log file (~/.slk/slk-cli.log) and run slk archive with -vv (double verbose mode) to print a list of archived, skipped and failed files.

When slk archive fails or is killed while the data transfer is ongoing, please run the same call of slk archive repeatedly until it finishes successfully. This only applies when slk archive fail during the transfer due to timeout or connection lost errors. The files which have already been successfully archived might be flagged as incomplete. These files are marked as Partial File when listed by slk list:

$ slk list /dkrz_test/netcdf/example
-rwxr-xr-x- k204221     bm0146        553.9M   19 Jul 2021 02:18 file_500mb_d.nc
-rw-r--r--- k204221     bm0146        553.9M   19 Jul 2021 02:18 file_500mb_e.nc
-rw-r--r--- k204221     bm0146        553.9M   19 Jul 2021 02:18 file_500mb_f.nc (Partial File)
-rw-r--r--- k204221     bm0146        554.0M   19 Jul 2021 02:18 file_500mb_g.nc (Partial File)
Files: 4

The Partial File is not displayed if the file was moved or renamed or if the permissions, group or owner of the file where changed. This is known slk bug. Please run slk_helpers has_no_flag_partial -v to check whether one or multiple files are flagged as partial:

$ slk_helpers has_no_flag_partial /dkrz_test/netcdf/20230504c -R -v
/dkrz_test/netcdf/20230504c/file_500mb_d.nc has partial flag
/dkrz_test/netcdf/20230504c/file_500mb_f.nc has partial flag
/dkrz_test/netcdf/20230504c/file_500mb_g.nc has partial flag
Number of files without partial flag: 7/10

Warning

has_flag_partial was renamed to has_no_flag_partial in slk_helpers 1.9.0. This was also changed in this guidelines.

estimate waiting time for retrievals / recalls#

The command slk_helpers job_queue was extended to provide some interpretation of the length of StrongLink’s recall job queue. Now, users can obtain a very rough estimate of the expected waiting time of newly submitted recall jobs. The command description is here and a few example calls are given below:

$ slkh job_queue
total read jobs: 110
active read jobs: 12
queued read jobs: 98

$ slkh job_queue --interpret N
3

$ slkh job_queue --interpret T
long

or like this:

$ slkh job_queue
total read jobs: 4
active read jobs: 4
queued read jobs: 0

$ slkh job_queue --interpret N
0

$ slkh job_queue --interpret T
none

$ slk_helpers job_queue --interpret D
no queue, waiting time in the queue: none