slk pitfalls (read this!)

file version: 18 November 2021

current slk version: 3.3.16

Run only one slk archive/retrieve per node and user

slk is fast but memory hungry. Additionally, it runs many threads in parallel. Running many slk calls in parallel on one node as one user (a) uses up a lot of memory and (b) causes issues in the thread management. The number of parallel slk commands per node and user depends on the amount of data that are archived / retrieved by slk and on the available memory. Therefore, we suggest to run only one slk command per user and node.

Running slk archive in one terminal window on mistralpp1 and doing slk list in another terminal window on the same node will not kill the slk archive. But, doing then slk list -R NAMESPACE_WITH_MANY_SUBNAMESPACES on the same node might cause issues.

Running commands without -R

Non-recursiveness is interpreted differently in StrongLink than defined in POSIX. If a namespace/directory (not a file) is given as input to the commands slk archive, slk retrieve and slk tag, all files in this namespace/directory are affected. In constrast, cp and rm would throw an error that -r is missing. When -R is set, all sub-namespaces are also affected.

slk writes no output in non-interactive mode

All slk commands except for slk list do not print output to the stdout and stderr streams (== command line output) when they are in non-interactive mode – i.e. running in SLURM jobs. Please catch the exit codes of your slk archive call and check whether they are equal 0. If not, an error occurred. Details on the error can be found in the slk log file ~/.slk/slk-cli.log. However, when you run many slk commands in parallel, the slk log becomes hard to read. Please print the time stamp (i.e. via date) when the error occurred to be able to find the details in the slk log later on. See the next code block on how to do this.

The exit code of the previous program call is stored in $?. Example:

$ slk archive /work/project/user/data /ex/am/ple/blub
...
$ echo $?
0
# or 1 or higher

In a bash/batch script it could look like this:

# ...
slk archive /work/project/user/data /ex/am/ple/blub
exit_code=$?

# print exit code with prefix so that it is easy to `grep`
echo "exit code: ${exit_code}"
if [ ${exit_code} -ne 0 ]; then
    #  print date
    date
fi

slk never writes to stderr

Error output of slk is written to the stdout stream instead of the stderr stream. If slk output in non-interactive mode was activated (it is not!) then you would find all error output in the SLURM stdout (not stderr) file when running jobs on mistral.

difference: slk move and slk rename

The Linux mv can move and rename files. The slk move can just move files/namespaces from one namespace to another namespace. Renaming can only be performed by slk rename. Both commands can only target one file/namespace at a time. Wildcards are not supported.

slk archive compares file size and timestamp prior to overwriting files

slk archive compares file size and timestamp to decide whether to overwrite a file or not. rsync does it the same way. There might be rare situations when an archived file should be overwritten by another file with the same name, size and timestamp: this would fail.

Availability of archived data and modified metadata might be delayed by a few seconds

StrongLink is a distributed system. Metadata is stored in a distributed metadata database. Some operations might take a few seconds until their results are visible because they have to be synchronized amongst different nodes.

Please wait a few seconds before you retrieve a file that was just archived.

A file listed by slk list is not necessarily available for retrieval yet

The location, name and size of a file are metadata. These metadata are written into the StrongLink metadata database when an archival process starts. slk list only prints metadata. Hence, if slk list lists a file, which is e.g. part of a file set currently uploaded in a batch job, this file is not necessarily fully uploaded yet. Similarly, aborted slk archive calls can produce a file’s metadata entry without correct data. Such a file can be retrieved without error. Please see failed or canceled slk archive and slk retrieve calls leave file fragments for details on file fragments.

failed or canceled slk archive and slk retrieve calls leave file fragments

issues during archival

A file fragment remains in StrongLink if slk archive did not terminate properly during an archival process. Metadata is available for this file fragment and it can be retrieved. It has no checksum. The latter is due to the fact that some metadata – like checksums – will be written after the archival process has finished successfully. The existence of checksums can be checked via slk_helpers checksum GNS_PATH. In the case of netCDF files, the header section might be copied properly. Thus, an ncdump -h might be successfully applied on a file fragment.

These fragements might occur when a user aborts slk archive (CTRL + C), a ssh connection breaks or a SLURM job is killed due to a timeout. More than one file might be affected because multiple files can be archived in parallel.

issues during retrieval

If slk retrieve does not terminate properly during a retrieval process, a file fragment might be created. These file fragments of temporary file names containing the original FILENAME: ~FILENAME14620203101828317173.slkretrieve. The reasons for improper termination of slk retrieve are the same as for slk archive. More than one file might be affected because multiple files can be retrieved in parallel.

Commonly, a file was correctly retrieved when it has its original filename and when the exit code of slk retrieve is 0 (echo $? directly after retrieval). To be 100% sure that the files was correctly retrieved, you can compared the checksum of the retrieved file with the checksum stored in StrongLink. If there is no checksum stored in StrongLink, the source file already is incomplete.

Pagination mode of slk list

When slk list is used in interactive mode without piping its output into another command, it will print its output in “pagination mode”. This means that only 25 results are printed “per page” and the user has to “turn the page” manually by pressing Return/Enter. Turning a page back is not possible. Even if there are less than 25 result, pagination mode is entered and the user has to type Return/Enter to leave the pagination mode. When a user regularly leaves the pagination mode, the terminal is cleared as CTRL + L does. This behaviour is by design and cannot be changed. If one wants to avoid the terminal to be cleared or does not want to browse through 30 pages, one should abort slk list with CTRL + C. We recommend to use slk list in combination with cat, less, more or similar tools in order to avoid the pagination mode. Below you will find an example.

Please note that the output of slk list NAMESPACE and slk list NAMESPACE | cat differs in the last line. This might be important when you create scripts around slk list.

slk list in pagination mode:

$ slk list /k204221_test
drwxrwxrwx- k204221     bm0146                 24 Jun 2021  20210624_test
drwxrwxrwx- k204221     bm0146                 25 Jun 2021  20210625_test
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  abc
drwxrwxrwx- k204221     bm0146                 24 Jun 2021  blubber
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  defg
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  memory_issue_testing
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  sbds_test_data
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  sbds_test_data_b
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  test
drwxrwxrwx- k204221     ka1209                 22 Jun 2021  test_20210617
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  test_20210622
drwxrwxrwx- k204221     ka1209                 22 Jun 2021  testing
Files 1-12 of 12

Avoid pagination mode of slk list:

$ slk list /k204221_test | cat
drwxrwxrwx- k204221     bm0146                 24 Jun 2021  20210624_test
drwxrwxrwx- k204221     bm0146                 25 Jun 2021  20210625_test
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  abc
drwxrwxrwx- k204221     bm0146                 24 Jun 2021  blubber
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  defg
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  memory_issue_testing
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  sbds_test_data
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  sbds_test_data_b
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  test
drwxrwxrwx- k204221     ka1209                 22 Jun 2021  test_20210617
drwxrwxrwx- k204221     bm0146                 22 Jun 2021  test_20210622
drwxrwxrwx- k204221     ka1209                 22 Jun 2021  testing
Files: 12

slk tag cannot be applied on individual files

slk tag cannot be applied on individual files but only on namespaces. If it is applied on a namespace, all files in this namespace are assigned the metadata provided in the slk tag call. The namespace itself does not get any metadata assigned. If -R is set, also all files in sub-namespaces are assigned the metadata.

slk does not have a –version flag

Instead, it has a version command: slk version

Update interval of progress bars (slk archive, group, owner, retrieve, tag)

Progress bars are updated per file or per block of n files. If you archive a folder with three files of 99 GB, 550 MB and 450 MB size, you will not see any updates of the progress bar 99% of the archival time while the large 99 GB file is archived and the progress bar will jump from 0% to 99%. If you tag a few files, the process bar will remain at 0% for a long time and suddenly jump to 100%.

Using slk list to print search results

slk list prints only the file names – independent on whether we print the content of a namespace or the result of a search. However, a search might find files in arbitrary namespaces. Thus, it would be helpful to print the path/namespace of each file when search results are listed. This is not the case. Currently, you cannot find out in which namespace(s) your search results are located in.

slk performance on different node types

We suggest running slk archive and slk retrieve on the mistralpp and compute/compute2 nodes. The run time on the mistralpp nodes considerably depends on the activity of other users on these nodes.

Please do not run slk archive and retrieve on the mistral login nodes (mlogin10X) when you archive large amounts of data because slk causes high CPU load and uses much memory.

The available memory per job on the shared nodes is very low. Therefore, slk archive and slk retrieve are slower than on other nodes. The run time can be expected to be two to four times as long as on the mistralpp and compute/compute2 nodes.

group memberships of user updated on login

If a user is added to a new group/project, this information is not automatically passed to StrongLink. Instead, the user has to run slk login again. Background: StrongLink caches LDAP data of each user and only updates its cache on a new login.

slk retrieve does not overwrite files but creates duplicates

When a file already exists, it retrieves a copy and inserts .DUPLICATE_FILENAME.[ID].[VERSION] between name and extension of the file. However, slk retrieve will overwrite these DUPLICATE files without warning. Consecutive retrievals will overwrite this file even if it is modified.

VERSION indicates the file version in StrongLink. If you modify a file and archive it a second time, the version will be incremented by one. Commonly, the version is not visible to you. Old file versions are not kept. Metadata of old versions is partly kept.

Do not archive such a DUPLICATE file because it might overwrite itself during retrieval.

“slk retrieve /source/ /target” and “slk retrieve /source /target” are not the same

slk retrieve works the same as rsync with respect to a / appended to the source path.

With / appended to the source path:

$ ls /ex/am/ple/bm0146/k20422/dm/retrieve_us
test.txt

$ slk retrieve -R /ex/am/ple/bm0146/k20422/dm/retrieve_us/ .
...

$ ls .
test.txt

Without / in the end of the source path:

$ ls /ex/am/ple/bm0146/k20422/dm/retrieve_us
test.txt

$ slk retrieve -R /ex/am/ple/bm0146/k20422/dm/retrieve_us .
...

$ ls .
retrieve_us

$ ls ./retrieve_us
test.txt

slk group and slk owner do not print visible error messages when they fail

Note

slk group and slk own are currently deactivated.

When slk group and slk owner are recursively applied to a folder with many files in it, the slk commands already start modifying first files while StrongLink is still collecting files. The progress bar will show 99% to 100% during the whole time while the file count will raise:

$ slk group -R 200524 /ex/am/ple/bm0146/k20422/dm/group_example
[========================================|] 100% complete. Files changed: 10/10, [150M/150M].
[========================================|] 100% complete. Files changed: 11/11, [152M/152M].
[========================================|] 100% complete. Files changed: 19/19, [214M/214M].
...

If some file cannot be modified, this is indicated as follows:

$ slk group -R 200524 /ex/am/ple/bm0146/k20422/dm/group_example
[========================================|] 100% complete. Files changed: 15426/15583, [7.9T/8.0T]. Files failed: 157.

But, when slk group finishes we do not know if all possible files were modified or if slk group was stopped in between (see next example):

$ slk group -R 200524 /k204221_test/testing/stability_20211012_size_500mb_40
[========================================|] 100% complete. Files changed: 15426/15583, [7.9T/8.0T]. Files failed: 157.
$ slk group -R 200524 /k204221_test/testing/stability_20211012_size_500mb_40
[=======================================/] 100% complete. Files changed: 31204/31227, [16.0T/16.1T]. Files failed: 157.

Both slk group were applied on the same folder. Therefore, the number of modified files should be the same – but, it is not. The reason for this discrepancy is that the first slk group command stopped with exit code 1 after 15583 files. Hence, it is important either to capture the exit code of slk group or to have a look into the slk log (~/.slk/slk-cli.log) afterwards.

slk archive might create namespaces with “.” and “..” as names but slk retrieve interpretes them

. and .. will be considered as normal names of namespaces in StrongLink. slk move and slk rename prevent the usage of . and .. (and moving into these). However, slk archive does not prevent this yet. The examples below should clarify this.

When namespaces with names . and .. are retrieved, these names are interpreted by the shell.

# create source data
$ mkdir none dot
$ echo "none" > none/a.txt
$ echo "." > dot/a.txt

# archival
$ slk archive none/a.txt /ex/am/ple/
[========================================\] 100% complete. Files archived: 1/1, [5B/5B].
$ slk archive dot/a.txt /ex/am/ple/.
[========================================-] 100% complete. Files archived: 1/1, [2B/2B].

# see what was archived
$ slk list /ex/am/ple | cat
drwxrwx---- stronglink  group0                 10 Nov 2021  .
-rw-r--r--- stronglink  group0             5   10 Nov 2021  a.txt
Files: 2
$ slk list /ex/am/ple/. | cat
-rw-r--r--- stronglink  group0             2   10 Nov 2021  a.txt

# retrieve top folder recursively
$ slk retrieve -R /ex/am/ple retr_overwrite_20211109_a
[========================================|] 100% complete. Files retrieved: 2/2, [7B/7B].

# check what is there
$ ls -la retr_overwrite_20211109_a/overwrite_20211109_a/
total 9
drwxr-xr-x 2 k204221 bm0146 4096 Nov 10 00:12 .
drwxr-xr-x 3 k204221 bm0146 4096 Nov 10 00:12 ..
-rw------- 1 k204221 bm0146    2 Nov 10 00:12 a.DUPLICATE_FILENAME.52933184010.1.txt
-rw------- 1 k204221 bm0146    5 Nov 10 00:12 a.txt

slk bad_input returns exit code 0

slk BAD_INPUT (like slk acrhvie) prints the help and returns a 0 as exit codes. It is said to print exit code 0 because the help is printed successfully. But, it should be 1 or higher.

slk cannot handle a path with // (double slash)

slk does not substitute // by /. Instead, it creates or looks for a namespace with an empty string as name (// => / + empty string + /). Empty strings as names for namespaces are prohibited. Therefore, commands fail when there is a // in a file path.

Filtering slk list results with “*”

use * to replace parts of the file name

This works fine:

$ slk list /ex/am/ple/\*.nc
...

$ slk list '/ex/am/ple/*.nc'
...

The user needs to prevent that * is interpreted by the bash/ksh/… . This can be done by one of both approaches above.

escape * to print the content of a namespace containing * in its name

Assuming, we have a namespaces with the name *, which is allowed, then we might do this to its content:

..code-block:: bash

$ slk list ‘/ex/am/ple/*’ …

This will prevent slk list successfully from interpreting the *. However, when a * is in the path, slk list automatically goes into “filter mode”. This means that the content of the namespace /ex/am/ple will be filtered for content with the name *. Hence, we will just get * printed and not its content.

using * to replace parts of namespace names

Using * to replace parts of the names of namespaces does not work. Example:

$ slk list /ex/am/ple/\*/\*.nc
...

$ slk list '/ex/am/ple/*/*.nc'
...

These two list commands will look for *.nc in /ex/am/ple and not in every sub-namespace of /ex/am/ple.

slk chmod -R modifies many more file permissions than it should

slk chmod -R creates a tree of all files and of all namespaces in which these files are located. slk chmod -R seems to iterated the tree in a wrong way so that each files’ permissions are not modified once but 2^[namespace_depth - 1] times.

example 1

$ echo "abc" > test.txt

$ slk archive test.txt /ex/am/ple/ex1/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z
[========================================/] 100% complete. Files archived: 1/1, [...].

# that's OK
$ slk chmod -R 755 /ex/am/ple/ex1/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z
[========================================\] 100% complete. Files changed: 1/1, [4B/4B].

# that's not OK
$ slk chmod -R 755 /ex/am/ple/ex1
^C                          ^C===========\] 100% complete. Files changed: 4431/4431, [...].

example 2

# archive five files into one parent parent
$ slk archive *.nc /ex/am/ple/ex2a
[========================================/] 100% complete. Files archived: 5/5, [10.8K/10.8K].
$ slk chmod -R 755 /ex/am/ple/ex2a
[========================================\] 100% complete. Files changed: 6/6, [10.8K/10.8K].
# ================>>>>>>>>>>>> SIX RESOURCES MODIFIED <<<<<<<<<<<<================

# archive five files into three sub-namespaces:
$ slk archive *.nc /ex/am/ple/ex2b/d1/d2/d3
[========================================-] 100% complete. Files archived: 5/5, [10.8K/10.8K].
$ slk chmod -R 755 /ex/am/ple/ex2b
[========================================|] 100% complete. Files changed: 55/55, [86.3K/86.3K].
# ================>>>>>>>>>>>> FIFTY-FIVE RESOURCES MODIFIED <<<<<<<<<<<<================

example 3

echo "abc" > test.txt

# no subfolder; n=0
slk archive test.txt /ex/am/ple/test01
slk chmod -R /ex/am/ple/test01
# => 2 resources (1x file, 1x namespace)

# subfolder; n=1
slk archive test.txt /ex/am/ple/test01/test02
slk chmod -R /ex/am/ple/test01
# => 5 resources (2x same file, 3x namespaces: 2x test02, 1x test01)

# subfolder in subfolder; n=2
slk archive test.txt /ex/am/ple/test01/test02/test03
slk chmod -R /ex/am/ple/test01
# 11 resources (4x same file, 7x namespaces: 4xtest03, 2x test02, 1x test01)

# subfolder in subfolder in subfolder; n=3
slk archive test.txt /ex/am/ple/test01/test02/test03/test04
slk chmod -R /ex/am/ple/test01
# 23 resources (8x same file, 15x namespaces: 8xtest04, 4xtest03, 2x test02, 1x test01)

# ... n ...
...
# 2^n * FILES + 2^(n+1) - 1 resources => 2^n times each file;2^(n+1)-1 namesspaces