Known Issues (read this!)#

file version: 16 Jan 2023

current software versions: slk version 3.3.81; slk_helpers version 1.7.1

slk issues on Levante#

slk/slk_helpers terminate directly after start with “Name or service not known” or “Unhandled error occurred”#

The error, which slk prints to the command line is

ERROR: Unhandled error occurred, please check logs

However, if you take a look into the slk log (~/.slk/slk-cli.log) then you will find the actual error:

2022-04-07 21:45:30 ERROR archive.dkrz.de: Name or service not known

The `slk_helpers print this message directly to the command line

archive.dkrz.de: Name or service not known

The error is the same.

Reason#

The slk / slk_helpers does not get any reply from StrongLink. On levante, the routing to the StrongLink constellation does not seem to work from time to time.

Solution#

You cannot do anything about them but just run slk again a few seconds later.

slk archive/retrieve may use much memory and CPU time – careful with parallel slk calls on one node#

slk archive and slk retrieve are fast but memory hungry. Additionally, these commands run many threads in parallel when large files or many files are transferred. As a rule of thumb, 6 GB of memory should be assumed for each archival and retrieval call. Be aware that on most Levante nodes, 2 GB of memory is allocated to each physical CPU core. This limit is strictly enforced by the operating systems and processes exceeding the allowed memory usage will be killed. Therefore, 6 GB of memory should be allocated in batch jobs via --meme=6GB.

Running many slk archive or slk retrieve in parallel in one job (a) occupies a lot of memory and (b) causes issues in the thread management. Additionally, the transfer speed might not improve if many archivals or retrievals run in parallel on one node due to hardware limitations. Please consider to aggregate several retrievals via a search and retrieval of the search results (see Aggregate file retrievals).

slk archive/retrieve is killed#

If you receive this output

/sw/spack-levante/slk-3.3.21-5xnsgp/bin/slk: line 16: 3673728 Killed

then, the slk call was killed because of a too high RAM usage. Please see slk archive/retrieve may use much memory and CPU time – careful with parallel slk calls on one node for details.

slk is hanging / unresponsive#

A few issues might cause a slk call to hang. However, long run time might not necessarily mean that slk hangs: slk retrieve and slk recall might idle a long period if many files have to be copied from tape at the same time.

reason: Lustre file system is hanging#

Please check whether /home is hanging. If /home is hanging, slk cannot access its login token and cannot write into its log.

reason: slk retrieve does not hang but the tape recall takes very long#

When many retrieve/recall requests of files from tape are processed, the individual calls of slk retrieve might take longer than normal because the tape requests are queued. In this situation, slk retrieve might look like it is hanging. Instead, it is waiting for files to be copied from tape to the HSM cache.

reason: one or more source files have 0 byte size#

Please check whether you are archiving a file of 0 Byte size. slk archive and slk retrieve hang when such a file is archive or retrieved, respectively.

Non-recursive is semi-recursive!?#

If a namespace/directory (not a file) is given as input to the command slk tag, all files in this namespace/directory are affected but no files in sub-directories. When -R is set, all sub-namespaces are also affected.

slk writes no output in non-interactive mode#

All slk commands except for slk list do not print output to the stdout and stderr streams (== command line output) when they are in non-interactive mode – i.e. running in SLURM jobs. In the case of slk archive you can set -vv in order to print a list of processed files to the terminal / SLURM log. In all other situations when slk is used in batch jobs, please catch the exit codes of your slk calls and check whether they are equal to 0. If not, an error occurred or files were skipped. Details on the error can be found in the slk log file ~/.slk/slk-cli.log. However, when you run many slk commands in parallel, the slk log becomes hard to read. Please print the hostname and time stamp (i.e. via date) when the error occurred into the SLURM log to be able to find the details in the slk log later on.

The exit code of the previous program call is stored in $?. Example:

$ slk archive /work/project/user/data /ex/am/ple/blub
...
$ echo $?
0
# or 1 or higher

In a bash/batch script it could look like this:

# ...
slk archive /work/project/user/data /ex/am/ple/blub
exit_code=$?

# print exit code with prefix so that it is easy to `grep`
echo "exit code: ${exit_code}"
if [ ${exit_code} -ne 0 ]; then
    #  print date
    date
    hostname
fi

slk never writes to stderr#

Error messages of slk is written to the stdout stream instead of the stderr stream. If slk was writing output in non-interactive mode (it is not!) then you would find all error output in the SLURM stdout (not stderr) file.

slk move cannot rename files#

The Linux mv can move and rename files. The slk move can just move files/namespaces from one namespace to another namespace. Renaming can only be performed by slk rename. Both commands can only target one file/namespace at a time. Wildcards are not supported.

slk archive compares file size and timestamp prior to overwriting files#

slk archive compares file size and timestamp to decide whether to overwrite a file or not. rsync does it the same way. There might be rare situations when an archived file should be overwritten by another file with the same name, size and timestamp: this would fail.

Availability of archived data and modified metadata is delayed by a few seconds#

StrongLink is a distributed system. Metadata is stored in a distributed metadata database. Some operations might take a few seconds until their results are visible because they have to be synchronized amongst different nodes.

Please wait a few seconds before you retrieve a file that was just archived.

A file listed by slk list is not necessarily available for retrieval yet#

The location, name and size of a file are metadata. These metadata are written into the StrongLink metadata database when an archival process starts. slk list only prints metadata. Hence, a file which is currently transferred by slk archive will be listed by slk list even before it has been fully transferred. Similarly, aborted slk archive calls can produce a file’s metadata entry without correct data. Such a file can be retrieved without error. Please see failed or canceled slk archive and slk retrieve calls leave file fragments for details on file fragments.

failed/canceled slk archive/retrieve calls leave file fragments#

issues during archival#

A file fragment remains in StrongLink if slk archive did not terminate properly during an archival process. Metadata is available for this file fragment and it can be retrieved. In most cases this file has no checksum: if a file has no checksum a few days after archival then it is incomplete. However, if a file has a checksum it is not necessarily complete. The checksum of a file can be obtained via slk_helpers checksum GNS_PATH. It is strongly recommended to compare all checksums of archived files when the corresponding call of slk archive was terminated. In the case of netCDF files, the header section might be copied properly. Thus, an ncdump -h might be successfully applied on a file fragment.

These fragments might occur when a user aborts slk archive (CTRL + C), a ssh connection breaks or a SLURM job is killed due to a timeout. More than one file might be affected because multiple files can be archived in parallel.

slk does not have a –version flag#

Instead, it has a version command: slk version

slk performance on different node types#

We recommend running slk archive and slk retrieve via the shared and interactive partitions – and in special cases on compute.

On login nodes: only small files should be archived so that these nodes are not slowed down by slk; slk retrieve can only retrieve one file at once.

On shared nodes, the available I/O bandwidth might be a limiting factor for the transfer rate of slk archive and slk retrieve. Therefore, the transfer rate might be higher on exclusive nodes.

group memberships of user updated on login#

If a user is added to a new group/project, this information is not automatically passed to StrongLink. Instead, the user has to run slk login again. Background: StrongLink caches LDAP data of each user and only updates its cache on a new login.

“slk retrieve /source/ /target” and “slk retrieve /source /target” are not the same#

slk retrieve works the same as rsync with respect to a / appended to the source path.

If / is appended to the source path then the source folder is created at the target location and the folder’s content is transferred into the created folder.

$ ls /ex/am/ple/bm0146/k20422/dm/retrieve_us
test.txt

$ slk retrieve -R /ex/am/ple/bm0146/k20422/dm/retrieve_us/ .
...

$ ls .
test.txt

If no / is in the end of the source path then only the folder’s content is transferred to the target location. But, the folder itself is not created.

$ ls /ex/am/ple/bm0146/k20422/dm/retrieve_us
test.txt

$ slk retrieve -R /ex/am/ple/bm0146/k20422/dm/retrieve_us .
...

$ ls .
retrieve_us

$ ls ./retrieve_us
test.txt

Filtering slk list results with “*”#

Bash wildcards/globs partly work with slk. slk list understands *

use * to replace parts of the file name#

This works fine:

$ slk list /ex/am/ple/\*.nc
...

$ slk list '/ex/am/ple/*.nc'
...

The user needs to prevent that * is interpreted by the bash/ksh/… . This can be done by one of both approaches above.

escape * to print the content of a namespace containing * in its name#

Assuming, we have a namespaces with the name *, which is allowed, then we might do this to its content:

$ slk list '/ex/am/ple/\*'
...

This will prevent slk list successfully from interpreting the *. However, when a * is in the path, slk list automatically goes into “filter mode”. This means that the content of the namespace /ex/am/ple will be filtered for content with the name *. Hence, we will just get * printed and not its content.

using * to replace parts of namespace names#

Using * to replace parts of the names of namespaces does not work. But, the full name of a namespace can be replaced by *. Example:

$ slk list /ex/am/ple
drwxrwxrwx- k204221  ka1209             0   30 Nov 2022 10:52 aa11
drwxrwxrwx- k204221  ka1209             0   30 Nov 2022 10:52 aa22
drwxrwxrwx- k204221  ka1209             0   30 Nov 2022 10:52 bb00

$ slk list /ex/am/ple/\*/\*.nc
...

$ slk list '/ex/am/ple/*/*.nc'
...

These last two list commands will look for *.nc in in every sub-namespace of /ex/am/ple. However, the following list command will not work.

$ slk list '/ex/am/ple/a*/*.nc'
ERROR: Cannot run "list /ex/am/ple/a*/*.nc": file or directory named '*.nc' was not found.

How to search non-recursively in a namespace#

By default, slk search search recursively in a namespace provided via path:

slk search '{"path":{"$gte":"/arch/bm0146/k204221/test_files"}}'

It is not possible to use the $eq operator in this context. Instead, another key-value pair "$max_depth": 1 has to be inserted as follows:

slk search '{"path":{"$gte":"/arch/bm0146/k204221/test_files", "$max_depth":1}}'

Alternatively, you can get the object id of the particular namespace via slk_helpers exists and, then in your search query, use it as value for the search field resources.parent_id (see this example code in slk Usage Examples)

Terminal cursor disappears if slk command with progress bar is canceled#

If a slk command with a progress bar is canceled by the user, the shell cursor might disappear. One can make it re-appear by (a) running reset or (b) starting vim and leaving it directly (:q!).

error “conflict with jdk/…” when the slk module is loaded#

slk needs a specific Java version that is automatically loaded with slk. Having other Java versions loaded in parallel might cause unwanted side effects. Therefore, the system throws an error message and aborts.

slk needs at least Java version 13#

You might encounter an error like this:

$ slk list 12
CLI tools require Java 13 (found 1)

slk needs a specific Java version. This Java version is automatically loaded when we load the slk module. If you have another Java loaded explicitly, please unload them prior to loading the slk module. If you loaded slk already, please: (1) unload slk, (2) unload all Java modules and (3) load slk again.

slk search yields RQL parse error#

ERROR: Search failed. Reason: RQL parse error: No period found in collection field name ().

Either: Please consider using ' around your search query instead of " to prevent operators starting with $ to be evaluated as bash variables.

Or: Please escape $’s belonging to query operators when you use " as delimiters of the query string.

Or: Please check your JSON query carefully. It might be valuable to print the query in a human readable way with echo QUERY | jq.

slk login asks me to provide a hostname and/or a domain#

If you are asked for this information the configuration is faulty. Please contact support@dkrz.de and tell us on which machine you are working.

Archival fails and Java NullPointerException in the log#

This error message is printed in the log:

2021-07-13 08:33:03 ERROR Unexpected exception
java.lang.NullPointerException: null
    at com.stronglink.slkcli.api.websocket.NodeThreadPools.getBestPool(NodeThreadPools.kt:28) ~[slk-cli-tools-3.1.62.jar:?]
    at com.stronglink.slkcli.archive.Archive.upload(Archive.kt:191) ~[slk-cli-tools-3.1.62.jar:?]
    at com.stronglink.slkcli.archive.Archive.uploadResource(Archive.kt:165) ~[slk-cli-tools-3.1.62.jar:?]
    at com.stronglink.slkcli.archive.Archive.archive(Archive.kt:77) [slk-cli-tools-3.1.62.jar:?]
    at com.stronglink.slkcli.SlkCliMain.run(SlkCliMain.kt:169) [slk-cli-tools-3.1.62.jar:?]
    at com.stronglink.slkcli.SlkCliMainKt.main(SlkCliMain.kt:103) [slk-cli-tools-3.1.62.jar:?]
2021-07-13 08:33:03 INFO

This error indicates that there is an API issue. A reason might be that one or more StrongLink nodes went offline and the other nodes did not take of their connections yet. Please notify support@dkrz.de if you experience this error.

slk ERROR: Unhandled error occurred, please check logs#

Please have a look into you slk log: ~/.slk/slk-cli.log.

slk archive: Exception …: lateinit property websocket has not been initialized#

Full error message on the command line:

Exception in thread "Thread-357" kotlin.UninitializedPropertyAccessException: lateinit property websocket has not been initialized
at com.stronglink.slkcli.queue.ArchiveWebsocketWorker.closeConnection(ArchiveWebsocketWorker.kt:146)
at com.stronglink.slkcli.queue.WebsocketWorker.run(WebsocketWorker.kt:67)

Error message in the log:

2022-03-01 13:50:28 ERROR Error in websocket worker
java.util.concurrent.CompletionException: java.net.http.WebSocketHandshakeException
        at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367) ~[?:?]

Reason#

Probably, slk archive was run with --streams 10 or a similar high number like --streams 16 or --streams 32

Solution#

Please use slk archive --streams N with a maximum value of 4 for N. Transfer rates of 1 to 2 GB/s are possible with this configuration when the system is not busy.

slk delete failed, but nevertheless file was deleted#

Issue description#

We run slk delete /abc/def/ghi.txt but slk delete fails due to an unknown reason. Repeated calls of slk delete /abc/def/ghi.txt fail because the target file does not exist anymore.

Reason#

The reason has not been fully identified yet. This one is the most probable reason: When slk delete sends a deletion request to StrongLink, it waits a certain time for the response of the StrongLink instance to return. If this does not happen or if the reply of another confirmation step does not return in time (= timeout), slk assumes that the command failed.

Solution#

Please carefully check, if files were actually deleted when a slk delete did finish successfully.

slk list will take very long when many search results are found#

Issue description#

slk list SEARCH_ID seems to do nothing.

Reason#

slk list SEARCH_ID collects all search results, first, and, then, prints them. The run time of slk list linearly scales with the number of them (20s to 60s per 1000 results). Hence, if you want to print a list of 10000 files which were found by slk search you might have to wait 5 minutes until the list is printed.

Alternatively to slk list, you can run slk_helpers list_search on the same SEARCH_ID which will continuously print collected search results.

Solution#

Please refine your search. The section Search files by metadata might help in this context.

slk retrieve returns exit code 1 if one or more files are skipped#

The current situation is this:

$ slk3.3.64 retrieve -R /dkrz_test/netcdf/20221215a z/ -s
[========================================|] 100% complete. Files retrieved: 4/4, [5.0K/5.0K].
$ echo $?
0

$ slk3.3.64 retrieve -R /dkrz_test/netcdf/20221215a z/ -s
[========================================\] 100% complete. Files retrieved: 0/4, [0B/5.0K]. Files skipped: 4.
$ echo $?
1

It should be this:

$ slk3.3.64 retrieve -R /dkrz_test/netcdf/20221215a z/ -s
[========================================|] 100% complete. Files retrieved: 4/4, [5.0K/5.0K].
$ echo $?
0

$ slk3.3.64 retrieve -R /dkrz_test/netcdf/20221215a z/ -s
[========================================\] 100% complete. Files retrieved: 0/4, [0B/5.0K]. Files skipped: 4.
$ echo $?
0