Known Issues (read this!)#

file version: 07 Dec 2023

current software versions: slk version 3.3.91; slk_helpers version 1.10.2; slk wrappers version 1.2.1

slk/slk_helpers terminate directly after start with “Name or service not known” or “Unhandled error occurred”#

The error, which slk prints to the command line is

ERROR: Unhandled error occurred, please check logs

However, if you take a look into the slk log (~/.slk/slk-cli.log) then you will find the actual error:

2022-04-07 21:45:30 ERROR archive.dkrz.de: Name or service not known

The `slk_helpers print this message directly to the command line

archive.dkrz.de: Name or service not known

The error is the same.

Reason#

The slk / slk_helpers does not get any reply from StrongLink. On levante, the routing to the StrongLink constellation does not seem to work from time to time.

Solution#

You cannot do anything about them but just run slk again a few seconds later.

slk list prints “Error getting namespace children for namespace:”#

You run slk list on a namespace/folder an see this

$ slk list /arch/pd1309/forcings/reanalyses/ERA5/year1987/
Error getting namespace children for namespace: 49083768018

Reason#

A connection timeout to StrongLink occurred but slk list just prints a generic error. The correct error can be found in the slk log file (~/.slk/slk-cli.log):

2023-12-04 12:27:47 levante4.lvt.dkrz.de 2567653 ERROR Get namespace children error: <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>

slk archive/retrieve may use much memory and CPU time – careful with parallel slk calls on one node#

slk archive and slk retrieve are fast but memory hungry. Additionally, these commands run many threads in parallel when large files or many files are transferred. As a rule of thumb, 6 GB of memory should be assumed for each archival and retrieval call. Be aware that on most Levante nodes, 2 GB of memory is allocated to each physical CPU core. This limit is strictly enforced by the operating systems and processes exceeding the allowed memory usage will be killed. Therefore, 6 GB of memory should be allocated in batch jobs via --meme=6GB.

Running many slk archive or slk retrieve in parallel in one job (a) occupies a lot of memory and (b) causes issues in the thread management. Additionally, the transfer speed might not improve if many archivals or retrievals run in parallel on one node due to hardware limitations. Please consider to aggregate several retrievals via a search and retrieval of the search results (see Aggregate file retrievals).

slk archive/retrieve is killed#

If you receive this output

/sw/spack-levante/slk-3.3.21-5xnsgp/bin/slk: line 16: 3673728 Killed

then, the slk call was killed because of a too high RAM usage. Please see slk archive/retrieve may use much memory and CPU time – careful with parallel slk calls on one node for details.

slk is hanging / unresponsive#

A few issues might cause a slk call to hang. However, long run time might not necessarily mean that slk hangs: slk retrieve and slk recall might idle a long period if many files have to be copied from tape at the same time.

reason: Lustre file system is hanging#

Please check whether /home is hanging. If /home is hanging, slk cannot access its login token and cannot write into its log.

reason: slk retrieve does not hang but the tape recall takes very long#

When many retrieve/recall requests of files from tape are processed, the individual calls of slk retrieve might take longer than normal because the tape requests are queued. In this situation, slk retrieve might look like it is hanging. Instead, it is waiting for files to be copied from tape to the HSM cache.

reason: one or more source files have 0 byte size#

Please check whether you are archiving a file of 0 Byte size. slk archive and slk retrieve hang when such a file is archive or retrieved, respectively.

Non-recursive is semi-recursive!?#

If a namespace/directory (not a file) is given as input to the command slk tag, all files in this namespace/directory are affected but no files in sub-directories. When -R is set, all sub-namespaces are also affected.

slk writes no output in non-interactive mode#

Many slk commands do not print output to the stdout and stderr streams (== command line output) but write their output into a buffer. Thus, the output is not captured when the commands are run in non-interactive mode – i.e. running in SLURM jobs. In the case of slk archive you can set -vv in order to print a list of processed files to stdout (i.e. the terminal / SLURM log). In all other situations when slk is used in batch jobs, please catch the exit codes of your slk calls and check whether they are equal to 0. If not, an error occurred or files were skipped. Details on the error can be found in the slk log file ~/.slk/slk-cli.log. However, when you run many slk commands in parallel, the slk log becomes hard to read. Please print the hostname and time stamp (i.e. via date) when the error occurred into the SLURM log to be able to find the details in the slk log later on. Printing the pid of the slk call does not help, because this is only the pid of a wrapper script around the actual slk process.

The exit code of the previous program call is stored in $?. Example:

$ slk archive /work/project/user/data /ex/am/ple/blub
...
$ echo $?
0
# or 1 or higher

In a bash/batch script it could look like this:

# ...
slk archive /work/project/user/data /ex/am/ple/blub
exit_code=$?

# print exit code with prefix so that it is easy to `grep`
echo "exit code: ${exit_code}"
if [ ${exit_code} -ne 0 ]; then
    #  print date
    date
    hostname
fi

slk never writes to stderr#

Error messages of slk is written to the stdout stream instead of the stderr stream. If slk was writing output in non-interactive mode (it is not!) then you would find all error output in the SLURM stdout (not stderr) file.

slk move cannot rename files#

The Linux mv can move and rename files. The slk move can just move files/namespaces from one namespace to another namespace. Renaming can only be performed by slk rename. Both commands can only target one file/namespace at a time. Wildcards are not supported.

slk archive compares file size and timestamp prior to overwriting files#

slk archive compares file size and timestamp to decide whether to overwrite a file or not. rsync does it the same way. There might be rare situations when an archived file should be overwritten by another file with the same name, size and timestamp: this would fail.

Availability of archived data and modified metadata is delayed by a few seconds#

StrongLink is a distributed system. Metadata is stored in a distributed metadata database. Some operations might take a few seconds until their results are visible because they have to be synchronized amongst different nodes.

Please wait a few seconds before you retrieve a file that was just archived.

A file listed by slk list is not necessarily available for retrieval yet#

The location, name and size of a file are metadata. These metadata are written into the StrongLink metadata database when an archival process starts. slk list only prints metadata. Hence, a file which is currently transferred by slk archive will be listed by slk list even before it has been fully transferred. Similarly, aborted slk archive calls can produce a file’s metadata entry without correct data. Such a file can be retrieved without error. Please see failed or canceled slk archive and slk retrieve calls leave file fragments for details on file fragments.

failed/canceled slk archive calls leave incomplete files#

Incomplete/partial files may remain in StrongLink if slk archive was interrupted during an archival process. More than one file might be affected because multiple files can be archived in parallel. These files are denoted as partial files by StrongLink. Metadata is available for the partial files and they are listed by ``slk list``. In most cases these files do not have checksums. Reason for interruption might be:

  • user aborts slk archive – e.g. via CTRL + C

  • a ssh connection breaks

  • a SLURM job is killed due to a timeout.

  • a too large amount of data is archived at once causing slk archive to fail (i.e. more than 5 to 10 TB)

Assume that the archival into this destination namespace failed: /arch/ab1234/c567890/test. There might be (a) partial files and (b) files with ‘partial’ flag in this namespace. The partial files (a) have only been partly archived and should be considered as corrupted. These files are flagged as partial (b). However, also completely archived files might be flagged as partial when the slk archive failed shortly after they were completely archived. Files which are flagged as partial cannot be retrieved. Therefore, the flag should be removed from all completely archived files.

Solution#

The partial files can be identified by a verify job. It checks whether the actual file size matches the expected file size which was stored in StrongLink during the initialized of the archival process. The files which are flagged as partial can be identified by slk_helpers has_no_flag_partial.

Run a verify job to find partial files#

Please check whether there partial files in /arch/ab1234/c567890/test. This is done by starting a verify job and collecting its results. For details please check the section Verify file size on page Archivals to tape.

# submit a verify job for the destination folder
$ slk_helpers submit_verify_job /arch/ab1234/c567890/test -R
Submitting up to 1 verify job(s) based on results of search id 576002:
search results: pages 1 to 1 of 1; visible search results: 10; submitted verify job: 176395
Number of submitted verify jobs: 1

# ... after some time ...
# check if the job finished => status "COMPLETED"
$ slk_helpers job_status 176395
COMPLETED

# collect the results
$ slk_helpers result_verify_job 176395
Errors:
Resource content size does not match record: /arch/ab1234/c567890/test/file_001gb_b.nc
Resource content size does not match record: /arch/ab1234/c567890/test/file_001gb_f.nc
Erroneous files: 2

The two files file_001gb_b.nc and file_001gb_f.nc are partial files. They should be re-archived (automatically overwritten) or deleted.

Run ‘has_no_flag_partial’ to find as ‘partial’ flagged files#

Please run the command slk_helpers has_no_flag_partial with the parameter -v to get a list of all flagged files. If -v is not set, the command only checks whethre no files or at least one file has this flag set. -vv prints the status of all targeted files.

$ slk_helpers has_no_flag_partial -v -R /arch/ab1234/c567890/test
/dkrz_test/netcdf/20230925a/file_001gb_b.nc has partial flag
/dkrz_test/netcdf/20230925a/file_001gb_e.nc has partial flag
/dkrz_test/netcdf/20230925a/file_001gb_f.nc has partial flag
/dkrz_test/netcdf/20230925a/file_001gb_i.nc has partial flag
Number of files without partial flag: 8/12

Thus, 4 files of 12 are flagged as partial files. Two of these were identified as actually partial files (see above; *_b.nc and *_f.nc). They will probably loose their flag after re-archival. But, the files *_e.nc and *_i.nc are falsely flagged as partial because they were not found by the verify job above. However, please also try to re-archive these files. Just to make sure. If they are OK, they are skipped and the flag is not removed.

slk does not have a –version flag#

Instead, it has a version command: slk version

slk performance on different node types#

We recommend running slk archive and slk retrieve on nodes of the shared and interactive partitions (see Run slk in the “interactive” partition and Run slk as batch job). Please avoind running slk archive and slk retrieve on login and compute nodes when larger amounts of data should be transfered.

On login nodes, only small files should be archived so that these nodes are not slowed down by slk. slk retrieve can only retrieve one file at once on these nodes.

On shared nodes, the available I/O bandwidth might be a limiting factor for the transfer rate of slk archive and slk retrieve. Therefore, the transfer rate might be higher on exclusive nodes.

group memberships of user updated on login#

If a user is added to a new group/project, this information is not automatically passed to StrongLink. Instead, the user has to run slk login again. Background: StrongLink caches LDAP data of each user and only updates its cache on a new login.

Filtering slk list results with “*”#

Bash wildcards/globs partly work with slk. slk list understands *

use * to replace parts of the file name#

This works fine:

$ slk list /ex/am/ple/\*.nc
...

$ slk list '/ex/am/ple/*.nc'
...

The user needs to prevent that * is interpreted by the bash/ksh/… . This can be done by one of both approaches above.

escape * to print the content of a namespace containing * in its name#

Assuming, we have a namespaces with the name *, which is allowed, then we might do this to its content:

$ slk list '/ex/am/ple/\*'
...

This will prevent slk list successfully from interpreting the *. However, when a * is in the path, slk list automatically goes into “filter mode”. This means that the content of the namespace /ex/am/ple will be filtered for content with the name *. Hence, we will just get * printed and not its content.

using * to replace parts of namespace names#

Using * to replace parts of the names of namespaces does not work. But, the full name of a namespace can be replaced by *. Example:

$ slk list /ex/am/ple
drwxrwxrwx- k204221  ka1209             0   30 Nov 2022 10:52 aa11
drwxrwxrwx- k204221  ka1209             0   30 Nov 2022 10:52 aa22
drwxrwxrwx- k204221  ka1209             0   30 Nov 2022 10:52 bb00

$ slk list /ex/am/ple/\*/\*.nc
...

$ slk list '/ex/am/ple/*/*.nc'
...

These last two list commands will look for *.nc in in every sub-namespace of /ex/am/ple. However, the following list command will not work.

$ slk list '/ex/am/ple/a*/*.nc'
ERROR: Cannot run "list /ex/am/ple/a*/*.nc": file or directory named '*.nc' was not found.

How to search non-recursively in a namespace#

By default, slk search search recursively in a namespace provided via path:

slk search '{"path":{"$gte":"/arch/bm0146/k204221/test_files"}}'

It is not possible to use the $eq operator in this context. Instead, another key-value pair "$max_depth": 1 has to be inserted as follows:

slk search '{"path":{"$gte":"/arch/bm0146/k204221/test_files", "$max_depth":1}}'

Alternatively, you can get the object id of the particular namespace via slk_helpers exists and, then in your search query, use it as value for the search field resources.parent_id (see this example code in slk Usage Examples)

Terminal cursor disappears if slk command with progress bar is canceled#

If a slk command with a progress bar is canceled by the user, the shell cursor might disappear. One can make it re-appear by (a) running reset or (b) starting vim and leaving it directly (:q!).

error “conflict with jdk/…” when the slk module is loaded#

slk needs a specific Java version that is automatically loaded with slk. Having other Java versions loaded in parallel might cause unwanted side effects. Therefore, the system throws an error message and aborts.

slk needs at least Java version 13#

You might encounter an error like this:

$ slk list 12
CLI tools require Java 13 (found 1)

slk needs a specific Java version. This Java version is automatically loaded when we load the slk module. If you have another Java loaded explicitly, please unload them prior to loading the slk module. If you loaded slk already, please: (1) unload slk, (2) unload all Java modules and (3) load slk again.

slk search yields RQL parse error#

ERROR: Search failed. Reason: RQL parse error: No period found in collection field name ().

Either: Please consider using ' around your search query instead of " to prevent operators starting with $ to be evaluated as bash variables.

Or: Please escape $’s belonging to query operators when you use " as delimiters of the query string.

Or: Please check your JSON query carefully. It might be valuable to print the query in a human readable way with echo QUERY | jq.

slk login asks me to provide a hostname and/or a domain#

If you are asked for this information the configuration is faulty. Please contact support@dkrz.de and tell us on which machine you are working.

Archival fails and Java NullPointerException in the log#

This error message is printed in the log:

2021-07-13 08:33:03 ERROR Unexpected exception
java.lang.NullPointerException: null
    at com.stronglink.slkcli.api.websocket.NodeThreadPools.getBestPool(NodeThreadPools.kt:28) ~[slk-cli-tools-3.1.62.jar:?]
    at com.stronglink.slkcli.archive.Archive.upload(Archive.kt:191) ~[slk-cli-tools-3.1.62.jar:?]
    at com.stronglink.slkcli.archive.Archive.uploadResource(Archive.kt:165) ~[slk-cli-tools-3.1.62.jar:?]
    at com.stronglink.slkcli.archive.Archive.archive(Archive.kt:77) [slk-cli-tools-3.1.62.jar:?]
    at com.stronglink.slkcli.SlkCliMain.run(SlkCliMain.kt:169) [slk-cli-tools-3.1.62.jar:?]
    at com.stronglink.slkcli.SlkCliMainKt.main(SlkCliMain.kt:103) [slk-cli-tools-3.1.62.jar:?]
2021-07-13 08:33:03 INFO

This error indicates that there is an API issue. A reason might be that one or more StrongLink nodes went offline and the other nodes did not take of their connections yet. Please notify support@dkrz.de if you experience this error.

slk ERROR: Unhandled error occurred, please check logs#

Please have a look into you slk log: ~/.slk/slk-cli.log.

slk archive: Exception …: lateinit property websocket has not been initialized#

Full error message on the command line:

Exception in thread "Thread-357" kotlin.UninitializedPropertyAccessException: lateinit property websocket has not been initialized
at com.stronglink.slkcli.queue.ArchiveWebsocketWorker.closeConnection(ArchiveWebsocketWorker.kt:146)
at com.stronglink.slkcli.queue.WebsocketWorker.run(WebsocketWorker.kt:67)

Error message in the log:

2022-03-01 13:50:28 ERROR Error in websocket worker
java.util.concurrent.CompletionException: java.net.http.WebSocketHandshakeException
        at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367) ~[?:?]

Reason#

Probably, slk archive was run with --streams 10 or a similar high number like --streams 16 or --streams 32

Solution#

Please use slk archive --streams N with a maximum value of 4 for N. Transfer rates of 1 to 2 GB/s are possible with this configuration when the system is not busy.

slk delete failed, but nevertheless file was deleted#

Issue description#

We run slk delete /abc/def/ghi.txt but slk delete fails due to an unknown reason. Repeated calls of slk delete /abc/def/ghi.txt fail because the target file does not exist anymore.

Reason#

The reason has not been fully identified yet. This one is the most probable reason: When slk delete sends a deletion request to StrongLink, it waits a certain time for the response of the StrongLink instance to return. If this does not happen or if the reply of another confirmation step does not return in time (= timeout), slk assumes that the command failed.

Solution#

Please carefully check, if files were actually deleted when a slk delete did finish successfully.

slk list will take very long when many search results are found#

Issue description#

slk list SEARCH_ID seems to do nothing.

Reason#

slk list SEARCH_ID collects all search results, first, and, then, prints them. The run time of slk list linearly scales with the number of them (20s to 60s per 1000 results). Hence, if you want to print a list of 10000 files which were found by slk search you might have to wait 5 minutes until the list is printed.

Alternatively to slk list, you can run slk_helpers list_search on the same SEARCH_ID which will continuously print collected search results.

Solution#

Please refine your search. The section Search files by metadata might help in this context.

slk search -user and -group do not work#

slk search -user <USER> and slk search -group <GROUP> do not work if USER or GROUP exist.

“Connection reset”, “Connection timeout has expired”, “Name or service not known”, “Unable to resolve hostname” and “Host not reachable” errors#

Issue description#

The Unable to resolve hostname or Host not reachable occurr directly after an slk command has been started.

The Connection reset and Connection timeout has expired occurr while a file transfer via slk archive or slk retrieve is running.

Connection reset

Exception in thread "Thread-3" java.net.SocketException: Connection reset
        at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:323)
        at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:350)
        at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:803)
        at java.base/java.net.Socket$SocketInputStream.read(Socket.java:966)
        at java.base/sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:478)
        ...

Connection timeout has expired

ERROR Connect timeout has expired [url=https://archive.dkrz.de//api/v2/udm_schemas, connect_timeout=unknown ms] io.ktor.network.sockets.ConnectTimeoutException: Connect timeout has expired [url=https://archive.dkrz.de//api/v2/udm_schemas, connect_timeout=unknown ms]
    at io.ktor.client.features.HttpTimeoutKt.ConnectTimeoutException(HttpTimeout.kt:183) ~[slk-cli-tools-3.3.76.jar:?]
    at io.ktor.client.engine.okhttp.OkUtilsKt.mapOkHttpException(OkUtils.kt:75) ~[slk-cli-tools-3.3.76.jar:?]
    at io.ktor.client.engine.okhttp.OkUtilsKt.access$mapOkHttpException(OkUtils.kt:1) ~[slk-cli-tools-3.3.76.jar:?]
    at io.ktor.client.engine.okhttp.OkHttpCallback.onFailure(OkUtils.kt:39) ~[slk-cli-tools-3.3.76.jar:?]

Reason#

If you experiance Connection reset, timeout, Host not reachable or similar errors, then StrongLink or the DKRZ DNS might take to long to reply to requests of the slk client. Due to the way how the load management of the StrongLink system is set up, some requests might run into timeouts.

Soluation#

In April 2023, we requested the StrongLink development team to increase the timeout values – hardcoded in slk – in order to prevent these errors from happening. In the slk_helpers, these timeouts have been already increased.

Currently, there is nothing a user can do except of running the failed slk command a second time. If Connection reset and Connection timeout has expired thrown, the StrongLink system might be under high load. I might help to wait a few hours before the commands are run another time – particularly, if it happended multiple times in a short time interval. When Unable to resolve hostname or Host not reachable is thrown than the slk can be rerun directly.