Known Issues (read this!)#
file version: 17 Sept 2024
current software versions: slk version 3.3.91; slk_helpers version 1.12.10; slk wrappers version 1.2.1
slk/slk_helpers terminate directly after start with “Name or service not known” or “Unhandled error occurred”#
The error, which slk
prints to the command line is
ERROR: Unhandled error occurred, please check logs
However, if you take a look into the slk
log (~/.slk/slk-cli.log
) then you will find the actual error:
2022-04-07 21:45:30 ERROR archive.dkrz.de: Name or service not known
The `slk_helpers
print this message directly to the command line
archive.dkrz.de: Name or service not known
The error is the same.
Reason#
The slk
/ slk_helpers
does not get any reply from StrongLink. On levante, the routing to the StrongLink constellation does not seem to work from time to time.
Solution#
You cannot do anything about them but just run slk
again a few seconds later.
slk list prints “Error getting namespace children for namespace:”#
You run slk list
on a namespace/folder an see this
$ slk list /arch/pd1309/forcings/reanalyses/ERA5/year1987/
Error getting namespace children for namespace: 49083768018
Reason#
A connection timeout to StrongLink occurred but slk list
just prints a generic error. The correct error can be found in the slk log file (~/.slk/slk-cli.log
):
2023-12-04 12:27:47 levante4.lvt.dkrz.de 2567653 ERROR Get namespace children error: <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>
The gateway timeout is returned when one or multiple StrongLink nodes do not respond or only respond after a long delay. This might be caused by overloaded StringLink nodes – e.g. by many extensive searches or a large number of archivals.
Solution#
Please try the command. If it fails repeatedly, please wait some time before retrying it again.
slk archive/retrieve may use much memory and CPU time – careful with parallel slk calls on one node#
slk archive
and slk retrieve
are fast but memory hungry. Additionally, these commands run many threads in parallel when large files or many files are transferred. As a rule of thumb, 6 GB of memory should be assumed for each archival and retrieval call. Be aware that on most Levante nodes, 2 GB of memory is allocated to each physical CPU core. This limit is strictly enforced by the operating systems and processes exceeding the allowed memory usage will be killed. Therefore, 6 GB of memory should be allocated in batch jobs via --meme=6GB
.
Running many slk archive
or slk retrieve
in parallel in one job (a) occupies a lot of memory and (b) causes issues in the thread management. Additionally, the transfer speed might not improve if many archivals or retrievals run in parallel on one node due to hardware limitations. Please consider to aggregate several retrievals via a search and retrieval of the search results (see Aggregate file retrievals).
slk archive/retrieve is killed#
If you receive this output
/sw/spack-levante/slk-3.3.21-5xnsgp/bin/slk: line 16: 3673728 Killed
then, the slk
call was killed because of a too high RAM usage. Please see slk archive/retrieve may use much memory and CPU time – careful with parallel slk calls on one node for details.
slk is hanging / unresponsive#
A few issues might cause a slk
call to hang. However, long run time might not necessarily mean that slk
hangs: slk retrieve
and slk recall
might idle a long period if many files have to be copied from tape at the same time.
reason: Lustre file system is hanging#
Please check whether /home
is hanging. If /home
is hanging, slk
cannot access its login token and cannot write into its log.
reason: slk retrieve does not hang but the tape recall takes very long#
When many retrieve/recall requests of files from tape are processed, the individual calls of slk retrieve
might take longer than normal because the tape requests are queued. In this situation, slk retrieve
might look like it is hanging. Instead, it is waiting for files to be copied from tape to the HSM cache.
reason: one or more source files have 0 byte size#
Please check whether you are archiving a file of 0 Byte size. slk archive
and slk retrieve
hang when such a file is archive or retrieved, respectively.
Non-recursive is semi-recursive!?#
If a namespace/directory (not a file) is given as input to the command slk tag
, all files in this namespace/directory are affected but no files in sub-directories. When -R
is set, all sub-namespaces are also affected.
slk writes no output in non-interactive mode#
Many slk commands do not print output to the stdout and stderr streams (== command line output) but write their output into a buffer. Thus, the output is not captured when the commands are run in non-interactive mode – i.e. running in SLURM jobs. In the case of slk archive
you can set -vv
in order to print a list of processed files to stdout (i.e. the terminal / SLURM log). In all other situations when slk is used in batch jobs, please catch the exit codes of your slk
calls and check whether they are equal to 0
. If not, an error occurred or files were skipped. Details on the error can be found in the slk log file ~/.slk/slk-cli.log
. However, when you run many slk
commands in parallel, the slk log becomes hard to read. Please print the hostname
and time stamp (i.e. via date
) when the error occurred into the SLURM log to be able to find the details in the slk log later on. Printing the pid
of the slk call does not help, because this is only the pid of a wrapper script around the actual slk process.
The exit code of the previous program call is stored in $?
. Example:
$ slk archive /work/project/user/data /ex/am/ple/blub
...
$ echo $?
0
# or 1 or higher
In a bash/batch script it could look like this:
# ...
slk archive /work/project/user/data /ex/am/ple/blub
exit_code=$?
# print exit code with prefix so that it is easy to `grep`
echo "exit code: ${exit_code}"
if [ ${exit_code} -ne 0 ]; then
# print date
date
hostname
fi
slk never writes to stderr#
Error messages of slk is written to the stdout
stream instead of the stderr
stream. If slk
was writing output in non-interactive mode (it is not!) then you would find all error output in the SLURM stdout (not stderr) file.
slk move cannot rename files#
The Linux mv
can move and rename files. The slk move
can just move files/namespaces from one namespace to another namespace. Renaming can only be performed by slk rename
. Both commands can only target one file/namespace at a time. Wildcards are not supported.
slk archive compares file size and timestamp prior to overwriting files#
slk archive
compares file size and timestamp to decide whether to overwrite a file or not. rsync
does it the same way. There might be rare situations when an archived file should be overwritten by another file with the same name, size and timestamp: this would fail.
Availability of archived data and modified metadata is delayed by a few seconds#
StrongLink is a distributed system. Metadata is stored in a distributed metadata database. Some operations might take a few seconds until their results are visible because they have to be synchronized amongst different nodes.
Please wait a few seconds before you retrieve a file that was just archived.
A file listed by slk list is not necessarily available for retrieval yet#
The location, name and size of a file are metadata. These metadata are written into the StrongLink metadata database when an archival process starts. slk list
only prints metadata. Hence, a file which is currently transferred by slk archive
will be listed by slk list
even before it has been fully transferred. Similarly, aborted slk archive
calls can produce a file’s metadata entry without correct data. Such a file can be retrieved without error. Please see failed or canceled slk archive and slk retrieve calls leave file fragments for details on file fragments.
failed/canceled slk archive calls leave incomplete files#
Incomplete/partial files may remain in StrongLink if slk archive
was interrupted during an archival process. More than one file might be affected because multiple files can be archived in parallel. These files are denoted as partial
files by StrongLink. Metadata is available for the partial files and they are listed by ``slk list``. In most cases these files do not have checksums. Reason for interruption might be:
user aborts
slk archive
– e.g. via CTRL + Ca ssh connection breaks
a SLURM job is killed due to a timeout.
a too large amount of data is archived at once causing
slk archive
to fail (i.e. more than 5 to 10 TB)
Assume that the archival into this destination namespace failed: /arch/ab1234/c567890/test
. There might be (a) partial files and (b) files with ‘partial’ flag in this namespace. The partial files (a) have only been partly archived and should be considered as corrupted. These files are flagged as partial (b). However, also completely archived files might be flagged as partial when the slk archive
failed shortly after they were completely archived. Files which are flagged as partial cannot be retrieved. Therefore, the flag should be removed from all completely archived files.
Solution#
The files which are flagged as partial can be identified by slk_helpers has_no_flag_partial
. Also completely files might be flagged as partial file
. If files are reported as partial files
although repeated archivals succeed, please inform support@dkrz.de and we will perform additional checks.
Run ‘has_no_flag_partial’ to find as ‘partial’ flagged files#
Please run the command slk_helpers has_no_flag_partial
with the parameter -v
to get a list of all flagged files. If -v
is not set, the command only checks whethre no files or at least one file has this flag set. -vv
prints the status of all targeted files.
$ slk_helpers has_no_flag_partial -v -R /arch/ab1234/c567890/test
/dkrz_test/netcdf/20230925a/file_001gb_b.nc has partial flag
/dkrz_test/netcdf/20230925a/file_001gb_e.nc has partial flag
/dkrz_test/netcdf/20230925a/file_001gb_f.nc has partial flag
/dkrz_test/netcdf/20230925a/file_001gb_i.nc has partial flag
Number of files without partial flag: 8/12
Thus, 4
files of 12
are flagged as partial files. However, please also try to re-archive these files. If they are skipped and the flag is not removed, please notify us via support@dkrz.de and we will perform additional checks.
slk does not have a –version flag#
Instead, it has a version command: slk version
slk performance on different node types#
We recommend running slk archive
and slk retrieve
on nodes of the shared
and interactive
partitions (see Run slk in the “interactive” partition and Run slk as batch job). Please avoind running slk archive
and slk retrieve
on login and compute
nodes when larger amounts of data should be transfered.
On login nodes, only small files should be archived so that these nodes are not slowed down by slk
. slk retrieve
can only retrieve one file at once on these nodes.
On shared nodes, the available I/O bandwidth might be a limiting factor for the transfer rate of slk archive
and slk retrieve
. Therefore, the transfer rate might be higher on exclusive nodes.
group memberships of user updated on login#
If a user is added to a new group/project, this information is not automatically passed to StrongLink. Instead, the user has to run slk login
again. Background: StrongLink caches LDAP data of each user and only updates its cache on a new login.
LDAP user not known to StrongLink prior to first login#
If a user never logged in to StrongLink, his/her user will not exist in StrongLink (i.e. chown to this user is not possible). Background: There are many users listed in the DKRZ LDAP that will never access StrongLink. Keeping all these users in the StrongLink user database is not reasonable.
Filtering slk list results with “*”#
Bash wildcards/globs partly work with slk
. slk list
understands *
use * to replace parts of the file name#
This works fine:
$ slk list /ex/am/ple/\*.nc
...
$ slk list '/ex/am/ple/*.nc'
...
The user needs to prevent that *
is interpreted by the bash/ksh/… . This can be done by one of both approaches above.
escape * to print the content of a namespace containing * in its name#
Assuming, we have a namespaces with the name *
, which is allowed, then we might do this to its content:
$ slk list '/ex/am/ple/\*'
...
This will prevent slk list
successfully from interpreting the *
. However, when a *
is in the path, slk list
automatically goes into “filter mode”. This means that the content of the namespace /ex/am/ple
will be filtered for content with the name *
. Hence, we will just get *
printed and not its content.
using * to replace parts of namespace names#
Using *
to replace parts of the names of namespaces does not work. But, the full name of a namespace can be replaced by *
. Example:
$ slk list /ex/am/ple
drwxrwxrwx- k204221 ka1209 0 30 Nov 2022 10:52 aa11
drwxrwxrwx- k204221 ka1209 0 30 Nov 2022 10:52 aa22
drwxrwxrwx- k204221 ka1209 0 30 Nov 2022 10:52 bb00
$ slk list /ex/am/ple/\*/\*.nc
...
$ slk list '/ex/am/ple/*/*.nc'
...
These last two list commands will look for *.nc
in in every sub-namespace of /ex/am/ple
. However, the following list command will not work.
$ slk list '/ex/am/ple/a*/*.nc'
ERROR: Cannot run "list /ex/am/ple/a*/*.nc": file or directory named '*.nc' was not found.
How to search non-recursively in a namespace#
By default, slk search
search recursively in a namespace provided via path
:
slk search '{"path":{"$gte":"/arch/bm0146/k204221/test_files"}}'
It is not possible to use the $eq
operator in this context. Instead, another key-value pair "$max_depth": 1
has to be inserted as follows:
slk search '{"path":{"$gte":"/arch/bm0146/k204221/test_files", "$max_depth":1}}'
Alternatively, you can get the object id of the particular namespace via slk_helpers exists
and, then in your search query, use it as value for the search field resources.parent_id
(see this example code in slk Usage Examples)
Terminal cursor disappears if slk command with progress bar is canceled#
If a slk command with a progress bar is canceled by the user, the shell cursor might disappear. One can make it re-appear by (a) running reset
or (b) starting vim
and leaving it directly (:q!
).
error “conflict with jdk/…” when the slk module is loaded#
slk
needs a specific Java version that is automatically loaded with slk
. Having other Java versions loaded in parallel might cause unwanted side effects. Therefore, the system throws an error message and aborts.
slk needs at least Java version 13#
You might encounter an error like this:
$ slk list 12
CLI tools require Java 13 (found 1)
slk
needs a specific Java version. This Java version is automatically loaded when we load the slk module. If you have another Java loaded explicitly, please unload them prior to loading the slk module. If you loaded slk already, please: (1) unload slk, (2) unload all Java modules and (3) load slk again.
slk search yields RQL parse error#
ERROR: Search failed. Reason: RQL parse error: No period found in collection field name ().
Either: Please consider using '
around your search query instead of "
to prevent operators starting with $
to be evaluated as bash variables.
Or: Please escape $
’s belonging to query operators when you use "
as delimiters of the query string.
Or: Please check your JSON query carefully. It might be valuable to print the query in a human readable way with echo QUERY | jq
.
slk login asks me to provide a hostname and/or a domain#
If you are asked for this information the configuration is faulty. Please contact support@dkrz.de and tell us on which machine you are working.
Archival fails and Java NullPointerException in the log#
This error message is printed in the log:
2021-07-13 08:33:03 ERROR Unexpected exception
java.lang.NullPointerException: null
at com.stronglink.slkcli.api.websocket.NodeThreadPools.getBestPool(NodeThreadPools.kt:28) ~[slk-cli-tools-3.1.62.jar:?]
at com.stronglink.slkcli.archive.Archive.upload(Archive.kt:191) ~[slk-cli-tools-3.1.62.jar:?]
at com.stronglink.slkcli.archive.Archive.uploadResource(Archive.kt:165) ~[slk-cli-tools-3.1.62.jar:?]
at com.stronglink.slkcli.archive.Archive.archive(Archive.kt:77) [slk-cli-tools-3.1.62.jar:?]
at com.stronglink.slkcli.SlkCliMain.run(SlkCliMain.kt:169) [slk-cli-tools-3.1.62.jar:?]
at com.stronglink.slkcli.SlkCliMainKt.main(SlkCliMain.kt:103) [slk-cli-tools-3.1.62.jar:?]
2021-07-13 08:33:03 INFO
This error indicates that there is an API issue. A reason might be that one or more StrongLink nodes went offline and the other nodes did not take of their connections yet. Please notify support@dkrz.de if you experience this error.
slk ERROR: Unhandled error occurred, please check logs#
Please have a look into you slk log: ~/.slk/slk-cli.log
.
slk archive: Exception …: lateinit property websocket has not been initialized#
Full error message on the command line:
Exception in thread "Thread-357" kotlin.UninitializedPropertyAccessException: lateinit property websocket has not been initialized
at com.stronglink.slkcli.queue.ArchiveWebsocketWorker.closeConnection(ArchiveWebsocketWorker.kt:146)
at com.stronglink.slkcli.queue.WebsocketWorker.run(WebsocketWorker.kt:67)
Error message in the log:
2022-03-01 13:50:28 ERROR Error in websocket worker
java.util.concurrent.CompletionException: java.net.http.WebSocketHandshakeException
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367) ~[?:?]
Reason#
Probably, slk archive
was run with --streams 10
or a similar high number like --streams 16
or --streams 32
Solution#
Please use slk archive --streams N
with a maximum value of 4
for N
. Transfer rates of 1 to 2 GB/s are possible with this configuration when the system is not busy.
slk delete failed, but nevertheless file was deleted#
Issue description#
We run slk delete /abc/def/ghi.txt
but slk delete fails due to an unknown reason. Repeated calls of slk delete /abc/def/ghi.txt
fail because the target file does not exist anymore.
Reason#
The reason has not been fully identified yet. This one is the most probable reason: When slk delete
sends a deletion request to StrongLink, it waits a certain time for the response of the StrongLink instance to return. If this does not happen or if the reply of another confirmation step does not return in time (= timeout), slk
assumes that the command failed.
Solution#
Please carefully check, if files were actually deleted when a slk delete
did finish successfully.
slk list will take very long when many search results are found#
Issue description#
slk list SEARCH_ID
seems to do nothing.
Reason#
slk list SEARCH_ID
collects all search results, first, and, then, prints them. The run time of slk list
linearly scales with the number of them (20s to 60s per 1000 results). Hence, if you want to print a list of 10000 files which were found by slk search
you might have to wait 5 minutes until the list is printed.
Alternatively to slk list
, you can run slk_helpers list_clone_search
on the same SEARCH_ID
which allows to select a range of results to print. This command works only with search ids and not with paths of namespaces or files. If you wish to list the same information for one individual file, please run slk_helpers list_clone_file <file_path>
. There is no list_clone_*
command for listing the content of a namespace.
Solution#
Please refine your search. The section Search files by metadata might help in this context.
slk search -user and -group do not work#
slk search -user <USER>
and slk search -group <GROUP>
do not work if USER
or GROUP
exist.
“Connection reset”, “Connection timeout has expired”, “Name or service not known”, “Unable to resolve hostname” and “Host not reachable” errors#
Issue description#
The Unable to resolve hostname
or Host not reachable
occurr directly after an slk
command has been started.
The Connection reset
and Connection timeout has expired
occurr while a file transfer via slk archive
or slk retrieve
is running.
Connection reset
Exception in thread "Thread-3" java.net.SocketException: Connection reset
at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:323)
at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:350)
at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:803)
at java.base/java.net.Socket$SocketInputStream.read(Socket.java:966)
at java.base/sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:478)
...
Connection timeout has expired
ERROR Connect timeout has expired [url=https://archive.dkrz.de//api/v2/udm_schemas, connect_timeout=unknown ms] io.ktor.network.sockets.ConnectTimeoutException: Connect timeout has expired [url=https://archive.dkrz.de//api/v2/udm_schemas, connect_timeout=unknown ms]
at io.ktor.client.features.HttpTimeoutKt.ConnectTimeoutException(HttpTimeout.kt:183) ~[slk-cli-tools-3.3.76.jar:?]
at io.ktor.client.engine.okhttp.OkUtilsKt.mapOkHttpException(OkUtils.kt:75) ~[slk-cli-tools-3.3.76.jar:?]
at io.ktor.client.engine.okhttp.OkUtilsKt.access$mapOkHttpException(OkUtils.kt:1) ~[slk-cli-tools-3.3.76.jar:?]
at io.ktor.client.engine.okhttp.OkHttpCallback.onFailure(OkUtils.kt:39) ~[slk-cli-tools-3.3.76.jar:?]
Reason#
If you experiance Connection reset
, timeout
, Host not reachable
or similar errors, then StrongLink or the DKRZ DNS might take to long to reply to requests of the slk
client. Due to the way how the load management of the StrongLink system is set up, some requests might run into timeouts.
Solution#
In April 2023, we requested the StrongLink development team to increase the timeout values – hardcoded in slk
– in order to prevent these errors from happening. In the slk_helpers
, these timeouts have been already increased.
Currently, there is nothing a user can do except of running the failed slk
command a second time. If Connection reset
and Connection timeout has expired
thrown, the StrongLink system might be under high load. I might help to wait a few hours before the commands are run another time – particularly, if it happended multiple times in a short time interval. When Unable to resolve hostname
or Host not reachable
is thrown than the slk
can be rerun directly.