Archivals to tape#

file version: 09 Jul 2024

current software versions: slk version 3.3.91; slk_helpers version 1.12.10; slk wrappers 1.2.2

run slk archive#

slk archive is available on all Levante nodes. Basic examples slk archive calls are:

# archive one file, absolute path
$ slk archive /work/bm0146/k204221/some_file.nc /arch/ab01234/c567890/my_data_1/
# archive one file, relative path
$ slk archive some_file.nc /arch/ab01234/c567890/my_data_3/
# archive folder recursively, absolute path
$ slk archive -R /work/bm0146/k204221/some_folder /arch/ab01234/c567890/my_data_4/
# archive folder recursively, relative path, skip hidden files and folders
$ slk archive -x -R some_folder /arch/ab01234/c567890/my_data_5/
# archive multiple files
$ slk archive some_file_a.nc some_file_b.nc some_file_c.nc /arch/ab01234/c567890/my_data_6/
# archive multiple files using wildcards
$ slk archive file_?.nc /arch/ab01234/c567890/my_data_7/
$ slk archive year200[0123].nc /arch/ab01234/c567890/my_data_8/

slk archive may need up to 6 GB of memory (incl. overhead). If you want to archive a file of 1 GB size, you are welcome to do this on the Levante login nodes. If you wnat to archive multiple GB of data, please run slk archive on the interactive or shared nodes and allocate 6 GB of memory. When your slk is killed with a message similar to the following one, too less memory was allocated.

/sw/[...]/bin/slk: line 16: [...] Killed

If you want to run multiple slk archive``s in parallel on one node, please allocate at least ``5 GB per call of slk archive. However, we recommend to run only one slk archive – but with many files at once.

Interactive sessions are started with salloc:

$ salloc --mem=6GB --partition=interactive --account=<YOUR_PROJECT_ACCOUNT>
$ slk archive example.nc /arch/ab0123/c456789
...
$ exit

See also Run slk in the “interactive” partition and Data Processing on Levante for details.

Different batch scripts for archiving data are provided at the bottom of this document. The script archive_slurm_template_multiple_files.sh (section archive multiple files) is the most simple one and will cover most use cases.

# first, adapt source files and target
# then submit:
$ sbatch --account=<YOUR_PROJECT_ACCOUNT> ./archive_slurm_template_multiple_files.sh

slk archive does not always finish successfully. Please check, whether it does. Three ways how to do this are described in section check if archival command was successful below.

If slk archive fails, please run the same command again. Missing and incomplete existing files are archived. Complete existing files are skipped. If slk archive fails repeatedly, we recommend running it with -vv and checking the log file for error messages or Java exceptions.

Please consider running slk archive with -x to prevent the archival of hidden files and folders (details). Hidden folders such as .git and .svn might contain many small files and hidden files may contain login information or similar.

file size#

We recommend a size between 10 GB and 200 GB per archived file. We strongly recommend not to archive files below 1 GB. Each archived file comsumes at least 1 GB of tape quota. Please pack smaller files into tar balls prior to their archival. packems supports you in this process (see below).

Why?

  • lower limit: It is very inefficient to retrieve a large number of small files from tape compared to a low number of large files of the same total size. This is because on average more tapes have to be read and more spooling has to be done per tape when more files are retrieved. The tape health is reduced and it takes more time to get all small files back.

  • upper limit: Due to the system setup of our current HSM system, the transfer speed between tape drives and HSM-cache decreases when files are larger than 200 GB to 250 GB. This might change in future.

Retrieving 100 files of 100 MB size takes much longer than retrieving one file of 10 GB size. Additionally, the lifetime of the tape(s) is considerably reduced. When a file is read from tape, the tape drive spools the tape to the start of the file, stops, reads the file with increasing speed until top speed of 300 MB/s is reached and stops towards the end of this file. In this case. this operation is repeated up to 100 times, which stresses the tape and takes time. If the files are stored on multiple tapes, the user has to wait additional time for the tape being transported from their shelves to the tape drives.

Warning

When you run slk archive -R to archive a folder with model output of nice size, there might be hidden files and/or folders which you accidentally archive as well. These hiddens folders, such as .git and .svn, might contain many small files. Please set -x to ignore these files (details).

Please do not archive more than 3 TB with one call of slk archive. There is a high probability that the connection to StrongLink is interrupted if you try to do so. The probability for connection interruptions increases with increasing size of the archival. There will be printed no error message to the terminal but only to the slk log (~/.slk/slk-cli.log):

2022-11-24 11:16:22 INFO  Executing command: "archive -R /work/ab1234/c567890/much_data /arch/zy0987/c567890/target"
2022-11-24 11:18:25 ERROR Unexpected exception
java.io.IOException: unexpected end of stream on https://archive.dkrz.de/...
        at
okhttp3.internal.http1.Http1ExchangeCodec.readResponseHeaders(Http1ExchangeCodec.kt:202)
~[slk-cli-tools-3.3.21.jar:?]
[...]
[...]
[...]
        ... 16 more
2022-11-24 11:18:25 INFO
Archive report
===============
Status: incomplete
Total files uploaded: 0/85083 files [0B/20.3T]

Please search for unexpected end of stream on https://archive.dkrz.de/... to find such events.

Please do not run more than two or three slk archive in parallel because running more than that for a long time might cause a hight load on multiple StrongLink nodes. This would caused slow system response and higher probility for connection timeouts for all users.

If you need to archive more than 50 TB at once, please contact us in advance via support@dkrz.de .

pack files#

Files below 1 GB size should be packed into tar balls or zip files of a maximum size of 200 GB. We recommend to use uncompressed tar balls.

You can create the tar balls manually or use packems to do the job for you. A folder or a list of files are provided as input to packems. packems automatically fills the provided files into one or multiple tar balls of a maximum size of 100 GB. In a next step, it copies the new tar balls into the tape archive. Additionally, packems creates and archives a file INDEX.txt, which contains a list of all packed files and their tar ball. packems can do all tasks in one step but we recommend to run packing and archiving separately.

Please do not run packems on the Levante login nodes but on shared or interactive nodes as described above for slk archive. Because packems uses slk archive for archival, sufficient memory as to be allocated.

packems: basic archival#

# pack data with packems
# -d ... => local tmp destination of the tar balls
# -S ... => destination of the tar balls in the tape archive
# -o data_a => prefix of the tar ball names
# --no-archive => only pack and no archival yet
# /work/b.../a/0 => folder to pack recursively
$ packems \
    -d /scratch/k/k204221/packems \
    -S /arch/bm0146/k204221/archived \
    -o data_a \
    --no-archive \
    /work/bm0146/k204221/archive_this
...

# looks what is there
$ ls
data_a_001.tar   data_a_002.tar   data_a_003.tar   ...

# archive the tar ball and create index file
# --archive-only => do only archive data

$ packems \
    -d /scratch/k/k204221/packed \
    -S /dkrz_test/techtalk/003 \
    -o data_a \
    --archive-only \
    /work/bm0146/k204221/material_hsm_workshop/data/many_small_files/a/0
...

packems: change tar ball size#

Set the maximume size of the tar balls to 50 GB.

# pack data with packems
# -t ...  => target size (a few tar balls might slightly exceed the value)
# -m ...  => hard maximum at least if the source files are large
$ packems \
    -d /scratch/k/k204221/packems \
    -S /arch/bm0146/k204221/archived \
    -o data_a \
    -t 30 -m 50
    --no-archive \
    /work/bm0146/k204221/archive_this
...

packems: list of files as input#

We want to archive all *.nc files from /work/bm0146/k204221/archive_this. For this purpose, we run find ... and pipe its output into packems.

# pack data with packems
# -i -  => read input from ``stdin``
$ find /work/bm0146/k204221/archive_this -type f -name *.nc | \
    packems \
    -i -
    -d /scratch/k/k204221/packems \
    -S /arch/bm0146/k204221/archived \
    -o data_a \
    -t 30 -m 50
    --no-archive \
...

We might also pipe the content of a file into packems.

# pack data with packems
# -i -  => read input from ``stdin``
$ cat file_list.txt | \
    packems \
    -i -
    -d /scratch/k/k204221/packems \
    -S /arch/bm0146/k204221/archived \
    -o data_a \
    -t 30 -m 50
    --no-archive \
...

owner, group and permissions#

permissions#

The permissions of the original file are transfered to the archived file. The permissions can modified by the owner with this command

slk chmod <PERMISSIONS> /arch/ab1234/test.nc

owner#

The owner of the archived file is the archiving user. If files should be handed over to another user, please contact support@dkrz.de .

group#

The group of the archived file is the default group of the user. Thus, the group is not adapted to the target namespace. The owner of files can change their group with this command

slk group <GROUP> /arch/ab1234/test.nc

Skip and ignore files#

slk archive automatically skips archiving a file when

  • a file with the same name already exist in the destination location,

  • both files are equal in size and

  • both files have the same modification time.

modification time” means the mtime timestamp of the file and not the archival or modification time in StrongLink.

Skipping a file is considered as success/not-failed. With this respect, slk archive works the same as rsync.

If the parameter -x is set, slk archive ignores all hidden files and directories. A file or directory is considered as hidden when its name starts with a dot. E.g. .git, .ipynb_checkpoints and .ssh folders or .gitignore and .config files.

When you run slk archive with -vv, then skipped files are listed but ignored files are not listed.

check if archival command was successful#

There are three ways on how to check

  • look into the textual output of slk archive,

  • capture the exit code $? (0: successful archival)

  • look into the slk log file: ~./slk/slk-cli.log

slk archive may skip the archival of files. Skipping a file is considered as success/not-failed.

check success: evaluate text output#

slk archive prints

  • Non-recursive Archive completed if the archival of all files was successful and

  • Non-recursive Archive failed if the archival of at least one file failed.

Skipped files are not considered as failed.

In the terminal it will look like this:

# archive a new file
$ slk archive a_file.nc /arch/ab0123/c456789
Non-recursive Archive completed

# archive a file which already exists and will be skipped
$ slk archive already_existing_file.nc /arch/ab0123/c456789
Non-recursive Archive completed

# archival will fail because no write permissions in destination location
$ slk archive a_file.nc /arch/no_permissions
Non-recursive Archive failed

You can print the archival status of each file by appending the flag -vv. In this case, you see whether a file was archived successfully (SUCCESS) or was skipped (SKIPPED).

# archive a new file
$ slk archive a_file.nc /arch/ab0123/c456789 -vv
a_file.nc SUCCESSFUL
Non-recursive Archive completed

# archive a file which already exists and will be skipped
$ slk archive already_existing_file.nc /arch/ab0123/c456789 -vv
already_existing_file.nc SKIPPED
Non-recursive Archive completed

# archival will fail because no write permissions in destination location
$ slk archive a_file.nc /arch/no_permissions -vv
a_file.nc FAILED
Non-recursive Archive failed

slk archive prints no file status if the source file does not exist – although -vv is set.

# archival will fail because source file does not exist
$ slk archive non_existing_file.nc /arch/ab0123/c456789 -vv
Non-recursive Archive failed

check success: capture exist code#

Each command returns an exit code when it ends. The exit code is not visible but the exit code of the most recent command is stored in the variable $?. An exit code of 0 indicates success. Exit codes >0 indicate a failure or have a special meaning depending on the command.

slk archive returns the exist code 0 on success and 1 on failure:

$ slk archive a_file.nc /arch/ab0123/c456789
Non-recursive Archive completed
$ echo $?
0

$ slk archive non_existing_file.nc /arch/ab0123/c456789
Non-recursive Archive failed
$ echo $?
1

All slk commands only return 0 and 1. In contrast, the slk_helpers return 0, 1, 2 and 3. The exact meaning differs from command to command. In the case of slk_helpers exists, the 0 means yes and 1 means no:

$ slk_helpers exists /.../existing_file.nc > /dev/null 2>&1
$ echo $?
0

$ slk_helpers exists /.../non_existing_file.nc > /dev/null 2>&1
$ echo $?
1

$ slk_helpers exists /.../no_read_permissions.nc > /dev/null 2>&1
$ echo $?
2

Exit codes are very useful in bash scripts:

#!/bin/bash

slk archive a_file.nc /arch/ab0123/c456789
exit_code=$?

if [ $exit_code -eq 0 ]; then
    echo "archival successful; first try"
    # do some more stuff ...
else
    >&2 echo "error occurred during archival; wait and retry"
    sleep 10
    slk archive a_file.nc /arch/ab0123/c456789
    exit_code=$?
    if [ $exit_code -eq 0 ]; then
        echo "archival successful; second try"
        # do some more stuff ...
    else
        >&2 echo "error occurred during archival; failed twice; exiting"
        exit 1
    fi
fi

check success: slk log file#

slk archive only prints sparse information to the terminal. Most error messages and an archival report are printed to the slk log file in ~/.slk/slk-cli.log. Please give not other user read permissions to ~/.slk because your slk login token is also stored in that folder.

A successful archival looks like this in the log:

2024-06-12 11:10:53 levante.dkrz.de 197384 INFO  Executing command: "archive ..."
2024-06-12 11:11:00 levante.dkrz.de 197384 INFO  Non-recursive Archive completed

Archive report
===============
Status: success
Total files uploaded: 1/1 files [3B/3B]

A failed archival might look similar to this in the log:

2024-06-12 11:17:01 levante.dkrz.de 205127 INFO  Executing command: "archive ..."
2024-06-12 11:17:02 levante.dkrz.de 205127 ERROR Unexpected exception
java.nio.file.NoSuchFileException: file_20.txt
[...]

Archive report
===============
Status: incomplete
Total files uploaded: 0/0 files [0B/0B]

Another failed archival might look like this in the log:

2023-03-25 01:34:25 xU22 97259 INFO  Executing command: "archive ..."
2023-03-25 01:34:27 xU22 97259 ERROR No active nodes. Shutting down...
2023-03-25 01:34:27 xU22 97259 ERROR Failed to upload resource: [...]
GNS Path: [...]
Error: Code: 500, Reason: CONNECTION_ERROR, Message: Cannot connect to websocket, Detailed Message: Cannot connect to websocket
2023-03-25 01:34:29 xU22 97259 INFO  Non-recursive Archive failed

Archive report
===============
Status: incomplete
Total files uploaded: 0/1 files [0B/1.5K]
Total files failed: 1/1 files [0B/1.5K]
        Connection Error: 1

slk archive failed#

What to do when slk archive failed? If you are in a hurry you can try the quick solution. However, if you have a bit more time, it might be good to find out why slk archive failed. Depending on situation, the quick solution might be appropriate (most situations) or not.

When slk archive fails, one or more files will probably be flagged as partial file. The important facts on such files are:

  • A file which is not flagged as partial file has been archived completely.

  • A file which is flagged as partial file may be an incomplete or a complete file.

  • slk list does not reliably highlight files flagged as partial file

  • slk_helpers has_no_flag_partial -v reliably lists all files flagged as partial file

  • a normal user cannot remove a partial file flag from a completely archived file

quick solution#

If the archival was interrupted, please run the same call of slk archive a second time. The slk archive will only transfer those files, which

  • have not already been archived,

  • have only been partly archived (internally flagged as partial file) or

  • have been modified since the first archival (see skip rules).

You can run slk archive repeatedly until it success. Afterwards, please check for files flagged as partial and notify us via support@dkrz.de to remove the (false) flags (details). Flagged files are blocked for retrieval.

Find out why archival failed#

You will find the most common reasons for slk archive to fail in the table below. If you experiance other reasons for failure, please notify us so that we can extend this table.

reason for failure

solution

manually killed by the user (e.g. via CTRL + C)

re-run same slk archive call

broken ssh connection

re-run same slk archive call

timeout of a SLURM job

re-run same slk archive call

archival of a large amount of data (> 3 TB)

split archival into multiple calls

of slk archive; contact support@dkrz.de when >50TB

by the operating system (e.g. allowed memory exceeded).

do not run slk archive on

Levante Login node; allocate more memory (--mem=6GB)

connection timeout to StrongLink

wait a few minutes and, then,

re-run same slk archive call

no permissions to write into destination path

obtain permissions

source file(s) does not exist

check source files or command

partial file flag#

Files which are incompletely archived are flagged as partial file. However, also completely archived files might be flagged as partial file when slk archive is interrupted directly after these files were completely archived.

slk list appends (partial file) in most situations when a file is flagged as partial file. However, when permissions, ownership, group, path or name of the flagged file were changed, the info (partial file) is not printed by slk list. Therefore, please do not use slk list to determine whether a file is flagged as partial file or not. Please use slk_helpers has_no_flag_partial -v for this purpose.

When you run the failed slk archive command again, missing or incomplete files are archived properly and the partial file flag ist removed. However, completely archived files are not touched and the partial file flag will not be removed from these files.

Listing incompletely archived files#

Please run slk_helpers has_no_flag_partial to quickly get a list of possibly incompletely archived files.

Please submit a verify job with slk_helpers submit_verify_job and later collect its results with slk_helpers result_verify_job in order to get a list of actually incompletely archived files.

Background on what happens when slk archive fails#

When slk archive starts to archive files, it, first, creates a 0-byte file in the destination location for each source file. Each of these 0-byte files is flagged as partial file. The actual size of each source file is stored hidden in StrongLink. When a file has been completely archived, slk and StrongLink need some time until the partial file flag is removed. The time span between completing the archival and removal of the flag increases with increasing amount of transfered data and with increasing load on the connected StrongLink node. When slk looses the connection to StrongLink before the flag has been removed, the flag remains being set.

If a call of slk archive, which transfers many files, is killed abruptly, each destination file will have one of these three conditions:

# file is complete; partial file is not set anymore # file is complete; partial file is still set # file is incomplete; partial file is still set

slk list appends (partial file) to each freshly archived file with this flag. However, when permissions, ownership, group, path or name of the flagged file were changed, the info (partial file) is not printed by slk list. Therefore, slk list does not reliably print information on this flag. You can list all parital file-flagged files in a namespace by slk_helpers has_no_flag_partial -R -v <namespace>.

When slk archive is run a second time, it skips all files, which are already in the destination and match in size and modification date. The partial file flag is ignored when this comparison is done. All other files are archived (again) and their partial file flags are removed. However, the partial file flag is not removed from skipped files because their metadata is not touched at all.

Therefore, the partial file flag is a necessary but not a sufficient condition for a file being actually incomplete. In contrast, each incompletely archived file is flagged as partial file.

Files, which are flagged as partial file, are blocked for retrieval. A user has to no possiblity to remove a partial file flag from a completely archived file. This has to be done by the StrongLink support. If you own such files, please contact us via support@dkrz.de and send us a list of these files. In advance, please make sure via slk archive -vv that these files were actually completely archive. This can be also checked via a verify job.

example failed archival#

We want to archive some netCDF files from the current folder to /dkrz_test/techtalk/021. This archival fails and some files are flagged as partial files.

$ slk archive *.nc /dkrz_test/techtalk/021
# some reason ...
Non-recursive Archive failed

$ slk list /dkrz_test/techtalk/021
...   1.1G  ...  file_001gb_a.nc (Partial File)
...      0  ...  file_001gb_b.nc (Partial File)
...   1.1G  ...  file_001gb_c.nc
...   1.1G  ...  file_001gb_d.nc
...   1.1G  ...  file_001gb_e.nc (Partial File)
...   1.1G  ...  file_001gb_f.nc (Partial File)
...      0  ...  file_001gb_g.nc (Partial File)
...   1.1G  ...  file_001gb_h.nc (Partial File)
... 144.4M  ...  file_001gb_i.nc (Partial File)
...      0  ...  file_001gb_j.nc (Partial File)
Files: 10

If we now modify the permissions of one file, slk list does not print the (Partial File) info anymore.

$ slk chmod +r /dkrz_test/techtalk/021/file_001gb_i.nc

$ slk list /dkrz_test/techtalk/021
...   1.1G  ...  file_001gb_a.nc (Partial File)
...      0  ...  file_001gb_b.nc (Partial File)
...   1.1G  ...  file_001gb_c.nc
...   1.1G  ...  file_001gb_d.nc
...   1.1G  ...  file_001gb_e.nc (Partial File)
...   1.1G  ...  file_001gb_f.nc (Partial File)
...      0  ...  file_001gb_g.nc (Partial File)
...   1.1G  ...  file_001gb_h.nc (Partial File)
... 144.4M  ...  file_001gb_i.nc
...      0  ...  file_001gb_j.nc (Partial File)
Files: 10

We can simply run the failed archival command again as shown further below. If this was not possible or failed again, one should verify the files as shown next.

In order to see which files are actually flagged as partial files, we could run slk_helpers has_no_flag_partial -v as described further below. In oder to find out which files are actually incomplete/partial, we could run a verify job (see below for details).

# run verify job
$ slk_helpers submit_verify_job /dkrz_test/techtalk/021 -R
Submitting up to 1 verify job(s) based on results of search id 732325:
search results: pages 1 to 1 of 1; visible search results: 10; submitted verify job: 247340
Number of submitted verify jobs: 1

# wait for job to be finished
$ slk_helpers job_status 247340
QUEUED (28)
$ slk_helpers job_status 247340
COMPLETED

# collect results when the job has completed
$ slk_helpers result_verify_job 247340
Errors:
Resource content size does not match record: /dkrz_test/techtalk/021/file_001gb_g.nc
Resource content size does not match record: /dkrz_test/techtalk/021/file_001gb_j.nc
Resource content size does not match record: /dkrz_test/techtalk/021/file_001gb_b.nc
Resource content size does not match record: /dkrz_test/techtalk/021/file_001gb_i.nc
Resource content size does not match record: /dkrz_test/techtalk/021/file_001gb_e.nc
Erroneous files: 5

The five listed files are defect. The defects of files *_b.nc, *_g.nc, *_i.nc and *_j.nc are obvious. However, the file *_e.nc is also defect which cannot be directly seen in the output of slk list above.

We run slk archive again:

$ slk archive *.nc /dkrz_test/techtalk/021 -vv
file_001gb_a.nc SKIPPED
file_001gb_b.nc SUCCESSFUL
file_001gb_c.nc SKIPPED
file_001gb_d.nc SKIPPED
file_001gb_e.nc SUCCESSFUL
file_001gb_f.nc SKIPPED
file_001gb_g.nc SUCCESSFUL
file_001gb_h.nc SKIPPED
file_001gb_i.nc SUCCESSFUL
file_001gb_j.nc SUCCESSFUL
Non-recursive Archive completed

After that, we can list the namespace’s content again:

$ slk list /dkrz_test/techtalk/021
... 1.1G  ...  file_001gb_a.nc
... 1.1G  ...  file_001gb_b.nc
... 1.1G  ...  file_001gb_c.nc
... 1.1G  ...  file_001gb_d.nc (Partial File)
... 1.1G  ...  file_001gb_e.nc
... 1.1G  ...  file_001gb_f.nc
... 1.1G  ...  file_001gb_g.nc
... 1.1G  ...  file_001gb_h.nc (Partial File)
... 1.1G  ...  file_001gb_i.nc
... 1.1G  ...  file_001gb_j.nc
Files: 10

The two files *_b.nc and *_h.nc, which were already completely archived during the first archival and skipped during the second, are still flagged as partial file. The command slk_helpers has_no_flag_partial -v will return the same. These flags cannot be removed by users. Retrieval of flagged files is not permitted. Please send us an email to support@dkrz.de to let the flags being removed.

Validate archivals#

In this section we describe methods to identify defect files – e.g. incompletely archived files (partial files). Defect files can be archived again with the same slk archive command with which they were archived in the first place. Please make sure that slk archive finished correctly. Complete/intact files are automatically skipped. Methods to verify files are these:

  • check if a file has already been written to tape

  • check if a file is flagged as partial file

  • run a verify job and collect its results (waiting time possible)

  • compare checksums of the source file and from StrongLink

When you archive important data we strongly recommend to run at least a verify job after archival. You can skip this if the files have already been written to tape. Additionally for very important data, the checksums should be compared in order to identify bit flips or issues of which we are not aware of.

Note

We are not aware of any data which were archived since January 2022 and were written to tape but had any defects. Therefore, we currently assume that a file is correct / complete if is on tape or if a verify job did not report this file as being defective / incomplete.

check if a file is on tape#

StrongLink performs a verify job prior to writing a file to tape. Files which fail are not written to tape. Hence, a file being on tape implies that the file succeeded a verification. You can use the command slk_helpers is_on_tape to check whether a file or all files in a namespaces have already been written onto tape.

# check a single file which is one tape
$ slk_helpers is_on_tape /arch/bm0146/k204221/iow/INDEX.txt
File is on tape

# check a directory of which all contained files are on tape
$ slk_helpers is_on_tape /arch/bm0146/k204221/iow -R
All files are on tape

# check a directory of which not all contained files are on tape
$ slk_helpers is_on_tape /dkrz_test/techtalk/001 -R
Not all files are on tape.

If you wish to print how many files were checked or to generate a list of files which have not been written to tape yet, please run the command with -v.

# check a single file which is one tape
$ slk_helpers is_on_tape /arch/bm0146/k204221/iow/INDEX.txt -R -v
Number of files stored on tape: 1/1

# check a directory of which all contained files are on tape
$ slk_helpers is_on_tape /arch/bm0146/k204221/iow -R -v
Number of files stored on tape: 23/23

# check a directory of which not all contained files are on tape
$ slk_helpers is_on_tape /dkrz_test/techtalk/001 -R -v
/dkrz_test/techtalk/001/file_01.txt is not on tape
/dkrz_test/techtalk/001/file_02.txt is not on tape
/dkrz_test/techtalk/001/file_00.txt is not on tape
Number of files stored on tape: 0/3

If you wish to print all checked files, please run the command with -vv.

# check a single file which is one tape
$ slk_helpers is_on_tape /arch/bm0146/k204221/iow/INDEX.txt -R -vv
/arch/bm0146/k204221/iow/INDEX.txt is on tape
Number of files stored on tape: 1/1

# check a directory of which all contained files are on tape
$ slk_helpers is_on_tape /arch/bm0146/k204221/iow -R -v
/arch/bm0146/k204221/iow/iow_data_002.tar is on tape
/arch/bm0146/k204221/iow/iow_data_001.tar is on tape
[...]
/arch/bm0146/k204221/iow/iow_data2_003.tar is on tape
Number of files stored on tape: 23/23

# check a directory of which not all contained files are on tape
$ slk_helpers is_on_tape /dkrz_test/techtalk/001 -R -vv
/dkrz_test/techtalk/001/file_01.txt is not on tape
/dkrz_test/techtalk/001/file_02.txt is not on tape
/dkrz_test/techtalk/001/file_00.txt is not on tape
Number of files stored on tape: 0/3

check if file flagged as “partial”#

The only defect of files, of which we are aware of and which were archived by users from Levante since January 2022, are/were incompletely archived files caused by aborted archivals. Incompletely archived files are flagged as partial file. Hence, checking a file for this flag is a simple way to see whether a file might be incomplete. However, also completely archived files may be flagged as partial when slk archive does not finish properly.

Note

A partial file is not necessarily incomplete but an incomplete file is definately flagged as partial file.

Please use slk_helpers has_no_flag_partial -v to check whether one file or multiple files are flagged as partial file.

$ slk_helpers has_no_flag_partial /dkrz_test/netcdf/20230504c -R -v
/dkrz_test/netcdf/20230504c/file_500mb_d.nc has partial flag
/dkrz_test/netcdf/20230504c/file_500mb_f.nc has partial flag
/dkrz_test/netcdf/20230504c/file_500mb_g.nc has partial flag
Number of files without partial flag: 7/10

Please do not trust the output of slk list with respect to the existance of the parial file flag because the flag might be hidden in some situations.

$ slk list /dkrz_test/netcdf/20230504c
-rwxr-xr-x- k204221     bm0146        553.9M   19 Jul 2021 02:18 file_500mb_d.nc
-rw-r--r--- k204221     bm0146        553.9M   19 Jul 2021 02:18 file_500mb_e.nc
-rw-r--r--- k204221     bm0146        553.9M   19 Jul 2021 02:18 file_500mb_f.nc (Partial File)
-rw-r--r--- k204221     bm0146        554.0M   19 Jul 2021 02:18 file_500mb_g.nc (Partial File)
Files: 4

The Partial File is not displayed if the file was moved or renamed or if the permissions, group or owner of the file where changed. This is a known slk bug.

If you wish a partial file flag to be removed please do as follows.

  1. Check whether a file is flagged as partial file with the command slk_helpers has_no_flag_partial -v <FILE_PATH> (validation: check if file flagged as “partial”).

  2. Run slk archive again. If the file is skipped, it should be correct. Please be aware that bit flips and other bit-wise corrptions are not captured by this.

  3. Please report properly archived flagged files to support@dkrz.de so that we can let the flag be removed.

See also

Further examples on the usage of slk_helpers has_no_flag_partial are on page slk usage examples.

run verify job#

When slk archive is aborted, some files might only be incompletely/partially archived. In the beginning of the archival of files, StrongLink stores the expected size of each source file in its database. The expected size is hidden to a normal user but a user can tell StrongLink to compare expected size of a file against its real size in order to identify incompletely archived files. This is done by verify jobs.

Verify jobs can only target files in the cache. StrongLink automatically runs a verify job on each file it wants to write to tape. Files which fail this verification are not written to tape.

Verify jobs are submit via the command slk_helpers submit_verify_job. They run a few seconds to a few minutes. The results of the verify job – i.e. a list of incomplete files or files with other issues – are obtained via the command slk_helpers result_verify_job. Verify jobs are submitted to the same queueing system as recall / retrieval jobs are submitted and might need to wait if the queue is full of jobs.

Please start a verify job as follows:

$ slk_helpers submit_verify_job /dkrz_test/netcdf/20230925a -R
Submitting up to 1 verify job(s) based on results of search id 576002:
search results: pages 1 to 1 of 1; visible search results: 10; submitted verify job: 176395
Number of submitted verify jobs: 1

A verify job with the id 176395 was submitted. It is in the same queue as recall jobs are. Thus, if many files are recalled and the StrongLink queue is well filled, verify jobs might need to wait some time until they are processed.

The command performs a search in the background which might take some time when StrongLink is under heavy load. You might run the command with -v in order to see at which point the command is waiting for StrongLink.

$ slk_helpers submit_verify_job /dkrz_test/netcdf/20230925a -R -v
Generating search query.
Search query is: '{"$and":[{"path":{"$gte":"/dkrz_test/netcdf/20230925a"}},{"smart_pool":"slpstor"}]}'.
Starting search query.
Search ID is: 576005.
Search continuing. ......
Submitting up to 1 verify job(s) based on results of search id 576005:
Collecting search results from page 1 to page 1
    Collecting search results  1 to 1000
Collected 10 search results from page 1 to page 1
Generate verify query
Submit verify query
search results: pages 1 to 1 of 1; visible search results: 10; submitted verify job: 176396
Number of submitted verify jobs: 1

When the verify job has been started please wait until it is finished. The job status is checked as follows:

$ slk_helpers job_status 176395
QUEUED (31)

$ slk_helpers job_status 176395
QUEUED (5)

$ slk_helpers job_status 176395
PROCESSING

$ slk_helpers job_status 176395
PROCESSING

# wait a few seconds or minutes ...
$ slk_helpers job_status 176395
COMPLETED

The results of the verify job can be fetched via slk_helpers result_verify_job:

$ slk_helpers result_verify_job 176395
Errors:
Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_b.nc
Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_c.nc
Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_a.nc
Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_f.nc
Erroneous files: 4

Four size-mismatch errors were detected. The this case, these files should be re-archived or deleted from the archive. Resource content size does not match record is the default error when files were incompletely archived. Thus, it is the most common output. If the result_verify_job command detects an unexpected error or an error which can be only solved by the DKRZ support, it will tell the user to notify the DKRZ staff.

$ slk_helpers result_verify_job 247338
Warning: Missing key in the JSON input: attributes.best_store; resource id:         80527401010; resource path: /dkrz_test/techtalk/020/file_001gb_a.nc
Warning: Resource has an unclear caching state; resource id: 80527401010; resource  path: /dkrz_test/techtalk/020/file_001gb_a.nc
Errors:
Resource content size does not match record: /dkrz_test/techtalk/020/file_001gb_e.nc
Resource content size does not match record: /dkrz_test/techtalk/020/file_001gb_f.nc
File not found: /dkrz_test/techtalk/020/file_001gb_a.nc
Erroneous files: 3 (some errors have to be solved by the DKRZ support; please contact support@dkrz.de)

compare checksums#

StrongLink calculates two types of checksums for files: sha512 and adler32. It might take a few hours after the archival until the checksums are calculated. If no checksum is available a day after the archival finished and the file size is larger than 0 byte, please contact support@dkrz.de.

The checksums from StrongLink are obtained via slk_helpers checksum RESOURCE. The sha512 checksum of a local file is calculated via sha512sum.

# archive a file
$ slk archive test.nc /arch/bm0146/k204221/test_data
[========================================\] 100% complete. Files archived: 1/1, [1.7K/1.7K].

# wait some hours ...

# calculated the checksum of the local file
$ sha512sum test.nc
22ef50dcbd179775b5a6e632b02d8b99ddf16609f342a66c1fae818ed42a49d5a33af3dd8e059fa7a743f5b615620f2ad87a3d01bf3e2e0cde0e8a607bc1f15d  test.nc

# get the checksum of the archived file
$ slk_helpers checksum -t sha512 /arch/bm0146/k204221/test_data/test.nc
22ef50dcbd179775b5a6e632b02d8b99ddf16609f342a66c1fae818ed42a49d5a33af3dd8e059fa7a743f5b615620f2ad87a3d01bf3e2e0cde0e8a607bc1f15d

Archival wrapper for SLURM#

In contrast to slk retrieve, we do not provide SLURM wrapper scripts for slk archive in the slk module on Levante. Instead, you will find several SLURM script templates for archivals below.

Archival script templates#

Several script templates for different use cases are printed below and available for download:

When you use these templates, you need to make a few adaptions (not each script has all of them):

  • modify src_folder: replace /work/xz1234/ex/am/ple by the actual source folder on the lustre file system

  • modify target_folder: replace /arch/xz1234/${USER}/ex/am/ple by something appropriate for you project

  • modify src_file: replace file.nc by a correct

Please run/submit these scripts via sbatch as described in Run slk as batch job and SLURM Introduction.

archive multiple files#

#!/bin/bash
# HINT:
# * You can change the values right of the "=" as you wish.
# * The "%j" in the log file names means that the job id will be inserted
#SBATCH --job-name=arch_files # Specify job name
#SBATCH --output=test_job.o%j # name for standard output log file
#SBATCH --error=test_job.e%j # name for standard error output log
#SBATCH --partition=shared # partition name
#SBATCH --ntasks=1 # max. number of tasks to be invoked
#SBATCH --time=08:00:00 # Set a limit on the total run time
#SBATCH --mem=6GB

# make 'module' available when script is submitted from certain environments
source /sw/etc/profile.levante

# ~~~~~~~~~~~~ preparation ~~~~~~~~~~~~
module load slk
# set the source folder
src_folder=/work/xz1234/elp/ma/xe
# set target folder for archival
target_folder=/arch/xz1234/${USER}/ex/am/ple

# ~~~~~~~~~~~~ archivals ~~~~~~~~~~~~
# do the archival
echo "doing 'slk archive'"
# ~~~~~~~~~~~~ doing multi-file archival ~~~~~~~~~~~~
# You can archive multiple files at once -- either by listing them or by
# using wildcard expressions.
slk archive -vv /${src_folder}/file01.nc ${src_folder}/file02.nc ${src_folder}/*.tar ${target_folder}
if [ $? -ne 0 ]; then
    >&2 echo "an error occurred in slk archive call"
else
    echo "archival of two files successful"
fi

several archivals of single files#

#!/bin/bash
# HINT:
# * You can change the values right of the "=" as you wish.
# * The "%j" in the log file names means that the job id will be inserted
#SBATCH --job-name=test_slk_arch_job # Specify job name
#SBATCH --output=test_job.o%j # name for standard output log file
#SBATCH --error=test_job.e%j # name for standard error output log
#SBATCH --partition=shared # partition name
#SBATCH --ntasks=1 # max. number of tasks to be invoked
#SBATCH --time=08:00:00 # Set a limit on the total run time
#SBATCH --mem=6GB

# make 'module' available when script is submitted from certain environments
source /sw/etc/profile.levante

# ~~~~~~~~~~~~ preparation ~~~~~~~~~~~~
module load slk
# set the source folder
src_folder=/work/xz1234/elp/ma/xe
# set target folder for archival
target_folder=/arch/xz1234/${USER}/ex/am/ple

# ~~~~~~~~~~~~ archivals ~~~~~~~~~~~~
# do the archival
echo "doing 'slk archive'"
# ~~~~~~~~~~~~ doing single-file archivals ~~~~~~~~~~~~
# You can do multiple archivals in one script. The exit code of each
# archival should be captured afterwards (get $? in line after slk command)
slk archive ${src_folder}/file01.nc ${target_folder}
if [ $? -ne 0 ]; then
    >&2 echo "an error occurred in slk archive call 1"
else
    echo "archival 1 successful"
fi

# second archival and capture exit code (get $? in line after slk cmd)
slk archive ${src_folder}/file02.nc ${target_folder}
if [ $? -ne 0 ]; then
    >&2 echo "an error occurred in slk archive call 2"
else
    echo "archival 2 successful"
fi
# ...
# ...
# fifteenth archival and capture exit code (get $? in line after slk cmd)
slk archive ${src_folder}/file15.nc ${target_folder}
if [ $? -ne 0 ]; then
    >&2 echo "an error occurred in slk archive call 15"
else
    echo "archival 15 successful"
fi

archival of one file with delayed checksum check#

This template/example consists of two files:

archive_slurm_template_single_file_with_checksum_check.sh#

#!/bin/bash
# HINT:
# * You can change the values right of the "=" as you wish.
# * The "%j" in the log file names means that the job id will be inserted
#SBATCH --job-name=test_slk_arch_job # Specify job name
#SBATCH --output=test_job.o%j # name for standard output log file
#SBATCH --error=test_job.e%j # name for standard error output log
#SBATCH --partition=shared # partition name
#SBATCH --ntasks=1 # max. number of tasks to be invoked
#SBATCH --time=08:00:00 # Set a limit on the total run time
#SBATCH --mem=6GB

# make 'module' available when script is submitted from certain environments
source /sw/etc/profile.levante

# ~~~~~~~~~~~~ preparation ~~~~~~~~~~~~
module load slk
# set the source folder
src_folder=/work/xz1234/elp/ma/xe
src_file=file.nc
# set target folder for archival
target_folder=/arch/xz1234/${USER}/ex/am/ple
# set a file to write the result of the checksum comparison into
checksum_result_file=${src_folder}/${src_file}.chk

# ~~~~~~~~~~~~ archivals ~~~~~~~~~~~~
# do the archival
echo "doing 'slk archive'"
# We run the archival and capture the exit code ...
slk archive ${src_folder}/${src_file} ${target_folder}
if [ $? -ne 0 ]; then
    >&2 echo "an error occurred in slk archive call"
    exit 1
else
    echo "archival successful"
fi

# ... then we calculate the checksum and ...
checksum_src_file_raw=`sha512sum ${src_folder}/${src_file}`
if [ $? -ne 0 ]; then
    >&2 echo "checksum could not be calculated"
    exit 1
else
    echo "calculation of checksum successful: ${checksum_src_file_raw}"
fi
echo $checksum_src_file_raw > ${src_folder}/${src_file}.sha512

# ... submit a delayed job for retrieving the checksum from StrongLink
sbatch --begin="now+2hours" --account=${SLURM_JOB_ACCOUNT} ./archive_slurm_template_get_and_compare_checksum.sh ${src_folder}/${src_file}.sha512 ${target_folder}/${src_file} ${checksum_result_file}

archive_slurm_template_get_and_compare_checksum.sh#

#!/bin/bash
# HINT:
# * You can change the values right of the "=" as you wish.
# * The "%j" in the log file names means that the job id will be inserted
#SBATCH --job-name=test_slk_checksum # Specify job name
#SBATCH --output=test_job.o%j # name for standard output log file
#SBATCH --error=test_job.e%j # name for standard error output log
#SBATCH --partition=shared # partition name
#SBATCH --ntasks=1 # max. number of tasks to be invoked
#SBATCH --time=08:00:00 # Set a limit on the total run time
#SBATCH --mem=6GB

# make 'module' available when script is submitted from certain environments
source /sw/etc/profile.levante

# ~~~~~~~~~~~~ get and print arguments ~~~~~~~~~~~~
if [ "$#" -ne 3 ]; then
echo -1
>&2 echo "need three input argument (got $#): FILE_CONTAINING_CHECKSUM_OF_SRC_FILE RESOURCE_PATH_HSM CHECKSUM_COMPARISON_RESULT_FILE"
exit 1
fi

checksum_file=$1
resource_path_hsm=$2
checksum_result_file=$3

echo "~~~ got this input: ~~~"
echo "checksum_file: ${checksum_file}"
echo "resource_path_hsm: ${resource_path_hsm}"
echo "checksum_result_file: ${checksum_result_file}"


# ~~~~~~~~~~~~ preparation ~~~~~~~~~~~~
module load slk


# ~~~~~~~~~~~~ get source file's checksum ~~~~~~~~~~~~
if [ ! -f ${checksum_file} ]; then
    >&2 echo "file containing the checksum of the source file does not exist: '${checksum_file}'"
    exit 1
fi
checksum_src_file_raw=`cat ${checksum_file}`
checksum_src_file=`echo ${checksum_src_file_raw} | awk '{ print $1 }'`


# ~~~~~~~~~~~~ check if HSM file is available ~~~~~~~~~~~~
# first we check whether the resource/file actually exists in the HSM
echo "doing 'slk_helpers exists'"
slk_helpers exists ${resource_path_hsm}
exit_code=$?
if [ $exit_code -ne 0 ]; then
    if [ $exit_code -eq 1 ]; then
        >&2 echo "file '${resource_path_hsm}'; stop obtaining a checksum"
        exit 1
    else
        >&2 echo "an unknown error occurred in 'slk_helpers exists ${resource_path_hsm}' call; exit code: ${exit_code}"
        exit 1
    fi
else
    echo "file exists in HSM ('$resource_path_hsm')"
fi


# ~~~~~~~~~~~~ get HSM checksum ~~~~~~~~~~~~
echo "doing 'slk_helpers checksum -t sha512'"
# We first run the archival and capture the exit code ...
checksum_hsm_file_raw=`slk_helpers checksum -t sha512 ${resource_path_hsm}`
exit_code=$?
if [ $exit_code -ne 0 ]; then
    if [ $exit_code -eq 1 ]; then
        echo "checksum of '${resource_path_hsm}' not yet calculated by StrongLink; resumitting this job"
        sbatch --begin="now+2hours" --account=${SLURM_JOB_ACCOUNT} ${0} ${checksum_src_file} ${resource_path_hsm} ${checksum_result_file}
        exit 0
    else
        >&2 echo "an error occurred in slk_helpers checksum call; exit code: ${exit_code}"
        exit 1
    fi
else
    echo "getting checksum successful"
fi
checksum_hsm_file=`echo ${checksum_hsm_file_raw} | awk '{ print $1 }'`


# ~~~~~~~~~~~~ compare if checksums are equal ~~~~~~~~~~~~
echo "Result of checksum comparison will be written into ${checksum_result_file} (first line: 0 == checksums equal; 1 == checksums differ)"
if [ "${checksum_src_file}" = "${checksum_hsm_file}" ]; then
    echo "checksums are equal: ${checksum_src_file}"
    exit_code=0
else
    echo "checksums are unequal: ${checksum_src_file} and ${checksum_hsm_file}"
    exit_code=1
fi
echo "${exit_code}" > ${checksum_result_file}
echo "# 0 == checksums equal; 1 == checksums differ)" >> ${checksum_result_file}
echo "checksum src file: ${checksum_src_file_raw}" >> ${checksum_result_file}
echo "checksum HSM file: ${checksum_hsm_file} ${resource_path_hsm}" >> ${checksum_result_file}

exit ${exit_code}