Archivals to tape#

file version: 08 Dec 2023

current software versions: slk version 3.3.91; slk_helpers version 1.10.2; slk wrappers 1.2.1

Introduction and Summary#

The slk archive command is available on all nodes of Levante. If you plan to archive a large file (larger than a few GB) or many files, please do not do this on login nodes but via the partitions compute, shared or interactive. Please allocate 6 GB of memory (--mem=6GB). If your slk is killed with a message like /sw/[...]/bin/slk: line 16: [...] Killed, then please inform the DKRZ support (support@dkrz.de) and allocate 8 GB or 10 GB of memory. If you wish to use slk retrieve interactively, please start an interactive batch session via the interactive partition with salloc as follows (see also Run slk in the “interactive” partition; details on salloc: Data Processing on Levante):

salloc --mem=6GB --partition=interactive --account=YOUR_PROJECT_ACCOUNT

Warning

Please avoid archiving more than approximately 5 TB with one call of slk archive. Archiving higher amounts of data at once might cause the slk archive to fail (details: How much data can I archive at once?).

Warning

Archiving too small or too large files causes problems. The optimal file size is between 10 GB and 200 GB. The transfer speed of files larger than approximately 250 GB between Tape and HSM cache is lower due to cache limitations. Please do not archive many small files but pack them into tar balls or similar. You can use packems for this purpose. It is more efficient and faster to retrieve one file of 1 GB size instead of ten files of 1 MB size. The read rate reaches up to 300 MB/s when data are continuously streamed. The long waiting time for tape retrievals is mainly caused by (a) the time the robot arm needs to transport a tape to the tape drive and (b) the waiting time for empty tape drives. Additionally, the tapes are stressed more by many single reading operations which increases the probability for tape failures. However, when file of 1 TB size and more is archived, then the retrieval will take very long. This is disadvantageous if one only needs a small part of such a large file.

Warning

If the transfer of a file via slk archive is interrupted, an incomplete version of this file will remain in the HSM. This incomplete file is listed by slk list, has a size of 0 byte and can be retrieved. We strongly recommend running the same call of slk archive a second time. This will archive all missing and incomplete files again. In contrast, to what we communicated in the past, it is not sufficient to check the existence of a checksum in StrongLink. Please see What to do when slk archive was interrupted/killed? and Check the integrity / completeness of archived files for details.

Useful information on slk archive#

  • High memory usage: Please allocate 6 GB of memory for each call of slk archive (argument for sbatch: --mem=6GB). Otherwise, your commands might be killed by the operating system. If you plan to run three archivals in parallel, please allocate 18 GB – and so on.

  • Check exit codes: Information on the success or failure of slk archive will not be printed into the SLURM log automatically. We strongly suggest to check the exit code of each slk retrieve and print it to the job log. The variable $? holds the exit code of the preceding command (see How do I capture exit codes?; example: our example scripts for usage examples).

  • Group of archived files: slk archive always sets the group of an archived file to the default group of the archiving user. The group is not set to the group of the target namespace – i.e. the group of the project into which namespace the file is archived into. The group has to be adapted manually.

  • Resume an archival: An exactly equal call of slk archive might be run twice if it did not finished properly the first time. The files which have been fully transfer the first time will be skipped when slk archive is run a second time. Partly/incompletely archived files will be copied again. If a file was modified in between the two calls, then it will be copied again, too. slk archive compares file size and time stamp.

  • Amount of data archived at once: We suggest to archive not more than approximately 5 TB with one call of slk archive (details: How much data can I archive at once?).

What to do when slk archive was interrupted/killed?#

Warning

In the past, we proposed that checking whether a checksum exists in StrongLink was sufficient to determine whether a file had been completely archived or not. This is not a recommended procedure anymore because it might fail. Please see Check the integrity / completeness of archived files for details.

short answer#

If the archival of several files was interrupted, please run the same call of slk archive a second time. The slk archive will only transfer those files, which

  • have not already been archived,

  • have only been partly archived (internally flagged as partial file) or

  • have been modified since the first archival.

You can identify partly archived files via verify jobs. Please also check whether additional files are flagged as partial. Completely archived files might be falsely flagged as partial file under certain conditions. These flagged files are blocked for retrieval. If you find any files flagged as partial please notify support@dkrz.de so that the flag is removed.

long answer#

A call of slk archive might be interrupted by these – and some other – reasons:

  • manually killed by the user (e.g. via CTRL + C),

  • broken ssh connection,

  • timeout of a SLURM job,

  • archival of a large amount of data (> 10 TB) while the StrongLink system is under high load or

  • by the operating system (e.g. allowed memory exceeded).

The files, which were currently transferred when slk archive was interrupted, are internally flagged as partial file. Some of these files might be incomplete and other might be complete (details further below). Incompletely archived files are not automatically deleted but will remain in StrongLink until the user takes further actions. You can identify these files via verify jobs. Please be aware that incomplete files are listed by slk list and may even have a checksum.

Files, which are flagged as partial, are blocked for retrieval. slk list should print partial file for each file with such a flag. However, under certain conditions (a bug) partial file is not displayed. Therefore, please do not trust the output of slk list in this context and use slk_helpers has_no_flag_partial <TARGET_PATH> -R -v to check whether files are flagged as partial. If you own files which are flagged as partial but have been completely archived, please send a list to support@dkrz.de so that the flags are removed.

How does the flase flagging happen? After a file has been transfered completely, slk archive does a quick check of the file. If slk archive is interrupted after the completion of the file transfer and before the quick check was done, the file is flagged as partial file although it has been completely archived. Repeated slk archive calls will skip this file and will not remove the flag. This is a bug. Please tell us if files are repeatedly listed as failed by slk archive -vv and not as skipped.

Example:

# run slk archive -vv => all files should be SKIPPED or SUCCESSFUL
# if files are FAILED => run slk archive -vv again
$ slk archive *.nc /dkrz_test/netcdf/20230914c -vv
file_001gb_a.nc SKIPPED
file_001gb_b.nc SKIPPED
file_001gb_c.nc SKIPPED
file_001gb_d.nc SUCCESSFUL
Non-recursive Archive completed

# check if partial flag is set for any file
$ slk_helpers has_no_flag_partial -R -v /dkrz_test/netcdf/20230914c
/dkrz_test/netcdf/20230914c/file_001gb_b.nc has partial flag
/dkrz_test/netcdf/20230914c/file_001gb_c.nc has partial flag
Number of files without partial flag: 2/4

# notify support@dkrz.de that files with partial flag are in
#  /dkrz_test/netcdf/20230914c

See also

A more detailed version of this example is on the page slk usage examples under Failed archival - check if files are flagged as partial

How much data can I archive at once?#

We suggest to archive not more than 5 TB to 10 TB with one call of slk archive. If you archive more than that and if the StrongLink system is under high load, the transfer might be interrupted unexpectely. The slk log (~/.slk/slk-cli.log) will show this error (look for unexpected end of stream on https://archive.dkrz.de/...):

2022-11-24 11:16:22 INFO  Executing command: "archive -R /work/ab1234/c567890/much_data /arch/zy0987/c567890/target
2022-11-24 11:18:25 ERROR Unexpected exception
java.io.IOException: unexpected end of stream on https://archive.dkrz.de/...
        at
okhttp3.internal.http1.Http1ExchangeCodec.readResponseHeaders(Http1ExchangeCodec.kt:202)
~[slk-cli-tools-3.3.21.jar:?]
[...]
[...]
[...]
        ... 16 more
2022-11-24 11:18:25 INFO
Archive report
===============
Status: incomplete
Total files uploaded: 0/85083 files [0B/20.3T]

If you want to keep archiving larger amounts than 10 TB at once – e.g. 100 TB –, please be prepared to run slk archive repeatedly. In the end, a summary similar to this one should be printed to the log:

2022-11-25 11:21:10 INFO
Archive report
===============
Status: incomplete
Total files uploaded: 0/85083 files [0B/20.3T]
Total files skipped: 85083/85083 files [20.3T/20.3T]
        Unchanged files: 85083

or this one:

2022-11-25 09:46:55 INFO
Archive report
===============
Status: incomplete
Total files uploaded: 4342/85083 files [1.2B/20.3T]
Total files skipped: 80741/85083 files [20.3T/20.3T]
    Unchanged files: 80741

Validate archivals#

For the reason given, we ask you to check if slk archive finished correctly when doing archivals.

If slk archive did not finish correctly, we strongly recommend re-running the same slk archive command or to check all archived files for completeness. Files might be archived incompletely if slk archive was killed manually, by a timeout of a SLURM job or by a disconnected ssh session. Currently, such files are also displayed by slk list and, rarely, checksums might be calculated. However, incomplete files are displayed as files with 0 byte size since the StrongLink update in October 2022. If the archived files are very important we recommend to perform a validation / integrity check as described further below.

If slk archive finished properly – exit code 0 – then it can be assumed that the file(s) was/were archived completely and correctly. This assumption is supported by results of extensive archival-retrieval tests performed. However, bit flips and similar events might occur through which files are corrupted. Therefore, if you archive data, which are very important, please compare the checksums as described below.

Check that slk archive terminated properly#

This can be done to check whether slk archive finished without an error:

slk archive /path/to/file/to/be/archived /arch/proj/user/test
if [ $? -ne 0 ]; then
    >&2 echo "an error occurred in slk archive call"
else
    echo "archival successful"
fi

slk archive might exit with an exit code unequal 0 or be killed due to a timeout of a SLURM job, an unexpected error, manual user interaction or similar. In this case, the archival can be resumed by calling the same slk archive command a second time (see also What to do when slk archive was interrupted/killed?). Completely archived files will not be archived again and incomplete files will be overwritten. When you run slk archive -vv, the archival status of each file will be printed out. Possible stati are SUCCESSFUL, SKIPPED and FAILED. If you wish to check for partially archived files, please have a look into the section Verify file size further below.

Check the integrity / completeness of archived files#

Note

In the past, we proposed that checking whether a checksum exists in StrongLink was sufficient to determine whether a file had been completely archived or not. This is not a recommended procedure anymore because it might fail.

There are three ways to check whether a file has been archived completely (and correctly):

  1. Run a verify job to verify the file size (Verify file size).

  2. Compare the checksums of the original file with the checksum that StrongLink calculated for the archived file (validation: comparing checksums).

  3. Retrieve the archived file and comparing it against the original file (validation: retrieving the file). Please avoid this currently (12/2023).

When many recalls jobs are running (== many retrievals from tape) then the verify jobs have to wait in the same queue as the recall jobs. Thus, they might need to wait a long time. As an alternative, you might check whether files are flagged as partial files:

  1. Check whether a file is flagged as partial file with the command slk_helpers has_no_flag_partial -v <FILE_PATH> (validation: check if file flagged as “partial”).

  2. Run slk archive again. If the file is skipped, it should be correct. Please be aware that bit flips and other bit-wise corrptions are not captured by this.

  3. Please report properly archived flagged files to support@dkrz.de so that we can let the flag be removed.

Note

Please be aware that a file might be falsely flagged as partial file although it has been completely archived as described in What to do when slk archive was interrupted/killed?. If a file is listed as partial file by slk_helpers has_no_flag_partial, please run slk archive again for this or for all files. If partial files are skipped by slk archive they are falsely flagged. Please contact support@dkrz.de in this case.

See also

A detailed example on the usage of slk_helpers has_no_flag_partial is on page slk usage examples under Failed archival - check if files are flagged as partial.

Verify file size#

Note

In the past, we proposed that checking whether a checksum exists in StrongLink was sufficient to determine whether a file had been completely archived or not. This is not a recommended procedure anymore because it might fail.

If slk archive did not finish properly (non-zero exit code or killed by various reasons), some files might only be partly archived. You can use verify jobs to indentify such partial files.

Please start a verify job as follows:

$ slk_helpers submit_verify_job /dkrz_test/netcdf/20230925a -R
Submitting up to 1 verify job(s) based on results of search id 576002:
search results: pages 1 to 1 of 1; visible search results: 10; submitted verify job: 176395
Number of submitted verify jobs: 1

A verify job with the id 176395 was submitted. It is in the same queue as recall jobs are. Thus, if many files are recalled and the StrongLink queue is well filled, verify jobs might need to wait some time until they are processed.

The command performs a search in the background which might take some time when StrongLink is under heavy load. You might run the command with -v in order to see at which point the command is waiting for StrongLink.

$ slk_helpers submit_verify_job /dkrz_test/netcdf/20230925a -R -v
Generating search query.
Search query is: '{"$and":[{"path":{"$gte":"/dkrz_test/netcdf/20230925a"}},{"smart_pool":"slpstor"}]}'.
Starting search query.
Search ID is: 576005.
Search continuing. ......
Submitting up to 1 verify job(s) based on results of search id 576005:
Collecting search results from page 1 to page 1
    Collecting search results  1 to 1000
Collected 10 search results from page 1 to page 1
Generate verify query
Submit verify query
search results: pages 1 to 1 of 1; visible search results: 10; submitted verify job: 176396
Number of submitted verify jobs: 1

When the verify job has been started please wait until it is finished. The job status is checked as follows:

$ slk_helpers job_status 176395
QUEUED (31)

$ slk_helpers job_status 176395
QUEUED (5)

$ slk_helpers job_status 176395
PROCESSING

$ slk_helpers job_status 176395
PROCESSING

# wait a few seconds or minutes ...
$ slk_helpers job_status 176395
COMPLETED

The results of the verify job can be fetched via slk_helpers result_verify_job:

$ slk_helpers result_verify_job 176395
Errors:
Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_b.nc
Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_c.nc
Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_a.nc
Resource content size does not match record: /dkrz_test/netcdf/20230925a/file_001gb_f.nc
Erroneous files: 4

Four size-mismatch errors were detected. The this case, these files should be re-archived or deleted from the archive.

validation: check if file flagged as “partial”#

Incompletely archived files are flagged as partial file. However, also completely archived files may be flagged as partial when slk archive does not finish properly. Please use slk_helpers has_no_flag_partial -v to check whether one file or multiple files are flagged as partial.

$ slk_helpers has_no_flag_partial /dkrz_test/netcdf/20230504c -R -v
/dkrz_test/netcdf/20230504c/file_500mb_d.nc has partial flag
/dkrz_test/netcdf/20230504c/file_500mb_f.nc has partial flag
/dkrz_test/netcdf/20230504c/file_500mb_g.nc has partial flag
Number of files without partial flag: 7/10

slk list should indicate which files are flagged as such but fails to do so in some situations.

$ slk list /dkrz_test/netcdf/20230504c
-rwxr-xr-x- k204221     bm0146        553.9M   19 Jul 2021 02:18 file_500mb_d.nc
-rw-r--r--- k204221     bm0146        553.9M   19 Jul 2021 02:18 file_500mb_e.nc
-rw-r--r--- k204221     bm0146        553.9M   19 Jul 2021 02:18 file_500mb_f.nc (Partial File)
-rw-r--r--- k204221     bm0146        554.0M   19 Jul 2021 02:18 file_500mb_g.nc (Partial File)
Files: 4

The Partial File is not displayed if the file was moved or renamed or if the permissions, group or owner of the file where changed. This is known slk bug.

Note

Please be aware that a file might be falsely flagged as partial file although it has been completely archived as described in What to do when slk archive was interrupted/killed?. If a file is listed as partial file by slk_helpers has_no_flag_partial -v, please run slk archive again for this or for all files. If partial files are skipped by slk archive they are falsely flagged. Please contact support@dkrz.de in this case.

See also

A detailed example on the usage of slk_helpers has_no_flag_partial is on page slk usage examples under Failed archival - check if files are flagged as partial.

validation: comparing checksums#

StrongLink calculates two types of checksums for files: sha512 and adler32. It might take a few hours after the archival until the checksums are calculated. If no checksum is available a day after the archival finished and the file size is larger than 0 byte, please contact support@dkrz.de.

The checksums from StrongLink are obtained via slk_helpers checksum RESOURCE. The sha512 checksum of a local file is calculated via sha512sum.

# archive a file
$ slk archive test.nc /arch/bm0146/k204221/test_data
[========================================\] 100% complete. Files archived: 1/1, [1.7K/1.7K].

# wait some hours ...

# calculated the checksum of the local file
$ sha512sum test.nc
22ef50dcbd179775b5a6e632b02d8b99ddf16609f342a66c1fae818ed42a49d5a33af3dd8e059fa7a743f5b615620f2ad87a3d01bf3e2e0cde0e8a607bc1f15d  test.nc

# get the checksum of the archived file
$ slk_helpers checksum -t sha512 /arch/bm0146/k204221/test_data/test.nc
22ef50dcbd179775b5a6e632b02d8b99ddf16609f342a66c1fae818ed42a49d5a33af3dd8e059fa7a743f5b615620f2ad87a3d01bf3e2e0cde0e8a607bc1f15d

validation: retrieving the file#

please avoid doing this until mid-2023 when more tape drives will become available

When we want to be as sure as possible that a file is correctly archived on tape, we need to wait until the file is written onto tape and retrieve it from there. After a file has been written onto tape, it remains in the HSM cache for a few hours or days. As long as the file is in the HSM cache, it will be retrieved from there instead from tape. Therefore, we need to wait some time until the file version in the HSM cache is deleted. Then we can run slk retrieve and get the file from tape.

Note

Files of a size of a few MB and below are not deleted from the HSM cache and will remain there forever.

# archive a file
$ slk archive test.nc /arch/bm0146/k204221/test_data
[========================================\] 100% complete. Files archived: 1/1, [1.7K/1.7K].

# wait some days ...

# check if file is still in the HSM cache
$ slk_helpers iscached /arch/bm0146/k204221/test_data/test.nc
File is cached
# still cached; wait more time

# wait some days ...

# check again whether file is in the cache
$ slk_helpers iscached /arch/bm0146/k204221/test_data/test.nc
File is not cached

# now, we retrieve the archived file
$ slk retrieve /arch/bm0146/k204221/test_data/test.nc compare/
...

# compare the two files; e.g. via their checksums and also get the checksum from StrongLink
$ sha512sum test.nc
22ef50dcbd179775b5a6e632b02d8b99ddf16609f342a66c1fae818ed42a49d5a33af3dd8e059fa7a743f5b615620f2ad87a3d01bf3e2e0cde0e8a607bc1f15d  test.nc
$ sha512sum compare/test.nc
22ef50dcbd179775b5a6e632b02d8b99ddf16609f342a66c1fae818ed42a49d5a33af3dd8e059fa7a743f5b615620f2ad87a3d01bf3e2e0cde0e8a607bc1f15d  test.nc
$ slk_helpers checksum -t sha512 /arch/bm0146/k204221/test_data/test.nc
22ef50dcbd179775b5a6e632b02d8b99ddf16609f342a66c1fae818ed42a49d5a33af3dd8e059fa7a743f5b615620f2ad87a3d01bf3e2e0cde0e8a607bc1f15d

Archival wrapper for SLURM#

We do not provide a sbatch wrapper script for slk archive. Large archivals might fail (see How much data can I archive at once?) and correctly archived files might be flagged as partial when slk archival fails (What to do when slk archive was interrupted/killed?). When we provided nicely running sbatch wrapper scripts, errors in the SLURM log files might be overlooked and source files of failed archivals deleted.

Archival script templates#

Several script templates for different use cases are printed below and available for download:

When you use these templates, you need to make a few adaptions (not each script has all of them):

  • modify src_folder: replace /work/xz1234/ex/am/ple by the actual source folder on the lustre file system

  • modify target_folder: replace /arch/xz1234/${USER}/ex/am/ple by something appropriate for you project

  • replace xz1234 in --account=xz1234 by your project account name (in all relevant scripts)

  • modify src_file: replace file.nc by a correct filename

  • modify: rtrv_folder: replace /scratch/${USER:0:1}/${USER}/validation by a temporary target folder for your validation retrievals

Please run/submit these scripts via sbatch as described in Run slk as batch job and SLURM Introduction.

several archivals of single files#

#!/bin/bash
# HINT:
# * You can change the values right of the "=" as you wish.
# * The "%j" in the log file names means that the job id will be inserted
#SBATCH --job-name=test_slk_arch_job # Specify job name
#SBATCH --output=test_job.o%j # name for standard output log file
#SBATCH --error=test_job.e%j # name for standard error output log
#SBATCH --partition=shared # partition name
#SBATCH --ntasks=1 # max. number of tasks to be invoked
#SBATCH --time=08:00:00 # Set a limit on the total run time
#SBATCH --account=xz1234 # Charge resources on this project
#SBATCH --mem=6GB

# make 'module' available when script is submitted from certain environments
source /sw/etc/profile.levante

# ~~~~~~~~~~~~ preparation ~~~~~~~~~~~~
module load slk
# set the source folder
src_folder=/work/xz1234/elp/ma/xe
# set target folder for archival
target_folder=/arch/xz1234/${USER}/ex/am/ple

# ~~~~~~~~~~~~ archivals ~~~~~~~~~~~~
# do the archival
echo "doing 'slk archive'"
# ~~~~~~~~~~~~ doing single-file archivals ~~~~~~~~~~~~
# You can do multiple archivals in one script. The exit code of each
# archival should be captured afterwards (get $? in line after slk command)
slk archive /work/xz1234/elp/ma/xe/file01.nc ${target_folder}
if [ $? -ne 0 ]; then
    >&2 echo "an error occurred in slk archive call 1"
else
    echo "archival 1 successful"
fi

# second archival and capture exit code (get $? in line after slk cmd)
slk archive /work/xz1234/elp/ma/xe/file02.nc ${target_folder}
if [ $? -ne 0 ]; then
    >&2 echo "an error occurred in slk archive call 2"
else
    echo "archival 2 successful"
fi
# ...
# ...
# fifteenth archival and capture exit code (get $? in line after slk cmd)
slk archive /work/xz1234/elp/ma/xe/file15.nc ${target_folder}
if [ $? -ne 0 ]; then
    >&2 echo "an error occurred in slk archive call 15"
else
    echo "archival 15 successful"
fi

archival of one file with delayed checksum check#

This template/example consists of two files:

archive_slurm_template_single_file_with_checksum_check.sh#

#!/bin/bash
# HINT:
# * You can change the values right of the "=" as you wish.
# * The "%j" in the log file names means that the job id will be inserted
#SBATCH --job-name=test_slk_arch_job # Specify job name
#SBATCH --output=test_job.o%j # name for standard output log file
#SBATCH --error=test_job.e%j # name for standard error output log
#SBATCH --partition=shared # partition name
#SBATCH --ntasks=1 # max. number of tasks to be invoked
#SBATCH --time=08:00:00 # Set a limit on the total run time
#SBATCH --account=xz1234 # Charge resources on this project
#SBATCH --mem=6GB

# make 'module' available when script is submitted from certain environments
source /sw/etc/profile.levante

# ~~~~~~~~~~~~ preparation ~~~~~~~~~~~~
module load slk
# set the source folder
src_folder=/work/xz1234/elp/ma/xe
src_file=file.nc
# set target folder for archival
target_folder=/arch/xz1234/${USER}/ex/am/ple
# set a file to write the result of the checksum comparison into
checksum_result_file=${src_folder}/${src_file}.chk

# ~~~~~~~~~~~~ archivals ~~~~~~~~~~~~
# do the archival
echo "doing 'slk archive'"
# We run the archival and capture the exit code ...
slk archive ${src_folder}/${src_file} ${target_folder}
if [ $? -ne 0 ]; then
    >&2 echo "an error occurred in slk archive call"
    exit 1
else
    echo "archival successful"
fi

# ... then we calculate the checksum and ...
checksum_src_file_raw=`sha512sum ${src_folder}/${src_file}`
if [ $? -ne 0 ]; then
    >&2 echo "checksum could not be calculated"
    exit 1
else
    echo "calculation of checksum successful: ${checksum_src_file_raw}"
fi
echo $checksum_src_file_raw > ${src_folder}/${src_file}.sha512

# ... submit a delayed job for retrieving the checksum from StrongLink
sbatch --begin="now+2hours" ./archive_slurm_template_get_and_compare_checksum.sh ${src_folder}/${src_file}.sha512 ${target_folder}/${src_file} ${checksum_result_file}

archive_slurm_template_get_and_compare_checksum.sh#

#!/bin/bash
# HINT:
# * You can change the values right of the "=" as you wish.
# * The "%j" in the log file names means that the job id will be inserted
#SBATCH --job-name=test_slk_checksum # Specify job name
#SBATCH --output=test_job.o%j # name for standard output log file
#SBATCH --error=test_job.e%j # name for standard error output log
#SBATCH --partition=shared # partition name
#SBATCH --ntasks=1 # max. number of tasks to be invoked
#SBATCH --time=08:00:00 # Set a limit on the total run time
#SBATCH --account=xz1234 # Charge resources on this project
#SBATCH --mem=6GB

# make 'module' available when script is submitted from certain environments
source /sw/etc/profile.levante

# ~~~~~~~~~~~~ get and print arguments ~~~~~~~~~~~~
if [ "$#" -ne 3 ]; then
echo -1
>&2 echo "need three input argument (got $#): FILE_CONTAINING_CHECKSUM_OF_SRC_FILE RESOURCE_PATH_HSM CHECKSUM_COMPARISON_RESULT_FILE"
exit 1
fi

checksum_file=$1
resource_path_hsm=$2
checksum_result_file=$3

echo "~~~ got this input: ~~~"
echo "checksum_file: ${checksum_file}"
echo "resource_path_hsm: ${resource_path_hsm}"
echo "checksum_result_file: ${checksum_result_file}"


# ~~~~~~~~~~~~ preparation ~~~~~~~~~~~~
module load slk


# ~~~~~~~~~~~~ get source file's checksum ~~~~~~~~~~~~
if [ ! -f ${checksum_file} ]; then
    >&2 echo "file containing the checksum of the source file does not exist: '${checksum_file}'"
    exit 1
fi
checksum_src_file_raw=`cat ${checksum_file}`
checksum_src_file=`echo ${checksum_src_file_raw} | awk '{ print $1 }'`


# ~~~~~~~~~~~~ check if HSM file is available ~~~~~~~~~~~~
# first we check whether the resource/file actually exists in the HSM
echo "doing 'slk_helpers exists'"
slk_helpers exists ${resource_path_hsm}
exit_code=$?
if [ $exit_code -ne 0 ]; then
    if [ $exit_code -eq 1 ]; then
        >&2 echo "file '${resource_path_hsm}'; stop obtaining a checksum"
        exit 1
    else
        >&2 echo "an unknown error occurred in 'slk_helpers exists ${resource_path_hsm}' call; exit code: ${exit_code}"
        exit 1
    fi
else
    echo "file exists in HSM ('$resource_path_hsm')"
fi


# ~~~~~~~~~~~~ get HSM checksum ~~~~~~~~~~~~
echo "doing 'slk_helpers checksum -t sha512'"
# We first run the archival and capture the exit code ...
checksum_hsm_file_raw=`slk_helpers checksum -t sha512 ${resource_path_hsm}`
exit_code=$?
if [ $exit_code -ne 0 ]; then
    if [ $exit_code -eq 1 ]; then
        echo "checksum of '${resource_path_hsm}' not yet calculated by StrongLink; resumitting this job"
        sbatch --begin="now+2hours" ${0} ${checksum_src_file} ${resource_path_hsm} ${checksum_result_file}
        exit 0
    else
        >&2 echo "an error occurred in slk_helpers checksum call; exit code: ${exit_code}"
        exit 1
    fi
else
    echo "getting checksum successful"
fi
checksum_hsm_file=`echo ${checksum_hsm_file_raw} | awk '{ print $1 }'`


# ~~~~~~~~~~~~ compare if checksums are equal ~~~~~~~~~~~~
echo "Result of checksum comparison will be written into ${checksum_result_file} (first line: 0 == checksums equal; 1 == checksums differ)"
if [ "${checksum_src_file}" = "${checksum_hsm_file}" ]; then
    echo "checksums are equal: ${checksum_src_file}"
    exit_code=0
else
    echo "checksums are unequal: ${checksum_src_file} and ${checksum_hsm_file}"
    exit_code=1
fi
echo "${exit_code}" > ${checksum_result_file}
echo "# 0 == checksums equal; 1 == checksums differ)" >> ${checksum_result_file}
echo "checksum src file: ${checksum_src_file_raw}" >> ${checksum_result_file}
echo "checksum HSM file: ${checksum_hsm_file} ${resource_path_hsm}" >> ${checksum_result_file}

exit ${exit_code}