File Search and Metadata#

file version: 25 Jun 2025

current software versions: slk version 3.3.91; slk_helpers version 1.16.4; slk wrappers version: 2.4.0

Metadata in StrongLink#

StrongLink allows to search, find and retrieve files based on their metadata. Metadata are stored in metadata fields, e.g. title. Each metadata field is part of one metadata schema. Basic file metadata, e.g. owner and size, are automatically extracted from any archived file and stored in a schema called resources. When netCDF files are archived, many global attributes and a few variable attributes are extracted and written into the schemata netcdf and netcdf_header. Metadata from common picture, document, video and audio formats are also extracted and written into respective schemata. All available metadata schemata, their content and file types on which they are applied are listed in our Metadata schema reference. All metadata fields except for fields of the resources schema can be manually modified by the user.

Searches are defined via JSON-formatted search queries and are performed via slk search. Simple queries to search for files based on their name and location can be generated with slk_helpers gen_file_query.

Details on the query language are provided on Reference: StrongLink query language. Several example search queries are provided in slk Usage Examples.

Warning

One metadata field like Title might exist in several metadata schemata. A file might not only be associated to one metadata schema but to an arbitrary number of metadata schemata - e.g. document, netcdf and netcdf_header. This file might have different values set for the *.Title fields - e.g.: document.Title: "Great data of Max Mustermann" , netcdf.Title: "Sea surface temperature of the North Sea from 1900 to 1999" and netcdf_header.Title: "sst North Sea 20th century".

Set metadata#

slk tag#

The content of metadata fields can be set and modified with slk tag. Metadata fields in the resources schema like timestamps or ownership are read-only.

Example:

$ slk tag /arch/bm0146/k204221/test_files netcdf.Title="A great data set"

slk tag can be applied on files, directories and search IDs. The latter feature is useful when many different files in different locations should receive the same metadata (e.g. the same project or the same author). slk_helpers gen_file_query can be used to generate the appropriate search query for this purpose.

Example:

$ slk_helpers gen_file_query /arch/bm0146/k204221/test_files/one_file.nc
{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/test_files","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}

$ slk search '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/test_files","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}'
Search continuing. ..
Search ID: 117892

$ slk tag 117892 netcdf.Title="A great data set"
Search continuing... ...
Search ID: 117892
[========================================/] 100% complete Metadata applied to 1 of 1 resources. Finishing up... ...

slk_helpers json2hsm#

slk tag is very slow. Therefore, it will take very long if the metadata of many files should be updated by different metadata per file. For this purpose, we provide the command slk_helpers json2hsm. The metadata are written into a JSON file first. This JSON file can contain metadata records of an arbitrary number of files. The command is run as follows:

slk_helpers json2hsm metadata_file.json

The JSON file needs to follow a certain structure which is defined here.

slk_helpers json2hsm has many different options to control the import process. All parameters are listed and briefly explained here. --schema/-s allows to select only a certain metadata schema from the input JSON which is imported. All metadata fields from other schemata are ignored. I.e. the following command will only import metadata of the schema netcdf:

slk_helpers json2hsm -s netcdf metadata_file.json

slk_helpers json2hsm stops the metadata import if it finds an error in one of the metadata records. If, however, all good metadata records should be imported and defect ones should just be skipped, the parameter --skip-bad-metadata-sets/-k can be used as follows:

slk_helpers json2hsm --skip-bad-metadata-sets metadata_file.json

slk_helpers json2hsm will collect all metadata updates and will start applying them after the whole JSON input file has been read. If, however, a metadata record should be written directly after it head been read then the parameter --instant-metadata-record-update can be specified. This is particularly useful in combination with the parameter --restart-file RESTARTFILE/-r RESTARTFILE. The paths of all files, which metadata were already updated, are written to RESTARTFILE if this parameter is set. When slk_helpers json2hsm is started with this parameter, all paths from RESTARTFILE are read and the listed files will be skipped. This allows to resume an interrupted call of slk_helpers json2hsm.

slk_helpers json2hsm --instant-metadata-record-update --restart-file my_restart.txt metadata_file.json

If the target metadata records contains erroneous metadata which should be purged prior new metadata is written, then the parameter --write-mode CLEAN can be specified. When set, each updated metadata schema is emptied before new metadata is written. Metadata schemata, to which no new metadata are written, are not affected.

slk_helpers json2hsm --write-mode CLEAN metadata_file.json

Print metadata#

Three different commands are available to print metadata from the HSM:

# export metadata of one or more resources as JSON
$ slk tag -display /arch/bm0146/k204221/test.nc

# export metadata of one resources as plain text
$ slk_helpers metadata /arch/bm0146/k204221/test.nc

# export metadata of one or more resources as JSON + have some parameters for fine tuning
$ slk_helpers hsm2json /arch/bm0146/k204221/test.nc

slk tag -display#

The slk tag -display RESOURCE prints the metadata of RESOURCE to the command line as JSON. The JSON structure is the same as slk_helpers hsm2json and slk_helpers json2hsm use (JSON structure described here).

slk tag -display prints the metadata of multiple files if it is applied on a directory or on a search id. The output is also the same as slk_helpers hsm2json and slk_helpers json2hsm use (JSON structure described here).

slk_helpers metadata#

The slk_helpers metadata prints the metadata of one file in plain text which is easily human-readable.

$ slk_helpers metadata /dkrz_test/netcdf/20221014a/test_netcdf_03.nc
netcdf
    Project: values project
    Var_Long_Name: latitude,longitude,height,mass_concentration_of_nitric_acid_in_air
    Var_Std_Name: latitude,longitude,height
    Data:
    Var_Name: time,lat,lon,z,data_var
    Time_Min: 961549200000
    Time_Max: 961722000000
    Contact: values contact,values contact_email,
netcdf_header
    Project: values project
    Var_Long_Name: latitude,longitude,height,mass_concentration_of_nitric_acid_in_air
    Var_Std_Name: latitude,longitude,height
    Var_Name: time,lat,lon,z,data_var
    Time_Min: 961549200000
    Time_Max: 961722000000
    Doi: values doi

If the output should be parsed, the parameter --alternative-output-format might be useful:

$ slk_helpers metadata /dkrz_test/netcdf/20221014a/test_netcdf_03.nc --alternative-output-format
netcdf.Var_Name: time,lat,lon,z,data_var
netcdf_header.Time_Min: 2000-06-21 03:00:00
netcdf_header.Doi: values doi
netcdf.Var_Std_Name: latitude,longitude,height
netcdf.Data:
netcdf_header.Var_Name: time,lat,lon,z,data_var
netcdf_header.Var_Long_Name: latitude,longitude,height,mass_concentration_of_nitric_acid_in_air
netcdf.Time_Min: 2000-06-21 03:00:00
netcdf.Var_Long_Name: latitude,longitude,height,mass_concentration_of_nitric_acid_in_air
netcdf.Project: values project
netcdf.Contact: values contact,values contact_email,
netcdf.Time_Max: 2000-06-23 03:00:00
netcdf_header.Project: values project
netcdf_header.Time_Max: 2000-06-23 03:00:00
netcdf_header.Var_Std_Name: latitude,longitude,height

slk_helpers hsm2json#

The slk_helpers hsm2json work similar as slk tag. However, it does not accept search IDs. However, it has more parameters than slk tag which allow some fine tuning.

The option --schema/-s allows to select metadata schemata which metadata should be exported. If --schema is set, metadata of all schemata, which are not listed, are ignored. The following command will only export metadata fields of the schema netcdf_header:

$ slk_helpers hsm2json /dkrz_test/netcdf/20221014a/test_netcdf_03.nc --schema netcdf_header
[ {
    "path" : "/dkrz_test/netcdf/20221014a/test_netcdf_03.nc",
    "id" : 61035459010,
    "tags" : {
        "netcdf_header.Time_Min" : "2000-06-21 03:00:00",
        "netcdf_header.Project" : "values project",
        "netcdf_header.Time_Max" : "2000-06-23 03:00:00",
        "netcdf_header.Doi" : "values doi",
        "netcdf_header.Var_Name" : "time,lat,lon,z,data_var",
        "netcdf_header.Var_Long_Name" : "latitude,longitude,height,mass_concentration_of_nitric_acid_in_air",
        "netcdf_header.Var_Std_Name" : "latitude,longitude,height"
},
"provenance" : {
    "timeStampISO" : "2022-10-16T03:08:09.285378838",
    "software" : "slk_helpers",
    "timeStampMillis" : 1665882489284,
    "formatVersion" : "2.0.0",
    "softwareVersion" : "1.5.3"
}
} ]

If we wish to write the metadata into a JSON file, we can either pipe the output of slk_helpers hsm2json into a file or we provide the parameter --outfile OUTPUTFILE/-o OUTPUTFILE. The latter parameter can be combined with --print-summary which will print a summary of the metadata export to the command line.

$ slk_helpers hsm2json /dkrz_test/netcdf/20221014a/test_netcdf_03.nc --print-summary --outfile out.json
Export Summary:
    1: total number of metadata records read from the HSM (until now)
------
    1: number of metadata records read from the HSM and written out
    0: number of metadata records read from the HSM but not written because this program exited with an error
    0: number of metadata records not read because considered as corrupt
    0: number of metadata records skipped because processed in the past

slk_helpers hsm2json collects all metadata of all source files, first, and prints/writes it out after the last metadata records has been read. In some use cases it might be useful to print metadata directly after it has been read from StrongLink. For this purpose, the parameter --instant-metadata-record-output is meant. This is particularly useful in combination with the parameter --restart-file RESTARTFILE/-r RESTARTFILE. The paths of all files, which have already been printed/written, are written to RESTARTFILE if this parameter is set. When slk_helpers hsm2json is started with this parameter, all paths from RESTARTFILE are read and no metadata from the listed files are exported from StrongLink. This allows to resume an interrupted call of slk_helpers hsm2json.

$ slk_helpers hsm2json -R /dkrz_test/netcdf/20221014a --instant-metadata-record-output --restart-file my_restart.txt  --print-summary --outfile out.json
Export Summary:
    9: total number of metadata records read from the HSM (until now)
------
    9: number of metadata records read from the HSM and written out
    0: number of metadata records read from the HSM but not written because this program exited with an error
    0: number of metadata records not read because considered as corrupt
    0: number of metadata records skipped because processed in the past

$ slk_helpers hsm2json -R /dkrz_test/netcdf/20221014a --instant-metadata-record-output --restart-file my_restart.txt  --print-summary --outfile out.json
Export Summary:
    9: total number of metadata records read from the HSM (until now)
------
    0: number of metadata records read from the HSM and written out
    0: number of metadata records read from the HSM but not written because this program exited with an error
    0: number of metadata records not read because considered as corrupt
    9: number of metadata records skipped because processed in the past

$ cat my_restart.txt
/dkrz_test/netcdf/20221014a/test_netcdf_03.nc
/dkrz_test/netcdf/20221014a/test_netcdf_01.nc
/dkrz_test/netcdf/20221014a/test_netcdf_header.nc
/dkrz_test/netcdf/20221014a/test_netcdf_c.nc
/dkrz_test/netcdf/20221014a/test_netcdf_a.nc
/dkrz_test/netcdf/20221014a/test_netcdf_b.nc
/dkrz_test/netcdf/20221014a/test_netcdf_02.nc
/dkrz_test/netcdf/20221014a/test_netcdf_d.nc
/dkrz_test/netcdf/20221014a/test_netcdf_04.nc

The JSON structure is described here.

Search files by metadata#

The command slk search allows to search for files by their metadata. Users can either search for file name, user name and group name via simple flags or formulate complex search queries on all available metadata fields. Search queries in StrongLink have to be compiled using a special query language which format is JSON. It is possible to match strings with regular expressions ($regex operator) or to use number comparison operators such as $gte (table containing all operators).

# search for "Max" as value in the metadata field "Producer" of the schema "image"
$ slk search '{"image.Producer":"Max"}'
Search continuing. .....
Search ID: 9

# Search for a producer with the name "Max M[...]" using regular expressions
$ slk search '{"image.Producer": {"$regex": "Max M"}'
Search continuing. .....
Search ID: 11
# every producer string which contains "Max M" will be matched -- also "Karl Max Mueller-Meyer"

# To make sure that we only match producers with the only first name "Max"
# and only one single lastname, we need to add "^" in the beginning and "$"
# in the end. "[a-z]" means "any lowercase letter" between "a" and "z". "*"
# means "the proceeding expression might occur zero or more times".
$ slk search '{"image.Producer": {"$regex": "^Max M[a-z]*$"}'
Search continuing. .....
Search ID: 11

# find a file based on location and name
$ slk search '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221","$max_depth":1}},{"resources.name": "one_file.nc"}]}'
Search continuing. ..
Search ID: 117892

If you want to search for a file by name or location, you can generate a query string via slk_helpers gen_file_query (more gen_file_query example applications here).

# find a file by name and location; let the query be generated by
$ slk_helpers gen_file_query /arch/bm0146/k204221/test_files/one_file.nc
{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/test_files","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}
$ slk search '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}'
Search continuing. ..
Search ID: 117892

# find a files by name and location recursively; let the query be generated
$ slk_helpers gen_file_query -R /arch/bm0146/k204221/INDEX.txt
{"$and":[{"path":{"$gte":"/arch/bm0146/k204221"}},{"resources.name":{"$regex":"INDEX.txt"}}]}
$ slk search '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221"}},{"resources.name":{"$regex":"INDEX.txt"}}]}'
Search continuing. ...
Search ID: 117899

# reduce the copy&paste work in the previous example
$ search_query=`slk_helpers gen_file_query -R /arch/bm0146/k204221/test_files/one_file.nc`
$ search_id=$(eval "slk search '"${search_query}"' | tail -n 1 | cut -c12-20")
# now you can further work with the search ID in a variable

The output of a search request is a search_ID. In order to list the search results, the search_ID is used as input to slk list or slk_helpers list_search. The output of slk list equals the output of ls. It does not print the path of a search results. Instead, slk_helpers list search prints the full path of search results but skips certain other information.

$ slk list 117899 | tail -n 5
-rw-r--r--- k204221     bm0146          1.5K   10 Nov 2020  INDEX.txt
-rw-r--r--- k204221     bm0146          1.2M   10 Jun 2020  INDEX.txt
-rw-r--r--- k204221     bm0146        924.9K   11 Sep 2020  INDEX.txt
-rw-r--r--- k204221     bm0146          1.3M   11 Sep 2020  INDEX.txt
-rw-r--r--- k204221     bm0146        924.9K   11 Sep 2020  INDEX.txt
Files: 22

$ slk_helpers list_search 117899 | tail -n 5
-rw-r--r---         1208734 /arch/bm0146/k204221/iow2_test/INDEX.txt
-rw-r--r---          947084 /arch/bm0146/k204221/exp/jsbach/INDEX.txt
-rw-r--r---         1347992 /arch/bm0146/k204221/exp/hamocc/INDEX.txt
-rw-r--r---          947084 /arch/bm0146/k204221/exp/echam/INDEX.txt
Resources: 22

The search_ID can also be used as input to slk_helpers retrieve, slk_helpers recall, slk_helpers init_watchers, slk_helpers gfbt and slk tag – see retrieve doc, usage examples and set metadata for details. However, we recommend, to write the search results into a file and use this file’s content as input in the respective commands (see e.g. here).

Print a search query in a human-readable way#

We have got this search query and want to analyze it:

slk search '{"$and": [{"resources.name": "INDEX.txt"}, {"$or": [{"$and": [{"resources.posix_uid": 25301}, {"path": {"$gte": "/arch"}}]}, {"path": {"$gte": "/double/bm0146"}}]}]}'

The search queries are written in JSON. You can use jq to print the search queries in a human-readable way:

$ echo '{"$and": [{"resources.name": "INDEX.txt"}, {"$or": [{"$and": [{"resources.posix_uid": 25301}, {"path": {"$gte": "/arch"}}]}, {"path": {"$gte": "/double/bm0146"}}]}]}' | jq
{
    "$and": [
        {
            "resources.name": "INDEX.txt"
        },
        {
            "$or": [
                {
                    "$and": [
                        {
                            "resources.posix_uid": 25301
                        },
                        {
                            "path": {
                                "$gte": "/arch"
                            }
                        }
                    ]
                },
                {
                    "path": {
                        "$gte": "/double/bm0146"
                    }
                }
            ]
        }
    ]
}

Example queries with explanations#

The examples are partly taken from the StrongLink Command Line Interface Guide.

Example queries copied from the manual.#
Query,Purpose
`{"resources.size":{"$gte": 1048576}}`	Find files greater than one megabyte (sizes are in bytes)
`{"path":{"$gte":"/arch/project"}}`	Find files in a specific namespace (recursively)
`{"path":{"$gte":"/arch/project", "$max_depth": 1}}`	Find files in a specific namespace (non-recursively)
`{"resources.mimetype":"image/jpeg"}`	Find files of a specific MIME type
`{"resources.posix_uid":999}`	Find files for a specific UID
`{"resources.posix_gid":999}`	Find files for a specific GID
`{"resources.mtime":{"$gt":"2020-10-10"}}`	Find files modified since a specific date
`{"project.name":"hadron"}`	Find files based on user-defined metadata. The user-defined schema and field name are the field. For example, if querying by the `name` field in the `Project` schema, the field you use in your query is `Project.name`.
`{"resources.posix_uid":25301}`	Find files of user k204221 (who has UID 25301)
`{"image.Producer":"Max"}`	Find images which metadata field `Producer` to be set to `Max`
`{"resources.name": "search_me.jpg"}`	Search for all files with the name `search_me.jpg`
`{"resources.name": {"$regex": "file_[0-9].nc"}}`	Search for all files which names match the regular expression `file_[0-9].nc`
`{"$or": [{"resources.posix_uid":24855},{"resources.posix_uid":25301}]}`	Find files which either belong user 24855 or user 25301
`{"$and":[{"resources.name": "surface_iow_day3d_temp_emep_2003.nc"}, {"resources.posix_uid": 25301}]}`	Find files with the name `surface_iow_day3d_temp_emep_2003.nc` which belong user k204221 (who has UID 25301)

Advanced query examples#

# two types of delimiters
$ slk search '{"resources.size":{"$gt": 1048576}}'
$ slk search "{\"resources.size\":{\"\$gt\": 1048576}}"

# using shell variables in calls of slk serach
# ~~~~~~~~~~~~~~~~~~~~ method one ~~~~~~~~~~~~~~~~~~~~
$ id k204221 -u
25301
$ slk search "{\"resources.posix_uid\":25301}"
...
# ~~~~~~~~~~~~~~~~~~~~ method two ~~~~~~~~~~~~~~~~~~~~
$ export uid=`id k204221 -u`
$ slk search "{\"resources.posix_uid\":$uid}"
...
# ~~~~~~~~~~~~~~~~~~~~ method two ~~~~~~~~~~~~~~~~~~~~
$ slk search "{\"resources.posix_uid\":`id k204221 -u`}"
...

Warning

The example shell commands are meant for bash. If you are using csh or tcsh then they do not work as printed here but have to be adapted. Please contact DKRZ support (support@dkrz.de) if you require assistance.