File Search and Metadata#

file version: 08 Dec 2023

current software versions: slk version 3.3.91; slk_helpers version 1.10.2; slk wrappers version: 1.2.1

Set metadata#

slk tag#

The content of metadata fields can be set and modified with slk tag. Metadata fields in the resources schema like timestamps or ownership are read-only.

Example:

$ slk tag /arch/bm0146/k204221/test_files netcdf.Title="A great data set"

slk tag can be applied on files, directories and search IDs. The latter feature is useful when many different files in different locations should receive the same metadata (e.g. the same project or the same author). slk_helpers gen_file_query can be used to generate the appropriate search query for this purpose.

Example:

$ slk_helpers gen_file_query /arch/bm0146/k204221/test_files/one_file.nc
{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/test_files","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}

$ slk search '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/test_files","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}'
Search continuing. ..
Search ID: 117892

$ slk tag 117892 netcdf.Title="A great data set"
Search continuing... ...
Search ID: 117892
[========================================/] 100% complete Metadata applied to 1 of 1 resources. Finishing up... ...

slk_helpers json2hsm#

slk tag is very slow. Therefore, it will take very long if the metadata of many files should be updated by different metadata per file. For this purpose, we provide the command slk_helpers json2hsm. The metadata are written into a JSON file first. This JSON file can contain metadata records of an arbitrary number of files. The command is run as follows:

slk_helpers json2hsm metadata_file.json

The JSON file needs to follow a certain structure which is defined here.

slk_helpers json2hsm has many different options to control the import process. All parameters are listed and briefly explained here. --schema/-s allows to select only a certain metadata schema from the input JSON which is imported. All metadata fields from other schemata are ignored. I.e. the following command will only import metadata of the schema netcdf:

slk_helpers json2hsm -s netcdf metadata_file.json

slk_helpers json2hsm stops the metadata import if it finds an error in one of the metadata records. If, however, all good metadata records should be imported and defect ones should just be skipped, the parameter --skip-bad-metadata-sets/-k can be used as follows:

slk_helpers json2hsm --skip-bad-metadata-sets metadata_file.json

slk_helpers json2hsm will collect all metadata updates and will start applying them after the whole JSON input file has been read. If, however, a metadata record should be written directly after it head been read then the parameter --instant-metadata-record-update can be specified. This is particularly useful in combination with the parameter --restart-file RESTARTFILE/-r RESTARTFILE. The paths of all files, which metadata were already updated, are written to RESTARTFILE if this parameter is set. When slk_helpers json2hsm is started with this parameter, all paths from RESTARTFILE are read and the listed files will be skipped. This allows to resume an interrupted call of slk_helpers json2hsm.

slk_helpers json2hsm --instant-metadata-record-update --restart-file my_restart.txt metadata_file.json

If the target metadata records contains erroneous metadata which should be purged prior new metadata is written, then the parameter --write-mode CLEAN can be specified. When set, each updated metadata schema is emptied before new metadata is written. Metadata schemata, to which no new metadata are written, are not affected.

slk_helpers json2hsm --write-mode CLEAN metadata_file.json

Search files by metadata#

The command slk search allows to search for files by their metadata. Users can either search for file name, user name and group name via simple flags or formulate complex search queries on all available metadata fields. Search queries in StrongLink have to be compiled using a special query language which format is JSON. It is possible to match strings with regular expressions ($regex operator) or to use number comparison operators such as $gte (table containing all operators).

# search for "Max" as value in the metadata field "Producer" of the schema "image"
$ slk search '{"image.Producer":"Max"}'
Search continuing. .....
Search ID: 9

# Search for a producer with the name "Max M[...]" using regular expressions
$ slk search '{"image.Producer": {"$regex": "Max M"}'
Search continuing. .....
Search ID: 11
# every producer string which contains "Max M" will be matched -- also "Karl Max Mueller-Meyer"

# To make sure that we only match producers with the only first name "Max"
# and only one single lastname, we need to add "^" in the beginning and "$"
# in the end. "[a-z]" means "any lowercase letter" between "a" and "z". "*"
# means "the proceeding expression might occur zero or more times".
$ slk search '{"image.Producer": {"$regex": "^Max M[a-z]*$"}'
Search continuing. .....
Search ID: 11

# find a file based on location and name
$ slk search '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221","$max_depth":1}},{"resources.name": "one_file.nc"}]}'
Search continuing. ..
Search ID: 117892

If you want to search for a file by name or location, you can generate a query string via slk_helpers gen_file_query (more gen_file_query example applications here).

# find a file by name and location; let the query be generated by
$ slk_helpers gen_file_query /arch/bm0146/k204221/test_files/one_file.nc
{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/test_files","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}
$ slk search '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}'
Search continuing. ..
Search ID: 117892

# find a files by name and location recursively; let the query be generated
$ slk_helpers gen_file_query -R /arch/bm0146/k204221/INDEX.txt
{"$and":[{"path":{"$gte":"/arch/bm0146/k204221"}},{"resources.name":{"$regex":"INDEX.txt"}}]}
$ slk search '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221"}},{"resources.name":{"$regex":"INDEX.txt"}}]}'
Search continuing. ...
Search ID: 117899

# reduce the copy&paste work in the previous example
$ search_query=`slk_helpers gen_file_query -R /arch/bm0146/k204221/test_files/one_file.nc`
$ search_id=$(eval "slk search '"${search_query}"' | tail -n 1 | cut -c12-20")
# now you can further work with the search ID in a variable

The output of a search request is a search_ID. In order to list the search results, the search_ID is used as input to slk list or slk_helpers list_search. The output of slk list equals the output of ls. It does not print the path of a search results. Instead, slk_helpers list search prints the full path of search results but skips certain other information.

$ slk list 117899 | tail -n 5
-rw-r--r--- k204221     bm0146          1.5K   10 Nov 2020  INDEX.txt
-rw-r--r--- k204221     bm0146          1.2M   10 Jun 2020  INDEX.txt
-rw-r--r--- k204221     bm0146        924.9K   11 Sep 2020  INDEX.txt
-rw-r--r--- k204221     bm0146          1.3M   11 Sep 2020  INDEX.txt
-rw-r--r--- k204221     bm0146        924.9K   11 Sep 2020  INDEX.txt
Files: 22

$ slk_helpers list_search 117899 | tail -n 5
-rw-r--r---         1208734 /arch/bm0146/k204221/iow2_test/INDEX.txt
-rw-r--r---          947084 /arch/bm0146/k204221/exp/jsbach/INDEX.txt
-rw-r--r---         1347992 /arch/bm0146/k204221/exp/hamocc/INDEX.txt
-rw-r--r---          947084 /arch/bm0146/k204221/exp/echam/INDEX.txt
Resources: 22

The search_ID can also be used as input to slk retrieve and slk tag – see Run a search query and retrieve search results and Set metadata, respectively. The SLURM job script “generate search string and retrieve files” shows one application of slk_helpers gen_file_query and slk_helpers search in combination with slk retrieve.

See also

Further query examples are given below. Available query operators are given in the Reference: StrongLink query language. See also StrongLink Command Line Interface Guide from page 6 onwards.

Example queries with explanations#

The examples are partly taken from the StrongLink Command Line Interface Guide.

Example queries copied from the manual.#

Query,Purpose

{"resources.size":{"$gte": 1048576}}

Find files greater than one megabyte (sizes are in bytes)

{"path":{"$gte":"/arch/project"}}

Find files in a specific namespace (recursively)

{"path":{"$gte":"/arch/project", "$max_depth": 1}}

Find files in a specific namespace (non-recursively)

{"resources.mimetype":"image/jpeg"}

Find files of a specific MIME type

{"resources.posix_uid":999}

Find files for a specific UID

{"resources.posix_gid":999}

Find files for a specific GID

{"resources.mtime":{"$gt":"2020-10-10"}}

Find files modified since a specific date

{"project.name":"hadron"}

Find files based on user-defined metadata. The user-defined schema and field name are the field. For example, if querying by the name field in the Project schema, the field you use in your query is Project.name.

{"resources.posix_uid":25301}

Find files of user k204221 (who has UID 25301)

{"image.Producer":"Max"}

Find images which metadata field Producer to be set to Max

{"resources.name": "search_me.jpg"}

Search for all files with the name search_me.jpg

{"resources.name": {"$regex": "file_[0-9].nc"}}

Search for all files which names match the regular expression file_[0-9].nc

{"$or": [{"resources.posix_uid":24855},{"resources.posix_uid":25301}]}

Find files which either belong user 24855 or user 25301

{"$and":[{"resources.name": "surface_iow_day3d_temp_emep_2003.nc"}, {"resources.posix_uid": 25301}]}

Find files with the name surface_iow_day3d_temp_emep_2003.nc which belong user k204221 (who has UID 25301)

Advanced query examples#

# two types of delimiters
$ slk search '{"resources.size":{"$gt": 1048576}}'
$ slk search "{\"resources.size\":{\"\$gt\": 1048576}}"

# using shell variables in calls of slk serach
# ~~~~~~~~~~~~~~~~~~~~ method one ~~~~~~~~~~~~~~~~~~~~
$ id k204221 -u
25301
$ slk search "{\"resources.posix_uid\":25301}"
...
# ~~~~~~~~~~~~~~~~~~~~ method two ~~~~~~~~~~~~~~~~~~~~
$ export uid=`id k204221 -u`
$ slk search "{\"resources.posix_uid\":$uid}"
...
# ~~~~~~~~~~~~~~~~~~~~ method two ~~~~~~~~~~~~~~~~~~~~
$ slk search "{\"resources.posix_uid\":`id k204221 -u`}"
...

Warning

The example shell commands are meant for bash. If you are using csh or tcsh then they do not work as printed here but have to be adapted. Please contact DKRZ support (support@dkrz.de) if you require assistance.