File Search and Metadata

file version: 25 August 2022

current software versions: slk version 3.3.21; slk_helpers version 1.2.4

Set metadata

The content of metadata fields can be set and modified with slk tag. Metadata fields in the resources schema like timestamps or ownership are read-only.

Example:

$ slk tag /arch/bm0146/k204221/test_files netCDF.Title="A great data set"

Currently, slk tag only allows namespaces and search ids as input. If applied on a namespace, the metadata of all files in this namespace are modified/set. slk tag does not work with an individual file as input. Instead, a search for this file has to be performed, first, and, then, the search id as to be used as input for slk tag. slk_helpers gen_file_query can be used to generate the appropriate search query for this purpose.

Example:

$ slk_helpers gen_file_query /arch/bm0146/k204221/test_files/one_file.nc
{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/test_files","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}

$ slk_helpers search_limited '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/test_files","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}'
Search continuing. ..
Search ID: 117892

$ slk tag 117892 netcdf.Title="A great data set"
Search continuing... ...
Search ID: 117892
[========================================/] 100% complete Metadata applied to 1 of 1 resources. Finishing up... ...

Note

A part of the netCDF metadata (mainly global attributes) is copied into a metadata schema. The full metadata extracted from netCDF files is stored in the field netcdf.Data and is read-only. Modifying the metadata in the netCDF metadata schema will not modify the read-only metadata. The

Search files by metadata

Note

Currently, slk search is not available due to an internal technical issue. Please use slk_helpers search_limited instead until slk search becomes fully available.

The command slk search allows to search for files by their metadata. Users can either search for file name, user name and group name via simple flags or formulate complex search queries on all available metadata fields. Search queries in StrongLink have to be compiled using a special query language which format is JSON. It is possible to match strings with regular expressions ($regex operator) or to use number comparison operators such as $gte (table containing all operators).

# search for "Max" as value in the metadata field "Producer" of the schema "image"
$ slk search '{"image.Producer":"Max"}'
Search continuing. .....
Search ID: 9

# alternatively, use slk_helpers search_limited
$ slk_helpers search_limited '{"image.Producer":"Max"}'
Search continuing. .....
Search ID: 10

# Search for a producer with the name "Max M[...]" using regular expressions
$ slk_helpers search_limited '{"image.Producer": {"$regex": "Max M"}'
Search continuing. .....
Search ID: 11
# every producer string which contains "Max M" will be matched -- also "Karl Max Mueller-Meyer"

# To make sure that we only match producers with the only first name "Max"
# and only one single lastname, we need to add "^" in the beginning and "$"
# in the end. "[a-z]" means "any lowercase letter" between "a" and "z". "*"
# means "the proceeding expression might occur zero or more times".
$ slk_helpers search_limited '{"image.Producer": {"$regex": "^Max M[a-z]*$"}'
Search continuing. .....
Search ID: 11

# find a file based on location and name
$ slk_helpers search_limited '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221","$max_depth":1}},{"resources.name": "one_file.nc"}]}'
Search continuing. ..
Search ID: 117892

If you want to search for a file by name or location, you can generate a query string via slk_helpers gen_file_query (more gen_file_query example applications :ref:here<gen_file_query_ex>).

# find a file by name and location; let the query be generated by
$ slk_helpers gen_file_query /arch/bm0146/k204221/test_files/one_file.nc
{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/test_files","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}
$ slk_helpers search_limited '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}'
Search continuing. ..
Search ID: 117892

# find a files by name and location recursively; let the query be generated
$ slk_helpers gen_file_query -R /arch/bm0146/k204221/INDEX.txt
{"$and":[{"path":{"$gte":"/arch/bm0146/k204221"}},{"resources.name":{"$regex":"INDEX.txt"}}]}
$ slk_helpers search_limited '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221"}},{"resources.name":{"$regex":"INDEX.txt"}}]}'
Search continuing. ...
Search ID: 117899

# reduce the copy&paste work in the previous example
$ search_query=`slk_helpers gen_file_query -R /arch/bm0146/k204221/test_files/one_file.nc`
$ search_id=$(eval "slk_helpers search_limited '"${search_query}"' | tail -n 1 | cut -c12-20")
# now you can further work with the search ID in a variable

The output of a search request is a search_ID. In order to list the search results, the search_ID is used as input to slk list or slk_helpers list_search. The output of slk list equals the output of ls. It does not print the path of a search results. Instead, slk_helpers list search prints the full path of search results but skips certain other information.

$ slk list 117899 | tail -n 5
-rw-r--r--- k204221     bm0146          1.5K   10 Nov 2020  INDEX.txt
-rw-r--r--- k204221     bm0146          1.2M   10 Jun 2020  INDEX.txt
-rw-r--r--- k204221     bm0146        924.9K   11 Sep 2020  INDEX.txt
-rw-r--r--- k204221     bm0146          1.3M   11 Sep 2020  INDEX.txt
-rw-r--r--- k204221     bm0146        924.9K   11 Sep 2020  INDEX.txt
Files: 22

$ slk_helpers list_search 117899 | tail -n 5
-rw-r--r---         1208734 /arch/bm0146/k204221/iow2_test/INDEX.txt
-rw-r--r---          947084 /arch/bm0146/k204221/exp/jsbach/INDEX.txt
-rw-r--r---         1347992 /arch/bm0146/k204221/exp/hamocc/INDEX.txt
-rw-r--r---          947084 /arch/bm0146/k204221/exp/echam/INDEX.txt
Resources: 22

The search_ID can also be used as input to slk retrieve and slk tag – see Run a search query and retrieve search results and Set metadata, respectively. The SLURM job script “generate search string and retrieve files” shows one application of slk_helpers gen_file_query and slk_helpers search_limited in combination with slk retrieve.

See also

Further query examples are given below. Available query operators are given in the Reference: StrongLink query language. See also StrongLink Command Line Interface Guide from page 6 onwards.

Example queries with explanations

The examples are partly taken from the StrongLink Command Line Interface Guide.

Example queries copied from the manual.

Query

Purpose

{"resources.size":{"$gte": 1048576}}

Find files greater than one megabyte (sizes are in bytes)

{"path":{"$gte":"/arch/project"}}

Find files in a specific namespace (recursively)

{"path":{"$gte":"/arch/project", "$max_depth": 1}}

Find files in a specific namespace (non-recursively)

{"resources.mimetype":"image/jpeg"}

Find files of a specific MIME type

{"resources.posix_uid":999}

Find files for a specific UID

{"resources.posix_gid":999}

Find files for a specific GID

{"resources.mtime":{"$gt":"2020-10-10"}}

Find files modified since a specific date

{"project.name":"hadron"}

Find files based on user-defined metadata. The user-defined schema and field name are the field. For example, if querying by the name field in the Project schema, the field you use in your query is Project.name.

{"resources.posix_uid":25301}

Find files of user k204221 (who has UID 25301)

{"image.Producer":"Max"}

Find images which metadata field Producer to be set to Max

{"resources.name": "search_me.jpg"}

Search for all files with the name search_me.jpg

{"resources.name": {"$regex": "file_[0-9].nc"}}

Search for all files which names match the regular expression file_[0-9].nc

{"$or": [{"resources.posix_uid":24855},{"resources.posix_uid":25301}]}

Find files which either belong user 24855 or user 25301

{"$and":[{"resources.name": "surface_iow_day3d_temp_emep_2003.nc"}, {"resources.posix_uid": 25301}]}

Find files with the name surface_iow_day3d_temp_emep_2003.nc which belong user k204221 (who has UID 25301)

Advanced query examples

# two types of delimiters
$ slk search '{"resources.size":{"$gt": 1048576}}'
$ slk search "{\"resources.size\":{\"\$gt\": 1048576}}"

# using shell variables in calls of slk serach
# ~~~~~~~~~~~~~~~~~~~~ method one ~~~~~~~~~~~~~~~~~~~~
$ id k204221 -u
25301
$ slk search "{\"resources.posix_uid\":25301}"
...
# ~~~~~~~~~~~~~~~~~~~~ method two ~~~~~~~~~~~~~~~~~~~~
$ export uid=`id k204221 -u`
$ slk search "{\"resources.posix_uid\":$uid}"
...
# ~~~~~~~~~~~~~~~~~~~~ method two ~~~~~~~~~~~~~~~~~~~~
$ slk search "{\"resources.posix_uid\":`id k204221 -u`}"
...

Warning

The example shell commands are meant for bash. If you are using csh or tcsh then they do not work as printed here but have to be adapted. Please contact DKRZ support (support@dkrz.de) if you require assistance.