File Search and Metadata#
file version: 08 Dec 2023
current software versions: slk version 3.3.91; slk_helpers version 1.10.2; slk wrappers version: 1.2.1
Metadata in StrongLink#
StrongLink allows to search, find and retrieve files based on their metadata. Metadata are stored in metadata fields, e.g. title
. Each metadata field is part of one metadata schema. Basic file metadata, e.g. owner and size, are automatically extracted from any archived file and stored in a schema called resources
. When netCDF files are archived, many global attributes and a few variable attributes are extracted and written into the schemata netcdf
and netcdf_header
. Metadata from common picture, document, video and audio formats are also extracted and written into respective schemata. All available metadata schemata, their content and file types on which they are applied are listed in our Metadata schema reference. All metadata fields except for fields of the resources
schema can be manually modified by the user.
Searches are defined via JSON-formatted search queries and are performed via slk search
. Simple queries to search for files based on their name and location can be generated with slk_helpers gen_file_query
.
Details on the query language are provided on Reference: StrongLink query language. Several example search queries are provided in slk Usage Examples.
Warning
One metadata field like Title
might exist in several metadata schemata. A file might not only be associated to one metadata schema but to an arbitrary number of metadata schemata - e.g. document
, netcdf
and netcdf_header
. This file might have different values set for the *.Title
fields - e.g.: document.Title: "Great data of Max Mustermann"
, netcdf.Title: "Sea surface temperature of the North Sea from 1900 to 1999"
and netcdf_header.Title: "sst North Sea 20th century"
.
Set metadata#
slk tag#
The content of metadata fields can be set and modified with slk tag
. Metadata fields in the resources
schema like timestamps or ownership are read-only.
Example:
$ slk tag /arch/bm0146/k204221/test_files netcdf.Title="A great data set"
slk tag
can be applied on files, directories and search IDs. The latter feature is useful when many different files in different locations should receive the same metadata (e.g. the same project or the same author). slk_helpers gen_file_query
can be used to generate the appropriate search query for this purpose.
Example:
$ slk_helpers gen_file_query /arch/bm0146/k204221/test_files/one_file.nc
{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/test_files","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}
$ slk search '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/test_files","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}'
Search continuing. ..
Search ID: 117892
$ slk tag 117892 netcdf.Title="A great data set"
Search continuing... ...
Search ID: 117892
[========================================/] 100% complete Metadata applied to 1 of 1 resources. Finishing up... ...
slk_helpers json2hsm#
slk tag
is very slow. Therefore, it will take very long if the metadata of many files should be updated by different metadata per file. For this purpose, we provide the command slk_helpers json2hsm
. The metadata are written into a JSON file first. This JSON file can contain metadata records of an arbitrary number of files. The command is run as follows:
slk_helpers json2hsm metadata_file.json
The JSON file needs to follow a certain structure which is defined here.
slk_helpers json2hsm
has many different options to control the import process. All parameters are listed and briefly explained here. --schema
/-s
allows to select only a certain metadata schema from the input JSON which is imported. All metadata fields from other schemata are ignored. I.e. the following command will only import metadata of the schema netcdf
:
slk_helpers json2hsm -s netcdf metadata_file.json
slk_helpers json2hsm
stops the metadata import if it finds an error in one of the metadata records. If, however, all good metadata records should be imported and defect ones should just be skipped, the parameter --skip-bad-metadata-sets
/-k
can be used as follows:
slk_helpers json2hsm --skip-bad-metadata-sets metadata_file.json
slk_helpers json2hsm
will collect all metadata updates and will start applying them after the whole JSON input file has been read. If, however, a metadata record should be written directly after it head been read then the parameter --instant-metadata-record-update
can be specified. This is particularly useful in combination with the parameter --restart-file RESTARTFILE
/-r RESTARTFILE
. The paths of all files, which metadata were already updated, are written to RESTARTFILE
if this parameter is set. When slk_helpers json2hsm
is started with this parameter, all paths from RESTARTFILE
are read and the listed files will be skipped. This allows to resume an interrupted call of slk_helpers json2hsm
.
slk_helpers json2hsm --instant-metadata-record-update --restart-file my_restart.txt metadata_file.json
If the target metadata records contains erroneous metadata which should be purged prior new metadata is written, then the parameter --write-mode CLEAN
can be specified. When set, each updated metadata schema is emptied before new metadata is written. Metadata schemata, to which no new metadata are written, are not affected.
slk_helpers json2hsm --write-mode CLEAN metadata_file.json
Print metadata#
Three different commands are available to print metadata from the HSM:
# export metadata of one or more resources as JSON
$ slk tag -display /arch/bm0146/k204221/test.nc
# export metadata of one resources as plain text
$ slk_helpers metadata /arch/bm0146/k204221/test.nc
# export metadata of one or more resources as JSON + have some parameters for fine tuning
$ slk_helpers hsm2json /arch/bm0146/k204221/test.nc
slk tag -display#
The slk tag -display RESOURCE
prints the metadata of RESOURCE
to the command line as JSON. The JSON structure is the same as slk_helpers hsm2json
and slk_helpers json2hsm
use (JSON structure described here).
slk tag -display
prints the metadata of multiple files if it is applied on a directory or on a search id. The output is also the same as slk_helpers hsm2json
and slk_helpers json2hsm
use (JSON structure described here).
slk_helpers metadata#
The slk_helpers metadata
prints the metadata of one file in plain text which is easily human-readable.
$ slk_helpers metadata /dkrz_test/netcdf/20221014a/test_netcdf_03.nc
netcdf
Project: values project
Var_Long_Name: latitude,longitude,height,mass_concentration_of_nitric_acid_in_air
Var_Std_Name: latitude,longitude,height
Data:
Var_Name: time,lat,lon,z,data_var
Time_Min: 961549200000
Time_Max: 961722000000
Contact: values contact,values contact_email,
netcdf_header
Project: values project
Var_Long_Name: latitude,longitude,height,mass_concentration_of_nitric_acid_in_air
Var_Std_Name: latitude,longitude,height
Var_Name: time,lat,lon,z,data_var
Time_Min: 961549200000
Time_Max: 961722000000
Doi: values doi
If the output should be parsed, the parameter --alternative-output-format
might be useful:
$ slk_helpers metadata /dkrz_test/netcdf/20221014a/test_netcdf_03.nc --alternative-output-format
netcdf.Var_Name: time,lat,lon,z,data_var
netcdf_header.Time_Min: 2000-06-21 03:00:00
netcdf_header.Doi: values doi
netcdf.Var_Std_Name: latitude,longitude,height
netcdf.Data:
netcdf_header.Var_Name: time,lat,lon,z,data_var
netcdf_header.Var_Long_Name: latitude,longitude,height,mass_concentration_of_nitric_acid_in_air
netcdf.Time_Min: 2000-06-21 03:00:00
netcdf.Var_Long_Name: latitude,longitude,height,mass_concentration_of_nitric_acid_in_air
netcdf.Project: values project
netcdf.Contact: values contact,values contact_email,
netcdf.Time_Max: 2000-06-23 03:00:00
netcdf_header.Project: values project
netcdf_header.Time_Max: 2000-06-23 03:00:00
netcdf_header.Var_Std_Name: latitude,longitude,height
slk_helpers hsm2json#
The slk_helpers hsm2json
work similar as slk tag
. However, it does not accept search IDs. However, it has more parameters than slk tag
which allow some fine tuning.
The option --schema
/-s
allows to select metadata schemata which metadata should be exported. If --schema
is set, metadata of all schemata, which are not listed, are ignored. The following command will only export metadata fields of the schema netcdf_header
:
$ slk_helpers hsm2json /dkrz_test/netcdf/20221014a/test_netcdf_03.nc --schema netcdf_header
[ {
"path" : "/dkrz_test/netcdf/20221014a/test_netcdf_03.nc",
"id" : 61035459010,
"tags" : {
"netcdf_header.Time_Min" : "2000-06-21 03:00:00",
"netcdf_header.Project" : "values project",
"netcdf_header.Time_Max" : "2000-06-23 03:00:00",
"netcdf_header.Doi" : "values doi",
"netcdf_header.Var_Name" : "time,lat,lon,z,data_var",
"netcdf_header.Var_Long_Name" : "latitude,longitude,height,mass_concentration_of_nitric_acid_in_air",
"netcdf_header.Var_Std_Name" : "latitude,longitude,height"
},
"provenance" : {
"timeStampISO" : "2022-10-16T03:08:09.285378838",
"software" : "slk_helpers",
"timeStampMillis" : 1665882489284,
"formatVersion" : "2.0.0",
"softwareVersion" : "1.5.3"
}
} ]
If we wish to write the metadata into a JSON file, we can either pipe the output of slk_helpers hsm2json
into a file or we provide the parameter --outfile OUTPUTFILE
/-o OUTPUTFILE
. The latter parameter can be combined with --print-summary
which will print a summary of the metadata export to the command line.
$ slk_helpers hsm2json /dkrz_test/netcdf/20221014a/test_netcdf_03.nc --print-summary --outfile out.json
Export Summary:
1: total number of metadata records read from the HSM (until now)
------
1: number of metadata records read from the HSM and written out
0: number of metadata records read from the HSM but not written because this program exited with an error
0: number of metadata records not read because considered as corrupt
0: number of metadata records skipped because processed in the past
slk_helpers hsm2json
collects all metadata of all source files, first, and prints/writes it out after the last metadata records has been read. In some use cases it might be useful to print metadata directly after it has been read from StrongLink. For this purpose, the parameter --instant-metadata-record-output
is meant. This is particularly useful in combination with the parameter --restart-file RESTARTFILE
/-r RESTARTFILE
. The paths of all files, which have already been printed/written, are written to RESTARTFILE
if this parameter is set. When slk_helpers hsm2json
is started with this parameter, all paths from RESTARTFILE
are read and no metadata from the listed files are exported from StrongLink. This allows to resume an interrupted call of slk_helpers hsm2json
.
$ slk_helpers hsm2json -R /dkrz_test/netcdf/20221014a --instant-metadata-record-output --restart-file my_restart.txt --print-summary --outfile out.json
Export Summary:
9: total number of metadata records read from the HSM (until now)
------
9: number of metadata records read from the HSM and written out
0: number of metadata records read from the HSM but not written because this program exited with an error
0: number of metadata records not read because considered as corrupt
0: number of metadata records skipped because processed in the past
$ slk_helpers hsm2json -R /dkrz_test/netcdf/20221014a --instant-metadata-record-output --restart-file my_restart.txt --print-summary --outfile out.json
Export Summary:
9: total number of metadata records read from the HSM (until now)
------
0: number of metadata records read from the HSM and written out
0: number of metadata records read from the HSM but not written because this program exited with an error
0: number of metadata records not read because considered as corrupt
9: number of metadata records skipped because processed in the past
$ cat my_restart.txt
/dkrz_test/netcdf/20221014a/test_netcdf_03.nc
/dkrz_test/netcdf/20221014a/test_netcdf_01.nc
/dkrz_test/netcdf/20221014a/test_netcdf_header.nc
/dkrz_test/netcdf/20221014a/test_netcdf_c.nc
/dkrz_test/netcdf/20221014a/test_netcdf_a.nc
/dkrz_test/netcdf/20221014a/test_netcdf_b.nc
/dkrz_test/netcdf/20221014a/test_netcdf_02.nc
/dkrz_test/netcdf/20221014a/test_netcdf_d.nc
/dkrz_test/netcdf/20221014a/test_netcdf_04.nc
The JSON structure is described here.
Search files by metadata#
The command slk search
allows to search for files by their metadata. Users can either search for file name, user name and group name via simple flags or formulate complex search queries on all available metadata fields. Search queries in StrongLink have to be compiled using a special query language which format is JSON. It is possible to match strings with regular expressions ($regex
operator) or to use number comparison operators such as $gte
(table containing all operators).
# search for "Max" as value in the metadata field "Producer" of the schema "image"
$ slk search '{"image.Producer":"Max"}'
Search continuing. .....
Search ID: 9
# Search for a producer with the name "Max M[...]" using regular expressions
$ slk search '{"image.Producer": {"$regex": "Max M"}'
Search continuing. .....
Search ID: 11
# every producer string which contains "Max M" will be matched -- also "Karl Max Mueller-Meyer"
# To make sure that we only match producers with the only first name "Max"
# and only one single lastname, we need to add "^" in the beginning and "$"
# in the end. "[a-z]" means "any lowercase letter" between "a" and "z". "*"
# means "the proceeding expression might occur zero or more times".
$ slk search '{"image.Producer": {"$regex": "^Max M[a-z]*$"}'
Search continuing. .....
Search ID: 11
# find a file based on location and name
$ slk search '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221","$max_depth":1}},{"resources.name": "one_file.nc"}]}'
Search continuing. ..
Search ID: 117892
If you want to search for a file by name or location, you can generate a query string via slk_helpers gen_file_query
(more gen_file_query
example applications here).
# find a file by name and location; let the query be generated by
$ slk_helpers gen_file_query /arch/bm0146/k204221/test_files/one_file.nc
{"$and":[{"path":{"$gte":"/arch/bm0146/k204221/test_files","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}
$ slk search '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221","$max_depth":1}},{"resources.name":{"$regex":"one_file.nc"}}]}'
Search continuing. ..
Search ID: 117892
# find a files by name and location recursively; let the query be generated
$ slk_helpers gen_file_query -R /arch/bm0146/k204221/INDEX.txt
{"$and":[{"path":{"$gte":"/arch/bm0146/k204221"}},{"resources.name":{"$regex":"INDEX.txt"}}]}
$ slk search '{"$and":[{"path":{"$gte":"/arch/bm0146/k204221"}},{"resources.name":{"$regex":"INDEX.txt"}}]}'
Search continuing. ...
Search ID: 117899
# reduce the copy&paste work in the previous example
$ search_query=`slk_helpers gen_file_query -R /arch/bm0146/k204221/test_files/one_file.nc`
$ search_id=$(eval "slk search '"${search_query}"' | tail -n 1 | cut -c12-20")
# now you can further work with the search ID in a variable
The output of a search request is a search_ID
. In order to list the search results, the search_ID
is used as input to slk list
or slk_helpers list_search
. The output of slk list
equals the output of ls
. It does not print the path of a search results. Instead, slk_helpers list search
prints the full path of search results but skips certain other information.
$ slk list 117899 | tail -n 5
-rw-r--r--- k204221 bm0146 1.5K 10 Nov 2020 INDEX.txt
-rw-r--r--- k204221 bm0146 1.2M 10 Jun 2020 INDEX.txt
-rw-r--r--- k204221 bm0146 924.9K 11 Sep 2020 INDEX.txt
-rw-r--r--- k204221 bm0146 1.3M 11 Sep 2020 INDEX.txt
-rw-r--r--- k204221 bm0146 924.9K 11 Sep 2020 INDEX.txt
Files: 22
$ slk_helpers list_search 117899 | tail -n 5
-rw-r--r--- 1208734 /arch/bm0146/k204221/iow2_test/INDEX.txt
-rw-r--r--- 947084 /arch/bm0146/k204221/exp/jsbach/INDEX.txt
-rw-r--r--- 1347992 /arch/bm0146/k204221/exp/hamocc/INDEX.txt
-rw-r--r--- 947084 /arch/bm0146/k204221/exp/echam/INDEX.txt
Resources: 22
The search_ID
can also be used as input to slk retrieve
and slk tag
– see Run a search query and retrieve search results and Set metadata, respectively. The SLURM job script “generate search string and retrieve files” shows one application of slk_helpers gen_file_query
and slk_helpers search
in combination with slk retrieve
.
See also
Further query examples are given below. Available query operators are given in the Reference: StrongLink query language. See also StrongLink Command Line Interface Guide from page 6 onwards.
Print a search query in a human-readable way#
We have got this search query and want to analyze it:
slk search '{"$and": [{"resources.name": "INDEX.txt"}, {"$or": [{"$and": [{"resources.posix_uid": 25301}, {"path": {"$gte": "/arch"}}]}, {"path": {"$gte": "/double/bm0146"}}]}]}'
The search queries are written in JSON. You can use jq
to print the search queries in a human-readable way:
$ echo '{"$and": [{"resources.name": "INDEX.txt"}, {"$or": [{"$and": [{"resources.posix_uid": 25301}, {"path": {"$gte": "/arch"}}]}, {"path": {"$gte": "/double/bm0146"}}]}]}' | jq
{
"$and": [
{
"resources.name": "INDEX.txt"
},
{
"$or": [
{
"$and": [
{
"resources.posix_uid": 25301
},
{
"path": {
"$gte": "/arch"
}
}
]
},
{
"path": {
"$gte": "/double/bm0146"
}
}
]
}
]
}
Example queries with explanations#
The examples are partly taken from the StrongLink Command Line Interface Guide.
Query,Purpose |
|
---|---|
|
Find files greater than one megabyte (sizes are in bytes) |
|
Find files in a specific namespace (recursively) |
|
Find files in a specific namespace (non-recursively) |
|
Find files of a specific MIME type |
|
Find files for a specific UID |
|
Find files for a specific GID |
|
Find files modified since a specific date |
|
Find files based on user-defined metadata. The user-defined schema and field name are the field. For example, if querying by the |
|
Find files of user k204221 (who has UID 25301) |
|
Find images which metadata field |
|
Search for all files with the name |
|
Search for all files which names match the regular expression |
|
Find files which either belong user 24855 or user 25301 |
|
Find files with the name |
Advanced query examples#
# two types of delimiters
$ slk search '{"resources.size":{"$gt": 1048576}}'
$ slk search "{\"resources.size\":{\"\$gt\": 1048576}}"
# using shell variables in calls of slk serach
# ~~~~~~~~~~~~~~~~~~~~ method one ~~~~~~~~~~~~~~~~~~~~
$ id k204221 -u
25301
$ slk search "{\"resources.posix_uid\":25301}"
...
# ~~~~~~~~~~~~~~~~~~~~ method two ~~~~~~~~~~~~~~~~~~~~
$ export uid=`id k204221 -u`
$ slk search "{\"resources.posix_uid\":$uid}"
...
# ~~~~~~~~~~~~~~~~~~~~ method two ~~~~~~~~~~~~~~~~~~~~
$ slk search "{\"resources.posix_uid\":`id k204221 -u`}"
...
Warning
The example shell commands are meant for bash
. If you are using csh
or tcsh
then they do not work as printed here but have to be adapted. Please contact DKRZ support (support@dkrz.de) if you require assistance.