Fast Lustre File System find and disk usage Tools#

The basic unix commands du and find can take minutes to hours on the lustre file system. As this is not practicable for large projects we provide the two fast alternatives lustre_find and lustre_du taking a few seconds to a few minutes for large searches. For reasons of basic data privacy, your searches are limited to your own projects under /work.

While [luv](https://luv.dkrz.de) already points out which fraction of files are unused for quite some time, these tools come in handy to find out where these are actually located.

Note

These tools query a dedicated metadata database in the background. The database gets updated nightly. If you have created, modified, or deleted files during the day, those changes will not appear until after the next update.

All interaction is done via the command-line commands described below, direct database access is neither required nor recommended.

Available Tools#

The following commands are available:

Command	Purpose
`lustre_find`	Search for files and paths matching a pattern
`lustre_du`	Compute directory usage (sizes, file counts, etc.)

Each command supports --help for an overview of usage and options.

Common Features#

Works on projects under /work.
Wildcards follow the standard shell (GLOB) syntax:
- * matches any sequence of characters
- ? matches a single character
Output can be saved as .parquet, .csv or .json.
Filtering supports DuckDB SQL predicates, e.g. "size > 1e6 AND atime_ms < '2025-11-01'" (needs both quotes).

The following fields can be added via --add or used in --filter:

Field	Description
`size`	File size in bytes
`ctime_ms`	Change time (inode metadata change), in milliseconds, accepts dates as quoted ‘YYYY-MM-DD’ strings
`mtime_ms`	Modification time (content change), in milliseconds
`atime_ms`	Last access time, in milliseconds
`crtime_ms`	Creation time, in milliseconds
`path`	Full absolute path
`ino`	Inode number (unique per filesystem)
`uid`	User ID of the file owner
`gid`	Group ID of the file owner
`projid`	Project ID used for project-based quotas
`mode`	File type and permissions (POSIX `st_mode` bitmask; ACL flags like `+` in `ls -l` are not included)

lustre_find#

lustre_find locates paths or files matching a given pattern. It behaves similar to find on Linux but is backed by the Lustre metadata database, which is significantly faster for large directories.

Examples

Warning

When using wildcards (*, ?), quote the pattern to avoid shell interpretation.

You can only search within your own projects under /work.

# Find all .nc files in a project
lustre_find "/work/ik1017/CMIP6/data/CMIP6/*.nc" # adjust to one of your projects

# Add extra columns to the output (e.g. uid, gid)
lustre_find "/work/ik1017/*.nc" --add uid --add gid

# Filter by file size and access time using SQL. Note the quotes!
lustre_find "/work/ik1017/*.nc" \
    --filter "size > 5e6 AND atime_ms >= '2025-06-15'"

# Change the number of rows printed
lustre_find "/work/ik1017/*.nc" --max_rows 50

# Save result to a parquet file
lustre_find "/work/ik1017/*.nc" --save /your_path_to_store_result/results.csv

Key Options

Option	Description
`PATTERN`	GLOB-like path (use `*` and `?` wildcards)
`--project`	Project name (e.g. `ik1017`). Optional; auto-detected.
`--add COL`	Show additional metadata columns (e.g. `uid` or `gid`)
`--filter`	SQL predicate, e.g. `"size > 1e6 AND atime_ms = '2025-11-01'"`
`--max_rows N`	Show at most N rows
`--save PATH`	Save results to `.parquet`, `.csv` or `.json`

lustre_du#

lustre_du provides fast, directory-level summaries similar to the UNIX du command. You can view total size, number of files, or aggregated metadata for any directory prefix.

Examples

Warning

You can only search within your own projects under /work.

# Show total size of a directory in /work
lustre_du /work/ik1017/ # adjust to one of your projects

# Limit display to largest subdirectories
lustre_du /work/ik1017/ --max_rows 20

# Get additional information through added columns (e.g. uid)
lustre_du /work/ik1017/ --add uid

Key Options

Option	Description
`PATH`	Path in `/work`
`--filter`	SQL predicate applied before counting, e.g. `"atime_ms > '2025-06-15' AND size >= 5000"` (needs both quotes)
`--add col[:agg]`	Additional columns to aggregate (repeatable). Allowed aggs: `any`, `max`, `min`, `avg`, `sum`, `count`. Default: `any`.
`--max_rows N`	Limit number of rows shown for subdirectories
`--save PATH`	Save results; directory (Parquet) or file path with `.parquet` / `.csv` / `.json` to select format

Note

Size might vary slightly compared to du due to block size instead of file size beeing used. As blocks are 4096 bytes, size of lustre_du and lustre_find will always be a multiple of this and might therefore be slightly bigger than the size plain du would report.

While du traverses the directory tree from a root inode, lustre_du aggregates using matching paths strings. This allows to start from multiple roots matching a given substring. In order to start from a single directory root, append / to the provided path string to resemble the exact du. For example du /work/ik1017/CMIP6 would just traverse that directory while lustre_du /work/ik1017/CMIP6 would also include /work/ik1017/CMIP6input/ and /work/ik1017/CMIP6_tmp/.