Fast Lustre File System find and disk usage Tools#

The basic unix commands du and find can take minutes to hours on the lustre file system. As this is not practicable for large projects we provide the two fast alternatives lustre_find and lustre_du taking a few seconds to a few minutes for large searches. For reasons of basic data privacy, your searches are limited to your own projects under /work.

While [luv](https://luv.dkrz.de) already points out which fraction of files are unused for quite some time, these tools come in handy to find out where these are actually located.

Note

These tools query a dedicated metadata database in the background. The database gets updated nightly. If you have created, modified, or deleted files during the day, those changes will not appear until after the next update.

All interaction is done via the command-line commands described below, direct database access is neither required nor recommended.

Available Tools#

The following commands are available:

Command

Purpose

lustre_find

Search for files and paths matching a pattern

lustre_du

Compute directory usage (sizes, file counts, etc.)

Each command supports --help for an overview of usage and options.

Common Features#

  • Works on projects under /work.

  • Wildcards follow the standard shell (GLOB) syntax:

    • * matches any sequence of characters

    • ? matches a single character

  • Output can be saved as .parquet, .csv or .json.

  • Filtering supports DuckDB SQL predicates, e.g. "size > 1e6 AND atime_ms < '2025-11-01'" (needs both quotes).

The following fields can be added via --add or used in --filter:

Field

Description

size

File size in bytes

ctime_ms

Change time (inode metadata change), in milliseconds, accepts dates as quoted ‘YYYY-MM-DD’ strings

mtime_ms

Modification time (content change), in milliseconds

atime_ms

Last access time, in milliseconds

crtime_ms

Creation time, in milliseconds

path

Full absolute path

ino

Inode number (unique per filesystem)

uid

User ID of the file owner

gid

Group ID of the file owner

projid

Project ID used for project-based quotas

mode

File type and permissions (POSIX st_mode bitmask; ACL flags like + in ls -l are not included)

lustre_find#

lustre_find locates paths or files matching a given pattern. It behaves similar to find on Linux but is backed by the Lustre metadata database, which is significantly faster for large directories.

Examples

Warning

When using wildcards (*, ?), quote the pattern to avoid shell interpretation.

You can only search within your own projects under /work.

# Find all .nc files in a project
lustre_find "/work/ik1017/CMIP6/data/CMIP6/*.nc" # adjust to one of your projects

# Add extra columns to the output (e.g. uid, gid)
lustre_find "/work/ik1017/*.nc" --add uid --add gid

# Filter by file size and access time using SQL. Note the quotes!
lustre_find "/work/ik1017/*.nc" \
    --filter "size > 5e6 AND atime_ms >= '2025-06-15'"

# Change the number of rows printed
lustre_find "/work/ik1017/*.nc" --max_rows 50

# Save result to a parquet file
lustre_find "/work/ik1017/*.nc" --save /your_path_to_store_result/results.csv

Key Options

Option

Description

PATTERN

GLOB-like path (use * and ? wildcards)

--project

Project name (e.g. ik1017). Optional; auto-detected.

--add COL

Show additional metadata columns (e.g. uid or gid)

--filter

SQL predicate, e.g. "size > 1e6 AND atime_ms = '2025-11-01'"

--max_rows N

Show at most N rows

--save PATH

Save results to .parquet, .csv or .json

lustre_du#

lustre_du provides fast, directory-level summaries similar to the UNIX du command. You can view total size, number of files, or aggregated metadata for any directory prefix.

Examples

Warning

You can only search within your own projects under /work.

# Show total size of a directory in /work
lustre_du /work/ik1017/ # adjust to one of your projects

# Limit display to largest subdirectories
lustre_du /work/ik1017/ --max_rows 20

# Get additional information through added columns (e.g. uid)
lustre_du /work/ik1017/ --add uid

Key Options

Option

Description

PATH

Path in /work

--filter

SQL predicate applied before counting, e.g. "atime_ms > '2025-06-15' AND size >= 5000" (needs both quotes)

--add col[:agg]

Additional columns to aggregate (repeatable). Allowed aggs: any, max, min, avg, sum, count. Default: any.

--max_rows N

Limit number of rows shown for subdirectories

--save PATH

Save results; directory (Parquet) or file path with .parquet / .csv / .json to select format