PrePARE

The PrePARE software tool is provided by PCMDI (Program for Climate Model Diagnosis and Intercomparison) to verify that CMIP6 files conform to the CMIP6 data protocol. The CMIP6 data protocol comprises requirements set out in different documents published by the CMIP6 WIP (Working Group on Climate Models Infrastructure Panel). These documents are translated into .json formatted Controlled Vocabularies and tables readable by PrePARE.

PrePARE performs 10 different tests which can be summarized by the following points:

  1. Check for invariable and conditional required global attributes and valid values of those.

  2. Are file names and paths conform to the project’s data reference syntax (DRS)?

  3. Check for required variable attributes.

  4. Coordinates: Some variables are requested on specific coordinates that need to be provided in the files in a compliant format.

In the following, we run PrePARE for a subset of CMIP6 pool data.

Preparation

We provide a conda environment that contains all CMIP-data-pool quality checker. If you like to work with bash, you can simply activate the environment. In order to have it as a kernel for jupyter notebooks, you need to install it with ipykernel. Afterwards, restart the server and select the new kernel for the quality assurance notebook.

  1. The following loads conda by writing it into your path. If you already have conda or the kernel for the environment, you do not need to do that:

originpath= !echo $PATH
%env PATH = /sw/rhel6-x64/conda/anaconda3-bleeding_edge/bin:{originpath[0]}
  1. The following block contains the activation of the environment and a line for the installation of a kernel for jupyter notebooks in jupyterhub:

%%bash
#The following line activates the source for working in a shell.
source activate /work/bm0021/conda-envs/quality-assurance
#The following line installs a kernel for the conda environment
#envkernel conda --name=cmip-quality-assurance --user /work/bm0021/conda-envs/quality-assurance

Since PrePARE cannot be imported as a python module, we work with a shell executable named pp and the subprocess library which controls the process.

import cmip6_cv.PrePARE.PrePARE as pp
import subprocess

We also import some useful packages

# copy2 copies without errors
from shutil import copy2
# tqdm gives a progressbar for for loops
from tqdm import tqdm

Since the data standard evolves over time, we need to find the matching version for the datasests which should be tested. For that, we need git to checkout the corresponding version of the data standard tables, named cmip6-cmor-tables. You can clone the tables repository via:

import git
import os
import re
# The following clones the cmip6 cmor tables if not available:
working_path="/mnt/lustre01/work/bm0021/"
working_path="/home/dkrz/k204210/"
cmip6_cmor_tables_url="https://github.com/PCMDI/cmip6-cmor-tables.git"
if "cmip6-cmor-tables" not in os.listdir(working_path):
    git.Git(working_path).clone(cmip6_cmor_tables_url)

One table in the tables repository only contains the global attributes and no information about the variables: CMIP6_CV.json where ‘CV’ is for Controlled Vocabulary. In contrast to those tables which contain variables, only the recent version of the global attributes table is valid. This is because this file is mostly never changed but rather extended. Whenever we checkout a different version of the tables repository, we need to copy the recent global attributes CV into that version. Therefore, we copy this CV to a save place named recentCV.

recentCV = working_path+"CMIP6-CV-20210419.json"
copy2(working_path+"cmip6-cmor-tables/Tables/CMIP6_CV.json", recentCV)
'/home/dkrz/k204210/CMIP6-CV-20210419.json'

Settings

The following variables are important for PrePARE and will be defined:

  • logChunk will hold the results of PrePARE

  • cmip6-cmor-table-path is the directory for the input tables

  • exec is the executable which we will run in bash

prepareSetting = {
    "exec" : pp.__file__ ,
    #"logChunk":"/mnt/lustre01/work/bm0021/prepare-test/",
    "logChunk":"/home/dkrz/k204210/prepare-test/",
    "cmip6-cmor-table-path" : working_path+"cmip6-cmor-tables/Tables" }
prepareSetting["exec"]
'/home/dkrz/k204210/miniconda3/lib/python3.7/site-packages/cmip6_cv/PrePARE/PrePARE.py'

Initialization

We read in the dataset list, load the git repository and copy the recent Controlled Vocabulary for required attributes.

g = git.Git(prepareSetting["cmip6-cmor-table-path"])
g.reset("--hard")
g.checkout("master")
copy2(prepareSetting["cmip6-cmor-table-path"]+"/CMIP6_CV.json", recentCV)
'/home/dkrz/k204210/CMIP6-CV-20210419.json'

Assume, we want to test the dataset dset_id in directory trunk:

trunk="/mnt/lustre02/work/ik1017/CMIP6/data/"
dset_id="CMIP6.ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp370.r1i1p1f1.Amon.tas.gn.v20190710"

In order to find out the data standard version used for the creation of the files which should be tested, we need to retrieve the value from the global attribute data_specs_version from one file of the dataset. We assign a corresponding attribute data_specs_version to the dset_id and combine it in a dictionary.

dsets_to_test={dset_id : { "dset_path":trunk+'/'.join(dset_id.split('.')),
                         "data_specs_version":""
                        }
             }

The function addSpecs will retrieve the specs attribute by using the bash tool ncdump -h showing the header of a file including all attributes.

def addSpecs(entry):
    try:
        fileinpath = [os.path.join(entry["dset_path"],f) for f in os.listdir(entry["dset_path"]) if os.path.isfile(os.path.join(entry["dset_path"],f))]
    except:
        return ""
#    ncdump_exec="/sw/rhel6-x64/netcdf/netcdf_c-4.4.1.1-gcc48/bin/ncdump"
    dsv = !ncdump -h {fileinpath[0]} | grep data_specs_version | cut -d '"' -f 2
    dsv = ''.join(dsv)
    return dsv

And now, we apply it for all dsets in the dsets_to_test dictionary:

for dset, entry in dsets_to_test.items():
    entry["data_specs_version"] = addSpecs(entry)

Retrieving all versions of the cmip6-cmor-table repository

We are using the tags of the version releases and reformat their values to be conform to the data_specs_version.

tags = reversed(g.tag("-n").split("\n"))
tagdict = {"data_specs_versions":[]}
for tag in tags :
    tl = tag.split(" ", 1)[0]
    tllen = len(tl.split("."))
    if tllen > 3 :
        continue
    dsvnumber = tl.split(".")[tllen-1]
    dsvnumber = "".join(filter(str.isdigit, dsvnumber))
    dsv = "['01.00."+dsvnumber+"']"
    if dsv not in tagdict["data_specs_versions"] :
        tagdict["data_specs_versions"].append(dsv)
        tagdict[dsv]={"tag_label":tl,
                      "description":tag.split(" ",1)[1]}
print(tagdict['data_specs_versions'])
["['01.00.32']", "['01.00.31']", "['01.00.30']", "['01.00.29']", "['01.00.28']", "['01.00.27']", "['01.00.24']", "['01.00.23']", "['01.00.22']", "['01.00.21']", "['01.00.20']", "['01.00.19']", "['01.00.18']", "['01.00.17']", "['01.00.16']", "['01.00.15']", "['01.00.14']", "['01.00.13']", "['01.00.12']", "['01.00.11']"]

Application

We loop over the datasets to be checked. Note that for many different datasets from different sources, it might be helpful to loop over different data_specs_versions instead so that we checkout each cmip6-cmor-tables repository version only once.

For the PrePARE run itself, we define the function checkSubset where:

  • We skip datasets for which we do not have a corresponding table version

  • We define a unique logPath for each dataset we are going to test using the logChunk, data_specs_version and the dset_id. PrePARE is able to create own directories which we also exploit. If there is already data in it, we skip the test to avoid duplications.

  • Checkout the correct cmip6-cmor tables and overwrite the CV with the most recent one saved in the beginning of this script.

  • Run PrePARE with 8 parallel processes.

def checkSubset(dset_id, dsetatts):
    if not "['"+dsetatts["data_specs_version"]+"']" in tagdict["data_specs_versions"] :
        print("No matching tag for data_specs_version {}".format(dsetatts["data_specs_version"]))
        return
    logPath=prepareSetting["logChunk"]+dsetatts["data_specs_version"].split('.')[2].split("'")[0]+"/"+dset_id
    if os.path.exists(logPath) and len(os.listdir(logPath)) != 0:
        return
    tag2checkout = tagdict["['"+dsetatts["data_specs_version"]+"']"]["tag_label"]
    g.reset("--hard")
    g.checkout(tag2checkout)
    copy2(recentCV, prepareSetting["cmip6-cmor-table-path"]+"/CMIP6_CV.json")
    #
    a = subprocess.run("{0} -l {1} --table-path {2} {3}".format(
                           prepareSetting["exec"],
                           logPath,
                           prepareSetting["cmip6-cmor-table-path"],
                           dsetatts["dset_path"]),
                       capture_output=True, shell=True)
    print(a)
for dset_id, dsetatts in dsets_to_test.items() :
    checkSubset(dset_id, dsetatts)
CompletedProcess(args='/home/dkrz/k204210/miniconda3/lib/python3.7/site-packages/cmip6_cv/PrePARE/PrePARE.py -l /home/dkrz/k204210/prepare-test/30/CMIP6.ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp370.r1i1p1f1.Amon.tas.gn.v20190710 --table-path /home/dkrz/k204210/cmip6-cmor-tables/Tables /mnt/lustre02/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710', returncode=0, stdout=b'x1b[1;32mrCheck netCDF file(s): x1b[0m5% | 1/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m11% | 2/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m16% | 3/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m22% | 4/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m27% | 5/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m33% | 6/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m38% | 7/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m44% | 8/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m50% | 9/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m55% | 10/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m61% | 11/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m66% | 12/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m72% | 13/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m77% | 14/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m83% | 15/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m88% | 16/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m94% | 17/18 filesx1b[1;32mrCheck netCDF file(s): x1b[0m100% | 18/18 filesrx1b[Kx1b[95mnNumber of files scanned: 18x1b[0mx1b[1;32mnNumber of file with error(s): 0x1b[0mn', stderr=b'')

Results

As we let PrePARE write logifles for each dataset, we have to collect the results to get an overview. Each logfile start with a description of - how many files were scanned - how many files had failed

Apparently, if 0 files have failed, the dataset (if we get one logfile per dataset) has passed the checks. The next lines are not clearly formulated so that we parse them. We can distinguish between two error categories. The maximal severity of the errors max_severity is determined with every new match of an error.

  • Critical errors

    • if the filename or filepath is not conform to the data standard

    • if the data structure could not be parsed

    • are identified by error keywords filename, not understood, SKIPPED

  • Minor issues

    • if a value of a required global attribute could not be found.

    • are identified by error keywords CV Fail

errorSeverity=["Passed", "Minor Issue", "Major Issue"]
parsedict={"meta": ["filename", "creation_date", "dset_id", "specs_version"],
           "filenoDict":{"checked": 'files scanned: (\d+)',
                        "failed": 'with error\(s\): (\d+)'
                       },
           "errorDict":{"filename": 2,
                        "Warning" : 1,
                        "CV FAIL" : 1,
                        "Permission denied" : 2,
                        "not understood" : 2,
                        "SKIPPED" : 2},
          }

We subdivide the parsing into two processes, parse_file and collect_errors. parse_file is executed if errors are detected in collect_errors. As an argument, we provide not only the path to the logfile but rather a dictionary that will be filled with all important metadata to assess the PrePARE results.

def collect_errors(dset_entry) :
    errors=[]
    max_severity=0
    for line in open(dset_entry["logfile_name"]):
        for errorKeyword in parsedict["errorDict"].keys() :
            match = re.findall(errorKeyword, line)
            if match:
                errors.append(errorKeyword)
                max_severity=max(max_severity,int(parsedict["errorDict"][errorKeyword]))
    dset_entry["errors"]=tuple(errors)
    dset_entry["max_severity"]=max_severity
def parse_file(dset_entry):
    checkedFiles=[]
    failedFiles=[]
    for line in open(dset_entry["logfile_name"]):
        match = re.search(parsedict["filenoDict"]["checked"], line)
        if match:
            checkedFiles.append(''.join(match.group(1)))
        match = re.search(parsedict["filenoDict"]["failed"], line)
        if match:
            failedFiles.append(''.join(match.group(1)))
    if not checkedFiles or not failedFiles :
        print(dset_entry["logfile_name"], checkedFiles, failedFiles)
    dset_entry["checked"]=int(checkedFiles[0])
    dset_entry["failed"]=int(failedFiles[0])
    dset_entry["passed"]=dset_entry["checked"]-dset_entry["failed"]
    if not dset_entry["failed"] == 0 :
        collect_errors(dset_entry)

We finally collect all results in a dictionary prepare_dict where the dset_ids are the keys. For that, we loop over all logfiles.

prepare_dict = {}
specs_paths=os.listdir(prepareSetting["logChunk"])
for specs_path in tqdm(specs_paths):
    for dirpath, dirnames, logfile_names in os.walk(os.path.join(prepareSetting["logChunk"], specs_path)):
        for logfile_name in logfile_names :
            dset_entry = {"logfile_name":os.path.join(dirpath, logfile_name),
                          "creation_date":logfile_name.split(".")[0].split("-")[1],
                          "dset_id":dirpath[len(os.path.join(prepareSetting["logChunk"], specs_path))+1:],
                          "specs_version": "01.00."+specs_path}

            parse_file(dset_entry)
            prepare_dict[dset_entry["dset_id"]]=dset_entry
100%|██████████| 1/1 [00:00<00:00, 558.87it/s]
prepare_dict
{'CMIP6.ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp370.r1i1p1f1.Amon.tas.gn.v20190710': {'logfile_name': '/home/dkrz/k204210/prepare-test/30/CMIP6.ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp370.r1i1p1f1.Amon.tas.gn.v20190710/PrePARE-20210621-171057.log',
  'creation_date': '20210621',
  'dset_id': 'CMIP6.ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp370.r1i1p1f1.Amon.tas.gn.v20190710',
  'specs_version': '01.00.30',
  'checked': 18,
  'failed': 0,
  'passed': 18}}