Data Preparation and Quality Checking#
Systematic analysis across big climate data repositories is only easy to do if the data is written in a common format with files structured similary and with sufficient metadata uniformly stored. It has to be readable and understandable by both human and machines.
Climate data to be published within consortial projects like CMIP therefore has to follow community conventions and meet a project’s data standard. DKRZ maintains, provides, develops and runs tools for data preparation and quality assurance in order to ensure that the data repositories have high quality.
A project data standard usually builds on a community convention but adds project specific requirements like additional metadata or variable requests.
This is an overview of widely accepted conventions and standards:
Standardizing data based on established standards supports the findability/discovery and reusability of data. Different conventions for standardizing netCDF files in the context of Earth System modeling exist. The most prominent ones are these:
CF (Climate and Forecast Conventions): improve reusability of data; standardizes the representation of data variables and spatio-temporal coordinate systems and derived grids in netCDF files; extended by a standard name table for biogeochemical variables.
ACDD (Attribute Conventions for Data Discovery): improve findability/discovery of data; standardizes many global attributes for bibligraphic purposes and searchability => store metadata needed for discovery in the netCDF file itself.
UGRID (Unstructured GRID conventioon): represent unstructured grids; extension of the CF conventions to represent unstructured model grids.
CMIP: complianced required for all official CMIP data; based on the CF conventions but prescribes additional global attributes and requires a maximum of one data variable per file.
AtMoDat Standard: compliance required for all data published within the AtMoDat project; based on the CF conventions and similar to the CMIP standard
In large community projects like CMIP, 2000 unique variables are requested from different submodels of the Earth Model System and for thousands of experiments. Each of these is individual with respect to its coordinates, frequency or aggregation type. Preparing data is therefore often a complex task which requires strong efforts. DKRZ develops and provides tools for standardization support.
The application of the Climate Model Output Rewriter, the CMOR library developed at PCMDI, ensures that its output data complies with a project standard. The program rewrites model data in a compliant way and saves it in a NetCDF file, one variable per file. The dataset is provided with all necessary and suitable metadata. Because the CMOR application involves invoking various program functions which require the provision of extensive information about the input data, inexperienced users need a long training period or processing time.
This is why multiple programs have been developed which tried to simplify this process. These were tuned for specific climate models and applications and named similar, e.g. cmorlite. Over the time, the meaning of the term cmorization shifted to rather standardizing model data than applying the CMOR library. However, the effort needed for developoing a parallel cmor-like tool was often underestimated so that the data standard compliance of the data could not be reached by cmorlite applications.
It is highly recommended to use the CMOR library to ensure a high data quality standard. CMOR is provided on Mistral under “/sw/rhel6-x64/cmor-3.6.0-gcc64/”.
CMOR was integrated into the worldwide accepted and widely used Climate Data Operators package (CDO). The operator ‘cmor’ provides the experienced CDO user with a common look&feel, and the support infrastructure which is built around the CDO toolkit. By using the information obtained from the interface of CDO about the metadata of the input file, the information that has to be provided by the user can be reduced to a minimum. cdo cmor is further developed at DKRZ and provided on Mistral under “/work/bm0021/cdo_incl_cmor/”.
Earth System model data can be corrupted in many ways, e.g:
Coordinate gaps caused by simulation run errors
Suspicious data like duplicates, outliers or missing values
Inconsistencies over many files of the same data set
Attribute mismatches of global or variable attributes and the project data standard
Compliance to conventions the community agreed on like CF or ACDD
DKRZ supports the application of different quality checkers for testing on these features.
The CEDA CF Checker tests netCDF files against the CF conventions and the CF standard name table. Requirments and recommendations are checked. Several versions of CF Conventions are available. It is written in Python can be used on the command-line, in a web browser (hosted by CEDA) or imported as Python library (e.g. in a Jupyter notebook). The development of the CEDA CF Checker dates back nearly to the beginning of the CF Conventions.
Usage of the CEDA CF Checker on mistral#
$ module load python3/2021.01-gcc-9.1.0 $ cfchecks example_file.nc ...
Usage of the CEDA CF Checker on levante#
$ module load python3/2022.01-gcc-11.2.0 $ cfchecks example_file.nc ...
The IOOS Compliance Checker is a python based tool for data providers to check for completeness and community standard compliance of local or remote netCDF files against CF, ACDD, and IOOS Metadata Profile file standards. The Compliance Checker can be used as a command-line tool or as a library that can be integrated into other software. The Compliance Checker also includes a web-based version that enables a broader audience and improve accessibility for the checker. With the web version, providers can simply provide a link or upload their datasets and get the full suite of capabilities that Compliance Checker offers.
Usage of the IOOS Compliance Checker on mistral#
$ module load python3/2021.01-gcc-9.1.0 # print available tests $ compliance-checker ... # check file again CF version 1.7 $ compliance-checker -t cf:1.7 example_file.nc ...
Usage of the IOOS Compliance Checker on levante#
$ module load python3/2022.01-gcc-11.2.0 # print available tests $ compliance-checker ... # check file again CF version 1.7 $ compliance-checker -t cf:1.7 example_file.nc ...
AtMoDat Checker Plugin#
Colleagues from the University of Leipzig and DKRZ developed an extension of the IOOS Compliance Checker within the AtMoDat project to check for a project specific standard. The plugin is available on GitHub. If you plan to develop you own plugin and have basic questions, you are welcome to contact us.
PrePARE is a checker that has be specifically developed to validate to the CMIP6 standard. Particularly, it is run before publishing files to ESGF. It may not work properly on CMIP5 files.