ESGF databrowser#

The Earth System Grid Federation (ESGF) maintains a global system of federated data centers that allow access to the largest archive of model climate data world-wide. ESGF portals like the one hosted by the DKRZ are an interface for users to access model data that are distributed in several data centers, also called data nodes, although they themselves do not host any data. Through them we can access to the output of the climate models contributing to the next assessment report of the Intergovernmental Panel on Climate Change IPCC through the Coupled Model Intercomparion Project CMIP. The WGCM Infrastructure Panel is the official CMIP document home. See a CMIP intro video here. Through these portals we will able to find data about the following projects:

  • CMIP6: Coupled Model Intercomparison Project Phase 6

  • input4MIPs: Boundary Condition and Forcing Datasets for CMIP6

  • CMIP5: Coupled Model Intercomparison Project Phase 5.

  • CORDEX: Coordinated Regional Climate Downscaling Experiment

  • Obs4MIPs: Observations for Climate Model Intercomparisons

  • MPI-GE: MPI Grand Ensemble

  • ISI-MIP: Inter-Sectoral Impact Model Intercomparison Project

For more information on how to publish your own data into the ESGF using DKRZ resources, please click here.

Help#

$ freva --esgf --help

The query is of the form key=value. the key might be repeated and/or negated with the
'_not_' suffix (e.g. model_not_=MPI-ESM-LR experiment=decadal2000 experiment=decadal2001)

Simple query:
    freva --esgf model=MPI-ESM-LR experiment=decadal2001 variable=tas distrib=False

The search API is described here: https://github.com/ESGF/esgf.github.io/wiki/ESGF_Search_REST_API
Some special query keys:
distrib: (*true*, false) search globally or only at DKRZ (MPI data and replicas)
latest : (true, false, *unset*) search for the latest version, older ones or all.
replica: (true, false, *unset*) search only for replicas, non-replicas, or all.

Usage: freva --esgf [options]

Options:
  -d, --debug           turn on debugging info and show stack trace on
                        exceptions.
  -h, --help            show this help message and exit
  --datasets            List the name of the datasets instead of showing the
                        urls.
  --show-facet=FACET    <list> List all values for the given facet (might be
                        defined multiple times). The results show the possible
                        values of the selected facet according to the given
                        constraints and the number of *datasets* (not files)
                        that selecting such value as a constraint will result
                        (faceted search)
  --opendap             List the name of the datasets instead of showing the
                        urls.
  --gridftp             Show Opendap endpoints instead of the http default
                        ones (or skip them if none found)
  --download-script=FILE
                        <file> Download wget_script for getting the files
                        instead of displaying anything (only http)
  --query=QUERY         <list> Display results from <list> queried fields

Usage#

The --esgf command provides a search syntax to look the model data through all the ESGF portals and it is derived from the ESGF search service API, although it has been simplified to be used from the command line and resembles - -databrowser as closer as possible. Despite the similarities, the two commands rely on different backends which have different query possibilities. This command expects a list of attribute=value (or key=value) pairs.

Note

please, note that unlike with --databrowser the queries are not case insensitive.

similarly, the tab completion does not work with this function.

For example, if we want to search the URLs of all the files stored locally at DKRZ (distrib=false) holding the latest version (latest=true) of the variable tas (variable=tas) for the experiments decadal1960 to decadal1965 (this is a bash construct and not part of the search api!), then:

$ freva --esgf project=CMIP5 experiment=decadal{1960..1965} variable=tas distrib=false latest=true
http://cmip3.dkrz.de/thredds/fileServer/cmip5/output1/CCCma/CanCM4/decadal1965/day/atmos/day/r10i1p1/v20120531/tas/tas_day_CanCM4_decadal1965_r10i1p1_19660101-19751231.nc
http://cmip3.dkrz.de/thredds/fileServer/cmip5/output1/CCCma/CanCM4/decadal1965/day/atmos/day/r10i2p1/v20120531/tas/tas_day_CanCM4_decadal1965_r10i2p1_19660101-19751231.nc
...

Although we advise first to check how many files they would be, “piping” it to wc (word count):

$ freva --esgf project=CMIP5 experiment=decadal{1960..1965} variable=tas distrib=false latest=true | wc -l
151390

Instead of the URLs we can list the datasets, with their version with --datasets:

$ freva --esgf project=CMIP5 experiment=decadal{1960..1965} distrib=false latest=true --datasets
cmip5.output1.BCC.bcc-csm1-1.decadal1960.3hr.atmos.3hr.r1i1p1 - version: 1
cmip5.output1.BCC.bcc-csm1-1.decadal1960.3hr.atmos.3hr.r4i1p1 - version: 1
cmip5.output1.BCC.bcc-csm1-1.decadal1960.3hr.land.3hr.r1i1p1 - version: 20121109
...

Similarly to --databrowser we can also retrieve certain facets from the query, instead of the files themselves, for that we use --show-facet=FACET. We can list several facets at the same time, via comma separation:

$ freva --esgf project=CMIP5 experiment=decadal{1960..1965} distrib=false latest=true --show-facet=experiment,variable
[experiment]
        decadal1960: 1247
        decadal1961: 510
        decadal1962: 457
        decadal1963: 457
        decadal1964: 530
        decadal1965: 1296
[variable]
        abs550aer: 84
        ageice: 30
        albisccp: 18
        areacella: 40
        areacello: 40
...

This option would be the compact format or directly querying facets (e.g., --query=variable,experiment) that would list us every instance of each facet in JSON format:

 $ freva --esgf project=CMIP5 experiment=decadal{1960..1965} distrib=false latest=true --query=experiment,variable
 [
{
  "experiment": [
    "decadal1965"
  ],
  "score": 1.0,
  "variable": [
    "psl",
    "ta",
    "ua",
    "va"
  ]
},
{
  "experiment": [
    "decadal1960"
  ],
  "score": 1.0,
  "variable": [
    "snc",
    "snm",
    "snw"
  ]
},
...

]

With --opendap we will show Opendap endpoints instead of the http default ones. Note that it will skip them if none are found!

$ freva --esgf project=CMIP5 experiment=decadal{1960..1965} distrib=false latest=true --opendap | head
http://esgf-node.cmcc.it/thredds/dodsC/esg_dataroot/cmip5/output1/CMCC/CMCC-CM/decadal1960/3hr/atmos/3hr/r1i1p1/v20170725/clt/clt_3hr_CMCC-CM_decadal1960_r1i1p1_196011010130-196012312230.nc.html
http://esgf-node.cmcc.it/thredds/dodsC/esg_dataroot/cmip5/output1/CMCC/CMCC-CM/decadal1960/3hr/atmos/3hr/r1i1p1/v20170725/clt/clt_3hr_CMCC-CM_decadal1960_r1i1p1_196101010130-196112312230.nc.html
http://esgf-node.cmcc.it/thredds/dodsC/esg_dataroot/cmip5/output1/CMCC/CMCC-CM/decadal1960/3hr/atmos/3hr/r1i1p1/v20170725/clt/clt_3hr_CMCC-CM_decadal1960_r1i1p1_196201010130-196212312230.nc.html
...

Similarly, with --gridftp we will show GSI authenticated FTP access endpoints:

$ freva --esgf project=CMIP5 experiment=decadal{1960..1965} distrib=false latest=true --gridftp
gsiftp://esgf-node.cmcc.it:2811//esg_dataroot/cmip5/output1/CMCC/CMCC-CM/decadal1960/3hr/atmos/3hr/r1i1p1/v20170725/clt/clt_3hr_CMCC-CM_decadal1960_r1i1p1_196011010130-196012312230.nc
gsiftp://esgf-node.cmcc.it:2811//esg_dataroot/cmip5/output1/CMCC/CMCC-CM/decadal1960/3hr/atmos/3hr/r1i1p1/v20170725/clt/clt_3hr_CMCC-CM_decadal1960_r1i1p1_196101010130-196112312230.nc
gsiftp://esgf-node.cmcc.it:2811//esg_dataroot/cmip5/output1/CMCC/CMCC-CM/decadal1960/3hr/atmos/3hr/r1i1p1/v20170725/clt/clt_3hr_CMCC-CM_decadal1960_r1i1p1_196201010130-196212312230.nc
...

# instead of the original 151390 files considerably fewer are shown now:
$ freva --esgf project=CMIP5 experiment=decadal{1960..1965} distrib=false latest=true --gridftp | wc -l
27868

To retrieve these files, we get can be manually do it via wget command, e.g.:

$ wget http://esgf2.dkrz.de/thredds/fileServer/lta_dataroot/cmip5/output1/BCC/bcc-csm1-1/decadal1960/3hr/atmos/3hr/r1i1p1/v1/tas/tas_3hr_bcc-csm1-1_decadal1960_r1i1p1_196101010000-197512312100.nc

However, this process can be too tedious when dealing with many files. For that reason freva --esgf offers and option to create an automatic wget script, that is, a bash script written around wget to simplify data download. For that, we need to add --download-script flag and its destination file, path_to_file/script.wget as in the following example:

$ freva --esgf --download-script /tmp/script.wget  project=CMIP5 experiment=decadal{1960..1965} variable=tas distrib=false latest=true
Download script successfully saved to /tmp/scrip.wget

execution of the script.wget#

To execute the script we only have to make it executable and run it:

$ chmod +x /tmp/script.wget
$ /tmp/script.wget

********************************************************************************
*                                                                              *
* Note that new functionality to allow authentication without the need for     *
* certificates is available with this version of the wget script.  To enable,  *
* use the "-H" option and enter your OpenID and password when prompted:        *
*                                                                              *
* $ script.wget -H [options...]                                                *
*                                                                              *
* For a full description of the available options use the help option:         *
*                                                                              *
* $ script.wget -h                                                             *
*                                                                              *
********************************************************************************
...

While doing so, we have two main options to download the data:
  1. via ESGF openID + password: straightforward, but we need enter the info everytime we execute the script.

  2. via certificate: more work, but once installed we won’t need to type anything afterwards.

For both cases we need to create a ESGF account through (for example) DKRZ’s portal. We need to follow these instructions:

  1. go to the registration page

  2. fill all the info, including:

    1. <username>

    2. password

  3. you will get:

    • openID: https://esgf-data.dkrz.de/esgf-idp/openid/<username>

1. Execute the script using the ESGF openID + password#

For that we only need add the -H flag and follow the instructions:

$ /tmp/script.wget.sh -H
Running script.wget version: 1.3.2
Use script.wget -h for help.

Warning! The total number of files was 2338 but this script will only process 1000.
There were files with the same name which were requested to be download to the same directory. To avoid overwriting the previous downloaded one they were skipped.
Please use the parameter 'download_structure' to set up unique directories for them.
Script created for 916 file(s)
(The count won't match if you manually edit this file!)



Enter your openid :  https://esgf-data.dkrz.de/esgf-idp/openid/<username>
Enter password : <password>

and it will download all the data

2. Execute the script with a certificate#

We can generate the needed certificate through the ESGF PyClient. This is a Python package designed for interacting with the Earth System Grid Federation system. Currently this package contains API code for calling the ESGF Search API within client code.

For a more in-depth information about this package please go here.

2.1 installation of the esgf-pyclient#

We will create a conda-environment for that package.

For that we need to run the following command:

$ conda create -c conda-forge -n esgf-pyclient python=3.X pip esgf-pyclient

This creates a conda environment named (-n) esgf-pyclient, through a python 3.X kernel, installing the esgf-pyclient package (pip) through the conda-forge channel (-c).

We can also select an especific python kernel loading from the available modules at the DKRZ and install the conda environment from there, for example:

$ module avail
...
python3/2020.02-gcc-9.1.0
python3/2020.09-gcc-9.1.0
python3/2021.01-gcc-9.1.0
python3/unstable
...

$ module load python3/unstable
$ conda create -n esgf-pyclient -c conda-forge esgf-pyclient

Here the python kernel would be the latest one installed in DKRZ, in this case 3.8.6 (2022.02.24)

We will need to install two additional packages, for that we first activate our conda environment (so the package will be installed there) and:

$ conda activate esgf-pyclient
(esgf-pyclient) $ pip install myproxyclient defusedxml

Here you can find some basic commands to allow you navigate through your conda environment.

2.2 creation of the cerficate#

To obtain the MyProxy credentials to allow downloading files or using secured OpenDAP we need to do the following:

$ conda activate esgf-pyclient
$ python
>>> from pyesgf.logon import LogonManager
>>> lm = LogonManager()
>>> lm.logoff()
>>> lm.is_logged_on()

Note

When you run it for the first time you need to set bootstrap=True.

>>> OPENID = 'https://esgf-data.dkrz.de/esgf-idp/openid/<username>'
>>> lm.logon_with_openid(openid=OPENID, password=None, bootstrap=True)
>>> lm.is_logged_on()

Note

you may be prompted for your username if not available via your OpenID.

Obtain MyProxy credentials from the MyProxy host in interactive mode asking you for <username> and password:

>>> myproxy_host = 'esgf-data.dkrz.de'
>>> lm.logon(hostname=myproxy_host, interactive=True, bootstrap=True)
>>> lm.is_logged_on()

Note

See the pyesgf.logon module documentation for details of how to use myproxy username instead of OpenID.

2.3 Execution of the script#
$ /tmp/script.wget.sh

The esgf certificate has been created and stored in ~/.esgf/