ESGF databrowser#
The Earth System Grid Federation (ESGF) maintains a global system of federated data centers that allow access to the largest archive of model climate data world-wide. ESGF portals like the one hosted by the DKRZ are an interface for users to access model data that are distributed in several data centers, also called data nodes, although they themselves do not host any data. Through them we can access to the output of the climate models contributing to the next assessment report of the Intergovernmental Panel on Climate Change IPCC through the Coupled Model Intercomparion Project CMIP. The WGCM Infrastructure Panel is the official CMIP document home. See a CMIP intro video here. Through these portals we will able to find data about the following projects:
CMIP6: Coupled Model Intercomparison Project Phase 6
input4MIPs: Boundary Condition and Forcing Datasets for CMIP6
CMIP5: Coupled Model Intercomparison Project Phase 5.
CORDEX: Coordinated Regional Climate Downscaling Experiment
Obs4MIPs: Observations for Climate Model Intercomparisons
MPI-GE: MPI Grand Ensemble
ISI-MIP: Inter-Sectoral Impact Model Intercomparison Project
For more information on how to publish your own data into the ESGF using DKRZ resources, please click here.
Help#
$ freva --esgf --help
The query is of the form key=value. the key might be repeated and/or negated with the
'_not_' suffix (e.g. model_not_=MPI-ESM-LR experiment=decadal2000 experiment=decadal2001)
Simple query:
freva --esgf model=MPI-ESM-LR experiment=decadal2001 variable=tas distrib=False
The search API is described here: https://github.com/ESGF/esgf.github.io/wiki/ESGF_Search_REST_API
Some special query keys:
distrib: (*true*, false) search globally or only at DKRZ (MPI data and replicas)
latest : (true, false, *unset*) search for the latest version, older ones or all.
replica: (true, false, *unset*) search only for replicas, non-replicas, or all.
Usage: freva --esgf [options]
Options:
-d, --debug turn on debugging info and show stack trace on
exceptions.
-h, --help show this help message and exit
--datasets List the name of the datasets instead of showing the
urls.
--show-facet=FACET <list> List all values for the given facet (might be
defined multiple times). The results show the possible
values of the selected facet according to the given
constraints and the number of *datasets* (not files)
that selecting such value as a constraint will result
(faceted search)
--opendap List the name of the datasets instead of showing the
urls.
--gridftp Show Opendap endpoints instead of the http default
ones (or skip them if none found)
--download-script=FILE
<file> Download wget_script for getting the files
instead of displaying anything (only http)
--query=QUERY <list> Display results from <list> queried fields
Usage#
The --esgf
command provides a search syntax to look the model data
through all the ESGF portals and it is derived from
the ESGF search service API,
although it has been simplified to be used from the command line and
resembles - -databrowser as closer as possible.
Despite the similarities, the two commands rely on different backends
which have different query possibilities. This command expects a list of attribute=value
(or key=value
) pairs.
Note
please, note that unlike with --databrowser
the queries are not case insensitive.
similarly, the tab completion does not work with this function.
For example, if we want to search the URLs of all the files stored locally at DKRZ (distrib=false
) holding the latest version (latest=true
) of the variable tas (variable=tas
) for the experiments decadal1960 to decadal1965 (this is a bash construct and not part of the search api!), then:
$ freva --esgf project=CMIP5 experiment=decadal{1960..1965} variable=tas distrib=false latest=true
http://cmip3.dkrz.de/thredds/fileServer/cmip5/output1/CCCma/CanCM4/decadal1965/day/atmos/day/r10i1p1/v20120531/tas/tas_day_CanCM4_decadal1965_r10i1p1_19660101-19751231.nc
http://cmip3.dkrz.de/thredds/fileServer/cmip5/output1/CCCma/CanCM4/decadal1965/day/atmos/day/r10i2p1/v20120531/tas/tas_day_CanCM4_decadal1965_r10i2p1_19660101-19751231.nc
...
Although we advise first to check how many files they would be, “piping” it to wc
(word count):
$ freva --esgf project=CMIP5 experiment=decadal{1960..1965} variable=tas distrib=false latest=true | wc -l
151390
Instead of the URLs we can list the datasets, with their version with --datasets
:
$ freva --esgf project=CMIP5 experiment=decadal{1960..1965} distrib=false latest=true --datasets
cmip5.output1.BCC.bcc-csm1-1.decadal1960.3hr.atmos.3hr.r1i1p1 - version: 1
cmip5.output1.BCC.bcc-csm1-1.decadal1960.3hr.atmos.3hr.r4i1p1 - version: 1
cmip5.output1.BCC.bcc-csm1-1.decadal1960.3hr.land.3hr.r1i1p1 - version: 20121109
...
Similarly to --databrowser
we can also retrieve certain facets from the query, instead of the files themselves, for that we use --show-facet=FACET
. We can list several facets at the same time, via comma separation:
$ freva --esgf project=CMIP5 experiment=decadal{1960..1965} distrib=false latest=true --show-facet=experiment,variable
[experiment]
decadal1960: 1247
decadal1961: 510
decadal1962: 457
decadal1963: 457
decadal1964: 530
decadal1965: 1296
[variable]
abs550aer: 84
ageice: 30
albisccp: 18
areacella: 40
areacello: 40
...
This option would be the compact format or directly querying facets (e.g., --query=variable,experiment
) that would list us every instance of each facet in JSON format:
$ freva --esgf project=CMIP5 experiment=decadal{1960..1965} distrib=false latest=true --query=experiment,variable
[
{
"experiment": [
"decadal1965"
],
"score": 1.0,
"variable": [
"psl",
"ta",
"ua",
"va"
]
},
{
"experiment": [
"decadal1960"
],
"score": 1.0,
"variable": [
"snc",
"snm",
"snw"
]
},
...
]
With --opendap
we will show Opendap endpoints instead of the http default ones. Note that it will skip them if none are found!
$ freva --esgf project=CMIP5 experiment=decadal{1960..1965} distrib=false latest=true --opendap | head
http://esgf-node.cmcc.it/thredds/dodsC/esg_dataroot/cmip5/output1/CMCC/CMCC-CM/decadal1960/3hr/atmos/3hr/r1i1p1/v20170725/clt/clt_3hr_CMCC-CM_decadal1960_r1i1p1_196011010130-196012312230.nc.html
http://esgf-node.cmcc.it/thredds/dodsC/esg_dataroot/cmip5/output1/CMCC/CMCC-CM/decadal1960/3hr/atmos/3hr/r1i1p1/v20170725/clt/clt_3hr_CMCC-CM_decadal1960_r1i1p1_196101010130-196112312230.nc.html
http://esgf-node.cmcc.it/thredds/dodsC/esg_dataroot/cmip5/output1/CMCC/CMCC-CM/decadal1960/3hr/atmos/3hr/r1i1p1/v20170725/clt/clt_3hr_CMCC-CM_decadal1960_r1i1p1_196201010130-196212312230.nc.html
...
Similarly, with --gridftp
we will show GSI authenticated FTP access endpoints:
$ freva --esgf project=CMIP5 experiment=decadal{1960..1965} distrib=false latest=true --gridftp
gsiftp://esgf-node.cmcc.it:2811//esg_dataroot/cmip5/output1/CMCC/CMCC-CM/decadal1960/3hr/atmos/3hr/r1i1p1/v20170725/clt/clt_3hr_CMCC-CM_decadal1960_r1i1p1_196011010130-196012312230.nc
gsiftp://esgf-node.cmcc.it:2811//esg_dataroot/cmip5/output1/CMCC/CMCC-CM/decadal1960/3hr/atmos/3hr/r1i1p1/v20170725/clt/clt_3hr_CMCC-CM_decadal1960_r1i1p1_196101010130-196112312230.nc
gsiftp://esgf-node.cmcc.it:2811//esg_dataroot/cmip5/output1/CMCC/CMCC-CM/decadal1960/3hr/atmos/3hr/r1i1p1/v20170725/clt/clt_3hr_CMCC-CM_decadal1960_r1i1p1_196201010130-196212312230.nc
...
# instead of the original 151390 files considerably fewer are shown now:
$ freva --esgf project=CMIP5 experiment=decadal{1960..1965} distrib=false latest=true --gridftp | wc -l
27868
To retrieve these files, we get can be manually do it via wget
command, e.g.:
$ wget http://esgf2.dkrz.de/thredds/fileServer/lta_dataroot/cmip5/output1/BCC/bcc-csm1-1/decadal1960/3hr/atmos/3hr/r1i1p1/v1/tas/tas_3hr_bcc-csm1-1_decadal1960_r1i1p1_196101010000-197512312100.nc
However, this process can be too tedious when dealing with many files. For that reason freva --esgf
offers and option to create an automatic wget script, that is, a bash script
written around wget to simplify data download. For that, we need to add --download-script
flag and its destination file, path_to_file/script.wget
as in the following example:
$ freva --esgf --download-script /tmp/script.wget project=CMIP5 experiment=decadal{1960..1965} variable=tas distrib=false latest=true
Download script successfully saved to /tmp/scrip.wget
execution of the script.wget#
To execute the script we only have to make it executable and run it:
$ chmod +x /tmp/script.wget
$ /tmp/script.wget
********************************************************************************
* *
* Note that new functionality to allow authentication without the need for *
* certificates is available with this version of the wget script. To enable, *
* use the "-H" option and enter your OpenID and password when prompted: *
* *
* $ script.wget -H [options...] *
* *
* For a full description of the available options use the help option: *
* *
* $ script.wget -h *
* *
********************************************************************************
...
via ESGF openID + password: straightforward, but we need enter the info everytime we execute the script.
via certificate: more work, but once installed we won’t need to type anything afterwards.
For both cases we need to create a ESGF account through (for example) DKRZ’s portal. We need to follow these instructions:
go to the registration page
fill all the info, including:
<username>
password
you will get:
openID:
https://esgf-data.dkrz.de/esgf-idp/openid/<username>
1. Execute the script using the ESGF openID + password#
For that we only need add the -H
flag and follow the instructions:
$ /tmp/script.wget.sh -H
Running script.wget version: 1.3.2
Use script.wget -h for help.
Warning! The total number of files was 2338 but this script will only process 1000.
There were files with the same name which were requested to be download to the same directory. To avoid overwriting the previous downloaded one they were skipped.
Please use the parameter 'download_structure' to set up unique directories for them.
Script created for 916 file(s)
(The count won't match if you manually edit this file!)
Enter your openid : https://esgf-data.dkrz.de/esgf-idp/openid/<username>
Enter password : <password>
and it will download all the data
2. Execute the script with a certificate#
We can generate the needed certificate through the ESGF PyClient. This is a Python package designed for interacting with the Earth System Grid Federation system. Currently this package contains API code for calling the ESGF Search API within client code.
For a more in-depth information about this package please go here.
2.1 installation of the esgf-pyclient#
We will create a conda-environment for that package.
For that we need to run the following command:
$ conda create -c conda-forge -n esgf-pyclient python=3.X pip esgf-pyclient
This creates a conda environment named (-n
) esgf-pyclient
,
through a python 3.X kernel, installing the esgf-pyclient
package
(pip
) through the conda-forge
channel (-c
).
We can also select an especific python kernel loading from the available modules at the DKRZ and install the conda environment from there, for example:
$ module avail
...
python3/2020.02-gcc-9.1.0
python3/2020.09-gcc-9.1.0
python3/2021.01-gcc-9.1.0
python3/unstable
...
$ module load python3/unstable
$ conda create -n esgf-pyclient -c conda-forge esgf-pyclient
Here the python kernel would be the latest one installed in DKRZ, in this case 3.8.6 (2022.02.24)
We will need to install two additional packages, for that we first activate our conda environment (so the package will be installed there) and:
$ conda activate esgf-pyclient
(esgf-pyclient) $ pip install myproxyclient defusedxml
Here you can find some basic commands to allow you navigate through your conda environment.
2.2 creation of the cerficate#
To obtain the MyProxy credentials to allow downloading files or using secured OpenDAP we need to do the following:
$ conda activate esgf-pyclient
$ python
>>> from pyesgf.logon import LogonManager
>>> lm = LogonManager()
>>> lm.logoff()
>>> lm.is_logged_on()
Note
When you run it for the first time you need to set bootstrap=True
.
>>> OPENID = 'https://esgf-data.dkrz.de/esgf-idp/openid/<username>'
>>> lm.logon_with_openid(openid=OPENID, password=None, bootstrap=True)
>>> lm.is_logged_on()
Note
you may be prompted for your username if not available via your OpenID.
Obtain MyProxy credentials from the MyProxy host in interactive mode
asking you for <username>
and password
:
>>> myproxy_host = 'esgf-data.dkrz.de'
>>> lm.logon(hostname=myproxy_host, interactive=True, bootstrap=True)
>>> lm.is_logged_on()
Note
See the pyesgf.logon
module documentation for details of how to use myproxy username instead of OpenID.
2.3 Execution of the script#
$ /tmp/script.wget.sh
The esgf certificate has been created and stored in ~/.esgf/