This lesson is in the early stages of development (Alpha version)

Data provenance


Teaching: 10 min
Exercises: 20 min
  • How can keep track of my data processing steps?

  • Automate the process of recording the history of what was entered at the command line to produce a given data file or image.

We’ve now successfully created a command line program - - that calculates and plots the precipitation climatology for a given season. The last step is to capture the provenance of that plot. In other words, we need a log of all the data processing steps that were taken from the intial download of the data file to the end result (i.e. the .png image).

The simplest way to do this is to follow the lead of the NCO and CDO command line tools, which insert a record of what was executed at the command line into the history attribute of the output netCDF file.

import xarray as xr

csiro_pr_file = 'data/'
dset = xr.open_dataset(csiro_pr_file)

Fri Dec  8 10:05:56 2017: ncatted -O -a history,pr,d,,
Fri Dec 01 08:01:43 2017: cdo seldate,2001-01-01,2005-12-31 /g/data/ua6/DRSv2/CMIP5/CSIRO-Mk3-6-0/historical/mon/atmos/r1i1p1/pr/latest/
2011-07-27T02:26:04Z CMOR rewrote data to comply with CF standards and CMIP5 requirements.

Fortunately, there is a Python package called cmdline-provenance that creates NCO/CDO-style records of what was executed at the command line. We can use it to generate a new command line record:

import cmdline_provenance as cmdprov
new_record = cmdprov.new_log()
2017-12-08T14:05:34: /Applications/anaconda/envs/pyaos-lesson/bin/python /Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ -f /Users/dirving/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json

(i.e. This is the command that was run to launch the jupyter notebook we’re using.)

Generate a log file

In order to capture the complete provenance of the precipitation plot, add a few lines of code to the end of the main function in so that it:

  1. Extracts the history attribute from the input file and combines it with the current command line entry (using the cmdprov.new_log function)
  2. Outputs a log file containing that information (using cmdprov.write_log; the file should have name as the plot, replacing .png with .txt)

(Hint: The documentation for cmdline-provenance explains the process.)


Make the following additions to (code omitted from this abbreviated version of the script is denoted ...):

import cmdline_provenance as cmdprov


def main(inargs):


    new_log = cmdprov.new_log(infile_history={inargs.pr_file: dset.attrs['history']})
    fname, extension = inargs.output_file.split('.')
    cmdprov.write_log(fname+'.txt', new_log)

At the conclusion of this lesson your script should look something like the following:

import pdb
import argparse

import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
import as ccrs
import cmocean
import cmdline_provenance as cmdprov

def convert_pr_units(darray):
    """Convert kg m-2 s-1 to mm day-1.
      darray (xarray.DataArray): Precipitation data
   assert darray.units == 'kg m-2 s-1', "Program assumes input units are kg m-2 s-1" = * 86400
   darray.attrs['units'] = 'mm/day'
   return darray

def apply_mask(darray, sftlf_file, realm):
    """Mask ocean or land using a sftlf (land surface fraction) file.
      darray (xarray.DataArray): Data to mask
      sftlf_file (str): Land surface fraction file
      realm (str): Realm to mask
    dset = xr.open_dataset(sftlf_file)
    assert realm in ['land', 'ocean'], """Valid realms are 'land' or 'ocean'"""
    if realm == 'land':
        masked_darray = darray.where(dset['sftlf'].data < 50)
        masked_darray = darray.where(dset['sftlf'].data > 50)   
    return masked_darray

def create_plot(clim, model_name, season, gridlines=False, levels=None):
    """Plot the precipitation climatology.
      clim (xarray.DataArray): Precipitation climatology data
      model_name (str): Name of the climate model
      season (str): Season
      gridlines (bool): Select whether to plot gridlines
      levels (list): Tick marks on the colorbar    

    if not levels:
        levels = np.arange(0, 13.5, 1.5)
    fig = plt.figure(figsize=[12,5])
    ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180))
                                          cbar_kwargs={'label': clim.units},
    if gridlines:
    title = '%s precipitation climatology (%s)' %(model_name, season)

def main(inargs):
    """Run the program."""

    dset = xr.open_dataset(inargs.pr_file)
    clim = dset['pr'].groupby('time.season').mean('time', keep_attrs=True)
    clim = convert_pr_units(clim)

    if inargs.mask:
        sftlf_file, realm = inargs.mask
        clim = apply_mask(clim, sftlf_file, realm)

    create_plot(clim, dset.attrs['model_id'], inargs.season,
                gridlines=inargs.gridlines, levels=inargs.cbar_levels)
    plt.savefig(inargs.output_file, dpi=200)

    new_log = cmdprov.new_log(infile_history={inargs.pr_file: dset.attrs['history']})
    fname, extension = inargs.output_file.split('.')
    cmdprov.write_log(fname+'.txt', new_log)

if __name__ == '__main__':
    description='Plot the precipitation climatology for a given season.'
    parser = argparse.ArgumentParser(description=description)
    parser.add_argument("pr_file", type=str, help="Precipitation data file")
    parser.add_argument("season", type=str, help="Season to plot")
    parser.add_argument("output_file", type=str, help="Output file name")

    parser.add_argument("--gridlines", action="store_true", default=False,
                        help="Include gridlines on the plot")
    parser.add_argument("--cbar_levels", type=float, nargs='*', default=None,
                        help='list of levels / tick marks to appear on the colorbar')
    parser.add_argument("--mask", type=str, nargs=2,
                        metavar=('SFTLF_FILE', 'REALM'), default=None,
                        help="""Provide sftlf file and realm to mask ('land' or 'ocean')""")

    args = parser.parse_args()

Key Points

  • It is possible (in only a few lines of code) to record the provenance of a data file or image.