Workflow vocabulary

Creating a climate product usually involves multiple steps:

  1. Selecting datasets
  2. Subsetting the data to a specific region and period
  3. Either regridding the multiple datasets to a common grid or computing spatial averages
  4. Computing climate indices
  5. Create graphs or tables from the results

Typically each step would involve calling one or many individual processes, and it’s convenient to combine these steps into a workflow. Here we use workflow to mean a formal description of the logical organization and ordering of individual processes. The workflow logic is encapsulated in a json file using a vocabulary (called a schema) that we describe below.

Workflows are built by combining Workflow_task into Group_of_task. These groups are then executed sequentially or in parallel, as indicated in the Workflow field (see the Workflow schema).

Workflow
High-level object describing a workflow.
Object Properties:
 
  • name (str) – Workflow name
  • tasks (list) – Array of Workflow_task. [optional*]
  • parallel_groups (list) – Array of Group_of_tasks being executed on multiple processes. [optional*]

Note

Either tasks or parallel_groups must be specified.

Workflow_task

Describe an individual task.

Object Properties:
 
  • name (str) – Unique name given to each workflow task
  • url (str) – Url of the WPS provider
  • identifier (str) – Identifier of a WPS process
  • inputs (dict) – Dictionary of inputs that must be fed to the WPS process. The key is the input name and the value is either the input data or an array of data if multiple values are allowed for this input. [optional]
  • linked_inputs (dict) – Dictionary of dynamic inputs that must be fed to the WPS process and obtained by the output of other tasks. The key is the input name (or None) and the value is an Input_description dictionary or an array of them if multiple values are allowed for this input. [optional]
  • progress_range (list) – [optiona] Array [start, end] defining the overall progress range of this task. Default value is [0, 100].

Note

  • Allow to plan the execution of a task after another one without feeding any output of the previous one to an input.
Group_of_task
Object Properties:
 
  • name (str) – Group of task name.
  • max_processes (int) – Number of processes to run concurrently to process the data.
  • map (Input_description) – Object describing what has to be mapped inside the group or an array of data that has to be mapped directly.
  • reduce (Input_description) – Object describing what has to be reduced before leaving the group.
  • tasks (list) – Array of Workflow_task to run concurrently inside the group.
Input_description
Identifies the output of a process included in the workflow that serves as an input in a downstream process.
Object Properties:
 
  • task (str) – Name of the task from which the input should be taken (if from a Group_of_task, this is from its map element).
  • output (str) – Output name in the task provided above. Not required if the task has only one output or when referring to the map/reduce tasks of a group. [optional]
  • as_reference (boolean) – Specify if the task output should be obtained as a reference (URL) or not (data directly). False is the default value if ommited. [optional]

Note

The workflow executor is able obviously to assign a reference output to an expected reference input and a data output to an expected data input but will also be able to read the value of a reference output to send the expected data input. However, a data output linked to an expected reference input will yield an exception.