Configuration file

Usage

NEDAS configuration is driven by YAML files and runtime argument parsing. The NEDAS/config/default.yml file defines all the entries and their default values. At runtime, a customized configuration file can be used by -c CONFIG_FILE, the CONFIG_FILE doesn’t need to define every entry in default.yml, just the ones related to the particular experiment. Also, the simple entry types (not the compound types such as list, tuple and dict) can be specified with a new value with --key value at runtime, which makes it easier to run the same experiment but just changing one or two parameters in the configuration.

To run a NEDAS experiment on command line:

python -m NEDAS -c CONFIG_FILE --key value

Alternatively, in an interactive environment such as a Jupyter notebook, the configuration object config can be initialized directly with

from NEDAS.config import Config
config = Config(config_file='CONFIG_FILE', key=value)

The config object can then use used to initialize and run the analysis scheme

from NEDAS.schemes import get_scheme
scheme = get_scheme(config)
scheme()

Description of entries

System paths and runtime environment

Entry	Description	Default (from `NEDAS/config/default.yml`)
`work_dir`	Working directory for running the analysis scheme.	‘work’
`directories`	Runtime directory structure defined by format strings.	See details in Table 1.
`python_env`	Initialization script to enter the python environment. If not None, at runtime `". {python_env}"` will source this script before running the python command.	None
`io_mode`	I/O mode. `'online'` keeps model/dataset data in memory; `'offline'` uses files on disk.	‘offline’
`job_submit`	Runtime job submitter settings. These options are forwarded to the job submitter (see NEDAS.job_submitters package).	See details in Table 2.
`nproc`	Total number of processors used when a step is executed under MPI.	1
`nproc_mem`	Number of processors in a “member group” when distributing ensemble members. If not set in YAML, the code sets `nproc_mem = nproc` and computes `nproc_rec = nproc / nproc_mem`. Must evenly divide `nproc`.	None (interpreted as `nproc`)
`nproc_util`	Number of processors to use for utility steps (preprocess, postprocess, diagnose, etc.). If not set in YAML, the code uses `nproc_util = nproc`.	None

Table 1. Breakdown of `directories` dictionary
Key	Description	Default
`cycle_dir`	Directory for each analysis cycle.	‘{work_dir}/cycle/ {time:%Y%m%d%H%M}’
`forecast_dir`	Directory for the ensemble forecast step.	‘{work_dir}/cycle/ {time:%Y%m%d%H%M}/ {model_name}’
`analysis_dir`	Directory for the assimilation step (outer-loop iteration `iter` is part of the path).	‘{work_dir}/cycle/ {time:%Y%m%d%H%M}/ analysis/{iter}’

Table 2. Breakdown of `job_submit` dictionary.
Key	Description	Examples
`host`	Host machine type. Machine-specific behavior can be defined in the corresponding subclass in NEDAS.job_submitters package.	‘local’, ‘betzy’, …
`scheduler`	Scheduler type.	None, ‘slurm’, ‘oar’, ‘pbs’, …
`project`	Project number for resource allocation.	None, ‘nn2993k’, …
`queue`	Name of the scheduler queue to submit jobs to (HPC).	None, ‘normal’, ‘devel’, …
`parallel_mode`	Parallelization strategy to request from the job submitter.	‘serial’, ‘mpi’, ‘openmp’

Runtime logging

Entry	Description	Default
`debug`	If True, show extra debug messages and output intermediate data during runtime.	False
`timer`	If True, show elapsed time for major steps in the workflow.	True
`interactive`	If True, allow ANSI escape codes (colors, cursor movement) in terminal output. If None, auto-detected from the terminal environment.	True
`quiet`	If True, suppress most runtime status output.	False
`call_stack`	Current call stack context string, set automatically at runtime.	None
`call_stack_max_level`	Maximum call stack depth to display in status output. If None, all levels are shown.	2
`is_notebook`	If True, adapt output formatting for Jupyter notebooks. If None, auto-detected.	None
`cols`	Terminal width in characters used for formatting status lines. If None, auto-detected from the terminal.	None
`anchor`	Number of characters reserved for the left (description) part of status lines.	50
`tabspace`	Number of spaces per call stack level indentation.	4
`progress_bar_width`	Width of the progress bar in characters.	10

Analysis scheme design parameters

Key	Description	Default
`nens`	Ensemble size.	20
`run_preproc`	Whether to run the preprocessing step.	True
`run_forecast`	Whether to run the ensemble forecast step.	True
`run_analysis`	Whether to run the analysis (assimilation) step.	True
`run_postproc`	Whether to run the postprocessing step after assimilation.	True
`run_diagnose`	Whether to run the diagnostic tools.	True
`save_checkpoint`	If True, save checkpoints of model state and observations between cycles.	False
`step`	Used by `NEDAS.schemes.filter`. If None, will run the entire workflow. Otherwise, will only run the specified step. (Valid step names depend on the scheme; for the filter scheme these include `run_all`, `prepare_truth`, `prepare_init_ensemble`, `preprocess`, `perturb`, `filter`, `postprocess`, `ensemble_forecast`, and `diagnose`.)	None

Time controls

Key	Description	Example
`time_start`	Start time of the period of interest.	2001-01-01T00:00:00Z
`time_end`	End time of the period of interest.	2001-01-30T00:00:00Z
`time_analysis_start`	Time of the first analysis cycle. Defaults to `time_start` if not set.	2001-01-07T00:00:00Z
`time_analysis_end`	Time of the last analysis cycle. Defaults to `time_end` if not set.	2001-01-28T00:00:00Z
`cycle_period`	Interval in hours between analysis cycles.	12
`time`	Time of the current analysis cycle, set automatically at runtime. If None, will start at `time_start`.	None
`obs_time_steps`	Time steps in hours relative to the analysis for the observations.	[0]
`obs_time_scale`	Smoothing window in hours for observations.	0
`state_time_steps`	Time steps in hours relative to the analysis for the state variables.	[0]
`state_time_scale`	Smoothing window in hours for state variables.	0

Analysis grid definition

The grid_def entry is a dictionary with the following entries:

Key	Description	Example
`type`	Type of grid to use for the analysis step. If ‘custom’, the other entries will be used as kwargs in initializing a regular grid, see details in Table 3. If a model name is specified, the corresponding model grid will be used instead.	‘custom’, ‘qg’, etc.
`mask`	Mask for invalid points in the domain. If not None, the model name specifies which model generates the mask for the analysis grid.	None, ‘qg’, etc.

Key

Description

Example

type

Type of grid to use for the analysis step.

If ‘custom’, the other entries will be used as kwargs in

initializing a regular grid, see details in Table 3.

If a model name is specified, the corresponding

model grid will be used instead.

‘custom’, ‘qg’, etc.

mask

Mask for invalid points in the domain.

If not None, the model name specifies which model generates

the mask for the analysis grid.

None, ‘qg’, etc.

Table 3. Additional kwargs for custom regular grid generation.
Key	Description	Example
`proj`	Map projection defined as PROJ4 strings	None, ‘+proj=stere +lat_0=90 +lon_0=-45’
`xmin`	X coordinate start	0
`xmax`	X coordinate end	128
`ymin`	Y coordinate start	0
`ymax`	Y coordinate end	128
`dx`	Grid spacing Note: the coordinates and grid spacing should be in meters. But if proj is None, they can be nondimensional.	1
`centered`	If True, the coordinates are defined at the center of each grid box.	False
`cyclic_dim`	The dimension(s) that are cyclic	None, ‘x’, ‘y’, or ‘xy’
`distance_type`	Type of distance function	‘cartesian’ or ‘spherical’

State definition

The state_def entry is a list, each item is a dictionary that defines one model state variable:

Key	Description	Example
`name`	Model state variable name. Corresponding to the keys in `Model.variables` implemented in the model interface	‘streamfunc’
`model_src`	Name of the model this variable comes from. Should be one of the keys in `model_def`.	‘qg.fortran’
`var_type`	Variable type.	‘field’, or ‘scalar’
`err_type`	Error distribution type.	‘normal’

The model_def entry is a dictionary, with model_name as keys pointing to a dictionary of model-specific configuration parameters.

Key	Description	Example
`config_file`	YAML configuration file for the model. If not specified, will use `default.yml` in the corresponding model module directory. Additional entries added below will overwrite the settings in the YAML file, making it easier to setup twin experiments.	None, ‘{nedas_root}/models/qg/ fortran/default.yml’
`model_env`	Initialization script for model. At runtime `". {model_env}"` will source this script before running the model forecast.	‘setup.src’
`model_code_dir`	Path to the model code directory.	‘{nedas_root}/models/qg/fortran’
`nproc_per_run`	Number of processors to use for a model forecast.	1
`nproc_per_util`	Number of processors to use for utility functions.	1
`walltime`	Maximum runtime in seconds for model forecast.	3600
`restart_dt`	Model restart file saving interval in hours.	24
`forcing_dt`	Model boundary condition interval in hours.	24
`ens_run_strategy`	Strategy for running tasks involving an ensemble of tasks. ‘scheduler’: run each member as a separate job and distribute the workload using a `Scheduler`. ‘batch’: run all members in a single job.	‘scheduler’ or ‘batch’
`use_job_array`	Whether to use job array functionality when submitting the jobs via `JobSubmitter`.	False
`ens_init_dir`	Directory where the initial ensemble restart files are located.	‘{work_dir}/init_ens’
`truth_dir`	Directory where the truth files are located. This is required when using synthetic observations that are generated from a truth run.	‘{work_dir}/truth’

Observation definition

The obs_def entry is a list, each item is a dictionary that defines one observation variable:

Key	Description	Example
`name`	Observation variable name. Corresponding to the keys in `Dataset.variables` implemented in the dataset interface	‘velocity’
`dataset_src`	Name of the dataset the observation comes from. Should be one of the keys in `dataset_def`.	‘synthetic’
`model_src`	Name of the model from which to compute the observation priors.	‘qg.fortran’
`nobs`	Number of observations. If generating synthetic random observation network, use this to control the density.	3000
`err`	Error definition dictionary.	See details in Table 4
`hroi`	Horizontal localization distance, radius of influence beyond which the observation impact is tapered to zero. In the same units as grid coordinates	inf, 10, etc.
`vroi`	Vertical localization distance, in the same units as `z_coords`	inf
`troi`	Temporal localization distance	inf
`impact_on_state`	List of impact factors of this observation on the state variables. The unlisted variable has a default impact of 1.0	{ ‘streamfunc’: 0 }, which turns off the impact on streamfunc

Table 4. Breakdown of the observation error definition dictionary.
Key	Description	Example
`type`	Type of error distribution.	‘normal’
`std`	Observation error standard deviation.	1.0
`hcorr`	Horizontal correlation length in observation error.	0
`vcorr`	Vertical correlation length in observation error.	0
`tcorr`	Temporal correlation length in observation error.	0
`cross_corr`	Cross-variable correlation in observation error. A dictionary {variable_name: corr} listing the correlation between self and other variable_name. Auto-correlation is always 1, so there is no need to include self in the dictionary.	{‘streamfunc’: 0}

The dataset_def entry is a dictionary, with dataset_name as keys pointing to a dictionary of dataset-specific configuration parameters.

Key	Description	Example
`model_src`	Name of the model used for computing observation priors for this dataset. Should be one of the keys in `model_def`.	‘qg.fortran’
`config_file`	YAML configuration file for the dataset. If not specified, will use `default.yml` in the corresponding dataset module directory. Additional entries added below will overwrite the settings in the YAML file.	None
`dataset_dir`	Path to the dataset files. (For synthetic observations this can be left empty.)	None
`obs_window_min`	Start of the observation window, hours relative to the analysis time.	-6
`obs_window_max`	End of the observation window, hours relative to the analysis time.	0

Some additional parameters:

Synthetic observations are enabled by using a synthetic dataset in obs_def (e.g. dataset_src: synthetic) and providing a corresponding entry in dataset_def.

Key	Description	Default (from `NEDAS/config/default.yml`)
`shuffle_obs`	Whether to randomize the order of observations.	False
`z_coords_from`	Where the reference vertical coordinates come from.	‘mean’
`interp_method`	Interpolation method used when mapping between grids.	‘linear’

Perturbation

The top-level perturb entry controls the optional perturbation step. In the default configuration it is left empty/None (no perturbation).

If enabled, perturb should be a list of dictionaries. Each dictionary defines a perturbation to apply to one ensemble member and one or more variables (see NEDAS.core.perturb).

Key	Description	Example
`variable`	Variable name (string) or list of variable names to perturb.	‘streamfunc’
`model_src`	Model name the variable(s) come from (a key in `model_def`).	‘qg.fortran’
`type`	Perturbation type string. The first token selects the main method: `gaussian`, `powerlaw`, or `displace`. Additional options can be appended with commas (e.g. `gaussian,exp`).	‘gaussian’
`amp`	Perturbation amplitude.	0.1
`hcorr`	Horizontal correlation length (needed by `gaussian` and `displace`).	15
`tcorr`	Temporal correlation length (hours) used to correlate perturbations between cycles/time steps.	0
`powerlaw`	Power-law exponent (needed by `powerlaw` perturbations).	4
`bounds`	Optional value bounds `[vmin, vmax]` enforced after perturbation.	[0, inf]
`seed`	Optional random seed.	1234

If no perturbation is needed, leave perturb empty/None.

Assimilation method

The following parameters help NEDAS locate the correct analysis scheme and assimilation components.

Key	Description	Default / example
`scheme`	Type of analysis scheme to use.	‘filter’
`assimilator_def`	Assimilator configuration dictionary. The assimilator class is chosen by `assimilator_def.type`.	See below.
`updator_def`	Updator configuration dictionary (applies increments to produce posterior state). Alignment-based updators are selected via `updator_def.type` and further configured through `updator_def.config_file`.	See `NEDAS/config/default.yml`
`covariance_def`	Covariance configuration dictionary.	See `NEDAS/config/default.yml`

Breakdown of `assimilator_def`.
Key	Description	Default
`type`	Assimilator type.	‘ETKF’
`config_file`	Optional YAML configuration file for the assimilator. If not specified, the assimilator module default is used.	None

Covariance inflation parameters are stored in the inflation_def entry as a dictionary.

Key	Description	Example
`type`	Type of inflation (post/prior, multiplicative/RTPP).	‘post,multiplicative’
`adaptive`	Whether to run an adaptive inflation scheme.	False
`coef`	Static inflation coefficient.	1.0

Covariance localization settings are separately defined for the spatial and temporal components. The localization_def entry is a dictionary with keys horizontal, vertical and temporal each pointing to a dictionary that defines its localization function parameters.

Key	Description	Example
`type`	Type of localization kernel to use. Implemented types include `gaspari_cohn`, `step`, and `exponential`.	‘gaspari_cohn’

Key

Description

Example

type

Type of localization kernel to use.

Implemented types include gaspari_cohn,

step, and exponential.

‘gaspari_cohn’

State and observation transforms can be configured with the transform_def entry, which is a list of dictionaries each defining one transform to apply (see NEDAS.assim_tools.transforms).

Breakdown of a `transform_def` entry.
Key	Description	Example
`type`	Transform type. Built-in types include `scale_bandpass` (for multiscale DA) and `identity`.	‘scale_bandpass’
`decompose_obs`	If True, apply the same transform decomposition to observations as well as state variables.	False

Multiscale approach configuration:

Key	Description	Example
`niter`	Number of outer-loop iterations, e.g. number of scale components in a multiscale approach.	1
`iter`	Current iteration number	0
`resolution_level`	Resolution level (n) for the analysis grid. The analysis grid will have a resolution `dx * 2**n` where `dx` is the grid spacing defined in `grid_def`.	[0]
`character_length`	Characteristic length (in grid coordinate units) for each scale (large to small).	[16]
`localize_scale_fac`	Scale factor for localization distances.	[1]
`obs_err_scale_fac`	Scale factor for observation error inflation.	[1]

Diagnostic methods

The diag entry is a list. Each element is a dictionary defining a diagnostic method to be run.

Key	Description	Example
`method`	Name of the diagnostic method (Python module path under `NEDAS/diag`).	‘misc.convert_output’
`config_file`	Optional YAML configuration file for the method. If not specified, the method module default is used.	None
`model_src`	Which model the diagnostic is applied to.	‘qg.fortran’
`variables`	List of variables to process.	[‘streamfunc’]
`grid_def`	Optional output grid definition; if omitted, the model grid is used.	None
`file`	Output filename format string.	‘{work_dir}/output/ mem{member:03}_ {time:%Y-%m-%dT%H}.nc’