Configuration file
==================

Usage
-----

NEDAS configuration is driven by YAML files and runtime argument parsing.
The ``NEDAS/config/default.yml`` file defines all the entries and their default values.
At runtime, a customized configuration file can be used by ``-c CONFIG_FILE``,
the ``CONFIG_FILE`` doesn't need to define every entry in ``default.yml``,
just the ones related to the particular experiment.
Also, the simple entry types (not the compound types such as list, tuple and dict) can be
specified with a new value with ``--key value`` at runtime,
which makes it easier to run the same experiment but just changing one or two parameters in the configuration.

To run a NEDAS experiment on command line:

.. code-block:: bash

   python -m NEDAS -c CONFIG_FILE --key value


Alternatively, in an interactive environment such as a Jupyter notebook,
the configuration object ``config`` can be initialized directly with

.. code-block:: python

   from NEDAS.config import Config
   config = Config(config_file='CONFIG_FILE', key=value)

The ``config`` object can then use used to initialize and run the analysis scheme

.. code-block:: python

   from NEDAS.schemes import get_scheme
   scheme = get_scheme(config)
   scheme()

Description of entries
----------------------

System paths and runtime environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1

   * - Entry
     - Description
     - Default (from ``NEDAS/config/default.yml``)
   * - ``work_dir``
     - Working directory for running the analysis scheme.
     - 'work'
   * - ``directories``
     - Runtime directory structure defined by format strings.
     - See details in Table 1.
   * - ``python_env``
     - Initialization script to enter the python environment.

       If not None, at runtime ``". {python_env}"`` will source

       this script before running the python command.
     - None
   * - ``io_mode``
     - I/O mode.

       ``'online'`` keeps model/dataset data in memory;
       
       ``'offline'`` uses files on disk.
     - 'offline'
   * - ``job_submit``
     - Runtime job submitter settings.

       These options are forwarded to the job submitter
       
       (see :doc:`NEDAS.job_submitters`).
     - See details in Table 2.
   * - ``nproc``
     - Total number of processors used when a step is
     
       executed under MPI.
     - 1
   * - ``nproc_mem``
     - Number of processors in a "member group" when 
     
       distributing ensemble members.

       If not set in YAML, the code sets ``nproc_mem = nproc``
       
       and computes ``nproc_rec = nproc / nproc_mem``.

       Must evenly divide ``nproc``.
     - None
     
       (interpreted as ``nproc``)
   * - ``nproc_util``
     - Number of processors to use for utility steps (preprocess,
     
       postprocess, diagnose, etc.).

       If not set in YAML, the code uses ``nproc_util = nproc``.
     - None

.. list-table:: Table 1. Breakdown of ``directories`` dictionary
   :header-rows: 1

   * - Key
     - Description
     - Default
   * - ``cycle_dir``
     - Directory for each analysis cycle.
     - | '{work_dir}/cycle/
       | {time:%Y%m%d%H%M}'
   * - ``forecast_dir``
     - Directory for the ensemble forecast step.
     - | '{work_dir}/cycle/
       | {time:%Y%m%d%H%M}/
       | {model_name}'
   * - ``analysis_dir``
     - Directory for the assimilation step
     
       (outer-loop iteration ``iter`` is part of the path).
     - | '{work_dir}/cycle/
       | {time:%Y%m%d%H%M}/
       | analysis/{iter}'

.. list-table:: Table 2. Breakdown of ``job_submit`` dictionary.
   :header-rows: 1

   * - Key
     - Description
     - Examples
   * - ``host``
     - Host machine type.

       Machine-specific behavior can be defined 
       
       in the corresponding subclass
       
       in :doc:`NEDAS.job_submitters`.
     - 'local', 'betzy', ...
   * - ``scheduler``
     - Scheduler type.
     - None, 'slurm', 'oar', 'pbs', ...
   * - ``project``
     - Project number for resource allocation.
     - None, 'nn2993k', ...
   * - ``queue``
     - Name of the scheduler queue to
     
       submit jobs to (HPC).
     - None, 'normal', 'devel', ...
   * - ``parallel_mode``
     - Parallelization strategy to request
     
       from the job submitter.
     - 'serial', 'mpi', 'openmp'

Runtime logging
^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1

   * - Entry
     - Description
     - Default
   * - ``debug``
     - If True, show extra debug messages and output intermediate
     
       data during runtime.
     - False
   * - ``timer``
     - If True, show elapsed time for major steps in the workflow.
     - True
   * - ``interactive``
     - If True, allow ANSI escape codes (colors, cursor movement)
     
       in terminal output.

       If None, auto-detected from the terminal environment.
     - True
   * - ``quiet``
     - If True, suppress most runtime status output.
     - False
   * - ``call_stack``
     - Current call stack context string, set automatically
     
       at runtime.
     - None
   * - ``call_stack_max_level``
     - Maximum call stack depth to display in status output.

       If None, all levels are shown.
     - 2
   * - ``is_notebook``
     - If True, adapt output formatting for Jupyter notebooks.

       If None, auto-detected.
     - None
   * - ``cols``
     - Terminal width in characters used for formatting status lines.

       If None, auto-detected from the terminal.
     - None
   * - ``anchor``
     - Number of characters reserved for the left (description)
     
       part of status lines.
     - 50
   * - ``tabspace``
     - Number of spaces per call stack level indentation.
     - 4
   * - ``progress_bar_width``
     - Width of the progress bar in characters.
     - 10

Analysis scheme design parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1

   * - Key
     - Description
     - Default
   * - ``nens``
     - Ensemble size.
     - 20
   * - ``run_preproc``
     - Whether to run the preprocessing step.
     - True
   * - ``run_forecast``
     - Whether to run the ensemble forecast step.
     - True
   * - ``run_analysis``
     - Whether to run the analysis (assimilation) step.
     - True
   * - ``run_postproc``
     - Whether to run the postprocessing step after assimilation.
     - True
   * - ``run_diagnose``
     - Whether to run the diagnostic tools.
     - True
   * - ``save_checkpoint``
     - If True, save checkpoints of model state and observations
     
       between cycles.
     - False
   * - ``step``
     - Used by :mod:`NEDAS.schemes.filter`.

       If None, will run the entire workflow.

       Otherwise, will only run the specified step.

       (Valid step names depend on the scheme; for the filter scheme
       
       these include ``run_all``, ``prepare_truth``, 
       
       ``prepare_init_ensemble``, ``preprocess``, ``perturb``,
       
       ``filter``, ``postprocess``, ``ensemble_forecast``,
       
       and ``diagnose``.)
     - None

Time controls
^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1

   * - Key
     - Description
     - Example
   * - ``time_start``
     - Start time of the period of interest.
     - 2001-01-01T00:00:00Z
   * - ``time_end``
     - End time of the period of interest.
     - 2001-01-30T00:00:00Z
   * - ``time_analysis_start``
     - Time of the first analysis cycle.

       Defaults to ``time_start`` if not set.
     - 2001-01-07T00:00:00Z
   * - ``time_analysis_end``
     - Time of the last analysis cycle.

       Defaults to ``time_end`` if not set.
     - 2001-01-28T00:00:00Z
   * - ``cycle_period``
     - Interval in hours between analysis cycles.
     - 12
   * - ``time``
     - Time of the current analysis cycle,
     
       set automatically at runtime.

       If None, will start at ``time_start``.
     - None
   * - ``obs_time_steps``
     - Time steps in hours relative to the analysis

       for the observations.
     - [0]
   * - ``obs_time_scale``
     - Smoothing window in hours for observations.
     - 0
   * - ``state_time_steps``
     - Time steps in hours relative to the analysis

       for the state variables.
     - [0]
   * - ``state_time_scale``
     - Smoothing window in hours for state variables.
     - 0

Analysis grid definition
^^^^^^^^^^^^^^^^^^^^^^^^

The ``grid_def`` entry is a dictionary with the following entries:

.. list-table::
   :header-rows: 1

   * - Key
     - Description
     - Example
   * - ``type``
     - Type of grid to use for the analysis step.

       If 'custom', the other entries will be used as kwargs in

       initializing a regular grid, see details in Table 3.

       If a model name is specified, the corresponding

       model grid will be used instead.
     - 'custom', 'qg', etc.
   * - ``mask``
     - Mask for invalid points in the domain.

       If not None, the model name specifies which model generates

       the mask for the analysis grid.
     - None, 'qg', etc.

.. list-table:: Table 3. Additional kwargs for custom regular grid generation.
   :header-rows: 1

   * - Key
     - Description
     - Example
   * - ``proj``
     - Map projection defined as PROJ4 strings
     - None,

       '+proj=stere +lat_0=90 +lon_0=-45'
   * - ``xmin``
     - X coordinate start
     - 0
   * - ``xmax``
     - X coordinate end
     - 128
   * - ``ymin``
     - Y coordinate start
     - 0
   * - ``ymax``
     - Y coordinate end
     - 128
   * - ``dx``
     - Grid spacing

       Note: the coordinates and grid spacing

       should be in meters. But if proj is None,

       they can be nondimensional.
     - 1
   * - ``centered``
     - If True, the coordinates are defined

       at the center of each grid box.
     - False
   * - ``cyclic_dim``
     - The dimension(s) that are cyclic
     - None, 'x', 'y', or 'xy'
   * - ``distance_type``
     - Type of distance function
     - 'cartesian' or 'spherical'

State definition
^^^^^^^^^^^^^^^^

The ``state_def`` entry is a list, each item is a dictionary that defines one model state variable:

.. list-table::
   :header-rows: 1

   * - Key
     - Description
     - Example
   * - ``name``
     - Model state variable name.

       Corresponding to the keys in ``Model.variables``

       implemented in the model interface
     - 'streamfunc'
   * - ``model_src``
     - Name of the model this variable comes from.

       Should be one of the keys in ``model_def``.
     - 'qg.fortran'
   * - ``var_type``
     - Variable type.
     - 'field', or 'scalar'
   * - ``err_type``
     - Error distribution type.
     - 'normal'

The ``model_def`` entry is a dictionary, with model_name as keys pointing to a dictionary of model-specific configuration parameters.

.. list-table::
   :header-rows: 1

   * - Key
     - Description
     - Example
   * - ``config_file``
     - YAML configuration file for the model.

       If not specified, will use ``default.yml``

       in the corresponding model module directory.

       Additional entries added below will overwrite

       the settings in the YAML file, making it easier

       to setup twin experiments.
     - None,

       | '{nedas_root}/models/qg/
       | fortran/default.yml'
   * - ``model_env``
     - Initialization script for model.

       At runtime ``". {model_env}"`` will source

       this script before running the model forecast.
     - 'setup.src'
   * - ``model_code_dir``
     - Path to the model code directory.
     - '{nedas_root}/models/qg/fortran'
   * - ``nproc_per_run``
     - Number of processors to use for a model forecast.
     - 1
   * - ``nproc_per_util``
     - Number of processors to use for utility functions.
     - 1
   * - ``walltime``
     - Maximum runtime in seconds for model forecast.
     - 3600
   * - ``restart_dt``
     - Model restart file saving interval in hours.
     - 24
   * - ``forcing_dt``
     - Model boundary condition interval in hours.
     - 24
   * - ``ens_run_strategy``
     - Strategy for running tasks involving
     
       an ensemble of tasks.

       'scheduler': run each member as a separate job

       and distribute the workload using a :class:`Scheduler`.

       'batch': run all members in a single job.
     - 'scheduler' or 'batch'
   * - ``use_job_array``
     - Whether to use job array functionality when

       submitting the jobs via :class:`JobSubmitter`.
     - False
   * - ``ens_init_dir``
     - Directory where the initial ensemble restart files

       are located.
     - '{work_dir}/init_ens'
   * - ``truth_dir``
     - Directory where the truth files are located.

       This is required when using synthetic observations 
       
       that are generated from a truth run.
     - '{work_dir}/truth'

Observation definition
^^^^^^^^^^^^^^^^^^^^^^

The ``obs_def`` entry is a list, each item is a dictionary that defines one observation variable:

.. list-table::
   :header-rows: 1

   * - Key
     - Description
     - Example
   * - ``name``
     - Observation variable name.

       Corresponding to the keys in ``Dataset.variables``

       implemented in the dataset interface
     - 'velocity'
   * - ``dataset_src``
     - Name of the dataset the observation comes from.

       Should be one of the keys in ``dataset_def``.
     - 'synthetic'
   * - ``model_src``
     - Name of the model from which to compute the

       observation priors.
     - 'qg.fortran'
   * - ``nobs``
     - Number of observations.

       If generating synthetic random observation network,

       use this to control the density.
     - 3000
   * - ``err``
     - Error definition dictionary.
     - See details in Table 4
   * - ``hroi``
     - Horizontal localization distance,

       radius of influence beyond which the observation

       impact is tapered to zero.

       In the same units as grid coordinates
     - inf, 10, etc.
   * - ``vroi``
     - Vertical localization distance,

       in the same units as ``z_coords``
     - inf
   * - ``troi``
     - Temporal localization distance
     - inf
   * - ``impact_on_state``
     - List of impact factors of this observation

       on the state variables.

       The unlisted variable has a default impact of 1.0
     - { 'streamfunc': 0 },

       which turns off the

       impact on streamfunc

.. list-table:: Table 4. Breakdown of the observation error definition dictionary.
   :header-rows: 1

   * - Key
     - Description
     - Example
   * - ``type``
     - Type of error distribution.
     - 'normal'
   * - ``std``
     - Observation error standard deviation.
     - 1.0
   * - ``hcorr``
     - Horizontal correlation length in observation error.
     - 0
   * - ``vcorr``
     - Vertical correlation length in observation error.
     - 0
   * - ``tcorr``
     - Temporal correlation length in observation error.
     - 0
   * - ``cross_corr``
     - Cross-variable correlation in observation error. A dictionary

       {variable_name: corr} listing the correlation between self

       and other variable_name. Auto-correlation is always 1,

       so there is no need to include self in the dictionary.
     - {'streamfunc': 0}

The ``dataset_def`` entry is a dictionary, with dataset_name as keys pointing to a dictionary of dataset-specific configuration parameters.

.. list-table::
   :header-rows: 1

   * - Key
     - Description
     - Example
   * - ``model_src``
     - Name of the model used for computing
     
       observation priors for this dataset.

       Should be one of the keys in ``model_def``.
     - 'qg.fortran'
   * - ``config_file``
     - YAML configuration file for the dataset.

       If not specified, will use ``default.yml``

       in the corresponding dataset module directory.

       Additional entries added below will overwrite

       the settings in the YAML file.
     - None
   * - ``dataset_dir``
     - Path to the dataset files.

       (For synthetic observations this can be left empty.)
     - None
   * - ``obs_window_min``
     - Start of the observation window, hours relative to 
     
       the analysis time.
     - -6
   * - ``obs_window_max``
     - End of the observation window, hours relative to 
     
       the analysis time.
     - 0

Some additional parameters:

Synthetic observations are enabled by using a synthetic dataset in ``obs_def`` (e.g. ``dataset_src: synthetic``)
and providing a corresponding entry in ``dataset_def``.

.. list-table::
   :header-rows: 1

   * - Key
     - Description
     - Default (from ``NEDAS/config/default.yml``)
   * - ``shuffle_obs``
     - Whether to randomize the order of observations.
     - False
   * - ``z_coords_from``
     - Where the reference vertical coordinates come from.
     - 'mean'
   * - ``interp_method``
     - Interpolation method used when mapping between grids.
     - 'linear'

Perturbation
^^^^^^^^^^^^

The top-level ``perturb`` entry controls the optional perturbation step.
In the default configuration it is left empty/None (no perturbation).

If enabled, ``perturb`` should be a list of dictionaries. Each dictionary defines a perturbation to apply
to one ensemble member and one or more variables (see :mod:`NEDAS.core.perturb`).

.. list-table::
   :header-rows: 1

   * - Key
     - Description
     - Example
   * - ``variable``
     - Variable name (string) or list of variable names 
     
       to perturb.
     - 'streamfunc'
   * - ``model_src``
     - Model name the variable(s) come from 
     
       (a key in ``model_def``).
     - 'qg.fortran'
   * - ``type``
     - Perturbation type string.

       The first token selects the main method: 
       
       ``gaussian``, ``powerlaw``, or ``displace``.

       Additional options can be appended with commas 
       
       (e.g. ``gaussian,exp``).
     - 'gaussian'
   * - ``amp``
     - Perturbation amplitude.
     - 0.1
   * - ``hcorr``
     - Horizontal correlation length
     
       (needed by ``gaussian`` and ``displace``).
     - 15
   * - ``tcorr``
     - Temporal correlation length (hours) used

       to correlate perturbations between cycles/time steps.
     - 0
   * - ``powerlaw``
     - Power-law exponent
     
       (needed by ``powerlaw`` perturbations).
     - 4
   * - ``bounds``
     - Optional value bounds ``[vmin, vmax]`` enforced
     
       after perturbation.
     - [0, inf]
   * - ``seed``
     - Optional random seed.
     - 1234

If no perturbation is needed, leave ``perturb`` empty/None.

Assimilation method
^^^^^^^^^^^^^^^^^^^

The following parameters help NEDAS locate the correct analysis scheme and assimilation components.

.. list-table::
   :header-rows: 1

   * - Key
     - Description
     - Default / example
   * - ``scheme``
     - Type of analysis scheme to use.
     - 'filter'
   * - ``assimilator_def``
     - Assimilator configuration dictionary.

       The assimilator class is chosen by
       
       ``assimilator_def.type``.
     - See below.
   * - ``updator_def``
     - Updator configuration dictionary
     
       (applies increments to produce posterior state).

       Alignment-based updators are selected via 
       
       ``updator_def.type`` and further configured
       
       through ``updator_def.config_file``.
     - See ``NEDAS/config/default.yml``
   * - ``covariance_def``
     - Covariance configuration dictionary.
     - See ``NEDAS/config/default.yml``

.. list-table:: Breakdown of ``assimilator_def``.
   :header-rows: 1

   * - Key
     - Description
     - Default
   * - ``type``
     - Assimilator type.
     - 'ETKF'
   * - ``config_file``
     - Optional YAML configuration file for the assimilator.

       If not specified, the assimilator module default is used.
     - None

Covariance inflation parameters are stored in the ``inflation_def`` entry as a dictionary.

.. list-table::
   :header-rows: 1

   * - Key
     - Description
     - Example
   * - ``type``
     - Type of inflation (post/prior, multiplicative/RTPP).
     - 'post,multiplicative'
   * - ``adaptive``
     - Whether to run an adaptive inflation scheme.
     - False
   * - ``coef``
     - Static inflation coefficient.
     - 1.0

Covariance localization settings are separately defined for the spatial and temporal components.
The ``localization_def`` entry is a dictionary with keys ``horizontal``, ``vertical`` and ``temporal``
each pointing to a dictionary that defines its localization function parameters.

.. list-table::
   :header-rows: 1

   * - Key
     - Description
     - Example
   * - ``type``
     - Type of localization kernel to use.

       Implemented types include ``gaspari_cohn``,
       
       ``step``, and ``exponential``.
     - 'gaspari_cohn'

State and observation transforms can be configured with the ``transform_def`` entry,
which is a list of dictionaries each defining one transform to apply
(see :mod:`NEDAS.assim_tools.transforms`).

.. list-table:: Breakdown of a ``transform_def`` entry.
   :header-rows: 1

   * - Key
     - Description
     - Example
   * - ``type``
     - Transform type.

       Built-in types include ``scale_bandpass`` (for multiscale DA)
       
       and ``identity``.
     - 'scale_bandpass'
   * - ``decompose_obs``
     - If True, apply the same transform decomposition to 
     
       observations as well as state variables.
     - False

Multiscale approach configuration:

.. list-table::
   :header-rows: 1

   * - Key
     - Description
     - Example
   * - ``niter``
     - Number of outer-loop iterations, e.g. number of 
     
       scale components in a multiscale approach.
     - 1
   * - ``iter``
     - Current iteration number
     - 0
   * - ``resolution_level``
     - Resolution level (n) for the analysis grid.

       The analysis grid will have a resolution ``dx * 2**n``

       where ``dx`` is the grid spacing defined in ``grid_def``.
     - [0]
   * - ``character_length``
     - Characteristic length (in grid coordinate units) 
     
       for each scale (large to small).
     - [16]
   * - ``localize_scale_fac``
     - Scale factor for localization distances.
     - [1]
   * - ``obs_err_scale_fac``
     - Scale factor for observation error inflation.
     - [1]

Diagnostic methods
^^^^^^^^^^^^^^^^^^

The ``diag`` entry is a list. Each element is a dictionary defining a diagnostic method to be run.

.. list-table::
   :header-rows: 1

   * - Key
     - Description
     - Example
   * - ``method``
     - Name of the diagnostic method 
     
       (Python module path under ``NEDAS/diag``).
     - 'misc.convert_output'
   * - ``config_file``
     - Optional YAML configuration file
     
       for the method.

       If not specified, the method module
       
       default is used.
     - None
   * - ``model_src``
     - Which model the diagnostic is applied to.
     - 'qg.fortran'
   * - ``variables``
     - List of variables to process.
     - ['streamfunc']
   * - ``grid_def``
     - Optional output grid definition; 
     
       if omitted, the model grid is used.
     - None
   * - ``file``
     - Output filename format string.
     - | '{work_dir}/output/
       | mem{member:03}_
       | {time:%Y-%m-%dT%H}.nc'