mlwiz.evaluation

evaluation.config

Configuration wrapper with attribute-style access.

The Config class exposes dictionary keys as attributes and supports pretty printing.

class mlwiz.evaluation.config.Config(config_dict: dict)

Bases: object

Simple class to manage the configuration dictionary as a Python object with fields.

Parameters:: config_dict (dict) – the configuration dictionary

get(key: str, default: object | None = None) → object

Returns the key from the dictionary if present, otherwise the default value specified

Parameters:

key (str) – the key to look up in the dictionary
default (object) – the default object

Returns:

a value from the dictionary

items() → list

Return a view on (key, value) pairs.

Returns:: View over (key, value) pairs.
Return type:: dict_items

keys() → set

Return a view on the configuration keys.

Returns:: View over keys in the dictionary.
Return type:: dict_keys

evaluation.evaluator

Ray-based experiment evaluation and result aggregation.

Implements the RiskAssesser workflow and helpers for progress updates and notifications.

class mlwiz.evaluation.evaluator.RiskAssesser(outer_folds: int, inner_folds: int, experiment_class: Callable[[...], Experiment], exp_path: str, splits_filepath: str, model_configs: Grid | RandomSearch, risk_assessment_training_runs: int, model_selection_training_runs: int, higher_is_better: bool, gpus_per_task: float, base_seed: int = 42, training_timeout_seconds: int = -1)

Bases: object

Class implementing a K-Fold technique to do Risk Assessment (estimate of the true generalization performances) and K-Fold Model Selection (select the best hyper-parameters for each external fold

Parameters:

outer_folds (int) – The number K of outer TEST folds. You should have generated the splits accordingly
inner_folds (int) – The number K of inner VALIDATION folds. You should have generated the splits accordingly
experiment_class – (Callable[…, Experiment]): the experiment class to be instantiated
exp_path (str) – The folder in which to store all results
splits_filepath (str) – The splits filepath with additional meta information
model_configs – (Union[Grid, RandomSearch]): an object storing all possible model configurations, e.g., config.base.Grid
risk_assessment_training_runs (int) – no of final training runs to mitigate bad initializations
model_selection_training_runs (int) – no of training runs to mitigate bad initializations at model selection time
higher_is_better (bool) – whether the best model for each external fold should be selected by higher or lower score values
gpus_per_task (float) – Number of gpus to assign to each experiment. Can be < 1.
base_seed (int) – Seed used to generate experiments seeds. Used to replicate results. Default is 42
training_timeout_seconds (int) – optional timeout limit per experiment in seconds

_create_dataset_getter(outer_k: int, inner_k: int | None) → DataProvider: Instantiates and configures a dataset provider for the requested folds.

_request_termination(): Signals all workers and the UI to terminate gracefully.

compute_best_hyperparameters(folder: str, outer_k: int, no_configurations: int, skip_config_ids: List[int])

Chooses the best hyper-parameters configuration using the proper validation mean score.

Parameters:

folder (str) – the model selection folder associated with outer fold k
outer_k (int) – the current outer fold to consider. Used for telegram updates
no_configurations (int) – number of possible configurations
skip_config_ids – list of configuration ids to skip

compute_final_runs_score_per_fold(outer_k: int)

Computes the average scores for the final runs of a specific outer fold

Parameters:: outer_k (int) – id of the outer fold from 0 to K-1

compute_risk_assessment_result(): Aggregates Outer Folds results and compute Training and Test mean/std

model_selection(kfold_folder: str, outer_k: int, debug: bool, execute_config_id: int | None, skip_config_ids: List[int])

Performs model selection.

Parameters:

kfold_folder – The root folder for model selection
outer_k – the current outer fold to consider
debug – if True, sequential execution is performed and logs are printed to screen
execute_config_id – if debug mode is enabled, it will prioritize the execution of this configuration. It assumes indices start from 1. Use this to debug specific configurations.
skip_config_ids – if provided, the provided list of configurations will not be considered for model selection. Use it, for instance, when a run is taking too long to execute and you

decide it is not worth to wait for it.

process_config_results_across_inner_folds(config_folder: str, config: Config)

Averages the results for each configuration across inner folds and stores it into a file.

Parameters:

config_folder (str)
config (Config) – the configuration object

process_model_selection_runs(inner_fold_config_folder: str, inner_k: int)

Computes the average performances for the training runs about: a specific configuration and a specific inner_fold split

Parameters:

inner_fold_config_folder (str) – an inner fold experiment folder of a specific configuration
inner_k (int) – the inner fold id

risk_assessment(debug: bool, execute_config_id: int | None = None, skip_config_ids: List[int] | None = None)

Performs risk assessment to evaluate the performances of a model.

Parameters:

debug – if True, sequential execution is performed and logs are printed to screen
execute_config_id – if debug mode is enabled, it will prioritize the execution of this configuration for each model selection procedure. It assumes indices start from 1. Use this to debug specific configurations.
skip_config_ids – if provided, the provided list of configurations will not be considered for model selection. Use it, for instance, when a run is taking too long to execute and you decide it is not worth to wait for it.

run_final_model(outer_k: int, debug: bool)

Performs the final runs once the best model for outer fold outer_k has been chosen.

Parameters:

outer_k (int) – the current outer fold to consider
debug (bool) – if True, sequential execution is performed and logs are printed to screen

wait_configs(skip_config_ids: List[int]) → bool

Waits for configurations to terminate and updates the state of the progress manager

Returns:: True if all runs completed successfully, False otherwise.
Return type:: bool

mlwiz.evaluation.evaluator._get_ray_num_gpus_per_task(default: float = 0.0) → float

Return the Ray GPU request per task from the environment.

This exists primarily to keep module import side-effect free (e.g. during Sphinx autodoc) when the variable is unset or malformed.

mlwiz.evaluation.evaluator._make_termination_checker(progress_actor, min_interval: float = 0.2) → Callable[[], bool]: Creates a closure that checks for termination requests without hammering the actor.

mlwiz.evaluation.evaluator._mean_std_ci(values: numpy.ndarray) → Tuple[float, float, float]: Computes mean, std, and 95% confidence interval for the provided values.

mlwiz.evaluation.evaluator._push_progress_update(progress_actor, payload: dict): Safely forwards progress updates to the shared actor.

mlwiz.evaluation.evaluator._set_cuda_memory_limit_from_env(): Best-effort limit of per-process GPU memory based on the configured Ray GPU fraction. No-op if CUDA is unavailable or the value is invalid.

mlwiz.evaluation.evaluator.extract_and_sum_elapsed_seconds(file_path)

Sum per-run elapsed time entries from an experiment log file.

The evaluator writes elapsed-time markers to the experiment log in the form:

Total time of the experiment in seconds: <SECONDS>

This helper scans the file for all such entries and returns their sum.

Parameters:: file_path (str | os.PathLike) – Path to the experiment log file.
Returns:: Sum of all matched elapsed seconds.
Return type:: float

Side effects:: Reads the file from disk.

mlwiz.evaluation.evaluator.run_test(experiment_class: Callable[[...], Experiment], dataset_getter: Callable[[...], DataProvider], best_config: dict, outer_k: int, run_id: int, final_run_exp_path: str, final_run_torch_path: str, exp_seed: int, training_timeout_seconds: int, logger: Logger, progress_actor=None) → Tuple[int, int, float]

Ray job that performs a risk assessment run and returns bookkeeping information for the progress manager.

Parameters:

experiment_class – (Callable[…, Experiment]): the class of the experiment to instantiate
dataset_getter – (Callable[…, DataProvider]): the class of the data provider to instantiate
best_config (dict) – the best configuration to use for this specific outer fold
run_id (int) – the id of the final run (for bookkeeping reasons)
final_run_exp_path (str) – path of the experiment root folder
final_run_torch_path (str) – path where to store the results of the experiment
exp_seed (int) – seed of the experiment
training_timeout_seconds (int) – timeout for the experiment in seconds
logger (Logger) – a logger to log information in the appropriate file

Returns:

a tuple with outer fold id, final run id, and time elapsed

mlwiz.evaluation.evaluator.run_valid(experiment_class: Callable[[...], Experiment], dataset_getter: Callable[[...], DataProvider], config: dict, config_id: int, run_id: int, fold_run_exp_folder: str, fold_run_results_torch_path: str, exp_seed: int, training_timeout_seconds: int, logger: Logger, progress_actor=None) → Tuple[int, int, int, int, float]

Ray job that performs a model selection run and returns bookkeeping information for the progress manager.

Parameters:

experiment_class – (Callable[…, Experiment]): the class of the experiment to instantiate
dataset_getter – (Callable[…, DataProvider]): the class of the data provider to instantiate
config (dict) – the configuration of this specific experiment
config_id (int) – the id of the configuration (for bookkeeping reasons)
run_id (int) – the id of the training run (for bookkeeping reasons)
fold_run_exp_folder (str) – path of the experiment root folder
fold_run_results_torch_path (str) – path where to store the results of the experiment
exp_seed (int) – seed of the experiment
training_timeout_seconds (int) – timeout for the experiment in seconds
logger (Logger) – a logger to log information in the appropriate file

Returns:

a tuple with outer fold id, inner fold id, config id, run id,: and time elapsed

mlwiz.evaluation.evaluator.send_telegram_update(bot_token: str, bot_chat_ID: str, bot_message: str)

Sends a message using Telegram APIs. Markdown can be used.

Parameters:

bot_token (str) – token of the user’s bot
bot_chat_ID (str) – identifier of the chat where to write the message
bot_message (str) – the message to be sent

evaluation.grid

Grid-search configuration expansion.

The Grid class enumerates all combinations from a YAML-defined hyperparameter space.

class mlwiz.evaluation.grid.Grid(configs_dict: dict)

Bases: object

Class that implements grid-search. It computes all possible configurations starting from a suitable config file.

Parameters:: configs_dict (dict) – the configuration dictionary specifying the different configurations to try

_gen_configs() → List[dict]

Takes a dictionary of key:list pairs and computes all possible combinations.

Returns:: A list of al possible configurations in the form of dictionaries

_gen_helper(cfgs_dict: dict) → dict: Helper generator that yields one possible configuration at a time.

_list_helper(values: object) → object: Recursively parses lists of possible options for a given hyper-parameter.

property exp_name: str

Computes the name of the root folder

Returns:: the name of the root folder as made of EXP-NAME_DATASET-NAME

property num_configs: int

Computes the number of configurations to try during model selection

Returns:: the number of configurations

evaluation.random_search

Random-search configuration sampling.

The RandomSearch class samples configurations from a YAML-defined search space.

class mlwiz.evaluation.random_search.RandomSearch(configs_dict: dict)

Bases: Grid

Class that implements random-search. It computes all possible configurations starting from a suitable config file.

Parameters:: configs_dict (dict) – the configuration dictionary specifying the different configurations to try

_dict_helper(configs: dict)

Recursively parses a dictionary

Returns:: A dictionary

_gen_helper(cfgs_dict: dict) → Iterator[Dict[str, Any]]

Takes a dictionary of key:list pairs and computes all possible combinations.

Returns:: A list of all possible configurations in the form of dictionaries

_sampler_helper(configs: dict)

Samples possible hyperparameter(s) and returns it (them, in this case as a dict)

Returns:
A dictionary

evaluation.util

Utilities for sampling, instantiation, and results analysis.

Includes random-search samplers and helpers to load runs, instantiate datasets/models, and inspect artifacts.

mlwiz.evaluation.util._collect_metric_samples(exp_folder: str, metric_key: str, set_key: str) → Tuple[numpy.ndarray, str]

Collect metric samples for an experiment and split.

If only one outer fold is present, this falls back to using the final-run results as samples; otherwise it uses one sample per outer fold (the fold mean).

Parameters:

exp_folder (str) – Path to the experiment folder.
metric_key (str) – Metric name to extract.
set_key (str) – Split name (case-insensitive): 'training', 'validation', or 'test'.

Returns:

(samples, source) where source is either 'final_runs' or 'outer_fold_means'.

Return type:

Tuple[numpy.ndarray, str]

Raises:

ValueError – If set_key is invalid or no folds are found.

mlwiz.evaluation.util._df_to_latex_table(df, no_decimals=2, model_as_row=True)

Convert an assessment-results DataFrame to a formatted LaTeX table.

Parameters:

df (pandas.DataFrame) – DataFrame where each row describes a model/dataset pair and includes score columns such as test and test_std.
no_decimals (int) – Number of decimal places to display.
model_as_row (bool) – If True, models are rows and datasets are columns; otherwise datasets are rows and models are columns.

Returns:

LaTeX table string produced by pandas.DataFrame.to_latex().

Return type:

str

mlwiz.evaluation.util._list_outer_fold_ids(exp_folder: str) → List[int]

List outer fold identifiers available in an experiment’s assessment folder.

Parameters:: exp_folder (str) – Path to the experiment folder.
Returns:: Sorted list of outer fold ids found under <exp_folder>/MODEL_ASSESSMENT.
Return type:: list[int]
Raises:: FileNotFoundError – If the assessment folder is missing.

mlwiz.evaluation.util._load_final_run_metric_samples(exp_folder: str, outer_fold_id: int, set_key: str, metric_key: str) → numpy.ndarray

Load per-run metric samples from cached final-run results.

This scans run_{i}_results.dill files under: <exp_folder>/MODEL_ASSESSMENT/OUTER_FOLD_<id>/final_run<i>/ and extracts metric_key from the selected split.

Parameters:

exp_folder (str) – Path to the experiment folder.
outer_fold_id (int) – Outer fold id.
set_key (str) – Split name (case-insensitive): 'training', 'validation', or 'test'.
metric_key (str) – Metric name to extract.

Returns:

1D array of metric samples (dtype=float).

Return type:

numpy.ndarray

Raises:

KeyError – If metric_key is missing from a run result.
ValueError – If no run results are found.

mlwiz.evaluation.util._summarize_samples(samples: numpy.ndarray, confidence_level: float)

Compute mean/std and a normal-approximation confidence interval half-width.

Parameters:

samples (numpy.ndarray) – 1D array of metric samples (must be non-empty).
confidence_level (float) – Confidence level in the (0, 1) range (e.g., 0.95).

Returns:

(mean, std, ci_half_width).

Return type:

tuple[float, float, float]

mlwiz.evaluation.util.choice(*args)

Sample one value uniformly at random from the provided arguments.

Parameters:: *args – Candidate values to sample from.
Returns:: One of the provided values.
Return type:: object

mlwiz.evaluation.util.create_dataframe(config_list: List[dict], key_mappings: List[Tuple[str, Callable]])

Creates a pandas DataFrame from a list of configuration dictionaries and key mappings.

Parameters:

config_list – List[dict] A list of dictionaries, where each dictionary represents a configuration. Each configuration must contain an exp_folder key and may include nested keys corresponding to hyperparameter names.
key_mappings – List[Tuple[str, Callable]] A list of tuples where: - The first element (str) is the hyperparameter name to extract from the configurations. - The second element (Callable) is a transformation function to apply to the extracted value.

Returns:

pandas.DataFrame: A DataFrame containing rows generated from config_list with columns for exp_folder and the specified key_mappings. If a mapping value is missing, the corresponding DataFrame cell will contain None.

Return type:

df

mlwiz.evaluation.util.create_latex_table_from_assessment_results(exp_metadata, metric_key='main_score', no_decimals=2, model_as_row=True, use_single_outer_fold=False) → str

Creates a LaTeX table from a list of experiment folders, each containing assessment results.

Parameters:

exp_metadata (list[tuple[str, str, str]]) – A list of (paths to the experiment folder, model name, dataset name).
metric_key (str) – The key for the metric to extract. Default is ‘main_score’.
no_decimals (int) – The number of rounded decimal places to display in the LaTeX table.
model_as_row (bool) – If True, models are rows and datasets are columns. If False, the opposite.
use_single_outer_fold (bool) – If True, only the first outer fold is used. This is useful when the number of outer folds is 1, the std in the assessment file is 0, therefore we want to recover the std across the final runs of the unique outer fold.

mlwiz.evaluation.util.filter_experiments(config_list: List[dict], logic: bool = 'AND', parameters: dict = {})

Filters the list of configurations returned by the method retrieve_experiments according to a dictionary. The dictionary contains the keys and values of the configuration files you are looking for.

If you specify more then one key/value pair to look for, then the logic parameter specifies whether you want to filter using the AND/OR rule.

For a key, you can specify more than one possible value you are interested in by passing a list as the value, for instance {‘device’: ‘cpu’, ‘lr’: [0.1, 0.01]}

Parameters:

config_list – The list of configuration files
logic – if AND, a configuration is selected iff all conditions are satisfied. If OR, a config is selected when at least one of the criteria is met.
parameters – dictionary with parameters used to filter the configurations

Returns:

a list of filtered configurations like the one in input

mlwiz.evaluation.util.get_scores_from_assessment_results(exp_folder, metric_key='main_score') → dict: Extracts scores from the configuration dictionary. :param exp_folder: The path to the experiment folder. :type exp_folder: str :param metric_key: The key for the metric to extract. Default is ‘main_score’. :type metric_key: str

mlwiz.evaluation.util.get_scores_from_outer_results(exp_folder, outer_fold_id, metric_key='main_score') → dict: Extracts scores from the configuration dictionary. :param exp_folder: The path to the experiment folder. :type exp_folder: str :param outer_fold_id: The ID of the outer fold, from 1 on. :type outer_fold_id: int :param metric_key: The key for the metric to extract. Default is ‘main_score’. :type metric_key: str

mlwiz.evaluation.util.instantiate_data_provider_from_config(config: dict, splits_filepath: str, n_outer_folds: int, n_inner_folds: int) → DataProvider: Instantiate a data provider from a configuration file. :param config (dict): the configuration file :param splits_filepath (str): the path to data splits file :param n_outer_folds (int): the number of outer folds :param n_inner_folds (int): the number of inner folds :return: an instance of DataProvider, i.e., the data provider

mlwiz.evaluation.util.instantiate_dataset_from_config(config: dict) → DatasetInterface

Instantiate a dataset from a configuration file.

Parameters:: (dict) (config) – the configuration file
Returns:: an instance of DatasetInterface, i.e., the dataset

mlwiz.evaluation.util.instantiate_model_from_config(config: dict, dataset: DatasetInterface) → ModelInterface: Instantiate a model from a configuration file. :param config (dict): the configuration file :param dataset (DatasetInterface): the dataset used in the experiment :return: an instance of ModelInterface, i.e., the model

mlwiz.evaluation.util.load_checkpoint(checkpoint_path: str, model: ModelInterface, device: torch.device): Load a checkpoint from a checkpoint file into a model. :param checkpoint_path: the checkpoint file path :param model (ModelInterface): the model :param device (torch.device): the device, e.g, “cpu” or “cuda”

mlwiz.evaluation.util.loguniform(*args)

Performs a log-uniform random selection.

Parameters:: *args – a tuple of (log min, log max, [base]) to use. Base 10 is used if the third argument is not available.
Returns:: a randomly chosen value

mlwiz.evaluation.util.normal(*args)

Sample a value from a univariate normal distribution.

Parameters:: *args – Arguments forwarded to random.normalvariate() ((mu, sigma)).
Returns:: Sampled value.
Return type:: float

mlwiz.evaluation.util.randint(*args)

Sample an integer uniformly at random from an interval.

Parameters:: *args – Arguments forwarded to random.randint() ((a, b)).
Returns:: Sampled integer in the closed interval [a, b].
Return type:: int

mlwiz.evaluation.util.retrieve_best_configuration(model_selection_folder) → dict

Once the experiments are done, retrieves the winning configuration from a specific model selection folder, and returns it as a dictionaries

Parameters:: model_selection_folder – path to the folder of a model selection, that is, your_results_path/…./MODEL_SELECTION/
Returns:: a dictionary with info about the best configuration

mlwiz.evaluation.util.retrieve_experiments(model_selection_folder, skip_results_not_found: bool = False) → List[dict]

Once the experiments are done, retrieves the config_results.json files of all configurations in a specific model selection folder, and returns them as a list of dictionaries

Parameters:

model_selection_folder – path to the folder of a model selection, that is, your_results_path/…./MODEL_SELECTION/
skip_results_not_found – whether to skip an experiment if a config_results.json file has not been produced yet. Useful when analyzing experiments while others still run.

Returns:

a list of dictionaries, one per configuration, each with an extra key “exp_folder” which identifies the config folder.

mlwiz.evaluation.util.statistical_significance(highlighted_exp_metadata: Tuple[str, str, str], other_exp_metadata: List[Tuple[str, str, str]], metric_key: str = 'main_score', set_key: str = 'test', confidence_level: float = 0.95) → pandas.DataFrame

Compares the statistical significance of a highlighted model against a list of other experiments using a Welch’s t-test.

Parameters:

highlighted_exp_metadata (tuple[str, str, str]) – (experiment_folder, model_name, dataset_name) for the reference model.
other_exp_metadata (list[tuple[str, str, str]]) – List of (experiment_folder, model_name, dataset_name) for the models to compare against the highlighted one.
metric_key (str) – The metric to compare. Default is “main_score”.
set_key (str) – Which dataset split to consider: “training”, “validation”, or “test”. Default is “test”.
confidence_level (float) – Confidence level for CI computation and significance test. Default is 0.95.

Returns:

Each row contains the mean/std/CI for the highlighted and compared models plus the p-value (two-sided) and a boolean flag indicating whether the difference is statistically significant at the provided confidence level.

Return type:

pandas.DataFrame

Notes

If multiple outer folds are present, their averaged scores are used as samples.
If only one outer fold exists, the scores of the final runs are used as samples.

mlwiz.evaluation.util.uniform(*args)

Sample a float uniformly at random from an interval.

Parameters:: *args – Arguments forwarded to random.uniform() (typically (a, b)).
Returns:: Sampled value in the interval.
Return type:: float