mlwiz.evaluation
evaluation.config
Configuration wrapper with attribute-style access.
The Config class exposes dictionary keys as attributes and supports pretty printing.
- class mlwiz.evaluation.config.Config(config_dict: dict)
Bases:
objectSimple class to manage the configuration dictionary as a Python object with fields.
- Parameters:
config_dict (dict) – the configuration dictionary
- get(key: str, default: object | None = None) object
Returns the key from the dictionary if present, otherwise the default value specified
- Parameters:
key (str) – the key to look up in the dictionary
default (object) – the default object
- Returns:
a value from the dictionary
- items() list
Return a view on (key, value) pairs.
- Returns:
View over (key, value) pairs.
- Return type:
dict_items
- keys() set
Return a view on the configuration keys.
- Returns:
View over keys in the dictionary.
- Return type:
dict_keys
evaluation.evaluator
Ray-based experiment evaluation and result aggregation.
Implements the RiskAssesser workflow and helpers for progress updates and notifications.
- class mlwiz.evaluation.evaluator.RiskAssesser(outer_folds: int, inner_folds: int, experiment_class: Callable[[...], Experiment], exp_path: str, splits_filepath: str, model_configs: Grid | RandomSearch, risk_assessment_training_runs: int, model_selection_training_runs: int, higher_is_better: bool, gpus_per_task: float, base_seed: int = 42, training_timeout_seconds: int = -1)
Bases:
objectClass implementing a K-Fold technique to do Risk Assessment (estimate of the true generalization performances) and K-Fold Model Selection (select the best hyper-parameters for each external fold
- Parameters:
outer_folds (int) – The number K of outer TEST folds. You should have generated the splits accordingly
inner_folds (int) – The number K of inner VALIDATION folds. You should have generated the splits accordingly
experiment_class – (Callable[…,
Experiment]): the experiment class to be instantiatedexp_path (str) – The folder in which to store all results
splits_filepath (str) – The splits filepath with additional meta information
model_configs – (Union[
Grid,RandomSearch]): an object storing all possible model configurations, e.g., config.base.Gridrisk_assessment_training_runs (int) – no of final training runs to mitigate bad initializations
model_selection_training_runs (int) – no of training runs to mitigate bad initializations at model selection time
higher_is_better (bool) – whether the best model for each external fold should be selected by higher or lower score values
gpus_per_task (float) – Number of gpus to assign to each experiment. Can be <
1.base_seed (int) – Seed used to generate experiments seeds. Used to replicate results. Default is
42training_timeout_seconds (int) – optional timeout limit per experiment in seconds
- _create_dataset_getter(outer_k: int, inner_k: int | None) DataProvider
Instantiates and configures a dataset provider for the requested folds.
- _request_termination()
Signals all workers and the UI to terminate gracefully.
- compute_best_hyperparameters(folder: str, outer_k: int, no_configurations: int, skip_config_ids: List[int])
Chooses the best hyper-parameters configuration using the proper validation mean score.
- Parameters:
folder (str) – the model selection folder associated with outer fold k
outer_k (int) – the current outer fold to consider. Used for telegram updates
no_configurations (int) – number of possible configurations
skip_config_ids – list of configuration ids to skip
- compute_final_runs_score_per_fold(outer_k: int)
Computes the average scores for the final runs of a specific outer fold
- Parameters:
outer_k (int) – id of the outer fold from 0 to K-1
- compute_risk_assessment_result()
Aggregates Outer Folds results and compute Training and Test mean/std
- model_selection(kfold_folder: str, outer_k: int, debug: bool, execute_config_id: int | None, skip_config_ids: List[int])
Performs model selection.
- Parameters:
kfold_folder – The root folder for model selection
outer_k – the current outer fold to consider
debug – if
True, sequential execution is performed and logs are printed to screenexecute_config_id – if debug mode is enabled, it will prioritize the execution of this configuration. It assumes indices start from 1. Use this to debug specific configurations.
skip_config_ids – if provided, the provided list of configurations will not be considered for model selection. Use it, for instance, when a run is taking too long to execute and you
decide it is not worth to wait for it.
- process_config_results_across_inner_folds(config_folder: str, config: Config)
Averages the results for each configuration across inner folds and stores it into a file.
- Parameters:
config_folder (str)
config (
Config) – the configuration object
- process_model_selection_runs(inner_fold_config_folder: str, inner_k: int)
- Computes the average performances for the training runs about
a specific configuration and a specific inner_fold split
- Parameters:
inner_fold_config_folder (str) – an inner fold experiment folder of a specific configuration
inner_k (int) – the inner fold id
- risk_assessment(debug: bool, execute_config_id: int | None = None, skip_config_ids: List[int] | None = None)
Performs risk assessment to evaluate the performances of a model.
- Parameters:
debug – if
True, sequential execution is performed and logs are printed to screenexecute_config_id – if debug mode is enabled, it will prioritize the execution of this configuration for each model selection procedure. It assumes indices start from 1. Use this to debug specific configurations.
skip_config_ids – if provided, the provided list of configurations will not be considered for model selection. Use it, for instance, when a run is taking too long to execute and you decide it is not worth to wait for it.
- run_final_model(outer_k: int, debug: bool)
Performs the final runs once the best model for outer fold
outer_khas been chosen.- Parameters:
outer_k (int) – the current outer fold to consider
debug (bool) – if
True, sequential execution is performed and logs are printed to screen
- wait_configs(skip_config_ids: List[int]) bool
Waits for configurations to terminate and updates the state of the progress manager
- Returns:
Trueif all runs completed successfully,Falseotherwise.- Return type:
bool
- mlwiz.evaluation.evaluator._get_ray_num_gpus_per_task(default: float = 0.0) float
Return the Ray GPU request per task from the environment.
This exists primarily to keep module import side-effect free (e.g. during Sphinx autodoc) when the variable is unset or malformed.
- mlwiz.evaluation.evaluator._make_termination_checker(progress_actor, min_interval: float = 0.2) Callable[[], bool]
Creates a closure that checks for termination requests without hammering the actor.
- mlwiz.evaluation.evaluator._mean_std_ci(values: numpy.ndarray) Tuple[float, float, float]
Computes mean, std, and 95% confidence interval for the provided values.
- mlwiz.evaluation.evaluator._push_progress_update(progress_actor, payload: dict)
Safely forwards progress updates to the shared actor.
- mlwiz.evaluation.evaluator._set_cuda_memory_limit_from_env()
Best-effort limit of per-process GPU memory based on the configured Ray GPU fraction. No-op if CUDA is unavailable or the value is invalid.
- mlwiz.evaluation.evaluator.extract_and_sum_elapsed_seconds(file_path)
Sum per-run elapsed time entries from an experiment log file.
The evaluator writes elapsed-time markers to the experiment log in the form:
Total time of the experiment in seconds: <SECONDS>This helper scans the file for all such entries and returns their sum.
- Parameters:
file_path (str | os.PathLike) – Path to the experiment log file.
- Returns:
Sum of all matched elapsed seconds.
- Return type:
float
- Side effects:
Reads the file from disk.
- mlwiz.evaluation.evaluator.run_test(experiment_class: Callable[[...], Experiment], dataset_getter: Callable[[...], DataProvider], best_config: dict, outer_k: int, run_id: int, final_run_exp_path: str, final_run_torch_path: str, exp_seed: int, training_timeout_seconds: int, logger: Logger, progress_actor=None) Tuple[int, int, float]
Ray job that performs a risk assessment run and returns bookkeeping information for the progress manager.
- Parameters:
experiment_class – (Callable[…,
Experiment]): the class of the experiment to instantiatedataset_getter – (Callable[…,
DataProvider]): the class of the data provider to instantiatebest_config (dict) – the best configuration to use for this specific outer fold
run_id (int) – the id of the final run (for bookkeeping reasons)
final_run_exp_path (str) – path of the experiment root folder
final_run_torch_path (str) – path where to store the results of the experiment
exp_seed (int) – seed of the experiment
training_timeout_seconds (int) – timeout for the experiment in seconds
logger (
Logger) – a logger to log information in the appropriate file
- Returns:
a tuple with outer fold id, final run id, and time elapsed
- mlwiz.evaluation.evaluator.run_valid(experiment_class: Callable[[...], Experiment], dataset_getter: Callable[[...], DataProvider], config: dict, config_id: int, run_id: int, fold_run_exp_folder: str, fold_run_results_torch_path: str, exp_seed: int, training_timeout_seconds: int, logger: Logger, progress_actor=None) Tuple[int, int, int, int, float]
Ray job that performs a model selection run and returns bookkeeping information for the progress manager.
- Parameters:
experiment_class – (Callable[…,
Experiment]): the class of the experiment to instantiatedataset_getter – (Callable[…,
DataProvider]): the class of the data provider to instantiateconfig (dict) – the configuration of this specific experiment
config_id (int) – the id of the configuration (for bookkeeping reasons)
run_id (int) – the id of the training run (for bookkeeping reasons)
fold_run_exp_folder (str) – path of the experiment root folder
fold_run_results_torch_path (str) – path where to store the results of the experiment
exp_seed (int) – seed of the experiment
training_timeout_seconds (int) – timeout for the experiment in seconds
logger (
Logger) – a logger to log information in the appropriate file
- Returns:
- a tuple with outer fold id, inner fold id, config id, run id,
and time elapsed
- mlwiz.evaluation.evaluator.send_telegram_update(bot_token: str, bot_chat_ID: str, bot_message: str)
Sends a message using Telegram APIs. Markdown can be used.
- Parameters:
bot_token (str) – token of the user’s bot
bot_chat_ID (str) – identifier of the chat where to write the message
bot_message (str) – the message to be sent
evaluation.grid
Grid-search configuration expansion.
The Grid class enumerates all combinations from a YAML-defined hyperparameter space.
- class mlwiz.evaluation.grid.Grid(configs_dict: dict)
Bases:
objectClass that implements grid-search. It computes all possible configurations starting from a suitable config file.
- Parameters:
configs_dict (dict) – the configuration dictionary specifying the different configurations to try
- _gen_configs() List[dict]
Takes a dictionary of key:list pairs and computes all possible combinations.
- Returns:
A list of al possible configurations in the form of dictionaries
- _gen_helper(cfgs_dict: dict) dict
Helper generator that yields one possible configuration at a time.
- _list_helper(values: object) object
Recursively parses lists of possible options for a given hyper-parameter.
- property exp_name: str
Computes the name of the root folder
- Returns:
the name of the root folder as made of
EXP-NAME_DATASET-NAME
- property num_configs: int
Computes the number of configurations to try during model selection
- Returns:
the number of configurations
evaluation.random_search
Random-search configuration sampling.
The RandomSearch class samples configurations from a YAML-defined search space.
- class mlwiz.evaluation.random_search.RandomSearch(configs_dict: dict)
Bases:
GridClass that implements random-search. It computes all possible configurations starting from a suitable config file.
- Parameters:
configs_dict (dict) – the configuration dictionary specifying the different configurations to try
- _dict_helper(configs: dict)
Recursively parses a dictionary
- Returns:
A dictionary
- _gen_helper(cfgs_dict: dict) Iterator[Dict[str, Any]]
Takes a dictionary of key:list pairs and computes all possible combinations.
- Returns:
A list of all possible configurations in the form of dictionaries
- _sampler_helper(configs: dict)
Samples possible hyperparameter(s) and returns it (them, in this case as a dict)
- Returns:
A dictionary
evaluation.util
Utilities for sampling, instantiation, and results analysis.
Includes random-search samplers and helpers to load runs, instantiate datasets/models, and inspect artifacts.
- mlwiz.evaluation.util._collect_metric_samples(exp_folder: str, metric_key: str, set_key: str) Tuple[numpy.ndarray, str]
Collect metric samples for an experiment and split.
If only one outer fold is present, this falls back to using the final-run results as samples; otherwise it uses one sample per outer fold (the fold mean).
- Parameters:
exp_folder (str) – Path to the experiment folder.
metric_key (str) – Metric name to extract.
set_key (str) – Split name (case-insensitive):
'training','validation', or'test'.
- Returns:
(samples, source)wheresourceis either'final_runs'or'outer_fold_means'.- Return type:
Tuple[numpy.ndarray, str]
- Raises:
ValueError – If
set_keyis invalid or no folds are found.
- mlwiz.evaluation.util._df_to_latex_table(df, no_decimals=2, model_as_row=True)
Convert an assessment-results DataFrame to a formatted LaTeX table.
- Parameters:
df (pandas.DataFrame) – DataFrame where each row describes a model/dataset pair and includes score columns such as
testandtest_std.no_decimals (int) – Number of decimal places to display.
model_as_row (bool) – If
True, models are rows and datasets are columns; otherwise datasets are rows and models are columns.
- Returns:
LaTeX table string produced by
pandas.DataFrame.to_latex().- Return type:
str
- mlwiz.evaluation.util._list_outer_fold_ids(exp_folder: str) List[int]
List outer fold identifiers available in an experiment’s assessment folder.
- Parameters:
exp_folder (str) – Path to the experiment folder.
- Returns:
Sorted list of outer fold ids found under
<exp_folder>/MODEL_ASSESSMENT.- Return type:
list[int]
- Raises:
FileNotFoundError – If the assessment folder is missing.
- mlwiz.evaluation.util._load_final_run_metric_samples(exp_folder: str, outer_fold_id: int, set_key: str, metric_key: str) numpy.ndarray
Load per-run metric samples from cached final-run results.
This scans
run_{i}_results.dillfiles under:<exp_folder>/MODEL_ASSESSMENT/OUTER_FOLD_<id>/final_run<i>/and extractsmetric_keyfrom the selected split.- Parameters:
exp_folder (str) – Path to the experiment folder.
outer_fold_id (int) – Outer fold id.
set_key (str) – Split name (case-insensitive):
'training','validation', or'test'.metric_key (str) – Metric name to extract.
- Returns:
1D array of metric samples (dtype=float).
- Return type:
numpy.ndarray
- Raises:
KeyError – If
metric_keyis missing from a run result.ValueError – If no run results are found.
- mlwiz.evaluation.util._summarize_samples(samples: numpy.ndarray, confidence_level: float)
Compute mean/std and a normal-approximation confidence interval half-width.
- Parameters:
samples (numpy.ndarray) – 1D array of metric samples (must be non-empty).
confidence_level (float) – Confidence level in the
(0, 1)range (e.g.,0.95).
- Returns:
(mean, std, ci_half_width).- Return type:
tuple[float, float, float]
- mlwiz.evaluation.util.choice(*args)
Sample one value uniformly at random from the provided arguments.
- Parameters:
*args – Candidate values to sample from.
- Returns:
One of the provided values.
- Return type:
object
- mlwiz.evaluation.util.create_dataframe(config_list: List[dict], key_mappings: List[Tuple[str, Callable]])
Creates a pandas DataFrame from a list of configuration dictionaries and key mappings.
- Parameters:
config_list – List[dict] A list of dictionaries, where each dictionary represents a configuration. Each configuration must contain an exp_folder key and may include nested keys corresponding to hyperparameter names.
key_mappings – List[Tuple[str, Callable]] A list of tuples where: - The first element (str) is the hyperparameter name to extract from the configurations. - The second element (Callable) is a transformation function to apply to the extracted value.
- Returns:
- pandas.DataFrame
A DataFrame containing rows generated from config_list with columns for exp_folder and the specified key_mappings. If a mapping value is missing, the corresponding DataFrame cell will contain None.
- Return type:
df
- mlwiz.evaluation.util.create_latex_table_from_assessment_results(exp_metadata, metric_key='main_score', no_decimals=2, model_as_row=True, use_single_outer_fold=False) str
Creates a LaTeX table from a list of experiment folders, each containing assessment results.
- Parameters:
exp_metadata (list[tuple[str, str, str]]) – A list of (paths to the experiment folder, model name, dataset name).
metric_key (str) – The key for the metric to extract. Default is ‘main_score’.
no_decimals (int) – The number of rounded decimal places to display in the LaTeX table.
model_as_row (bool) – If True, models are rows and datasets are columns. If False, the opposite.
use_single_outer_fold (bool) – If True, only the first outer fold is used. This is useful when the number of outer folds is 1, the std in the assessment file is 0, therefore we want to recover the std across the final runs of the unique outer fold.
- mlwiz.evaluation.util.filter_experiments(config_list: List[dict], logic: bool = 'AND', parameters: dict = {})
Filters the list of configurations returned by the method
retrieve_experimentsaccording to a dictionary. The dictionary contains the keys and values of the configuration files you are looking for.If you specify more then one key/value pair to look for, then the logic parameter specifies whether you want to filter using the AND/OR rule.
For a key, you can specify more than one possible value you are interested in by passing a list as the value, for instance {‘device’: ‘cpu’, ‘lr’: [0.1, 0.01]}
- Parameters:
config_list – The list of configuration files
logic – if
AND, a configuration is selected iff all conditions are satisfied. IfOR, a config is selected when at least one of the criteria is met.parameters – dictionary with parameters used to filter the configurations
- Returns:
a list of filtered configurations like the one in input
- mlwiz.evaluation.util.get_scores_from_assessment_results(exp_folder, metric_key='main_score') dict
Extracts scores from the configuration dictionary. :param exp_folder: The path to the experiment folder. :type exp_folder: str :param metric_key: The key for the metric to extract. Default is ‘main_score’. :type metric_key: str
- mlwiz.evaluation.util.get_scores_from_outer_results(exp_folder, outer_fold_id, metric_key='main_score') dict
Extracts scores from the configuration dictionary. :param exp_folder: The path to the experiment folder. :type exp_folder: str :param outer_fold_id: The ID of the outer fold, from 1 on. :type outer_fold_id: int :param metric_key: The key for the metric to extract. Default is ‘main_score’. :type metric_key: str
- mlwiz.evaluation.util.instantiate_data_provider_from_config(config: dict, splits_filepath: str, n_outer_folds: int, n_inner_folds: int) DataProvider
Instantiate a data provider from a configuration file. :param config (dict): the configuration file :param splits_filepath (str): the path to data splits file :param n_outer_folds (int): the number of outer folds :param n_inner_folds (int): the number of inner folds :return: an instance of DataProvider, i.e., the data provider
- mlwiz.evaluation.util.instantiate_dataset_from_config(config: dict) DatasetInterface
Instantiate a dataset from a configuration file.
- Parameters:
(dict) (config) – the configuration file
- Returns:
an instance of DatasetInterface, i.e., the dataset
- mlwiz.evaluation.util.instantiate_model_from_config(config: dict, dataset: DatasetInterface) ModelInterface
Instantiate a model from a configuration file. :param config (dict): the configuration file :param dataset (DatasetInterface): the dataset used in the experiment :return: an instance of ModelInterface, i.e., the model
- mlwiz.evaluation.util.load_checkpoint(checkpoint_path: str, model: ModelInterface, device: torch.device)
Load a checkpoint from a checkpoint file into a model. :param checkpoint_path: the checkpoint file path :param model (ModelInterface): the model :param device (torch.device): the device, e.g, “cpu” or “cuda”
- mlwiz.evaluation.util.loguniform(*args)
Performs a log-uniform random selection.
- Parameters:
*args – a tuple of (log min, log max, [base]) to use. Base 10 is used if the third argument is not available.
- Returns:
a randomly chosen value
- mlwiz.evaluation.util.normal(*args)
Sample a value from a univariate normal distribution.
- Parameters:
*args – Arguments forwarded to
random.normalvariate()((mu, sigma)).- Returns:
Sampled value.
- Return type:
float
- mlwiz.evaluation.util.randint(*args)
Sample an integer uniformly at random from an interval.
- Parameters:
*args – Arguments forwarded to
random.randint()((a, b)).- Returns:
Sampled integer in the closed interval
[a, b].- Return type:
int
- mlwiz.evaluation.util.retrieve_best_configuration(model_selection_folder) dict
Once the experiments are done, retrieves the winning configuration from a specific model selection folder, and returns it as a dictionaries
- Parameters:
model_selection_folder – path to the folder of a model selection, that is, your_results_path/…./MODEL_SELECTION/
- Returns:
a dictionary with info about the best configuration
- mlwiz.evaluation.util.retrieve_experiments(model_selection_folder, skip_results_not_found: bool = False) List[dict]
Once the experiments are done, retrieves the config_results.json files of all configurations in a specific model selection folder, and returns them as a list of dictionaries
- Parameters:
model_selection_folder – path to the folder of a model selection, that is, your_results_path/…./MODEL_SELECTION/
skip_results_not_found – whether to skip an experiment if a config_results.json file has not been produced yet. Useful when analyzing experiments while others still run.
- Returns:
a list of dictionaries, one per configuration, each with an extra key “exp_folder” which identifies the config folder.
- mlwiz.evaluation.util.statistical_significance(highlighted_exp_metadata: Tuple[str, str, str], other_exp_metadata: List[Tuple[str, str, str]], metric_key: str = 'main_score', set_key: str = 'test', confidence_level: float = 0.95) pandas.DataFrame
Compares the statistical significance of a highlighted model against a list of other experiments using a Welch’s t-test.
- Parameters:
highlighted_exp_metadata (tuple[str, str, str]) – (experiment_folder, model_name, dataset_name) for the reference model.
other_exp_metadata (list[tuple[str, str, str]]) – List of (experiment_folder, model_name, dataset_name) for the models to compare against the highlighted one.
metric_key (str) – The metric to compare. Default is “main_score”.
set_key (str) – Which dataset split to consider: “training”, “validation”, or “test”. Default is “test”.
confidence_level (float) – Confidence level for CI computation and significance test. Default is 0.95.
- Returns:
Each row contains the mean/std/CI for the highlighted and compared models plus the p-value (two-sided) and a boolean flag indicating whether the difference is statistically significant at the provided confidence level.
- Return type:
pandas.DataFrame
Notes
If multiple outer folds are present, their averaged scores are used as samples.
If only one outer fold exists, the scores of the final runs are used as samples.
- mlwiz.evaluation.util.uniform(*args)
Sample a float uniformly at random from an interval.
- Parameters:
*args – Arguments forwarded to
random.uniform()(typically(a, b)).- Returns:
Sampled value in the interval.
- Return type:
float