mlwiz.experiment

experiment.experiment

Experiment wrapper for building and running training jobs.

Defines Experiment, which instantiates models/engines from configs and runs validation/test loops.

class mlwiz.experiment.experiment.Experiment(model_configuration: dict, exp_path: str, exp_seed: int)

Bases: object

Class that handles a single standard experiment.

Parameters:
  • model_configuration (dict) – the dictionary holding the experiment-specific configuration

  • exp_path (str) – path to the experiment folder

  • exp_seed (int) – the experiment’s seed to use

_cleanup_ddp()

Tear down process group.

_return_class_and_args(key: str) Tuple[Callable[[...], object], dict]

Returns the class and arguments associated to a specific key in the configuration file.

Parameters:
  • config – the configuration dictionary

  • key – a string representing a particular class in the configuration dictionary

Returns:

a tuple (class, dict of arguments), or (None, None) if the key is not present in the config dictionary

_run_ddp(mode: str, dataset_getter, training_timeout_seconds, logger, progress_callback: Callable[[dict], None] | None = None, should_terminate: Callable[[], bool] | None = None)

Spawn one local process per visible GPU and return rank-0 result.

_run_test_impl(dataset_getter, training_timeout_seconds, logger, progress_callback: Callable[[dict], None] | None = None, should_terminate: Callable[[], bool] | None = None, ddp_rank: int | None = None, ddp_world_size: int = 1)

Internal final run used by both single-process and DDP paths.

_run_valid_impl(dataset_getter, training_timeout_seconds, logger, progress_callback: Callable[[dict], None] | None = None, should_terminate: Callable[[], bool] | None = None, ddp_rank: int | None = None, ddp_world_size: int = 1)

Internal validation run used by both single-process and DDP paths.

_set_worker_device(ddp_rank: int | None)

Set per-rank device in config.

_setup_ddp(rank: int, world_size: int, master_port: int)

Initialize the process group for this rank.

_should_use_ddp() bool

Enable DDP when multiple CUDA devices are visible in this process.

_wrap_ddp_model(model, ddp_rank: int | None)

Wrap model in DDP (single-device and model-parallel cases).

create_engine(config: Config, model: ModelInterface) TrainingEngine

Utility that instantiates the training engine. It looks for pre-defined fields in the configuration file, i.e. loss, scorer, optimizer, scheduler, gradient_clipper, early_stopper and plotter, all of which should be classes implementing the EventHandler interface

Parameters:
  • config (Config) – the configuration dictionary

  • model – the model that needs be trained

Returns:

a TrainingEngine object

create_model(dim_input_features: int | Tuple[int], dim_target: int, config: Config) ModelInterface

Instantiates a model that implements the ModelInterface interface

Parameters:
  • dim_input_features (Union[int, Tuple[int]]) – number of node features

  • dim_target (int) – target dimension

  • config (Config) – the configuration dictionary

Returns:

a model that implements the ModelInterface interface

run_test(dataset_getter, training_timeout_seconds, logger, progress_callback: Callable[[dict], None] | None = None, should_terminate: Callable[[], bool] | None = None)

This function returns the training, validation and test results for a final run. Do not use the test to train the model nor for early stopping reasons! If possible, rely on already available subclasses of this class.

It implements a simple training scheme.

Parameters:
  • dataset_getter (DataProvider) – a data provider

  • training_timeout_seconds (int) – timeout for the experiment in seconds

  • logger (Logger) – the logger

Returns:

a tuple of training,validation,test dictionaries. Each dictionary has two keys:

  • LOSS (as defined in mlwiz.static)

  • SCORE (as defined in mlwiz.static)

For instance, training_results[SCORE] is a dictionary itself with other fields to be used by the evaluator.

run_valid(dataset_getter, training_timeout_seconds, logger, progress_callback: Callable[[dict], None] | None = None, should_terminate: Callable[[], bool] | None = None)

This function returns the training and validation results for a model selection run. Do not attempt to load the test set inside this method! If possible, rely on already available subclasses of this class.

It implements a simple training scheme.

Parameters:
  • dataset_getter (DataProvider) – a data provider

  • training_timeout_seconds (int) – timeout for the experiment in seconds

  • logger (Logger) – the logger

Returns:

a tuple of training and test dictionaries. Each dictionary has two keys:

  • LOSS (as defined in mlwiz.static)

  • SCORE (as defined in mlwiz.static)

For instance, training_results[SCORE] is a dictionary itself with other fields to be used by the evaluator.

mlwiz.experiment.experiment._ddp_worker(rank: int, world_size: int, mode: str, experiment_spec, dataset_getter_spec, training_timeout_seconds, logger_spec, master_port: int, result_queue, progress_queue, stop_flag)

Worker used by DDP spawn.

mlwiz.experiment.experiment._find_free_port() int

Return a free localhost TCP port.

mlwiz.experiment.experiment._to_dotted_path(obj) str

Return the dotted path for a class/function object.

mlwiz.experiment.experiment._to_queue_safe(obj)

Convert tensors/numpy values to plain Python before queue transfer.