mlwiz.data
data.dataset
Dataset interfaces and reference dataset implementations.
Defines DatasetInterface and iterable variants plus small built-in datasets.
- class mlwiz.data.dataset.Cora(storage_folder: str, raw_dataset_folder: str | None = None, transform_train: Callable | None = None, transform_eval: Callable | None = None, pre_transform: Callable | None = None, **kwargs)
Bases:
DatasetInterfaceNote: For graph datasets we still use torch.save/load, since PyG >=2.6.0 specifies the safe globals (Pytorch 2.5) and torch.save/load is much faster and more efficient (space/time) at storing Data objects.
- static _load_dataset(dataset_filepath)
Load the processed PyG dataset using
torch.load().- Parameters:
dataset_filepath (str | pathlib.Path) – Path to the serialized dataset.
- Returns:
The deserialized dataset.
- Return type:
object
- static _save_dataset(dataset, dataset_filepath)
Save the processed PyG dataset using
torch.save().- Parameters:
dataset (object) – Dataset object to serialize (typically a PyG
InMemoryDatasetor list ofDataobjects).dataset_filepath (str | pathlib.Path) – Destination path.
- Side effects:
Writes to disk.
- property dim_input_features: int | Tuple[int]
Return the Cora node feature dimension (1433).
- property dim_target: int | Tuple[int]
Return the Cora target dimension (7 classes).
- process_dataset() List[object]
Processes the dataset to the
self.dataset_folderfolder. It should generate files according to the obj:self.dataset_file_names list.
- class mlwiz.data.dataset.DatasetInterface(storage_folder: str, raw_dataset_folder: str | None = None, transform_train: Callable | None = None, transform_eval: Callable | None = None, pre_transform: Callable | None = None, **kwargs)
Bases:
objectClass that defines a number of properties essential to all datasets implementations inside MLWiz. These properties are used by the training engine and forwarded to the model to be trained.
Useful for small to medium datasets where a single file can contain the whole data. In case the dataset is too large and needs to be split into chunks, check
mlwiz.data.dataset.IterableDatasetInterfacePlease note that in order to use transformations you need to use classes like
mlwiz.data.provider.SubsetTrainEval- Parameters:
storage_folder (str) – path to folder where to store the dataset
raw_dataset_folder (Optional[str]) – path to raw data folder where raw data is stored
transform_train (Optional[Callable]) – transformations to apply to each sample at training time
transform_eval (Optional[Callable]) – transformations to apply to each sample at eval time
pre_transform (Optional[Callable]) – transformations to apply to each sample at dataset creation time
- static _load_dataset(dataset_filepath)
Load a previously saved processed dataset object from disk.
- Parameters:
dataset_filepath (str | pathlib.Path) – Path to the stored dataset.
- Returns:
The deserialized dataset representation.
- Return type:
object
- static _save_dataset(dataset, dataset_filepath)
Persist a processed dataset object to disk.
The default implementation uses dill-based serialization. Subclasses may override this method to use alternative serializers (e.g.,
torch.save()for PyGDataobjects).- Parameters:
dataset (object) – In-memory dataset representation to store.
dataset_filepath (str | pathlib.Path) – Destination path.
- Side effects:
Writes to disk.
- property dataset_filename: str
Return the filename of the serialized processed dataset.
- property dataset_filepath: Path
Return the full path to the serialized processed dataset file.
- property dataset_folder: Path
Return the folder where processed dataset artifacts are stored.
- property dim_input_features: int | Tuple[int]
Specifies the number of input features or a tuple if there are more, for instance node and edge features in graphs.
- property dim_target: int | Tuple[int]
Specifies the dimension of each target vector or a tuple if there are more.
- property name: str
Return the dataset name (defaults to the class name).
- process_dataset() List[object]
Build the processed dataset in memory.
This method is called automatically by
__init__()when the processed dataset file does not exist yet.- Returns:
The processed samples. Each sample is typically a tuple
(x, y)wherexis the model input andyis the target, but MLWiz does not enforce a specific structure.- Return type:
List[object]
- property raw_dataset_folder: Path
Return the folder where raw dataset files are stored.
- class mlwiz.data.dataset.IterableDatasetInterface(*args: Any, **kwargs: Any)
Bases:
IterableDatasetClass that implements the Iterable-style dataset, including multi-process data loading (https://pytorch.org/docs/stable/data.html#iterable-style-datasets). Useful when the dataset is too big and split in chunks of files to be stored on disk. Each chunk can hold a single sample or a set of samples, and there is the chance to shuffle sample-wise or chunk-wise. Must be combined with an appropriate
mlwiz.data.provider.IterableDataProvider.NOTE 1: We assume the splitter will split the dataset with respect to the number of files stored on disk, so be sure that the length of your dataset reflects that number. Then, examples will be provided sequentially, so if each file holds more than one sample, we will still be able to create a batch of samples from one or multiple files.
NOTE 2: NEVER override the __len__() method, as it varies dynamically with the
url_indicesargument.- Parameters:
storage_folder (str) – path to root folder where to store the dataset
raw_dataset_folder (Optional[str]) – path to raw data folder where raw data is stored
transform_train (Optional[Callable]) – transformations to apply to each sample at training time
transform_eval (Optional[Callable]) – transformations to apply to each sample at eval time
pre_transform (Optional[Callable]) – transformations to apply to each sample at dataset creation time
- property dataset_filepaths: List[Path]
Return the full paths to all dataset files on disk.
- Returns:
Paths derived from
url_indices().- Return type:
List[pathlib.Path]
- property dataset_folder: Path
Return the folder where dataset files are stored.
- property dataset_name: str
Return the dataset name (defaults to the class name).
- property dim_input_features: int | Tuple[int]
Specifies the number of input features or a tuple if there are more, for instance node and edge features in graphs.
- property dim_target: int | Tuple[int]
Specifies the dimension of each target vector or a tuple if there are more.
- process_dataset(pre_transform: Callable | None)
Processes the dataset to the
self.dataset_folderfolder. It should generate and store files according to the obj:self.url_indices list.- Parameters:
pre_transform (Optional[Callable]) – transformations to apply to each sample at dataset creation time
- property raw_dataset_folder: Path
Return the folder where raw dataset files are stored.
Notes
Only valid when
raw_dataset_folderwas provided at initialization time.
- set_eval(is_eval: bool)
Set whether iteration should apply training or evaluation transforms.
- Parameters:
is_eval (bool) – If
True,transform_evalis applied during iteration; otherwisetransform_trainis applied.
- Side effects:
Updates the internal
_evalflag used by__iter__().
- shuffle_urls(value: bool)
Shuffles urls associated to individual files stored on disk
- Parameters:
value (bool) – whether to shuffle urls
- shuffle_urls_elements(value: bool)
Shuffles elements contained in each file (associated with an url). Use this method when a single file stores multiple samples and you want to provide them in shuffled order. IMPORTANT: in this case we assume that each file contains a list of Data objects!
- Parameters:
value (bool) – whether to shuffle urls
- splice(start: int, end: int)
Use this method to assign portions of the dataset to load to different workers, otherwise they will load the same samples.
- Parameters:
start (int) – the index where to start
end (int) – the index where to stop
- subset(indices: List[int])
Use this method to modify the dataset by taking a subset of samples. WARNING: It PERMANENTLY changes the object URLs, so you have to create a copy of the original object before calling this method. It is not a memory intensive process to create a copy since this dataset works as iterable and loads data from disk on the fly.
- Parameters:
indices (List[int]) – the indices to keep
- property url_indices: List[Path]
Specify the list of dataset file names (relative to
dataset_folder).Each entry represents a file containing either a single sample or a list of samples.
- class mlwiz.data.dataset.MNIST(storage_folder: str, raw_dataset_folder: str | None = None, transform_train: Callable | None = None, transform_eval: Callable | None = None, pre_transform: Callable | None = None, **kwargs)
Bases:
DatasetInterfaceTorchvision MNIST dataset wrapper stored as a single processed file.
- property dim_input_features: int | Tuple[int]
Return the flattened MNIST input dimension (28 * 28).
- property dim_target: int | Tuple[int]
Return the MNIST target dimension (10 classes).
- process_dataset() List[object]
Processes the dataset to the
self.dataset_folderfolder. It should generate files according to the obj:self.dataset_file_names list.
- class mlwiz.data.dataset.MNISTTemporal(storage_folder: str, raw_dataset_folder: str | None = None, transform_train: Callable | None = None, transform_eval: Callable | None = None, pre_transform: Callable | None = None, **kwargs)
Bases:
DatasetInterfaceMNIST variant where each sample is a sequence of 28 timesteps.
- property dim_input_features: int | Tuple[int]
Return the per-timestep input dimension (28).
- property dim_target: int | Tuple[int]
Return the MNIST target dimension (10 classes).
- process_dataset() List[object]
Processes the dataset to the
self.dataset_folderfolder. It should generate files according to the obj:self.dataset_file_names list.
- class mlwiz.data.dataset.NCI1(storage_folder: str, raw_dataset_folder: str | None = None, transform_train: Callable | None = None, transform_eval: Callable | None = None, pre_transform: Callable | None = None, **kwargs)
Bases:
DatasetInterfaceNote: For graph datasets we still use torch.save/load, since PyG >=2.6.0 specifies the safe globals (Pytorch 2.5) and torch.save/load is much faster and more efficient (space/time) at storing Data objects.
- static _load_dataset(dataset_filepath)
Load the processed PyG dataset using
torch.load().- Parameters:
dataset_filepath (str | pathlib.Path) – Path to the serialized dataset.
- Returns:
The deserialized dataset.
- Return type:
object
- static _save_dataset(dataset, dataset_filepath)
Save the processed PyG dataset using
torch.save().- Parameters:
dataset (object) – Dataset object to serialize (typically a sequence of PyG
Dataobjects).dataset_filepath (str | pathlib.Path) – Destination path.
- Side effects:
Writes to disk.
- property dim_input_features: int | Tuple[int]
Return the NCI1 node feature dimension (37).
- property dim_target: int | Tuple[int]
Return the NCI1 target dimension (2 classes).
- process_dataset() List[object]
Processes the dataset to the
self.dataset_folderfolder. It should generate files according to the obj:self.dataset_file_names list.
- class mlwiz.data.dataset.ToyIterableDataset(*args: Any, **kwargs: Any)
Bases:
IterableDatasetInterfaceSmall synthetic iterable dataset used for tests/examples.
- property dim_input_features: int | Tuple[int]
Specifies the number of node features (after pre-processing, but in the end it depends on the model that is implemented).
- property dim_target: int | Tuple[int]
Specifies the dimension of each target vector.
- process_dataset(pre_transform: Callable | None)
Creates a fake dataset and stores it to the
self.processed_dirfolder. Each file will contain a list of 20 fake samples.
- property url_indices: List[Path]
Specifies the list of file names where you plan to store portions of the large dataset
- class mlwiz.data.dataset._ReshapeMNISTTemporal(*args: Any, **kwargs: Any)
Bases:
ModuleTransform that reshapes MNIST images into a length-28 sequence.
data.provider
Fold-aware data loader construction for MLWiz.
Provides DataProvider variants that expose train/validation/test loaders for nested CV.
- class mlwiz.data.provider.DataProvider(storage_folder: str, splits_filepath: str, dataset_class: Callable[[...], DatasetInterface], data_loader_class: Callable[[...], torch.utils.data.DataLoader] | Callable[[...], torch_geometric.loader.DataLoader], data_loader_args: dict, outer_folds: int, inner_folds: int)
Bases:
objectA DataProvider object retrieves the correct data according to the external and internal data splits. It can be additionally used to augment the data, or to create a specific type of data loader. The base class does nothing special, but here is where the i-th element of a dataset could be pre-processed before constructing the mini-batches.
IMPORTANT: if the dataset is to be shuffled, you MUST use a
mlwiz.data.sampler.RandomSamplerobject to determine the permutation.- Parameters:
storage_folder (str) – the path of the root folder in which data is stored
splits_filepath (str) – the filepath of the splits. with additional metadata
dataset_class – (Callable[…,:class:mlwiz.data.dataset.DatasetInterface]): the class of the dataset
data_loader_class – (Union[Callable[…,:class:torch.utils.data.DataLoader], Callable[…,:class:torch_geometric.loader.DataLoader]]): the class of the data loader to use
data_loader_args (dict) – the arguments of the data loader
outer_folds (int) – the number of outer folds for risk assessment. 1 means hold-out, >1 means k-fold
inner_folds (int) – the number of outer folds for model selection. 1 means hold-out, >1 means k-fold
- _get_dataset(**kwargs: dict) DatasetInterface
Instantiates the dataset. Relies on the parameters stored in the
dataset_kwargs.ptfile.- Parameters:
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset. Not used in the base version
- Returns:
a
DatasetInterfaceobject
- _get_loader(indices: list, is_eval: bool, **kwargs: dict) torch.utils.data.DataLoader | torch_geometric.loader.DataLoader
Instantiates the data loader.
- Parameters:
indices (sequence) – Indices in the whole set selected for subset
is_eval – false if training, true otherwise
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns:
a Union[
torch.utils.data.DataLoader,torch_geometric.loader.DataLoader] object
- _get_splitter() Splitter
Instantiates the splitter with the parameters stored in the file
self.splits_filepath- Returns:
a
Splitterobject
- _require_exp_seed() None
Raises a RuntimeError if the seed has not been specified
- _require_outer_and_inner_k() None
Raises a RuntimeError if the outer and inner fodls have not been specified
- _require_outer_k() None
Raises a RuntimeError if the outer fold has not been specified
- get_dim_input_features() int
Returns the number of node features of the dataset
- Returns:
the value of the property
dim_input_featuresin the dataset
- get_dim_target() int
Returns the dimension of the target for the task
- Returns:
the value of the property
dim_targetin the dataset
- get_inner_train(**kwargs: dict) torch.utils.data.DataLoader | torch_geometric.loader.DataLoader
Returns the training set for model selection associated with specific outer and inner folds
- Parameters:
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns:
a Union[
torch.utils.data.DataLoader,torch_geometric.loader.DataLoader] object
- get_inner_val(**kwargs: dict) torch.utils.data.DataLoader | torch_geometric.loader.DataLoader
Returns the validation set for model selection associated with specific outer and inner folds
- Parameters:
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns:
a Union[
torch.utils.data.DataLoader,torch_geometric.loader.DataLoader] object
- get_outer_test(**kwargs: dict) torch.utils.data.DataLoader | torch_geometric.loader.DataLoader
Returns the test set for risk assessment associated with specific outer and inner folds
- Parameters:
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns:
a Union[
torch.utils.data.DataLoader,torch_geometric.loader.DataLoader] object
- get_outer_train(**kwargs: dict) torch.utils.data.DataLoader | torch_geometric.loader.DataLoader
Returns the training set for risk assessment associated with specific outer and inner folds
- Parameters:
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns:
a Union[
torch.utils.data.DataLoader,torch_geometric.loader.DataLoader] object
- get_outer_val(**kwargs: dict) torch.utils.data.DataLoader | torch_geometric.loader.DataLoader
Returns the validation set for risk assessment associated with specific outer and inner folds
- Parameters:
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns:
a Union[
torch.utils.data.DataLoader,torch_geometric.loader.DataLoader] object
- set_exp_seed(seed: int)
Sets the experiment seed to give to the DataLoader. Helps with reproducibility.
- Parameters:
seed (int) – id of the seed
- set_inner_k(k)
Sets the parameter k of the model selection procedure. Called by the evaluation modules to load the correct subset of the data.
- Parameters:
k (int) – the id of the fold, ranging from 0 to K-1.
- set_outer_k(k: int)
Sets the parameter k of the risk assessment procedure. Called by the evaluation modules to load the correct data subset.
- Parameters:
k (int) – the id of the fold, ranging from 0 to K-1.
- class mlwiz.data.provider.IterableDataProvider(storage_folder: str, splits_filepath: str, dataset_class: Callable[[...], DatasetInterface], data_loader_class: Callable[[...], torch.utils.data.DataLoader] | Callable[[...], torch_geometric.loader.DataLoader], data_loader_args: dict, outer_folds: int, inner_folds: int)
Bases:
DataProviderA DataProvider object that allows to fetch data from an Iterable-style Dataset (see
mlwiz.data.dataset.IterableDatasetInterface).- _get_loader(indices: list, is_eval: bool, **kwargs: dict) torch.utils.data.DataLoader | torch_geometric.loader.DataLoader
Instantiates the data loader, passing to the dataset an additional url_indices argument with the indices to fetch. This is because each time this method is called with different indices a separate instance of the dataset is called.
- Parameters:
indices (sequence) – Indices in the whole set selected for subset
is_eval – false if training, true otherwise
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns:
a Union[
torch.utils.data.DataLoader,torch_geometric.loader.DataLoader] object
- class mlwiz.data.provider.SingleGraphDataProvider(storage_folder: str, splits_filepath: str, dataset_class: Callable[[...], DatasetInterface], data_loader_class: Callable[[...], torch.utils.data.DataLoader] | Callable[[...], torch_geometric.loader.DataLoader], data_loader_args: dict, outer_folds: int, inner_folds: int)
Bases:
DataProviderA DataProvider subclass that only works with
mlwiz.data.splitter.SingleGraphSplitter.- Parameters:
storage_folder (str) – the path of the root folder in which data is stored
splits_filepath (str) – the filepath of the splits. with additional metadata
dataset_class – (Callable[…,:class:mlwiz.data.dataset.DatasetInterface]): the class of the dataset
data_loader_class – (Union[Callable[…,:class:torch.utils.data.DataLoader], Callable[…,:class:torch_geometric.loader.DataLoader]]): the class of the data loader to use
data_loader_args (dict) – the arguments of the data loader
outer_folds (int) – the number of outer folds for risk assessment. 1 means hold-out, >1 means k-fold
inner_folds (int) – the number of outer folds for model selection. 1 means hold-out, >1 means k-fold
- _get_dataset(**kwargs: dict) DatasetInterface
Compared to superclass method, this always returns a new instance of the dataset, optionally passing extra arguments specified at runtime.
- Parameters:
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset. Not used in the base version
- Returns:
a
DatasetInterfaceobject
- _get_loader(eval_indices: list, training_indices: list, **kwargs: dict) torch.utils.data.DataLoader | torch_geometric.loader.DataLoader
Compared to superclass method, returns a dataloader with the single graph augmented with additional fields. These are training_indices with the indices that refer to training nodes (usually always available) and eval_indices, which specify which are the indices on which to evaluate (can be validation or test).
- Parameters:
indices (sequence) – Indices in the whole set selected for subset
eval_set (bool) – whether or not indices refer to eval set (validation or test) or to training
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns:
a Union[
torch.utils.data.DataLoader,torch_geometric.loader.DataLoader] object
- _get_splitter()
Instantiates the splitter with the parameters stored in the file
self.splits_filepath. Only works with ~mlwiz.data.splitter.SingleGraphSplitter.- Returns:
a
Splitterobject
- get_inner_train(**kwargs: dict) torch.utils.data.DataLoader | torch_geometric.loader.DataLoader
Returns the training set for model selection associated with specific outer and inner folds
- Parameters:
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns:
a Union[
torch.utils.data.DataLoader,torch_geometric.loader.DataLoader] object
- get_inner_val(**kwargs: dict) torch.utils.data.DataLoader | torch_geometric.loader.DataLoader
Returns the validation set for model selection associated with specific outer and inner folds
- Parameters:
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns:
a Union[
torch.utils.data.DataLoader,torch_geometric.loader.DataLoader] object
- get_outer_test(**kwargs: dict) torch.utils.data.DataLoader | torch_geometric.loader.DataLoader
Returns the test set for risk assessment associated with specific outer and inner folds
- Parameters:
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns:
a Union[
torch.utils.data.DataLoader,torch_geometric.loader.DataLoader] object
- get_outer_train(**kwargs: dict) torch.utils.data.DataLoader | torch_geometric.loader.DataLoader
Returns the training set for risk assessment associated with specific outer and inner folds
- Parameters:
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded. Not used in the base version
- Returns:
a Union[
torch.utils.data.DataLoader,torch_geometric.loader.DataLoader] object
- get_outer_val(**kwargs: dict) torch.utils.data.DataLoader | torch_geometric.loader.DataLoader
Returns the validation set for risk assessment associated with specific outer and inner folds
- Parameters:
kwargs (dict) – a dictionary of additional arguments to be passed to the dataset being loaded.Not used in the base version
- Returns:
a Union[
torch.utils.data.DataLoader,torch_geometric.loader.DataLoader] object
- class mlwiz.data.provider.SubsetTrainEval(*args: Any, **kwargs: Any)
Bases:
SubsetExtension of Pytorch Subset to differentiate between training and evaluation subsets.
- Parameters:
dataset (DatasetInterface) – The whole Dataset
indices (sequence) – Indices in the whole set selected for subset
is_eval (bool) – false if training true otherwise
- mlwiz.data.provider._iterable_worker_init_fn(worker_id: int, exp_seed: int)
Set the seeds for the worker and computes the range of samples ids to fetch.
- mlwiz.data.provider.seed_worker(exp_seed, worker_id)
Used to set a different, but reproducible, seed for all data-retriever workers. Without this, all workers will retrieve the data in the same order (important for Iterable-style datasets).
- Parameters:
exp_seed (int) – base seed to be used for reproducibility
worker_id (int) – id number of the worker
data.sampler
Custom samplers used by MLWiz data loaders.
Includes RandomSampler, which records the applied permutation.
- class mlwiz.data.sampler.RandomSampler(*args: Any, **kwargs: Any)
Bases:
RandomSamplerThis sampler wraps the dataset and saves the random permutation applied to the samples, so that it will be available for further use (e.g. for saving embeddings in the original samples order). The permutation is saved in the ‘permutation’ attribute.
- Parameters:
data_source (
mlwiz.data.DatasetInterface) – the dataset object
data.splitter
Dataset split generation and persistence utilities.
Defines splitters and fold containers used for hold-out and (nested) cross-validation.
- class mlwiz.data.splitter.Fold(train_idxs, val_idxs=None, test_idxs=None)
Bases:
objectSimple class that stores training, validation, and test indices.
- Parameters:
train_idxs (Union[list, tuple]) – training indices
val_idxs (Union[list, tuple]) – validation indices. Default is
Nonetest_idxs (Union[list, tuple]) – test indices. Default is
None
- class mlwiz.data.splitter.InnerFold(train_idxs, val_idxs=None, test_idxs=None)
Bases:
FoldSimple extension of the Fold class that returns a dictionary with training and validation indices (model selection).
- todict() dict
Creates a dictionary with the training/validation indices.
- Returns:
a dict with keys
['train', 'val']associated with the respective indices
- class mlwiz.data.splitter.OuterFold(train_idxs, val_idxs=None, test_idxs=None)
Bases:
FoldSimple extension of the Fold class that returns a dictionary with training and test indices (risk assessment)
- todict() dict
Creates a dictionary with the training/validation/test indices.
- Returns:
a dict with keys
['train', 'val', 'test']associated with the respective indices
- class mlwiz.data.splitter.SingleGraphSplitter(n_outer_folds: int, n_inner_folds: int, seed: int, stratify: bool, shuffle: bool, inner_val_ratio: float = 0.1, outer_val_ratio: float = 0.1, test_ratio: float = 0.1)
Bases:
SplitterA splitter for a single graph dataset that randomly splits nodes into training/validation/test
- Parameters:
n_outer_folds (int) – number of outer folds (risk assessment). 1 means hold-out, >1 means k-fold
n_inner_folds (int) – number of inner folds (model selection). 1 means hold-out, >1 means k-fold
seed (int) – random seed for reproducibility (on the same machine)
stratify (bool) – whether to apply stratification or not (should be true for classification tasks)
shuffle (bool) – whether to apply shuffle or not
inner_val_ratio (float) – percentage of validation set for hold_out model selection. Default is
0.1outer_val_ratio (float) – percentage of validation set for hold_out model assessment (final training runs). Default is
0.1test_ratio (float) – percentage of test set for hold_out model assessment. Default is
0.1
- split(dataset: DatasetInterface, targets: numpy.ndarray | None = None)
Compared with the superclass version, the only difference is that the range of indices spans across the number of nodes of the single graph taken into consideration.
- Parameters:
dataset (
DatasetInterface) – the Dataset objecttargets (np.ndarray]) – targets used for stratification. Default is
None
- class mlwiz.data.splitter.Splitter(n_outer_folds: int, n_inner_folds: int, seed: int, stratify: bool, shuffle: bool, inner_val_ratio: float = 0.1, outer_val_ratio: float = 0.1, test_ratio: float = 0.1)
Bases:
objectClass that generates and stores the data splits at dataset creation time.
- Parameters:
n_outer_folds (int) – number of outer folds (risk assessment). 1 means hold-out, >1 means k-fold
n_inner_folds (int) – number of inner folds (model selection). 1 means hold-out, >1 means k-fold
seed (int) – random seed for reproducibility (on the same machine)
stratify (bool) – whether to apply stratification or not (should be true for classification tasks)
shuffle (bool) – whether to apply shuffle or not
inner_val_ratio (float) – percentage of validation set for hold_out model selection. Default is
0.1outer_val_ratio (float) – percentage of validation set for hold_out model assessment (final training runs). Default is
0.1test_ratio (float) – percentage of test set for hold_out model assessment. Default is
0.1
- _get_splitter(n_splits: int, stratified: bool, eval_ratio: float)
Instantiates the appropriate splitter to use depending on the situation
- Parameters:
n_splits (int) – the number of different splits to create
stratified (bool) – whether to perform stratification. Works with classification tasks only!
eval_ratio (float) – the amount of evaluation (validation/test) data to use in case
n_splits==1(i.e., hold-out data split)
- Returns:
a
Splitterobject
- _splitter_args() dict
Returns a dict with all the splitter’s arguments for subsequent re-loading at experiment time.
- Returns:
a dict containing all splitter’s arguments.
- check_splits_overlap(skip_check: bool = False)
Checks if the splits created are non-overlapping or overlapping. If overlapping, an error message is returned. :param skip_check: whether to skip this check
- get_targets(dataset: DatasetInterface) Tuple[bool, numpy.ndarray]
Reads the entire dataset and returns the targets.
- Parameters:
dataset (
DatasetInterface) – the dataset- Returns:
a tuple of two elements. The first element is a boolean, which is
Trueif target values exist or an exception has not been thrown. The second value holds the actual targets orNone, depending on the first boolean value.
- classmethod load(path: str)
Loads the data splits from disk.
:param : param path: the path of the yaml file with the splits
- Returns:
a
Splitterobject
- save(path: str)
Saves the split as a dictionary into a
torchfile. The arguments of the dictionary are * seed (int) * splitter_class (str) * splitter_args (dict) * outer_folds (list of dicts) * inner_folds (list of lists of dicts)- Parameters:
path (str) – filepath where to save the object
- split(dataset: DatasetInterface, targets: numpy.ndarray | None = None)
Computes the splits and stores them in the list fields
self.outer_foldsandself.inner_folds. IMPORTANT: calling split() sets the seed of numpy, torch, and random for reproducibility.- Parameters:
dataset (
DatasetInterface) – the Dataset objecttargets (np.ndarray]) – targets used for stratification. Default is
None
- property stratify: bool
Whether the splitter has to apply target stratification or not
- class mlwiz.data.splitter._NoShuffleTrainTestSplit(test_ratio)
Bases:
objectClass that implements a very simple training/test split. Can be used to further split training data into training and validation.
- Parameters:
test_ratio – percentage of data to use for evaluation.
- split(idxs, y=None)
Splits the data.
- Parameters:
idxs – the indices to split according to the test_ratio parameter
y – Unused argument
- Returns:
a list of a single tuple (train indices, test/eval indices)
data.util
Dataset preprocessing and loading helpers.
Implements config-driven dataset/splitter instantiation and utilities like preprocess_data() and load_dataset().
- mlwiz.data.util.check_argument(cls: object, arg_name: str) bool
Checks whether
arg_nameis in the signature of a method or class.- Parameters:
cls (object) – the class to inspect
arg_name (str) – the name to look for
- Returns:
Trueif the name was found,Falseotherwise
- mlwiz.data.util.get_or_create_dir(path: str) str
Creates directories associated to the specified path if they are missing, and it returns the path string.
- Parameters:
path (str) – the path
- Returns:
the same path as the given argument
- mlwiz.data.util.load_dataset(storage_folder: str, dataset_class: Callable, **kwargs: dict) object
Loads the dataset using the
dataset_kwargs.ptfile created when parsing the data config file.- Parameters:
storage_folder (str) – path of the folder that contains the dataset folder
dataset_class – (Callable): the class of the dataset to instantiate with the parameters stored in the
dataset_kwargs.ptfile.kwargs (dict) – additional arguments to be passed to the dataset (potentially provided by a DataProvider)
- Returns:
a dataset object
- mlwiz.data.util.preprocess_data(options: dict) dict
One of the main functions of the MLWiz library. Used to create the dataset and its associated files that ensure the correct functioning of the data loading steps.
- Parameters:
options (dict) – a dictionary of dataset/splitter arguments as defined in the data configuration file used.
- mlwiz.data.util.single_graph_collate(batch)
Collate function for single-graph datasets.
PyTorch/PyG data loaders build a list of samples for each batch. For single-graph workflows, the loader is typically configured with
batch_size=1and each sample already contains all needed information. This collate function returns the single element in the batch list.- Parameters:
batch (list) – Batch list produced by a DataLoader.
- Returns:
The first (and expected only) element of
batch.- Return type:
object