ClassicMondrian#

class ClassicMondrian(dataset: Dataset, k: int, group_anonymization: GroupAnonymization = <function GroupAnonymizationBuiltIn.SUMMARIZATION>)#

Bases: LocalRecodingAlgorithm

Implementation of Classic Mondrian algorithm.

Classic Mondrian uses a top-down, greedy domain partitioning approach. It divides data into smaller groups by recursively splitting the widest-range QID attribute (of the local region at each step) on its median value, until there is no longer an allowable split satisfying k-anonymity (i.e., no possible split that provides 2 groups of size ≥ k).

Parameters:
  • dataset (Dataset) – The Dataset object holding the original data and its metadata.

  • k (int) – The privacy parameter k.

  • group_anonymization (GroupAnonymization) – The method to anonymize the resulting groups after applying local recoding. It is possible to use an example method in GroupAnonymizationBuiltIn, or create a custom method custom_group_anonymization(group: list, props: Any) -> list. Default: GroupAnonymizationBuiltIn.SUMMARIZATION

Methods

do_classic_mondrian

The recursive core of the Classic Mondrian algorithm.

do_local_recoding

Prepare data and initiate the Mondrian partitioning process.

sort_qids_idx

Determine the order of QID attributes to attempt splitting.

do_classic_mondrian(slice_data: ndarray)#

The recursive core of the Classic Mondrian algorithm.

Splits a group into two smaller sub-groups at the median of the chosen dimension. A split is only accepted if both resulting sub-groups contain at least k records.

Parameters:

slice_data (numpy.ndarray) – The records within the current partition.

Returns:

list – A list of final partitions.

do_local_recoding()#

Prepare data and initiate the Mondrian partitioning process.

Factors categorical attributes into integers to allow for median-based splitting, then calls the recursive partitioner.

Returns:

list – The collection of partitions (equivalence classes) produced by the algorithm.

sort_qids_idx(slice_data: ndarray)#

Determine the order of QID attributes to attempt splitting.

Heuristically choose the QID attributes with the widest normalized range first to minimize information loss. If the normalized ranges are tied, pick the attribute with more distinct values.

Parameters:

slice_data (numpy.ndarray) – The records within the current partition.

Returns:

list – Sorted list of QID indices for splitting attempts.