Datafly#
- class Datafly(dataset: Dataset, k: int, suppression_threshold: int = 0)#
Bases:
AlgorithmImplementation of Datafly algorithm.
Datafly applies an iterative heuristic to generalize attributes with the highest cardinality (number of unique values) until the dataset satisfies k-anonymity.
If an optional suppression threshold is set, Datafly generalizes the dataset until (entire dataset - suppression threshold) satisfies k-anonymity.
- Parameters:
dataset (Dataset) – The Dataset object holding the original data and its metadata.
k (int) – The privacy parameter k.
suppression_threshold (int, default 0) –
The number of allowed suppressed records.
The maximum number of records that can be removed (suppressed) from the dataset to satisfy k-anonymity.
- Variables:
suppression_threshold (int) – The number of allowed suppressed records.
hierarchies_tracking (dict) – A mapping of attribute names to their current generalization level in the hierarchy.
Methods
Run the Datafly algorithm.
Pick the attribute with the highest cardinality.
- anonymize()#
Run the Datafly algorithm.
Iteratively generalizes the attribute with the highest cardinality. If the data state does not satisfy k-anonymity but the number of outlying records is below the suppression_threshold, those records are removed to finalize anonymization.
- pick_attribute(np_data, qids_idx, qids)#
Pick the attribute with the highest cardinality.
This heuristic is used to decide which attribute to generalize next, aiming to reduce the uniqueness of records as quickly as possible.
- Parameters:
np_data (numpy.ndarray) – The current state of the data in a NumPy array format.
qids_idx (list) – The column indices of the Quasi-Identifiers.
qids (list) – The names of the Quasi-Identifiers.
- Returns:
int – The column index of the attribute with the highest cardinality.
str – The name of the attribute with the highest cardinality.