Datafly

`Datafly`#

class Datafly(dataset: Dataset, k: int, suppression_threshold: int = 0)#

Bases: Algorithm

Implementation of Datafly algorithm.

Datafly applies an iterative heuristic to generalize attributes with the highest cardinality (number of unique values) until the dataset satisfies k-anonymity.

If an optional suppression threshold is set, Datafly generalizes the dataset until (entire dataset - suppression threshold) satisfies k-anonymity.

Parameters:

dataset (Dataset) – The Dataset object holding the original data and its metadata.
k (int) – The privacy parameter k.
suppression_threshold (int, default 0) –
The number of allowed suppressed records.

The maximum number of records that can be removed (suppressed) from the dataset to satisfy k-anonymity.

Variables:

suppression_threshold (int) – The number of allowed suppressed records.
hierarchies_tracking (dict) – A mapping of attribute names to their current generalization level in the hierarchy.

Methods

`anonymize`	Run the Datafly algorithm.
`pick_attribute`	Pick the attribute with the highest cardinality.

anonymize()#

Run the Datafly algorithm.

Iteratively generalizes the attribute with the highest cardinality. If the data state does not satisfy k-anonymity but the number of outlying records is below the suppression_threshold, those records are removed to finalize anonymization.

pick_attribute(np_data, qids_idx, qids)#

Pick the attribute with the highest cardinality.

This heuristic is used to decide which attribute to generalize next, aiming to reduce the uniqueness of records as quickly as possible.

Parameters:

np_data (numpy.ndarray) – The current state of the data in a NumPy array format.
qids_idx (list) – The column indices of the Quasi-Identifiers.
qids (list) – The names of the Quasi-Identifiers.

Returns:

int – The column index of the attribute with the highest cardinality.
str – The name of the attribute with the highest cardinality.

Datafly

Contents

Datafly#

`Datafly`#