Datafly#

class Datafly(dataset: Dataset, k: int, suppression_threshold: int = 0)#

Bases: Algorithm

Implementation of Datafly algorithm.

Datafly applies an iterative heuristic to generalize attributes with the highest cardinality (number of unique values) until the dataset satisfies k-anonymity.

If an optional suppression threshold is set, Datafly generalizes the dataset until (entire dataset - suppression threshold) satisfies k-anonymity.

Parameters:
  • dataset (Dataset) – The Dataset object holding the original data and its metadata.

  • k (int) – The privacy parameter k.

  • suppression_threshold (int, default 0) –

    The number of allowed suppressed records.

    The maximum number of records that can be removed (suppressed) from the dataset to satisfy k-anonymity.

Variables:
  • suppression_threshold (int) – The number of allowed suppressed records.

  • hierarchies_tracking (dict) – A mapping of attribute names to their current generalization level in the hierarchy.

Methods

anonymize

Run the Datafly algorithm.

pick_attribute

Pick the attribute with the highest cardinality.

anonymize()#

Run the Datafly algorithm.

Iteratively generalizes the attribute with the highest cardinality. If the data state does not satisfy k-anonymity but the number of outlying records is below the suppression_threshold, those records are removed to finalize anonymization.

pick_attribute(np_data, qids_idx, qids)#

Pick the attribute with the highest cardinality.

This heuristic is used to decide which attribute to generalize next, aiming to reduce the uniqueness of records as quickly as possible.

Parameters:
  • np_data (numpy.ndarray) – The current state of the data in a NumPy array format.

  • qids_idx (list) – The column indices of the Quasi-Identifiers.

  • qids (list) – The names of the Quasi-Identifiers.

Returns:

  • int – The column index of the attribute with the highest cardinality.

  • str – The name of the attribute with the highest cardinality.