NCP#

class NCP#

Bases: object

Normalized Certainty Penalty (NCP).

NCP applies value-wise penalty on each value of QID attribute and normalizes it to [0, 1] by the data size \(|D|\) and the number of QIDs \(|Q|\). A lower NCP indicates a lower information loss.

\[NCP = \frac{1}{|D|} * \sum^{|D|}\frac{\sum P_{num}(v_{num}) + \sum P_{cat}(v_{cat})}{|Q|}\]

where \(P_{num}(v_{num})\) is the penalty for a value of a numerical QID attribute, \(P_{cat}(v_{cat})\) is the penalty for a value of a categorical QID attribute. Depending on the anonymization method, \(P_{num}\) and \(P_{cat}\) are calculated diferrently.

Methods

calculate_for_generalization

Calculate NCP for generalization anonymization.

calculate_for_local_recoding_mean_mode

Calculate NCP for local recoding algorithm with mean-mode group anonymization.

calculate_for_local_recoding_summarization

Calculate NCP for local recoding algorithm with summarization group anonymization.

static calculate_for_generalization(org_data: DataFrame, anon_data: DataFrame, hierarchies: HierarchiesDict, qids_idx: list, is_categorical: list)#

Calculate NCP for generalization anonymization.

When a numerical value is generalized to a (local) numerical range, it becomes ambiguous in such a range. Thus, \(P_{num} = \frac{local\_range}{global\_range}\).

For categorical value, it becomes ambiguous among the leaves under the common ancestor for its equivalence class. Thus \(P_{cat} = \frac{leaves\_under\_common\_ancestor}{all\_leaves}\).

Parameters:
  • org_data (DataFrame) – The original data.

  • anon_data (DataFrame) – The anonymized data.

  • hierarchies (HierarchiesDict) – Hierarchy definitions for the QID attributes.

  • qids_idx (list) – The column indices of the QID attributes.

  • is_categorical (list) – A list of booleans indicating if a QID attribute is categorical.

Returns:

float – The NCP score.

static calculate_for_local_recoding_mean_mode(org_data: DataFrame, groups: list, qids_idx: list, is_categorical: list)#

Calculate NCP for local recoding algorithm with mean-mode group anonymization.

When a numerical value is generalized to a (local) numerical range, it becomes ambiguous in such a range. Thus, \(P_{num} = \frac{local\_range}{global\_range}\).

For categorical value, if an original value is different from the mode of its equivalence class, the entire value is loss. Thus \(P_{cat}(v_{cat}) = 1\:if\:v_{cat} \neq mode, 0\:otherwise\).

Parameters:
  • org_data (DataFrame) – The original data.

  • groups (list) – The anonymized data.

  • qids_idx (list) – The column indices of the QID attributes.

  • is_categorical (list) – A list of booleans indicating if a QID attribute is categorical.

Returns:

float – The NCP score.

static calculate_for_local_recoding_summarization(org_data: DataFrame, groups: list, qids_idx: list, is_categorical: list)#

Calculate NCP for local recoding algorithm with summarization group anonymization.

When a numerical value is generalized to a (local) numerical range, it becomes ambiguous in such a range. Thus, \(P_{num} = \frac{local\_range}{global\_range}\).

For categorical value, it becomes ambiguous among the unique values of its QID attribute in its equivalence class, denoted as \(Q^{EQ}(v_{cat})\). Thus \(P_{cat}(v_{cat}) = \frac{1}{|Q^{EQ}(v_{cat}).unique|}\).

Parameters:
  • org_data (DataFrame) – The original data.

  • groups (list) – The anonymized data.

  • qids_idx (list) – The column indices of the QID attributes.

  • is_categorical (list) – A list of booleans indicating if a QID attribute is categorical.

Returns:

float – The NCP score.

See also

k_anonymization.algorithms.local_recoding.GroupAnonymizationBuiltIn.SUMMARIZATION

Anonymize a group by creating a summary range or set.