OKA#

class OKA(dataset: Dataset, k: int, group_anonymization: GroupAnonymization = <function GroupAnonymizationBuiltIn.SUMMARIZATION>, seed: int = None, parallel: bool = False, cpu_cores: int = 3)#

Bases: LocalRecodingAlgorithm

Implementation of the One-Pass K-Means (OKA) clustering algorithm.

OKA adopts the idea of the K-Means clustering algorithm. It initiates all clusters (groups) of records at once, each with a random seed, and distributes the remaining records individually to them based on minimal clusters’ information losses. Then, OKA performs a one-time adjustment step, where furthest records in clusters of size > k (subject to distance to centroid) are picked out and redistributed to those of size < k, until every cluster contains at least k records.

Parameters:
  • dataset (Dataset) – The Dataset object holding the original data and its metadata.

  • k (int) – The privacy parameter k.

  • group_anonymization (GroupAnonymization) – The method to anonymize the resulting clusters after applying local recoding. It is possible to use an example method in GroupAnonymizationBuiltIn, or create a custom method custom_group_anonymization(group: list, props: Any) -> list. Default: GroupAnonymizationBuiltIn.SUMMARIZATION

  • seed (int) – Random seed for the initial record selection to ensure reproducibility.

  • parallel (bool) – Boolean flag to enable parallel processing.

  • cpu_cores (int) – The number of CPU cores to utilize when parallel is True.

Variables:
  • is_parallel (bool) – Whether the algorithm is running in parallel mode.

  • information_loss (float) – The total information loss calculated across all clusters.

  • rand_idx (list) – The indices of the records randomly selected to serve as initial cluster seeds.

Methods

do_local_recoding

Perform the OKA clustering algorithm.

find_best_cluster

Find the closest cluster centroid for a given record.

get_adjusting_records

Extract excess records from clusters that exceed size k.

init_clusters

Initialize all clusters, each with a random record.

do_local_recoding()#

Perform the OKA clustering algorithm.

The workflow consists of:

  1. Initialize clusters with random seeds.

  2. Clustering stage: Assign every record to the closest cluster based on distance to centroid.

  3. Adjustment stage: Rebalance records from “over-full” clusters (> k) to “under-full” clusters (< k) to ensure all clusters are valid.

Returns:

list – The final list of clusters.

find_best_cluster(record: list, clusters: list)#

Find the closest cluster centroid for a given record.

Calculates the distance between the given record and all current cluster centroids to find the best similar cluster.

Parameters:
  • record (list) – The record to be assigned.

  • clusters (list[list]) – The list of clusters.

Returns:

int – The index of the cluster with the minimum distance.

get_adjusting_records(clusters: list)#

Extract excess records from clusters that exceed size k.

Used during the adjustment stage to free up records that can be reassigned to clusters that haven’t yet k-anonymous.

Parameters:

clusters – The list of clusters with more than k members.

Returns:

list – A list of records removed from the provided clusters.

init_clusters()#

Initialize all clusters, each with a random record.

The number of seeds is calculated as round_down(D/k), where D is the total number of records.

Returns:

list – A list of initialized cluster objects.