OKA#
- class OKA(dataset: Dataset, k: int, group_anonymization: GroupAnonymization = <function GroupAnonymizationBuiltIn.SUMMARIZATION>, seed: int = None, parallel: bool = False, cpu_cores: int = 3)#
Bases:
LocalRecodingAlgorithmImplementation of the One-Pass K-Means (OKA) clustering algorithm.
OKA adopts the idea of the K-Means clustering algorithm. It initiates all clusters (groups) of records at once, each with a random seed, and distributes the remaining records individually to them based on minimal clusters’ information losses. Then, OKA performs a one-time adjustment step, where furthest records in clusters of size > k (subject to distance to centroid) are picked out and redistributed to those of size < k, until every cluster contains at least k records.
- Parameters:
dataset (Dataset) – The Dataset object holding the original data and its metadata.
k (int) – The privacy parameter k.
group_anonymization (GroupAnonymization) – The method to anonymize the resulting clusters after applying local recoding. It is possible to use an example method in
GroupAnonymizationBuiltIn, or create a custom methodcustom_group_anonymization(group: list, props: Any) -> list. Default:GroupAnonymizationBuiltIn.SUMMARIZATIONseed (int) – Random seed for the initial record selection to ensure reproducibility.
parallel (bool) – Boolean flag to enable parallel processing.
cpu_cores (int) – The number of CPU cores to utilize when
parallelis True.
- Variables:
is_parallel (bool) – Whether the algorithm is running in parallel mode.
information_loss (float) – The total information loss calculated across all clusters.
rand_idx (list) – The indices of the records randomly selected to serve as initial cluster seeds.
Methods
Perform the OKA clustering algorithm.
Find the closest cluster centroid for a given record.
Extract excess records from clusters that exceed size k.
Initialize all clusters, each with a random record.
- do_local_recoding()#
Perform the OKA clustering algorithm.
The workflow consists of:
Initialize clusters with random seeds.
Clustering stage: Assign every record to the closest cluster based on distance to centroid.
Adjustment stage: Rebalance records from “over-full” clusters (> k) to “under-full” clusters (< k) to ensure all clusters are valid.
- Returns:
list – The final list of clusters.
- find_best_cluster(record: list, clusters: list)#
Find the closest cluster centroid for a given record.
Calculates the distance between the given record and all current cluster centroids to find the best similar cluster.
- Parameters:
record (list) – The record to be assigned.
clusters (list[list]) – The list of clusters.
- Returns:
int – The index of the cluster with the minimum distance.
- get_adjusting_records(clusters: list)#
Extract excess records from clusters that exceed size k.
Used during the adjustment stage to free up records that can be reassigned to clusters that haven’t yet k-anonymous.
- Parameters:
clusters – The list of clusters with more than k members.
- Returns:
list – A list of records removed from the provided clusters.
- init_clusters()#
Initialize all clusters, each with a random record.
The number of seeds is calculated as
round_down(D/k), where D is the total number of records.- Returns:
list – A list of initialized cluster objects.