OKA

`OKA`#

class OKA(dataset: Dataset, k: int, group_anonymization: GroupAnonymization = <function GroupAnonymizationBuiltIn.SUMMARIZATION>, seed: int = None, parallel: bool = False, cpu_cores: int = 3)#

Bases: LocalRecodingAlgorithm

Implementation of the One-Pass K-Means (OKA) clustering algorithm.

OKA adopts the idea of the K-Means clustering algorithm. It initiates all clusters (groups) of records at once, each with a random seed, and distributes the remaining records individually to them based on minimal clusters’ information losses. Then, OKA performs a one-time adjustment step, where furthest records in clusters of size > k (subject to distance to centroid) are picked out and redistributed to those of size < k, until every cluster contains at least k records.

Parameters:

dataset (Dataset) – The Dataset object holding the original data and its metadata.
k (int) – The privacy parameter k.
group_anonymization (GroupAnonymization) – The method to anonymize the resulting clusters after applying local recoding. It is possible to use an example method in GroupAnonymizationBuiltIn, or create a custom method custom_group_anonymization(group: list, props: Any) -> list. Default: GroupAnonymizationBuiltIn.SUMMARIZATION
seed (int) – Random seed for the initial record selection to ensure reproducibility.
parallel (bool) – Boolean flag to enable parallel processing.
cpu_cores (int) – The number of CPU cores to utilize when parallel is True.

Variables:

is_parallel (bool) – Whether the algorithm is running in parallel mode.
information_loss (float) – The total information loss calculated across all clusters.
rand_idx (list) – The indices of the records randomly selected to serve as initial cluster seeds.

Methods

`do_local_recoding`	Perform the OKA clustering algorithm.
`find_best_cluster`	Find the closest cluster centroid for a given record.
`get_adjusting_records`	Extract excess records from clusters that exceed size k.
`init_clusters`	Initialize all clusters, each with a random record.

do_local_recoding()#

Perform the OKA clustering algorithm.

The workflow consists of:

Initialize clusters with random seeds.
Clustering stage: Assign every record to the closest cluster based on distance to centroid.
Adjustment stage: Rebalance records from “over-full” clusters (> k) to “under-full” clusters (< k) to ensure all clusters are valid.

Returns:: list – The final list of clusters.

find_best_cluster(record: list, clusters: list)#

Find the closest cluster centroid for a given record.

Calculates the distance between the given record and all current cluster centroids to find the best similar cluster.

Parameters:

record (list) – The record to be assigned.
clusters (list[list]) – The list of clusters.

Returns:

int – The index of the cluster with the minimum distance.

get_adjusting_records(clusters: list)#

Extract excess records from clusters that exceed size k.

Used during the adjustment stage to free up records that can be reassigned to clusters that haven’t yet k-anonymous.

Parameters:: clusters – The list of clusters with more than k members.
Returns:: list – A list of records removed from the provided clusters.

init_clusters()#

Initialize all clusters, each with a random record.

The number of seeds is calculated as round_down(D/k), where D is the total number of records.

Returns:: list – A list of initialized cluster objects.

OKA

Contents

OKA#

`OKA`#