KMember#
- class KMember(dataset: Dataset, k: int, group_anonymization: GroupAnonymization = <function GroupAnonymizationBuiltIn.SUMMARIZATION>, seed: int = None, parallel: bool = False, cpu_cores: int = 3)#
Bases:
LocalRecodingAlgorithmImplementation of the K-Member clustering algorithm.
K-Member greedily constructs one cluster (group) of records at a time until the whole dataset is divided into groups of at least k records. It initiates the first cluster by randomly selecting an initial record (seed), then finds and adds k-1 other records that minimizes the cluster’s information loss. From the second cluster onward, it picks a new seed which is the furthest from the previous seed, and repeats the record selection process until there are less than k records remaining. Finally, each remaining record is added to one of the existing clusters that minimizes the cluster’s information loss.
- Parameters:
dataset (Dataset) – The Dataset object holding the original data and its metadata.
k (int) – The privacy parameter k.
group_anonymization (GroupAnonymization) – The method to anonymize the resulting clusters after applying local recoding. It is possible to use an example method in
GroupAnonymizationBuiltIn, or create a custom methodcustom_group_anonymization(group: list, props: Any) -> list. Default:GroupAnonymizationBuiltIn.SUMMARIZATIONseed – Random seed for the initial record selection to ensure reproducibility.
parallel – Boolean flag to enable parallel processing.
cpu_cores – The number of CPU cores to utilize when
parallelis True.
- Variables:
is_parallel (bool) – Whether the algorithm is running in parallel mode.
information_loss (float) – The total information loss calculated across all clusters.
See also
k_anonymization.core.ParallelUtility wrapper for paralellizing tasks across multiple CPU cores.
Methods
Perform the K-Member clustering algorithm.
Assign an orphaned record to the most compatible existing cluster.
Find the record that minimizes the information loss of the given cluster.
Find the most distant record from the given record r.
- do_local_recoding()#
Perform the K-Member clustering algorithm.
The workflow consists of:
Pick a seed record (a random record for the 1st iteration, the furthest record from the previous seed otherwise).
Build a cluster of size k by greedily adding records that minimize information loss.
Repeat 1-3 until fewer than k records remain.
Distribute remaining records to the most suitable existing clusters.
- Returns:
list – The final list of clusters.
- find_best_cluster(clusters: list, r: list)#
Assign an orphaned record to the most compatible existing cluster.
Used at the end of the process to assign any remaining records to existing clusters while minimizing added information loss.
- Parameters:
clusters (list[list]) – The list of already formed clusters.
r (list) – The record to be assigned.
- Returns:
tuple – (index of the best cluster, the resulting information loss).
- find_best_record(data, cluster)#
Find the record that minimizes the information loss of the given cluster.
Iterates through available records and calculates the potential increase in information loss if each record were added to the given cluster, then pick the one that causes the lowest loss.
- Parameters:
data (list[list]) – The list of available records.
cluster (list[list]) – The given cluster.
- Returns:
tuple – (the best record, its index, the resulting information loss).
- find_furthest_record_from_r(r: list, data: list)#
Find the most distant record from the given record r.
This is used to find a “seed” for a new cluster that is far away from the previously processed cluster.
- Parameters:
r (list) – The given record.
data (list[list]) – The list of available records.
- Returns:
tuple – (the furthest record, its index).