Dataset#

class Dataset(name: str)#

Bases: object

Management class for handling dataset and its properties.

This class serves as a central hub for a dataset, orchestrating the loading of raw data, attribute properties, and generalization hierarchies.

Parameters:

name (str) – The name of the dataset. This should correspond to a directory in k_anonymization/datasets/.

Variables:
  • all_datasets (list) – A class-level list containing all instantiated Dataset objects.

  • name (str) – The name assigned to the dataset instance.

Methods

_repr_html_

Get the HTML representation of the dataset.

describe

Render interactive dataframe showing dataset info.

reload_df

Read the CSV file of the Dataset from disk.

sample

Create a stratified sample of the dataset.

_repr_html_()#

Get the HTML representation of the dataset.

Returns:

str – HTML string for Jupyter display.

describe()#

Render interactive dataframe showing dataset info.

reload_df()#

Read the CSV file of the Dataset from disk.

sample(n=None, frac=None, seed=None, ignore_index=True)#

Create a stratified sample of the dataset.

Samples rows based on the distribution of the target attribute.

Parameters:
  • n (int, optional) – Number of rows to sample.

  • frac (float, optional) – Fraction of data to sample.

  • seed (int, optional) – Random state for reproducibility.

  • ignore_index (bool, default True) – If True, the resulting index will be labeled 0, 1, …, n - 1.

Returns:

SampleDataset – A new dataset object containing the sampled records.

Attributes

all_datasets

df

Lazy-load the dataset as a ITableDF.

hierarchies

Lazy-load the generalization hierarchies for this dataset.

info

Generate a summary DataFrame of the dataset's attributes.

is_categorical

A list of booleans indicating if a QID attribute is categorical.

path

Get the absolute file path to the dataset directory.

props

Load and cache the dataset properties from a JSON file.

qids

A list of names of the QID attributes.

qids_categorial

A list of names of the cateforical QID attributes.

qids_idx

The column indices of the QID attributes.

qids_idx_categorial

The column indices of the categorical QID attributes.

qids_idx_numerical

The column indices of the numerical QID attributes.

qids_numerical

A list of names of the numerical QID attributes.

target

The sensitive attribute.

all_datasets = []#
df#

Lazy-load the dataset as a ITableDF.

Returns:

ITableDF – The underlying data stored in the CSV file.

hierarchies#

Lazy-load the generalization hierarchies for this dataset.

Returns:

HierarchiesDict – Hierarchy definitions for the QID attributes.

info#

Generate a summary DataFrame of the dataset’s attributes.

Provides a breakdown of each column’s status as a QID, numerical, or categorical attribute, along with the count of unique values.

Returns:

pandas.DataFrame – A summary table with metadata labels as the index.

is_categorical#

A list of booleans indicating if a QID attribute is categorical.

Returns:

list

path#

Get the absolute file path to the dataset directory.

Returns:

str – The directory path where the dataset assets are stored.

props#

Load and cache the dataset properties from a JSON file.

The file is expected to be located at path/props.json.

Returns:

dict – A dictionary containing metadata such as QID indices and categorical flags.

qids#

A list of names of the QID attributes.

Returns:

list

qids_categorial#

A list of names of the cateforical QID attributes.

Returns:

list

qids_idx#

The column indices of the QID attributes.

Returns:

list

qids_idx_categorial#

The column indices of the categorical QID attributes.

Returns:

list

qids_idx_numerical#

The column indices of the numerical QID attributes.

Returns:

list

qids_numerical#

A list of names of the numerical QID attributes.

Returns:

list

target#

The sensitive attribute.

Returns:

str