Dataset

`Dataset`#

class Dataset(name: str)#

Bases: object

Management class for handling dataset and its properties.

This class serves as a central hub for a dataset, orchestrating the loading of raw data, attribute properties, and generalization hierarchies.

Parameters:

name (str) – The name of the dataset. This should correspond to a directory in k_anonymization/datasets/.

Variables:

all_datasets (list) – A class-level list containing all instantiated Dataset objects.
name (str) – The name assigned to the dataset instance.

Methods

`_repr_html_`	Get the HTML representation of the dataset.
`describe`	Render interactive dataframe showing dataset info.
`reload_df`	Read the CSV file of the Dataset from disk.
`sample`	Create a stratified sample of the dataset.

_repr_html_()#

Get the HTML representation of the dataset.

Returns:: str – HTML string for Jupyter display.

describe()#: Render interactive dataframe showing dataset info.

reload_df()#: Read the CSV file of the Dataset from disk.

sample(n=None, frac=None, seed=None, ignore_index=True)#

Create a stratified sample of the dataset.

Samples rows based on the distribution of the target attribute.

Parameters:

n (int, optional) – Number of rows to sample.
frac (float, optional) – Fraction of data to sample.
seed (int, optional) – Random state for reproducibility.
ignore_index (bool, default True) – If True, the resulting index will be labeled 0, 1, …, n - 1.

Returns:

SampleDataset – A new dataset object containing the sampled records.

Attributes

`all_datasets`
`df`	Lazy-load the dataset as a ITableDF.
`hierarchies`	Lazy-load the generalization hierarchies for this dataset.
`info`	Generate a summary DataFrame of the dataset's attributes.
`is_categorical`	A list of booleans indicating if a QID attribute is categorical.
`path`	Get the absolute file path to the dataset directory.
`props`	Load and cache the dataset properties from a JSON file.
`qids`	A list of names of the QID attributes.
`qids_categorial`	A list of names of the cateforical QID attributes.
`qids_idx`	The column indices of the QID attributes.
`qids_idx_categorial`	The column indices of the categorical QID attributes.
`qids_idx_numerical`	The column indices of the numerical QID attributes.
`qids_numerical`	A list of names of the numerical QID attributes.
`target`	The sensitive attribute.

all_datasets = []#

df#

Lazy-load the dataset as a ITableDF.

Returns:: ITableDF – The underlying data stored in the CSV file.

hierarchies#

Lazy-load the generalization hierarchies for this dataset.

Returns:: HierarchiesDict – Hierarchy definitions for the QID attributes.

info#

Generate a summary DataFrame of the dataset’s attributes.

Provides a breakdown of each column’s status as a QID, numerical, or categorical attribute, along with the count of unique values.

Returns:: pandas.DataFrame – A summary table with metadata labels as the index.

is_categorical#

A list of booleans indicating if a QID attribute is categorical.

Returns:: list

path#

Get the absolute file path to the dataset directory.

Returns:: str – The directory path where the dataset assets are stored.

props#

Load and cache the dataset properties from a JSON file.

The file is expected to be located at path/props.json.

Returns:: dict – A dictionary containing metadata such as QID indices and categorical flags.

qids#

A list of names of the QID attributes.

Returns:: list

qids_categorial#

A list of names of the cateforical QID attributes.

Returns:: list

qids_idx#

The column indices of the QID attributes.

Returns:: list

qids_idx_categorial#

The column indices of the categorical QID attributes.

Returns:: list

qids_idx_numerical#

The column indices of the numerical QID attributes.

Returns:: list

qids_numerical#

A list of names of the numerical QID attributes.

Returns:: list

target#

The sensitive attribute.

Returns:: str

Dataset

Contents

Dataset#

`Dataset`#