Dataset#
- class Dataset(name: str)#
Bases:
objectManagement class for handling dataset and its properties.
This class serves as a central hub for a dataset, orchestrating the loading of raw data, attribute properties, and generalization hierarchies.
- Parameters:
name (str) – The name of the dataset. This should correspond to a directory in k_anonymization/datasets/.
- Variables:
all_datasets (list) – A class-level list containing all instantiated Dataset objects.
name (str) – The name assigned to the dataset instance.
Methods
Get the HTML representation of the dataset.
Render interactive dataframe showing dataset info.
Read the CSV file of the Dataset from disk.
Create a stratified sample of the dataset.
- _repr_html_()#
Get the HTML representation of the dataset.
- Returns:
str – HTML string for Jupyter display.
- describe()#
Render interactive dataframe showing dataset info.
- reload_df()#
Read the CSV file of the Dataset from disk.
- sample(n=None, frac=None, seed=None, ignore_index=True)#
Create a stratified sample of the dataset.
Samples rows based on the distribution of the target attribute.
- Parameters:
n (int, optional) – Number of rows to sample.
frac (float, optional) – Fraction of data to sample.
seed (int, optional) – Random state for reproducibility.
ignore_index (bool, default True) – If True, the resulting index will be labeled 0, 1, …, n - 1.
- Returns:
SampleDataset – A new dataset object containing the sampled records.
Attributes
Lazy-load the dataset as a ITableDF.
Lazy-load the generalization hierarchies for this dataset.
Generate a summary DataFrame of the dataset's attributes.
A list of booleans indicating if a QID attribute is categorical.
Get the absolute file path to the dataset directory.
Load and cache the dataset properties from a JSON file.
A list of names of the QID attributes.
A list of names of the cateforical QID attributes.
The column indices of the QID attributes.
The column indices of the categorical QID attributes.
The column indices of the numerical QID attributes.
A list of names of the numerical QID attributes.
The sensitive attribute.
- all_datasets = []#
- df#
Lazy-load the dataset as a ITableDF.
- Returns:
ITableDF – The underlying data stored in the CSV file.
- hierarchies#
Lazy-load the generalization hierarchies for this dataset.
- Returns:
HierarchiesDict – Hierarchy definitions for the QID attributes.
- info#
Generate a summary DataFrame of the dataset’s attributes.
Provides a breakdown of each column’s status as a QID, numerical, or categorical attribute, along with the count of unique values.
- Returns:
pandas.DataFrame – A summary table with metadata labels as the index.
- is_categorical#
A list of booleans indicating if a QID attribute is categorical.
- Returns:
list
- path#
Get the absolute file path to the dataset directory.
- Returns:
str – The directory path where the dataset assets are stored.
- props#
Load and cache the dataset properties from a JSON file.
The file is expected to be located at path/props.json.
- Returns:
dict – A dictionary containing metadata such as QID indices and categorical flags.
- qids#
A list of names of the QID attributes.
- Returns:
list
- qids_categorial#
A list of names of the cateforical QID attributes.
- Returns:
list
- qids_idx#
The column indices of the QID attributes.
- Returns:
list
- qids_idx_categorial#
The column indices of the categorical QID attributes.
- Returns:
list
- qids_idx_numerical#
The column indices of the numerical QID attributes.
- Returns:
list
- qids_numerical#
A list of names of the numerical QID attributes.
- Returns:
list
- target#
The sensitive attribute.
- Returns:
str