Skip to content

Multiple Models Interface

In this example, we are going to conduct a deep performance profiling for 4 models. This demonstration will show how to create input arguments for Virny, how to compute overall and disparity metrics with a metric computation interface, and how to build static visualizations based on the calculated metrics. For that, we will use compute_metrics_with_config interface that can compute metrics for multiple models. Thus, we will need to do the next steps:

  • Initialize input variables

  • Compute subgroup metrics

  • Perform disparity metrics composition using the Metric Composer

  • Create static visualizations using the Metric Visualizer

Import dependencies

import os
import pandas as pd
from pprint import pprint
from datetime import datetime, timezone

from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from virny.utils.custom_initializers import create_config_obj, read_model_metric_dfs, create_models_config_from_tuned_params_df
from virny.user_interfaces.multiple_models_api import compute_metrics_with_config
from virny.preprocessing.basic_preprocessing import preprocess_dataset
from virny.custom_classes.metrics_visualizer import MetricsVisualizer
from virny.custom_classes.metrics_composer import MetricsComposer
from virny.utils.model_tuning_utils import tune_ML_models
from virny.datasets.base import BaseDataLoader

Initialize Input Variables

Based on the library flow, we need to create 3 input objects for a user interface:

  • A config yaml that is a file with configuration parameters for different user interfaces for metric computation.

  • A dataset class that is a wrapper above the user’s raw dataset that includes its descriptive attributes like a target column, numerical columns, categorical columns, etc. This class must be inherited from the BaseDataset class, which was created for user convenience.

  • Finally, a models config that is a Python dictionary, where keys are model names and values are initialized models for analysis. This dictionary helps conduct audits for different analysis modes and analyze different types of models.

DATASET_SPLIT_SEED = 42
MODELS_TUNING_SEED = 42
TEST_SET_FRACTION = 0.2
models_params_for_tuning = {
    'DecisionTreeClassifier': {
        'model': DecisionTreeClassifier(random_state=MODELS_TUNING_SEED),
        'params': {
            "max_depth": [20, 30],
            "min_samples_split" : [0.1],
            "max_features": ['sqrt'],
            "criterion": ["gini", "entropy"]
        }
    },
    'LogisticRegression': {
        'model': LogisticRegression(random_state=MODELS_TUNING_SEED),
        'params': {
            'penalty': ['l2'],
            'C' : [0.0001, 0.1, 1, 100],
            'solver': ['newton-cg', 'lbfgs'],
            'max_iter': [250],
        }
    },
    'RandomForestClassifier': {
        'model': RandomForestClassifier(random_state=MODELS_TUNING_SEED),
        'params': {
            "max_depth": [6, 10],
            "min_samples_leaf": [1],
            "n_estimators": [50, 100],
            "max_features": [0.6]
        }
    },
    'XGBClassifier': {
        'model': XGBClassifier(random_state=MODELS_TUNING_SEED, verbosity=0),
        'params': {
            'learning_rate': [0.1],
            'n_estimators': [200],
            'max_depth': [5, 7],
            'lambda':  [10, 100]
        }
    }
}

Create a config object

compute_metrics_with_config interface requires that your yaml file includes the following parameters:

  • dataset_name: str, a name of your dataset; it will be used to name files with metrics.

  • bootstrap_fraction: float, the fraction from a train set in the range [0.0 - 1.0] to fit models in bootstrap (usually more than 0.5).

  • random_state: int, a seed to control the randomness of the whole model evaluation pipeline.

  • n_estimators: int, the number of estimators for bootstrap to compute subgroup stability metrics.

  • sensitive_attributes_dct: dict, a dictionary where keys are sensitive attribute names (including intersectional attributes), and values are disadvantaged values for these attributes. Intersectional attributes must include '&' between sensitive attributes. You do not need to specify disadvantaged values for intersectional groups since they will be derived from disadvantaged values in sensitive_attributes_dct for each separate sensitive attribute in this intersectional pair.

Note that disadvantaged value in a sensitive attribute dictionary must be the same as in the original dataset. For example, when distinct values of the sex column in the original dataset are 'F' and 'M', and after pre-processing they became 0 and 1 respectively, you still need to set a disadvantaged value as 'F' or 'M' in the sensitive attribute dictionary.

ROOT_DIR = os.path.join('docs', 'examples')
config_yaml_path = os.path.join(ROOT_DIR, 'experiment_config.yaml')
config_yaml_content = """
dataset_name: COMPAS_Without_Sensitive_Attributes
bootstrap_fraction: 0.8
random_state: 42
n_estimators: 50  # Better to input the higher number of estimators than 100; this is only for this use case example
sensitive_attributes_dct: {'sex': 1, 'race': 'African-American', 'sex&race': None}
"""

with open(config_yaml_path, 'w', encoding='utf-8') as f:
    f.write(config_yaml_content)
config = create_config_obj(config_yaml_path=config_yaml_path)
SAVE_RESULTS_DIR_PATH = os.path.join(ROOT_DIR, 'results', f'{config.dataset_name}_Metrics_{datetime.now(timezone.utc).strftime("%Y%m%d__%H%M%S")}')

Preprocess the dataset and create a BaseFlowDataset class

Based on the BaseDataset class, your dataset class should include the following attributes:

  • Obligatory attributes: dataset, target, features, numerical_columns, categorical_columns

  • Optional attributes: X_data, y_data, columns_with_nulls

For more details, please refer to the library documentation.

class CompasWithoutSensitiveAttrsDataset(BaseDataLoader):
    """
    Dataset class for COMPAS dataset that does not contain sensitive attributes among feature columns
     to test blind classifiers

    Parameters
    ----------
    subsample_size
        Subsample size to create based on the input dataset

    """
    def __init__(self, dataset_path, subsample_size: int = None):
        df = pd.read_csv(dataset_path)
        if subsample_size:
            df = df.sample(subsample_size)

        # Initial data types transformation
        int_columns = ['recidivism', 'age', 'age_cat_25 - 45', 'age_cat_Greater than 45',
                       'age_cat_Less than 25', 'c_charge_degree_F', 'c_charge_degree_M', 'sex']
        int_columns_dct = {col: "int" for col in int_columns}
        df = df.astype(int_columns_dct)

        # Define params
        target = 'recidivism'
        numerical_columns = ['juv_fel_count', 'juv_misd_count', 'juv_other_count','priors_count']
        categorical_columns = ['age_cat_25 - 45', 'age_cat_Greater than 45','age_cat_Less than 25',
                               'c_charge_degree_F', 'c_charge_degree_M']

        super().__init__(
            full_df=df,
            target=target,
            numerical_columns=numerical_columns,
            categorical_columns=categorical_columns
        )
data_loader = CompasWithoutSensitiveAttrsDataset(dataset_path=os.path.join('virny', 'datasets', 'data', 'COMPAS.csv'))
data_loader.X_data[data_loader.X_data.columns[:5]].head()
juv_fel_count juv_misd_count juv_other_count priors_count age_cat_25 - 45
0 0.0 -2.340451 1.0 -15.010999 1
1 0.0 0.000000 0.0 0.000000 1
2 0.0 0.000000 0.0 0.000000 0
3 0.0 0.000000 0.0 6.000000 1
4 0.0 0.000000 0.0 7.513697 1
column_transformer = ColumnTransformer(transformers=[
    ('categorical_features', OneHotEncoder(handle_unknown='ignore', sparse=False), data_loader.categorical_columns),
    ('numerical_features', StandardScaler(), data_loader.numerical_columns),
])
base_flow_dataset = preprocess_dataset(data_loader=data_loader, 
                                       column_transformer=column_transformer,
                                       sensitive_attributes_dct=config.sensitive_attributes_dct,
                                       test_set_fraction=TEST_SET_FRACTION,
                                       dataset_split_seed=DATASET_SPLIT_SEED)

Tune models and create a models config for metrics computation

tuned_params_df, models_config = tune_ML_models(models_params_for_tuning, base_flow_dataset, config.dataset_name, n_folds=3)
tuned_params_df
2024/06/02, 00:28:22: Tuning DecisionTreeClassifier...
2024/06/02, 00:28:23: Tuning for DecisionTreeClassifier is finished [F1 score = 0.6554846983071246, Accuracy = 0.6575048862828714]

2024/06/02, 00:28:23: Tuning LogisticRegression...
2024/06/02, 00:28:23: Tuning for LogisticRegression is finished [F1 score = 0.6483823116804865, Accuracy = 0.6520611566087312]

2024/06/02, 00:28:23: Tuning RandomForestClassifier...
2024/06/02, 00:28:24: Tuning for RandomForestClassifier is finished [F1 score = 0.6569271025126497, Accuracy = 0.6586904492688075]

2024/06/02, 00:28:24: Tuning XGBClassifier...
2024/06/02, 00:28:24: Tuning for XGBClassifier is finished [F1 score = 0.6623616224585352, Accuracy = 0.6646105242187331]
Dataset_Name Model_Name F1_Score Accuracy_Score Model_Best_Params
0 COMPAS_Without_Sensitive_Attributes DecisionTreeClassifier 0.655485 0.657505 {'criterion': 'gini', 'max_depth': 20, 'max_fe...
1 COMPAS_Without_Sensitive_Attributes LogisticRegression 0.648382 0.652061 {'C': 1, 'max_iter': 250, 'penalty': 'l2', 'so...
2 COMPAS_Without_Sensitive_Attributes RandomForestClassifier 0.656927 0.658690 {'max_depth': 10, 'max_features': 0.6, 'min_sa...
3 COMPAS_Without_Sensitive_Attributes XGBClassifier 0.662362 0.664611 {'lambda': 100, 'learning_rate': 0.1, 'max_dep...
now = datetime.now(timezone.utc)
date_time_str = now.strftime("%Y%m%d__%H%M%S")
tuned_df_path = os.path.join(ROOT_DIR, 'results', 'models_tuning', f'tuning_results_{config.dataset_name}_{date_time_str}.csv')
tuned_params_df.to_csv(tuned_df_path, sep=",", columns=tuned_params_df.columns, float_format="%.4f", index=False)

Create models_config from the saved tuned_params_df for higher reliability

models_config = create_models_config_from_tuned_params_df(models_params_for_tuning, tuned_df_path)
pprint(models_config)
{'DecisionTreeClassifier': DecisionTreeClassifier(max_depth=20, max_features='sqrt', min_samples_split=0.1,
                       random_state=42),
 'LogisticRegression': LogisticRegression(C=1, max_iter=250, random_state=42, solver='newton-cg'),
 'RandomForestClassifier': RandomForestClassifier(max_depth=10, max_features=0.6, random_state=42),
 'XGBClassifier': XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, lambda=100, learning_rate=0.1,
              max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=5, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=200, n_jobs=None, num_parallel_tree=None,
              predictor=None, ...)}

Subgroup Metric Computation

After that we need to input the BaseFlowDataset object, models config, and config yaml to a metric computation interface and execute it. The interface uses subgroup analyzers to compute different sets of metrics for each privileged and disadvantaged group. As for now, our library supports Subgroup Variance Analyzer and Subgroup Error Analyzer, but it is easily extensible to any other analyzers. When the variance and error analyzers complete metric computation, their metrics are combined, returned in a matrix format, and stored in a file if defined.

metrics_dct = compute_metrics_with_config(base_flow_dataset, config, models_config, SAVE_RESULTS_DIR_PATH,
                                          notebook_logs_stdout=True)
Analyze multiple models:   0%|          | 0/4 [00:00<?, ?it/s]



Classifiers testing by bootstrap:   0%|          | 0/50 [00:00<?, ?it/s]



Classifiers testing by bootstrap:   0%|          | 0/50 [00:00<?, ?it/s]



Classifiers testing by bootstrap:   0%|          | 0/50 [00:00<?, ?it/s]



Classifiers testing by bootstrap:   0%|          | 0/50 [00:00<?, ?it/s]

Look at several columns in top rows of computed metrics

sample_model_metrics_df = metrics_dct[list(models_config.keys())[0]]
sample_model_metrics_df[sample_model_metrics_df.columns[:6]].head(20)
Metric overall sex_priv sex_dis race_priv race_dis
0 IQR 0.093218 0.092883 0.093302 0.095182 0.091952
1 Overall_Uncertainty 0.899836 0.909407 0.897446 0.896719 0.901847
2 Std 0.076228 0.077296 0.075962 0.075141 0.076929
3 Mean_Prediction 0.520117 0.572049 0.507149 0.581026 0.480839
4 Aleatoric_Uncertainty 0.869944 0.875791 0.868484 0.866015 0.872477
5 Statistical_Bias 0.422194 0.416842 0.423530 0.418523 0.424561
6 Epistemic_Uncertainty 0.029893 0.033616 0.028963 0.030704 0.029369
7 Jitter 0.148098 0.159899 0.145152 0.138860 0.154056
8 Label_Stability 0.786591 0.766825 0.791527 0.801256 0.777134
9 TPR 0.687898 0.573333 0.709596 0.578231 0.737654
10 TNR 0.687179 0.808824 0.650334 0.756554 0.628931
11 PPV 0.639053 0.623188 0.641553 0.566667 0.669468
12 FNR 0.312102 0.426667 0.290404 0.421769 0.262346
13 FPR 0.312821 0.191176 0.349666 0.243446 0.371069
14 Accuracy 0.687500 0.725118 0.678107 0.693237 0.683801
15 F1 0.662577 0.597222 0.673861 0.572391 0.701909
16 Selection-Rate 0.480114 0.327014 0.518343 0.362319 0.556075
17 Sample_Size 1056.000000 211.000000 845.000000 414.000000 642.000000

Disparity Metric Composition

To compose disparity metrics, the Metric Composer should be applied. Metric Composer is responsible for the second stage of the model audit. Currently, it computes our custom error disparity, stability disparity, and uncertainty disparity metrics, but extending it for new disparity metrics is very simple. We noticed that more and more disparity metrics have appeared during the last decade, but most of them are based on the same group specific metrics. Hence, such a separation of group specific and disparity metrics computation allows us to experiment with different combinations of group specific metrics and avoid group metrics recomputation for a new set of disparity metrics.

models_metrics_dct = read_model_metric_dfs(SAVE_RESULTS_DIR_PATH, model_names=list(models_config.keys()))
metrics_composer = MetricsComposer(models_metrics_dct, config.sensitive_attributes_dct)

Compute composed metrics

models_composed_metrics_df = metrics_composer.compose_metrics()
models_composed_metrics_df
Metric sex race sex&race Model_Name
0 Accuracy_Difference -0.047012 -0.009436 -0.039300 DecisionTreeClassifier
1 Aleatoric_Uncertainty_Difference -0.007307 0.006463 0.000802 DecisionTreeClassifier
2 Aleatoric_Uncertainty_Ratio 0.991656 1.007463 1.000922 DecisionTreeClassifier
3 Epistemic_Uncertainty_Difference -0.004654 -0.001335 -0.003381 DecisionTreeClassifier
4 Epistemic_Uncertainty_Ratio 0.861563 0.956510 0.892966 DecisionTreeClassifier
... ... ... ... ... ...
71 Disparate_Impact 1.465176 1.537383 1.596796 XGBClassifier
72 Std_Difference 0.000151 0.002984 0.002995 XGBClassifier
73 Std_Ratio 1.003178 1.065098 1.064903 XGBClassifier
74 Equalized_Odds_TNR -0.076968 -0.101583 -0.123015 XGBClassifier
75 Equalized_Odds_TPR 0.153535 0.152053 0.155233 XGBClassifier

76 rows Γ— 5 columns

Metric Visualization

Metric Visualizer allows us to build static visualizations for the computed metrics. It unifies different preprocessing methods for the computed metrics and creates various data formats required for visualizations. Hence, users can simply call methods of the MetricsVisualizer class and get custom plots for diverse metric analysis.

visualizer = MetricsVisualizer(models_metrics_dct, models_composed_metrics_df, config.dataset_name,
                               model_names=list(models_config.keys()),
                               sensitive_attributes_dct=config.sensitive_attributes_dct)
visualizer.create_overall_metrics_bar_char(
    metric_names=['Accuracy', 'F1', 'TPR', 'TNR', 'PPV', 'Selection-Rate'],
    plot_title="Accuracy Metrics"
)
visualizer.create_overall_metrics_bar_char(
    metric_names=['Aleatoric_Uncertainty', 'Overall_Uncertainty', 'Label_Stability', 'Std', 'IQR', 'Jitter'],
    plot_title="Stability and Uncertainty Metrics"
)
visualizer.create_overall_metric_heatmap(
    model_names=list(models_params_for_tuning.keys()),
    metrics_lst=visualizer.all_accuracy_metrics + visualizer.all_uncertainty_metrics,
    tolerance=0.005,
)

png

visualizer.create_disparity_metric_heatmap(
    model_names=list(models_params_for_tuning.keys()),
    metrics_lst=[
        # Error disparity metrics
        'Equalized_Odds_TPR',
        'Equalized_Odds_FPR',
        'Disparate_Impact',
        # Stability disparity metrics
        'Label_Stability_Difference',
        'IQR_Difference',
        'Std_Ratio',
    ],
    groups_lst=config.sensitive_attributes_dct.keys(),
    tolerance=0.005,
)

png