Multiple Models Interface With Postprocessor¶

In this example, we are going to audit 2 models together with a postprocessor from AIF360, visualize metrics, and create an analysis report. For that, we will use compute_metrics_with_config interface that can compute metrics for multiple models. Thus, we will need to do the next steps:

Initialize input variables
Compute subgroup metrics
Perform disparity metrics composition using the Metric Composer
Create static visualizations using the Metric Visualizer

Import dependencies¶

import os
from pprint import pprint
from datetime import datetime, timezone

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from aif360.algorithms.postprocessing import EqOddsPostprocessing

from virny.utils.custom_initializers import create_config_obj, read_model_metric_dfs, create_models_config_from_tuned_params_df
from virny.user_interfaces.multiple_models_api import compute_metrics_with_config
from virny.preprocessing.basic_preprocessing import preprocess_dataset
from virny.custom_classes.metrics_visualizer import MetricsVisualizer
from virny.custom_classes.metrics_composer import MetricsComposer
from virny.utils.model_tuning_utils import tune_ML_models

Initialize Input Variables¶

Based on the library flow, we need to create 3 input objects for a user interface:

A config yaml that is a file with configuration parameters for different user interfaces for metric computation.
A dataset class that is a wrapper above the user’s raw dataset that includes its descriptive attributes like a target column, numerical columns, categorical columns, etc. This class must be inherited from the BaseDataset class, which was created for user convenience.
Finally, a models config that is a Python dictionary, where keys are model names and values are initialized models for analysis. This dictionary helps conduct audits for different analysis modes and analyze different types of models.

DATASET_SPLIT_SEED = 42
MODELS_TUNING_SEED = 42
TEST_SET_FRACTION = 0.2

models_params_for_tuning = {
    'LogisticRegression': {
        'model': LogisticRegression(random_state=MODELS_TUNING_SEED),
        'params': {
            'penalty': ['l2'],
            'C' : [0.0001, 0.1, 1, 100],
            'solver': ['newton-cg', 'lbfgs'],
            'max_iter': [250],
        }
    },
    'RandomForestClassifier': {
        'model': RandomForestClassifier(random_state=MODELS_TUNING_SEED),
        'params': {
            "max_depth": [6, 10],
            "min_samples_leaf": [1],
            "n_estimators": [50, 100],
            "max_features": [0.6]
        }
    },
}

Create a config object¶

compute_metrics_with_config interface requires that your yaml file includes the following parameters:

dataset_name: str, a name of your dataset; it will be used to name files with metrics.
bootstrap_fraction: float, the fraction from a train set in the range [0.0 - 1.0] to fit models in bootstrap (usually more than 0.5).
computation_mode: str, 'default' or 'error_analysis'. Name of the computation mode. When a default computation mode measures metrics for sex_priv and sex_dis, an error_analysis mode measures metrics for (sex_priv, sex_priv_correct, sex_priv_incorrect) and (sex_dis, sex_dis_correct, sex_dis_incorrect). Therefore, a user can analyze how a model is certain about its incorrect predictions.
random_state: int, a seed to control the randomness of the whole model evaluation pipeline.
n_estimators: int, the number of estimators for bootstrap to compute subgroup stability metrics.
sensitive_attributes_dct: dict, a dictionary where keys are sensitive attribute names (including intersectional attributes), and values are disadvantaged values for these attributes. Intersectional attributes must include '&' between sensitive attributes. You do not need to specify disadvantaged values for intersectional groups since they will be derived from disadvantaged values in sensitive_attributes_dct for each separate sensitive attribute in this intersectional pair.
postprocessing_sensitive_attribute: str, a name of a sensitive attribute to use for postprocessing.

Note that disadvantaged value in a sensitive attribute dictionary must be the same as in the original dataset. For example, when distinct values of the sex column in the original dataset are 'F' and 'M', and after pre-processing they became 0 and 1 respectively, you still need to set a disadvantaged value as 'F' or 'M' in the sensitive attribute dictionary.

ROOT_DIR = os.path.join('docs', 'examples')
config_yaml_path = os.path.join(ROOT_DIR, 'experiment_config.yaml')
config_yaml_content = """
dataset_name: Law_School
bootstrap_fraction: 0.8
computation_mode: error_analysis
random_state: 42
n_estimators: 50  # Better to input the higher number of estimators than 100; this is only for this use case example
sensitive_attributes_dct: {'male': '0', 'race': 'Non-White', 'male&race': None}
postprocessing_sensitive_attribute: 'race_binary'
"""

with open(config_yaml_path, 'w', encoding='utf-8') as f:
    f.write(config_yaml_content)

config = create_config_obj(config_yaml_path=config_yaml_path)
SAVE_RESULTS_DIR_PATH = os.path.join(ROOT_DIR, 'results', f'{config.dataset_name}_Metrics_{datetime.now(timezone.utc).strftime("%Y%m%d__%H%M%S")}')

Preprocess the dataset, create a BaseFlowDataset class, and define a postprocessor¶

from virny.datasets import LawSchoolDataset

data_loader = LawSchoolDataset()
data_loader.X_data[data_loader.X_data.columns[:5]].head()

	decile1b	decile3	lsat	ugpa	zfygpa
0	10.0	10.0	44.0	3.5	1.33
1	5.0	4.0	29.0	3.5	-0.11
2	8.0	7.0	37.0	3.4	0.63
3	8.0	7.0	43.0	3.3	0.67
4	3.0	2.0	41.0	3.3	-0.67

column_transformer = ColumnTransformer(transformers=[
    ('categorical_features', OneHotEncoder(handle_unknown='ignore', sparse_output=False), data_loader.categorical_columns),
    ('numerical_features', StandardScaler(), data_loader.numerical_columns),
])

# Create a binary race column for postprocessing since aif360 postprocessors can postprocess a dataset only based on binary columns.
data_loader.X_data['race_binary'] = data_loader.X_data['race'].apply(lambda x: 1 if x == 'White' else 0)

base_flow_dataset = preprocess_dataset(data_loader=data_loader,
                                       column_transformer=column_transformer,
                                       sensitive_attributes_dct=config.sensitive_attributes_dct,
                                       test_set_fraction=TEST_SET_FRACTION,
                                       dataset_split_seed=DATASET_SPLIT_SEED)
base_flow_dataset.X_train_val['race_binary'] = data_loader.X_data.loc[base_flow_dataset.X_train_val.index, 'race_binary']
base_flow_dataset.X_test['race_binary'] = data_loader.X_data.loc[base_flow_dataset.X_test.index, 'race_binary']

# Define a postprocessor
privileged_groups = [{'race_binary': 1}]
unprivileged_groups = [{'race_binary': 0}]
postprocessor = EqOddsPostprocessing(
    privileged_groups=privileged_groups,
    unprivileged_groups=unprivileged_groups,
    seed=None  # Set postprocessor's seed to None to avoid similar predictions during the bootstrap
)

Tune models and create a models config for metrics computation¶

tuned_params_df, models_config = tune_ML_models(models_params_for_tuning, base_flow_dataset, config.dataset_name, n_folds=3)
tuned_params_df

2024/06/02, 00:35:52: Tuning LogisticRegression...
2024/06/02, 00:35:54: Tuning for LogisticRegression is finished [F1 score = 0.6563618630035558, Accuracy = 0.8987258083904316]

2024/06/02, 00:35:54: Tuning RandomForestClassifier...
2024/06/02, 00:35:56: Tuning for RandomForestClassifier is finished [F1 score = 0.6538551003755212, Accuracy = 0.8980646712345234]

	Dataset_Name	Model_Name	F1_Score	Accuracy_Score	Model_Best_Params
0	Law_School	LogisticRegression	0.656362	0.898726	{'C': 100, 'max_iter': 250, 'penalty': 'l2', '...
1	Law_School	RandomForestClassifier	0.653855	0.898065	{'max_depth': 10, 'max_features': 0.6, 'min_sa...

now = datetime.now(timezone.utc)
date_time_str = now.strftime("%Y%m%d__%H%M%S")
tuned_df_path = os.path.join(ROOT_DIR, 'results', 'models_tuning', f'tuning_results_{config.dataset_name}_{date_time_str}.csv')
tuned_params_df.to_csv(tuned_df_path, sep=",", columns=tuned_params_df.columns, float_format="%.4f", index=False)

Create models_config from the saved tuned_params_df for higher reliability

models_config = create_models_config_from_tuned_params_df(models_params_for_tuning, tuned_df_path)
pprint(models_config)

{'LogisticRegression': LogisticRegression(C=100, max_iter=250, random_state=42, solver='newton-cg'),
 'RandomForestClassifier': RandomForestClassifier(max_depth=10, max_features=0.6, n_estimators=50,
                       random_state=42)}

Subgroup Metric Computation¶

After the variables are input to a user interface, the interface uses subgroup analyzers to compute different sets of metrics for each privileged and disadvantaged subgroup. As for now, our library supports Subgroup Variance Analyzer and Subgroup Error Analyzer, but it is easily extensible to any other analyzers. When the variance and error analyzers complete metrics computation, their metrics are combined, returned in a matrix format, and stored in a file if defined.

metrics_dct = compute_metrics_with_config(dataset=base_flow_dataset,
                                          config=config,
                                          models_config=models_config,
                                          save_results_dir_path=SAVE_RESULTS_DIR_PATH,
                                          postprocessor=postprocessor,
                                          notebook_logs_stdout=True)

Analyze multiple models:   0%|          | 0/2 [00:00<?, ?it/s]


Enabled a postprocessing mode



Classifiers testing by bootstrap:   0%|          | 0/50 [00:00<?, ?it/s]


Enabled a postprocessing mode



Classifiers testing by bootstrap:   0%|          | 0/50 [00:00<?, ?it/s]

Look at several columns in top rows of computed metrics

sample_model_metrics_df = metrics_dct[list(models_config.keys())[0]]
sample_model_metrics_df[sample_model_metrics_df.columns[:6]].head(20)

	Metric	overall	male_priv	male_priv_correct	male_priv_incorrect	male_dis
0	Jitter	0.044141	0.040939	0.035644	0.094502	0.048374
1	Label_Stability	0.949913	0.953970	0.961893	0.873803	0.944554
2	TPR	0.994903	0.994884	1.000000	0.000000	0.994930
3	TNR	0.078704	0.073394	1.000000	0.000000	0.084112
4	PPV	0.903092	0.913712	1.000000	0.000000	0.889015
5	FNR	0.005097	0.005116	0.000000	1.000000	0.005070
6	FPR	0.921296	0.926606	0.000000	1.000000	0.915888
7	Accuracy	0.899760	0.910051	1.000000	0.000000	0.886161
8	F1	0.946777	0.952572	1.000000	0.000000	0.938995
9	Selection-Rate	0.987260	0.988598	0.992575	0.948357	0.985491
10	Sample_Size	4160.000000	2368.000000	2155.000000	213.000000	1792.000000

Disparity Metric Composition¶

Metrics Composer is responsible for this second stage of the model audit. Currently, it computes our custom group fairness and stability metrics, but extending it for new group metrics is very simple. We noticed that more and more group metrics have appeared during the last decade, but most of them are based on the same subgroup metrics. Hence, such a separation of subgroup and group metrics computation allows one to experiment with different combinations of subgroup metrics and avoid subgroup metrics recomputation for a new set of grouped metrics.

models_metrics_dct = read_model_metric_dfs(SAVE_RESULTS_DIR_PATH, model_names=list(models_config.keys()))

metrics_composer = MetricsComposer(models_metrics_dct, config.sensitive_attributes_dct)

Compute composed metrics

models_composed_metrics_df = metrics_composer.compose_metrics()

models_composed_metrics_df

	Metric	male	race	male&race	Model_Name
0	Accuracy_Difference	-0.023890	-0.196227	-0.174183	LogisticRegression
1	Equalized_Odds_FNR	-0.000047	-0.005823	-0.005454	LogisticRegression
2	Equalized_Odds_FPR	-0.010718	0.129278	0.098266	LogisticRegression
3	Jitter_Difference	0.007435	0.034351	0.049795	LogisticRegression
4	Label_Stability_Ratio	0.990130	0.943974	0.924259	LogisticRegression
5	Label_Stability_Difference	-0.009416	-0.053678	-0.072383	LogisticRegression
6	Statistical_Parity_Difference	-0.003107	0.015031	0.013838	LogisticRegression
7	Disparate_Impact	0.996857	1.015261	1.014032	LogisticRegression
8	Equalized_Odds_TNR	0.010718	-0.129278	-0.098266	LogisticRegression
9	Equalized_Odds_TPR	0.000047	0.005823	0.005454	LogisticRegression
10	Accuracy_Difference	-0.020693	-0.158407	-0.134267	RandomForestClassifier
11	Equalized_Odds_FNR	0.004134	0.020908	0.029136	RandomForestClassifier
12	Equalized_Odds_FPR	-0.058218	-0.104439	-0.140207	RandomForestClassifier
13	Jitter_Difference	0.009800	0.093877	0.101423	RandomForestClassifier
14	Label_Stability_Ratio	0.981678	0.844858	0.825698	RandomForestClassifier
15	Label_Stability_Difference	-0.017405	-0.149755	-0.166575	RandomForestClassifier
16	Statistical_Parity_Difference	-0.013514	-0.061529	-0.076446	RandomForestClassifier
17	Disparate_Impact	0.986242	0.937586	0.922193	RandomForestClassifier
18	Equalized_Odds_TNR	0.058218	0.104439	0.140207	RandomForestClassifier
19	Equalized_Odds_TPR	-0.004134	-0.020908	-0.029136	RandomForestClassifier

Metric Visualization¶

Metric Visualizer allows us to build static visualizations for the computed metrics. It unifies different preprocessing methods for the computed metrics and creates various data formats required for visualizations. Hence, users can simply call methods of the MetricsVisualizer class and get custom plots for diverse metric analysis.

visualizer = MetricsVisualizer(models_metrics_dct, models_composed_metrics_df, config.dataset_name,
                               model_names=list(models_config.keys()),
                               sensitive_attributes_dct=config.sensitive_attributes_dct)

visualizer.create_overall_metrics_bar_char(
    metric_names=['Accuracy', 'F1', 'TPR', 'TNR', 'PPV', 'Selection-Rate'],
    plot_title="Accuracy Metrics"
)

visualizer.create_overall_metrics_bar_char(
    metric_names=['Label_Stability', 'Jitter'],
    plot_title="Stability Metrics"
)

visualizer.create_overall_metric_heatmap(
    model_names=list(models_params_for_tuning.keys()),
    metrics_lst=['Accuracy', 'F1', 'TNR', 'TPR', 'FNR', 'FPR', 'Label_Stability', 'Jitter'],
    tolerance=0.005,
)

png

visualizer.create_disparity_metric_heatmap(
    model_names=list(models_params_for_tuning.keys()),
    metrics_lst=[
        # Error disparity metrics
        'Equalized_Odds_TPR',
        'Equalized_Odds_FPR',
        'Disparate_Impact',
        # Stability disparity metrics
        'Label_Stability_Difference',
        'Jitter_Difference',
    ],
    groups_lst=config.sensitive_attributes_dct.keys(),
    tolerance=0.005,
)

png