Demo 1: Regression problem
Demo: Fairness on regression problems¶
# !pip install -e fairsense
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from deel.fairsense.data_management.factory import from_numpy, from_pandas
from deel.fairsense.data_management.processing import one_hot_encode
from deel.fairsense.indices.confidence_intervals import with_confidence_intervals
from deel.fairsense.indices.cvm import cvm_indices
from deel.fairsense.indices.standard_metrics import disparate_impact
from deel.fairsense.indices.sobol import sobol_indices
from deel.fairsense.utils.dataclasses import IndicesInput, IndicesOutput
from deel.fairsense.utils.fairness_objective import y_true, squared_error, y_pred
from deel.fairsense.visualization.plots import cat_plot
from deel.fairsense.visualization.text import format_with_intervals
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
First we will start with computing some indices on the training data to see if the dataset is biased.
The first step consist of building the IndicesInput object that stores the data. As we can set the target y_true
means that we analyse the data, but this can be set to y_pred
if we want to analyse predictions, or squared_error
if we want to analyse the error. This parameter can be changer afterward.
data = load_boston()
# construct IndicesInput object
indices_inputs = from_numpy(
x=data.data,
y=data.target,
feature_names=data.feature_names,
target=y_true,
)
indices_inputs.x.head()
We can then apply preprocessing such as one_hot encoding.
# apply one hot encoding
indices_inputs = one_hot_encode(indices_inputs, ["CHAS", "RAD"])
indices computation: CVM¶
As we have a regression problem, we use the CVM indices to compute sensitvity analysis.
We then declare the indices computation functions. The results are stored in a indicesOuput
object. raw value can be acessed with .values
, Please note that 0 refers to total independence and 1 refers to total dependence.
indices_outputs = cvm_indices(indices_inputs)
indices_outputs.values
We can now plot those easily using the approriate function from the visualization
module. The two main parameters are plot_per
and kind
:
- plot_per (str): can be either
variable
orindex
, when set tovariable
there is one graph per variable, each graph showing the values of all indices. Respectively setting toindex
will build one graph per index, each showing the values for all variable. - kind (str): kind of visualization to produce, can be one of
strip
,swarm
,box
,violin
,boxen
,point
,bar
.
feel free to play with it !
cat_plot(indices_outputs, plot_per="index", kind="bar")
plt.show()
confidence intervals¶
It is also possible to decorate any indice function with with_confidence_intervals
to use bootstrapping to compute confidence intervals. We can also use the + operator to compute multiple indices simulteanously. Results with confidence intervals can be visualized either textually with format_with_intervals
or 'graphically with cat_plot
cvm_with_ci = with_confidence_intervals(n_splits=10)(cvm_indices)
indices_outputs_ci = cvm_with_ci(indices_inputs)
format_with_intervals(indices_outputs_ci, quantile=0.05)
cat_plot(indices_outputs_ci, plot_per="index", kind="box")
plt.show()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
similarly we build the IndiceInput
object
indices_inputs_train = from_numpy(
x=X_train,
y=y_train,
feature_names=data.feature_names,
)
indices_inputs_test = from_numpy(
x=X_test,
y=y_test,
feature_names=data.feature_names,
)
then we train a basic model: DecisionTree. Note that this analysis can be applied to any callable that can handle numpy array as inputs.
model = RandomForestRegressor(250, max_depth=5, min_samples_leaf=3)
model.fit(indices_inputs_train.x, indices_inputs_train.y_true)
train_score = model.score(indices_inputs_train.x, indices_inputs_train.y_true)
val_score = model.score(indices_inputs_test.x, indices_inputs_test.y_true)
print(f"train score: {train_score}, val score {val_score}")
compute indices¶
we set the model and the objective
indices_inputs_train.model = model.predict
indices_inputs_train.objective = y_pred
indices_inputs_test.model = model.predict
indices_inputs_test.objective = y_pred
cvm_with_ci = with_confidence_intervals(n_splits=10)(cvm_indices)
sobol_with_ci = with_confidence_intervals(n_splits=10)(sobol_indices)
indices_outputs_train = cvm_with_ci(indices_inputs_train) + sobol_with_ci(indices_inputs_train)
format_with_intervals(indices_outputs_train, quantile=0.1)
cat_plot(indices_outputs_train, plot_per="variable", kind="box", col_wrap=4)
plt.show()
compare indices from target=y_true
with indices from target=y_pred
¶
OK, these results are interesting but we would like to compare the indices obtained with target=y_true
.
merged_indices = indices_outputs_ci.runs
merged_indices[["CVM_model", "CVM_indep_model"]] = indices_outputs_train.runs[["CVM", "CVM_indep"]]
merged_indices = IndicesOutput(merged_indices[["CVM_model", "CVM", "CVM_indep_model", "CVM_indep"]])
cat_plot(merged_indices, plot_per="variable", kind="box", col_wrap=4)
plt.show()
As we can see the model tend to increase the influence of many variables
III) Analysis of the sensitivity of the error¶
Now we want to see if some variable are influent with the error of model.
indices_inputs_train.objective = squared_error
indices_inputs_test.objective = squared_error
cvm_with_ci = with_confidence_intervals(n_splits=30)(cvm_indices)
indices_outputs_error_test = cvm_with_ci(indices_inputs_test)
format_with_intervals(indices_outputs_error_test, quantile=0.1)
cat_plot(indices_outputs_error_test, plot_per="variable", kind="box", col_wrap=4)
plt.show()