Quick start: how to read indices ?¶
# !pip install fairsense
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from deel.fairsense.indices import cvm_indices, sobol_indices, with_confidence_intervals
from deel.fairsense.utils.dataclasses import IndicesInput
from deel.fairsense.utils.fairness_objective import y_pred
from deel.fairsense.visualization import cat_plot, format_with_intervals
The data¶
In this example we will highlight sobol indices properties using a very simple distribution: a 3 variable gaussian distribution where the variances and covariances can be controlled.
def gaussian_data_generator(sigma12, sigma13, sigma23, N, var1=1.0, var2=1.0, var3=1.0):
cov = np.mat(
[[var1, sigma12, sigma13], [sigma12, var2, sigma23], [sigma13, sigma23, var3]]
)
x = np.random.multivariate_normal(mean=np.array([0, 0, 0]), cov=cov, size=N)
return pd.DataFrame(x, columns=[0, 1, 2])
Intro: Computing the indices¶
In order to compute the indices, we must start with building an IndiceInput
object, which can be done by providing a dataset, a model and an objective.
model = lambda x: x["0"] # "f(X_0, X_1, X_2) -> X_0"
data = gaussian_data_generator(sigma12=0., sigma13=0., sigma23=0., N=10**3)
objective = y_pred
inputs = IndicesInput(model=model, x=data, objective=objective)
We can now compute indices using the functions provided in deel.fairsense.indices
. Results can be stacked using the +
operator.
results = sobol_indices(inputs, n=10**3) + cvm_indices(inputs)
results.values.head()
We can also enrich usual indices to compute confidence intervals (computed using k-fold over the data)
sobol_with_ci = with_confidence_intervals(n_splits=31)(sobol_indices)
results = sobol_with_ci(inputs, n=10**3)
format_with_intervals(results)
How to read the indices¶
GSA indices quantify how much of a variable influences the variance of the output of a function. It is import to recall that a variable can be influent in numerous ways:
- a variable can be influential by itself
- a variable can be influential becaus it is correlated to an influent variable
- a variable can be influential because of a joint effect with another variable (ex:
cold + no salt on road + rain => slippery road
is an example where each variable is not influential by itself but is influential when joined with other )
There is 4 Sobol indices that allow to give clues about how a variable is influential.
Sobol indices and input distribution¶
for the function f(X_0, X_1, X_2) -> X_0 + X_1
, the sobol indices that X_0
and X_1
a both equally influential :
model = lambda x: x["0"] + x["1"] + 0.001 * x["2"] # "f(X_0, X_1, X_2) -> X_0 + X_1"
data = gaussian_data_generator(sigma12=0., sigma13=0., sigma23=0., N=10**3)
objective = y_pred
inputs = IndicesInput(model=model, x=data, objective=objective)
results = sobol_indices(inputs, n=10**3)# + cvm_indices(inputs)
cat_plot(results, plot_per="index", kind="bar")
plt.show()
However, it is important to recall that Sobol indices accounts for the input distribution. For instance when $ var(X_0) = 10 $ and $ var(X_1) = 1 $ the indices reveals that in practice $ X_0 $ is more influential than $ X_1 $:
model = lambda x: x["0"] + x["1"] # "f(X_0, X_1, X_2) -> X_0 + X_1"
data = gaussian_data_generator(sigma12=0., sigma13=0., sigma23=0., var1=10, N=10**3)
objective = y_pred
inputs = IndicesInput(model=model, x=data, objective=objective)
results = sobol_indices(inputs, n=10**3)# + cvm_indices(inputs)
cat_plot(results, plot_per="index", kind="bar")
plt.show()
Sobol indices and correlations¶
When a variable is influential trough correlations, sobol indices account it's influence, while Sobol independent does not. This can be observed in this example where: $$ f(X_0, X_1, X_2) = X_0 $$ And \(X_1\) and \(X_2\) are both correlated to \(X_0\) with a 0.5 correlation coefficient.
model = lambda x: x["0"] # "f(X_0, X_1, X_2) -> X_0"
data = gaussian_data_generator(sigma12=0.5, sigma13=0.5, sigma23=0., N=10**3)
objective = y_pred
inputs = IndicesInput(model=model, x=data, objective=objective)
results = sobol_indices(inputs, n=10**3)# + cvm_indices(inputs)
cat_plot(results, plot_per="index", kind="bar")
plt.show()
Sobol indices and joint effects¶
When a variable is influential trough joint effects, sobol total indices account it's influence, while Sobol does not. This can be observed in this example where: $$ f(X_0, X_1, X_2) = 20X_0 \text{ if } X_1 \text{ and } X_2 \leq 0.5 \ = 1.0X_0 \text{ otherwise } $$
model = lambda x: x["0"] * (((x["1"] > 0) * (x["2"] > 0) * 20) + -10) # "f(x) -> 20*X_0 if (X_1 > 0.5) && (X_2 > 0.5) else: 0.25*X_0 "
data = gaussian_data_generator(sigma12=0., sigma13=0., sigma23=0., N=10**3)
objective = y_pred
inputs = IndicesInput(model=model, x=data, objective=objective)
results = sobol_indices(inputs, n=10**3)# + cvm_indices(inputs)
cat_plot(results, plot_per="index", kind="bar")
plt.show()