Quick start: how to read indices ?¶

# !pip install fairsense

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from deel.fairsense.indices import cvm_indices, sobol_indices, with_confidence_intervals
from deel.fairsense.utils.dataclasses import IndicesInput
from deel.fairsense.utils.fairness_objective import y_pred
from deel.fairsense.visualization import cat_plot, format_with_intervals

The data¶

In this example we will highlight sobol indices properties using a very simple distribution: a 3 variable gaussian distribution where the variances and covariances can be controlled.

def gaussian_data_generator(sigma12, sigma13, sigma23, N, var1=1.0, var2=1.0, var3=1.0):
    cov = np.mat(
        [[var1, sigma12, sigma13], [sigma12, var2, sigma23], [sigma13, sigma23, var3]]
    )
    x = np.random.multivariate_normal(mean=np.array([0, 0, 0]), cov=cov, size=N)
    return pd.DataFrame(x, columns=[0, 1, 2])

Intro: Computing the indices¶

In order to compute the indices, we must start with building an IndiceInput object, which can be done by providing a dataset, a model and an objective.

model = lambda x: x["0"] # "f(X_0, X_1, X_2) -> X_0"
data = gaussian_data_generator(sigma12=0., sigma13=0., sigma23=0., N=10**3)
objective = y_pred
inputs = IndicesInput(model=model, x=data, objective=objective)

We can now compute indices using the functions provided in deel.fairsense.indices. Results can be stacked using the + operator.

results = sobol_indices(inputs, n=10**3) + cvm_indices(inputs)
results.values.head()

	S	ST	S_ind	ST_ind	CVM	CVM_indep
0	1.000000	1.000000	0.947373	0.961075	0.993082	0.891087
1	0.001122	0.000267	0.000237	0.000392	0.034323	0.000000
2	0.000000	0.002494	0.001184	0.000971	0.035442	0.000000

We can also enrich usual indices to compute confidence intervals (computed using k-fold over the data)

sobol_with_ci = with_confidence_intervals(n_splits=31)(sobol_indices)
results = sobol_with_ci(inputs, n=10**3)
format_with_intervals(results)

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 31/31 [00:02<00:00, 15.19it/s]

	S	ST	S_ind	ST_ind
0	0.98 [0.92, 1.00]	0.99 [0.94, 1.00]	0.93 [0.76, 1.00]	0.95 [0.81, 1.00]
1	0.03 [0.00, 0.12]	0.04 [0.01, 0.17]	0.00 [0.00, 0.00]	0.00 [0.00, 0.01]
2	0.02 [0.00, 0.10]	0.04 [0.00, 0.16]	0.00 [0.00, 0.00]	0.00 [0.00, 0.01]

How to read the indices¶

GSA indices quantify how much of a variable influences the variance of the output of a function. It is import to recall that a variable can be influent in numerous ways:

a variable can be influential by itself
a variable can be influential becaus it is correlated to an influent variable
a variable can be influential because of a joint effect with another variable (ex: cold + no salt on road + rain => slippery road is an example where each variable is not influential by itself but is influential when joined with other )

There is 4 Sobol indices that allow to give clues about how a variable is influential.

Sobol indices and input distribution¶

for the function f(X_0, X_1, X_2) -> X_0 + X_1, the sobol indices that X_0 and X_1 a both equally influential :

model = lambda x: x["0"] + x["1"] + 0.001 * x["2"] # "f(X_0, X_1, X_2) -> X_0 + X_1"
data = gaussian_data_generator(sigma12=0., sigma13=0., sigma23=0., N=10**3)
objective = y_pred
inputs = IndicesInput(model=model, x=data, objective=objective)
results = sobol_indices(inputs, n=10**3)# + cvm_indices(inputs)
cat_plot(results, plot_per="index", kind="bar")
plt.show()

However, it is important to recall that Sobol indices accounts for the input distribution. For instance when $ var(X_0) = 10 $ and $ var(X_1) = 1 $ the indices reveals that in practice $ X_0 $ is more influential than $ X_1 $:

model = lambda x: x["0"] + x["1"] # "f(X_0, X_1, X_2) -> X_0 + X_1"
data = gaussian_data_generator(sigma12=0., sigma13=0., sigma23=0., var1=10, N=10**3)
objective = y_pred
inputs = IndicesInput(model=model, x=data, objective=objective)
results = sobol_indices(inputs, n=10**3)# + cvm_indices(inputs)
cat_plot(results, plot_per="index", kind="bar")
plt.show()

Sobol indices and correlations¶

When a variable is influential trough correlations, sobol indices account it's influence, while Sobol independent does not. This can be observed in this example where: $$ f(X_0, X_1, X_2) = X_0 $$ And $X_1$ and $X_2$ are both correlated to $X_0$ with a 0.5 correlation coefficient.

model = lambda x: x["0"] # "f(X_0, X_1, X_2) -> X_0"
data = gaussian_data_generator(sigma12=0.5, sigma13=0.5, sigma23=0., N=10**3)
objective = y_pred
inputs = IndicesInput(model=model, x=data, objective=objective)
results = sobol_indices(inputs, n=10**3)# + cvm_indices(inputs)
cat_plot(results, plot_per="index", kind="bar")
plt.show()

Sobol indices and joint effects¶

When a variable is influential trough joint effects, sobol total indices account it's influence, while Sobol does not. This can be observed in this example where: $$ f(X_0, X_1, X_2) = 20X_0 \text{ if } X_1 \text{ and } X_2 \leq 0.5 \ = 1.0X_0 \text{ otherwise } $$

model = lambda x: x["0"] * (((x["1"] > 0) * (x["2"] > 0) * 20) + -10) #  "f(x) -> 20*X_0 if (X_1 > 0.5) && (X_2 > 0.5) else: 0.25*X_0 "
data = gaussian_data_generator(sigma12=0., sigma13=0., sigma23=0., N=10**3)
objective = y_pred
inputs = IndicesInput(model=model, x=data, objective=objective)
results = sobol_indices(inputs, n=10**3)# + cvm_indices(inputs)
cat_plot(results, plot_per="index", kind="bar")
plt.show()