Skip to content

indices module

Statistical Parity - Disparate Impact - Demographic Parity

The rates of value-1 predictions from the different groups must be equal. Independence between the predictor and the protected variable.

  • S binary
    \(P(f(X)=1|S=0) = P(f(X)=1|S=1)\)

  • S continuous or discrete
    \(P(f(X)=1|S) = P(f(X)=1)\)

Avoiding of Disparate Treatment

The probability that an input leads to prediction 1 should be equal regardless of the value of the sensitive variable.

  • S binary
    \(P(f(X)=1|X_S=x,S=0) = P(f(X)=1|X_S=x,S=1)\)
    where \(X_S\) represents \(X\) without the sensitive variable.

Equality of Odds

The rates of true and false predictions from the different groups must be equal. Independence between the error of the model and the protected variable.

  • S binary
    \(P(f(X)=1|Y=i,S=0) = P(f(X)=1|Y=i,S=1) ,i=0,1\)

  • S general
    \(P(f(X)=1|Y=i,S) = P(f(X)=1|Y=i) ,i=0,1\)

Avoiding of Disparate Mistreatment

The probability that a prediction is false should be equal regardless of the value of the sensitive variable. - S binary
\(P(f(X)\ne Y|S=1) = P(f(X)\ne Y|S=0)\)

Global Sensitivity Analysis

GSA is used for quantifying the influence of a set of features on the outcome.
Sobol' indices are based on correlations and need access to the function while CVM' indices are based on rank and need only a sample of evaluations.

Sobol' indices
4 indices that quantify how much of the output variance can be explained by the variance of Xi.

Correlation Between Variables Joined Contributions
\(Sob_i\) ✔️
\(SobT_i\) ✔️ ✔️
\(Sob_i^{ind}\)
\(SobT_i^{ind}\) ✔️

Cramer-Von Mises' indices
The 2 CVM' indices is an extension of the Sobol’ indices to quantify more than just the second-order influence of the inputs on the output.

For further details about GSA in Fairness

Case-of-use Recap

Disparate Impact Avoiding Disparate Treatment Equality Odds Avoiding Disparate Mistreatment Sobol' indices Cramer-Von Mises' indices
S binary ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
S discrete ✔️ ✔️ ✔️ ✔️
S continuous ✔️ ✔️ ✔️ ✔️

disparate_impact(index_input, group_reduction=np.mean)

Compute the disparate impact.

Warning

disparate impact/equality of odds can only be computed on classification problems, and on categorical variables. Continuous variables are dropped and output replaced by np.nan

Note

When applied with target=classification_error this function compute the equality of odds.

Parameters:

Name Type Description Default
index_input IndicesInput

The fairness problem to study.

required
group_reduction

the method used to compute the indices for a group of variables. By default the average of the values of each groups is applied.

np.mean

Returns:

Type Description
IndicesOutput

IndicesOutput object, containing the CVM indices, one line per variable group

IndicesOutput

and one column for each index.

Source code in deel\fairsense\indices\standard_metrics.py
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def disparate_impact(
    index_input: IndicesInput, group_reduction=np.mean
) -> IndicesOutput:
    """
    Compute the disparate impact.

    Warning:
        disparate impact/equality of odds can only be computed on classification
        problems, and on categorical variables. Continuous variables are dropped and
        output replaced by `np.nan`

    Note:
         When applied with `target=classification_error` this function compute the
         equality of odds.

    Args:
        index_input (IndicesInput): The fairness problem to study.
        group_reduction: the method used to compute the indices for a group of
            variables. By default the average of the values of each groups is applied.

    Returns:
        IndicesOutput object, containing the CVM indices, one line per variable group
        and one column for each index.

    """
    df = index_input.x
    y = index_input.compute_objective()
    df["outputs"] = y.values if hasattr(y, "values") else y
    dis = []
    for group in index_input.variable_groups:
        group_output = []
        for var in group:
            group_output.append(_disparate_impact_single_variable(df, var))
        dis.append(group_reduction(group_output))
    data = np.expand_dims(np.array(dis), axis=-1)
    index = index_input.merged_groups
    results = pd.DataFrame(data=data, columns=["DI"], index=index)
    return IndicesOutput(results)

sobol_indices(inputs, n=1000, N=None)

Compute all sobol indices for all variables

Warning

this indice may fail silently if all values of one variable are similar ( constant ) which may occurs when applying one hot encoding with a large number of splits.

Parameters:

Name Type Description Default
inputs IndicesInput

The fairness problem to study.

required
n

number of sample used to compute the sobol indices

1000
N

number of sample used to compute marginals

None

Returns:

Type Description
IndicesOutput

IndicesOutput object, containing the CVM indices, one line per variable group

IndicesOutput

and one column for each index.

Source code in deel\fairsense\indices\sobol.py
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
def sobol_indices(inputs: IndicesInput, n=1000, N=None) -> IndicesOutput:
    """
    Compute all sobol indices for all variables

    Warning:
        this indice may fail silently if all values of one variable are similar (
        constant ) which  may occurs when applying one hot encoding with a large
        number of splits.

    Args:
        inputs (IndicesInput): The fairness problem to study.
        n: number of sample used to compute the sobol indices
        N: number of sample used to compute marginals

    Returns:
        IndicesOutput object, containing the CVM indices, one line per variable group
        and one column for each index.

    """
    x = inputs.x
    cov = (x + np.random.normal(scale=1e-5, size=x.shape)).cov()
    orig_cols = []
    for group in inputs.variable_groups:
        orig_cols += group
    f_inv = _compute_marginal_inv_cumul_dist(x[orig_cols].values, N)
    sobol_table = []
    for i in range(len(inputs.variable_groups)):
        sobol_table.append(
            _sobol_indices_at_i(
                inputs.compute_objective, i, inputs.variable_groups, n, cov, f_inv
            )
        )
    sobol_table = np.vstack(sobol_table)
    sobol_table[:, 2:] = np.roll(sobol_table[:, 2:], -1, axis=0)
    return IndicesOutput(
        pd.DataFrame(
            data=sobol_table,
            index=inputs.merged_groups,
            columns=["S", "ST", "S_ind", "ST_ind"],
        )
    )

cvm_indices(index_input)

Compute the CVM indices of a fairness problem. Set FairnessProblem.result as a Dataframe containing the indices.

Warning

this indice may fail silently if all values of one variable are similar ( constant ) which may occurs when applying one hot encoding with a large number of splits. It may also yield erroneous results when used without enough data. Which might occur when used with confidence intervals.

Parameters:

Name Type Description Default
index_input IndicesInput

The fairness problem to study.

required

Returns:

Type Description
IndicesOutput

IndicesOutput object, containing the CVM indices, one line per variable group

IndicesOutput

and one column for each index.

Source code in deel\fairsense\indices\cvm.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def cvm_indices(index_input: IndicesInput) -> IndicesOutput:
    """Compute the CVM indices of a fairness problem.
    Set FairnessProblem.result as a Dataframe containing the indices.

    Warning:
        this indice may fail silently if all values of one variable are similar (
        constant ) which  may occurs when applying one hot encoding with a large
        number of splits. It may also yield erroneous results when used without
        enough data. Which might occur when used with confidence intervals.

    Args:
        index_input (IndicesInput): The fairness problem to study.

    Returns:
        IndicesOutput object, containing the CVM indices, one line per variable group
        and one column for each index.

    """
    # __check_arg_cvm(index_input, cols)
    df = pd.DataFrame(index_input.x, columns=index_input.x.columns)
    df["outputs"] = pd.DataFrame(index_input.compute_objective())
    return IndicesOutput(_analyze(df, "outputs", cols=index_input.variable_groups))

with_confidence_intervals(n_splits=31, shuffle=False, random_state=None)

Function decorator that allows to compute confidence intervals using the naive method. The input data is split in n_splits and for each split indices are computed.

Warnings

No correction if applied on the output (small number of split will lead to overconfident intervals and a large number of split will lead to a large variance due to the lack of data).

This function must be applied on one of the indices computation function from the indices module.

Parameters:

Name Type Description Default
n_splits

positive integer : number of split.

31
shuffle

Whether to shuffle the data before splitting into batches. Note that the samples within each split will not be shuffled.

False
random_state

When shuffle is True, random_state affects the ordering of the indices, which controls the randomness of each fold. Otherwise, this parameter has no effect. Pass an int for reproducible output across multiple function calls.

None

Returns:

Type Description

the original indice computation function enriched to compute confidence

intervals.

Source code in deel\fairsense\indices\confidence_intervals.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def with_confidence_intervals(n_splits=31, shuffle=False, random_state=None):
    """
    Function decorator that allows to compute confidence intervals using the naive
    method. The input data is split in n_splits and for each split indices are
    computed.

    Warnings:
            No correction if applied on the output (small number of split will lead
            to overconfident intervals and a large number of split will lead to a
            large variance due to the lack of data).

    This function must be applied on one of the indices computation function from the
    indices module.

    Args:
        n_splits: positive integer : number of split.
        shuffle:  Whether to shuffle the data before splitting into batches. Note that
            the samples within each split will not be shuffled.
        random_state: When `shuffle` is True, `random_state` affects the ordering of
            the indices, which controls the randomness of each fold. Otherwise, this
            parameter has no effect. Pass an int for reproducible output across
            multiple function calls.

    Returns:
        the original indice computation function enriched to compute confidence
        intervals.

    """

    kf = KFold(n_splits, shuffle=shuffle, random_state=random_state)

    def confidence_computation_fct(function):
        def call_function(inputs: IndicesInput, *args, **kwargs):
            # get full inputs
            x = inputs.x
            y = inputs.y_true
            fold_results = []
            # repeat indices computation on each fold
            for _, split in tqdm(kf.split(x, y), total=n_splits, ncols=80):
                # build input for the fold
                x_fold = x.iloc[split]
                y_fold = y.iloc[split] if y is not None else None
                fold_inputs = IndicesInput(
                    model=inputs.model,
                    x=x_fold,
                    y_true=y_fold,
                    variable_groups=inputs.variable_groups,
                    objective=inputs.objective,
                )
                # compute the result for the fold
                fold_results.append(function(fold_inputs, *args, **kwargs))
            # merge results to compute values and confidence intervals
            fvalues = [f.values for f in fold_results]
            runs = pd.concat(fvalues)
            return IndicesOutput(runs)

        return call_function

    return confidence_computation_fct