Representer Point Selection  L2¶
Using a completely different notion of influence than the techniques in modules deel.influenciae.influence
and deel.influenciae.trac_in
, this is the first method to use the representer point theorem for kernels
to attribute an influence to datapoints in the training dataset. In particular, it posits that the
classification function can be approximated by:
where \(\Phi\) is the function that transforms the points \(x_t\) in the embedding of the network's last layer to the logits for the classification. However, this function must be learned with a strong \(\ell^2\) regularization, and thus requires creating a surrogate model.
In particular, it will be the \(\alpha_i\) for the predicted label \(j\) the equivalent of the influence of the datapoint.
This method does not require us to compute inversehessianvector products, so it can be computed with a certain efficiency once the surrogate model has been learned.
Notebooks¶
RepresenterPointL2
¶
A class implementing a method to compute the influence of training points through
the representer point theorem for kernels.
__init__(self,
model: keras.engine.training.Model,
train_set: tf.Dataset,
loss_function: Union[Callable[[tf.Tensor, tf.Tensor], tf.Tensor], keras.losses.Loss],
lambda_regularization: float,
scaling_factor: float = 0.1,
epochs: int = 100,
layer_index: int = 1)
¶
model: keras.engine.training.Model,
train_set: tf.Dataset,
loss_function: Union[Callable[[tf.Tensor, tf.Tensor], tf.Tensor], keras.losses.Loss],
lambda_regularization: float,
scaling_factor: float = 0.1,
epochs: int = 100,
layer_index: int = 1)
Parameters

model : keras.engine.training.Model
A TF2 model that has already been trained

train_set : tf.Dataset
A batched TF dataset with the points with which the model was trained

loss_function : Union[Callable[[tf.Tensor, tf.Tensor], tf.Tensor], keras.losses.Loss]
The loss function with which the model was trained. This loss function MUST NOT be reduced.

lambda_regularization : float
The coefficient for the regularization of the surrogate last layer that needs to be trained for this method

scaling_factor : float = 0.1
A float with the scaling factor for the SGD backtracking linesearch optimizer for fitting the surrogate linear model

epochs : int = 100
An integer for the amount of epochs to fit the linear model

layer_index : int = 1
layer of the logits
compute_influence_values(self,
train_set: tf.Dataset,
device: Optional[str] = None) > tf.Dataset
¶
train_set: tf.Dataset,
device: Optional[str] = None) > tf.Dataset
Compute the influence score for each sample of the provided (full or partial) model's training dataset.
Parameters

train_set : tf.Dataset
A TF dataset with the (full or partial) model's training dataset.

device : Optional[str] = None
Device where the computation will be executed
Return

train_set : tf.Dataset
A dataset containing the tuple: (batch of training samples, influence score)
compute_influence_vector(self,
train_set: tf.Dataset,
save_influence_vector_ds_path: Optional[str] = None,
device: Optional[str] = None) > tf.Dataset
¶
train_set: tf.Dataset,
save_influence_vector_ds_path: Optional[str] = None,
device: Optional[str] = None) > tf.Dataset
Compute the influence vector for each sample of the provided (full or partial) model's training dataset.
Parameters

train_set : tf.Dataset
A TF dataset with the (full or partial) model's training dataset.

save_influence_vector_ds_path : Optional[str] = None
The path to save or load the influence vector of the training dataset. If specified, load the dataset if it has already been computed, otherwise, compute the influence vector and then save it in the specified path.

device : Optional[str] = None
Device where the computation will be executed
Return

inf_vect_ds : tf.Dataset
A dataset containing the tuple: (batch of training samples, influence vector)
compute_top_k_from_training_dataset(self,
train_set: tf.Dataset,
k: int,
order: deel.influenciae.utils.sorted_dict.ORDER = ) > Tuple[tf.Tensor, tf.Tensor]
¶
train_set: tf.Dataset,
k: int,
order: deel.influenciae.utils.sorted_dict.ORDER =
Compute the k most influential datapoints of the model's training dataset by computing
Cook's distance for each point individually.
Parameters

train_set : tf.Dataset
A TF dataset containing the points on which the model was trained.

k : int
An integer with the number of most important samples we wish to keep

order : 2>
Either ORDER.DESCENDING or ORDER.ASCENDING depending on if we wish to find the topk or bottomk samples, respectively.
Return

training_samples, influences_values : Tuple[tf.Tensor, tf.Tensor]
A tuple of tensor.
 training_samples: A tensor containing the k most influential samples of the training dataset for the model provided.
 influences_values: The influence score corresponding to these k most influential samples.
estimate_influence_values_in_batches(self,
dataset_to_evaluate: tf.Dataset,
train_set: tf.Dataset,
influence_vector_in_cache: deel.influenciae.common.base_influence.CACHE = ,
load_influence_vector_path: Optional[str] = None,
save_influence_vector_path: Optional[str] = None,
save_influence_value_path: Optional[str] = None,
device: Optional[str] = None) > tf.Dataset
¶
dataset_to_evaluate: tf.Dataset,
train_set: tf.Dataset,
influence_vector_in_cache: deel.influenciae.common.base_influence.CACHE =
load_influence_vector_path: Optional[str] = None,
save_influence_vector_path: Optional[str] = None,
save_influence_value_path: Optional[str] = None,
device: Optional[str] = None) > tf.Dataset
Estimates the influence that each point in the provided training dataset has on each of the test points.
This can provide some insights as to what makes the model predict a certain way for the given test points,
and thus presents datacentric explanations.
Parameters

dataset_to_evaluate : tf.Dataset
A TF dataset containing the test samples for which to compute the effect of removing each of the provided training points (individually).

train_set : tf.Dataset
A TF dataset containing the model's training dataset (partial or full).

influence_vector_in_cache : 0>
An enum indicating if intermediary values are to be cached (either in memory or on the disk) or not.
Options include CACHE.MEMORY (0) for caching in memory, CACHE.DISK (1) for the disk and CACHE.NO_CACHE (2) for no optimization.

load_influence_vector_path : Optional[str] = None
The path to load the influence vectors (if they have already been calculated).

save_influence_vector_path : Optional[str] = None
The path to save the computed influence vector.

save_influence_value_path : Optional[str] = None
The path to save the computed influence values.

device : Optional[str] = None
Device where the computation will be executed
Return

influence_value_dataset : tf.Dataset
A dataset containing the tuple: (samples_to_evaluate, dataset).
 samples_to_evaluate: The batch of sample to evaluate.
 dataset: Dataset containing tuples of batch of the training dataset and their influence score.
predict_with_kernel(self,
samples_to_evaluate: Tuple[tf.Tensor, ...]) > tf.Tensor
¶
samples_to_evaluate: Tuple[tf.Tensor, ...]) > tf.Tensor
Uses the learned kernel to approximate the model's predictions on a group of samples.
Parameters

samples_to_evaluate : Tuple[tf.Tensor, ...]
A single batch of tensors with the samples for which we wish to approximate the model's predictions
Return

predictions : tf.Tensor
A tensor with an approximation of the model's predictions
top_k(self,
dataset_to_evaluate: tf.Dataset,
train_set: tf.Dataset,
k: int = 5,
nearest_neighbors: deel.influenciae.utils.nearest_neighbors.BaseNearestNeighbors = ,
influence_vector_in_cache: deel.influenciae.common.base_influence.CACHE = ,
load_influence_vector_ds_path: Optional[str] = None,
save_influence_vector_ds_path: Optional[str] = None,
save_top_k_ds_path: Optional[str] = None,
order: deel.influenciae.utils.sorted_dict.ORDER = ,
d_type: tensorflow.python.framework.dtypes.DType = tf.float32,
device: Optional[str] = None) > tf.Dataset
¶
dataset_to_evaluate: tf.Dataset,
train_set: tf.Dataset,
k: int = 5,
nearest_neighbors: deel.influenciae.utils.nearest_neighbors.BaseNearestNeighbors =
influence_vector_in_cache: deel.influenciae.common.base_influence.CACHE =
load_influence_vector_ds_path: Optional[str] = None,
save_influence_vector_ds_path: Optional[str] = None,
save_top_k_ds_path: Optional[str] = None,
order: deel.influenciae.utils.sorted_dict.ORDER =
d_type: tensorflow.python.framework.dtypes.DType = tf.float32,
device: Optional[str] = None) > tf.Dataset
Find the topk closest elements for each element of dataset to evaluate in the training dataset
The method will return a dataset containing a tuple of:
(Topk influence values for each sample to evaluate, Topk training sample for each sample to evaluate)
Parameters

dataset_to_evaluate : tf.Dataset
The dataset which contains the samples which will be compare to the training dataset

train_set : tf.Dataset
The dataset used to train the model.

k : int = 5
the number of most influence samples to retain in training dataset

nearest_neighbors : deel.influenciae.utils.nearest_neighbors.BaseNearestNeighbors =
The nearest neighbor method. The default method is a linear search

influence_vector_in_cache : 0>
An enum indicating if intermediary values are to be cached (either in memory or on the disk) or not.
Options include CACHE.MEMORY (0) for caching in memory, CACHE.DISK (1) for the disk and CACHE.NO_CACHE (2) for no optimization.

load_influence_vector_ds_path : Optional[str] = None
The path to load the influence vectors (if they have already been calculated).

save_influence_vector_ds_path : Optional[str] = None
The path to save the computed influence vector.

save_top_k_ds_path : Optional[str] = None
The path to save the result of the computation of the topk elements

order : 2>
Either ORDER.DESCENDING or ORDER.ASCENDING depending on if we wish to find the topk or bottomk samples, respectively.

d_type : tensorflow.python.framework.dtypes.DType = tf.float32
The datatype of the tensors.

device : Optional[str] = None
Device where the computation will be executed
Return

top_k_dataset : tf.Dataset
A dataset containing the tuple (samples_to_evaluate, influence_values, training_samples).
 samples_to_evaluate: Topk samples to evaluate.
 influence_values: Topk influence values for each sample to evaluate.
 training_samples: Topk training sample for each sample to evaluate.