Sample boundary¶
_{ }View source
For a completely different notion of influence or importance of datapoints, we propose to measure the distance that separates each datapoint from the decision boundary, and assign a higher influence score to the elements that are closest to the decision boundary. It would make sense for these examples to be the most influential, as if they weren't there, the model would have placed the decision boundary elsewhere.
In particular, we define the influence score as follows:
$$ \mathcal{I}{SB} (z) =  \lVert z  z \rVert^2 \, , $$ where \(z\) is the datapoint under study and \(z_{adv}\) is the adversarial example with the lowest possible budget and obtained through the DeepFool method.
This technique is based on a simple idea we had, and as such, there's no paper associated to it. We decided to include it because it seems that its performance is less dependent on the choice of model and training schedule and still obtains acceptable results on our mislabeled point detection benchmark.
Notebooks¶
SampleBoundaryCalculator
¶
A class implementing an influence score based on the distance of a sample to the
boundary of the classifier.
The distance to the boundary is estimated using the deep fool method.
[https://arxiv.org/abs/1511.04599]
__init__(self,
model: keras.engine.training.Model,
step_nbr: int = 100,
eps: float = 1e06)
¶
model: keras.engine.training.Model,
step_nbr: int = 100,
eps: float = 1e06)
Parameters

model : keras.engine.training.Model
A TF2 model that has already been trained

step_nbr : int = 100
Number of the iterations to find the closest adversarial problem

eps : float = 1e06
Difference between two logits to assume that they have the same values
compute_influence_values(self,
train_set: tf.Dataset,
device: Optional[str] = None) > tf.Dataset
¶
train_set: tf.Dataset,
device: Optional[str] = None) > tf.Dataset
Compute the influence score for each sample of the provided (full or partial) model's training dataset.
Parameters

train_set : tf.Dataset
A TF dataset with the (full or partial) model's training dataset.

device : Optional[str] = None
Device where the computation will be executed
Return

train_set : tf.Dataset
A dataset containing the tuple: (batch of training samples, influence score)
compute_top_k_from_training_dataset(self,
train_set: tf.Dataset,
k: int,
order: deel.influenciae.utils.sorted_dict.ORDER = ) > Tuple[tf.Tensor, tf.Tensor]
¶
train_set: tf.Dataset,
k: int,
order: deel.influenciae.utils.sorted_dict.ORDER =
Compute the k most influential datapoints of the model's training dataset by computing
Cook's distance for each point individually.
Parameters

train_set : tf.Dataset
A TF dataset containing the points on which the model was trained.

k : int
An integer with the number of most important samples we wish to keep

order : 2>
Either ORDER.DESCENDING or ORDER.ASCENDING depending on if we wish to find the topk or bottomk samples, respectively.
Return

training_samples, influences_values : Tuple[tf.Tensor, tf.Tensor]
A tuple of tensor.
 training_samples: A tensor containing the k most influential samples of the training dataset for the model provided.
 influences_values: The influence score corresponding to these k most influential samples.