TracIn¶

View source | 📰 Original Paper

This method proposes an alternative for estimating influence without the need for expensive inverse hessian-vector product computations, but requiring information that is only available at train time. It leverages the fundamental theorem of calculus to estimate the influence of the training points by looking at how the loss at that point evolves at different model checkpoints. Concretely, the influence will take the following for:

\[ \mathcal{I} (z, z') = \sum_i^k \eta_i \nabla_\theta \ell (\theta_{t_i}, z) \cdot \nabla_\theta \ell (\theta_{t_i}, z') \]

where \(\theta_{t_i}\) are the model's weights at epoch \(t_i\) and \(\eta_i\) is the learning rate at that same epoch.

Just like RPS-L2, this method does not need an instance of the InverseHessianVectorProduct class, but does require to provide some of the model's checkpoints and the learning rates at each of them.

Notebooks¶

Using TracIn

`TracIn`¶

A class implementing an influence score based on TracIn method proposed in https://arxiv.org/pdf/2002.08484.pdf

`init(self, models: List[deel.influenciae.common.model_wrappers.InfluenceModel], learning_rates: Union[float, List[float]])`¶

Parameters

models : List[deel.influenciae.common.model_wrappers.InfluenceModel]
- A list of TF2.X models implementing the InfluenceModel interface at different steps (epochs) of the training
learning_rates : Union[float, List[float]]
- Learning rate or list of learning rates used during the training.
  If learning_rates is a list, it should have the same size as the amount of models

`compute_influence_values(self, train_set: tf.Dataset, device: Optional[str] = None) -> tf.Dataset`¶

Compute the influence score for each sample of the provided (full or partial) model's training dataset.

Parameters

train_set : tf.Dataset
- A TF dataset with the (full or partial) model's training dataset.
device : Optional[str] = None
- Device where the computation will be executed

Return

train_set : tf.Dataset
- A dataset containing the tuple: (batch of training samples, influence score)

`compute_influence_vector(self, train_set: tf.Dataset, save_influence_vector_ds_path: Optional[str] = None, device: Optional[str] = None) -> tf.Dataset`¶

Compute the influence vector for each sample of the provided (full or partial) model's training dataset.

Parameters

train_set : tf.Dataset
- A TF dataset with the (full or partial) model's training dataset.
save_influence_vector_ds_path : Optional[str] = None
- The path to save or load the influence vector of the training dataset. If specified, load the dataset if it has already been computed, otherwise, compute the influence vector and then save it in the specified path.
device : Optional[str] = None
- Device where the computation will be executed

Return

inf_vect_ds : tf.Dataset
- A dataset containing the tuple: (batch of training samples, influence vector)

`compute_top_k_from_training_dataset(self, train_set: tf.Dataset, k: int, order: deel.influenciae.utils.sorted_dict.ORDER = ) -> Tuple[tf.Tensor, tf.Tensor]`¶

Compute the k most influential data-points of the model's training dataset by computing Cook's distance for each point individually.

Parameters

train_set : tf.Dataset
- A TF dataset containing the points on which the model was trained.
k : int
- An integer with the number of most important samples we wish to keep
order : 2>
- Either ORDER.DESCENDING or ORDER.ASCENDING depending on if we wish to find the top-k or bottom-k samples, respectively.

Return

training_samples, influences_values : Tuple[tf.Tensor, tf.Tensor]
- A tuple of tensor.
  - training_samples: A tensor containing the k most influential samples of the training dataset for the model provided.
  - influences_values: The influence score corresponding to these k most influential samples.

estimate_influence_values_in_batches(self, dataset_to_evaluate: tf.Dataset, train_set: tf.Dataset, influence_vector_in_cache: deel.influenciae.common.base_influence.CACHE = , load_influence_vector_path: Optional[str] = None, save_influence_vector_path: Optional[str] = None, save_influence_value_path: Optional[str] = None, device: Optional[str] = None) -> tf.Dataset¶

Estimates the influence that each point in the provided training dataset has on each of the test points. This can provide some insights as to what makes the model predict a certain way for the given test points, and thus presents data-centric explanations.

Parameters

dataset_to_evaluate : tf.Dataset
- A TF dataset containing the test samples for which to compute the effect of removing each of the provided training points (individually).
train_set : tf.Dataset
- A TF dataset containing the model's training dataset (partial or full).
influence_vector_in_cache : 0>
- An enum indicating if intermediary values are to be cached (either in memory or on the disk) or not.
  Options include CACHE.MEMORY (0) for caching in memory, CACHE.DISK (1) for the disk and CACHE.NO_CACHE (2) for no optimization.
load_influence_vector_path : Optional[str] = None
- The path to load the influence vectors (if they have already been calculated).
save_influence_vector_path : Optional[str] = None
- The path to save the computed influence vector.
save_influence_value_path : Optional[str] = None
- The path to save the computed influence values.
device : Optional[str] = None
- Device where the computation will be executed

Return

influence_value_dataset : tf.Dataset
- A dataset containing the tuple: (samples_to_evaluate, dataset).
  - samples_to_evaluate: The batch of sample to evaluate.
  - dataset: Dataset containing tuples of batch of the training dataset and their influence score.

top_k(self, dataset_to_evaluate: tf.Dataset, train_set: tf.Dataset, k: int = 5, nearest_neighbors: deel.influenciae.utils.nearest_neighbors.BaseNearestNeighbors = , influence_vector_in_cache: deel.influenciae.common.base_influence.CACHE = , load_influence_vector_ds_path: Optional[str] = None, save_influence_vector_ds_path: Optional[str] = None, save_top_k_ds_path: Optional[str] = None, order: deel.influenciae.utils.sorted_dict.ORDER = , d_type: tensorflow.python.framework.dtypes.DType = tf.float32, device: Optional[str] = None) -> tf.Dataset¶

Find the top-k closest elements for each element of dataset to evaluate in the training dataset The method will return a dataset containing a tuple of: (Top-k influence values for each sample to evaluate, Top-k training sample for each sample to evaluate)

Parameters

dataset_to_evaluate : tf.Dataset
- The dataset which contains the samples which will be compare to the training dataset
train_set : tf.Dataset
- The dataset used to train the model.
k : int = 5
- the number of most influence samples to retain in training dataset
nearest_neighbors : deel.influenciae.utils.nearest_neighbors.BaseNearestNeighbors =
- The nearest neighbor method. The default method is a linear search
influence_vector_in_cache : 0>
- An enum indicating if intermediary values are to be cached (either in memory or on the disk) or not.
  Options include CACHE.MEMORY (0) for caching in memory, CACHE.DISK (1) for the disk and CACHE.NO_CACHE (2) for no optimization.
load_influence_vector_ds_path : Optional[str] = None
- The path to load the influence vectors (if they have already been calculated).
save_influence_vector_ds_path : Optional[str] = None
- The path to save the computed influence vector.
save_top_k_ds_path : Optional[str] = None
- The path to save the result of the computation of the top-k elements
order : 2>
- Either ORDER.DESCENDING or ORDER.ASCENDING depending on if we wish to find the top-k or bottom-k samples, respectively.
d_type : tensorflow.python.framework.dtypes.DType = tf.float32
- The data-type of the tensors.
device : Optional[str] = None
- Device where the computation will be executed

Return

top_k_dataset : tf.Dataset
- A dataset containing the tuple (samples_to_evaluate, influence_values, training_samples).
  - samples_to_evaluate: Top-k samples to evaluate.
  - influence_values: Top-k influence values for each sample to evaluate.
  - training_samples: Top-k training sample for each sample to evaluate.

TracIn¶

Notebooks¶

TracIn¶

__init__(self, models: List[deel.influenciae.common.model_wrappers.InfluenceModel], learning_rates: Union[float, List[float]])¶

compute_influence_values(self, train_set: tf.Dataset, device: Optional[str] = None) -> tf.Dataset¶

compute_influence_vector(self, train_set: tf.Dataset, save_influence_vector_ds_path: Optional[str] = None, device: Optional[str] = None) -> tf.Dataset¶

compute_top_k_from_training_dataset(self, train_set: tf.Dataset, k: int, order: deel.influenciae.utils.sorted_dict.ORDER = ) -> Tuple[tf.Tensor, tf.Tensor]¶

`TracIn`¶

`init(self, models: List[deel.influenciae.common.model_wrappers.InfluenceModel], learning_rates: Union[float, List[float]])`¶

`compute_influence_values(self, train_set: tf.Dataset, device: Optional[str] = None) -> tf.Dataset`¶

`compute_influence_vector(self, train_set: tf.Dataset, save_influence_vector_ds_path: Optional[str] = None, device: Optional[str] = None) -> tf.Dataset`¶

`compute_top_k_from_training_dataset(self, train_set: tf.Dataset, k: int, order: deel.influenciae.utils.sorted_dict.ORDER = ) -> Tuple[tf.Tensor, tf.Tensor]`¶