Arnoldi Influence Calculator¶
View source | 📰 Paper
This class implements the method introduced in Scaling Up Influence Functions, Schioppa et al. at AAAI 2022. It proposes a series of memory and computational optimizations based on the Arnoldi iteration for speeding up inverse hessian calculators, allowing the authors to approximately compute influence functions on whole large vision models (going up to a ViT-L with 300M parameters).
In essence, the optimizations can be summarized as follows: - build an orthonormal basis for the Krylov subspaces of a random vector (in the desired dimensionality). - find the eigenvalues and eigenvectors of the restriction of the Hessian matrix in that restricted subspace. - keep only the \(k\) largest eigenvalues and their corresponding eigenvectors, and create a projection matrix \(G\) into this space. - use forward-over-backward auto-differentiation to directly compute the JVPs in this reduced space.
Due to the specificity of these optimizations, the inverse hessian vector product operation is implemented inside the class, and thus, doesn't require an additional separate IHVP object. In addition, it can only be applied to individual points for the moment.
Notebooks¶
ArnoldiInfluenceCalculator
¶
A class implementing an influence score based on reducing the dimension of the problem
of computing IHVPs through the Arnoldi algorithm as per https://arxiv.org/pdf/2112.03052.pdf
This allows this calculator to be used on models with a considerable amount of weights
in a time-efficient manner. The influence score being calculated is theoretically the
same as the rest of the calculators in the influence
sub-package.
__init__(self,
model: deel.influenciae.common.model_wrappers.InfluenceModel,
train_dataset: tf.Dataset,
subspace_dim: int,
force_hermitian: bool,
k_largest_eig_vals: int,
dtype: = tf.float32)
¶
model: deel.influenciae.common.model_wrappers.InfluenceModel,
train_dataset: tf.Dataset,
subspace_dim: int,
force_hermitian: bool,
k_largest_eig_vals: int,
dtype:
Parameters
-
model : deel.influenciae.common.model_wrappers.InfluenceModel
The TF2.X model implementing the InfluenceModel interface.
-
train_dataset : tf.Dataset
A batched TF dataset with the points with which the model was trained.
-
subspace_dim : int
The dimension of the Krylov subspace for the Arnoldi algorithm.
-
force_hermitian : bool
A boolean indicating if we should force the projected matrix to be hermitian before the eigenvalue computation.
-
k_largest_eig_vals : int
An integer for the amount of top eigenvalues to keep for the influence estimations.
-
dtype : \Users\lucas.hervier\Anaconda3\envs\py39-tf212\lib\site-packages\tensorflow\_api\v2\dtypes\init.py'> = tf.float32
Numeric type for the Krylov basis (tf.float32 by default).
arnoldi(self,
dim: int) -> Tuple[tf.Tensor, tf.Tensor]
¶
dim: int) -> Tuple[tf.Tensor, tf.Tensor]
Builds the projection of the inverse of the hessian on the Krylov subspaces.
Parameters
-
dim : int
The dimension of the basis
Return
-
eig_vals : Tuple[tf.Tensor, tf.Tensor]
The eigen values of the projection
-
G : Tuple[tf.Tensor, tf.Tensor]
The projection matrix
compute_influence_values(self,
train_set: tf.Dataset,
device: Optional[str] = None) -> tf.Dataset
¶
train_set: tf.Dataset,
device: Optional[str] = None) -> tf.Dataset
Compute the influence score for each sample of the provided (full or partial) model's training dataset.
Parameters
-
train_set : tf.Dataset
A TF dataset with the (full or partial) model's training dataset.
-
device : Optional[str] = None
Device where the computation will be executed
Return
-
train_set : tf.Dataset
A dataset containing the tuple: (batch of training samples, influence score)
compute_influence_vector(self,
train_set: tf.Dataset,
save_influence_vector_ds_path: Optional[str] = None,
device: Optional[str] = None) -> tf.Dataset
¶
train_set: tf.Dataset,
save_influence_vector_ds_path: Optional[str] = None,
device: Optional[str] = None) -> tf.Dataset
Compute the influence vector for each sample of the provided (full or partial) model's training dataset.
Parameters
-
train_set : tf.Dataset
A TF dataset with the (full or partial) model's training dataset.
-
save_influence_vector_ds_path : Optional[str] = None
The path to save or load the influence vector of the training dataset. If specified, load the dataset if it has already been computed, otherwise, compute the influence vector and then save it in the specified path.
-
device : Optional[str] = None
Device where the computation will be executed
Return
-
inf_vect_ds : tf.Dataset
A dataset containing the tuple: (batch of training samples, influence vector)
compute_top_k_from_training_dataset(self,
train_set: tf.Dataset,
k: int,
order: deel.influenciae.utils.sorted_dict.ORDER = ) -> Tuple[tf.Tensor, tf.Tensor]
¶
train_set: tf.Dataset,
k: int,
order: deel.influenciae.utils.sorted_dict.ORDER =
Compute the k most influential data-points of the model's training dataset by computing
Cook's distance for each point individually.
Parameters
-
train_set : tf.Dataset
A TF dataset containing the points on which the model was trained.
-
k : int
An integer with the number of most important samples we wish to keep
-
order : 2>
Either ORDER.DESCENDING or ORDER.ASCENDING depending on if we wish to find the top-k or bottom-k samples, respectively.
Return
-
training_samples, influences_values : Tuple[tf.Tensor, tf.Tensor]
A tuple of tensor.
- training_samples: A tensor containing the k most influential samples of the training dataset for the model provided.
- influences_values: The influence score corresponding to these k most influential samples.
estimate_influence_values_in_batches(self,
dataset_to_evaluate: tf.Dataset,
train_set: tf.Dataset,
influence_vector_in_cache: deel.influenciae.common.base_influence.CACHE = ,
load_influence_vector_path: Optional[str] = None,
save_influence_vector_path: Optional[str] = None,
save_influence_value_path: Optional[str] = None,
device: Optional[str] = None) -> tf.Dataset
¶
dataset_to_evaluate: tf.Dataset,
train_set: tf.Dataset,
influence_vector_in_cache: deel.influenciae.common.base_influence.CACHE =
load_influence_vector_path: Optional[str] = None,
save_influence_vector_path: Optional[str] = None,
save_influence_value_path: Optional[str] = None,
device: Optional[str] = None) -> tf.Dataset
Estimates the influence that each point in the provided training dataset has on each of the test points.
This can provide some insights as to what makes the model predict a certain way for the given test points,
and thus presents data-centric explanations.
Parameters
-
dataset_to_evaluate : tf.Dataset
A TF dataset containing the test samples for which to compute the effect of removing each of the provided training points (individually).
-
train_set : tf.Dataset
A TF dataset containing the model's training dataset (partial or full).
-
influence_vector_in_cache : 0>
An enum indicating if intermediary values are to be cached (either in memory or on the disk) or not.
Options include CACHE.MEMORY (0) for caching in memory, CACHE.DISK (1) for the disk and CACHE.NO_CACHE (2) for no optimization.
-
load_influence_vector_path : Optional[str] = None
The path to load the influence vectors (if they have already been calculated).
-
save_influence_vector_path : Optional[str] = None
The path to save the computed influence vector.
-
save_influence_value_path : Optional[str] = None
The path to save the computed influence values.
-
device : Optional[str] = None
Device where the computation will be executed
Return
-
influence_value_dataset : tf.Dataset
A dataset containing the tuple: (samples_to_evaluate, dataset).
- samples_to_evaluate: The batch of sample to evaluate.
- dataset: Dataset containing tuples of batch of the training dataset and their influence score.
top_k(self,
dataset_to_evaluate: tf.Dataset,
train_set: tf.Dataset,
k: int = 5,
nearest_neighbors: deel.influenciae.utils.nearest_neighbors.BaseNearestNeighbors = ,
influence_vector_in_cache: deel.influenciae.common.base_influence.CACHE = ,
load_influence_vector_ds_path: Optional[str] = None,
save_influence_vector_ds_path: Optional[str] = None,
save_top_k_ds_path: Optional[str] = None,
order: deel.influenciae.utils.sorted_dict.ORDER = ,
d_type: tensorflow.python.framework.dtypes.DType = tf.float32,
device: Optional[str] = None) -> tf.Dataset
¶
dataset_to_evaluate: tf.Dataset,
train_set: tf.Dataset,
k: int = 5,
nearest_neighbors: deel.influenciae.utils.nearest_neighbors.BaseNearestNeighbors =
influence_vector_in_cache: deel.influenciae.common.base_influence.CACHE =
load_influence_vector_ds_path: Optional[str] = None,
save_influence_vector_ds_path: Optional[str] = None,
save_top_k_ds_path: Optional[str] = None,
order: deel.influenciae.utils.sorted_dict.ORDER =
d_type: tensorflow.python.framework.dtypes.DType = tf.float32,
device: Optional[str] = None) -> tf.Dataset
Find the top-k closest elements for each element of dataset to evaluate in the training dataset
The method will return a dataset containing a tuple of:
(Top-k influence values for each sample to evaluate, Top-k training sample for each sample to evaluate)
Parameters
-
dataset_to_evaluate : tf.Dataset
The dataset which contains the samples which will be compare to the training dataset
-
train_set : tf.Dataset
The dataset used to train the model.
-
k : int = 5
the number of most influence samples to retain in training dataset
-
nearest_neighbors : deel.influenciae.utils.nearest_neighbors.BaseNearestNeighbors =
The nearest neighbor method. The default method is a linear search
-
influence_vector_in_cache : 0>
An enum indicating if intermediary values are to be cached (either in memory or on the disk) or not.
Options include CACHE.MEMORY (0) for caching in memory, CACHE.DISK (1) for the disk and CACHE.NO_CACHE (2) for no optimization.
-
load_influence_vector_ds_path : Optional[str] = None
The path to load the influence vectors (if they have already been calculated).
-
save_influence_vector_ds_path : Optional[str] = None
The path to save the computed influence vector.
-
save_top_k_ds_path : Optional[str] = None
The path to save the result of the computation of the top-k elements
-
order : 2>
Either ORDER.DESCENDING or ORDER.ASCENDING depending on if we wish to find the top-k or bottom-k samples, respectively.
-
d_type : tensorflow.python.framework.dtypes.DType = tf.float32
The data-type of the tensors.
-
device : Optional[str] = None
Device where the computation will be executed
Return
-
top_k_dataset : tf.Dataset
A dataset containing the tuple (samples_to_evaluate, influence_values, training_samples).
- samples_to_evaluate: Top-k samples to evaluate.
- influence_values: Top-k influence values for each sample to evaluate.
- training_samples: Top-k training sample for each sample to evaluate.