Skip to content

Representer Point Selection - Local Jacobian Expansion

View source | 📰 Original Paper

Introduced as an improvement over Representer Point Selection - L2, this technique trades the surrogate model for a local taylor expansion on the jacobian matrix, effectively allowing for the decomposition of the model's last layer into a kernel as an approximation. In short, it proposes the following formula for computing influence values:

\[ \Theta_L^\dagger \phi (x_t) = \sum_i \alpha_i \phi (x_i)^T \phi (x_y) \]
\[ \alpha_i = \Theta_L \frac{1}{\phi (x_i) \, n} - \frac{1}{n} H_{\Theta_L}^{-1} \frac{\partial L (x_i, y_i, \Theta)}{\partial \Theta_L \phi (x_i)} \]

In particular, it will be the \(\alpha_i\) for the predicted label \(j\) the equivalent of the influence of the data-point.

As all the other methods based on computing inverse-hessian-vector products, it will be performing all these computations with the help of objects of the class InverseHessianVectorProduct, capable of doing so efficiently.

Notebooks

RepresenterPointLJE

Representer Point Selection via Local Jacobian Expansion for Post-hoc Classifier Explanation of Deep Neural Networks and Ensemble Models https://proceedings.neurips.cc/paper/2021/file/c460dc0f18fc309ac07306a4a55d2fd6-Paper.pdf

__init__(self,
         influence_model: deel.influenciae.common.model_wrappers.InfluenceModel,
         dataset: tf.Dataset,
         ihvp_calculator_factory: deel.influenciae.common.ihvp_factory.InverseHessianVectorProductFactory,
         n_samples_for_hessian: Optional[int] = None,
         target_layer: Union[int, str] = -1,
         shuffle_buffer_size: int = 10000,
         epsilon: float = 1e-05)

Parameters

  • influence_model : deel.influenciae.common.model_wrappers.InfluenceModel

    • The TF2.X model implementing the InfluenceModel interface.

  • ihvp_calculator_factory : deel.influenciae.common.ihvp_factory.InverseHessianVectorProductFactory

    • An InverseHessianVectorProductFactory for creating new instances of the InverseHessianVectorProduct class.

  • n_samples_for_hessian : Optional[int] = None

    • An integer for the amount of samples from the training dataset that will be used for the computation of the hessian matrix.

      If None, the whole dataset will be used.

  • target_layer : Union[int, str] = -1

    • Either a string or an integer identifying the layer on which to compute the influence-related quantities.

  • shuffle_buffer_size : int = 10000

    • An integer with the buffer size for the training set's shuffle operation.

  • epsilon : float = 1e-05

    • An epsilon value to prevent division by zero.

compute_influence_values(self,
                         train_set: tf.Dataset,
                         device: Optional[str] = None) -> tf.Dataset

Compute the influence score for each sample of the provided (full or partial) model's training dataset.

Parameters

  • train_set : tf.Dataset

    • A TF dataset with the (full or partial) model's training dataset.

  • device : Optional[str] = None

    • Device where the computation will be executed

Return

  • train_set : tf.Dataset

    • A dataset containing the tuple: (batch of training samples, influence score)


compute_influence_vector(self,
                         train_set: tf.Dataset,
                         save_influence_vector_ds_path: Optional[str] = None,
                         device: Optional[str] = None) -> tf.Dataset

Compute the influence vector for each sample of the provided (full or partial) model's training dataset.

Parameters

  • train_set : tf.Dataset

    • A TF dataset with the (full or partial) model's training dataset.

  • save_influence_vector_ds_path : Optional[str] = None

    • The path to save or load the influence vector of the training dataset. If specified, load the dataset if it has already been computed, otherwise, compute the influence vector and then save it in the specified path.

  • device : Optional[str] = None

    • Device where the computation will be executed

Return

  • inf_vect_ds : tf.Dataset

    • A dataset containing the tuple: (batch of training samples, influence vector)


compute_top_k_from_training_dataset(self,
                                    train_set: tf.Dataset,
                                    k: int,
                                    order: deel.influenciae.utils.sorted_dict.ORDER = ) -> Tuple[tf.Tensor, tf.Tensor]

Compute the k most influential data-points of the model's training dataset by computing Cook's distance for each point individually.

Parameters

  • train_set : tf.Dataset

    • A TF dataset containing the points on which the model was trained.

  • k : int

    • An integer with the number of most important samples we wish to keep

  • order : 2>

    • Either ORDER.DESCENDING or ORDER.ASCENDING depending on if we wish to find the top-k or bottom-k samples, respectively.

Return

  • training_samples, influences_values : Tuple[tf.Tensor, tf.Tensor]

    • A tuple of tensor.

      - training_samples: A tensor containing the k most influential samples of the training dataset for the model provided.

      - influences_values: The influence score corresponding to these k most influential samples.


estimate_influence_values_in_batches(self,
                                     dataset_to_evaluate: tf.Dataset,
                                     train_set: tf.Dataset,
                                     influence_vector_in_cache: deel.influenciae.common.base_influence.CACHE = ,
                                     load_influence_vector_path: Optional[str] = None,
                                     save_influence_vector_path: Optional[str] = None,
                                     save_influence_value_path: Optional[str] = None,
                                     device: Optional[str] = None) -> tf.Dataset

Estimates the influence that each point in the provided training dataset has on each of the test points. This can provide some insights as to what makes the model predict a certain way for the given test points, and thus presents data-centric explanations.

Parameters

  • dataset_to_evaluate : tf.Dataset

    • A TF dataset containing the test samples for which to compute the effect of removing each of the provided training points (individually).

  • train_set : tf.Dataset

    • A TF dataset containing the model's training dataset (partial or full).

  • influence_vector_in_cache : 0>

    • An enum indicating if intermediary values are to be cached (either in memory or on the disk) or not.

      Options include CACHE.MEMORY (0) for caching in memory, CACHE.DISK (1) for the disk and CACHE.NO_CACHE (2) for no optimization.

  • load_influence_vector_path : Optional[str] = None

    • The path to load the influence vectors (if they have already been calculated).

  • save_influence_vector_path : Optional[str] = None

    • The path to save the computed influence vector.

  • save_influence_value_path : Optional[str] = None

    • The path to save the computed influence values.

  • device : Optional[str] = None

    • Device where the computation will be executed

Return

  • influence_value_dataset : tf.Dataset

    • A dataset containing the tuple: (samples_to_evaluate, dataset).

      - samples_to_evaluate: The batch of sample to evaluate.

      - dataset: Dataset containing tuples of batch of the training dataset and their influence score.


top_k(self,
      dataset_to_evaluate: tf.Dataset,
      train_set: tf.Dataset,
      k: int = 5,
      nearest_neighbors: deel.influenciae.utils.nearest_neighbors.BaseNearestNeighbors = ,
      influence_vector_in_cache: deel.influenciae.common.base_influence.CACHE = ,
      load_influence_vector_ds_path: Optional[str] = None,
      save_influence_vector_ds_path: Optional[str] = None,
      save_top_k_ds_path: Optional[str] = None,
      order: deel.influenciae.utils.sorted_dict.ORDER = ,
      d_type: tensorflow.python.framework.dtypes.DType = tf.float32,
      device: Optional[str] = None) -> tf.Dataset

Find the top-k closest elements for each element of dataset to evaluate in the training dataset The method will return a dataset containing a tuple of: (Top-k influence values for each sample to evaluate, Top-k training sample for each sample to evaluate)

Parameters

  • dataset_to_evaluate : tf.Dataset

    • The dataset which contains the samples which will be compare to the training dataset

  • train_set : tf.Dataset

    • The dataset used to train the model.

  • k : int = 5

    • the number of most influence samples to retain in training dataset

  • nearest_neighbors : deel.influenciae.utils.nearest_neighbors.BaseNearestNeighbors =

    • The nearest neighbor method. The default method is a linear search

  • influence_vector_in_cache : 0>

    • An enum indicating if intermediary values are to be cached (either in memory or on the disk) or not.

      Options include CACHE.MEMORY (0) for caching in memory, CACHE.DISK (1) for the disk and CACHE.NO_CACHE (2) for no optimization.

  • load_influence_vector_ds_path : Optional[str] = None

    • The path to load the influence vectors (if they have already been calculated).

  • save_influence_vector_ds_path : Optional[str] = None

    • The path to save the computed influence vector.

  • save_top_k_ds_path : Optional[str] = None

    • The path to save the result of the computation of the top-k elements

  • order : 2>

    • Either ORDER.DESCENDING or ORDER.ASCENDING depending on if we wish to find the top-k or bottom-k samples, respectively.

  • d_type : tensorflow.python.framework.dtypes.DType = tf.float32

    • The data-type of the tensors.

  • device : Optional[str] = None

    • Device where the computation will be executed

Return

  • top_k_dataset : tf.Dataset

    • A dataset containing the tuple (samples_to_evaluate, influence_values, training_samples).

      - samples_to_evaluate: Top-k samples to evaluate.

      - influence_values: Top-k influence values for each sample to evaluate.

      - training_samples: Top-k training sample for each sample to evaluate.