GradCAM¶

View colab tutorial | View source | 📰 Paper

Grad-CAM is a technique for producing visual explanations that can be used on Convolutional Neural Network (CNN) which uses both gradients and the feature maps of the last convolutional layer.

Quote

Grad-CAM uses the gradients of any target concept (say logits for “dog” or even a caption), flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.

-- Visual Explanations from Deep Networks via Gradient-based Localization (2016).

More precisely, to obtain the localization map for a prediction \(f(x)\), we need to compute the weights \(w_k\) associated to each of the feature map channel \(A^k \in \mathbb{R}^{W \times H}\). As we use the last convolutionnal layer, \(k\) will be the number of filters, \(Z\) is the number of pixels in each feature map (\(Z = W \times H\), e.g. 7x7 for ResNet50).

\[ w_k = \frac{1}{Z} \sum_i \sum_j \frac{\partial f(x)}{\partial A^k_{i,j}} \]

We now use this weight to ponderate and aggregate the feature maps to obtain our grad-cam attribution \(\phi\):

\[ \phi = \text{max}(0, \sum_k w_k A^k) \]

Notice that \(\phi \in \mathbb{R}^{W \times H}\) and thus the size of the explanation depends on the size of the feature map (\(W, H\)) of the last feature map. In order to compare it to the original input \(x\), we upsample \(\phi\) using bicubic interpolation.

Example¶

from xplique.attributions import GradCAM

# load images, labels and model
# ...

method = GradCAM(model)
explanations = method.explain(images, labels)

Notebooks¶

`GradCAM`¶

Used to compute the Grad-CAM visualization method.

`init(self, model: keras.src.engine.training.Model, output_layer: Union[str, int, None] = None, batch_size: Optional[int] = 32, operator: Union[xplique.commons.operators_operations.Tasks, str, Callable[[keras.src.engine.training.Model, tensorflow.python.framework.tensor.Tensor, tensorflow.python.framework.tensor.Tensor], float], None] = None, conv_layer: Union[str, int, None] = None)`¶

Parameters

model : keras.src.engine.training.Model
- The model from which we want to obtain explanations
output_layer : Union[str, int, None] = None
- Layer to target for the outputs (e.g logits or after softmax).
  If an int is provided it will be interpreted as a layer index.
  If a string is provided it will look for the layer name.
  Default to the last layer.
  It is recommended to use the layer before Softmax.
batch_size : Optional[int] = 32
- Number of inputs to explain at once, if None compute all at once.
operator : Union[xplique.commons.operators_operations.Tasks, str, Callable[[keras.src.engine.training.Model, tensorflow.python.framework.tensor.Tensor, tensorflow.python.framework.tensor.Tensor], float], None] = None
- Function g to explain, g take 3 parameters (f, x, y) and should return a scalar, with f the model, x the inputs and y the targets. If None, use the standard operator g(f, x, y) = f(x)[y].
conv_layer : Union[str, int, None] = None
- Layer to target for Grad-CAM algorithm.
  If an int is provided it will be interpreted as a layer index.
  If a string is provided it will look for the layer name.

`explain(self, inputs: Union[tf.Dataset, tensorflow.python.framework.tensor.Tensor, numpy.ndarray], targets: Union[tensorflow.python.framework.tensor.Tensor, numpy.ndarray, None] = None) -> tensorflow.python.framework.tensor.Tensor`¶

Compute and resize explanations to match inputs shape. Accept Tensor, numpy array or tf.data.Dataset (in that case targets is None)

Parameters

inputs : Union[tf.Dataset, tensorflow.python.framework.tensor.Tensor, numpy.ndarray]
- Dataset, Tensor or Array. Input samples to be explained.
  If Dataset, targets should not be provided (included in Dataset).
  Expected shape among (N, W), (N, T, W), (N, H, W, C).
  More information in the documentation.
targets : Union[tensorflow.python.framework.tensor.Tensor, numpy.ndarray, None] = None
- Tensor or Array. One-hot encoding of the model's output from which an explanation is desired. One encoding per input and only one output at a time. Therefore, the expected shape is (N, output_size).
  More information in the documentation.

Return

grad_cam : tensorflow.python.framework.tensor.Tensor
- Grad-CAM explanations, same shape as the inputs except for the channels.

Visual Explanations from Deep Networks via Gradient-based Localization (2016). ↩

GradCAM¶

Example¶

Notebooks¶

GradCAM¶

explain(self, inputs: Union[tf.Dataset, tensorflow.python.framework.tensor.Tensor, numpy.ndarray], targets: Union[tensorflow.python.framework.tensor.Tensor, numpy.ndarray, None] = None) -> tensorflow.python.framework.tensor.Tensor¶

`GradCAM`¶

`explain(self, inputs: Union[tf.Dataset, tensorflow.python.framework.tensor.Tensor, numpy.ndarray], targets: Union[tensorflow.python.framework.tensor.Tensor, numpy.ndarray, None] = None) -> tensorflow.python.framework.tensor.Tensor`¶