Grad-CAM:Visual Explanations from Deep Networks via Gradient-based Localization

Gradient-weighted Class Activation Mapping

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam

code-torch: https://github.com/ramprs/grad-cam/

code-pytorch: https://github.com/jacobgil/pytorch-grad-cam

code-keras: https://github.com/jacobgil/keras-grad-cam

参考链接：

1. 声明

最近在看到一篇person-reid的文章Re-Identification with Consistent Attentive Siamese Networks，其中涉及到了Grad-CAM，所以简单学习一下Grad-CAM，但不作为重点。

2018-12-18

在使用的过程中，发现自己写的这篇博客不太容易让自己一目了然，所以根据链接来进行更新一次。

2018-12-20

在重新看论文person-reid的过程中，发现其中涉及到的网络架构师GAIN，所以补充GAIN的说明及其代码。此时，这篇的重点变成了Grad-CAM和GAIN。

2. 前言

对于深度模型的可解释性和可视化，现在已经研究出了一些方法，包括不限于Deconvolution, Guided-Backpropagation, CAM, Grad-CAM.

其中 Deconvolution 和 Guided-Backpropagation 得到更偏向于细粒度图， CAM 和 Grad-CAM 得到更偏向于类区分的热力图。

各种可视化方法及其效果图参见：https://github.com/utkuozbulak/pytorch-cnn-visualizations

参考链接: https://blog.csdn.net/geek_wh2016/article/details/81060315

2.1 Deconvolution

Deconvolution: Visualizing and Understanding Convolutional Networks

code: https://github.com/kvfrans/feature-visualization

综述: 这篇paper是CNN可视化的开山之作(由 Lecun 得意门生 Matthew Zeiler 发表于2013年)，主要解决了两个问题:

why CNN perform so well?
how CNN might be improved?

实现: 对于CNN，可视化就是整个过程的逆过程，即Unpooling+ReLU+Deconv.

Unpooling: 记录max-pool的位置，即Switches表格，unpooling时，最大值放回该位置，其他位置放0.
ReLU: 继续使用ReLU.
Deconv: 使用相同卷积核的转置作为新的卷积核，对特征进行卷积.

参考链接：

2.2 Guided-Backpropagation

Guided-Backpropagation: Striving for Simplicity: The All Convolutional Net

反向传播、反卷积和导向反向传播都是反向传播，区别在于经过 ReLU 层时对梯度的不同处理策略。在这篇论文中有详细的解释。

计算公式如下：

文中提出使用 stride convolution 代替 pooling，研究这种结构的有效性。

效果显示如下：

可以看出 Guided-Backpropagation 主要提取对分类有效果的特征，但是与是哪类没有关系。

2.3 CAM

CAM: Learning Deep Features for Discriminative Localization

综述：论文重新审视了global average pooling (GAP) 的有效性，并详细阐述了GAP如何使得CNN有优异的目标定位能力。
介绍：摒弃FC，使用GAP。
实现：

2.4 Grad-CAM

Grad-CAM 是CAM的改进版，与 CAM 的不同点在于前者的特征加权系数是反向传播得到的，后者的特征加权系数是分类器的权重。

Grad-CAM 可以加载到任意网络架构上，而不需要修改网络架构，而CAM必须使用GAP。

下面会详细介绍。

2.5 GAIN

GAIN: Tell Me Where to Look: Guided Attention Inference Network

code: https://github.com/alokwhitewolf/Guided-Attention-Inference-Network

GAIN 是 Grad-CAM 的改进版，Grad-CAM只能可视化解释现有的网络结构的结果，却不能指导网络架构，GAIN可以指导网络修正错误，关注更正确的位置。

问题：在船识别的过程中，网络的关注点是水面而不是船。

实现：通过最小化遮挡图像的物体来训练。

整体网络架构中，只有一个网络，两个处理流都是共享同一个网络。

公式：损失函数

$w_{l,k}^c=GAP(\frac{\partial s^c}{\partial f_{l,k}})$ $A^c=ReLU(conv(f_l, w^c))$ $T(A^c)=\frac{1}{1+exp(-\omega(A^c-\sigma))}$ $I^{*c}=I-(T(A^c)\odot I)$ $L_{am}=\frac{1}{n}\sum_cs^c(I^{*c})$ $L_{self}=L_{cl}+\alpha L_{am}$ $\alpha=1$

扩展：如果有额外的监督真值，比如分割，那么可以进行扩充网络

$L_e=\frac{1}{n}\sum_c (A^c-H^c)^2$ $L_{ext}=L_{cl}+\alpha L_{am}+\omega L_e$ $\alpha=1, \omega=10$

2.5.1 GAIN-code

第一步：训练分类网络

FCN:

self.conv1_1 = L.Convolution2D(3, 64, 3, 1, 1)
self.conv1_2 = L.Convolution2D(64, 64, 3, 1, 1)

self.conv2_1 = L.Convolution2D(64, 128, 3, 1, 1)
self.conv2_2 = L.Convolution2D(128, 128, 3, 1, 1)

self.conv3_1 = L.Convolution2D(128, 256, 3, 1, 1)
self.conv3_2 = L.Convolution2D(256, 256, 3, 1, 1)
self.conv3_3 = L.Convolution2D(256, 256, 3, 1, 1)

self.conv4_1 = L.Convolution2D(256, 512, 3, 1, 1)
self.conv4_2 = L.Convolution2D(512, 512, 3, 1, 1)
self.conv4_3 = L.Convolution2D(512, 512, 3, 1, 1)

self.conv5_1 = L.Convolution2D(512, 512, 3, 1, 1)
self.conv5_2 = L.Convolution2D(512, 512, 3, 1, 1)
self.conv5_3 = L.Convolution2D(512, 512, 3, 1, 1)

self.fc6 = L.Convolution2D(512, 4096, 7, 1, 0)
self.fc7 = L.Convolution2D(4096, 4096, 1, 1, 0)
self.score_fr = L.Convolution2D(4096, n_class, 1, 1, 0)

def segment(self, x, t=None):
    # conv1
    self.conv1_1.pad = (100, 100)
    h = F.relu(self.conv1_1(x))
    conv1_1 = h
    h = F.relu(self.conv1_2(conv1_1))
    conv1_2 = h
    h = _max_pooling_2d(conv1_2)
    pool1 = h  # 1/2

    # conv2
    h = F.relu(self.conv2_1(pool1))
    conv2_1 = h
    h = F.relu(self.conv2_2(conv2_1))
    conv2_2 = h
    h = _max_pooling_2d(conv2_2)
    pool2 = h  # 1/4

    # conv3
    h = F.relu(self.conv3_1(pool2))
    conv3_1 = h
    h = F.relu(self.conv3_2(conv3_1))
    conv3_2 = h
    h = F.relu(self.conv3_3(conv3_2))
    conv3_3 = h
    h = _max_pooling_2d(conv3_3)
    pool3 = h  # 1/8

    # conv4
    h = F.relu(self.conv4_1(pool3))
    h = F.relu(self.conv4_2(h))
    h = F.relu(self.conv4_3(h))
    h = _max_pooling_2d(h)
    pool4 = h  # 1/16

    # conv5
    h = F.relu(self.conv5_1(pool4))
    h = F.relu(self.conv5_2(h))
    h = F.relu(self.conv5_3(h))
    h = _max_pooling_2d(h)
    pool5 = h  # 1/32

    # fc6
    h = F.relu(self.fc6(pool5))
    h = F.dropout(h, ratio=.5)
    fc6 = h  # 1/32

    # fc7
    h = F.relu(self.fc7(fc6))
    h = F.dropout(h, ratio=.5)
    fc7 = h  # 1/32

    # score_fr
    h = self.score_fr(fc7)
    score_fr = h  # 1/32

    # score_pool3
    h = self.score_pool3(pool3)
    score_pool3 = h  # 1/8

    # score_pool4
    h = self.score_pool4(pool4)
    score_pool4 = h  # 1/16

    # upscore2
    h = self.upscore2(score_fr)
    upscore2 = h  # 1/16

    # score_pool4c
    h = score_pool4[:, :,
                    5:5 + upscore2.shape[2],
                    5:5 + upscore2.shape[3]]
    score_pool4c = h  # 1/16

    # fuse_pool4
    h = upscore2 + score_pool4c
    fuse_pool4 = h  # 1/16

    # upscore_pool4
    h = self.upscore_pool4(fuse_pool4)
    upscore_pool4 = h  # 1/8

    # score_pool4c
    h = score_pool3[:, :,
                    9:9 + upscore_pool4.shape[2],
                    9:9 + upscore_pool4.shape[3]]
    score_pool3c = h  # 1/8

    # fuse_pool3
    h = upscore_pool4 + score_pool3c
    fuse_pool3 = h  # 1/8

    # upscore8
    h = self.upscore8(fuse_pool3)
    upscore8 = h  # 1/1

    # score
    h = upscore8[:, :, 31:31 + x.shape[2], 31:31 + x.shape[3]]
    score = h  # 1/1
    self.score = score

    if t is None:
        assert not chainer.config.train
        return

    loss = F.softmax_cross_entropy(score, t, normalize=True)
    if np.isnan(float(loss.data)):
        raise ValueError('Loss is nan.')
    chainer.report({'loss': loss}, self)
    self.conv1_1.pad = (1, 1)
    return loss

FCN-v1.0: 普通的分类

self.conv1_1 = L.Convolution2D(3, 64, 3, 1, 1)
self.conv1_2 = L.Convolution2D(64, 64, 3, 1, 1)

self.conv2_1 = L.Convolution2D(64, 128, 3, 1, 1)
self.conv2_2 = L.Convolution2D(128, 128, 3, 1, 1)

self.conv3_1 = L.Convolution2D(128, 256, 3, 1, 1)
self.conv3_2 = L.Convolution2D(256, 256, 3, 1, 1)
self.conv3_3 = L.Convolution2D(256, 256, 3, 1, 1)

self.conv4_1 = L.Convolution2D(256, 512, 3, 1, 1)
self.conv4_2 = L.Convolution2D(512, 512, 3, 1, 1)
self.conv4_3 = L.Convolution2D(512, 512, 3, 1, 1)

self.conv5_1 = L.Convolution2D(512, 512, 3, 1, 1)
self.conv5_2 = L.Convolution2D(512, 512, 3, 1, 1)
self.conv5_3 = L.Convolution2D(512, 512, 3, 1, 1)

self.fc6_cl = L.Linear(512, 4096)
self.fc7_cl = L.Linear(4096, 4096)
self.score_cl = L.Linear(4096, n_class-1) # Disregard 0 class for classification

self.final_conv_layer = 'conv5_3'
self.grad_target_layer = 'prob'
self.freezed_layers = ['fc6_cl', 'fc7_cl', 'score_cl']

def classify(self, x, is_training=True):
    with chainer.using_config('train',False):
        # conv1
        h = F.relu(self.conv1_1(x))
        h = F.relu(self.conv1_2(h))
        h = _max_pooling_2d(h)

        # conv2
        h = F.relu(self.conv2_1(h))
        h = F.relu(self.conv2_2(h))
        h = _max_pooling_2d(h)

        # conv3
        h = F.relu(self.conv3_1(h))
        h = F.relu(self.conv3_2(h))
        h = F.relu(self.conv3_3(h))
        h = _max_pooling_2d(h)

        # conv4
        h = F.relu(self.conv4_1(h))
        h = F.relu(self.conv4_2(h))
        h = F.relu(self.conv4_3(h))
        h = _max_pooling_2d(h)

        # conv5
        h = F.relu(self.conv5_1(h))
        h = F.relu(self.conv5_2(h))
        h = F.relu(self.conv5_3(h))
        h = _max_pooling_2d(h)
        h = _average_pooling_2d(h)

    with chainer.using_config('train',is_training):
        h = F.relu(F.dropout(self.fc6_cl(h), .5))
        h = F.relu(F.dropout(self.fc7_cl(h), .5))
        h = self.score_cl(h)
        # 1*20

    return h

loss:

# cl_output=classify(image)
# cl_output: 1*20
# target: 1*20 ~ [0,1] 1表示有这个label，0表示没有这个label，用的是多分类损失函数，且类别之间不排斥，类似对每个类别做二元分类。
loss = F.sigmoid_cross_entropy(cl_output, target, normalize=True)

第二步：训练GAIN

self.GAIN_functions = collections.OrderedDict([
    ('conv1_1', [self.conv1_1, F.relu]),
    ('conv1_2', [self.conv1_2, F.relu]),
    ('pool1', [_max_pooling_2d]),

    ('conv2_1', [self.conv2_1, F.relu]),
    ('conv2_2', [self.conv2_2, F.relu]),
    ('pool2', [_max_pooling_2d]),

    ('conv3_1', [self.conv3_1, F.relu]),
    ('conv3_2', [self.conv3_2, F.relu]),
    ('conv3_3', [self.conv3_3, F.relu]),
    ('pool3', [_max_pooling_2d]),

    ('conv4_1', [self.conv4_1, F.relu]),
    ('conv4_2', [self.conv4_2, F.relu]),
    ('conv4_3', [self.conv4_3, F.relu]),
    ('pool4', [_max_pooling_2d]),

    ('conv5_1', [self.conv5_1, F.relu]),
    ('conv5_2', [self.conv5_2, F.relu]),
    ('conv5_3', [self.conv5_3, F.relu]),
    ('pool5', [_max_pooling_2d]),

    ('avg_pool', [_average_pooling_2d]),

    ('fc6_cl', [self.fc6_cl, F.relu]),
    ('fc7_cl', [self.fc7_cl, F.relu]),
    ('prob', [self.score_cl, F.sigmoid])

])
self.final_conv_layer = 'conv5_3'
self.grad_target_layer = 'prob'
self.freezed_layers = ['fc6_cl', 'fc7_cl', 'score_cl']

通过分类结果获取mask：

def stream_cl(self, inp, label=None):
    # h: 1*3*281*500
    # label: 真值 array([0, 14])
    # return: gcam: mask,size(1,3,281,500); h: size(1,20), class_id: 一个数字
    h = inp
    for key, funcs in self.GAIN_functions.items():
        for func in funcs:
            h = func(h)
        if key == self.final_conv_layer:
            activation = h
        if key == self.grad_target_layer:
            break

    gcam, class_id = self.get_gcam(h, activation, (inp.shape[-2], inp.shape[-1]), label=label)
    return gcam, h, class_id

def get_gcam(self, end_output, activations, shape, label):
    # end_output: size: 1,20
    # activations: size: 1*512*18*32
    # shape: (281, 500)
    # label: 真值
    self.cleargrads()
    class_id = self.set_init_grad(end_output, label)
    end_output.backward(retain_grad=True)
    grad = activations.grad_var
    grad = F.average_pooling_2d(grad, (grad.shape[-2], grad.shape[-1]), 1)
    grad = F.expand_dims(F.reshape(grad, (grad.shape[0]*grad.shape[1], grad.shape[2], grad.shape[3])), 0)
    weights = activations
    weights = F.expand_dims(F.reshape(weights, (weights.shape[0]*weights.shape[1], weights.shape[2], weights.shape[3])), 0)
    gcam = F.resize_images(F.relu(F.convolution_2d(weights, grad, None, 1, 0)), shape)
    return gcam, class_id

$L_{cl}$

1 2	gcam, cl_scores, class_id = self._optimizers['main'].target.stream_cl(image, gt_labels) cl_loss = F.sigmoid_cross_entropy(cl_scores, target, normalize=True)

$L_{am}$

1
2
3

masked_output = self._optimizers['main'].target.stream_am(masked_image)
masked_output = F.sigmoid(masked_output)
am_loss = masked_output[0][class_id][0]

备注: $L{cl}$和$L{am}$完全共享网络。

3. Introduction

可视化即应该满足高分辨率，也应该满足类别定位能力。

示例图像

4. Approach

CAM
在CAM中，一个全连接层替换成GAP，参见上面的CAM图，则分类任务可以表示成

$y^c = \sum_k w_k^c \frac{1}{Z} \sum_i \sum_j A_{ij}^k$

其中，$y^c$表示分类结果，$wk^c$表示第k个特征图(kxhxw)对第c个类别的贡献，即全连接层的系数，$Z$表示特征图的大小，$Z=h\cdot w$，$A{ij}^k$表示第k个特征图。

则 CAM 的输出图表示为：

$L_{CAM}^c=\sum_k w_k^c A^k$

Grad-CAM
在Grad-CAM中，权重系数是通过反向传播得到的。

$\alpha_k^c=\frac{1}{Z}\sum_i \sum_j \frac{\partial y^c}{\partial A_{ij}^k}$

则Grad-CAM的输出图表示为：

$L_{Grad-CAM}^c=ReLU(\sum_k \alpha_k^c A^k)$

可以证明，Grad-CAM与CAM的公式是同一个公式的变形。

Guided Grad-CAM
Guided Grad-CAM 是将 Grad-CAM 与 Guided Backpropagation 得到的输出图简单地点乘，从而获得类区分定位的高分辨率细节图。

同时作者还分析了CNN分类错误的样本。

5. 代码

对于pytorch代码进行分析

5.1 Grad-CAM

计算Grad-CAM:

反向传播：先计算出当前图片的分类结果output(size:1*5)(假设共5类)，选出最优分类结果，假设是第2类，然后令one-hot=[0,1,0,0,0]，求得sum-one-hot=得到一个数字，然后反向传播。

cam: (H,W), ~(0,1)


class FeatureExtractor(object):
    """ Class for extracting activations and
    registering gradients from targetted intermediate layers """
    """
    调用方式:
    outputs, x = FeatureExtractor(x)
    gradients = FeatureExtractor.gradients
    """

    def __init__(self, model, target_layers):
        self.model = model
        self.target_layers = target_layers
        self.gradients = []

    def save_gradient(self, grad):
        self.gradients.append(grad)

    def __call__(self, x):
        """

        :param x: N*C*H*W, a picture
        :return: outputs: list, activations layer output, A in equation
                 x : feature map, feature of model output n*c*h*w
        """
        outputs = []
        self.gradients = []
        for name, module in self.model._modules.items():
            x = module(x)
            if name in self.target_layers:
                x.register_hook(self.save_gradient)
                outputs += [x]
        return outputs, x


class ModelOutputs(object):
    """ Class for making a forward pass, and getting:
    1. The network output.
    2. Activations from intermeddiate targetted layers.
    3. Gradients from intermeddiate targetted layers. """

    """
    调用方式：
    target_activations, output = ModelOutputs(x)
    gradients = ModelOutputs.get_gradients()
    """

    def __init__(self, model, target_layers):
        self.model = model
        self.feature_extractor = FeatureExtractor(self.model.features, target_layers)

    def get_gradients(self):
        return self.feature_extractor.gradients

    def __call__(self, x):
        """

        :param x: N*C*H*W, a picture
        :return: target_activations: list, activations layer output, A in equation
                 output : tensor, classification output. N*c. y in equation.
        """
        target_activations, output = self.feature_extractor(x)
        output = output.view(output.size(0), -1)
        output = self.model.classifier(output)
        return target_activations, output


class GradCam(object):
    """
    Class for making Grad-CAM, and getting:
    1. Grad-CAM
    """
    """
    调用方式：
    mask=GradCam(input)
    """

    def __init__(self, model, target_layer_names, use_cuda):
        self.model = model
        self.model.eval()
        self.cuda = use_cuda
        if self.cuda:
            self.model = model.cuda()

        self.extractor = ModelOutputs(self.model, target_layer_names)

    def forward(self, input):
        return self.model(input)

    def __call__(self, input, index=None):
        """

        :param input: N*C*H*W, a picture
        :param index: int
        :return: cam: N*C*H*W ~(0,1) L_{Grad-CAM}^c
        """
        if self.cuda:
            features, output = self.extractor(input.cuda())
        else:
            features, output = self.extractor(input)

        if index == None:
            index = np.argmax(output.cpu().data.numpy())

        one_hot = np.zeros((1, output.size()[-1]), dtype=np.float32)
        one_hot[0][index] = 1
        # After test, requires_grad could be False
        one_hot = Variable(torch.from_numpy(one_hot), requires_grad=True)
        if self.cuda:
            one_hot = torch.sum(one_hot.cuda() * output)
        else:
            one_hot = torch.sum(one_hot * output)

        self.model.features.zero_grad()
        self.model.classifier.zero_grad()
        # After test, requires_grad could be False
        one_hot.backward(retain_graph=True)

        grads_val = self.extractor.get_gradients()[-1].cpu().data.numpy()

        target = features[-1]
        target = target.cpu().data.numpy()[0, :]

        weights = np.mean(grads_val, axis=(2, 3))[0, :]
        cam = np.zeros(target.shape[1:], dtype=np.float32)

        for i, w in enumerate(weights):
            cam += w * target[i, :, :]

        cam = np.maximum(cam, 0)
        cam = cv2.resize(cam, (224, 224))
        cam = cam - np.min(cam)
        cam = cam / np.max(cam)
        return cam

显示Grad-CAM


def show_cam_on_image(img, mask):
    heatmap = cv2.applyColorMap(np.uint8(255 * mask), cv2.COLORMAP_JET)
    heatmap = np.float32(heatmap) / 255
    cam = heatmap + np.float32(img)
    cam = cam / np.max(cam)
    cv2.imwrite("cam.jpg", np.uint8(255 * cam))

5.2 GuidedBackpropReLUModel

gb: (C,H,W) 任意值

class GuidedBackpropReLU(Function):

    def forward(self, input):
        positive_mask = (input > 0).type_as(input)
        output = torch.addcmul(torch.zeros(input.size()).type_as(input), input, positive_mask)
        self.save_for_backward(input, output)
        return output

    def backward(self, grad_output):
        input, output = self.saved_tensors
        grad_input = None

        positive_mask_1 = (input > 0).type_as(grad_output)
        positive_mask_2 = (grad_output > 0).type_as(grad_output)
        grad_input = torch.addcmul(torch.zeros(input.size()).type_as(input),
                                   torch.addcmul(torch.zeros(input.size()).type_as(input), grad_output,
                                                 positive_mask_1), positive_mask_2)

        return grad_input


class GuidedBackpropReLUModel:
    def __init__(self, model, use_cuda):
        self.model = model
        self.model.eval()
        self.cuda = use_cuda
        if self.cuda:
            self.model = model.cuda()

        # replace ReLU with GuidedBackpropReLU
        for idx, module in self.model.features._modules.items():
            if module.__class__.__name__ == 'ReLU':
                self.model.features._modules[idx] = GuidedBackpropReLU()

    def forward(self, input):
        return self.model(input)

    def __call__(self, input, index=None):
        if self.cuda:
            output = self.forward(input.cuda())
        else:
            output = self.forward(input)

        if index == None:
            index = np.argmax(output.cpu().data.numpy())

        one_hot = np.zeros((1, output.size()[-1]), dtype=np.float32)
        one_hot[0][index] = 1
        # After test, requires_grad could be False
        one_hot = Variable(torch.from_numpy(one_hot), requires_grad=True)
        if self.cuda:
            one_hot = torch.sum(one_hot.cuda() * output)
        else:
            one_hot = torch.sum(one_hot * output)

        # self.model.features.zero_grad()
        # self.model.classifier.zero_grad()
        one_hot.backward(retain_graph=True)

        output = input.grad.cpu().data.numpy()
        output = output[0, :, :, :]

        return output

5.3 Guided Grad-CAM

cam_mask = np.zeros(gb.shape)
for i in range(0, gb.shape[0]):
    cam_mask[i, :, :] = mask

cam_gb = np.multiply(cam_mask, gb)
utils.save_image(torch.from_numpy(cam_gb), 'cam_gb.jpg')

6. 效果显示

原图

Grad-CAM

Guided-Backpropagation

Guided Grad-CAM