residual_attention

0. 前言

paper: 商汤 CVPR2017 Residual Attention_Network for Image Classification
code: caffe, caffe网络可视化工具 Netscope, pytorch
paper: ECCV2018_CBAM: Convolutional Block Attention Module
code: pytorch
paper: GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond
code: pytorch 这篇论文也是讲 attention map 的，主要用于分类，说实话，没有太理解其中的创新点，可能是因为不懂整个 attention map 的进程，前人做到了什么地步，从效果图上看，感觉比 Dual Attention 那篇论文还是差一些。不过能上 CVPR 的肯定有牛的地方，只是自己水平不够。

1. Residual Attention Network

每个 Attention Module 都分为两个分支：mask branch and trunk branch。

$H_{i,c}(x)=(1+M_{i,c}(x))*F_{i,c}(x)$

其中，c表示通道，i表示所有的位置，$M_{i,c}(x)\in[0,1]$

1.1 Spatial Attention and Channel Attention

Mixed attention:

$f_1(x_{i,c})=\frac{1}{1+\exp (-x_{i,c})}$

Channel Attention:

$f_2(x_{i,c})=\frac{x_{i,c}}{\parallel x_i \parallel}$

Spatial Attention:

$f_3(x_{i,c})=\frac{1}{1+\exp (-(x_{i,c}-mean_c)/std_c)}$

2. code

代码有点绕，暂时没有看懂。

3. CBAM

4. code

简单明了

class ChannelAttention(nn.Module):
    def __init__(self, in_planes, ratio=16):
        super(ChannelAttention, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.max_pool = nn.AdaptiveMaxPool2d(1)

        self.fc1   = nn.Conv2d(in_planes, in_planes // 16, 1, bias=False)
        self.relu1 = nn.ReLU()
        self.fc2   = nn.Conv2d(in_planes // 16, in_planes, 1, bias=False)

        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        avg_out = self.fc2(self.relu1(self.fc1(self.avg_pool(x))))
        max_out = self.fc2(self.relu1(self.fc1(self.max_pool(x))))
        out = avg_out + max_out
        return self.sigmoid(out)

class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(Bottleneck, self).__init__()
        self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride,
                               padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv3 = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(planes * 4)
        self.relu = nn.ReLU(inplace=True)

        self.ca = ChannelAttention(planes * 4)
        self.sa = SpatialAttention()

        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        out = self.ca(out) * out
        out = self.sa(out) * out

        if self.downsample is not None:
            residual = self.downsample(x)

        out += residual
        out = self.relu(out)

        return out

5. GCNet

综合考虑了 SE-block 和 NL-block。

代码有点绕，暂时不看。

5. 优化器 optimizer

https://zhuanlan.zhihu.com/p/64882877?utm_source=wechat_session&utm_medium=social&utm_oi=589386839161311232

code

不同的优化器，对结果也有很大的影响，先只记录一下 SGD, Adam 的常见参数配置。

原始 SGD：

学习率太小 0.20 ，有限步内无法达到最优，全程都没有震荡
学习率太大 0.61 ，会造成剧烈震荡，随着学习率的增加，后期震荡逐渐变大，0.52时后期震荡就已经无法收敛了
学习率适中 0.42 ，会成功达到最优点，前期会有震荡，后期没有震荡
对学习率很敏感

带动量的 SGD：

适中的动量 beta=0.42 可以减少大学习率 lr=0.61 的震荡，达到最优点
较大的动量 beta=0.81 会对大学习率 lr=0.61 引入更大的震荡
一般的做法是大的动量 beta=0.9 和小的学习率 lr=0.02 or 0.03，会以比较平缓的方式加速达到最优
当 lr=0.01 时，beta不管怎么取都达不到最优，或者没到，或者超过
当 beta=0.9 时，对学习率也比较敏感，0.01会到不了最优点、0.02会到最优点前面、0.03会到最优点后面，区别比较大
常用设置：beta=0.9，weight-decay=5e-4，nesterov=True，之后再调节 lr 吧。

1	torch.optim.SGD(params, lr, momentum=0.9, weight_decay=5e-4, nesterov=True)

Adam：

对学习率不敏感
训练稳定
在最优点附件
参数设置：beta1=0.9, beta2=0.999, eta=1e-8

虽然 Adam 对初始学习率不敏感，训练也比较稳定，但最终能达到的精度没有手动调好的SGD来得高。

1
2
3

torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=1e-4)
# 更推荐 0.5, 0.999
torch.optim.Adam(params, lr=0.001, betas=(0.5, 0.999), eps=1e-08, weight_decay=1e-4)