【Paper】 convolutional Block Attention Module - Paper Summary

CBAM : Convolutional Block Attention Module

0. Abstract

  1. our module sequentially infers attention maps along two separate dimensions
  2. the attention maps are multiplied to the input feature map for adaptive feature refinement (향상되어 정제된 feature)
  3. lightweight and general module, it can be integrated into any CNN architectures

1. introduction

  1. Recent Detectors
    1. recent researches have mainly investigated depth, width(#channel), and cardinality(같은 형태의 빌딩 블록의 갯수).
    2. VGGNet, ResNet, GoogLeNet has become deeper for rich representation(중요한 특징 표현력).
    3. GoogLeNet and Wide ResNet(2016), width must be another important factor.
    4. Xception and ResNeXt, increase the cardinality of a network . the cardinality saves the total number of parameters and make results powerful than depth and width.
    5. significant visual attention papers
      • [16] A recurrent neural network for image generation
      • [17] Spatial transformer networks
  2. Emphasize meaningful features / along those two principal dimensions / channel(depth) and spatial axes(x,y). -> channel and spatial attention modules -> learning which information to emphasize or suppress.
    • image-20210114212858200
  3. Contribution
    1. Can be widely applied to boost representation power of CNNs
    2. Extensive ablation studies
    3. Performance of various networks is greatly improved

2. Related Work

  1. Network engineering

    1. ResNet / ResNeXt / Inception-ResNet
    2. WideResNet : a larger number of convolutional filters and reduced depth
      • image-20210114214240109
    3. PyramidNet : a strict generalization of WideResNet and the width increases.
      • image-20210114214431894
    4. ResNeXt : use grouped convolutions and vertify cardinality effect
      • image-20210114214135397
    5. DenseNet : Concatenates the input features with the output features
      • image-20210114214036592
  2. Attention

    • (2017) Residual attention network for image classification
      • image-20210114215737540
      • encoderdecoder style attention module
      • By refining the feature maps, performance good, robust to noisy inputs
      • more computational and parameter
    • (2017) Squeeze-and-excitation networks
      • image-20210114220217634
      • Exploit the inter-channel relationship
      • global average-pooled features to compute channel-wise attention. (‘what’ to focus) -> we suggest to use max-pooled features.
      • miss the spatial attention deciding ‘where’ to focus.
    • (2019) Spatial and channel-wise attention in convolutional networks for image captioning

3. Convolutional Block Attention Module

image-20210114223009703

  • ⊗ : element-wise multiplication
  • channel attention values are broadcasted along the spatial dimension

  • Channel attention module

    • In the past, make model learn the extent of the target object or compute spatial statistics.[33] [28]
    • exploiting the inter-channel relationship of features.
    • each channel is considered as a a feature detector.
    • ‘what’ is meaningful
    • average pooling and max-pooling -> two different spatial context (F^c_avg and F^c_max)
    • image-20210114224244736
    • W1, W0 are shared for both inputs and the ReLU activation function is followed by W0. r = reduction ratio.
  • Spatial attention module

    • The design philosophy is symmetric with the channel attention branch.
    • [34] Paying more attention to attention, pooling along channel axis can be effective in highlighting informative regions.
    • concatenated feature (both descriptor)
    • image-20210114225054378 : encodes where to emphasize or suppress.
    • image-20210114225827305
    • σ : the sigmoid function, the filter size of 7 × 7
  • Arrangement of attention modules

    • the sequential arrangement gives a better result than a parallel arrangement. (첫번째 그림처럼. 직렬로. width하게 병렬로 NO)
    • our experimental result.

4. Experiments

  • We apply CBAM on the convolution outputs in each block
  • image-20210114230937883
  • top-5 error, top-1 error : 감소해야 좋음
  • FLOPS = FLoating point Operations Per Second
    GFLOPS = GPU FLoating point Operations Per Second
    (그래픽카드의 소요 정도)
  • we empirically show the effectiveness of our design choice.
    • FLOPS = FLoating point Operations Per Second
      GFLOPS = GPU FLoating point Operations Per Second
      (그래픽카드의 소요 정도)
    • Channel attention
      • a shared MLP
      • using both poolings
      • r = the reduction ratio to 16.
    • Spatial attention
      • channelpooling
      • convolution layer with a large kernel size
    • Arrangement

4.2 Image Classification on ImageNet-1K

4.3 Network Visualization with Grad-CAM

  • Grad-CAM is a recently proposed visualization method.
  • Grad-CAM uses gradients in order to calculate the importance of the spatial locations.
  • image-20210114233249882

5. conclusion

  • we observed that our module induces the network to focus on target object properly.
  • We hope CBAM become an important component of various network architectures.

모르는 내용


© All rights reserved By Junha Song.