【Paper】 convolutional Block Attention Module - Paper Summary

CBAM : Convolutional Block Attention Module

our module sequentially infers attention maps along two separate dimensions
the attention maps are multiplied to the input feature map for adaptive feature refinement (향상되어 정제된 feature)
lightweight and general module, it can be integrated into any CNN architectures

Recent Detectors
1. recent researches have mainly investigated depth, width(#channel), and cardinality(같은 형태의 빌딩 블록의 갯수).
2. VGGNet, ResNet, GoogLeNet has become deeper for rich representation(중요한 특징 표현력).
3. GoogLeNet and Wide ResNet(2016), width must be another important factor.
4. Xception and ResNeXt, increase the cardinality of a network . the cardinality saves the total number of parameters and make results powerful than depth and width.
5. significant visual attention papers
  - [16] A recurrent neural network for image generation
  - [17] Spatial transformer networks
Emphasize meaningful features / along those two principal dimensions / channel(depth) and spatial axes(x,y). -> channel and spatial attention modules -> learning which information to emphasize or suppress.
Contribution
1. Can be widely applied to boost representation power of CNNs
2. Extensive ablation studies
3. Performance of various networks is greatly improved

⊗ : element-wise multiplication
channel attention values are broadcasted along the spatial dimension
Channel attention module
- In the past, make model learn the extent of the target object or compute spatial statistics.[33] [28]
- exploiting the inter-channel relationship of features.
- each channel is considered as a a feature detector.
- ‘what’ is meaningful
- average pooling and max-pooling -> two different spatial context (F^c_avg and F^c_max)
- W1, W0 are shared for both inputs and the ReLU activation function is followed by W0. r = reduction ratio.
Spatial attention module
- The design philosophy is symmetric with the channel attention branch.
- [34] Paying more attention to attention, pooling along channel axis can be effective in highlighting informative regions.
- concatenated feature (both descriptor)
- : encodes where to emphasize or suppress.
- σ : the sigmoid function, the filter size of 7 × 7
Arrangement of attention modules
- the sequential arrangement gives a better result than a parallel arrangement. (첫번째 그림처럼. 직렬로. width하게 병렬로 NO)
- our experimental result.

We apply CBAM on the convolution outputs in each block
top-5 error, top-1 error : 감소해야 좋음
FLOPS = FLoating point Operations Per Second
GFLOPS = GPU FLoating point Operations Per Second
(그래픽카드의 소요 정도)
we empirically show the effectiveness of our design choice.
- FLOPS = FLoating point Operations Per Second
  GFLOPS = GPU FLoating point Operations Per Second
  (그래픽카드의 소요 정도)
- Channel attention
  - a shared MLP
  - using both poolings
  - r = the reduction ratio to 16.
- Spatial attention
  - channelpooling
  - convolution layer with a large kernel size
- Arrangement

Grad-CAM is a recently proposed visualization method.
Grad-CAM uses gradients in order to calculate the importance of the spatial locations.

we observed that our module induces the network to focus on target object properly.
We hope CBAM become an important component of various network architectures.