+ All Categories
Home > Documents > BAM: Bottleneck Attention Module - GitHub Pages · of BAM, which refines the features using two...

BAM: Bottleneck Attention Module - GitHub Pages · of BAM, which refines the features using two...

Date post: 29-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
7
PARK, WOO, LEE, KWEON: BOTTLENECK ATTENTION MODULE 1 BAM: Bottleneck Attention Module Jongchan Park*1 Sanghyun Woo* 2 Joon-Young Lee 3 In So Kweon 2 1 Lunit Inc. Seoul, Korea 2 School of Electrical Engineering Korea Advanced Institute of Science and Technology Daejeon, Korea 3 Adobe Research San Hose, CA, USA 1 Implementation Details In order to perform fair comparisons, we have created our benchmark platform in the Pytorch fromework [1] based on the open-sourced projects [1, 2, 5, 7, 8, 10, 12, 13]. Our unified framework has allowed us to simply plug our module (BAM) while keeping all other settings same. All the networks are trained using Stochastic Gradient Descent. On CIFAR, we train for 300 epochs. The initial learning rate is set to 0.1, and is divided by 10 at 50% and 75% of the total number of training epochs. On ImageNet, we train models for 90 epochs. The learning rate is initially set to 0.1, and is decreased by 10 times at epoch 30 and 60. On the MS COCO detection dataset, we take our ImageNet pretrained models and train for 490K iterations. The initial learning rate is set to 0.001 and lowered by 10 times at 350K iteration. We use a weight decay of 10 -4 and a Nesterov momentum [11] of 0.9 without dampening. Throughout the experiments, we used a fixed random seed. 2 The effectiveness of BAM In Fig. 1, we visualize our attention maps and compare with the baseline feature maps for thorough analysis of accuracy improvement. We compare two models trained on ImageNet- 1K: ResNet50 and ResNet50 + BAM. We select three examples that the baseline model fails to correctly classify while the model with BAM succeeds. We gather all the 3D attention maps at the bottlenecks and examine their distributions with respect to the channel and spa- tial axes respectively. For visualizing the 2D spatial attention maps, we averaged attention maps over the channel axis and resized them. All the 2D maps are normalized according to the global statistics at each stage computed from the whole ImageNet-1K training set. For visualizing the channel attention profiles, we averaged our attention map over the spatial axis and uniformly sampled 200 channels similar to [3]. c 2018. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms. *Both authors have equally contributed. †The work was done while the author was at KAIST.
Transcript
Page 1: BAM: Bottleneck Attention Module - GitHub Pages · of BAM, which refines the features using two complementary attentions jointly to focus on more meaningful information. Moreover,

PARK, WOO, LEE, KWEON: BOTTLENECK ATTENTION MODULE 1

BAM: Bottleneck Attention Module

Jongchan Park*†1

Sanghyun Woo*2

Joon-Young Lee3

In So Kweon2

1 Lunit Inc.Seoul, Korea

2 School of Electrical EngineeringKorea Advanced Institute of Scienceand TechnologyDaejeon, Korea

3 Adobe ResearchSan Hose, CA, USA

1 Implementation DetailsIn order to perform fair comparisons, we have created our benchmark platform in the Pytorchfromework [1] based on the open-sourced projects [1, 2, 5, 7, 8, 10, 12, 13]. Our unifiedframework has allowed us to simply plug our module (BAM) while keeping all other settingssame. All the networks are trained using Stochastic Gradient Descent. On CIFAR, we trainfor 300 epochs. The initial learning rate is set to 0.1, and is divided by 10 at 50% and 75%of the total number of training epochs. On ImageNet, we train models for 90 epochs. Thelearning rate is initially set to 0.1, and is decreased by 10 times at epoch 30 and 60. On theMS COCO detection dataset, we take our ImageNet pretrained models and train for 490Kiterations. The initial learning rate is set to 0.001 and lowered by 10 times at 350K iteration.We use a weight decay of 10−4 and a Nesterov momentum [11] of 0.9 without dampening.Throughout the experiments, we used a fixed random seed.

2 The effectiveness of BAMIn Fig. 1, we visualize our attention maps and compare with the baseline feature maps forthorough analysis of accuracy improvement. We compare two models trained on ImageNet-1K: ResNet50 and ResNet50 + BAM. We select three examples that the baseline model failsto correctly classify while the model with BAM succeeds. We gather all the 3D attentionmaps at the bottlenecks and examine their distributions with respect to the channel and spa-tial axes respectively. For visualizing the 2D spatial attention maps, we averaged attentionmaps over the channel axis and resized them. All the 2D maps are normalized according tothe global statistics at each stage computed from the whole ImageNet-1K training set. Forvisualizing the channel attention profiles, we averaged our attention map over the spatial axisand uniformly sampled 200 channels similar to [3].

c© 2018. The copyright of this document resides with its authors.It may be distributed unchanged freely in print or electronic forms.

*Both authors have equally contributed.†The work was done while the author was at KAIST.

Citation
Citation
{pyt}
Citation
Citation
{pyt}
Citation
Citation
{Chen and Gupta} 2017
Citation
Citation
{Kuen} 2017
Citation
Citation
{marvis} 2017
Citation
Citation
{PyTorch} 2017
Citation
Citation
{Rodriguez} 2017
Citation
Citation
{Veit} 2017
Citation
Citation
{Yang} 2017
Citation
Citation
{Sutskever, Martens, Dahl, and Hinton} 2013
Citation
Citation
{Hu, Shen, and Sun} 2017
Page 2: BAM: Bottleneck Attention Module - GitHub Pages · of BAM, which refines the features using two complementary attentions jointly to focus on more meaningful information. Moreover,

2 PARK, WOO, LEE, KWEON: BOTTLENECK ATTENTION MODULE

Feature before BAM

BAM attention map

Feature after BAMInput image Feature map

“baseball”

Stag

e 1

BAM

1

Stag

e 2

Stag

e 3

BAM

3

Stage 1Stage 2

Stage 3

Baseline network + BAM Baseline network

“macaque”Stage 1

Stage 2Stage 3

BAM

2BA

M 1

BAM

2BA

M 3

Stag

e 1

Stag

e 2

Stag

e 3

Correct:“baseball”

Correct:“macaque”

Wrong:“acorn”

Wrong:“proboscis monkey”

“indigo bird”

Correct:“indigo bird”

Wrong:“jay bird”

Stag

e 1

BAM

1

Stag

e 2

BAM

2BA

M 3

Stag

e 3

Stage 1Stage 2

Stage 3

Channel-wise attention

BAM 1

BAM 2

BAM 3

channel index

attention val

channel index

attention val

channel index

attention val

Figure 1: Visualizing the attention process of BAM. In order to provide an intuitive un-derstanding of BAM’s role, we visualize image classification process using the images thatbaseline (ResNet50) fails to classify correctly while the model with BAM succeeds. Usingthe models trained on ImageNet-1K, we gather all the 3D attention maps from each bottle-neck and examine their distribution spatially and channel-wise. We can clearly observe thatthe module BAM successfully drives the network to focus on the target while the baselinemodel fails.

As shown in Fig. 1, we can observe that the module BAM drives the network to focuson the target gradually while the baseline model shows more scattered feature activations.Note that accurate targeting is important for the fine-grained classification, as the incorrectanswers of the baseline are reasonable errors. At the first stage, we observe high variancealong the channel axis and enhanced 2D feature maps after BAM. Since the theoretical re-ceptive field size at the first bottleneck is 35, compared to the input image size of 224, thefeatures contain only local information of the input. Therefore, the filters of attention map atthis stage act as a local feature denoiser. We can infer that both channel and spatial attentioncontributes together to selectively refine local features, learning what (‘channel’) and where(‘spatial’) to focus or suppress. The second stage shows an intermediate characteristic of thefirst and final stages. At the final stage, the module generates binary-like 2D attention mapsfocusing on the target object. In terms of channel, the attention profile shows few spikes with

Page 3: BAM: Bottleneck Attention Module - GitHub Pages · of BAM, which refines the features using two complementary attentions jointly to focus on more meaningful information. Moreover,

PARK, WOO, LEE, KWEON: BOTTLENECK ATTENTION MODULE 3

low variance. We conjecture that this is because there is enough information about ‘what’to focus at this stage. Even it is noisy, note that the features before applying the moduleshow high activations around the target, indicating that the network already has a strongclue in what to focus on. By comparing the features of the baseline and before/after BAM,we verify that BAM accurately focuses on the target object while the baseline features arestill scattered. The visualization of the overall attention process demonstrates the efficacyof BAM, which refines the features using two complementary attentions jointly to focus onmore meaningful information. Moreover, the stage-by-stage gradual focusing resembles ahierarchical human perception process [4, 6, 9], suggesting that BAM drives the network tomimic the human visual system effectively.

3 Additional Visualization ResultsWe show more visualization results of attention process. All the results in this section areproduced from ResNet50 baseline (with BAM) tested with the ImageNet validation set. InSec. 3.1, correctly classified examples with BAM are listed with intermediate activations andattention maps. We have selected examples where the baseline with BAM succeeds and thebaseline fails. On the contrary, in Sec. 3.2, examples are selected where the baseline withBAM fails and the baseline succeeds. Figures are best viewed in color.

Citation
Citation
{Hubel and Wiesel} 1959
Citation
Citation
{Marr and Vision} 1982
Citation
Citation
{Riesenhuber and Poggio} 1999
Page 4: BAM: Bottleneck Attention Module - GitHub Pages · of BAM, which refines the features using two complementary attentions jointly to focus on more meaningful information. Moreover,

4 PARK, WOO, LEE, KWEON: BOTTLENECK ATTENTION MODULE

3.1 Successful Cases with BAM

Feature before BAM

BAM attention map

Feature after BAMInput image

Baseline network + BAM

“toy poodle”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“toy

poodle”Stag

e 1

“titi monkey”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“titi

monkey”Stag

e 1

“German sheperd”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“German shepherd”St

age

1

“giant schnauzer”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“giant

schnauzer”Stag

e 1

“killer whale”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“killer whale”St

age

1

“spider monkey”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“spider

monkey”Stag

e 1

“Shetland sheepdog”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“Shetland sheepdog”St

age

1

“Rhodesian ridgeback”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“Rhodesian ridgeback”St

age

1

“wood rabbit”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“wood rabbit”St

age

1

“American egret”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“American

egret”Stag

e 1

Feature before BAM

BAM attention map

Feature after BAM

Feature before BAM

BAM attention map

Feature after BAM

Figure 2: Successful cases with BAM. The shown examples are the intermediate activationsand BAM attention maps when the baseline+BAM succeeds and the baseline fails. Figurebest viewed in color.

Page 5: BAM: Bottleneck Attention Module - GitHub Pages · of BAM, which refines the features using two complementary attentions jointly to focus on more meaningful information. Moreover,

PARK, WOO, LEE, KWEON: BOTTLENECK ATTENTION MODULE 5

Feature before BAM

BAM attention map

Feature after BAMInput image

Baseline network + BAM

“German shepherd”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“German shepherd”St

age

1

“Bouvier des Flandres”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2

Predict:“Bouvier

des Flandres”St

age

1

“Scottish deerhound”

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“Scottish

deerhound”Stag

e 1

“thunder snake”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“thunder snake”St

age

1

“komondor”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“komondor”St

age

1

“black stork”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“black stork”St

age

1

“ibex”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“ibex”

Stag

e 1

“spider monkey”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“spider

monkey”Stag

e 1

“cairn”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“cairn”

Stag

e 1

“silky terrier”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“silky

terrier”Stag

e 1

Feature before BAM

BAM attention map

Feature after BAM

Feature before BAM

BAM attention map

Feature after BAM

BAM

1

Figure 3: Successful cases with BAM. The shown examples are the intermediate activationsand BAM attention maps when the baseline+BAM succeeds and the baseline fails. Figurebest viewed in color.

Page 6: BAM: Bottleneck Attention Module - GitHub Pages · of BAM, which refines the features using two complementary attentions jointly to focus on more meaningful information. Moreover,

6 PARK, WOO, LEE, KWEON: BOTTLENECK ATTENTION MODULE

3.2 Failure Cases with BAM

Feature before BAM

BAM attention map

Feature after BAMInput image

Baseline network + BAM

“king crab”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“corn”

Stag

e 1

“wooden spoon”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“ladle”

Stag

e 1

“slide rule”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“ruler”

Stag

e 1

“wall clock”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“bird

house”Stag

e 1

“organ”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“tile roof”

Stag

e 1

“stopwatch”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“digital watch”St

age

1

“beer bottle”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“soda bottle”St

age

1

“cash machine”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“screen”

Stag

e 1

“desktop computer”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“monitor”

Stag

e 1

“restaurant”

BAM

1

Stag

e 2

Stag

e 3

BAM

3

BAM

2 Predict:“goblet”

Stag

e 1

Feature before BAM

BAM attention map

Feature after BAM

Feature before BAM

BAM attention map

Feature after BAM

Figure 4: Failure cases with BAM. The shown examples are the intermediate activations andBAM attention maps when baseline+BAM fails and baseline succeeds. Figure best viewedin color.

Page 7: BAM: Bottleneck Attention Module - GitHub Pages · of BAM, which refines the features using two complementary attentions jointly to focus on more meaningful information. Moreover,

PARK, WOO, LEE, KWEON: BOTTLENECK ATTENTION MODULE 7

References[1] Pytorch. http://pytorch.org/. Accessed: 2018-04-20.

[2] Xinlei Chen and Abhinav Gupta. An implementation of faster rcnn with study forregion sampling. arXiv preprint arXiv:1702.02138, 2017.

[3] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. arXiv preprintarXiv:1709.01507, 2017.

[4] David H Hubel and Torsten N Wiesel. Receptive fields of single neurones in the cat’sstriate cortex. The Journal of physiology, 148(3):574–591, 1959.

[5] Jason Kuen. Wideresnet pytorch implementation. https://github.com/xternalz/WideResNet-pytorch.git, 2017.

[6] David Marr and A Vision. A computational investigation into the human representationand processing of visual information. WH San Francisco: Freeman and Company, 1(2), 1982.

[7] marvis. Mobilenet pytorch implementation. https://github.com/marvis/pytorch-mobilenet, 2017.

[8] PyTorch. torchvision. https://github.com/pytorch/vision, 2017.

[9] Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recogni-tion in cortex. Nature neuroscience, 2(11):1019–1025, 1999.

[10] Pau Rodriguez. Resnext pytorch implementation. https://github.com/prlz77/ResNeXt.pytorch, 2017.

[11] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the impor-tance of initialization and momentum in deep learning. In International conference onmachine learning, pages 1139–1147, 2013.

[12] Andreas Veit. Densenet pytorch implementation. https://github.com/andreasveit/densenet-pytorch.git, 2017.

[13] Wei Yang. Preresnet pytorch implementation. https://github.com/bearpaw/pytorch-classification, 2017.


Recommended