ONCE FOR-ALL: TRAIN ONE NETWORK AND SPE CIALIZE IT FOR ... · Published as a conference paper at...

Published as a conference paper at ICLR 2020

ONCE-FOR-ALL: TRAIN ONE NETWORK AND SPE-CIALIZE IT FOR EFFICIENT DEPLOYMENT ON DIVERSEHARDWARE PLATFORMS

Han Cai1, Chuang Gan2, Tianzhe Wang1, Zhekai Zhang1, Song Han1

1Massachusetts Institute of Technology, 2MIT-IBM Watson AI Lab{hancai, chuangg, songhan}@mit.edu

ABSTRACT

We address the challenging problem of efficient deep learning model deploymentacross many devices and diverse constraints, from general-purpose hardware tospecialized accelerators. Conventional approaches either manually design or useneural architecture search (NAS) to find a specialized neural network and train itfrom scratch for each case, which is computationally prohibitive (causing CO2

emission as much as 5 cars’ lifetime Strubell et al. (2019)) thus unscalable. Toreduce the cost, our key idea is to decouple model training from architecture search.To this end, we propose to train a once-for-all network (OFA) that supports di-verse architectural settings (depth, width, kernel size, and resolution). Given adeployment scenario, we can then quickly get a specialized sub-network by select-ing from the OFA network without additional training. To prevent interferencebetween many sub-networks during training, we also propose a novel progressiveshrinking algorithm, which can train a surprisingly large number of sub-networks(> 1019) simultaneously. Extensive experiments on various hardware platforms(CPU, GPU, mCPU, mGPU, FPGA accelerator) show that OFA consistently outper-forms SOTA NAS methods (up to 4.0% ImageNet top1 accuracy improvement overMobileNetV3) while reducing orders of magnitude GPU hours and CO2 emission.In particular, OFA achieves a new SOTA 80.0% ImageNet top1 accuracy underthe mobile setting (<600M FLOPs). Code and pre-trained models are released athttps://github.com/mit-han-lab/once-for-all.

1 INTRODUCTION

Deep Neural Networks (DNNs) deliver state-of-the-art accuracy in many machine learning applica-tions. However, the explosive growth in model size and computation cost gives rise to new challengeson how to efficiently deploy these deep learning models on diverse hardware platforms, since theyhave to meet different hardware efficiency constraints (e.g., latency, energy). For instance, one mobileapplication on App Store has to support a diverse range of hardware devices, from a high-end Sam-sung Note10 with a dedicated neural network accelerator to a 5-year-old Samsung S6 with a muchslower processor. With different hardware resources (e.g., on-chip memory size, #arithmetic units),the optimal neural network architecture varies significantly. Even running on the same hardware,under different battery conditions or workloads, the best model architecture also differs a lot.

Given different hardware platforms and efficiency constraints (defined as deployment scenarios),researchers either design compact models specialized for mobile (Howard et al., 2017; Sandler et al.,2018; Zhang et al., 2018) or accelerate the existing models by compression (Han et al., 2016; Heet al., 2018) for efficient deployment. However, designing specialized DNNs for every scenariois engineer-expensive and computationally expensive, either with human-based methods or NAS.Since such methods need to repeat the network design process and retrain the designed networkfrom scratch for each case. Their total cost grows linearly as the number of deployment scenariosincreases, which will result in excessive energy consumption and CO2 emission (Strubell et al., 2019).It makes them unable to handle the vast amount of hardware devices (23.14 billion IoT devices till

1

arX

iv:1

908.

0979

1v3

[cs

.LG

] 8

Mar

202

0

https://github.com/mit-han-lab/once-for-all


Number of Deployment Scenarios 0 20 40 60 80

16x~1300x reduction

direct deploy (no retrain)

train a once-for-all network

specialized sub-nets

Samsung Note10 Latency (ms)

cpuF P

G A

Different Hardware / Constraint

Desig

n Co

st

Previous: O(N) design costOurs: O(1) design cost

Mobile AITiny AI(AIoT)Cloud AI

Top-

1 Im

ageN

et A

cc (%

)

67

69

71

73

75

77

6 9 12 15 18 21 24

OFA MobileNetV3

70.0

76.1

Train Once,

Get Many

75.2

73.3

70.4

67.4Tra

in Four Tim

es,

Get Four

MCU

once-for-all network


Mobile AI

Tiny AI (AIoT)

Cloud AI

MCU


16x~1300x reduction


cpuF P

G A


Desig

n Co

st


Top-

1 Im

ageN

et A

cc (%

)

67

69

71

73

75

77

6 9 12 15 18 21 24

OFA MobileNetV3

70.0

76.1

Train Once,

Get Many

75.2

73.3

70.4

67.4Tra

in Four Tim

es,

Get Four

1

Figure 1: Left: a single once-for-all network is trained to support versatile architectural configurationsincluding depth, width, kernel size, and resolution. Given a deployment scenario, a specialized sub-network is directly selected from the once-for-all network without training. Middle: this approachreduces the cost of specialized deep learning deployment from O(N) to O(1). Right: once-for-allnetwork followed by model selection can derive many accuracy-latency trade-offs by training onlyonce, compared to conventional methods that require repeated training.

20181) and highly dynamic deployment environments (different battery conditions, different latencyrequirements, etc.).

This paper introduces a new solution to tackle this challenge – designing a once-for-all network thatcan be directly deployed under diverse architectural configurations, amortizing the training cost. Theinference is performed by selecting only part of the once-for-all network. It flexibly supports differentdepths, widths, kernel sizes, and resolutions without retraining. A simple example of Once-for-All(OFA) is illustrated in Figure 1 (left). Specifically, we decouple the model training stage and themodel specialization stage. In the model training stage, we focus on improving the accuracy of allsub-networks that are derived by selecting different parts of the once-for-all network. In the modelspecialization stage, we sample a subset of sub-networks to train an accuracy predictor and latencypredictors. Given the target hardware and constraint, a predictor-guided architecture search (Liu et al.,2018) is conducted to get a specialized sub-network, and the cost is negligible. As such, we reducethe total cost of specialized neural network design from O(N) to O(1) (Figure 1 middle).

However, training the once-for-all network is a non-trivial task, since it requires joint optimizationof the weights to maintain the accuracy of a large number of sub-networks (more than 1019 in ourexperiments). It is computationally prohibitive to enumerate all sub-networks to get the exact gradientin each update step, while randomly sampling a few sub-networks in each step will lead to significantaccuracy drops. The challenge is that different sub-networks are interfering with each other, makingthe training process of the whole once-for-all network inefficient. To address this challenge, wepropose a progressive shrinking algorithm for training the once-for-all network. Instead of directlyoptimizing the once-for-all network from scratch, we propose to first train the largest neural networkwith maximum depth, width, and kernel size, then progressively fine-tune the once-for-all network tosupport smaller sub-networks that share weights with the larger ones. As such, it provides betterinitialization by selecting the most important weights of larger sub-networks, and the opportunity todistill smaller sub-networks, which greatly improves the training efficiency.

We extensively evaluated the effectiveness of OFA on ImageNet with many hardware platforms(CPU, GPU, mCPU, mGPU, FPGA accelerator) and efficiency constraints. Under all deploymentscenarios, OFA consistently improves the ImageNet accuracy by a significant margin compared toSOTA hardware-aware NAS methods while saving the GPU hours, dollars, and CO2 emission byorders of magnitude. On the ImageNet mobile setting (less than 600M FLOPs), OFA achieves a newSOTA 80.0% top1 accuracy with 595M FLOPs. To the best of our knowledge, this is the first timethat the SOTA ImageNet top1 accuracy reaches 80% under the mobile setting.

2 RELATED WORK

Efficient Deep Learning. Many efficient neural network architectures are proposed to improve thehardware efficiency, such as SqueezeNet (Iandola et al., 2016), MobileNets (Howard et al., 2017;

1https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/

2


Sandler et al., 2018), ShuffleNets (Ma et al., 2018; Zhang et al., 2018), etc. Orthogonal to architectingefficient neural networks, model compression (Han et al., 2016) is another very effective techniquefor efficient deep learning, including network pruning that removes redundant units (Han et al., 2015)or redundant channels (He et al., 2018; Liu et al., 2017), and quantization that reduces the bit widthfor the weights and activations (Han et al., 2016; Courbariaux et al., 2015; Zhu et al., 2017).

Neural Architecture Search. Neural architecture search (NAS) focuses on automating the archi-tecture design process (Zoph & Le, 2017; Zoph et al., 2018; Real et al., 2019; Cai et al., 2018a; Liuet al., 2019). Early NAS methods (Zoph et al., 2018; Real et al., 2019; Cai et al., 2018b) search forhigh-accuracy architectures without taking hardware efficiency into consideration. Therefore, theproduced architectures (e.g., NASNet, AmoebaNet) are not efficient for inference. Recent hardware-aware NAS methods (Cai et al., 2019; Tan et al., 2019; Wu et al., 2019) directly incorporate thehardware feedback into architecture search. As a result, they are able to improve inference efficiency.However, given new inference hardware platforms, these methods need to repeat the architecturesearch process and retrain the model, leading to prohibitive GPU hours, dollars and CO2 emission.They are not scalable to a large number of deployment scenarios. The individually trained models donot share any weight, leading to a large total model size and high downloading bandwidth.

Dynamic Neural Networks. To improve the efficiency of a given neural network, some workexplored skipping part of the model based on the input image. For example, Wu et al. (2018); Liu &Deng (2018); Wang et al. (2018) learn a controller or gating modules to adaptively drop layers; Huanget al. (2018) introduce early-exit branches in the computation graph; Lin et al. (2017) adaptivelyprune channels based on the input feature map; Kuen et al. (2018) introduce stochastic downsamplingpoint to reduce the feature map size adaptively. Recently, Slimmable Nets (Yu et al., 2019; Yu &Huang, 2019b) propose to train a model to support multiple width multipliers (e.g., 4 different globalwidth multipliers), building upon existing human-designed neural networks (e.g., MobileNetV2 0.35,0.5, 0.75, 1.0). Such methods can adaptively fit different efficiency constraints at runtime, however,still inherit a pre-designed neural network (e.g., MobileNet-v2), which limits the degree of flexibility(e.g., only width multiplier can adapt) and the ability in handling new deployment scenarios wherethe pre-designed neural network is not optimal. In this work, in contrast, we enable a much morediverse architecture space (depth, width, kernel size, and resolution) and a significantly larger numberof architectural settings (1019 v.s. 4 (Yu et al., 2019)). Thanks to the diversity and the large designspace, we can derive new specialized neural networks for many different deployment scenarios ratherthan working on top of an existing neural network that limits the optimization headroom. However, itis more challenging to train the network to achieve this flexibility, which motivates us to design theprogressive shrinking algorithm to tackle this challenge.

3 METHOD

3.1 PROBLEM FORMALIZATION

Assuming the weights of the once-for-all network as Wo and the architectural configurations as{archi}, we then can formalize the problem as

minWo

∑archi

Lval

(C(Wo, archi)

), (1)

where C(Wo, archi) denotes a selection scheme that selects part of the model from the once-for-allnetwork Wo to form a sub-network with architectural configuration archi. The overall trainingobjective is to optimize Wo to make each supported sub-network maintain the same level of accuracyas independently training a network with the same architectural configuration.

3.2 ARCHITECTURE SPACE

Our once-for-all network provides one model but supports many sub-networks of different sizes,covering four important dimensions of the convolutional neural networks (CNNs) architectures, i.e.,depth, width, kernel size, and resolution. Following the common practice of many CNN models (Heet al., 2016; Sandler et al., 2018; Huang et al., 2017), we divide a CNN model into a sequence ofunits with gradually reduced feature map size and increased channel numbers. Each unit consists of a

3


7x7

TransformMatrix 25x25

5x5

Transform Matrix 9x9

3x3

train with full width

channelimportance

0.020.150.850.63

channelsorting

progressively shrink the width

unit i

train with full depth

channelimportance

0.820.110.46

reorg.channelsorting

reorg.


channelsorting .

++ +

p1

p2

p3O1

O2

O3O1

O2

O1

unit i

shrink the depth

O1

O2

unit i

shrink the depth

O1

O2

O3

Once-for-all

NetworkK = 7D = 4W = 6

Train full network

Elastic Kernel SizeD = 4, W = 6

K [7, 5, 3]!Sample K at each layer

Generate kernel weights (Fig. 3)

Fine-tune weights & transformation matrix

Elastic WidthD [4, 3, 2], K [7, 5, 3]! !

Channel sorting

Sample E at each

Fine-tune weights

W [6, 4, 3]!W [6, 4]!

Channel sorting (Fig. 4)

Sample W at each layer; sample K, D

Elastic ResolutionR [128, 132, …, 224]! Elastic Depth

W = 6, K [7, 5, 3]!

Sample D at each

Skip top (4-D)

Fine-tune weights

D [4, 3, 2]!D [4, 3]!

Sample D at each unit; sample K

Keep the first D layers at each unit (Fig. 3)

Fine-tune weights Fine-tune weights

1

Figure 2: Illustration of the progressive shrinking process to support different depth D, width W ,kernel size K and resolution R. It leads to a large space comprising diverse sub-networks (> 1019).

sequence of layers where only the first layer has stride 2 if the feature map size decreases (Sandleret al., 2018). All the other layers in the units have stride 1.

We allow each unit to use arbitrary numbers of layers (denoted as elastic depth); For each layer,we allow to use arbitrary numbers of channels (denoted as elastic width) and arbitrary kernel sizes(denoted as elastic kernel size). In addition, we also allow the CNN model to take arbitrary inputimage sizes (denoted as elastic resolution). For example, in our experiments, the input image sizeranges from 128 to 224 with a stride 4; the depth of each unit is chosen from {2, 3, 4}; the widthexpansion ratio in each layer is chosen from {3, 4, 6}; the kernel size is chosen from {3, 5, 7}.Therefore, with 5 units, we have roughly ((3× 3)2 + (3× 3)3 + (3× 3)4)5 ≈ 2× 1019 differentneural network architectures and each of them can be used under 25 different input resolutions. Sinceall of these sub-networks share the same weights (i.e., Wo) (Cheung et al., 2019), we only require7.7M parameters to store all of them. Without sharing, the total model size will be prohibitive.

3.3 TRAINING THE ONCE-FOR-ALL NETWORK

Naı̈ve Approach. Training the once-for-all network can be cast as a multi-objective problem, whereeach objective corresponds to one sub-network. From this perspective, a naı̈ve training approachis to directly optimize the once-for-all network from scratch using the exact gradient of the overallobjective, which is derived by enumerating all sub-networks in each update step, as shown in Eq. (1).The cost of this approach is linear to the number of sub-networks. Therefore, it is only applicable toscenarios where a limited number of sub-networks are supported (Yu et al., 2019), while in our case,it is computationally prohibitive to adopt this approach.

Another naı̈ve training approach is to sample a few sub-networks in each update step rather thanenumerate all of them, which does not have the issue of prohibitive cost. However, with such a largenumber of sub-networks that share weights thus interfere with each other, we find it suffers fromsignificant accuracy drop. In the following section, we introduce a solution to address this challengeby adding a progressive shrinking training order to the training process. Correspondingly, we refer tothe naı̈ve training approach as random order.

Progressive Shrinking. The once-for-all network comprises many sub-networks of different sizeswhere small sub-networks are nested in large sub-networks. To prevent interference between thesub-networks, we propose to enforce a training order from large sub-networks to small sub-networksin a progressive manner. We name this training order as progressive shrinking (PS). An exampleof the training process with PS is provided in Figure 2, where we start with training the largestneural network with the maximum kernel size (i.e., 7), depth (i.e., 4), and width (i.e., 6). Next, weprogressively fine-tune the network to support smaller sub-networks by gradually adding them intothe sampling space (larger sub-networks may also be sampled). Specifically, after training the largestnetwork, we first support elastic kernel size which can choose from {3, 5, 7} at each layer, whilethe depth and width remain the maximum values. Then, we support elastic depth and elastic widthsequentially, as is shown in Figure 2. The resolution is elastic throughout the whole training process,which is implemented by sampling different image sizes for each batch of training data. We also usethe knowledge distillation technique after training the largest neural network (Hinton et al., 2015;Ashok et al., 2018; Yu & Huang, 2019b). It combines two loss terms using both the soft labels givenby the largest neural network and the real labels.

4


7x7

TransformMatrix 25x25

5x5

Transform Matrix 9x9

3x3


channelimportance

0.020.150.850.63

channelsorting


unit i


channelimportance

0.820.110.46

reorg.channelsorting

reorg.


channelsorting .

++ +

p1

p2

p3O1

O2

O3O1

O2

O1

unit i

shrink the depth

O1

O2

unit i

shrink the depth

O1

O2

O3

�1

Figure 3: Left: Kernel transformation matrix for elastic kernel size. Right: Progressive shrinking forelastic depth. Instead of skipping each layer independently, we keep the first D layers and skip thelast (4−D) layers. The weights of the early layers are shared.

7x7

Transformation Matrix: 25x25

5x5

Transformation Matrix: 9x9

3x3


channelimportance

0.020.150.850.63

channelsorting


stage i stage i


stage i

channelimportance

0.820.110.46

reorg.channelsortingreorg.


channelsorting

progressively shrink the depth progressively shrink the depth

.

O1

++ +

p1

p2

p3

O2

O3

O1

O2

O1

O2

O3

O1

O2

O1

�1

Figure 4: Progressive shrinking for elastic width. In this example, we progressively support 4, 3, and2 channel settings. We perform channel sorting and pick the most important channels (with large L1norm) to initialize the smaller channel settings. The important channels’ weights are shared.

Compared to the naı̈ve approach, PS prevents small sub-networks from interfering large sub-networks,since large sub-networks are already well-trained when the once-for-all network is fine-tuned tosupport small sub-networks. Additionally, during fine-tuning, the model is optimized in the localspace around the well-trained large sub-networks by using a small learning rate and revisiting (i.e.,sampling) well-trained large sub-networks. Regarding the small sub-networks, they share the weightswith the large ones. Therefore, PS allows initializing small sub-networks with the most importantweights of well-trained large sub-networks, which expedites the training process. We describe thedetails of the PS training flow as follows:

• Elastic Kernel Size (Figure 3 left). We let the center of a 7x7 convolution kernel also serve asa 5x5 kernel, the center of which can also be a 3x3 kernel. Therefore, the kernel size becomeselastic. The challenge is that the centering sub-kernels (e.g., 3x3 and 5x5) are shared and needto play multiple roles (independent kernel and part of a large kernel). The weights of centeredsub-kernels may need to have different distribution or magnitude as different roles. Forcing themto be the same degrades the performance of some sub-networks. Therefore, we introduce kerneltransformation matrices when sharing the kernel weights. We use separate kernel transformationmatrices for different layers. Within each layer, the kernel transformation matrices are sharedamong different channels. As such, we only need 25× 25 + 9× 9 = 706 extra parameters to storethe kernel transformation matrices in each layer, which is negligible.

• Elastic Depth (Figure 3 right). To derive a sub-network that has D layers in a unit that originallyhas N layers, we keep the first D layers and skip the last N −D layers, rather than keeping anyD layers as done in current NAS methods (Cai et al., 2019; Wu et al., 2019). As such, one depthsetting only corresponds to one combination of layers. In the end, the weights of the first D layersare shared between large and small models.

• Elastic Width (Figure 4). Width means the number of channels. We give each layer the flexibilityto choose different channel expansion ratios. Following the progressive shrinking scheme, we firsttrain a full-width model. Then we introduce a channel sorting operation to support partial widths.It reorganizes the channels according to their importance, which is calculated based on the L1norm of a channel’s weight. Larger L1 norm means more important. For example, when shrinkingfrom a 4-channel-layer to a 3-channel-layer, we select the largest 3 channels, whose weights areshared with the 4-channel-layer (Figure 4 left and middle). Thereby, smaller sub-networks areinitialized with the most important channels on the once-for-all network which is already welltrained. This channel sorting operation preserves the accuracy of larger sub-networks.

5


D = 2 D = 4Sub-networks W = 3 W = 6 W = 3 W = 6 Mbv3-L

K = 3 K = 7 K = 3 K = 7 K = 3 K = 7 K = 3 K = 7Parameters 3.4M 3.5M 4.7M 4.8M 4.4M 4.6M 7.3M 7.7M 5.4MFLOPs 121M 151M 223M 283M 226M 293M 433M 566M 219MRandom Order 68.0 69.1 70.6 71.6 71.5 72.3 73.1 73.8

75.2PS (ours) 70.5 71.9 74.1 75.0 74.8 75.7 76.8 77.3∆ Acc. +2.5 +2.8 +3.5 +3.4 +3.3 +3.4 +3.7 +3.5

Table 1: ImageNet top1 accuracy (%) performances of sub-networks under resolution 224 × 224.“(D = d, W = w, K = k)” denotes a sub-network with d layers in each unit, and each layer has anwidth expansion ratio w and kernel size k. “Mbv3-L” denotes “MobileNetV3-Large”.

3.4 SPECIALIZED MODEL DEPLOYMENT WITH ONCE-FOR-ALL NETWORK

Having trained a once-for-all network, the next stage is to derive the specialized sub-network for agiven deployment scenario. The goal is to search for a neural network that satisfies the efficiency(e.g., latency, energy) constraints on the target hardware while optimizing the accuracy. Since OFAdecouples model training from architecture search, we do not need any training cost in this stage.Furthermore, we build neural-network-twins to predict the latency and accuracy given a neuralnetwork architecture, providing a quick feedback for model quality. It eliminates the repeated searchcost by substituting the measured accuracy/latency with predicted accuracy/latency (twins).

Specifically, we randomly sample 16K sub-networks with different architectures and input imagesizes, then measure their accuracy on 10K validation images sampled from the original training set.These [architecture, accuracy] pairs are used to train an accuracy predictor to predict the accuracy ofa model given its architecture and input image size2. We also build a latency lookup table (Cai et al.,2019) on each target hardware platform to predict the latency. Given the target hardware and latencyconstraint, we conduct an evolutionary search (Real et al., 2019) based on the neural-network-twinsto get a specialized sub-network. Since the cost of searching with neural-network-twins is negligible,we only need 40 GPU hours to collect the data pairs, and the cost stays constant regardless of#deployment scenarios.

4 EXPERIMENTS

In this section, we first apply the progressive shrinking algorithm to train the once-for-all network onImageNet (Deng et al., 2009). Then we demonstrate the effectiveness of our trained once-for-allnetwork on various hardware platforms (Samsung S7 Edge, Note8, Note10, Google Pixel1, Pixel2,LG G8, NVIDIA 1080Ti, V100 GPUs, Jetson TX2, Intel Xeon CPU, Xilinx ZU9EG, and ZU3EGFPGAs) with different latency constraints.

4.1 TRAINING THE ONCE-FOR-ALL NETWORK ON IMAGENET

Training Details. We use the same architecture space as MobileNetV3 (Howard et al., 2019). Fortraining the full network, we use the standard SGD optimizer with Nesterov momentum 0.9 andweight decay 3e−5. The initial learning rate is 2.6, and we use the cosine schedule (Loshchilov &Hutter, 2016) for learning rate decay. The full network is trained for 180 epochs with batch size 2048on 32 GPUs. Then we follow the schedule described in Figure 2 to further fine-tune the full network3.The whole training process takes around 1,200 GPU hours on V100 GPUs. This is a one-time trainingcost that can be amortized by many deployment scenarios.

Results. Table 1 reports the top1 accuracy of sub-networks derived from the once-for-all networksthat are trained with our progressive shrinking (PS) algorithm and random order respectively. Dueto space limits, we take 8 sub-networks for comparison, and each of them is denoted as “(D = d,W = w, K = k)”. It represents a sub-network that has d layers for all units while the expansionratio and kernel size are set to w and k for all layers. Compared to random order, PS can improve

2Details of the accuracy predictor is provided in Appendix A.3Implementation details can be found in Appendix B.

6


Model ImageNet FLOPs Mobile Search cost Training cost Total cost (N = 40)Top1 (%) latency (GPU hours) (GPU hours) GPU hours CO2e (lbs) AWS cost

MobileNetV2 [28] 72.0 300M 66ms 0 150N 6k 1.7k $18.4kMobileNetV2 #1200 73.5 300M 66ms 0 1200N 48k 13.6k $146.9kNASNet-A [41] 74.0 564M - 48,000N - 1,920k 544.5k $5875.2kDARTS [22] 73.1 595M - 96N 250N 14k 4.0k $42.8kMnasNet [30] 74.0 317M 70ms 40,000N - 1,600k 453.8k $4896.0kFBNet-C [33] 74.9 375M - 216N 360N 23k 6.5k $70.4kProxylessNAS [4] 74.6 320M 71ms 200N 300N 20k 5.7k $61.2kSinglePathNAS [8] 74.7 328M - 288 + 24N 384N 17k 4.8k $52.0kAutoSlim [35] 74.2 305M 63ms 180 300N 12k 3.4k $36.7kMobileNetV3-Large [14] 75.2 219M 58ms - 180N 7.2k 1.8k $22.2kOFA w/o PS 72.4 235M 59ms 40 1200 1.2k 0.34k $3.7kOFA w/ PS 76.0 230M 58ms 40 1200 1.2k 0.34k $3.7kOFA w/ PS #25 76.4 230M 58ms 40 1200 + 25N 2.2k 0.62k $6.7kOFA w/ PS #75 76.9 230M 58ms 40 1200 + 75N 4.2k 1.2k $13.0k

Table 2: Comparison with SOTA hardware-aware NAS methods on Pixel1 phone. OFA decouplesmodel training from architecture search. The search cost and training cost both stay constant as thenumber of deployment scenarios grows. “#25” denotes the specialized sub-networks are fine-tunedfor 25 epochs after grabbing weights from the once-for-all network. “CO2e” denotes CO2 emissionwhich is calculated based on Strubell et al. (2019). AWS cost is calculated based on the price ofon-demand P3.16xlarge instances.

High (Low) Workload

High (Low) Battery

Specializedsub-network


16x~1300x reduction

direct deploy (no retrain)

train a once-for-all network



75.4

cpuF P

G A


Desig

n Co

st


8.9%

Edge,Full battery

Edge,Low battery Cloud


cpuF P

G A


Desig

n Co

st

Previous: O(N) design costEdge,Full battery

Edge,Low battery

Cloud

…

Architecture Design

Train from Scratch

Architecture Design

Train from Scratch

Architecture Design

Train from Scratch

Repeated architecture design and model training

Note10

OFA #5 Acc Loss MobileNetV3-Small

#5 Acc Loss MobileNetV3-Large

Untitled 1 8.0 70.0 8.0 67.4 15.3

Untitled 2 8.9 71.0 10.8 70.4 22.0

Untitled 3 10.0 71.7


11.7 73.4

13.0 73.8

14.1 74.3

15.1 74.6

15.8 74.8

17.0 75.0

18.5 75.1

19.4 75.5

20.0 75.6

21.3 75.7

21.7 75.9

22.4 76.1

Top-

1 Im

ageN

et A

cc (%

)

67

69

71

73

75

77

6 9 12 15 18 21 24

OFA MobileNetV3-LargeMobileNetV3-Small

70.0

76.1

Train Once,

Get Many

75.2

73.3

70.4

67.4Tra

in Four Tim

es,

Get Four

ProxylessNAS

FBNet

MnasNet

0 30000 60000 90000 120000

N = 40 N = 100

Table 1

40 100

ProxylessNAS 5700 8550

FBNet 6500 9750

MnasNet 45380 68070 453800 567250

14.3k

16.3k

1134.5k453.8k

specializedsub-nets

Once-for-all network

Edge,Full battery

Cloud

Edge,Full battery

Edge,Low battery

Cloud

…

Architecture Design

Train from Scratch

Architecture Design

Train from Scratch

Architecture Design

Train from Scratch

Previous Repeated architecture

design and model training

direct deploy(no retrain)

Ours Train once, specialize for

many deployment scenarios

…

Edge,Low battery

Improved Efficiency

SOTA Accuracy on ImageNet Mobile Setting

CO2

ProxylessNAS

FBNet

MnasNet

OFA0 12500 25000 37500 50000

Total cost (lbs of CO2 emission), N = 40 Table 1-1-1

40 100

ProxylessNAS 5700 8550

FBNet 6500 9750

MnasNet 45380 68070 453800 567250

OFA 300 300

5.7k6.5k

454k

0.34k 1300x

1

Figure 5: OFA saves orders of magnitude design cost compared to NAS methods.

the ImageNet accuracy of sub-networks by a significant margin under all architectural settings.Specifically, without architecture optimization, PS can achieve 74.8% top1 accuracy using 226MFLOPs under the architecture setting (D=4, W=3, K=3), which is on par with MobileNetV3-Large.In contrast, random order only achieves 71.5% which is 3.3% lower.

4.2 SPECIALIZED SUB-NETWORKS FOR DIFFERENT HARDWARE AND CONSTRAINTS

We apply our trained once-for-all network to get different specialized sub-networks for diversehardware platforms: from the cloud to the edge. On cloud devices, the latency for GPU is measuredwith batch size 64 on NVIDIA 1080Ti and V100 with Pytorch 1.0+cuDNN. The CPU latency ismeasured with batch size 1 on Intel Xeon E5-2690 v4+MKL-DNN. On edge devices, includingmobile phones, we use Samsung, Google and LG phones with TF-Lite, batch size 1; for mobile GPU,we use Jetson TX2 with Pytorch 1.0+cuDNN, batch size of 16; for embedded FPGA, we use XilinxZU9EG and ZU3EG FPGAs with Vitis AI4, batch size 1.

Comparison with NAS on Mobile Devices. Table 2 reports the comparison between OFA andstate-of-the-art hardware-aware NAS methods on the mobile phone (Pixel1). OFA is much moreefficient than NAS when handling multiple deployment scenarios since the cost of OFA is constantwhile others are linear to the number of deployment scenarios (N ). With N = 40, the total CO2

emissions of OFA is 16× fewer than ProxylessNAS, 19× fewer than FBNet, and 1,300× fewerthan MnasNet (Figure 5). Without retraining, OFA achieves 76.0% top1 accuracy on ImageNet,which is 0.8% higher than MobileNetV3-Large while maintaining similar mobile latency. We canfurther improve the top1 accuracy to 76.4% by fine-tuning the specialized sub-network for 25 epochsand to 76.9% by fine-tuning for 75 epochs. Besides, we also observe that OFA with PS can achieve3.6% better accuracy than without PS, showing the effectiveness of PS.

OFA under Different Computational Resource Constraints. Figure 6 summarizes the results ofOFA under different FLOPs and Pixel1 latency constraints. OFA achieves 79.1% ImageNet top1

4https://www.xilinx.com/products/design-tools/vitis/vitis-ai.html

7


Compared with EfficientNet (FLOPs)

Once for All #5 Acc Loss Once for All #25 Slimmable Nets #5 Acc Loss EfficientNet #5 Acc Loss MnasNet #5 Acc Loss ProxylessNAS

Untitled 1 389 79.10 390 76.3

Untitled 2 482.0 79.60 700 78.8

Untitled 3 595.0 80.00 1000 79.8

Untitled 4

Top-

1 Im

ageN

et A

cc (%

)

76

77

78

79

80

81

200 400 600 800 1,000 1,200

OFA EfficientNet

76.3

78.8

79.879.679.1

FLOPs (M)

80.01.68x FLOPs

reduction

Compared with EfficientNet (Pixel1)

Once for All #5 Acc Loss Once for All #25 Slimmable Nets #5 Acc Loss EfficientNet #5 Acc Loss MnasNet #5 Acc Loss ProxylessNAS

Untitled 1 78.7 78.70 163 76.3

Untitled 2 132.0 79.80 276 78.8

Untitled 3 143.0 80.10 375 79.8

Untitled 4

Top-

1 Im

ageN

et A

cc (%

)

76

77

78

79

80

81

0 50 100 150 200 250 300 350 40076.3

78.8

79.879.8

78.7

Google Pixel1 Latency (ms)

80.12.6x latency reduction

3.8% higher accuracy2.8% higher

accuracy

1

Figure 6: OFA achieves 80.0% top1 accuracy with 595M FLOPs and 80.1% top1 accuracy with143ms Pixel1 latency, setting a new SOTA ImageNet top1 accuracy on the mobile setting.

Note10

OFA #5 Acc Loss OFA #25 #5 Acc Loss MobileNetV3-Small


#5 Acc Loss

Untitled 1 8.0 70.0 8 71.4 8.0 67.4 15.3 73.3

Untitled 2 8.9 71.0 10.9 73.6 10.8 70.4 22.0 75.2

Untitled 3 10.0 71.7 15.8 75.5

Untitled 4 10.9 72.6 22.4 76.6

11.7 73.4

13.0 73.8

14.1 74.3

15.1 74.6

15.8 74.8

17.0 75.0

18.5 75.1

19.4 75.5

20.0 75.6

21.3 75.7

21.7 75.9

22.4 76.1

Note8



#5 Acc Loss

Untitled 1 22.0 68.5 22.0 70.4 22.0 67.4 49.0 73.3

Untitled 2 24.0 69.4 31.0 72.8 31.0 70.4 65.0 75.2

Untitled 3 26.0 70.3 49.0 74.9

Untitled 4 28.0 70.9 65.0 76.1

30.0 71.6

31.0 71.7

34.0 72.3

36.0 72.7

38.0 73.1

39.0 73.4

42.0 73.7

44.0 74.0

46.0 74.3

49.0 74.7

52.0 74.9

54.0 75.0

56.0 75.2

58.0 75.3

60.0 75.4

62.0 75.6

65.0 75.7

LG G8



#5 Acc Loss

Untitled 1 8.3 69.4 8.3 71.1 8.2 67.4 17.0 73.3

Untitled 2 8.9 70.0 11.3 73.0 11.0 70.4 24.0 75.2

Untitled 3 10.0 70.9 16.0 74.7

Untitled 4 11.3 72.1 24.0 76.4

13.3 73.0

14.0 73.5

15.3 73.9

16.0 74.3

18.0 75.0

19.0 75.2

20.0 75.3

21.0 75.4

22.0 75.6

23.0 75.8

24.0 76.0

Samsung S7 Edge Latency (ms)

Top-

1 Im

ageN

et A

cc (%

)

67

69

71

73

75

77

25 40 55 70 85 100

75.2

73.3

70.4

67.4

70.5

73.1

74.7

76.3


Top-

1 Im

ageN

et A

cc (%

)

67

69

71

73

75

77

7 9 11 13 15 17 19 21 23

OFA #25 OFA MobileNetV3-Large MobileNetV3-Small

75.2

73.3

70.4

67.4

76.675.5

73.6

71.4


Top-

1 Im

ageN

et A

cc (%

)

67

69

71

73

75

77

20 26 32 38 44 50 56 62 68

75.2

73.3

70.4

67.4

76.174.9

72.8

70.4

LG G8 Latency (ms)

Top-

1 Im

ageN

et A

cc (%

)

67

69

71

73

75

77

7 10 13 16 19 22 25

75.2

73.3

70.4

67.4

76.4

74.7

73.0

71.1


Top-

1 Im

ageN

et A

cc (%

)

67

69

71

73

75

77

18 24 30 36 42 48 54 60

75.2

73.3

70.4

67.4

76.474.9

73.3

71.4


Top-

1 Im

ageN

et A

cc (%

)

67

69

71

73

75

77

23 28 33 38 43 48 53 58 63 68

75.2

73.3

70.4

67.4

75.874.7

73.4

71.5

1

Figure 7: OFA consistently outperforms MobileNetV3 on mobile platforms.

accuracy with 389M FLOPs, being 2.8% more accurate than EfficientNet-B0 that has similar FLOPs.With 595M FLOPs, OFA reaches a new SOTA 80.0% ImageNet top1 accuracy under the mobilesetting (<600M FLOPs), which is 0.2% higher than EfficientNet-B2 while using 1.68× fewer FLOPs.More importantly, OFA runs much faster than EfficientNets on hardware. Specifically, with 143msPixel1 latency, OFA achieves 80.1% ImageNet top1 accuracy, being 0.3% more accurate and 2.6×faster than EfficientNet-B2.

Figure 7 reports detailed comparisons between OFA and MobileNetV3 on six mobile devices.Remarkably, OFA can produce the entire trade-off curves with many points over a wide rangeof latency constraints by training only once (green curve). It is impossible for previous NASmethods (Tan et al., 2019; Cai et al., 2019) due to the prohibitive training cost.

OFA for Diverse Hardware Platforms. Besides the mobile platforms, we extensively studiedthe effectiveness of OFA on six additional hardware platforms (Figure 8) using the ProxylessNASarchitecture space (Cai et al., 2019). OFA consistently improves the trade-off between accuracy andlatency by a significant margin, especially on GPUs which have more parallelism. With similar latencyas MobileNetV2 0.35, “OFA #25” improves the ImageNet top1 accuracy from MobileNetV2’s 60.3%to 72.6% (+12.3% improvement) on the 1080Ti GPU. Detailed architectures of our specialized modelsare shown in Figure 11. It reveals the insight that using the same model for different deploymentscenarios with only the width multiplier modified has a limited impact on efficiency improvement:the accuracy drops quickly as the latency constraint gets tighter.

8


2080ti-gpu64

Once for All #5 Acc Loss Once for All #25 Slimmable Nets #5 Acc Loss

Untitled 1 5.5 68.3 5.5 70.0 5.9 59.7

Untitled 2 6.8 71.1 6.8 72.2 7.7 64.4

Untitled 3 10.9 73.8 10.9 74.4 12.3 68.9

Untitled 4 13.4 74.6 13.4 75.0 14.3 70.5

59

63

66

70

73

77

5 7 9 11 13 15

OFA #25 OFAMnasNet MobileNetV2Slimmable Nets

V100 gpu64

Once for All #5 Acc Loss Once for All #25 Slimmable Nets #5 Acc Loss MobileNetV2 #5 Acc Loss MnasNet #5 Acc Loss

Untitled 1 4.8 69.0 4.8 71.6 4.9 59.7 4.9 60.3 4.8 62.4

Untitled 2 6.0 71.2 6.0 73.0 5.8 64.4 5.8 65.4 6.2 67.8

Untitled 3 9.2 74.6 9.2 75.3 9.2 68.9 9.2 69.8 9.3 71.5

Untitled 4 10.6 75.5 10.6 76.1 10.6 70.5 10.6 72.0 11.2 74.0

58

62

66

69

73

77

4 6 8 10 12

OFA #25 OFA MnasNet MobileNetV2 Slimmable Nets

1080ti gpu64


Untitled 1 11.6 70.1 11.6 72.6 12.2 59.7 12.2 60.3 11.9 62.4

Untitled 2 14.9 72.1 14.9 73.8 15.1 64.4 15.1 65.4 15.4 67.8

Untitled 3 22.0 74.6 22.0 75.3 24.0 68.9 24.0 69.8 23.4 71.5

Untitled 4 27.3 75.9 27.3 76.4 27.9 70.5 27.9 72.0 28.5 74.0

Top-

1 Im

ageN

et A

cc (%

)

58

62

66

69

73

77

10 14 18 22 26 30

2080Ti Latency (ms)

NVIDIA V100 Latency (ms)Batch Size = 64

60.3

65.4

69.872.0

60.3

65.4

69.8

72.0

NVIDIA 1080Ti Latency (ms)Batch Size = 64

60.3

65.4

69.872.0

72.673.8

75.3 76.4

70.0

72.2

74.4 75.0

71.673.0

75.3

Intel Xeon CPU


Untitled 1 9.7 68.8 9.7 71.1 9.8 59.7 9.8 60.3 9.8 62.4

Untitled 2 10.9 70.2 10.9 72.0 10.6 64.4 10.6 65.4 11.0 67.8

Untitled 3 14.8 73.7 14.8 74.6 15.5 68.9 15.5 69.8 14.5 71.5

Untitled 4 16.8 75.0 16.8 75.7 17.8 70.5 17.8 72.0 17.5 74.0

58

62

66

69

73

77

9 11 13 15 17 19Intel Xeon CPU Latency (ms)

Batch Size = 1

60.3

65.4

69.8

72.071.1

74.675.7

72.0

76.1

Jetson TX2 Latency (ms)Batch Size = 16

Top-

1 Im

ageN

et A

cc (%

)

58

62

66

69

73

77

30 45 60 75 90 10558

62

66

69

73

77

1.5 2.0 2.5 3.0 3.5 4.058

62

66

69

73

77

3.0 4.0 5.0 6.0 7.0 8.059.1

63.3

69.071.5

67.069.4

72.873.6

Xilinx ZU9EG FPGA Latency (ms)Batch Size = 1 (Quantized)

Xilinx ZU3EG FPGA Latency (ms)Batch Size = 1 (Quantized)

59.1

63.3

69.071.5

67.0

69.6

72.873.7

72.069.8

65.4

60.3

75.875.4

72.9

70.3

58

62

66

69

73

77

3.0 4.0 5.0 6.0 7.0 8.0

58

62

66

69

73

77

3.0 4.0 5.0 6.0 7.0 8.0

58

62

66

69

73

77

3.0 4.0 5.0 6.0 7.0 8.058

62

66

69

73

77

3.0 4.0 5.0 6.0 7.0 8.058

62

66

69

73

77

3.0 4.0 5.0 6.0 7.0 8.0

1

Figure 8: Specialized OFA models consistently achieve significantly higher ImageNet accuracywith similar latency than non-specialized neural networks on CPU, GPU, mGPU, and FPGA. Moreremarkably, specializing for a new hardware platform does not add training cost using OFA.

FPGA Arithmetic Intensity (op/B)

MobileNetV2 MnasNet OFA

0.35 27 27.6 39.4

0.5 35.3 37.1 49.4

0.75 51.6 51.9 54.4

1.0 61 61.2 63.9

Arith

met

ic In

tens

ity

(OPS

/Byt

e)

0

18

35

53

70

1.0x

FPGA UltraZed-EG GOPS/s


0.35 36 31.8 61.2

0.5 48.1 44.0 75.5

0.75 67.8 81.3

1.0 79 83.7

ZU3E

G F

PGA

(GO

PS/s

)

0

23

45

68

90

FPGA ZCU102 GOPS/s-1


0.35 77 67.6 126.8

0.5 102.6 94.4 155.3

0.75 150.6 135.4 164.6

1.0 185 167.3 186.3

ZU9E

G F

PGA

(GO

PS/s

)

0

50

100

150

200MobileNet-v2 MnasNet OFA (Ours)

(under different latency constraints)

on Xilinx ZU3EG FPGAon Xilinx ZU9EG FPGA

0.75x0.5x0.35x1.0x0.75x0.5x0.35x1.0x0.75x0.5x0.35x(under different latency constraints)(under different latency constraints)

Out

of B

RAM

Out

of B

RAM

Arith

met

ic In

tens

ity

(OPS

/Byt

e)

0.0

12.5

25.0

37.5

50.0

ZU9E

G F

PGA

(GO

PS/s

)

0.0

40.0

80.0

120.0

160.0MobileNet-v2 MnasNet OFA (Ours)

ZU3E

G F

PGA

(GO

PS/s

)

0.0

20.0

40.0

60.0

80.0

1

Figure 9: OFA models improve the arithmetic intensity (OPS/Byte) and utilization (GOPS/s) com-pared with the MobileNetV2 and MnasNet (measured results on Xilinx ZU9EG and ZU3EG FPGA).

OFA for Specialized Hardware Accelerators. There has been plenty of work on NAS for general-purpose hardware, but little work has been focused on specialized hardware accelerators. Wequantitatively analyzed the performance of OFA on two FPGAs accelerators (ZU3EG and ZU9EG)using Xilinx Vitis AI with 8-bit quantization, and discuss two design principles.

Principle 1: memory access is expensive, computation is cheap. An efficient CNN should do asmuch as computation with a small amount of memory footprint. The ratio is defined as the arithmeticintensity (OPs/Byte). The higher OPs/Byte, the less memory bounded, the easier to parallelize.Thanks to OFA’s diverse choices of sub-network architectures (1019) (Section 3.3), and the OFAmodel twin that can quickly give the accuracy/latency feedback (Section 3.4), the evolutionary searchcan automatically find a CNN architecture that has higher arithmetic intensity. As shown in Figure 9,OFA’s arithmetic intensity is 48%/43% higher than MobileNetV2 and MnasNet (MobileNetV3 isnot supported by Xilinx Vitis AI). Removing the memory bottleneck results in higher utilization andGOPS/s by 70%-90%, pushing the operation point to the upper-right in the roofline model (Williamset al., 2009), as shown in Figure 10. (70%-90% looks small in the log scale but that is significant).

Principle 2: the CNN architecture should be co-designed with the hardware accelerator’s cost model.The FPGA accelerator has a specialized depth-wise engine that is pipelined with the point-wiseengine. The pipeline throughput is perfectly matched for 3x3 kernels. As a result, OFA’s searchedmodel only has 3x3 kernel (Figure 11, a) on FPGA, despite 5x5 and 7x7 kernels are also in the searchspace. Additionally, large kernels sometimes cause “out of BRAM” error on FPGA, giving high cost.On Intel Xeon CPU, however, more than 50% operations are large kernels. Both FPGA and GPUmodels are wider than CPU, due to the large parallelism of the computation array.

9


FPGA Arithmetic Intensity (op/B)


0.35 27 27.6 39.4

0.5 35.3 37.1 49.4

0.75 51.6 51.9 54.4

1.0 61 61.2 63.9

Arith

met

ic In

tens

ity

(OPS

/Byt

e)0

18

35

53

70

1.0x

FPGA UltraZed-EG GOPS/s


0.35 36 31.8 61.2

0.5 48.1 44.0 75.5

0.75 67.8 81.3

1.0 79 83.7

ZU3E

G F

PGA

(GO

PS/s

)

0

23

45

68

90

FPGA ZCU102 GOPS/s-1


0.35 77 67.6 126.8

0.5 102.6 94.4 155.3

0.75 150.6 135.4 164.6

1.0 185 167.3 186.3

ZU9E

G F

PGA

(GO

PS/s

)

0

50

100

150

200MobileNet-v2 MnasNet OFA (Ours)

(under different latency constraints)

(b) on Xilinx ZU3EG FPGA(a) on Xilinx ZU9EG FPGA

0.75x0.5x0.35x1.0x0.75x0.5x0.35x1.0x0.75x0.5x0.35x(under different latency constraints)(under different latency constraints)

�1

Figure 10: Quantative study of OFA’s roofline model on Xilinx ZU9EG and ZU3EG FPGAs (logscale). OFA model increased the arithmetic intensity by 33%/43% and GOPS/s by 72%/92% on thesetwo FPGAs compared with MnasNet.

MB

1 3x

3

Con

v 3x

3

Poo

ling

FC

164x164

MB

4 3x

3

MB

5 3x

3

MB

6 3x

3

MB

5 3x

3

MB

6 3x

3

ZU3EG 4.1ms (R = 164)(3x3_MBConv1_RELU6

(3x3_MBConv4_RELU6

(3x3_MBConv4_RELU6

(3x3_MBConv4_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv4_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv6_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv6_RELU6

MB

4 3x

3

MB

4 3x

3

MB

5 3x

3

MB

5 3x

3

MB

5 3x

3

MB

5 3x

3

MB

4 3x

3

CPU 10.9ms (R = 144)3x3_Conv_O40

(3x3_MBConv1_RELU6_O24(5x5_MBConv4_RELU6_O32(5x5_MBConv3_RELU6_O32

(3x3_MBConv3_RELU6_O56(7x7_MBConv3_RELU6_O56(5x5_MBConv4_RELU6_O104(3x3_MBConv4_RELU6_O104



1x1_Conv_O16641664x1000_Linear

MB

1 3x

3

Con

v 3x

3

144x144

MB

4 5x

5

MB

3 5x

5

MB

3 3x

3

MB

3 7x

7

MB

4 5x

5

MB

4 3x

3

MB

3 7x

7

MB

4 5x

5

MB

4 7x

7

MB

6 3x

3

MB

4 3x

3

MB

4 5x

5

Poo

ling

FC

MB

1 3x

3

Con

v 3x

3

144x144

MB

4 3x

3

MB

4 3x

3

MB

6 3x

3

MB

6 3x

3

MB

4 3x

3

MB

3 3x

3

MB

3 5x

5

MB

3 3x

3

MB

4 5x

5

MB

3 3x

3

MB

6 3x

3

MB

6 3x

3

MB

6 7x

7

MB

4 7x

7

MB

6 3x

3

MB

3 5x

5

Poo

ling

FC

1

(a) 4.1ms latency on Xilinx ZU3EG (batch size = 1).

MB

1 3x

3

Con

v 3x

3

Poo

ling

FC

164x164

MB

4 3x

3

MB

5 3x

3

MB

6 3x

3

MB

5 3x

3

MB

6 3x

3


(3x3_MBConv4_RELU6

(3x3_MBConv4_RELU6

(3x3_MBConv4_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv4_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv6_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv6_RELU6

MB

4 3x

3

MB

4 3x

3

MB

5 3x

3

MB

5 3x

3

MB

5 3x

3

MB

5 3x

3

MB

4 3x

3

CPU 10.9ms (R = 144)3x3_Conv_O40






MB

1 3x

3

Con

v 3x

3

144x144

MB

4 5x

5

MB

3 5x

5

MB

3 3x

3

MB

3 7x

7

MB

4 5x

5

MB

4 3x

3

MB

3 7x

7

MB

4 5x

5

MB

4 7x

7

MB

6 3x

3

MB

4 3x

3

MB

4 5x

5

Poo

ling

FC

MB

1 3x

3

Con

v 3x

3

144x144

MB

4 3x

3

MB

4 3x

3

MB

6 3x

3

MB

6 3x

3

MB

4 3x

3

MB

3 3x

3

MB

3 5x

5

MB

3 3x

3

MB

4 5x

5

MB

3 3x

3

MB

6 3x

3

MB

6 3x

3

MB

6 7x

7

MB

4 7x

7

MB

6 3x

3

MB

3 5x

5

Poo

ling

FC

1

(b) 10.9ms latency on Intel Xeon CPU (batch size = 1).

MB

1 3x

3

Con

v 3x

3

Poo

ling

FC

164x164

MB

4 3x

3

MB

5 3x

3

MB

6 3x

3

MB

5 3x

3

MB

6 3x

3


(3x3_MBConv4_RELU6

(3x3_MBConv4_RELU6

(3x3_MBConv4_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv4_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv6_RELU6

(3x3_MBConv5_RELU6

(3x3_MBConv6_RELU6

MB

4 3x

3

MB

4 3x

3

MB

5 3x

3

MB

5 3x

3

MB

5 3x

3

MB

5 3x

3

MB

4 3x

3

CPU 10.9ms (R = 144)3x3_Conv_O40






MB

1 3x

3

Con

v 3x

3

144x144

MB

4 5x

5

MB

3 5x

5

MB

3 3x

3

MB

3 7x

7

MB

4 5x

5

MB

4 3x

3

MB

3 7x

7

MB

4 5x

5

MB

4 7x

7

MB

6 3x

3

MB

4 3x

3

MB

4 5x

5

Poo

ling

FC

MB

1 3x

3

Con

v 3x

3

144x144

MB

4 3x

3

MB

4 3x

3

MB

6 3x

3

MB

6 3x

3

MB

4 3x

3

MB

3 3x

3

MB

3 5x

5

MB

3 3x

3

MB

4 5x

5

MB

3 3x

3

MB

6 3x

3

MB

6 3x

3

MB

6 7x

7

MB

4 7x

7

MB

6 3x

3

MB

3 5x

5

Poo

ling

FC

1

(c) 14.9ms latency on NVIDIA 1080Ti (batch size = 64).

Figure 11: OFA can design specialized models for different hardware and different latency constraint.“MB4 3x3” means “mobile block with expansion ratio 4, kernel size 3x3”. FPGA and GPU models arewider than CPU model due to larger parallelism. Different hardware has different cost model, leadingto different optimal CNN architectures. OFA provides a unified and efficient design methodology.

5 CONCLUSION

We proposed Once-for-All (OFA), a new methodology that decouples model training from architecturesearch for efficient deep learning deployment under a large number of hardware platforms. Unlikeprevious approaches that design and train a neural network for each deployment scenario, we designeda once-for-all network that supports different architectural configurations, including elastic depth,width, kernel size, and resolution. It reduces the training cost (GPU hours, energy consumption, andCO2 emission) by orders of magnitude compared to conventional methods. To prevent sub-networksof different sizes from interference, we proposed a progressive shrinking algorithm that enablesa large number of sub-network to achieve the same level of accuracy compared to training themindependently. Experiments on a diverse range of hardware platforms and efficiency constraintsdemonstrated the effectiveness of our approach. OFA provides an automated ecosystem to efficientlydesign efficient neural networks with the hardware cost model in the loop.

10


ACKNOWLEDGMENTS

We thank NSF Career Award #1943349, MIT-IBM Watson AI Lab, Google-Daydream ResearchAward, Samsung, Intel, Xilinx, SONY, AWS Machine Learning Research Award for supporting thisresearch. We thank Samsung, Google and LG for donating mobile phones.

REFERENCES

Anubhav Ashok, Nicholas Rhinehart, Fares Beainy, and Kris M Kitani. N2n learning: Network tonetwork compression via policy gradient reinforcement learning. In ICLR, 2018. 4

Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search bynetwork transformation. In AAAI, 2018a. 3

Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. Path-level network transformationfor efficient architecture search. In ICML, 2018b. 3

Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target taskand hardware. In ICLR, 2019. URL https://arxiv.org/pdf/1812.00332.pdf. 3, 5,6, 7, 8

Brian Cheung, Alex Terekhov, Yubei Chen, Pulkit Agrawal, and Bruno Olshausen. Superposition ofmany models into one. In NeurIPS, 2019. 4

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neuralnetworks with binary weights during propagations. In NeurIPS, 2015. 3

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scalehierarchical image database. In CVPR, 2009. 6

Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Singlepath one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420,2019. 7

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections forefficient neural network. In NeurIPS, 2015. 3

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networkswith pruning, trained quantization and huffman coding. In ICLR, 2016. 1, 3

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In CVPR, 2016. 3

Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for modelcompression and acceleration on mobile devices. In ECCV, 2018. 1, 3

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531, 2015. 4

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, WeijunWang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In ICCV2019, 2019. 6, 7

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks formobile vision applications. arXiv preprint arXiv:1704.04861, 2017. 1, 2

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connectedconvolutional networks. In CVPR, 2017. 3

Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Q Weinberger.Multi-scale dense networks for resource efficient image classification. In ICLR, 2018. 3

11

https://arxiv.org/pdf/1812.00332.pdf


Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and KurtKeutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size.arXiv preprint arXiv:1602.07360, 2016. 2

Jason Kuen, Xiangfei Kong, Zhe Lin, Gang Wang, Jianxiong Yin, Simon See, and Yap-PengTan. Stochastic downsampling for cost-adjustable inference and improved regularization inconvolutional networks. In CVPR, 2018. 3

Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In NeurIPS, 2017. 3

Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, AlanYuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In ECCV,2018. 2

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In ICLR,2019. 3, 7

Lanlan Liu and Jia Deng. Dynamic deep neural networks: Optimizing accuracy-efficiency trade-offsby selective execution. In AAAI, 2018. 3

Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learningefficient convolutional networks through network slimming. In ICCV, 2017. 3

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXivpreprint arXiv:1608.03983, 2016. 6

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines forefficient cnn architecture design. In ECCV, 2018. 3

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for imageclassifier architecture search. In AAAI, 2019. 3, 6

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mo-bilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018. 1, 3, 4, 7

Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deeplearning in nlp. In ACL, 2019. 1, 7

Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, andQuoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828, 2019. 3, 7, 8

Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning dynamicrouting in convolutional networks. In ECCV, 2018. 3

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An insightful visual perfor-mance model for floating-point programs and multicore architectures. Technical report, LawrenceBerkeley National Lab.(LBNL), Berkeley, CA (United States), 2009. 9

Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian,Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design viadifferentiable neural architecture search. In CVPR, 2019. 3, 5, 7

Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman,and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. In CVPR, 2018. 3

Jiahui Yu and Thomas Huang. Autoslim: Towards one-shot architecture search for channel numbers.arXiv preprint arXiv:1903.11728, 2019a. 7

Jiahui Yu and Thomas Huang. Universally slimmable networks and improved training techniques. InICCV, 2019b. 3, 4

Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks. InICLR, 2019. 3, 4

12


Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficientconvolutional neural network for mobile devices. In CVPR, 2018. 1, 3

Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. In ICLR,2017. 3

Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In ICLR, 2017.3

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architecturesfor scalable image recognition. In CVPR, 2018. 3, 7

A DETAILS OF THE ACCURACY PREDICTOR

We use a three-layer feedforward neural network that has 400 hidden units in each layer as theaccuracy predictor. Given a model, we encode each layer in the neural network into a one-hot vectorbased on its kernel size and expand ratio, and we assign zero vectors to layers that are skipped.Besides, we have an additional one-hot vector that represents the input image size. We concatenatethese vectors into a large vector that represents the whole neural network architecture and input imagesize, which is then fed to the three-layer feedforward neural network to get the predicted accuracy. Inour experiments, this simple accuracy prediction model can provide very accurate predictions. Atconvergence, the root-mean-square error (RMSE) between predicted accuracy and estimated accuracyon the test set is only 0.21%. Figure 12 shows the relationship between the RMSE of the accuracyprediction model and the final results (i.e., the accuracy of selected sub-networks). We can find thatlower RMSE typically leads to better final results.

Once for All #25


Untitled 2 8.7 72.2

Untitled 3 4.5 72.7

2.3 72.8

1.0 74.1

0.5 74.7

0.2 75.1

Acc

of S

elec

ted

Sub-

net (

%)

72.0

72.9

73.8

74.6

75.5

0 5 10 15 20RMSE of Acc Prediction Model (%)

1

Figure 12: Performances of selected sub-networks using different accuracy prediction model.

B IMPLEMENTATION DETAILS OF PROGRESSIVE SHRINKING

After training the full network, we first have one stage of fine-tuning to incorporate elastic kernel size.In this stage (i.e., K ∈ [7, 5, 3]), we sample one sub-network in each update step. The network isfine-tuned for 125 epochs with an initial learning rate of 0.96. All other training settings are the sameas training the full network.

Next, we have two stages of fine-tuning to incorporate elastic depth. We sample two sub-networksand aggregate their gradients in each update step. The first stage (i.e., D ∈ [4, 3]) takes 25 epochswith an initial learning rate of 0.08 while the second stage (i.e., D ∈ [4, 3, 2]) takes 125 epochs withan initial learning rate of 0.24.

Finally, we have two stages of fine-tuning to incorporate elastic width. We sample four sub-networksand aggregate their gradients in each update step. The first stage (i.e., W ∈ [6, 4]) takes 25 epochswith an initial learning rate of 0.08 while the second stage (i.e., W ∈ [6, 4, 3]) takes 125 epochs withan initial learning rate of 0.24.

13

Date post:	23-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

ONCE FOR-ALL: TRAIN ONE NETWORK AND SPE CIALIZE IT FOR ... · Published as a conference paper at...

Documents