+ All Categories
Home > Documents > Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF...

Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF...

Date post: 17-Jun-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
14
ORDERING CHAOS :MEMORY-A WARE S CHEDULING OF I RREGULARLY WIRED NEURAL NETWORKS FOR E DGE DEVICES Byung Hoon Ahn 1 Jinwon Lee 2 Jamie Menjay Lin 2 Hsin-Pai Cheng 3 Jilei Hou 2 Hadi Esmaeilzadeh 1 ABSTRACT Recent advance on automating machine learning through Neural Architecture Search and Random Network Gener- ators, has yielded networks that deliver higher accuracy given the same hardware resource constrains, e.g., memory capacity, bandwidth, number of functional units. Many of these emergent networks; however, comprise of irregular wirings (connections) that complicate their execution by deviating from the conventional regular patterns of layer, node connectivity, and computation. The irregularity leads to a new problem space where the schedule and order of nodes significantly affect the activation memory footprint during inference. Concurrently, there is an increasing general demand to deploy neural models onto resource-constrained edge devices due to efficiency, connectivity, and privacy concerns. To enable such a transition from cloud to edge for the irregularly wired neural networks, we set out to devise a compiler optimization that caps and minimizes the footprint to the limitations of the edge device. This optimization is a search for the schedule of the nodes in an intractably large space of possible solutions. We offer and leverage the insight that partial schedules leads to repeated subpaths for search and use the graph properties to generate a signature for these repetition. These signatures enable the use of Dynamic Programming as a basis for the optimization algorithm. However, due to the sheer number of neurons and connections, the search space may remain prohibitively large. As such, we devise an Adaptive Soft Budgeting technique that during dynamic programming per- forms a light-weight meta-search to find the appropriate memory budget for pruning suboptimal paths. Nonetheless, schedules from any scheduling algorithm, including ours, is still bound to the topology of the neural graph under compilation. To alleviate this intrinsic restriction, we develop an Identity Graph Rewriting scheme that leads to even lower memory footprint without changing the mathematical integrity of the neural network. We evaluate our proposed algorithms and schemes using representative irregularly wired neural networks. Compared to TensorFlow Lite, a widely used framework for edge devices, the proposed framework provides 1.86×reduction in memory footprint and 1.76× reduction in off-chip traffic with an average of less than one minute extra compilation time. 1 I NTRODUCTION Growing body of work focuses on Automating Machine Learning (AutoML) using Neural Architecture Search (NAS) (Zoph & Le, 2017; Cortes et al., 2017; Zoph et al., 2018; Liu et al., 2019a; Cai et al., 2019; Real et al., 2019; Zhang et al., 2019) and now even, Random Network Genera- tors (Xie et al., 2019; Wortsman et al., 2019) which emit mod- els with irregular wirings, and shows that such irregularly wired neural networks can significantly enhance classifica- tion performance. These networks that deviate from regular topology can even adapt to some of the constraints of the hardware (e.g., memory capacity, bandwidth, number of func- tional units), rendering themselves especially useful in target- Work done as intern at Qualcomm AI Research. 1 University of California, San Diego 2 Qualcomm AI Research 3 Duke University. Correspondence to: Byung Hoon Ahn <[email protected]>. Proceedings of the 3 rd MLSys Conference, Austin, TX, USA, 2020. Copyright 2020 by the author(s). ing edge devices. Therefore, lifting the regularity condition provides significant freedom for NAS and expands the search space (Cortes et al., 2017; Zhang et al., 2019; Xie et al., 2019). The general objective is to enable deployment of neural in- telligence even on stringently constrained devices by trading off regular wiring of neurons for higher resource efficiency. Importantly, pushing neural execution to edge is one way to address the growing concerns about privacy (Mireshghallah et al., 2020) and enable their effective use where connectivity to cloud is restricted (Wu et al., 2019). However, the new challenge arises regarding orchestrating execution of these irregularly wired neural networks on the edge devices as working memory footprint during execution frequently surpass the strict cap on the memory capacity of these devices. The lack of multi-level memory hierarchy in these micro devices exacerbates the problem, because the network cannot even be executed if the footprint exceeds the capacity. To that end, despite the significant potential of irregularly wired neural networks, their complicated execution pattern,
Transcript
Page 1: Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF IRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES ByungHoonAhn1 yJinwonLee2 JamieMenjayLin

ORDERING CHAOS MEMORY-AWARE SCHEDULING OFIRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES

Byung Hoon Ahn 1 dagger Jinwon Lee 2 Jamie Menjay Lin 2 Hsin-Pai Cheng 3 dagger Jilei Hou 2 Hadi Esmaeilzadeh 1

ABSTRACTRecent advance on automating machine learning through Neural Architecture Search and Random Network Gener-ators has yielded networks that deliver higher accuracy given the same hardware resource constrains eg memorycapacity bandwidth number of functional units Many of these emergent networks however comprise of irregularwirings (connections) that complicate their execution by deviating from the conventional regular patterns of layernode connectivity and computation The irregularity leads to a new problem space where the schedule and orderof nodes significantly affect the activation memory footprint during inference Concurrently there is an increasinggeneral demand to deploy neural models onto resource-constrained edge devices due to efficiency connectivity andprivacy concerns To enable such a transition from cloud to edge for the irregularly wired neural networks we set outto devise a compiler optimization that caps and minimizes the footprint to the limitations of the edge device Thisoptimization is a search for the schedule of the nodes in an intractably large space of possible solutions We offerand leverage the insight that partial schedules leads to repeated subpaths for search and use the graph properties togenerate a signature for these repetition These signatures enable the use of Dynamic Programming as a basis for theoptimization algorithm However due to the sheer number of neurons and connections the search space may remainprohibitively large As such we devise an Adaptive Soft Budgeting technique that during dynamic programming per-forms a light-weight meta-search to find the appropriate memory budget for pruning suboptimal paths Nonethelessschedules from any scheduling algorithm including ours is still bound to the topology of the neural graph undercompilation To alleviate this intrinsic restriction we develop an Identity Graph Rewriting scheme that leads toeven lower memory footprint without changing the mathematical integrity of the neural network We evaluate ourproposed algorithms and schemes using representative irregularly wired neural networks Compared to TensorFlowLite a widely used framework for edge devices the proposed framework provides 186timesreduction in memoryfootprint and 176times reduction in off-chip traffic with an average of less than one minute extra compilation time

1 INTRODUCTION

Growing body of work focuses on Automating MachineLearning (AutoML) using Neural Architecture Search(NAS) (Zoph amp Le 2017 Cortes et al 2017 Zoph et al2018 Liu et al 2019a Cai et al 2019 Real et al 2019Zhang et al 2019) and now even Random Network Genera-tors (Xie et al 2019 Wortsman et al 2019) which emit mod-els with irregular wirings and shows that such irregularlywired neural networks can significantly enhance classifica-tion performance These networks that deviate from regulartopology can even adapt to some of the constraints of thehardware (eg memory capacity bandwidth number of func-tional units) rendering themselves especially useful in target-

daggerWork done as intern at Qualcomm AI Research 1University ofCalifornia San Diego 2Qualcomm AI Research 3Duke UniversityCorrespondence to Byung Hoon Ahnltbhahnengucsdedugt

Proceedings of the 3 rd MLSys Conference Austin TX USA 2020Copyright 2020 by the author(s)

ing edge devices Therefore lifting the regularity conditionprovides significant freedom for NAS and expands the searchspace (Cortes et al 2017 Zhang et al 2019 Xie et al 2019)

The general objective is to enable deployment of neural in-telligence even on stringently constrained devices by tradingoff regular wiring of neurons for higher resource efficiencyImportantly pushing neural execution to edge is one way toaddress the growing concerns about privacy (Mireshghallahet al 2020) and enable their effective use where connectivityto cloud is restricted (Wu et al 2019) However the newchallenge arises regarding orchestrating execution of theseirregularly wired neural networks on the edge devices asworking memory footprint during execution frequentlysurpass the strict cap on the memory capacity of thesedevices The lack of multi-level memory hierarchy in thesemicro devices exacerbates the problem because the networkcannot even be executed if the footprint exceeds the capacityTo that end despite the significant potential of irregularlywired neural networks their complicated execution pattern

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

in contrast to previously streamlined execution of modelswith regular topology renders conventional frameworksfutile in taking these networks to edge due to their large peakmemory footprint While peak memory footprint is largelydependent on scheduling of neurons current deep learningcompilers (Chen et al 2018 Vasilache et al 2018) andframeworks (Abadi et al 2016 Paszke et al 2019 Jia et al2014) rely on basic topological ordering algorithms that areoblivious to peak memory footprint and instead focus on anorthogonal problem of tiling and kernel level optimizationThis paper is an initial step towards embedding peak memoryfootprint as first-grade constraint in deep learning schedulersto unleash the potential of the emergent irregularly wiredneural networks As such this paper makes the followingcontributions

(1) Memory-aware scheduling for irregularly wiredneural networks Scheduling for these networks is a topo-logical ordering problem which enumerates an intractablylarge space of possible schedules We offer and leverage theinsight that partial schedules leads to repeated subpaths forsearch and use the graph properties to generate a signaturefor these repetition while embedding a notion of the runningmemory usage These signatures enable the use of DynamicProgramming as a basis for the optimization algorithm(2) Adaptive soft budgeting for tractable compilationtime Even with the dynamic programming as the base dueto the sheer number of neurons and connections the searchspace may remain too large (exponentially large) in practiceAs such we devise an Adaptive Soft Budgeting techniquethat uses a lightweight meta-search mechanism to find theappropriate memory budget for pruning the suboptimalpaths This technique aims to find an inflection point beyondwhich tighter budgets may lead to no solution and looserbudget prolongs the scheduling substantially putting theoptimization in a position of questionable utility(3) Identity graph rewriting for enabling higher poten-tial in memory reduction Any scheduling algorithm in-cluding ours is still bound to the topology of the neural graphunder compilation To relax this intrinsic restriction wedevise an Identity Graph Rewriting scheme that exchangessubgraphs leading to a lower memory footprint withoutaltering the mathematical integrity of the neural network

Results show that our adaptive scheduling algorithmimproves peak memory footprint for irregularly wired neuralnetworks by 168timescompared to TensorFlow Lite the de factoframework for edge devices Our graph rewriting techniqueprovides an opportunity to lower the peak memory footprintby an additional 107 Furthermore our framework caneven bring about 176times reduction in off-chip traffic for de-vices with multi-level memory hierarchy and even eliminatethe traffic in some cases by confining the memory footprintbelow the on-chip memory capacity These gains come ataverage of less than one minute extra compilation time

(a) RandWire (b) SwiftNet

Figure 1 Architecture of network models from NAS and RandomNetwork Generators Topology of such networks include distinctiveirregular wirings between the nodes

2 CHALLENGES AND OUR APPROACH

21 Irregularly Wired Neural Networks

Recent excitement in Automated Machine Learning(AutoML) (Feurer et al 2015 Dean 2017 He et al 2018Elthakeb et al 2018 Wang et al 2019 Laredo et al 2019)aims to achieve human out of the loop in developing machinelearning systems This includes Neural Architecture Search(NAS) (Zoph amp Le 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Network Generators (Xie et al 2019 Wortsmanet al 2019) that focus on automation of designing neuralarchitectures Figure 1 demonstrates that networks of thisregime are characterized by their distinctive irregular graphtopology with much more irregular wirings (dataflow)compared to conventional networks with regular graphtopology This paper refers to these networks as irregularlywired neural networks

Multiply-and-accumulate (Billions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

200 10 30 40

DPN-131

Inception V1MobileNet

ShuffleNet

Inception V2

Inception V3Xception ResNet-152

SENet

AmoebaNet-A

ReNeXt-101PolyNetInception ResNet V2

Inception V4

NASNet-ANASNet-B

RandWire

AmoebaNet-A

AmoebaNet-B

RandWire

irregularly wired neural networksregular topology neural networks

irregularly wired neural networksshow better performance for

same amount of compute thanregular topology neural networks

top left means is better

Figure 2 ImageNet accuracy vs number of multiply-and-accumulate where irregularly wired neural networks show higherperformance for same compute than regular topology neural net-works Plot for number of parameters also displays a similar trend

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

concat

conv

(a) SwiftNet Cell A

Peak Memory Footprint (KB)

Cum

ulat

ive

Dist

ribut

ion

of S

ched

ules

()

250 KBconstraint100

020406080

350 400200 250 300

41 of schedulessatisfy the constraint

004 of schedulesare optimal

(b) CDF of peak memory fordifferent possible schedules

Figure 3 CDF of the peak memory footprint for the differentpossible schedules of a given irregularly wired neural network

From the performance perspective these networks haveshown to outperform manually designed architecturesin terms of accuracy while using less resources In factmajority of winning neural architectures in competitionswith primary goal of reducing resources (Gauen et al 2017)rely on NAS suggesting its effectiveness in that respectFigure 2 plots the accuracy of different models given theircomputation The figure clearly shows that the Paretofrontier of irregularly wired neural networks from NASand Random Network Generators are better than the handdesigned models with regular topology This indicates thatthe efficiency in terms of accuracy given fixed resources arebetter with the irregularly wired neural networks

22 Challenges

Many existing compilers (Chen et al 2018 Vasilache et al2018) and frameworks (Paszke et al 2019 Abadi et al 2016Jia et al 2014) rely on basic topological ordering algorithmsto schedule the graph While the current approach may be suf-ficient to run conventional networks on server-class machinessuch scheme may be unfit for running irregularly wiredneural networks on resource-constrained edge devices Thisis because unlike running networks with regular topologyrunning irregular networks results in varied range of memoryfootprint depending on the schedule For instance giventhe constraints of a representative edge device (SparkFunEdge 250KB weightactivation memory and 60M MACs)Figure 3(b) shows that 41 of the schedules barely meets thehard memory constraint while only 004 would achieve theoptimal peak memory In reality such limitation will preventfurther exploration regarding the diversity and innovationof network design and in order to allow edge computingregime to take full advantage of the irregularly wired neuralnetworks this limitation should be alleviated if not removed

23 Design Objectives

Scheduling algorithm To address this issue our workaims to find a schedule of nodes slowast from the search spaceS that would minimize peak memory footprint micropeak Senumerates all possible orderings of the nodes visinV whereV is the set of all nodes within a graph G

slowast=argmins

micropeak(sG) for sisinS (1)

The most straightforward way to schedule is a brute forceapproach which just enumerates S and picks one with theminimum peak memory footprint While this extrememethod may find an optimal solution it is too costly in termsof time due to its immense complexity Θ(|V |) where |V |denotes number of nodes in the graph One way to improveis to narrow down the search space to just focus on onlythe topological orderings ST subS However this will stillsuffer from a complexity with an upper bound of O(|V |)(takes days to schedule DAG with merely 30 nodes) In factprevious works (Bruno amp Sethi 1976 Bernstein et al 1989)already prove optimal scheduling for DAGs is NP-completeOn another extreme are heuristics for topological orderingsuch as Kahnrsquos algorithm (Kahn 1962) with complexity ofO(|V |+|E|) where V andE are number of nodes and edgesHowever as demonstrated in Figure 3 such method mayyield suboptimal schedule of nodes which will not run on thetarget hardware To this end we explore dynamic program-ming combined with adaptive soft budgeting for schedulingto achieve an optimal solution while keeping the graph con-stant slowast without adding too much overhead in terms of timeWe explain our algorithms in depth in Section 31 and 32

Graph rewriting Any scheduling algorithm includingours is intrinsically bounded by the graph topologyTherefore we explore to transform the search spacethrough graph rewriting (Plump 1999) Graph rewritingis generally concerned with substituting a certain patternin the graph with a different pattern to achieve a certainobjective For a computational dataflow graph leveragingdistributive associative and commutative properties withinthe computation of the graph graph rewriting can maintainthe semantics while bringing significant improvementsregarding some objective For example in general programssum

ilogxi can be represented assum

oddilogxi+sum

evenilogxior log

prodixi while x+x can be translated to xtimes2 or xltlt1

Likewise we bring this insight to neural networks to find aset of possible transformationsX that can rewrite the originalgraph G to a new graph Gprime that would also change our searchspace S to one with a lower peak memory footprint

X lowast=argminX

(micropeak(slowastX (G))) (2)

We identify a set of candidate patterns for transformationχ grarr gprime (g isinG and gprime isinGprime) which constitutes X Whiletransforming the graph our method keeps the mathematicalintegrity of the graph intact thus not an approximationmethod We embed this systematic way to improve peakmemory footprint and the search space as identity graphrewriting and we address this technique in Section 33

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

1

2

3

4

5

7

8

9

10

11

12

13

14

15

16

17

G s GGraph RewrittenGraph Schedule

IdentityGraph Rewriter

Dynamic Programming-

based Scheduler

Adaptive Soft Budgeting

Rewrite graph to alleviate activation memory footprint

of the graph

Find memory-optimal schedule given an

input graph

Adaptively manage soft budget to speed

up scheduling

G

flag = lsquono solutionrsquo lsquotimeoutrsquo lsquosolutionrsquo

τ T s

Figure 4 Overall workflow of SERENITY memory-aware scheduling of irregularly wired neural network

3 SERENITY MEMORY-AWARESCHEDULING OF IRREGULARLYWIRED NEURAL NETWORKS

As discussed in Section 2 the objective is reducing thepeak memory footprint while executing irregularly wiredneural networks We propose SERENITY memory-awarescheduling that targets devices with restricted resources(eg edge devices) Figure 4 summarizes the overallscheduling process highlighting the major contributions ofour approach Input to SERENITY is a graph of irregularlywired neural network G which in fact acts as an intermediaterepresentation (IR) during the scheduling process Weaugment this IR with the metadata of the nodes such as theoperation type inputoutput edges inputoutput shapesand memory cost Then the graph rewriter transformsthe graph G rarr Gprime to relax the memory costs of memoryintensive patterns with the goal of reducing the peak memoryfootprint micropeak of G SERENITY schedules the graph to anoptimal schedule slowast using the dynamic programming-basedscheduler However since the scheduling may be slowdue to the complexity we scale down search space byleveraging divide-and-conquer which partitions the graphinto multiple subgraphs Them we augment the schedulerwith an adaptive soft budgeting which prunes suboptimalpaths by adaptively finding a budget for thresholding througha swift meta-search to speed up the scheduling process Thissection focuses on the innovations of SERENITY dynamicprogramming-based scheduling divide-and-conqueradaptive soft budgeting and graph rewriting which areexplained in detail in Section 31 32 and 33 respectively

31 Dynamic Programming-based SchedulingAchieving Optimal Peak Memory Footprint

Our goal for the scheduling algorithm is to minimize thepeak memory footprint micropeak(sG) As stated in Section 23recursive algorithms that covers the entire search space Sor the subspace of all topological orderings ST sub S takesimpractically long time This is primarily due to the repetitivere-computation of subproblems that upper bounds the algo-rithm byO(|V |) Therefore we leverage dynamic program-ming (Bellman 1961 1966 Held amp Karp 1962) which in-cludes a memoization scheme that has been shown to be effec-tive in reducing the complexity of time-intensive algorithmsby reusing solutions from their subproblems while still find-

ing optimal solution by sweeping the entire search space

Identifying signature to enable dynamic programmingThe first step to applying dynamic programming to anew problem is characterizing the structure of an optimalsolution slowast=slowastn (slowastn is an optimal solution for n number ofnodes) Then it requires identifying a recursive relationshipbetween the optimal solution of a subproblem slowasti and theoriginal problem slowasti+1 and we do this by analyzing thestraightforward recursive topological ordering whichwhile inefficient sweeps the entire search space In essencetopological ordering algorithm is a repeated process ofidentifying a set of nodes that are available for schedulingand iterating the set for recursion In graph theory such aset of nodes available for scheduling is called zero-indegreeset z where z is a set of nodes which all of their incomingedges and the corresponding predecessor nodes (indegree)have been scheduled Figure 5 demonstrates the recursiontree of the different topological ordering algorithms wherethe height of the tree is the search step and every path fromthe root to the leaf is a topological ordering s isin ST Thefigure highlights the redundant z in the recursive topologicalordering in the recursion tree then merges these z to makethem unique identifying it as the signature for repetition andprevent the aforementioned re-computation This makes thescheduling for z into a unique subproblem that constitutes

A

B C

D E F

H I J

K

L

G

GGraph Recursive Topological Ordering

A

B C J

CD JG B E F J B C

hellip

Sear

ch S

tep

CDG CDGhellip hellipz

s

Redundant zero-indegree set z

Dynamic Programming-based Topological Ordering

A

B C J

ABCD ABJG E F ACJ

hellip

Sear

ch S

tep

CDGhellip

Unique zero-indegree set z

Scheduled SchedulableXX For memoizationXX

Figure 5 Illustration of identifying redundant zero-indegree set zand making z unique (square) throughout the topological orderingalgorithm to reduce re-computation

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

outdegree of 1rarr 0 outdegree of 1rarr 0

micro 8

A

B C

D E F

H I J

K

L

G

GGraph

Scheduled To ScheduleAllocateBA To DeallocateB

Activation MemoryD E F I J

(1) ScheduleAllocate H D E F I J H

(0) Initial State

(2) Deallocate F I J HD E

D E

H I

DE

i = 8

micropeak

micropeak9 = max(micropeak8 micropeak)

micro9

s8 = A B CD E F I Jmicro peak8 from M8

z8 = HG

u8 = H

s9 = A B CD E F I J H

Figure 6 Visualization of scheduling the node u8 = H during thesearch step i = 8 Starting from s8 micro8 and micropeak8 the figureshows how the algorithm calculates s9 micro9 and micropeak9

the dynamic programming-based topological ordering

Integrating the peak memory footprint constraint Ontop of the dynamic programming formulation that showspotential for optimizing the search space significantly weoverlay the problem specific constraints to achieve theoptimal solution In particular we calculate the memoryfootprint microi+1 and its corresponding peak micropeaki+1 in eachsearch step i to select optimal path slowasti+1 for memoizationHere we clarify the process of a search step explainingthe details of calculating micropeaki+1 and saving si+1 for eachsearch step i In each search step we start with number ofunique zero-indegree sets zi (signature) saved in ith entryof memoizationMi For each zi we append the schedule upto the point si sum of activations in the memory microi for thesignature zi and the peak memory footprint of the si denotedmicropeaki Therefore in each search step i we start with si microiandmicropeaki for si Then when we iterate zi to schedule a newnode ui its output activation is appended to si to form si+1and is allocated in the memory Size of ui is product (

prod) of

uishape where shape is a property of the activation tensorthat includes channels height width and the precision (egbyte float) is added to microi so microi+1 larr microi +

prod(uishape)

Then we use microi+1 as micropeak to update micropeaki+1 (peakmemory footprint for si+1) Since some predecessors ofui will not be used anymore after allocating ui we updatethe outdegrees of the node by decrementing them Havingupdated the outdegree we will be left with a zero-outdegreeset that denotes the nodes that are ready for deallocation Wedeallocate the nodes in the set and update microi+1 accordingly

To demonstrate scheduling of a node ui Figure 6 simulatesscheduling a node u8 = H in i=8 In the figure (1) H is ap-pended to s8 and allocated to memory as it is scheduled andthen the scheduler records maximum of the micropeak8 and thesum of all activations in the memory at this point as micropeak9Then it recalculates the outdegrees of the predecessor nodesof H D and E rsquos outdegree are decremented from one tozero (2) Then these nodes are deallocated and sum of the

Algorithm 1 Dynamic Programming-based Scheduling1 Input graph G2 Output optimal schedule slowast

3 initialize memoization4 s0larr [] micro0micropeak0larr0 z0larr zero-indegree(s0G)5 M0[z0]larr(s0micro0micropeak0)6 iterate search step7 for i=0 to nminus1 do8 iterate (schedule current memory peak memory)9 for zi(simicroimicropeak) inMi do

10 for ui in zi do11 si+1larrsiappend(ui) allocate12 zi+1larr zero-indegree(si+1G)13 microi+1micropeaklarrmicroi+

prod(uishape)

14 micropeaki+1larrmax(micropeakimicropeak)15 for pi in uipreds do16 if pi is in zero-outdegree(si+1G) then17 microi+1larrmicroi+1minus

prod(pishape) deallocate

18 end if19 end for20 memoize schedule with least peak memory21 if micropeaki+1leMi+1[zi+1]micropeaki+1 then22 Mi+1[zi+1]larr(si+1microi+1micropeaki+1)23 end if24 end for25 end for26 end for27 slowastmicrolowastpeaklarrM[middot]nsnM[middot]nmicropeakn solution

activation memory here is recorded as micro9

Finding schedule with optimal peak memory footprintAfter scheduling ui we save the new signature into theMi+1 for next search step i+1 Since the goal of this work isto minimize the overall micropeak we identify the correspondingoptimal schedule slowasti+1 for each zi+1 by only saving si+1 withthe minimum micropeaki+1 We integrate the aforementionedstep of scheduling ui and updating Mi+1 to completethe proposed dynamic programming-based schedulingalgorithm Algorithm 1 summarizes the the algorithm As afirst step the algorithm starts by initializing the memoizationtableM0 then the algorithm iterates different search stepsIn each search step i the algorithm performs the aboveillustrated memory allocation for allui in zi and saving si+1microi+1 and micropeaki+1 After iterating all search steps to nminus1slowast is saved inMn with a unique entry for n being numberof nodes in G We provide the proof for the optimality of thepeak memory footprint in the supplementary material

Complexity of the algorithm The complexity of theproposed dynamic programming-based scheduling isO(|V |times2|V |) which is significantly faster than the exhaus-tive search ofST with an upper bound complexity ofO(|V |)

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A

B C

D

E

F G

H

Divide Conquer Combine

ABCD

Schedule

Schedule

g1

g2Concatenate

s sg1

sg2EFGH

ABCDEFGH

Figure 7 Illustration of divide-and-conquer which divides thegraphs into multiple subgraphs (divide) schedules each of themusing the optimal scheduler (conquer) then concatenates thesub-schedules to get the final schedule (combine)

Due to the space limitation we present the derivation of thealgorithm complexity in the supplementary material

32 Optimizing Scheduling Speed Speeding upthe Dynamic Programming-based Scheduling

While the above scheduling algorithm improves complexityof the search search space may still be intractable due tothe immense irregularity Therefore we devise divide-and-conquer and adaptive soft budgeting to accelerate the searchby effectively shrinking and pruning the search space

Divide-and-conquer We can observe from Figure 1that the topology of irregularly wired neural networks arehourglass shaped ( ) because many NAS and RandomNetwork Generators design cells with single input and singleoutput then stack them to form an hourglass shape topology(Wilken et al 2000) shows that during general purpose codescheduling graphs can be partitioned (divide) into multiplesubgraphs and the corresponding solutions (conquer) can beconcatenated (combine) to form an optimal solution for theoverall problem While the complexity of the scheduling al-gorithm remains the same this divide-and-conquer approachcan reduce the number of nodes in each subproblem speed-ing up the overall scheduling time For instance for a graphthat can be partitioned intoN equal subgraphs the schedul-ing time will decrease from |V |times2|V | to |V |times2|V |N thatwe can speed up scheduling by multiple orders of magnitudecompared to the naive approach depending on the size ofthe graph and the number of partitions

As such Figure 7 shows this insight can be extended to ourproblem setting where we can first perform scheduling oneach cell and merge those solutions together to form thefinal solution First stage is partitioning the original graphG into multiple subgraphs g (divide) Then utilizing theindependence among the subgraphs each subgraph g canbe scheduled separately for their corresponding optimalschedule sg (conquer) Considering that the number ofnodes in the subgraph g is much smaller than the entire graphG the scheduling time will decrease significantly Finallythe schedules of the subgraphs are concatenated to giveoptimal schedule slowast of the entire graph (combine)

Adaptive soft budgeting While divide-and-conquerapproach scales down the number of nodes the algorithmmay still not be fast enough due to the exponential complexityof the algorithm Therefore we explore avoiding suboptimalsolutions during the early stage of scheduling without affect-ing the optimality of the original algorithm Since our goal isto find a single solution that can run within a given memorybudget τlowast=microlowast while all other solutions can be discarded set-ting some budget τ that is greater or equal to microlowast and pruningsuboptimal schedules with which their micropeak exceeds τ canfocus the search to a smaller search spaceS primeT subST while stillachieving the optimal schedule slowast On top of this we developa meta-search for τ This is inspired from engineers buyinga larger memory (increase τ ) if a program fails due to stackoverflow (= rsquono solutionrsquo due to an overly aggressive pruning)and selling out excess memory (decrease τ ) if the currentbudget is prohibitive (= rsquotimeoutrsquo due to lack of pruning)SERENITY takes advantage of this insight to develop anadaptive soft budgeting scheme while scheduling to cut downthe overall number of explored schedules Figure 8 illustratesthe overall idea by first showing how some schedules arepruned with regard to a given budget τ in Figure 8(a) thenimplication of different τ on scheduling time in Figure 8(b)

Figure 8(a) depicts a certain point while scheduling G wherenodes G H F and J can be scheduled In particular the

32

23

35

τ = 36

35 38 38J

s1

A

B C

D E F

H I

K

L

G

GGraph

Scheduled SchedulableXX For memoizationXX

ABCDEI hellip

Sear

ch S

tep

FHG

hellip

J

D 6 E 6 F 6J 6I 3H 3G 3

ABCDEIFH helliphellip

C 6

23 32s2

gt τ

z

z

s3

35

output activation size

(a) While both path s1 and s2 schedules lead to same zprime theirmicro and micropeak varies and we can prune schedules that yieldhigher micropeak than a given budget τ Numbers next to box orcircle are micro and numbers next to edges are micropeak

Proh

ibiti

ve

Sche

dulin

g Ti

me

(tim

eout

)

Sche

dulin

g Fa

ilure

(no

solu

tion

)

No o

f Exp

lore

d Sc

hedu

les

prop Sc

hedu

ling

Tim

e

Budget

Optimal Budget τ Hard Budget τmaxSoft Budget τ

Adaptive Soft Budgeting

(b) Adaptive soft budgeting starts by setting a hard budgetτmax as the maximum value for the soft budget τ Thenconducts a binary search for τ higher than τlowast that it finds asolution yet not too high that scheduling completes quickly

Figure 8 Illustration of the adaptive soft budgeting (a) shows howschedules are pruned and (b) illustrates how the soft budget τrelates to the number of explored schedules

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

figure compares two possible solutions s1 and s2 whichschedules Hrarr F and Frarr H respectively given τ = 36While s1 and s2 both starts from z with micro=32 schedulingH leads to micropeak = 32+3 (H) = 35 whereas scheduling F

or J leads to micropeak =32+6 (F or J) =38 Therefore sincewe assume τ = 36 s2 and s3 will fail because micropeak = 38for s2 and s3 exceeds 36 So as long as we set the budgetτ higher than microlowast the scheduler still finds a single optimalsolution while avoiding many suboptimal paths On theother hand too small a τ ltmicrolowast leads to no solution becausethe optimal path would be pruned away

Having established the possibility of pruning our questionboils down to discovering τ that is greater or equal to microlowast

which we call an optimal budget τlowast yet close enough toshrink the search space effectively Figure 8(b) and Algo-rithm 2 summarizes the proposed adaptive soft budgetingSince we start with no information about the approximaterange for τ we resort to a commonly used topologicalordering algorithm called Kahnrsquos algorithm (Kahn 1962)(O(|V |+|E|)) to adaptively gain idea of the range for τ Weuse the peak memory footprint from this sequence and useit as our hard budget τmax and in contrast we call adaptivelychanging τ as a soft budget Since τmaxgemicrolowast we know thatany τgeτmax do not need to be explored Having this upperbound for the search adaptive soft budgeting implementsa binary search to first run the scheduling algorithm with τand T as input where T is an hyperparameter that limits thescheduling time per search step The binary search increasesτ (τnew larr (τnew + τold)2) if it finds rsquono solutionrsquo anddecreases τ (τnewlarrτnew2) if a search step returns rsquotimeoutrsquo(search step duration exceeds T ) The binary search stops as

Algorithm 2 Adaptive Soft Budgeting1 Input graph G2 Output optimal schedule slowast

3 τmaxlarrmicro(KahnrsquosAlgorithm(G)G) hard budget4 τoldτnewlarrτmax

5 flaglarr rsquono solutionrsquo6 repeat7 binary search for τ decrease τ if rsquotimeoutrsquo8 and increase τ if rsquono solutionrsquo9 if flag is rsquotimeoutrsquo then

10 simultaneous11 τoldlarrτnew τnewlarrτnew212 else if flag is rsquono solutionrsquo then13 simultaneous14 τoldlarrτnew τnewlarr(τnew+τold)215 end if16 if flag is rsquosolutionrsquo then17 slowastlarrschedule optimal schedule18 end if19 until flag is rsquosolutionrsquo

micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(wixi) )

micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(y) )

Channel-wisePartitioning

Kernel-wisePartitioning

=

=

wij

concat

conv

x1 x2 xn

w1hellipwm

y

add

partialconv w1

x1 x2xn

w2 wn

y

concat

depth-conv

x1 x2 xn

w1hellipwn

y

concat

partialdepth-conv w1

x1 x2xn

w2 wn

y

yxi ith Input Output jth Channel of ith Kernel

x

x

Figure 9 Illustration of the graph rewriting patterns channel-wisepartitioning and kernel-wise partitioning can reduce the memorycost of convolution and depthwise convolution respectively

soon as it finds a schedule (rsquosolutionrsquo) and this method usingbinary search is guaranteed to work due to the monotonicallyincreasing number of explored schedules with τ

33 Identity Graph Rewriting Improving theSearch Space for Better Peak Memory Footprint

Reorganizing the computational graph of the irregularlywired neural networks may lead to significant reduction in thepeak memory footprint micropeak during computation For exam-ple it is notable that large stream of NAS-based works (Liuet al 2019a Zhang et al 2019) rely on extensive use of con-catenation as a natural approach to merge information frommultiple branches of the input activations and expand thesearch space of the neural architectures However concatena-tion with many incoming edges may prolong the liveness ofthe input activation and increase the memory pressure whichis unfavorable especially for resource constrained scenariosTo address this issue we propose identity graph rewritingto effectively reduce micropeak around the concatenation whilekeeping the arithmetic outputs identical To this end wepresent two main examples of the graph patterns in irregularlywired neural networks that benefits from our technique

Channel-wise partitioning (convolution) One typicalpattern in irregularly wired neural networks is concatenation(concat [middot]) that takes multiple branches of the input prior toa convolution (conv lowast) While executing such pattern peakmemory footprint micropeak occurs when the output y isinRn isbeing computed while concatenated branches of inputxisinRn

are also mandated to reside in the memory Our objectiveis to achieve the same arithmetic results and logical effectas concat yet sidestep the corresponding seemingly exces-sive memory cost To this end we channel-wise partitionthe conv that follows the concat so that the partitioned convcan be computed as soon as the input xi becomes available

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of

sumi and lowast these

transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from

sumxi+y tomax(wilowastxi)+y which becomes

more effective when there are more incoming edges to concat

y=[sum

i

w1ilowastxisumi

wmilowastxi]

(concat+conv) (3)

=sumi

[w1ilowastxiwmilowastxi

](4)

=sumi

[w1iwmi

]lowastxi (5)

=sumi

[wilowastxi

](partial conv+add) (6)

Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly

y=[w1lowastx1wnlowastxn

](concat+depthconv) (7)

=[[w1lowastx1][wnlowastxn]

](partial depthconv+concat)

(8)

Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost

Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1

ACCURACY

DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745

4 EVALUATION

We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time

41 Methodology

Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset

42 Experimental Results

Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In

1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

183x

220x

239x

209x

140x

127x 168x

125x

139x

168x220x

244x

270x 345x

140x

127x 168x

125x

139x 186x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Red

ucti

on i

n Pe

ak M

emor

yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Higher the betterReduction in Peak Memory

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 10 Reduction in peak memory footprint of SERENITY

against TensorFlow Lite (no memory hierarchy)

192x 2

58x

251x

115x

108x

129x

108x

130x

152x192x 2

68x

125x

111x

131x

111x 161x

149x192x

356x

125x

119x

109x

108x 151x200x

135x

250x

182x

138x 176x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Redu

ctio

n in

Off

-chi

p

32KB 64KB 128KB 256KB

Reduction in Off-chip

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Memory Communication

only

SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipon

ly SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipSERENITY removes off-chip communication

NA

NA

NA

NA

NA NA

Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)

addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks

Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed

Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory

0

50

100

150

200

250

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

251KB reduction in peak memoryfootprint with Memory Allocator

Memory Footprint (KB)

Time

(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)

0

50

100

150

200

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic ProgrammingDynamic Programming+Graph Rewriting

125KB reductionin peak memory footprint

Time

Memory Footprint (KB)

(b) Memory footprint without the memory allocator

Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)

allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting

Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting

Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

32s 57s

45s

278s 11

81s

151s

285s 744s

879s

406s

32s

421s

305s

393s 11

81s

151s

285s 744s

879s

488s

1

10

100

1000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Sche

dulin

g Ti

me

(sec

onds

)

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Scheduling Time (seconds)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 13 Scheduling time evaluation for SERENITY

Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time

GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME

7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs

3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs

nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility

5 RELATED WORKS

The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)

bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices

Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper

Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint

Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts

6 CONCLUSION

As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

ACKNOWLEDGEMENT

We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms

REFERENCES

Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016

Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018

Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH

Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017

Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966

Bellman R Dynamic programming Science 1966

Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961

Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989

Bruno J and Sethi R Code generation for a one-register machineJACM 1976

Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm

Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018

Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014

Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016

Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017

Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf

Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf

Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017

Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf

Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS

Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015

Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018

Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017

Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017

Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite

Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015

Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a

Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b

He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018

Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962

Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf

Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014

Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018

Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019

Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017

Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016

Kahn A B Topological sorting of large networks CACM 1962

Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001

Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf

Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004

LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990

Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003

Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX

Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b

Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020

Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb

NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt

Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017

Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019

Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999

Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019

Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf

Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007

Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018

Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014

Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf

Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019

Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000

Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019

Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019

Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019

Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf

Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf

Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-

Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg

Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS

Multiply-and-accumulate (Billions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

200 10 30 40

DPN-131

Inception V1MobileNet

ShuffleNet

Inception V2

Inception V3Xception ResNet-152

SENet

AmoebaNet-A

ReNeXt-101PolyNetInception ResNet V2

Inception V4

NASNet-ANASNet-B

RandWire

AmoebaNet-A

AmoebaNet-B

RandWire

irregularly wired neural networksregular topology neural networks

irregularly wired neural networksshow better performance for

same amount of compute thanregular topology neural networks

top left means is better

(a) ImageNet accuracy vs number of multiply-and-accumulate

Number of Parameters (Millions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

800 40 100 140

DPN-131

irregularly wired neural networks

Inception V1MobileNetShuffleNet

Inception V2

Inception V3Xception

ResNet-152

SENet

AmoebaNet-C

ReNeXt-101

PolyNetInception ResNet V2Inception V4

NASNet-A

NASNet-A

RandWire

AmoebaNet-A

RandWire

regular topology neural networks

irregularly wired neural networksshow better performance for

same number of parameters thanregular topology neural networks

top left means is better

6020 120

NASNet-A

(b) ImageNet accuracy vs number of parameters

Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks

B COMPARISON WITH TENSORFLOW LITE

In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks

1656

552

194

70

645

330 60

5

350

160

903

251

82 33

459

260 359

280

115

753

226

72 20

459

260 359

280

115

0

500

1000

1500

2000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Peak

Mem

ory

Foot

prin

t (K

B)

TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Smaller the better

Peak Memory Footprint (KB)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite

C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING

Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm

THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step

Proof If i=0 the optimal s0 is an empty sequence and micro0

must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +

prodvishapemicroi +

produlowasti shape)

and microi+1 is calculated by deductingprodpishape forallpi isin

(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption

D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF

We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node

A

D W

Z

CB X Y

GGraph

hellip

Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm

First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step

For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2

iminus1) On top of this each step would make an iteration

over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield

1+1times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=1+

(|V |minus2

0

)times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=2+

|V |minus2sumi=0

(|V |minus2

i

)times(|V |minus2minusi)

=2+(|V |minus2)times2|V |minus3

le(|V |minus2)times2|V |minus2 for |V |ge4

le|V |times2|V |

As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering

Page 2: Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF IRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES ByungHoonAhn1 yJinwonLee2 JamieMenjayLin

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

in contrast to previously streamlined execution of modelswith regular topology renders conventional frameworksfutile in taking these networks to edge due to their large peakmemory footprint While peak memory footprint is largelydependent on scheduling of neurons current deep learningcompilers (Chen et al 2018 Vasilache et al 2018) andframeworks (Abadi et al 2016 Paszke et al 2019 Jia et al2014) rely on basic topological ordering algorithms that areoblivious to peak memory footprint and instead focus on anorthogonal problem of tiling and kernel level optimizationThis paper is an initial step towards embedding peak memoryfootprint as first-grade constraint in deep learning schedulersto unleash the potential of the emergent irregularly wiredneural networks As such this paper makes the followingcontributions

(1) Memory-aware scheduling for irregularly wiredneural networks Scheduling for these networks is a topo-logical ordering problem which enumerates an intractablylarge space of possible schedules We offer and leverage theinsight that partial schedules leads to repeated subpaths forsearch and use the graph properties to generate a signaturefor these repetition while embedding a notion of the runningmemory usage These signatures enable the use of DynamicProgramming as a basis for the optimization algorithm(2) Adaptive soft budgeting for tractable compilationtime Even with the dynamic programming as the base dueto the sheer number of neurons and connections the searchspace may remain too large (exponentially large) in practiceAs such we devise an Adaptive Soft Budgeting techniquethat uses a lightweight meta-search mechanism to find theappropriate memory budget for pruning the suboptimalpaths This technique aims to find an inflection point beyondwhich tighter budgets may lead to no solution and looserbudget prolongs the scheduling substantially putting theoptimization in a position of questionable utility(3) Identity graph rewriting for enabling higher poten-tial in memory reduction Any scheduling algorithm in-cluding ours is still bound to the topology of the neural graphunder compilation To relax this intrinsic restriction wedevise an Identity Graph Rewriting scheme that exchangessubgraphs leading to a lower memory footprint withoutaltering the mathematical integrity of the neural network

Results show that our adaptive scheduling algorithmimproves peak memory footprint for irregularly wired neuralnetworks by 168timescompared to TensorFlow Lite the de factoframework for edge devices Our graph rewriting techniqueprovides an opportunity to lower the peak memory footprintby an additional 107 Furthermore our framework caneven bring about 176times reduction in off-chip traffic for de-vices with multi-level memory hierarchy and even eliminatethe traffic in some cases by confining the memory footprintbelow the on-chip memory capacity These gains come ataverage of less than one minute extra compilation time

(a) RandWire (b) SwiftNet

Figure 1 Architecture of network models from NAS and RandomNetwork Generators Topology of such networks include distinctiveirregular wirings between the nodes

2 CHALLENGES AND OUR APPROACH

21 Irregularly Wired Neural Networks

Recent excitement in Automated Machine Learning(AutoML) (Feurer et al 2015 Dean 2017 He et al 2018Elthakeb et al 2018 Wang et al 2019 Laredo et al 2019)aims to achieve human out of the loop in developing machinelearning systems This includes Neural Architecture Search(NAS) (Zoph amp Le 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Network Generators (Xie et al 2019 Wortsmanet al 2019) that focus on automation of designing neuralarchitectures Figure 1 demonstrates that networks of thisregime are characterized by their distinctive irregular graphtopology with much more irregular wirings (dataflow)compared to conventional networks with regular graphtopology This paper refers to these networks as irregularlywired neural networks

Multiply-and-accumulate (Billions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

200 10 30 40

DPN-131

Inception V1MobileNet

ShuffleNet

Inception V2

Inception V3Xception ResNet-152

SENet

AmoebaNet-A

ReNeXt-101PolyNetInception ResNet V2

Inception V4

NASNet-ANASNet-B

RandWire

AmoebaNet-A

AmoebaNet-B

RandWire

irregularly wired neural networksregular topology neural networks

irregularly wired neural networksshow better performance for

same amount of compute thanregular topology neural networks

top left means is better

Figure 2 ImageNet accuracy vs number of multiply-and-accumulate where irregularly wired neural networks show higherperformance for same compute than regular topology neural net-works Plot for number of parameters also displays a similar trend

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

concat

conv

(a) SwiftNet Cell A

Peak Memory Footprint (KB)

Cum

ulat

ive

Dist

ribut

ion

of S

ched

ules

()

250 KBconstraint100

020406080

350 400200 250 300

41 of schedulessatisfy the constraint

004 of schedulesare optimal

(b) CDF of peak memory fordifferent possible schedules

Figure 3 CDF of the peak memory footprint for the differentpossible schedules of a given irregularly wired neural network

From the performance perspective these networks haveshown to outperform manually designed architecturesin terms of accuracy while using less resources In factmajority of winning neural architectures in competitionswith primary goal of reducing resources (Gauen et al 2017)rely on NAS suggesting its effectiveness in that respectFigure 2 plots the accuracy of different models given theircomputation The figure clearly shows that the Paretofrontier of irregularly wired neural networks from NASand Random Network Generators are better than the handdesigned models with regular topology This indicates thatthe efficiency in terms of accuracy given fixed resources arebetter with the irregularly wired neural networks

22 Challenges

Many existing compilers (Chen et al 2018 Vasilache et al2018) and frameworks (Paszke et al 2019 Abadi et al 2016Jia et al 2014) rely on basic topological ordering algorithmsto schedule the graph While the current approach may be suf-ficient to run conventional networks on server-class machinessuch scheme may be unfit for running irregularly wiredneural networks on resource-constrained edge devices Thisis because unlike running networks with regular topologyrunning irregular networks results in varied range of memoryfootprint depending on the schedule For instance giventhe constraints of a representative edge device (SparkFunEdge 250KB weightactivation memory and 60M MACs)Figure 3(b) shows that 41 of the schedules barely meets thehard memory constraint while only 004 would achieve theoptimal peak memory In reality such limitation will preventfurther exploration regarding the diversity and innovationof network design and in order to allow edge computingregime to take full advantage of the irregularly wired neuralnetworks this limitation should be alleviated if not removed

23 Design Objectives

Scheduling algorithm To address this issue our workaims to find a schedule of nodes slowast from the search spaceS that would minimize peak memory footprint micropeak Senumerates all possible orderings of the nodes visinV whereV is the set of all nodes within a graph G

slowast=argmins

micropeak(sG) for sisinS (1)

The most straightforward way to schedule is a brute forceapproach which just enumerates S and picks one with theminimum peak memory footprint While this extrememethod may find an optimal solution it is too costly in termsof time due to its immense complexity Θ(|V |) where |V |denotes number of nodes in the graph One way to improveis to narrow down the search space to just focus on onlythe topological orderings ST subS However this will stillsuffer from a complexity with an upper bound of O(|V |)(takes days to schedule DAG with merely 30 nodes) In factprevious works (Bruno amp Sethi 1976 Bernstein et al 1989)already prove optimal scheduling for DAGs is NP-completeOn another extreme are heuristics for topological orderingsuch as Kahnrsquos algorithm (Kahn 1962) with complexity ofO(|V |+|E|) where V andE are number of nodes and edgesHowever as demonstrated in Figure 3 such method mayyield suboptimal schedule of nodes which will not run on thetarget hardware To this end we explore dynamic program-ming combined with adaptive soft budgeting for schedulingto achieve an optimal solution while keeping the graph con-stant slowast without adding too much overhead in terms of timeWe explain our algorithms in depth in Section 31 and 32

Graph rewriting Any scheduling algorithm includingours is intrinsically bounded by the graph topologyTherefore we explore to transform the search spacethrough graph rewriting (Plump 1999) Graph rewritingis generally concerned with substituting a certain patternin the graph with a different pattern to achieve a certainobjective For a computational dataflow graph leveragingdistributive associative and commutative properties withinthe computation of the graph graph rewriting can maintainthe semantics while bringing significant improvementsregarding some objective For example in general programssum

ilogxi can be represented assum

oddilogxi+sum

evenilogxior log

prodixi while x+x can be translated to xtimes2 or xltlt1

Likewise we bring this insight to neural networks to find aset of possible transformationsX that can rewrite the originalgraph G to a new graph Gprime that would also change our searchspace S to one with a lower peak memory footprint

X lowast=argminX

(micropeak(slowastX (G))) (2)

We identify a set of candidate patterns for transformationχ grarr gprime (g isinG and gprime isinGprime) which constitutes X Whiletransforming the graph our method keeps the mathematicalintegrity of the graph intact thus not an approximationmethod We embed this systematic way to improve peakmemory footprint and the search space as identity graphrewriting and we address this technique in Section 33

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

1

2

3

4

5

7

8

9

10

11

12

13

14

15

16

17

G s GGraph RewrittenGraph Schedule

IdentityGraph Rewriter

Dynamic Programming-

based Scheduler

Adaptive Soft Budgeting

Rewrite graph to alleviate activation memory footprint

of the graph

Find memory-optimal schedule given an

input graph

Adaptively manage soft budget to speed

up scheduling

G

flag = lsquono solutionrsquo lsquotimeoutrsquo lsquosolutionrsquo

τ T s

Figure 4 Overall workflow of SERENITY memory-aware scheduling of irregularly wired neural network

3 SERENITY MEMORY-AWARESCHEDULING OF IRREGULARLYWIRED NEURAL NETWORKS

As discussed in Section 2 the objective is reducing thepeak memory footprint while executing irregularly wiredneural networks We propose SERENITY memory-awarescheduling that targets devices with restricted resources(eg edge devices) Figure 4 summarizes the overallscheduling process highlighting the major contributions ofour approach Input to SERENITY is a graph of irregularlywired neural network G which in fact acts as an intermediaterepresentation (IR) during the scheduling process Weaugment this IR with the metadata of the nodes such as theoperation type inputoutput edges inputoutput shapesand memory cost Then the graph rewriter transformsthe graph G rarr Gprime to relax the memory costs of memoryintensive patterns with the goal of reducing the peak memoryfootprint micropeak of G SERENITY schedules the graph to anoptimal schedule slowast using the dynamic programming-basedscheduler However since the scheduling may be slowdue to the complexity we scale down search space byleveraging divide-and-conquer which partitions the graphinto multiple subgraphs Them we augment the schedulerwith an adaptive soft budgeting which prunes suboptimalpaths by adaptively finding a budget for thresholding througha swift meta-search to speed up the scheduling process Thissection focuses on the innovations of SERENITY dynamicprogramming-based scheduling divide-and-conqueradaptive soft budgeting and graph rewriting which areexplained in detail in Section 31 32 and 33 respectively

31 Dynamic Programming-based SchedulingAchieving Optimal Peak Memory Footprint

Our goal for the scheduling algorithm is to minimize thepeak memory footprint micropeak(sG) As stated in Section 23recursive algorithms that covers the entire search space Sor the subspace of all topological orderings ST sub S takesimpractically long time This is primarily due to the repetitivere-computation of subproblems that upper bounds the algo-rithm byO(|V |) Therefore we leverage dynamic program-ming (Bellman 1961 1966 Held amp Karp 1962) which in-cludes a memoization scheme that has been shown to be effec-tive in reducing the complexity of time-intensive algorithmsby reusing solutions from their subproblems while still find-

ing optimal solution by sweeping the entire search space

Identifying signature to enable dynamic programmingThe first step to applying dynamic programming to anew problem is characterizing the structure of an optimalsolution slowast=slowastn (slowastn is an optimal solution for n number ofnodes) Then it requires identifying a recursive relationshipbetween the optimal solution of a subproblem slowasti and theoriginal problem slowasti+1 and we do this by analyzing thestraightforward recursive topological ordering whichwhile inefficient sweeps the entire search space In essencetopological ordering algorithm is a repeated process ofidentifying a set of nodes that are available for schedulingand iterating the set for recursion In graph theory such aset of nodes available for scheduling is called zero-indegreeset z where z is a set of nodes which all of their incomingedges and the corresponding predecessor nodes (indegree)have been scheduled Figure 5 demonstrates the recursiontree of the different topological ordering algorithms wherethe height of the tree is the search step and every path fromthe root to the leaf is a topological ordering s isin ST Thefigure highlights the redundant z in the recursive topologicalordering in the recursion tree then merges these z to makethem unique identifying it as the signature for repetition andprevent the aforementioned re-computation This makes thescheduling for z into a unique subproblem that constitutes

A

B C

D E F

H I J

K

L

G

GGraph Recursive Topological Ordering

A

B C J

CD JG B E F J B C

hellip

Sear

ch S

tep

CDG CDGhellip hellipz

s

Redundant zero-indegree set z

Dynamic Programming-based Topological Ordering

A

B C J

ABCD ABJG E F ACJ

hellip

Sear

ch S

tep

CDGhellip

Unique zero-indegree set z

Scheduled SchedulableXX For memoizationXX

Figure 5 Illustration of identifying redundant zero-indegree set zand making z unique (square) throughout the topological orderingalgorithm to reduce re-computation

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

outdegree of 1rarr 0 outdegree of 1rarr 0

micro 8

A

B C

D E F

H I J

K

L

G

GGraph

Scheduled To ScheduleAllocateBA To DeallocateB

Activation MemoryD E F I J

(1) ScheduleAllocate H D E F I J H

(0) Initial State

(2) Deallocate F I J HD E

D E

H I

DE

i = 8

micropeak

micropeak9 = max(micropeak8 micropeak)

micro9

s8 = A B CD E F I Jmicro peak8 from M8

z8 = HG

u8 = H

s9 = A B CD E F I J H

Figure 6 Visualization of scheduling the node u8 = H during thesearch step i = 8 Starting from s8 micro8 and micropeak8 the figureshows how the algorithm calculates s9 micro9 and micropeak9

the dynamic programming-based topological ordering

Integrating the peak memory footprint constraint Ontop of the dynamic programming formulation that showspotential for optimizing the search space significantly weoverlay the problem specific constraints to achieve theoptimal solution In particular we calculate the memoryfootprint microi+1 and its corresponding peak micropeaki+1 in eachsearch step i to select optimal path slowasti+1 for memoizationHere we clarify the process of a search step explainingthe details of calculating micropeaki+1 and saving si+1 for eachsearch step i In each search step we start with number ofunique zero-indegree sets zi (signature) saved in ith entryof memoizationMi For each zi we append the schedule upto the point si sum of activations in the memory microi for thesignature zi and the peak memory footprint of the si denotedmicropeaki Therefore in each search step i we start with si microiandmicropeaki for si Then when we iterate zi to schedule a newnode ui its output activation is appended to si to form si+1and is allocated in the memory Size of ui is product (

prod) of

uishape where shape is a property of the activation tensorthat includes channels height width and the precision (egbyte float) is added to microi so microi+1 larr microi +

prod(uishape)

Then we use microi+1 as micropeak to update micropeaki+1 (peakmemory footprint for si+1) Since some predecessors ofui will not be used anymore after allocating ui we updatethe outdegrees of the node by decrementing them Havingupdated the outdegree we will be left with a zero-outdegreeset that denotes the nodes that are ready for deallocation Wedeallocate the nodes in the set and update microi+1 accordingly

To demonstrate scheduling of a node ui Figure 6 simulatesscheduling a node u8 = H in i=8 In the figure (1) H is ap-pended to s8 and allocated to memory as it is scheduled andthen the scheduler records maximum of the micropeak8 and thesum of all activations in the memory at this point as micropeak9Then it recalculates the outdegrees of the predecessor nodesof H D and E rsquos outdegree are decremented from one tozero (2) Then these nodes are deallocated and sum of the

Algorithm 1 Dynamic Programming-based Scheduling1 Input graph G2 Output optimal schedule slowast

3 initialize memoization4 s0larr [] micro0micropeak0larr0 z0larr zero-indegree(s0G)5 M0[z0]larr(s0micro0micropeak0)6 iterate search step7 for i=0 to nminus1 do8 iterate (schedule current memory peak memory)9 for zi(simicroimicropeak) inMi do

10 for ui in zi do11 si+1larrsiappend(ui) allocate12 zi+1larr zero-indegree(si+1G)13 microi+1micropeaklarrmicroi+

prod(uishape)

14 micropeaki+1larrmax(micropeakimicropeak)15 for pi in uipreds do16 if pi is in zero-outdegree(si+1G) then17 microi+1larrmicroi+1minus

prod(pishape) deallocate

18 end if19 end for20 memoize schedule with least peak memory21 if micropeaki+1leMi+1[zi+1]micropeaki+1 then22 Mi+1[zi+1]larr(si+1microi+1micropeaki+1)23 end if24 end for25 end for26 end for27 slowastmicrolowastpeaklarrM[middot]nsnM[middot]nmicropeakn solution

activation memory here is recorded as micro9

Finding schedule with optimal peak memory footprintAfter scheduling ui we save the new signature into theMi+1 for next search step i+1 Since the goal of this work isto minimize the overall micropeak we identify the correspondingoptimal schedule slowasti+1 for each zi+1 by only saving si+1 withthe minimum micropeaki+1 We integrate the aforementionedstep of scheduling ui and updating Mi+1 to completethe proposed dynamic programming-based schedulingalgorithm Algorithm 1 summarizes the the algorithm As afirst step the algorithm starts by initializing the memoizationtableM0 then the algorithm iterates different search stepsIn each search step i the algorithm performs the aboveillustrated memory allocation for allui in zi and saving si+1microi+1 and micropeaki+1 After iterating all search steps to nminus1slowast is saved inMn with a unique entry for n being numberof nodes in G We provide the proof for the optimality of thepeak memory footprint in the supplementary material

Complexity of the algorithm The complexity of theproposed dynamic programming-based scheduling isO(|V |times2|V |) which is significantly faster than the exhaus-tive search ofST with an upper bound complexity ofO(|V |)

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A

B C

D

E

F G

H

Divide Conquer Combine

ABCD

Schedule

Schedule

g1

g2Concatenate

s sg1

sg2EFGH

ABCDEFGH

Figure 7 Illustration of divide-and-conquer which divides thegraphs into multiple subgraphs (divide) schedules each of themusing the optimal scheduler (conquer) then concatenates thesub-schedules to get the final schedule (combine)

Due to the space limitation we present the derivation of thealgorithm complexity in the supplementary material

32 Optimizing Scheduling Speed Speeding upthe Dynamic Programming-based Scheduling

While the above scheduling algorithm improves complexityof the search search space may still be intractable due tothe immense irregularity Therefore we devise divide-and-conquer and adaptive soft budgeting to accelerate the searchby effectively shrinking and pruning the search space

Divide-and-conquer We can observe from Figure 1that the topology of irregularly wired neural networks arehourglass shaped ( ) because many NAS and RandomNetwork Generators design cells with single input and singleoutput then stack them to form an hourglass shape topology(Wilken et al 2000) shows that during general purpose codescheduling graphs can be partitioned (divide) into multiplesubgraphs and the corresponding solutions (conquer) can beconcatenated (combine) to form an optimal solution for theoverall problem While the complexity of the scheduling al-gorithm remains the same this divide-and-conquer approachcan reduce the number of nodes in each subproblem speed-ing up the overall scheduling time For instance for a graphthat can be partitioned intoN equal subgraphs the schedul-ing time will decrease from |V |times2|V | to |V |times2|V |N thatwe can speed up scheduling by multiple orders of magnitudecompared to the naive approach depending on the size ofthe graph and the number of partitions

As such Figure 7 shows this insight can be extended to ourproblem setting where we can first perform scheduling oneach cell and merge those solutions together to form thefinal solution First stage is partitioning the original graphG into multiple subgraphs g (divide) Then utilizing theindependence among the subgraphs each subgraph g canbe scheduled separately for their corresponding optimalschedule sg (conquer) Considering that the number ofnodes in the subgraph g is much smaller than the entire graphG the scheduling time will decrease significantly Finallythe schedules of the subgraphs are concatenated to giveoptimal schedule slowast of the entire graph (combine)

Adaptive soft budgeting While divide-and-conquerapproach scales down the number of nodes the algorithmmay still not be fast enough due to the exponential complexityof the algorithm Therefore we explore avoiding suboptimalsolutions during the early stage of scheduling without affect-ing the optimality of the original algorithm Since our goal isto find a single solution that can run within a given memorybudget τlowast=microlowast while all other solutions can be discarded set-ting some budget τ that is greater or equal to microlowast and pruningsuboptimal schedules with which their micropeak exceeds τ canfocus the search to a smaller search spaceS primeT subST while stillachieving the optimal schedule slowast On top of this we developa meta-search for τ This is inspired from engineers buyinga larger memory (increase τ ) if a program fails due to stackoverflow (= rsquono solutionrsquo due to an overly aggressive pruning)and selling out excess memory (decrease τ ) if the currentbudget is prohibitive (= rsquotimeoutrsquo due to lack of pruning)SERENITY takes advantage of this insight to develop anadaptive soft budgeting scheme while scheduling to cut downthe overall number of explored schedules Figure 8 illustratesthe overall idea by first showing how some schedules arepruned with regard to a given budget τ in Figure 8(a) thenimplication of different τ on scheduling time in Figure 8(b)

Figure 8(a) depicts a certain point while scheduling G wherenodes G H F and J can be scheduled In particular the

32

23

35

τ = 36

35 38 38J

s1

A

B C

D E F

H I

K

L

G

GGraph

Scheduled SchedulableXX For memoizationXX

ABCDEI hellip

Sear

ch S

tep

FHG

hellip

J

D 6 E 6 F 6J 6I 3H 3G 3

ABCDEIFH helliphellip

C 6

23 32s2

gt τ

z

z

s3

35

output activation size

(a) While both path s1 and s2 schedules lead to same zprime theirmicro and micropeak varies and we can prune schedules that yieldhigher micropeak than a given budget τ Numbers next to box orcircle are micro and numbers next to edges are micropeak

Proh

ibiti

ve

Sche

dulin

g Ti

me

(tim

eout

)

Sche

dulin

g Fa

ilure

(no

solu

tion

)

No o

f Exp

lore

d Sc

hedu

les

prop Sc

hedu

ling

Tim

e

Budget

Optimal Budget τ Hard Budget τmaxSoft Budget τ

Adaptive Soft Budgeting

(b) Adaptive soft budgeting starts by setting a hard budgetτmax as the maximum value for the soft budget τ Thenconducts a binary search for τ higher than τlowast that it finds asolution yet not too high that scheduling completes quickly

Figure 8 Illustration of the adaptive soft budgeting (a) shows howschedules are pruned and (b) illustrates how the soft budget τrelates to the number of explored schedules

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

figure compares two possible solutions s1 and s2 whichschedules Hrarr F and Frarr H respectively given τ = 36While s1 and s2 both starts from z with micro=32 schedulingH leads to micropeak = 32+3 (H) = 35 whereas scheduling F

or J leads to micropeak =32+6 (F or J) =38 Therefore sincewe assume τ = 36 s2 and s3 will fail because micropeak = 38for s2 and s3 exceeds 36 So as long as we set the budgetτ higher than microlowast the scheduler still finds a single optimalsolution while avoiding many suboptimal paths On theother hand too small a τ ltmicrolowast leads to no solution becausethe optimal path would be pruned away

Having established the possibility of pruning our questionboils down to discovering τ that is greater or equal to microlowast

which we call an optimal budget τlowast yet close enough toshrink the search space effectively Figure 8(b) and Algo-rithm 2 summarizes the proposed adaptive soft budgetingSince we start with no information about the approximaterange for τ we resort to a commonly used topologicalordering algorithm called Kahnrsquos algorithm (Kahn 1962)(O(|V |+|E|)) to adaptively gain idea of the range for τ Weuse the peak memory footprint from this sequence and useit as our hard budget τmax and in contrast we call adaptivelychanging τ as a soft budget Since τmaxgemicrolowast we know thatany τgeτmax do not need to be explored Having this upperbound for the search adaptive soft budgeting implementsa binary search to first run the scheduling algorithm with τand T as input where T is an hyperparameter that limits thescheduling time per search step The binary search increasesτ (τnew larr (τnew + τold)2) if it finds rsquono solutionrsquo anddecreases τ (τnewlarrτnew2) if a search step returns rsquotimeoutrsquo(search step duration exceeds T ) The binary search stops as

Algorithm 2 Adaptive Soft Budgeting1 Input graph G2 Output optimal schedule slowast

3 τmaxlarrmicro(KahnrsquosAlgorithm(G)G) hard budget4 τoldτnewlarrτmax

5 flaglarr rsquono solutionrsquo6 repeat7 binary search for τ decrease τ if rsquotimeoutrsquo8 and increase τ if rsquono solutionrsquo9 if flag is rsquotimeoutrsquo then

10 simultaneous11 τoldlarrτnew τnewlarrτnew212 else if flag is rsquono solutionrsquo then13 simultaneous14 τoldlarrτnew τnewlarr(τnew+τold)215 end if16 if flag is rsquosolutionrsquo then17 slowastlarrschedule optimal schedule18 end if19 until flag is rsquosolutionrsquo

micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(wixi) )

micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(y) )

Channel-wisePartitioning

Kernel-wisePartitioning

=

=

wij

concat

conv

x1 x2 xn

w1hellipwm

y

add

partialconv w1

x1 x2xn

w2 wn

y

concat

depth-conv

x1 x2 xn

w1hellipwn

y

concat

partialdepth-conv w1

x1 x2xn

w2 wn

y

yxi ith Input Output jth Channel of ith Kernel

x

x

Figure 9 Illustration of the graph rewriting patterns channel-wisepartitioning and kernel-wise partitioning can reduce the memorycost of convolution and depthwise convolution respectively

soon as it finds a schedule (rsquosolutionrsquo) and this method usingbinary search is guaranteed to work due to the monotonicallyincreasing number of explored schedules with τ

33 Identity Graph Rewriting Improving theSearch Space for Better Peak Memory Footprint

Reorganizing the computational graph of the irregularlywired neural networks may lead to significant reduction in thepeak memory footprint micropeak during computation For exam-ple it is notable that large stream of NAS-based works (Liuet al 2019a Zhang et al 2019) rely on extensive use of con-catenation as a natural approach to merge information frommultiple branches of the input activations and expand thesearch space of the neural architectures However concatena-tion with many incoming edges may prolong the liveness ofthe input activation and increase the memory pressure whichis unfavorable especially for resource constrained scenariosTo address this issue we propose identity graph rewritingto effectively reduce micropeak around the concatenation whilekeeping the arithmetic outputs identical To this end wepresent two main examples of the graph patterns in irregularlywired neural networks that benefits from our technique

Channel-wise partitioning (convolution) One typicalpattern in irregularly wired neural networks is concatenation(concat [middot]) that takes multiple branches of the input prior toa convolution (conv lowast) While executing such pattern peakmemory footprint micropeak occurs when the output y isinRn isbeing computed while concatenated branches of inputxisinRn

are also mandated to reside in the memory Our objectiveis to achieve the same arithmetic results and logical effectas concat yet sidestep the corresponding seemingly exces-sive memory cost To this end we channel-wise partitionthe conv that follows the concat so that the partitioned convcan be computed as soon as the input xi becomes available

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of

sumi and lowast these

transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from

sumxi+y tomax(wilowastxi)+y which becomes

more effective when there are more incoming edges to concat

y=[sum

i

w1ilowastxisumi

wmilowastxi]

(concat+conv) (3)

=sumi

[w1ilowastxiwmilowastxi

](4)

=sumi

[w1iwmi

]lowastxi (5)

=sumi

[wilowastxi

](partial conv+add) (6)

Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly

y=[w1lowastx1wnlowastxn

](concat+depthconv) (7)

=[[w1lowastx1][wnlowastxn]

](partial depthconv+concat)

(8)

Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost

Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1

ACCURACY

DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745

4 EVALUATION

We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time

41 Methodology

Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset

42 Experimental Results

Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In

1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

183x

220x

239x

209x

140x

127x 168x

125x

139x

168x220x

244x

270x 345x

140x

127x 168x

125x

139x 186x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Red

ucti

on i

n Pe

ak M

emor

yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Higher the betterReduction in Peak Memory

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 10 Reduction in peak memory footprint of SERENITY

against TensorFlow Lite (no memory hierarchy)

192x 2

58x

251x

115x

108x

129x

108x

130x

152x192x 2

68x

125x

111x

131x

111x 161x

149x192x

356x

125x

119x

109x

108x 151x200x

135x

250x

182x

138x 176x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Redu

ctio

n in

Off

-chi

p

32KB 64KB 128KB 256KB

Reduction in Off-chip

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Memory Communication

only

SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipon

ly SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipSERENITY removes off-chip communication

NA

NA

NA

NA

NA NA

Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)

addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks

Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed

Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory

0

50

100

150

200

250

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

251KB reduction in peak memoryfootprint with Memory Allocator

Memory Footprint (KB)

Time

(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)

0

50

100

150

200

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic ProgrammingDynamic Programming+Graph Rewriting

125KB reductionin peak memory footprint

Time

Memory Footprint (KB)

(b) Memory footprint without the memory allocator

Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)

allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting

Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting

Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

32s 57s

45s

278s 11

81s

151s

285s 744s

879s

406s

32s

421s

305s

393s 11

81s

151s

285s 744s

879s

488s

1

10

100

1000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Sche

dulin

g Ti

me

(sec

onds

)

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Scheduling Time (seconds)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 13 Scheduling time evaluation for SERENITY

Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time

GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME

7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs

3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs

nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility

5 RELATED WORKS

The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)

bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices

Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper

Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint

Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts

6 CONCLUSION

As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

ACKNOWLEDGEMENT

We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms

REFERENCES

Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016

Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018

Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH

Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017

Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966

Bellman R Dynamic programming Science 1966

Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961

Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989

Bruno J and Sethi R Code generation for a one-register machineJACM 1976

Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm

Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018

Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014

Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016

Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017

Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf

Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf

Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017

Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf

Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS

Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015

Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018

Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017

Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017

Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite

Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015

Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a

Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b

He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018

Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962

Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf

Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014

Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018

Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019

Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017

Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016

Kahn A B Topological sorting of large networks CACM 1962

Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001

Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf

Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004

LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990

Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003

Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX

Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b

Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020

Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb

NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt

Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017

Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019

Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999

Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019

Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf

Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007

Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018

Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014

Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf

Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019

Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000

Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019

Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019

Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019

Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf

Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf

Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-

Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg

Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS

Multiply-and-accumulate (Billions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

200 10 30 40

DPN-131

Inception V1MobileNet

ShuffleNet

Inception V2

Inception V3Xception ResNet-152

SENet

AmoebaNet-A

ReNeXt-101PolyNetInception ResNet V2

Inception V4

NASNet-ANASNet-B

RandWire

AmoebaNet-A

AmoebaNet-B

RandWire

irregularly wired neural networksregular topology neural networks

irregularly wired neural networksshow better performance for

same amount of compute thanregular topology neural networks

top left means is better

(a) ImageNet accuracy vs number of multiply-and-accumulate

Number of Parameters (Millions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

800 40 100 140

DPN-131

irregularly wired neural networks

Inception V1MobileNetShuffleNet

Inception V2

Inception V3Xception

ResNet-152

SENet

AmoebaNet-C

ReNeXt-101

PolyNetInception ResNet V2Inception V4

NASNet-A

NASNet-A

RandWire

AmoebaNet-A

RandWire

regular topology neural networks

irregularly wired neural networksshow better performance for

same number of parameters thanregular topology neural networks

top left means is better

6020 120

NASNet-A

(b) ImageNet accuracy vs number of parameters

Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks

B COMPARISON WITH TENSORFLOW LITE

In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks

1656

552

194

70

645

330 60

5

350

160

903

251

82 33

459

260 359

280

115

753

226

72 20

459

260 359

280

115

0

500

1000

1500

2000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Peak

Mem

ory

Foot

prin

t (K

B)

TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Smaller the better

Peak Memory Footprint (KB)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite

C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING

Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm

THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step

Proof If i=0 the optimal s0 is an empty sequence and micro0

must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +

prodvishapemicroi +

produlowasti shape)

and microi+1 is calculated by deductingprodpishape forallpi isin

(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption

D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF

We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node

A

D W

Z

CB X Y

GGraph

hellip

Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm

First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step

For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2

iminus1) On top of this each step would make an iteration

over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield

1+1times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=1+

(|V |minus2

0

)times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=2+

|V |minus2sumi=0

(|V |minus2

i

)times(|V |minus2minusi)

=2+(|V |minus2)times2|V |minus3

le(|V |minus2)times2|V |minus2 for |V |ge4

le|V |times2|V |

As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering

Page 3: Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF IRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES ByungHoonAhn1 yJinwonLee2 JamieMenjayLin

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

concat

conv

(a) SwiftNet Cell A

Peak Memory Footprint (KB)

Cum

ulat

ive

Dist

ribut

ion

of S

ched

ules

()

250 KBconstraint100

020406080

350 400200 250 300

41 of schedulessatisfy the constraint

004 of schedulesare optimal

(b) CDF of peak memory fordifferent possible schedules

Figure 3 CDF of the peak memory footprint for the differentpossible schedules of a given irregularly wired neural network

From the performance perspective these networks haveshown to outperform manually designed architecturesin terms of accuracy while using less resources In factmajority of winning neural architectures in competitionswith primary goal of reducing resources (Gauen et al 2017)rely on NAS suggesting its effectiveness in that respectFigure 2 plots the accuracy of different models given theircomputation The figure clearly shows that the Paretofrontier of irregularly wired neural networks from NASand Random Network Generators are better than the handdesigned models with regular topology This indicates thatthe efficiency in terms of accuracy given fixed resources arebetter with the irregularly wired neural networks

22 Challenges

Many existing compilers (Chen et al 2018 Vasilache et al2018) and frameworks (Paszke et al 2019 Abadi et al 2016Jia et al 2014) rely on basic topological ordering algorithmsto schedule the graph While the current approach may be suf-ficient to run conventional networks on server-class machinessuch scheme may be unfit for running irregularly wiredneural networks on resource-constrained edge devices Thisis because unlike running networks with regular topologyrunning irregular networks results in varied range of memoryfootprint depending on the schedule For instance giventhe constraints of a representative edge device (SparkFunEdge 250KB weightactivation memory and 60M MACs)Figure 3(b) shows that 41 of the schedules barely meets thehard memory constraint while only 004 would achieve theoptimal peak memory In reality such limitation will preventfurther exploration regarding the diversity and innovationof network design and in order to allow edge computingregime to take full advantage of the irregularly wired neuralnetworks this limitation should be alleviated if not removed

23 Design Objectives

Scheduling algorithm To address this issue our workaims to find a schedule of nodes slowast from the search spaceS that would minimize peak memory footprint micropeak Senumerates all possible orderings of the nodes visinV whereV is the set of all nodes within a graph G

slowast=argmins

micropeak(sG) for sisinS (1)

The most straightforward way to schedule is a brute forceapproach which just enumerates S and picks one with theminimum peak memory footprint While this extrememethod may find an optimal solution it is too costly in termsof time due to its immense complexity Θ(|V |) where |V |denotes number of nodes in the graph One way to improveis to narrow down the search space to just focus on onlythe topological orderings ST subS However this will stillsuffer from a complexity with an upper bound of O(|V |)(takes days to schedule DAG with merely 30 nodes) In factprevious works (Bruno amp Sethi 1976 Bernstein et al 1989)already prove optimal scheduling for DAGs is NP-completeOn another extreme are heuristics for topological orderingsuch as Kahnrsquos algorithm (Kahn 1962) with complexity ofO(|V |+|E|) where V andE are number of nodes and edgesHowever as demonstrated in Figure 3 such method mayyield suboptimal schedule of nodes which will not run on thetarget hardware To this end we explore dynamic program-ming combined with adaptive soft budgeting for schedulingto achieve an optimal solution while keeping the graph con-stant slowast without adding too much overhead in terms of timeWe explain our algorithms in depth in Section 31 and 32

Graph rewriting Any scheduling algorithm includingours is intrinsically bounded by the graph topologyTherefore we explore to transform the search spacethrough graph rewriting (Plump 1999) Graph rewritingis generally concerned with substituting a certain patternin the graph with a different pattern to achieve a certainobjective For a computational dataflow graph leveragingdistributive associative and commutative properties withinthe computation of the graph graph rewriting can maintainthe semantics while bringing significant improvementsregarding some objective For example in general programssum

ilogxi can be represented assum

oddilogxi+sum

evenilogxior log

prodixi while x+x can be translated to xtimes2 or xltlt1

Likewise we bring this insight to neural networks to find aset of possible transformationsX that can rewrite the originalgraph G to a new graph Gprime that would also change our searchspace S to one with a lower peak memory footprint

X lowast=argminX

(micropeak(slowastX (G))) (2)

We identify a set of candidate patterns for transformationχ grarr gprime (g isinG and gprime isinGprime) which constitutes X Whiletransforming the graph our method keeps the mathematicalintegrity of the graph intact thus not an approximationmethod We embed this systematic way to improve peakmemory footprint and the search space as identity graphrewriting and we address this technique in Section 33

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

1

2

3

4

5

7

8

9

10

11

12

13

14

15

16

17

G s GGraph RewrittenGraph Schedule

IdentityGraph Rewriter

Dynamic Programming-

based Scheduler

Adaptive Soft Budgeting

Rewrite graph to alleviate activation memory footprint

of the graph

Find memory-optimal schedule given an

input graph

Adaptively manage soft budget to speed

up scheduling

G

flag = lsquono solutionrsquo lsquotimeoutrsquo lsquosolutionrsquo

τ T s

Figure 4 Overall workflow of SERENITY memory-aware scheduling of irregularly wired neural network

3 SERENITY MEMORY-AWARESCHEDULING OF IRREGULARLYWIRED NEURAL NETWORKS

As discussed in Section 2 the objective is reducing thepeak memory footprint while executing irregularly wiredneural networks We propose SERENITY memory-awarescheduling that targets devices with restricted resources(eg edge devices) Figure 4 summarizes the overallscheduling process highlighting the major contributions ofour approach Input to SERENITY is a graph of irregularlywired neural network G which in fact acts as an intermediaterepresentation (IR) during the scheduling process Weaugment this IR with the metadata of the nodes such as theoperation type inputoutput edges inputoutput shapesand memory cost Then the graph rewriter transformsthe graph G rarr Gprime to relax the memory costs of memoryintensive patterns with the goal of reducing the peak memoryfootprint micropeak of G SERENITY schedules the graph to anoptimal schedule slowast using the dynamic programming-basedscheduler However since the scheduling may be slowdue to the complexity we scale down search space byleveraging divide-and-conquer which partitions the graphinto multiple subgraphs Them we augment the schedulerwith an adaptive soft budgeting which prunes suboptimalpaths by adaptively finding a budget for thresholding througha swift meta-search to speed up the scheduling process Thissection focuses on the innovations of SERENITY dynamicprogramming-based scheduling divide-and-conqueradaptive soft budgeting and graph rewriting which areexplained in detail in Section 31 32 and 33 respectively

31 Dynamic Programming-based SchedulingAchieving Optimal Peak Memory Footprint

Our goal for the scheduling algorithm is to minimize thepeak memory footprint micropeak(sG) As stated in Section 23recursive algorithms that covers the entire search space Sor the subspace of all topological orderings ST sub S takesimpractically long time This is primarily due to the repetitivere-computation of subproblems that upper bounds the algo-rithm byO(|V |) Therefore we leverage dynamic program-ming (Bellman 1961 1966 Held amp Karp 1962) which in-cludes a memoization scheme that has been shown to be effec-tive in reducing the complexity of time-intensive algorithmsby reusing solutions from their subproblems while still find-

ing optimal solution by sweeping the entire search space

Identifying signature to enable dynamic programmingThe first step to applying dynamic programming to anew problem is characterizing the structure of an optimalsolution slowast=slowastn (slowastn is an optimal solution for n number ofnodes) Then it requires identifying a recursive relationshipbetween the optimal solution of a subproblem slowasti and theoriginal problem slowasti+1 and we do this by analyzing thestraightforward recursive topological ordering whichwhile inefficient sweeps the entire search space In essencetopological ordering algorithm is a repeated process ofidentifying a set of nodes that are available for schedulingand iterating the set for recursion In graph theory such aset of nodes available for scheduling is called zero-indegreeset z where z is a set of nodes which all of their incomingedges and the corresponding predecessor nodes (indegree)have been scheduled Figure 5 demonstrates the recursiontree of the different topological ordering algorithms wherethe height of the tree is the search step and every path fromthe root to the leaf is a topological ordering s isin ST Thefigure highlights the redundant z in the recursive topologicalordering in the recursion tree then merges these z to makethem unique identifying it as the signature for repetition andprevent the aforementioned re-computation This makes thescheduling for z into a unique subproblem that constitutes

A

B C

D E F

H I J

K

L

G

GGraph Recursive Topological Ordering

A

B C J

CD JG B E F J B C

hellip

Sear

ch S

tep

CDG CDGhellip hellipz

s

Redundant zero-indegree set z

Dynamic Programming-based Topological Ordering

A

B C J

ABCD ABJG E F ACJ

hellip

Sear

ch S

tep

CDGhellip

Unique zero-indegree set z

Scheduled SchedulableXX For memoizationXX

Figure 5 Illustration of identifying redundant zero-indegree set zand making z unique (square) throughout the topological orderingalgorithm to reduce re-computation

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

outdegree of 1rarr 0 outdegree of 1rarr 0

micro 8

A

B C

D E F

H I J

K

L

G

GGraph

Scheduled To ScheduleAllocateBA To DeallocateB

Activation MemoryD E F I J

(1) ScheduleAllocate H D E F I J H

(0) Initial State

(2) Deallocate F I J HD E

D E

H I

DE

i = 8

micropeak

micropeak9 = max(micropeak8 micropeak)

micro9

s8 = A B CD E F I Jmicro peak8 from M8

z8 = HG

u8 = H

s9 = A B CD E F I J H

Figure 6 Visualization of scheduling the node u8 = H during thesearch step i = 8 Starting from s8 micro8 and micropeak8 the figureshows how the algorithm calculates s9 micro9 and micropeak9

the dynamic programming-based topological ordering

Integrating the peak memory footprint constraint Ontop of the dynamic programming formulation that showspotential for optimizing the search space significantly weoverlay the problem specific constraints to achieve theoptimal solution In particular we calculate the memoryfootprint microi+1 and its corresponding peak micropeaki+1 in eachsearch step i to select optimal path slowasti+1 for memoizationHere we clarify the process of a search step explainingthe details of calculating micropeaki+1 and saving si+1 for eachsearch step i In each search step we start with number ofunique zero-indegree sets zi (signature) saved in ith entryof memoizationMi For each zi we append the schedule upto the point si sum of activations in the memory microi for thesignature zi and the peak memory footprint of the si denotedmicropeaki Therefore in each search step i we start with si microiandmicropeaki for si Then when we iterate zi to schedule a newnode ui its output activation is appended to si to form si+1and is allocated in the memory Size of ui is product (

prod) of

uishape where shape is a property of the activation tensorthat includes channels height width and the precision (egbyte float) is added to microi so microi+1 larr microi +

prod(uishape)

Then we use microi+1 as micropeak to update micropeaki+1 (peakmemory footprint for si+1) Since some predecessors ofui will not be used anymore after allocating ui we updatethe outdegrees of the node by decrementing them Havingupdated the outdegree we will be left with a zero-outdegreeset that denotes the nodes that are ready for deallocation Wedeallocate the nodes in the set and update microi+1 accordingly

To demonstrate scheduling of a node ui Figure 6 simulatesscheduling a node u8 = H in i=8 In the figure (1) H is ap-pended to s8 and allocated to memory as it is scheduled andthen the scheduler records maximum of the micropeak8 and thesum of all activations in the memory at this point as micropeak9Then it recalculates the outdegrees of the predecessor nodesof H D and E rsquos outdegree are decremented from one tozero (2) Then these nodes are deallocated and sum of the

Algorithm 1 Dynamic Programming-based Scheduling1 Input graph G2 Output optimal schedule slowast

3 initialize memoization4 s0larr [] micro0micropeak0larr0 z0larr zero-indegree(s0G)5 M0[z0]larr(s0micro0micropeak0)6 iterate search step7 for i=0 to nminus1 do8 iterate (schedule current memory peak memory)9 for zi(simicroimicropeak) inMi do

10 for ui in zi do11 si+1larrsiappend(ui) allocate12 zi+1larr zero-indegree(si+1G)13 microi+1micropeaklarrmicroi+

prod(uishape)

14 micropeaki+1larrmax(micropeakimicropeak)15 for pi in uipreds do16 if pi is in zero-outdegree(si+1G) then17 microi+1larrmicroi+1minus

prod(pishape) deallocate

18 end if19 end for20 memoize schedule with least peak memory21 if micropeaki+1leMi+1[zi+1]micropeaki+1 then22 Mi+1[zi+1]larr(si+1microi+1micropeaki+1)23 end if24 end for25 end for26 end for27 slowastmicrolowastpeaklarrM[middot]nsnM[middot]nmicropeakn solution

activation memory here is recorded as micro9

Finding schedule with optimal peak memory footprintAfter scheduling ui we save the new signature into theMi+1 for next search step i+1 Since the goal of this work isto minimize the overall micropeak we identify the correspondingoptimal schedule slowasti+1 for each zi+1 by only saving si+1 withthe minimum micropeaki+1 We integrate the aforementionedstep of scheduling ui and updating Mi+1 to completethe proposed dynamic programming-based schedulingalgorithm Algorithm 1 summarizes the the algorithm As afirst step the algorithm starts by initializing the memoizationtableM0 then the algorithm iterates different search stepsIn each search step i the algorithm performs the aboveillustrated memory allocation for allui in zi and saving si+1microi+1 and micropeaki+1 After iterating all search steps to nminus1slowast is saved inMn with a unique entry for n being numberof nodes in G We provide the proof for the optimality of thepeak memory footprint in the supplementary material

Complexity of the algorithm The complexity of theproposed dynamic programming-based scheduling isO(|V |times2|V |) which is significantly faster than the exhaus-tive search ofST with an upper bound complexity ofO(|V |)

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A

B C

D

E

F G

H

Divide Conquer Combine

ABCD

Schedule

Schedule

g1

g2Concatenate

s sg1

sg2EFGH

ABCDEFGH

Figure 7 Illustration of divide-and-conquer which divides thegraphs into multiple subgraphs (divide) schedules each of themusing the optimal scheduler (conquer) then concatenates thesub-schedules to get the final schedule (combine)

Due to the space limitation we present the derivation of thealgorithm complexity in the supplementary material

32 Optimizing Scheduling Speed Speeding upthe Dynamic Programming-based Scheduling

While the above scheduling algorithm improves complexityof the search search space may still be intractable due tothe immense irregularity Therefore we devise divide-and-conquer and adaptive soft budgeting to accelerate the searchby effectively shrinking and pruning the search space

Divide-and-conquer We can observe from Figure 1that the topology of irregularly wired neural networks arehourglass shaped ( ) because many NAS and RandomNetwork Generators design cells with single input and singleoutput then stack them to form an hourglass shape topology(Wilken et al 2000) shows that during general purpose codescheduling graphs can be partitioned (divide) into multiplesubgraphs and the corresponding solutions (conquer) can beconcatenated (combine) to form an optimal solution for theoverall problem While the complexity of the scheduling al-gorithm remains the same this divide-and-conquer approachcan reduce the number of nodes in each subproblem speed-ing up the overall scheduling time For instance for a graphthat can be partitioned intoN equal subgraphs the schedul-ing time will decrease from |V |times2|V | to |V |times2|V |N thatwe can speed up scheduling by multiple orders of magnitudecompared to the naive approach depending on the size ofthe graph and the number of partitions

As such Figure 7 shows this insight can be extended to ourproblem setting where we can first perform scheduling oneach cell and merge those solutions together to form thefinal solution First stage is partitioning the original graphG into multiple subgraphs g (divide) Then utilizing theindependence among the subgraphs each subgraph g canbe scheduled separately for their corresponding optimalschedule sg (conquer) Considering that the number ofnodes in the subgraph g is much smaller than the entire graphG the scheduling time will decrease significantly Finallythe schedules of the subgraphs are concatenated to giveoptimal schedule slowast of the entire graph (combine)

Adaptive soft budgeting While divide-and-conquerapproach scales down the number of nodes the algorithmmay still not be fast enough due to the exponential complexityof the algorithm Therefore we explore avoiding suboptimalsolutions during the early stage of scheduling without affect-ing the optimality of the original algorithm Since our goal isto find a single solution that can run within a given memorybudget τlowast=microlowast while all other solutions can be discarded set-ting some budget τ that is greater or equal to microlowast and pruningsuboptimal schedules with which their micropeak exceeds τ canfocus the search to a smaller search spaceS primeT subST while stillachieving the optimal schedule slowast On top of this we developa meta-search for τ This is inspired from engineers buyinga larger memory (increase τ ) if a program fails due to stackoverflow (= rsquono solutionrsquo due to an overly aggressive pruning)and selling out excess memory (decrease τ ) if the currentbudget is prohibitive (= rsquotimeoutrsquo due to lack of pruning)SERENITY takes advantage of this insight to develop anadaptive soft budgeting scheme while scheduling to cut downthe overall number of explored schedules Figure 8 illustratesthe overall idea by first showing how some schedules arepruned with regard to a given budget τ in Figure 8(a) thenimplication of different τ on scheduling time in Figure 8(b)

Figure 8(a) depicts a certain point while scheduling G wherenodes G H F and J can be scheduled In particular the

32

23

35

τ = 36

35 38 38J

s1

A

B C

D E F

H I

K

L

G

GGraph

Scheduled SchedulableXX For memoizationXX

ABCDEI hellip

Sear

ch S

tep

FHG

hellip

J

D 6 E 6 F 6J 6I 3H 3G 3

ABCDEIFH helliphellip

C 6

23 32s2

gt τ

z

z

s3

35

output activation size

(a) While both path s1 and s2 schedules lead to same zprime theirmicro and micropeak varies and we can prune schedules that yieldhigher micropeak than a given budget τ Numbers next to box orcircle are micro and numbers next to edges are micropeak

Proh

ibiti

ve

Sche

dulin

g Ti

me

(tim

eout

)

Sche

dulin

g Fa

ilure

(no

solu

tion

)

No o

f Exp

lore

d Sc

hedu

les

prop Sc

hedu

ling

Tim

e

Budget

Optimal Budget τ Hard Budget τmaxSoft Budget τ

Adaptive Soft Budgeting

(b) Adaptive soft budgeting starts by setting a hard budgetτmax as the maximum value for the soft budget τ Thenconducts a binary search for τ higher than τlowast that it finds asolution yet not too high that scheduling completes quickly

Figure 8 Illustration of the adaptive soft budgeting (a) shows howschedules are pruned and (b) illustrates how the soft budget τrelates to the number of explored schedules

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

figure compares two possible solutions s1 and s2 whichschedules Hrarr F and Frarr H respectively given τ = 36While s1 and s2 both starts from z with micro=32 schedulingH leads to micropeak = 32+3 (H) = 35 whereas scheduling F

or J leads to micropeak =32+6 (F or J) =38 Therefore sincewe assume τ = 36 s2 and s3 will fail because micropeak = 38for s2 and s3 exceeds 36 So as long as we set the budgetτ higher than microlowast the scheduler still finds a single optimalsolution while avoiding many suboptimal paths On theother hand too small a τ ltmicrolowast leads to no solution becausethe optimal path would be pruned away

Having established the possibility of pruning our questionboils down to discovering τ that is greater or equal to microlowast

which we call an optimal budget τlowast yet close enough toshrink the search space effectively Figure 8(b) and Algo-rithm 2 summarizes the proposed adaptive soft budgetingSince we start with no information about the approximaterange for τ we resort to a commonly used topologicalordering algorithm called Kahnrsquos algorithm (Kahn 1962)(O(|V |+|E|)) to adaptively gain idea of the range for τ Weuse the peak memory footprint from this sequence and useit as our hard budget τmax and in contrast we call adaptivelychanging τ as a soft budget Since τmaxgemicrolowast we know thatany τgeτmax do not need to be explored Having this upperbound for the search adaptive soft budgeting implementsa binary search to first run the scheduling algorithm with τand T as input where T is an hyperparameter that limits thescheduling time per search step The binary search increasesτ (τnew larr (τnew + τold)2) if it finds rsquono solutionrsquo anddecreases τ (τnewlarrτnew2) if a search step returns rsquotimeoutrsquo(search step duration exceeds T ) The binary search stops as

Algorithm 2 Adaptive Soft Budgeting1 Input graph G2 Output optimal schedule slowast

3 τmaxlarrmicro(KahnrsquosAlgorithm(G)G) hard budget4 τoldτnewlarrτmax

5 flaglarr rsquono solutionrsquo6 repeat7 binary search for τ decrease τ if rsquotimeoutrsquo8 and increase τ if rsquono solutionrsquo9 if flag is rsquotimeoutrsquo then

10 simultaneous11 τoldlarrτnew τnewlarrτnew212 else if flag is rsquono solutionrsquo then13 simultaneous14 τoldlarrτnew τnewlarr(τnew+τold)215 end if16 if flag is rsquosolutionrsquo then17 slowastlarrschedule optimal schedule18 end if19 until flag is rsquosolutionrsquo

micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(wixi) )

micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(y) )

Channel-wisePartitioning

Kernel-wisePartitioning

=

=

wij

concat

conv

x1 x2 xn

w1hellipwm

y

add

partialconv w1

x1 x2xn

w2 wn

y

concat

depth-conv

x1 x2 xn

w1hellipwn

y

concat

partialdepth-conv w1

x1 x2xn

w2 wn

y

yxi ith Input Output jth Channel of ith Kernel

x

x

Figure 9 Illustration of the graph rewriting patterns channel-wisepartitioning and kernel-wise partitioning can reduce the memorycost of convolution and depthwise convolution respectively

soon as it finds a schedule (rsquosolutionrsquo) and this method usingbinary search is guaranteed to work due to the monotonicallyincreasing number of explored schedules with τ

33 Identity Graph Rewriting Improving theSearch Space for Better Peak Memory Footprint

Reorganizing the computational graph of the irregularlywired neural networks may lead to significant reduction in thepeak memory footprint micropeak during computation For exam-ple it is notable that large stream of NAS-based works (Liuet al 2019a Zhang et al 2019) rely on extensive use of con-catenation as a natural approach to merge information frommultiple branches of the input activations and expand thesearch space of the neural architectures However concatena-tion with many incoming edges may prolong the liveness ofthe input activation and increase the memory pressure whichis unfavorable especially for resource constrained scenariosTo address this issue we propose identity graph rewritingto effectively reduce micropeak around the concatenation whilekeeping the arithmetic outputs identical To this end wepresent two main examples of the graph patterns in irregularlywired neural networks that benefits from our technique

Channel-wise partitioning (convolution) One typicalpattern in irregularly wired neural networks is concatenation(concat [middot]) that takes multiple branches of the input prior toa convolution (conv lowast) While executing such pattern peakmemory footprint micropeak occurs when the output y isinRn isbeing computed while concatenated branches of inputxisinRn

are also mandated to reside in the memory Our objectiveis to achieve the same arithmetic results and logical effectas concat yet sidestep the corresponding seemingly exces-sive memory cost To this end we channel-wise partitionthe conv that follows the concat so that the partitioned convcan be computed as soon as the input xi becomes available

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of

sumi and lowast these

transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from

sumxi+y tomax(wilowastxi)+y which becomes

more effective when there are more incoming edges to concat

y=[sum

i

w1ilowastxisumi

wmilowastxi]

(concat+conv) (3)

=sumi

[w1ilowastxiwmilowastxi

](4)

=sumi

[w1iwmi

]lowastxi (5)

=sumi

[wilowastxi

](partial conv+add) (6)

Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly

y=[w1lowastx1wnlowastxn

](concat+depthconv) (7)

=[[w1lowastx1][wnlowastxn]

](partial depthconv+concat)

(8)

Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost

Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1

ACCURACY

DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745

4 EVALUATION

We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time

41 Methodology

Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset

42 Experimental Results

Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In

1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

183x

220x

239x

209x

140x

127x 168x

125x

139x

168x220x

244x

270x 345x

140x

127x 168x

125x

139x 186x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Red

ucti

on i

n Pe

ak M

emor

yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Higher the betterReduction in Peak Memory

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 10 Reduction in peak memory footprint of SERENITY

against TensorFlow Lite (no memory hierarchy)

192x 2

58x

251x

115x

108x

129x

108x

130x

152x192x 2

68x

125x

111x

131x

111x 161x

149x192x

356x

125x

119x

109x

108x 151x200x

135x

250x

182x

138x 176x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Redu

ctio

n in

Off

-chi

p

32KB 64KB 128KB 256KB

Reduction in Off-chip

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Memory Communication

only

SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipon

ly SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipSERENITY removes off-chip communication

NA

NA

NA

NA

NA NA

Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)

addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks

Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed

Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory

0

50

100

150

200

250

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

251KB reduction in peak memoryfootprint with Memory Allocator

Memory Footprint (KB)

Time

(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)

0

50

100

150

200

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic ProgrammingDynamic Programming+Graph Rewriting

125KB reductionin peak memory footprint

Time

Memory Footprint (KB)

(b) Memory footprint without the memory allocator

Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)

allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting

Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting

Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

32s 57s

45s

278s 11

81s

151s

285s 744s

879s

406s

32s

421s

305s

393s 11

81s

151s

285s 744s

879s

488s

1

10

100

1000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Sche

dulin

g Ti

me

(sec

onds

)

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Scheduling Time (seconds)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 13 Scheduling time evaluation for SERENITY

Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time

GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME

7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs

3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs

nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility

5 RELATED WORKS

The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)

bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices

Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper

Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint

Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts

6 CONCLUSION

As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

ACKNOWLEDGEMENT

We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms

REFERENCES

Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016

Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018

Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH

Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017

Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966

Bellman R Dynamic programming Science 1966

Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961

Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989

Bruno J and Sethi R Code generation for a one-register machineJACM 1976

Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm

Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018

Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014

Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016

Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017

Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf

Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf

Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017

Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf

Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS

Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015

Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018

Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017

Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017

Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite

Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015

Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a

Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b

He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018

Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962

Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf

Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014

Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018

Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019

Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017

Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016

Kahn A B Topological sorting of large networks CACM 1962

Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001

Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf

Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004

LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990

Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003

Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX

Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b

Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020

Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb

NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt

Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017

Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019

Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999

Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019

Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf

Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007

Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018

Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014

Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf

Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019

Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000

Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019

Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019

Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019

Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf

Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf

Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-

Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg

Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS

Multiply-and-accumulate (Billions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

200 10 30 40

DPN-131

Inception V1MobileNet

ShuffleNet

Inception V2

Inception V3Xception ResNet-152

SENet

AmoebaNet-A

ReNeXt-101PolyNetInception ResNet V2

Inception V4

NASNet-ANASNet-B

RandWire

AmoebaNet-A

AmoebaNet-B

RandWire

irregularly wired neural networksregular topology neural networks

irregularly wired neural networksshow better performance for

same amount of compute thanregular topology neural networks

top left means is better

(a) ImageNet accuracy vs number of multiply-and-accumulate

Number of Parameters (Millions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

800 40 100 140

DPN-131

irregularly wired neural networks

Inception V1MobileNetShuffleNet

Inception V2

Inception V3Xception

ResNet-152

SENet

AmoebaNet-C

ReNeXt-101

PolyNetInception ResNet V2Inception V4

NASNet-A

NASNet-A

RandWire

AmoebaNet-A

RandWire

regular topology neural networks

irregularly wired neural networksshow better performance for

same number of parameters thanregular topology neural networks

top left means is better

6020 120

NASNet-A

(b) ImageNet accuracy vs number of parameters

Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks

B COMPARISON WITH TENSORFLOW LITE

In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks

1656

552

194

70

645

330 60

5

350

160

903

251

82 33

459

260 359

280

115

753

226

72 20

459

260 359

280

115

0

500

1000

1500

2000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Peak

Mem

ory

Foot

prin

t (K

B)

TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Smaller the better

Peak Memory Footprint (KB)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite

C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING

Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm

THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step

Proof If i=0 the optimal s0 is an empty sequence and micro0

must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +

prodvishapemicroi +

produlowasti shape)

and microi+1 is calculated by deductingprodpishape forallpi isin

(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption

D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF

We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node

A

D W

Z

CB X Y

GGraph

hellip

Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm

First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step

For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2

iminus1) On top of this each step would make an iteration

over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield

1+1times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=1+

(|V |minus2

0

)times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=2+

|V |minus2sumi=0

(|V |minus2

i

)times(|V |minus2minusi)

=2+(|V |minus2)times2|V |minus3

le(|V |minus2)times2|V |minus2 for |V |ge4

le|V |times2|V |

As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering

Page 4: Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF IRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES ByungHoonAhn1 yJinwonLee2 JamieMenjayLin

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

1

2

3

4

5

7

8

9

10

11

12

13

14

15

16

17

G s GGraph RewrittenGraph Schedule

IdentityGraph Rewriter

Dynamic Programming-

based Scheduler

Adaptive Soft Budgeting

Rewrite graph to alleviate activation memory footprint

of the graph

Find memory-optimal schedule given an

input graph

Adaptively manage soft budget to speed

up scheduling

G

flag = lsquono solutionrsquo lsquotimeoutrsquo lsquosolutionrsquo

τ T s

Figure 4 Overall workflow of SERENITY memory-aware scheduling of irregularly wired neural network

3 SERENITY MEMORY-AWARESCHEDULING OF IRREGULARLYWIRED NEURAL NETWORKS

As discussed in Section 2 the objective is reducing thepeak memory footprint while executing irregularly wiredneural networks We propose SERENITY memory-awarescheduling that targets devices with restricted resources(eg edge devices) Figure 4 summarizes the overallscheduling process highlighting the major contributions ofour approach Input to SERENITY is a graph of irregularlywired neural network G which in fact acts as an intermediaterepresentation (IR) during the scheduling process Weaugment this IR with the metadata of the nodes such as theoperation type inputoutput edges inputoutput shapesand memory cost Then the graph rewriter transformsthe graph G rarr Gprime to relax the memory costs of memoryintensive patterns with the goal of reducing the peak memoryfootprint micropeak of G SERENITY schedules the graph to anoptimal schedule slowast using the dynamic programming-basedscheduler However since the scheduling may be slowdue to the complexity we scale down search space byleveraging divide-and-conquer which partitions the graphinto multiple subgraphs Them we augment the schedulerwith an adaptive soft budgeting which prunes suboptimalpaths by adaptively finding a budget for thresholding througha swift meta-search to speed up the scheduling process Thissection focuses on the innovations of SERENITY dynamicprogramming-based scheduling divide-and-conqueradaptive soft budgeting and graph rewriting which areexplained in detail in Section 31 32 and 33 respectively

31 Dynamic Programming-based SchedulingAchieving Optimal Peak Memory Footprint

Our goal for the scheduling algorithm is to minimize thepeak memory footprint micropeak(sG) As stated in Section 23recursive algorithms that covers the entire search space Sor the subspace of all topological orderings ST sub S takesimpractically long time This is primarily due to the repetitivere-computation of subproblems that upper bounds the algo-rithm byO(|V |) Therefore we leverage dynamic program-ming (Bellman 1961 1966 Held amp Karp 1962) which in-cludes a memoization scheme that has been shown to be effec-tive in reducing the complexity of time-intensive algorithmsby reusing solutions from their subproblems while still find-

ing optimal solution by sweeping the entire search space

Identifying signature to enable dynamic programmingThe first step to applying dynamic programming to anew problem is characterizing the structure of an optimalsolution slowast=slowastn (slowastn is an optimal solution for n number ofnodes) Then it requires identifying a recursive relationshipbetween the optimal solution of a subproblem slowasti and theoriginal problem slowasti+1 and we do this by analyzing thestraightforward recursive topological ordering whichwhile inefficient sweeps the entire search space In essencetopological ordering algorithm is a repeated process ofidentifying a set of nodes that are available for schedulingand iterating the set for recursion In graph theory such aset of nodes available for scheduling is called zero-indegreeset z where z is a set of nodes which all of their incomingedges and the corresponding predecessor nodes (indegree)have been scheduled Figure 5 demonstrates the recursiontree of the different topological ordering algorithms wherethe height of the tree is the search step and every path fromthe root to the leaf is a topological ordering s isin ST Thefigure highlights the redundant z in the recursive topologicalordering in the recursion tree then merges these z to makethem unique identifying it as the signature for repetition andprevent the aforementioned re-computation This makes thescheduling for z into a unique subproblem that constitutes

A

B C

D E F

H I J

K

L

G

GGraph Recursive Topological Ordering

A

B C J

CD JG B E F J B C

hellip

Sear

ch S

tep

CDG CDGhellip hellipz

s

Redundant zero-indegree set z

Dynamic Programming-based Topological Ordering

A

B C J

ABCD ABJG E F ACJ

hellip

Sear

ch S

tep

CDGhellip

Unique zero-indegree set z

Scheduled SchedulableXX For memoizationXX

Figure 5 Illustration of identifying redundant zero-indegree set zand making z unique (square) throughout the topological orderingalgorithm to reduce re-computation

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

outdegree of 1rarr 0 outdegree of 1rarr 0

micro 8

A

B C

D E F

H I J

K

L

G

GGraph

Scheduled To ScheduleAllocateBA To DeallocateB

Activation MemoryD E F I J

(1) ScheduleAllocate H D E F I J H

(0) Initial State

(2) Deallocate F I J HD E

D E

H I

DE

i = 8

micropeak

micropeak9 = max(micropeak8 micropeak)

micro9

s8 = A B CD E F I Jmicro peak8 from M8

z8 = HG

u8 = H

s9 = A B CD E F I J H

Figure 6 Visualization of scheduling the node u8 = H during thesearch step i = 8 Starting from s8 micro8 and micropeak8 the figureshows how the algorithm calculates s9 micro9 and micropeak9

the dynamic programming-based topological ordering

Integrating the peak memory footprint constraint Ontop of the dynamic programming formulation that showspotential for optimizing the search space significantly weoverlay the problem specific constraints to achieve theoptimal solution In particular we calculate the memoryfootprint microi+1 and its corresponding peak micropeaki+1 in eachsearch step i to select optimal path slowasti+1 for memoizationHere we clarify the process of a search step explainingthe details of calculating micropeaki+1 and saving si+1 for eachsearch step i In each search step we start with number ofunique zero-indegree sets zi (signature) saved in ith entryof memoizationMi For each zi we append the schedule upto the point si sum of activations in the memory microi for thesignature zi and the peak memory footprint of the si denotedmicropeaki Therefore in each search step i we start with si microiandmicropeaki for si Then when we iterate zi to schedule a newnode ui its output activation is appended to si to form si+1and is allocated in the memory Size of ui is product (

prod) of

uishape where shape is a property of the activation tensorthat includes channels height width and the precision (egbyte float) is added to microi so microi+1 larr microi +

prod(uishape)

Then we use microi+1 as micropeak to update micropeaki+1 (peakmemory footprint for si+1) Since some predecessors ofui will not be used anymore after allocating ui we updatethe outdegrees of the node by decrementing them Havingupdated the outdegree we will be left with a zero-outdegreeset that denotes the nodes that are ready for deallocation Wedeallocate the nodes in the set and update microi+1 accordingly

To demonstrate scheduling of a node ui Figure 6 simulatesscheduling a node u8 = H in i=8 In the figure (1) H is ap-pended to s8 and allocated to memory as it is scheduled andthen the scheduler records maximum of the micropeak8 and thesum of all activations in the memory at this point as micropeak9Then it recalculates the outdegrees of the predecessor nodesof H D and E rsquos outdegree are decremented from one tozero (2) Then these nodes are deallocated and sum of the

Algorithm 1 Dynamic Programming-based Scheduling1 Input graph G2 Output optimal schedule slowast

3 initialize memoization4 s0larr [] micro0micropeak0larr0 z0larr zero-indegree(s0G)5 M0[z0]larr(s0micro0micropeak0)6 iterate search step7 for i=0 to nminus1 do8 iterate (schedule current memory peak memory)9 for zi(simicroimicropeak) inMi do

10 for ui in zi do11 si+1larrsiappend(ui) allocate12 zi+1larr zero-indegree(si+1G)13 microi+1micropeaklarrmicroi+

prod(uishape)

14 micropeaki+1larrmax(micropeakimicropeak)15 for pi in uipreds do16 if pi is in zero-outdegree(si+1G) then17 microi+1larrmicroi+1minus

prod(pishape) deallocate

18 end if19 end for20 memoize schedule with least peak memory21 if micropeaki+1leMi+1[zi+1]micropeaki+1 then22 Mi+1[zi+1]larr(si+1microi+1micropeaki+1)23 end if24 end for25 end for26 end for27 slowastmicrolowastpeaklarrM[middot]nsnM[middot]nmicropeakn solution

activation memory here is recorded as micro9

Finding schedule with optimal peak memory footprintAfter scheduling ui we save the new signature into theMi+1 for next search step i+1 Since the goal of this work isto minimize the overall micropeak we identify the correspondingoptimal schedule slowasti+1 for each zi+1 by only saving si+1 withthe minimum micropeaki+1 We integrate the aforementionedstep of scheduling ui and updating Mi+1 to completethe proposed dynamic programming-based schedulingalgorithm Algorithm 1 summarizes the the algorithm As afirst step the algorithm starts by initializing the memoizationtableM0 then the algorithm iterates different search stepsIn each search step i the algorithm performs the aboveillustrated memory allocation for allui in zi and saving si+1microi+1 and micropeaki+1 After iterating all search steps to nminus1slowast is saved inMn with a unique entry for n being numberof nodes in G We provide the proof for the optimality of thepeak memory footprint in the supplementary material

Complexity of the algorithm The complexity of theproposed dynamic programming-based scheduling isO(|V |times2|V |) which is significantly faster than the exhaus-tive search ofST with an upper bound complexity ofO(|V |)

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A

B C

D

E

F G

H

Divide Conquer Combine

ABCD

Schedule

Schedule

g1

g2Concatenate

s sg1

sg2EFGH

ABCDEFGH

Figure 7 Illustration of divide-and-conquer which divides thegraphs into multiple subgraphs (divide) schedules each of themusing the optimal scheduler (conquer) then concatenates thesub-schedules to get the final schedule (combine)

Due to the space limitation we present the derivation of thealgorithm complexity in the supplementary material

32 Optimizing Scheduling Speed Speeding upthe Dynamic Programming-based Scheduling

While the above scheduling algorithm improves complexityof the search search space may still be intractable due tothe immense irregularity Therefore we devise divide-and-conquer and adaptive soft budgeting to accelerate the searchby effectively shrinking and pruning the search space

Divide-and-conquer We can observe from Figure 1that the topology of irregularly wired neural networks arehourglass shaped ( ) because many NAS and RandomNetwork Generators design cells with single input and singleoutput then stack them to form an hourglass shape topology(Wilken et al 2000) shows that during general purpose codescheduling graphs can be partitioned (divide) into multiplesubgraphs and the corresponding solutions (conquer) can beconcatenated (combine) to form an optimal solution for theoverall problem While the complexity of the scheduling al-gorithm remains the same this divide-and-conquer approachcan reduce the number of nodes in each subproblem speed-ing up the overall scheduling time For instance for a graphthat can be partitioned intoN equal subgraphs the schedul-ing time will decrease from |V |times2|V | to |V |times2|V |N thatwe can speed up scheduling by multiple orders of magnitudecompared to the naive approach depending on the size ofthe graph and the number of partitions

As such Figure 7 shows this insight can be extended to ourproblem setting where we can first perform scheduling oneach cell and merge those solutions together to form thefinal solution First stage is partitioning the original graphG into multiple subgraphs g (divide) Then utilizing theindependence among the subgraphs each subgraph g canbe scheduled separately for their corresponding optimalschedule sg (conquer) Considering that the number ofnodes in the subgraph g is much smaller than the entire graphG the scheduling time will decrease significantly Finallythe schedules of the subgraphs are concatenated to giveoptimal schedule slowast of the entire graph (combine)

Adaptive soft budgeting While divide-and-conquerapproach scales down the number of nodes the algorithmmay still not be fast enough due to the exponential complexityof the algorithm Therefore we explore avoiding suboptimalsolutions during the early stage of scheduling without affect-ing the optimality of the original algorithm Since our goal isto find a single solution that can run within a given memorybudget τlowast=microlowast while all other solutions can be discarded set-ting some budget τ that is greater or equal to microlowast and pruningsuboptimal schedules with which their micropeak exceeds τ canfocus the search to a smaller search spaceS primeT subST while stillachieving the optimal schedule slowast On top of this we developa meta-search for τ This is inspired from engineers buyinga larger memory (increase τ ) if a program fails due to stackoverflow (= rsquono solutionrsquo due to an overly aggressive pruning)and selling out excess memory (decrease τ ) if the currentbudget is prohibitive (= rsquotimeoutrsquo due to lack of pruning)SERENITY takes advantage of this insight to develop anadaptive soft budgeting scheme while scheduling to cut downthe overall number of explored schedules Figure 8 illustratesthe overall idea by first showing how some schedules arepruned with regard to a given budget τ in Figure 8(a) thenimplication of different τ on scheduling time in Figure 8(b)

Figure 8(a) depicts a certain point while scheduling G wherenodes G H F and J can be scheduled In particular the

32

23

35

τ = 36

35 38 38J

s1

A

B C

D E F

H I

K

L

G

GGraph

Scheduled SchedulableXX For memoizationXX

ABCDEI hellip

Sear

ch S

tep

FHG

hellip

J

D 6 E 6 F 6J 6I 3H 3G 3

ABCDEIFH helliphellip

C 6

23 32s2

gt τ

z

z

s3

35

output activation size

(a) While both path s1 and s2 schedules lead to same zprime theirmicro and micropeak varies and we can prune schedules that yieldhigher micropeak than a given budget τ Numbers next to box orcircle are micro and numbers next to edges are micropeak

Proh

ibiti

ve

Sche

dulin

g Ti

me

(tim

eout

)

Sche

dulin

g Fa

ilure

(no

solu

tion

)

No o

f Exp

lore

d Sc

hedu

les

prop Sc

hedu

ling

Tim

e

Budget

Optimal Budget τ Hard Budget τmaxSoft Budget τ

Adaptive Soft Budgeting

(b) Adaptive soft budgeting starts by setting a hard budgetτmax as the maximum value for the soft budget τ Thenconducts a binary search for τ higher than τlowast that it finds asolution yet not too high that scheduling completes quickly

Figure 8 Illustration of the adaptive soft budgeting (a) shows howschedules are pruned and (b) illustrates how the soft budget τrelates to the number of explored schedules

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

figure compares two possible solutions s1 and s2 whichschedules Hrarr F and Frarr H respectively given τ = 36While s1 and s2 both starts from z with micro=32 schedulingH leads to micropeak = 32+3 (H) = 35 whereas scheduling F

or J leads to micropeak =32+6 (F or J) =38 Therefore sincewe assume τ = 36 s2 and s3 will fail because micropeak = 38for s2 and s3 exceeds 36 So as long as we set the budgetτ higher than microlowast the scheduler still finds a single optimalsolution while avoiding many suboptimal paths On theother hand too small a τ ltmicrolowast leads to no solution becausethe optimal path would be pruned away

Having established the possibility of pruning our questionboils down to discovering τ that is greater or equal to microlowast

which we call an optimal budget τlowast yet close enough toshrink the search space effectively Figure 8(b) and Algo-rithm 2 summarizes the proposed adaptive soft budgetingSince we start with no information about the approximaterange for τ we resort to a commonly used topologicalordering algorithm called Kahnrsquos algorithm (Kahn 1962)(O(|V |+|E|)) to adaptively gain idea of the range for τ Weuse the peak memory footprint from this sequence and useit as our hard budget τmax and in contrast we call adaptivelychanging τ as a soft budget Since τmaxgemicrolowast we know thatany τgeτmax do not need to be explored Having this upperbound for the search adaptive soft budgeting implementsa binary search to first run the scheduling algorithm with τand T as input where T is an hyperparameter that limits thescheduling time per search step The binary search increasesτ (τnew larr (τnew + τold)2) if it finds rsquono solutionrsquo anddecreases τ (τnewlarrτnew2) if a search step returns rsquotimeoutrsquo(search step duration exceeds T ) The binary search stops as

Algorithm 2 Adaptive Soft Budgeting1 Input graph G2 Output optimal schedule slowast

3 τmaxlarrmicro(KahnrsquosAlgorithm(G)G) hard budget4 τoldτnewlarrτmax

5 flaglarr rsquono solutionrsquo6 repeat7 binary search for τ decrease τ if rsquotimeoutrsquo8 and increase τ if rsquono solutionrsquo9 if flag is rsquotimeoutrsquo then

10 simultaneous11 τoldlarrτnew τnewlarrτnew212 else if flag is rsquono solutionrsquo then13 simultaneous14 τoldlarrτnew τnewlarr(τnew+τold)215 end if16 if flag is rsquosolutionrsquo then17 slowastlarrschedule optimal schedule18 end if19 until flag is rsquosolutionrsquo

micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(wixi) )

micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(y) )

Channel-wisePartitioning

Kernel-wisePartitioning

=

=

wij

concat

conv

x1 x2 xn

w1hellipwm

y

add

partialconv w1

x1 x2xn

w2 wn

y

concat

depth-conv

x1 x2 xn

w1hellipwn

y

concat

partialdepth-conv w1

x1 x2xn

w2 wn

y

yxi ith Input Output jth Channel of ith Kernel

x

x

Figure 9 Illustration of the graph rewriting patterns channel-wisepartitioning and kernel-wise partitioning can reduce the memorycost of convolution and depthwise convolution respectively

soon as it finds a schedule (rsquosolutionrsquo) and this method usingbinary search is guaranteed to work due to the monotonicallyincreasing number of explored schedules with τ

33 Identity Graph Rewriting Improving theSearch Space for Better Peak Memory Footprint

Reorganizing the computational graph of the irregularlywired neural networks may lead to significant reduction in thepeak memory footprint micropeak during computation For exam-ple it is notable that large stream of NAS-based works (Liuet al 2019a Zhang et al 2019) rely on extensive use of con-catenation as a natural approach to merge information frommultiple branches of the input activations and expand thesearch space of the neural architectures However concatena-tion with many incoming edges may prolong the liveness ofthe input activation and increase the memory pressure whichis unfavorable especially for resource constrained scenariosTo address this issue we propose identity graph rewritingto effectively reduce micropeak around the concatenation whilekeeping the arithmetic outputs identical To this end wepresent two main examples of the graph patterns in irregularlywired neural networks that benefits from our technique

Channel-wise partitioning (convolution) One typicalpattern in irregularly wired neural networks is concatenation(concat [middot]) that takes multiple branches of the input prior toa convolution (conv lowast) While executing such pattern peakmemory footprint micropeak occurs when the output y isinRn isbeing computed while concatenated branches of inputxisinRn

are also mandated to reside in the memory Our objectiveis to achieve the same arithmetic results and logical effectas concat yet sidestep the corresponding seemingly exces-sive memory cost To this end we channel-wise partitionthe conv that follows the concat so that the partitioned convcan be computed as soon as the input xi becomes available

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of

sumi and lowast these

transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from

sumxi+y tomax(wilowastxi)+y which becomes

more effective when there are more incoming edges to concat

y=[sum

i

w1ilowastxisumi

wmilowastxi]

(concat+conv) (3)

=sumi

[w1ilowastxiwmilowastxi

](4)

=sumi

[w1iwmi

]lowastxi (5)

=sumi

[wilowastxi

](partial conv+add) (6)

Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly

y=[w1lowastx1wnlowastxn

](concat+depthconv) (7)

=[[w1lowastx1][wnlowastxn]

](partial depthconv+concat)

(8)

Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost

Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1

ACCURACY

DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745

4 EVALUATION

We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time

41 Methodology

Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset

42 Experimental Results

Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In

1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

183x

220x

239x

209x

140x

127x 168x

125x

139x

168x220x

244x

270x 345x

140x

127x 168x

125x

139x 186x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Red

ucti

on i

n Pe

ak M

emor

yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Higher the betterReduction in Peak Memory

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 10 Reduction in peak memory footprint of SERENITY

against TensorFlow Lite (no memory hierarchy)

192x 2

58x

251x

115x

108x

129x

108x

130x

152x192x 2

68x

125x

111x

131x

111x 161x

149x192x

356x

125x

119x

109x

108x 151x200x

135x

250x

182x

138x 176x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Redu

ctio

n in

Off

-chi

p

32KB 64KB 128KB 256KB

Reduction in Off-chip

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Memory Communication

only

SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipon

ly SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipSERENITY removes off-chip communication

NA

NA

NA

NA

NA NA

Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)

addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks

Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed

Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory

0

50

100

150

200

250

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

251KB reduction in peak memoryfootprint with Memory Allocator

Memory Footprint (KB)

Time

(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)

0

50

100

150

200

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic ProgrammingDynamic Programming+Graph Rewriting

125KB reductionin peak memory footprint

Time

Memory Footprint (KB)

(b) Memory footprint without the memory allocator

Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)

allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting

Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting

Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

32s 57s

45s

278s 11

81s

151s

285s 744s

879s

406s

32s

421s

305s

393s 11

81s

151s

285s 744s

879s

488s

1

10

100

1000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Sche

dulin

g Ti

me

(sec

onds

)

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Scheduling Time (seconds)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 13 Scheduling time evaluation for SERENITY

Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time

GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME

7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs

3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs

nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility

5 RELATED WORKS

The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)

bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices

Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper

Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint

Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts

6 CONCLUSION

As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

ACKNOWLEDGEMENT

We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms

REFERENCES

Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016

Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018

Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH

Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017

Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966

Bellman R Dynamic programming Science 1966

Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961

Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989

Bruno J and Sethi R Code generation for a one-register machineJACM 1976

Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm

Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018

Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014

Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016

Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017

Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf

Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf

Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017

Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf

Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS

Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015

Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018

Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017

Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017

Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite

Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015

Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a

Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b

He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018

Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962

Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf

Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014

Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018

Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019

Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017

Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016

Kahn A B Topological sorting of large networks CACM 1962

Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001

Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf

Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004

LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990

Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003

Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX

Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b

Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020

Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb

NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt

Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017

Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019

Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999

Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019

Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf

Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007

Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018

Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014

Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf

Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019

Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000

Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019

Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019

Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019

Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf

Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf

Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-

Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg

Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS

Multiply-and-accumulate (Billions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

200 10 30 40

DPN-131

Inception V1MobileNet

ShuffleNet

Inception V2

Inception V3Xception ResNet-152

SENet

AmoebaNet-A

ReNeXt-101PolyNetInception ResNet V2

Inception V4

NASNet-ANASNet-B

RandWire

AmoebaNet-A

AmoebaNet-B

RandWire

irregularly wired neural networksregular topology neural networks

irregularly wired neural networksshow better performance for

same amount of compute thanregular topology neural networks

top left means is better

(a) ImageNet accuracy vs number of multiply-and-accumulate

Number of Parameters (Millions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

800 40 100 140

DPN-131

irregularly wired neural networks

Inception V1MobileNetShuffleNet

Inception V2

Inception V3Xception

ResNet-152

SENet

AmoebaNet-C

ReNeXt-101

PolyNetInception ResNet V2Inception V4

NASNet-A

NASNet-A

RandWire

AmoebaNet-A

RandWire

regular topology neural networks

irregularly wired neural networksshow better performance for

same number of parameters thanregular topology neural networks

top left means is better

6020 120

NASNet-A

(b) ImageNet accuracy vs number of parameters

Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks

B COMPARISON WITH TENSORFLOW LITE

In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks

1656

552

194

70

645

330 60

5

350

160

903

251

82 33

459

260 359

280

115

753

226

72 20

459

260 359

280

115

0

500

1000

1500

2000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Peak

Mem

ory

Foot

prin

t (K

B)

TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Smaller the better

Peak Memory Footprint (KB)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite

C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING

Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm

THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step

Proof If i=0 the optimal s0 is an empty sequence and micro0

must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +

prodvishapemicroi +

produlowasti shape)

and microi+1 is calculated by deductingprodpishape forallpi isin

(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption

D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF

We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node

A

D W

Z

CB X Y

GGraph

hellip

Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm

First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step

For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2

iminus1) On top of this each step would make an iteration

over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield

1+1times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=1+

(|V |minus2

0

)times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=2+

|V |minus2sumi=0

(|V |minus2

i

)times(|V |minus2minusi)

=2+(|V |minus2)times2|V |minus3

le(|V |minus2)times2|V |minus2 for |V |ge4

le|V |times2|V |

As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering

Page 5: Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF IRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES ByungHoonAhn1 yJinwonLee2 JamieMenjayLin

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

outdegree of 1rarr 0 outdegree of 1rarr 0

micro 8

A

B C

D E F

H I J

K

L

G

GGraph

Scheduled To ScheduleAllocateBA To DeallocateB

Activation MemoryD E F I J

(1) ScheduleAllocate H D E F I J H

(0) Initial State

(2) Deallocate F I J HD E

D E

H I

DE

i = 8

micropeak

micropeak9 = max(micropeak8 micropeak)

micro9

s8 = A B CD E F I Jmicro peak8 from M8

z8 = HG

u8 = H

s9 = A B CD E F I J H

Figure 6 Visualization of scheduling the node u8 = H during thesearch step i = 8 Starting from s8 micro8 and micropeak8 the figureshows how the algorithm calculates s9 micro9 and micropeak9

the dynamic programming-based topological ordering

Integrating the peak memory footprint constraint Ontop of the dynamic programming formulation that showspotential for optimizing the search space significantly weoverlay the problem specific constraints to achieve theoptimal solution In particular we calculate the memoryfootprint microi+1 and its corresponding peak micropeaki+1 in eachsearch step i to select optimal path slowasti+1 for memoizationHere we clarify the process of a search step explainingthe details of calculating micropeaki+1 and saving si+1 for eachsearch step i In each search step we start with number ofunique zero-indegree sets zi (signature) saved in ith entryof memoizationMi For each zi we append the schedule upto the point si sum of activations in the memory microi for thesignature zi and the peak memory footprint of the si denotedmicropeaki Therefore in each search step i we start with si microiandmicropeaki for si Then when we iterate zi to schedule a newnode ui its output activation is appended to si to form si+1and is allocated in the memory Size of ui is product (

prod) of

uishape where shape is a property of the activation tensorthat includes channels height width and the precision (egbyte float) is added to microi so microi+1 larr microi +

prod(uishape)

Then we use microi+1 as micropeak to update micropeaki+1 (peakmemory footprint for si+1) Since some predecessors ofui will not be used anymore after allocating ui we updatethe outdegrees of the node by decrementing them Havingupdated the outdegree we will be left with a zero-outdegreeset that denotes the nodes that are ready for deallocation Wedeallocate the nodes in the set and update microi+1 accordingly

To demonstrate scheduling of a node ui Figure 6 simulatesscheduling a node u8 = H in i=8 In the figure (1) H is ap-pended to s8 and allocated to memory as it is scheduled andthen the scheduler records maximum of the micropeak8 and thesum of all activations in the memory at this point as micropeak9Then it recalculates the outdegrees of the predecessor nodesof H D and E rsquos outdegree are decremented from one tozero (2) Then these nodes are deallocated and sum of the

Algorithm 1 Dynamic Programming-based Scheduling1 Input graph G2 Output optimal schedule slowast

3 initialize memoization4 s0larr [] micro0micropeak0larr0 z0larr zero-indegree(s0G)5 M0[z0]larr(s0micro0micropeak0)6 iterate search step7 for i=0 to nminus1 do8 iterate (schedule current memory peak memory)9 for zi(simicroimicropeak) inMi do

10 for ui in zi do11 si+1larrsiappend(ui) allocate12 zi+1larr zero-indegree(si+1G)13 microi+1micropeaklarrmicroi+

prod(uishape)

14 micropeaki+1larrmax(micropeakimicropeak)15 for pi in uipreds do16 if pi is in zero-outdegree(si+1G) then17 microi+1larrmicroi+1minus

prod(pishape) deallocate

18 end if19 end for20 memoize schedule with least peak memory21 if micropeaki+1leMi+1[zi+1]micropeaki+1 then22 Mi+1[zi+1]larr(si+1microi+1micropeaki+1)23 end if24 end for25 end for26 end for27 slowastmicrolowastpeaklarrM[middot]nsnM[middot]nmicropeakn solution

activation memory here is recorded as micro9

Finding schedule with optimal peak memory footprintAfter scheduling ui we save the new signature into theMi+1 for next search step i+1 Since the goal of this work isto minimize the overall micropeak we identify the correspondingoptimal schedule slowasti+1 for each zi+1 by only saving si+1 withthe minimum micropeaki+1 We integrate the aforementionedstep of scheduling ui and updating Mi+1 to completethe proposed dynamic programming-based schedulingalgorithm Algorithm 1 summarizes the the algorithm As afirst step the algorithm starts by initializing the memoizationtableM0 then the algorithm iterates different search stepsIn each search step i the algorithm performs the aboveillustrated memory allocation for allui in zi and saving si+1microi+1 and micropeaki+1 After iterating all search steps to nminus1slowast is saved inMn with a unique entry for n being numberof nodes in G We provide the proof for the optimality of thepeak memory footprint in the supplementary material

Complexity of the algorithm The complexity of theproposed dynamic programming-based scheduling isO(|V |times2|V |) which is significantly faster than the exhaus-tive search ofST with an upper bound complexity ofO(|V |)

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A

B C

D

E

F G

H

Divide Conquer Combine

ABCD

Schedule

Schedule

g1

g2Concatenate

s sg1

sg2EFGH

ABCDEFGH

Figure 7 Illustration of divide-and-conquer which divides thegraphs into multiple subgraphs (divide) schedules each of themusing the optimal scheduler (conquer) then concatenates thesub-schedules to get the final schedule (combine)

Due to the space limitation we present the derivation of thealgorithm complexity in the supplementary material

32 Optimizing Scheduling Speed Speeding upthe Dynamic Programming-based Scheduling

While the above scheduling algorithm improves complexityof the search search space may still be intractable due tothe immense irregularity Therefore we devise divide-and-conquer and adaptive soft budgeting to accelerate the searchby effectively shrinking and pruning the search space

Divide-and-conquer We can observe from Figure 1that the topology of irregularly wired neural networks arehourglass shaped ( ) because many NAS and RandomNetwork Generators design cells with single input and singleoutput then stack them to form an hourglass shape topology(Wilken et al 2000) shows that during general purpose codescheduling graphs can be partitioned (divide) into multiplesubgraphs and the corresponding solutions (conquer) can beconcatenated (combine) to form an optimal solution for theoverall problem While the complexity of the scheduling al-gorithm remains the same this divide-and-conquer approachcan reduce the number of nodes in each subproblem speed-ing up the overall scheduling time For instance for a graphthat can be partitioned intoN equal subgraphs the schedul-ing time will decrease from |V |times2|V | to |V |times2|V |N thatwe can speed up scheduling by multiple orders of magnitudecompared to the naive approach depending on the size ofthe graph and the number of partitions

As such Figure 7 shows this insight can be extended to ourproblem setting where we can first perform scheduling oneach cell and merge those solutions together to form thefinal solution First stage is partitioning the original graphG into multiple subgraphs g (divide) Then utilizing theindependence among the subgraphs each subgraph g canbe scheduled separately for their corresponding optimalschedule sg (conquer) Considering that the number ofnodes in the subgraph g is much smaller than the entire graphG the scheduling time will decrease significantly Finallythe schedules of the subgraphs are concatenated to giveoptimal schedule slowast of the entire graph (combine)

Adaptive soft budgeting While divide-and-conquerapproach scales down the number of nodes the algorithmmay still not be fast enough due to the exponential complexityof the algorithm Therefore we explore avoiding suboptimalsolutions during the early stage of scheduling without affect-ing the optimality of the original algorithm Since our goal isto find a single solution that can run within a given memorybudget τlowast=microlowast while all other solutions can be discarded set-ting some budget τ that is greater or equal to microlowast and pruningsuboptimal schedules with which their micropeak exceeds τ canfocus the search to a smaller search spaceS primeT subST while stillachieving the optimal schedule slowast On top of this we developa meta-search for τ This is inspired from engineers buyinga larger memory (increase τ ) if a program fails due to stackoverflow (= rsquono solutionrsquo due to an overly aggressive pruning)and selling out excess memory (decrease τ ) if the currentbudget is prohibitive (= rsquotimeoutrsquo due to lack of pruning)SERENITY takes advantage of this insight to develop anadaptive soft budgeting scheme while scheduling to cut downthe overall number of explored schedules Figure 8 illustratesthe overall idea by first showing how some schedules arepruned with regard to a given budget τ in Figure 8(a) thenimplication of different τ on scheduling time in Figure 8(b)

Figure 8(a) depicts a certain point while scheduling G wherenodes G H F and J can be scheduled In particular the

32

23

35

τ = 36

35 38 38J

s1

A

B C

D E F

H I

K

L

G

GGraph

Scheduled SchedulableXX For memoizationXX

ABCDEI hellip

Sear

ch S

tep

FHG

hellip

J

D 6 E 6 F 6J 6I 3H 3G 3

ABCDEIFH helliphellip

C 6

23 32s2

gt τ

z

z

s3

35

output activation size

(a) While both path s1 and s2 schedules lead to same zprime theirmicro and micropeak varies and we can prune schedules that yieldhigher micropeak than a given budget τ Numbers next to box orcircle are micro and numbers next to edges are micropeak

Proh

ibiti

ve

Sche

dulin

g Ti

me

(tim

eout

)

Sche

dulin

g Fa

ilure

(no

solu

tion

)

No o

f Exp

lore

d Sc

hedu

les

prop Sc

hedu

ling

Tim

e

Budget

Optimal Budget τ Hard Budget τmaxSoft Budget τ

Adaptive Soft Budgeting

(b) Adaptive soft budgeting starts by setting a hard budgetτmax as the maximum value for the soft budget τ Thenconducts a binary search for τ higher than τlowast that it finds asolution yet not too high that scheduling completes quickly

Figure 8 Illustration of the adaptive soft budgeting (a) shows howschedules are pruned and (b) illustrates how the soft budget τrelates to the number of explored schedules

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

figure compares two possible solutions s1 and s2 whichschedules Hrarr F and Frarr H respectively given τ = 36While s1 and s2 both starts from z with micro=32 schedulingH leads to micropeak = 32+3 (H) = 35 whereas scheduling F

or J leads to micropeak =32+6 (F or J) =38 Therefore sincewe assume τ = 36 s2 and s3 will fail because micropeak = 38for s2 and s3 exceeds 36 So as long as we set the budgetτ higher than microlowast the scheduler still finds a single optimalsolution while avoiding many suboptimal paths On theother hand too small a τ ltmicrolowast leads to no solution becausethe optimal path would be pruned away

Having established the possibility of pruning our questionboils down to discovering τ that is greater or equal to microlowast

which we call an optimal budget τlowast yet close enough toshrink the search space effectively Figure 8(b) and Algo-rithm 2 summarizes the proposed adaptive soft budgetingSince we start with no information about the approximaterange for τ we resort to a commonly used topologicalordering algorithm called Kahnrsquos algorithm (Kahn 1962)(O(|V |+|E|)) to adaptively gain idea of the range for τ Weuse the peak memory footprint from this sequence and useit as our hard budget τmax and in contrast we call adaptivelychanging τ as a soft budget Since τmaxgemicrolowast we know thatany τgeτmax do not need to be explored Having this upperbound for the search adaptive soft budgeting implementsa binary search to first run the scheduling algorithm with τand T as input where T is an hyperparameter that limits thescheduling time per search step The binary search increasesτ (τnew larr (τnew + τold)2) if it finds rsquono solutionrsquo anddecreases τ (τnewlarrτnew2) if a search step returns rsquotimeoutrsquo(search step duration exceeds T ) The binary search stops as

Algorithm 2 Adaptive Soft Budgeting1 Input graph G2 Output optimal schedule slowast

3 τmaxlarrmicro(KahnrsquosAlgorithm(G)G) hard budget4 τoldτnewlarrτmax

5 flaglarr rsquono solutionrsquo6 repeat7 binary search for τ decrease τ if rsquotimeoutrsquo8 and increase τ if rsquono solutionrsquo9 if flag is rsquotimeoutrsquo then

10 simultaneous11 τoldlarrτnew τnewlarrτnew212 else if flag is rsquono solutionrsquo then13 simultaneous14 τoldlarrτnew τnewlarr(τnew+τold)215 end if16 if flag is rsquosolutionrsquo then17 slowastlarrschedule optimal schedule18 end if19 until flag is rsquosolutionrsquo

micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(wixi) )

micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(y) )

Channel-wisePartitioning

Kernel-wisePartitioning

=

=

wij

concat

conv

x1 x2 xn

w1hellipwm

y

add

partialconv w1

x1 x2xn

w2 wn

y

concat

depth-conv

x1 x2 xn

w1hellipwn

y

concat

partialdepth-conv w1

x1 x2xn

w2 wn

y

yxi ith Input Output jth Channel of ith Kernel

x

x

Figure 9 Illustration of the graph rewriting patterns channel-wisepartitioning and kernel-wise partitioning can reduce the memorycost of convolution and depthwise convolution respectively

soon as it finds a schedule (rsquosolutionrsquo) and this method usingbinary search is guaranteed to work due to the monotonicallyincreasing number of explored schedules with τ

33 Identity Graph Rewriting Improving theSearch Space for Better Peak Memory Footprint

Reorganizing the computational graph of the irregularlywired neural networks may lead to significant reduction in thepeak memory footprint micropeak during computation For exam-ple it is notable that large stream of NAS-based works (Liuet al 2019a Zhang et al 2019) rely on extensive use of con-catenation as a natural approach to merge information frommultiple branches of the input activations and expand thesearch space of the neural architectures However concatena-tion with many incoming edges may prolong the liveness ofthe input activation and increase the memory pressure whichis unfavorable especially for resource constrained scenariosTo address this issue we propose identity graph rewritingto effectively reduce micropeak around the concatenation whilekeeping the arithmetic outputs identical To this end wepresent two main examples of the graph patterns in irregularlywired neural networks that benefits from our technique

Channel-wise partitioning (convolution) One typicalpattern in irregularly wired neural networks is concatenation(concat [middot]) that takes multiple branches of the input prior toa convolution (conv lowast) While executing such pattern peakmemory footprint micropeak occurs when the output y isinRn isbeing computed while concatenated branches of inputxisinRn

are also mandated to reside in the memory Our objectiveis to achieve the same arithmetic results and logical effectas concat yet sidestep the corresponding seemingly exces-sive memory cost To this end we channel-wise partitionthe conv that follows the concat so that the partitioned convcan be computed as soon as the input xi becomes available

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of

sumi and lowast these

transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from

sumxi+y tomax(wilowastxi)+y which becomes

more effective when there are more incoming edges to concat

y=[sum

i

w1ilowastxisumi

wmilowastxi]

(concat+conv) (3)

=sumi

[w1ilowastxiwmilowastxi

](4)

=sumi

[w1iwmi

]lowastxi (5)

=sumi

[wilowastxi

](partial conv+add) (6)

Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly

y=[w1lowastx1wnlowastxn

](concat+depthconv) (7)

=[[w1lowastx1][wnlowastxn]

](partial depthconv+concat)

(8)

Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost

Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1

ACCURACY

DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745

4 EVALUATION

We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time

41 Methodology

Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset

42 Experimental Results

Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In

1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

183x

220x

239x

209x

140x

127x 168x

125x

139x

168x220x

244x

270x 345x

140x

127x 168x

125x

139x 186x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Red

ucti

on i

n Pe

ak M

emor

yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Higher the betterReduction in Peak Memory

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 10 Reduction in peak memory footprint of SERENITY

against TensorFlow Lite (no memory hierarchy)

192x 2

58x

251x

115x

108x

129x

108x

130x

152x192x 2

68x

125x

111x

131x

111x 161x

149x192x

356x

125x

119x

109x

108x 151x200x

135x

250x

182x

138x 176x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Redu

ctio

n in

Off

-chi

p

32KB 64KB 128KB 256KB

Reduction in Off-chip

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Memory Communication

only

SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipon

ly SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipSERENITY removes off-chip communication

NA

NA

NA

NA

NA NA

Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)

addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks

Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed

Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory

0

50

100

150

200

250

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

251KB reduction in peak memoryfootprint with Memory Allocator

Memory Footprint (KB)

Time

(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)

0

50

100

150

200

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic ProgrammingDynamic Programming+Graph Rewriting

125KB reductionin peak memory footprint

Time

Memory Footprint (KB)

(b) Memory footprint without the memory allocator

Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)

allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting

Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting

Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

32s 57s

45s

278s 11

81s

151s

285s 744s

879s

406s

32s

421s

305s

393s 11

81s

151s

285s 744s

879s

488s

1

10

100

1000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Sche

dulin

g Ti

me

(sec

onds

)

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Scheduling Time (seconds)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 13 Scheduling time evaluation for SERENITY

Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time

GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME

7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs

3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs

nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility

5 RELATED WORKS

The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)

bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices

Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper

Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint

Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts

6 CONCLUSION

As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

ACKNOWLEDGEMENT

We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms

REFERENCES

Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016

Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018

Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH

Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017

Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966

Bellman R Dynamic programming Science 1966

Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961

Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989

Bruno J and Sethi R Code generation for a one-register machineJACM 1976

Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm

Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018

Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014

Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016

Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017

Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf

Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf

Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017

Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf

Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS

Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015

Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018

Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017

Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017

Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite

Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015

Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a

Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b

He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018

Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962

Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf

Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014

Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018

Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019

Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017

Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016

Kahn A B Topological sorting of large networks CACM 1962

Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001

Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf

Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004

LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990

Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003

Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX

Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b

Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020

Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb

NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt

Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017

Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019

Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999

Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019

Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf

Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007

Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018

Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014

Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf

Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019

Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000

Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019

Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019

Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019

Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf

Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf

Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-

Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg

Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS

Multiply-and-accumulate (Billions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

200 10 30 40

DPN-131

Inception V1MobileNet

ShuffleNet

Inception V2

Inception V3Xception ResNet-152

SENet

AmoebaNet-A

ReNeXt-101PolyNetInception ResNet V2

Inception V4

NASNet-ANASNet-B

RandWire

AmoebaNet-A

AmoebaNet-B

RandWire

irregularly wired neural networksregular topology neural networks

irregularly wired neural networksshow better performance for

same amount of compute thanregular topology neural networks

top left means is better

(a) ImageNet accuracy vs number of multiply-and-accumulate

Number of Parameters (Millions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

800 40 100 140

DPN-131

irregularly wired neural networks

Inception V1MobileNetShuffleNet

Inception V2

Inception V3Xception

ResNet-152

SENet

AmoebaNet-C

ReNeXt-101

PolyNetInception ResNet V2Inception V4

NASNet-A

NASNet-A

RandWire

AmoebaNet-A

RandWire

regular topology neural networks

irregularly wired neural networksshow better performance for

same number of parameters thanregular topology neural networks

top left means is better

6020 120

NASNet-A

(b) ImageNet accuracy vs number of parameters

Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks

B COMPARISON WITH TENSORFLOW LITE

In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks

1656

552

194

70

645

330 60

5

350

160

903

251

82 33

459

260 359

280

115

753

226

72 20

459

260 359

280

115

0

500

1000

1500

2000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Peak

Mem

ory

Foot

prin

t (K

B)

TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Smaller the better

Peak Memory Footprint (KB)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite

C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING

Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm

THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step

Proof If i=0 the optimal s0 is an empty sequence and micro0

must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +

prodvishapemicroi +

produlowasti shape)

and microi+1 is calculated by deductingprodpishape forallpi isin

(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption

D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF

We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node

A

D W

Z

CB X Y

GGraph

hellip

Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm

First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step

For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2

iminus1) On top of this each step would make an iteration

over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield

1+1times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=1+

(|V |minus2

0

)times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=2+

|V |minus2sumi=0

(|V |minus2

i

)times(|V |minus2minusi)

=2+(|V |minus2)times2|V |minus3

le(|V |minus2)times2|V |minus2 for |V |ge4

le|V |times2|V |

As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering

Page 6: Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF IRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES ByungHoonAhn1 yJinwonLee2 JamieMenjayLin

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A

B C

D

E

F G

H

Divide Conquer Combine

ABCD

Schedule

Schedule

g1

g2Concatenate

s sg1

sg2EFGH

ABCDEFGH

Figure 7 Illustration of divide-and-conquer which divides thegraphs into multiple subgraphs (divide) schedules each of themusing the optimal scheduler (conquer) then concatenates thesub-schedules to get the final schedule (combine)

Due to the space limitation we present the derivation of thealgorithm complexity in the supplementary material

32 Optimizing Scheduling Speed Speeding upthe Dynamic Programming-based Scheduling

While the above scheduling algorithm improves complexityof the search search space may still be intractable due tothe immense irregularity Therefore we devise divide-and-conquer and adaptive soft budgeting to accelerate the searchby effectively shrinking and pruning the search space

Divide-and-conquer We can observe from Figure 1that the topology of irregularly wired neural networks arehourglass shaped ( ) because many NAS and RandomNetwork Generators design cells with single input and singleoutput then stack them to form an hourglass shape topology(Wilken et al 2000) shows that during general purpose codescheduling graphs can be partitioned (divide) into multiplesubgraphs and the corresponding solutions (conquer) can beconcatenated (combine) to form an optimal solution for theoverall problem While the complexity of the scheduling al-gorithm remains the same this divide-and-conquer approachcan reduce the number of nodes in each subproblem speed-ing up the overall scheduling time For instance for a graphthat can be partitioned intoN equal subgraphs the schedul-ing time will decrease from |V |times2|V | to |V |times2|V |N thatwe can speed up scheduling by multiple orders of magnitudecompared to the naive approach depending on the size ofthe graph and the number of partitions

As such Figure 7 shows this insight can be extended to ourproblem setting where we can first perform scheduling oneach cell and merge those solutions together to form thefinal solution First stage is partitioning the original graphG into multiple subgraphs g (divide) Then utilizing theindependence among the subgraphs each subgraph g canbe scheduled separately for their corresponding optimalschedule sg (conquer) Considering that the number ofnodes in the subgraph g is much smaller than the entire graphG the scheduling time will decrease significantly Finallythe schedules of the subgraphs are concatenated to giveoptimal schedule slowast of the entire graph (combine)

Adaptive soft budgeting While divide-and-conquerapproach scales down the number of nodes the algorithmmay still not be fast enough due to the exponential complexityof the algorithm Therefore we explore avoiding suboptimalsolutions during the early stage of scheduling without affect-ing the optimality of the original algorithm Since our goal isto find a single solution that can run within a given memorybudget τlowast=microlowast while all other solutions can be discarded set-ting some budget τ that is greater or equal to microlowast and pruningsuboptimal schedules with which their micropeak exceeds τ canfocus the search to a smaller search spaceS primeT subST while stillachieving the optimal schedule slowast On top of this we developa meta-search for τ This is inspired from engineers buyinga larger memory (increase τ ) if a program fails due to stackoverflow (= rsquono solutionrsquo due to an overly aggressive pruning)and selling out excess memory (decrease τ ) if the currentbudget is prohibitive (= rsquotimeoutrsquo due to lack of pruning)SERENITY takes advantage of this insight to develop anadaptive soft budgeting scheme while scheduling to cut downthe overall number of explored schedules Figure 8 illustratesthe overall idea by first showing how some schedules arepruned with regard to a given budget τ in Figure 8(a) thenimplication of different τ on scheduling time in Figure 8(b)

Figure 8(a) depicts a certain point while scheduling G wherenodes G H F and J can be scheduled In particular the

32

23

35

τ = 36

35 38 38J

s1

A

B C

D E F

H I

K

L

G

GGraph

Scheduled SchedulableXX For memoizationXX

ABCDEI hellip

Sear

ch S

tep

FHG

hellip

J

D 6 E 6 F 6J 6I 3H 3G 3

ABCDEIFH helliphellip

C 6

23 32s2

gt τ

z

z

s3

35

output activation size

(a) While both path s1 and s2 schedules lead to same zprime theirmicro and micropeak varies and we can prune schedules that yieldhigher micropeak than a given budget τ Numbers next to box orcircle are micro and numbers next to edges are micropeak

Proh

ibiti

ve

Sche

dulin

g Ti

me

(tim

eout

)

Sche

dulin

g Fa

ilure

(no

solu

tion

)

No o

f Exp

lore

d Sc

hedu

les

prop Sc

hedu

ling

Tim

e

Budget

Optimal Budget τ Hard Budget τmaxSoft Budget τ

Adaptive Soft Budgeting

(b) Adaptive soft budgeting starts by setting a hard budgetτmax as the maximum value for the soft budget τ Thenconducts a binary search for τ higher than τlowast that it finds asolution yet not too high that scheduling completes quickly

Figure 8 Illustration of the adaptive soft budgeting (a) shows howschedules are pruned and (b) illustrates how the soft budget τrelates to the number of explored schedules

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

figure compares two possible solutions s1 and s2 whichschedules Hrarr F and Frarr H respectively given τ = 36While s1 and s2 both starts from z with micro=32 schedulingH leads to micropeak = 32+3 (H) = 35 whereas scheduling F

or J leads to micropeak =32+6 (F or J) =38 Therefore sincewe assume τ = 36 s2 and s3 will fail because micropeak = 38for s2 and s3 exceeds 36 So as long as we set the budgetτ higher than microlowast the scheduler still finds a single optimalsolution while avoiding many suboptimal paths On theother hand too small a τ ltmicrolowast leads to no solution becausethe optimal path would be pruned away

Having established the possibility of pruning our questionboils down to discovering τ that is greater or equal to microlowast

which we call an optimal budget τlowast yet close enough toshrink the search space effectively Figure 8(b) and Algo-rithm 2 summarizes the proposed adaptive soft budgetingSince we start with no information about the approximaterange for τ we resort to a commonly used topologicalordering algorithm called Kahnrsquos algorithm (Kahn 1962)(O(|V |+|E|)) to adaptively gain idea of the range for τ Weuse the peak memory footprint from this sequence and useit as our hard budget τmax and in contrast we call adaptivelychanging τ as a soft budget Since τmaxgemicrolowast we know thatany τgeτmax do not need to be explored Having this upperbound for the search adaptive soft budgeting implementsa binary search to first run the scheduling algorithm with τand T as input where T is an hyperparameter that limits thescheduling time per search step The binary search increasesτ (τnew larr (τnew + τold)2) if it finds rsquono solutionrsquo anddecreases τ (τnewlarrτnew2) if a search step returns rsquotimeoutrsquo(search step duration exceeds T ) The binary search stops as

Algorithm 2 Adaptive Soft Budgeting1 Input graph G2 Output optimal schedule slowast

3 τmaxlarrmicro(KahnrsquosAlgorithm(G)G) hard budget4 τoldτnewlarrτmax

5 flaglarr rsquono solutionrsquo6 repeat7 binary search for τ decrease τ if rsquotimeoutrsquo8 and increase τ if rsquono solutionrsquo9 if flag is rsquotimeoutrsquo then

10 simultaneous11 τoldlarrτnew τnewlarrτnew212 else if flag is rsquono solutionrsquo then13 simultaneous14 τoldlarrτnew τnewlarr(τnew+τold)215 end if16 if flag is rsquosolutionrsquo then17 slowastlarrschedule optimal schedule18 end if19 until flag is rsquosolutionrsquo

micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(wixi) )

micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(y) )

Channel-wisePartitioning

Kernel-wisePartitioning

=

=

wij

concat

conv

x1 x2 xn

w1hellipwm

y

add

partialconv w1

x1 x2xn

w2 wn

y

concat

depth-conv

x1 x2 xn

w1hellipwn

y

concat

partialdepth-conv w1

x1 x2xn

w2 wn

y

yxi ith Input Output jth Channel of ith Kernel

x

x

Figure 9 Illustration of the graph rewriting patterns channel-wisepartitioning and kernel-wise partitioning can reduce the memorycost of convolution and depthwise convolution respectively

soon as it finds a schedule (rsquosolutionrsquo) and this method usingbinary search is guaranteed to work due to the monotonicallyincreasing number of explored schedules with τ

33 Identity Graph Rewriting Improving theSearch Space for Better Peak Memory Footprint

Reorganizing the computational graph of the irregularlywired neural networks may lead to significant reduction in thepeak memory footprint micropeak during computation For exam-ple it is notable that large stream of NAS-based works (Liuet al 2019a Zhang et al 2019) rely on extensive use of con-catenation as a natural approach to merge information frommultiple branches of the input activations and expand thesearch space of the neural architectures However concatena-tion with many incoming edges may prolong the liveness ofthe input activation and increase the memory pressure whichis unfavorable especially for resource constrained scenariosTo address this issue we propose identity graph rewritingto effectively reduce micropeak around the concatenation whilekeeping the arithmetic outputs identical To this end wepresent two main examples of the graph patterns in irregularlywired neural networks that benefits from our technique

Channel-wise partitioning (convolution) One typicalpattern in irregularly wired neural networks is concatenation(concat [middot]) that takes multiple branches of the input prior toa convolution (conv lowast) While executing such pattern peakmemory footprint micropeak occurs when the output y isinRn isbeing computed while concatenated branches of inputxisinRn

are also mandated to reside in the memory Our objectiveis to achieve the same arithmetic results and logical effectas concat yet sidestep the corresponding seemingly exces-sive memory cost To this end we channel-wise partitionthe conv that follows the concat so that the partitioned convcan be computed as soon as the input xi becomes available

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of

sumi and lowast these

transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from

sumxi+y tomax(wilowastxi)+y which becomes

more effective when there are more incoming edges to concat

y=[sum

i

w1ilowastxisumi

wmilowastxi]

(concat+conv) (3)

=sumi

[w1ilowastxiwmilowastxi

](4)

=sumi

[w1iwmi

]lowastxi (5)

=sumi

[wilowastxi

](partial conv+add) (6)

Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly

y=[w1lowastx1wnlowastxn

](concat+depthconv) (7)

=[[w1lowastx1][wnlowastxn]

](partial depthconv+concat)

(8)

Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost

Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1

ACCURACY

DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745

4 EVALUATION

We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time

41 Methodology

Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset

42 Experimental Results

Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In

1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

183x

220x

239x

209x

140x

127x 168x

125x

139x

168x220x

244x

270x 345x

140x

127x 168x

125x

139x 186x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Red

ucti

on i

n Pe

ak M

emor

yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Higher the betterReduction in Peak Memory

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 10 Reduction in peak memory footprint of SERENITY

against TensorFlow Lite (no memory hierarchy)

192x 2

58x

251x

115x

108x

129x

108x

130x

152x192x 2

68x

125x

111x

131x

111x 161x

149x192x

356x

125x

119x

109x

108x 151x200x

135x

250x

182x

138x 176x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Redu

ctio

n in

Off

-chi

p

32KB 64KB 128KB 256KB

Reduction in Off-chip

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Memory Communication

only

SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipon

ly SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipSERENITY removes off-chip communication

NA

NA

NA

NA

NA NA

Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)

addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks

Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed

Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory

0

50

100

150

200

250

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

251KB reduction in peak memoryfootprint with Memory Allocator

Memory Footprint (KB)

Time

(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)

0

50

100

150

200

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic ProgrammingDynamic Programming+Graph Rewriting

125KB reductionin peak memory footprint

Time

Memory Footprint (KB)

(b) Memory footprint without the memory allocator

Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)

allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting

Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting

Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

32s 57s

45s

278s 11

81s

151s

285s 744s

879s

406s

32s

421s

305s

393s 11

81s

151s

285s 744s

879s

488s

1

10

100

1000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Sche

dulin

g Ti

me

(sec

onds

)

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Scheduling Time (seconds)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 13 Scheduling time evaluation for SERENITY

Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time

GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME

7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs

3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs

nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility

5 RELATED WORKS

The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)

bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices

Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper

Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint

Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts

6 CONCLUSION

As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

ACKNOWLEDGEMENT

We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms

REFERENCES

Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016

Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018

Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH

Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017

Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966

Bellman R Dynamic programming Science 1966

Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961

Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989

Bruno J and Sethi R Code generation for a one-register machineJACM 1976

Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm

Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018

Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014

Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016

Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017

Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf

Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf

Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017

Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf

Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS

Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015

Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018

Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017

Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017

Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite

Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015

Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a

Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b

He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018

Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962

Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf

Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014

Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018

Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019

Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017

Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016

Kahn A B Topological sorting of large networks CACM 1962

Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001

Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf

Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004

LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990

Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003

Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX

Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b

Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020

Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb

NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt

Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017

Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019

Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999

Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019

Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf

Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007

Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018

Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014

Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf

Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019

Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000

Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019

Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019

Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019

Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf

Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf

Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-

Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg

Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS

Multiply-and-accumulate (Billions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

200 10 30 40

DPN-131

Inception V1MobileNet

ShuffleNet

Inception V2

Inception V3Xception ResNet-152

SENet

AmoebaNet-A

ReNeXt-101PolyNetInception ResNet V2

Inception V4

NASNet-ANASNet-B

RandWire

AmoebaNet-A

AmoebaNet-B

RandWire

irregularly wired neural networksregular topology neural networks

irregularly wired neural networksshow better performance for

same amount of compute thanregular topology neural networks

top left means is better

(a) ImageNet accuracy vs number of multiply-and-accumulate

Number of Parameters (Millions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

800 40 100 140

DPN-131

irregularly wired neural networks

Inception V1MobileNetShuffleNet

Inception V2

Inception V3Xception

ResNet-152

SENet

AmoebaNet-C

ReNeXt-101

PolyNetInception ResNet V2Inception V4

NASNet-A

NASNet-A

RandWire

AmoebaNet-A

RandWire

regular topology neural networks

irregularly wired neural networksshow better performance for

same number of parameters thanregular topology neural networks

top left means is better

6020 120

NASNet-A

(b) ImageNet accuracy vs number of parameters

Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks

B COMPARISON WITH TENSORFLOW LITE

In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks

1656

552

194

70

645

330 60

5

350

160

903

251

82 33

459

260 359

280

115

753

226

72 20

459

260 359

280

115

0

500

1000

1500

2000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Peak

Mem

ory

Foot

prin

t (K

B)

TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Smaller the better

Peak Memory Footprint (KB)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite

C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING

Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm

THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step

Proof If i=0 the optimal s0 is an empty sequence and micro0

must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +

prodvishapemicroi +

produlowasti shape)

and microi+1 is calculated by deductingprodpishape forallpi isin

(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption

D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF

We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node

A

D W

Z

CB X Y

GGraph

hellip

Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm

First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step

For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2

iminus1) On top of this each step would make an iteration

over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield

1+1times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=1+

(|V |minus2

0

)times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=2+

|V |minus2sumi=0

(|V |minus2

i

)times(|V |minus2minusi)

=2+(|V |minus2)times2|V |minus3

le(|V |minus2)times2|V |minus2 for |V |ge4

le|V |times2|V |

As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering

Page 7: Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF IRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES ByungHoonAhn1 yJinwonLee2 JamieMenjayLin

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

figure compares two possible solutions s1 and s2 whichschedules Hrarr F and Frarr H respectively given τ = 36While s1 and s2 both starts from z with micro=32 schedulingH leads to micropeak = 32+3 (H) = 35 whereas scheduling F

or J leads to micropeak =32+6 (F or J) =38 Therefore sincewe assume τ = 36 s2 and s3 will fail because micropeak = 38for s2 and s3 exceeds 36 So as long as we set the budgetτ higher than microlowast the scheduler still finds a single optimalsolution while avoiding many suboptimal paths On theother hand too small a τ ltmicrolowast leads to no solution becausethe optimal path would be pruned away

Having established the possibility of pruning our questionboils down to discovering τ that is greater or equal to microlowast

which we call an optimal budget τlowast yet close enough toshrink the search space effectively Figure 8(b) and Algo-rithm 2 summarizes the proposed adaptive soft budgetingSince we start with no information about the approximaterange for τ we resort to a commonly used topologicalordering algorithm called Kahnrsquos algorithm (Kahn 1962)(O(|V |+|E|)) to adaptively gain idea of the range for τ Weuse the peak memory footprint from this sequence and useit as our hard budget τmax and in contrast we call adaptivelychanging τ as a soft budget Since τmaxgemicrolowast we know thatany τgeτmax do not need to be explored Having this upperbound for the search adaptive soft budgeting implementsa binary search to first run the scheduling algorithm with τand T as input where T is an hyperparameter that limits thescheduling time per search step The binary search increasesτ (τnew larr (τnew + τold)2) if it finds rsquono solutionrsquo anddecreases τ (τnewlarrτnew2) if a search step returns rsquotimeoutrsquo(search step duration exceeds T ) The binary search stops as

Algorithm 2 Adaptive Soft Budgeting1 Input graph G2 Output optimal schedule slowast

3 τmaxlarrmicro(KahnrsquosAlgorithm(G)G) hard budget4 τoldτnewlarrτmax

5 flaglarr rsquono solutionrsquo6 repeat7 binary search for τ decrease τ if rsquotimeoutrsquo8 and increase τ if rsquono solutionrsquo9 if flag is rsquotimeoutrsquo then

10 simultaneous11 τoldlarrτnew τnewlarrτnew212 else if flag is rsquono solutionrsquo then13 simultaneous14 τoldlarrτnew τnewlarr(τnew+τold)215 end if16 if flag is rsquosolutionrsquo then17 slowastlarrschedule optimal schedule18 end if19 until flag is rsquosolutionrsquo

micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(wixi) )

micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(y) )

Channel-wisePartitioning

Kernel-wisePartitioning

=

=

wij

concat

conv

x1 x2 xn

w1hellipwm

y

add

partialconv w1

x1 x2xn

w2 wn

y

concat

depth-conv

x1 x2 xn

w1hellipwn

y

concat

partialdepth-conv w1

x1 x2xn

w2 wn

y

yxi ith Input Output jth Channel of ith Kernel

x

x

Figure 9 Illustration of the graph rewriting patterns channel-wisepartitioning and kernel-wise partitioning can reduce the memorycost of convolution and depthwise convolution respectively

soon as it finds a schedule (rsquosolutionrsquo) and this method usingbinary search is guaranteed to work due to the monotonicallyincreasing number of explored schedules with τ

33 Identity Graph Rewriting Improving theSearch Space for Better Peak Memory Footprint

Reorganizing the computational graph of the irregularlywired neural networks may lead to significant reduction in thepeak memory footprint micropeak during computation For exam-ple it is notable that large stream of NAS-based works (Liuet al 2019a Zhang et al 2019) rely on extensive use of con-catenation as a natural approach to merge information frommultiple branches of the input activations and expand thesearch space of the neural architectures However concatena-tion with many incoming edges may prolong the liveness ofthe input activation and increase the memory pressure whichis unfavorable especially for resource constrained scenariosTo address this issue we propose identity graph rewritingto effectively reduce micropeak around the concatenation whilekeeping the arithmetic outputs identical To this end wepresent two main examples of the graph patterns in irregularlywired neural networks that benefits from our technique

Channel-wise partitioning (convolution) One typicalpattern in irregularly wired neural networks is concatenation(concat [middot]) that takes multiple branches of the input prior toa convolution (conv lowast) While executing such pattern peakmemory footprint micropeak occurs when the output y isinRn isbeing computed while concatenated branches of inputxisinRn

are also mandated to reside in the memory Our objectiveis to achieve the same arithmetic results and logical effectas concat yet sidestep the corresponding seemingly exces-sive memory cost To this end we channel-wise partitionthe conv that follows the concat so that the partitioned convcan be computed as soon as the input xi becomes available

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of

sumi and lowast these

transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from

sumxi+y tomax(wilowastxi)+y which becomes

more effective when there are more incoming edges to concat

y=[sum

i

w1ilowastxisumi

wmilowastxi]

(concat+conv) (3)

=sumi

[w1ilowastxiwmilowastxi

](4)

=sumi

[w1iwmi

]lowastxi (5)

=sumi

[wilowastxi

](partial conv+add) (6)

Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly

y=[w1lowastx1wnlowastxn

](concat+depthconv) (7)

=[[w1lowastx1][wnlowastxn]

](partial depthconv+concat)

(8)

Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost

Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1

ACCURACY

DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745

4 EVALUATION

We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time

41 Methodology

Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset

42 Experimental Results

Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In

1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

183x

220x

239x

209x

140x

127x 168x

125x

139x

168x220x

244x

270x 345x

140x

127x 168x

125x

139x 186x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Red

ucti

on i

n Pe

ak M

emor

yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Higher the betterReduction in Peak Memory

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 10 Reduction in peak memory footprint of SERENITY

against TensorFlow Lite (no memory hierarchy)

192x 2

58x

251x

115x

108x

129x

108x

130x

152x192x 2

68x

125x

111x

131x

111x 161x

149x192x

356x

125x

119x

109x

108x 151x200x

135x

250x

182x

138x 176x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Redu

ctio

n in

Off

-chi

p

32KB 64KB 128KB 256KB

Reduction in Off-chip

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Memory Communication

only

SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipon

ly SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipSERENITY removes off-chip communication

NA

NA

NA

NA

NA NA

Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)

addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks

Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed

Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory

0

50

100

150

200

250

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

251KB reduction in peak memoryfootprint with Memory Allocator

Memory Footprint (KB)

Time

(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)

0

50

100

150

200

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic ProgrammingDynamic Programming+Graph Rewriting

125KB reductionin peak memory footprint

Time

Memory Footprint (KB)

(b) Memory footprint without the memory allocator

Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)

allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting

Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting

Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

32s 57s

45s

278s 11

81s

151s

285s 744s

879s

406s

32s

421s

305s

393s 11

81s

151s

285s 744s

879s

488s

1

10

100

1000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Sche

dulin

g Ti

me

(sec

onds

)

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Scheduling Time (seconds)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 13 Scheduling time evaluation for SERENITY

Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time

GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME

7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs

3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs

nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility

5 RELATED WORKS

The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)

bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices

Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper

Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint

Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts

6 CONCLUSION

As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

ACKNOWLEDGEMENT

We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms

REFERENCES

Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016

Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018

Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH

Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017

Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966

Bellman R Dynamic programming Science 1966

Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961

Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989

Bruno J and Sethi R Code generation for a one-register machineJACM 1976

Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm

Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018

Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014

Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016

Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017

Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf

Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf

Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017

Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf

Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS

Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015

Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018

Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017

Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017

Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite

Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015

Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a

Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b

He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018

Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962

Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf

Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014

Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018

Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019

Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017

Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016

Kahn A B Topological sorting of large networks CACM 1962

Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001

Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf

Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004

LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990

Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003

Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX

Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b

Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020

Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb

NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt

Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017

Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019

Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999

Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019

Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf

Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007

Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018

Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014

Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf

Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019

Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000

Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019

Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019

Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019

Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf

Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf

Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-

Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg

Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS

Multiply-and-accumulate (Billions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

200 10 30 40

DPN-131

Inception V1MobileNet

ShuffleNet

Inception V2

Inception V3Xception ResNet-152

SENet

AmoebaNet-A

ReNeXt-101PolyNetInception ResNet V2

Inception V4

NASNet-ANASNet-B

RandWire

AmoebaNet-A

AmoebaNet-B

RandWire

irregularly wired neural networksregular topology neural networks

irregularly wired neural networksshow better performance for

same amount of compute thanregular topology neural networks

top left means is better

(a) ImageNet accuracy vs number of multiply-and-accumulate

Number of Parameters (Millions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

800 40 100 140

DPN-131

irregularly wired neural networks

Inception V1MobileNetShuffleNet

Inception V2

Inception V3Xception

ResNet-152

SENet

AmoebaNet-C

ReNeXt-101

PolyNetInception ResNet V2Inception V4

NASNet-A

NASNet-A

RandWire

AmoebaNet-A

RandWire

regular topology neural networks

irregularly wired neural networksshow better performance for

same number of parameters thanregular topology neural networks

top left means is better

6020 120

NASNet-A

(b) ImageNet accuracy vs number of parameters

Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks

B COMPARISON WITH TENSORFLOW LITE

In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks

1656

552

194

70

645

330 60

5

350

160

903

251

82 33

459

260 359

280

115

753

226

72 20

459

260 359

280

115

0

500

1000

1500

2000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Peak

Mem

ory

Foot

prin

t (K

B)

TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Smaller the better

Peak Memory Footprint (KB)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite

C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING

Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm

THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step

Proof If i=0 the optimal s0 is an empty sequence and micro0

must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +

prodvishapemicroi +

produlowasti shape)

and microi+1 is calculated by deductingprodpishape forallpi isin

(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption

D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF

We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node

A

D W

Z

CB X Y

GGraph

hellip

Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm

First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step

For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2

iminus1) On top of this each step would make an iteration

over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield

1+1times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=1+

(|V |minus2

0

)times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=2+

|V |minus2sumi=0

(|V |minus2

i

)times(|V |minus2minusi)

=2+(|V |minus2)times2|V |minus3

le(|V |minus2)times2|V |minus2 for |V |ge4

le|V |times2|V |

As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering

Page 8: Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF IRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES ByungHoonAhn1 yJinwonLee2 JamieMenjayLin

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of

sumi and lowast these

transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from

sumxi+y tomax(wilowastxi)+y which becomes

more effective when there are more incoming edges to concat

y=[sum

i

w1ilowastxisumi

wmilowastxi]

(concat+conv) (3)

=sumi

[w1ilowastxiwmilowastxi

](4)

=sumi

[w1iwmi

]lowastxi (5)

=sumi

[wilowastxi

](partial conv+add) (6)

Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly

y=[w1lowastx1wnlowastxn

](concat+depthconv) (7)

=[[w1lowastx1][wnlowastxn]

](partial depthconv+concat)

(8)

Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost

Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1

ACCURACY

DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745

4 EVALUATION

We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time

41 Methodology

Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset

42 Experimental Results

Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In

1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

183x

220x

239x

209x

140x

127x 168x

125x

139x

168x220x

244x

270x 345x

140x

127x 168x

125x

139x 186x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Red

ucti

on i

n Pe

ak M

emor

yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Higher the betterReduction in Peak Memory

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 10 Reduction in peak memory footprint of SERENITY

against TensorFlow Lite (no memory hierarchy)

192x 2

58x

251x

115x

108x

129x

108x

130x

152x192x 2

68x

125x

111x

131x

111x 161x

149x192x

356x

125x

119x

109x

108x 151x200x

135x

250x

182x

138x 176x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Redu

ctio

n in

Off

-chi

p

32KB 64KB 128KB 256KB

Reduction in Off-chip

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Memory Communication

only

SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipon

ly SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipSERENITY removes off-chip communication

NA

NA

NA

NA

NA NA

Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)

addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks

Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed

Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory

0

50

100

150

200

250

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

251KB reduction in peak memoryfootprint with Memory Allocator

Memory Footprint (KB)

Time

(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)

0

50

100

150

200

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic ProgrammingDynamic Programming+Graph Rewriting

125KB reductionin peak memory footprint

Time

Memory Footprint (KB)

(b) Memory footprint without the memory allocator

Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)

allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting

Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting

Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

32s 57s

45s

278s 11

81s

151s

285s 744s

879s

406s

32s

421s

305s

393s 11

81s

151s

285s 744s

879s

488s

1

10

100

1000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Sche

dulin

g Ti

me

(sec

onds

)

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Scheduling Time (seconds)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 13 Scheduling time evaluation for SERENITY

Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time

GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME

7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs

3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs

nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility

5 RELATED WORKS

The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)

bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices

Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper

Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint

Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts

6 CONCLUSION

As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

ACKNOWLEDGEMENT

We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms

REFERENCES

Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016

Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018

Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH

Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017

Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966

Bellman R Dynamic programming Science 1966

Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961

Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989

Bruno J and Sethi R Code generation for a one-register machineJACM 1976

Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm

Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018

Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014

Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016

Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017

Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf

Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf

Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017

Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf

Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS

Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015

Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018

Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017

Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017

Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite

Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015

Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a

Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b

He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018

Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962

Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf

Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014

Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018

Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019

Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017

Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016

Kahn A B Topological sorting of large networks CACM 1962

Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001

Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf

Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004

LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990

Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003

Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX

Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b

Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020

Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb

NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt

Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017

Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019

Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999

Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019

Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf

Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007

Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018

Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014

Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf

Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019

Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000

Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019

Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019

Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019

Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf

Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf

Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-

Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg

Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS

Multiply-and-accumulate (Billions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

200 10 30 40

DPN-131

Inception V1MobileNet

ShuffleNet

Inception V2

Inception V3Xception ResNet-152

SENet

AmoebaNet-A

ReNeXt-101PolyNetInception ResNet V2

Inception V4

NASNet-ANASNet-B

RandWire

AmoebaNet-A

AmoebaNet-B

RandWire

irregularly wired neural networksregular topology neural networks

irregularly wired neural networksshow better performance for

same amount of compute thanregular topology neural networks

top left means is better

(a) ImageNet accuracy vs number of multiply-and-accumulate

Number of Parameters (Millions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

800 40 100 140

DPN-131

irregularly wired neural networks

Inception V1MobileNetShuffleNet

Inception V2

Inception V3Xception

ResNet-152

SENet

AmoebaNet-C

ReNeXt-101

PolyNetInception ResNet V2Inception V4

NASNet-A

NASNet-A

RandWire

AmoebaNet-A

RandWire

regular topology neural networks

irregularly wired neural networksshow better performance for

same number of parameters thanregular topology neural networks

top left means is better

6020 120

NASNet-A

(b) ImageNet accuracy vs number of parameters

Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks

B COMPARISON WITH TENSORFLOW LITE

In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks

1656

552

194

70

645

330 60

5

350

160

903

251

82 33

459

260 359

280

115

753

226

72 20

459

260 359

280

115

0

500

1000

1500

2000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Peak

Mem

ory

Foot

prin

t (K

B)

TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Smaller the better

Peak Memory Footprint (KB)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite

C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING

Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm

THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step

Proof If i=0 the optimal s0 is an empty sequence and micro0

must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +

prodvishapemicroi +

produlowasti shape)

and microi+1 is calculated by deductingprodpishape forallpi isin

(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption

D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF

We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node

A

D W

Z

CB X Y

GGraph

hellip

Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm

First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step

For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2

iminus1) On top of this each step would make an iteration

over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield

1+1times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=1+

(|V |minus2

0

)times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=2+

|V |minus2sumi=0

(|V |minus2

i

)times(|V |minus2minusi)

=2+(|V |minus2)times2|V |minus3

le(|V |minus2)times2|V |minus2 for |V |ge4

le|V |times2|V |

As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering

Page 9: Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF IRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES ByungHoonAhn1 yJinwonLee2 JamieMenjayLin

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

183x

220x

239x

209x

140x

127x 168x

125x

139x

168x220x

244x

270x 345x

140x

127x 168x

125x

139x 186x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Red

ucti

on i

n Pe

ak M

emor

yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Higher the betterReduction in Peak Memory

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 10 Reduction in peak memory footprint of SERENITY

against TensorFlow Lite (no memory hierarchy)

192x 2

58x

251x

115x

108x

129x

108x

130x

152x192x 2

68x

125x

111x

131x

111x 161x

149x192x

356x

125x

119x

109x

108x 151x200x

135x

250x

182x

138x 176x

000

100

200

300

400

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Redu

ctio

n in

Off

-chi

p

32KB 64KB 128KB 256KB

Reduction in Off-chip

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Memory Communication

only

SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipon

ly SERENITY fi

ts o

n-ch

ip

only

SERENITY fi

ts o

n-ch

ipSERENITY removes off-chip communication

NA

NA

NA

NA

NA NA

Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)

addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks

Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed

Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory

0

50

100

150

200

250

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

251KB reduction in peak memoryfootprint with Memory Allocator

Memory Footprint (KB)

Time

(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)

0

50

100

150

200

Mem

ory

Foot

prin

t (K

B)

Time

Dynamic ProgrammingDynamic Programming+Graph Rewriting

125KB reductionin peak memory footprint

Time

Memory Footprint (KB)

(b) Memory footprint without the memory allocator

Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)

allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting

Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting

Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

32s 57s

45s

278s 11

81s

151s

285s 744s

879s

406s

32s

421s

305s

393s 11

81s

151s

285s 744s

879s

488s

1

10

100

1000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Sche

dulin

g Ti

me

(sec

onds

)

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Scheduling Time (seconds)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 13 Scheduling time evaluation for SERENITY

Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time

GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME

7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs

3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs

nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility

5 RELATED WORKS

The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)

bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices

Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper

Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint

Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts

6 CONCLUSION

As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

ACKNOWLEDGEMENT

We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms

REFERENCES

Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016

Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018

Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH

Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017

Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966

Bellman R Dynamic programming Science 1966

Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961

Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989

Bruno J and Sethi R Code generation for a one-register machineJACM 1976

Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm

Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018

Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014

Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016

Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017

Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf

Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf

Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017

Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf

Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS

Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015

Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018

Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017

Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017

Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite

Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015

Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a

Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b

He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018

Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962

Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf

Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014

Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018

Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019

Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017

Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016

Kahn A B Topological sorting of large networks CACM 1962

Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001

Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf

Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004

LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990

Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003

Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX

Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b

Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020

Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb

NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt

Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017

Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019

Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999

Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019

Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf

Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007

Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018

Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014

Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf

Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019

Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000

Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019

Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019

Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019

Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf

Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf

Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-

Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg

Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS

Multiply-and-accumulate (Billions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

200 10 30 40

DPN-131

Inception V1MobileNet

ShuffleNet

Inception V2

Inception V3Xception ResNet-152

SENet

AmoebaNet-A

ReNeXt-101PolyNetInception ResNet V2

Inception V4

NASNet-ANASNet-B

RandWire

AmoebaNet-A

AmoebaNet-B

RandWire

irregularly wired neural networksregular topology neural networks

irregularly wired neural networksshow better performance for

same amount of compute thanregular topology neural networks

top left means is better

(a) ImageNet accuracy vs number of multiply-and-accumulate

Number of Parameters (Millions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

800 40 100 140

DPN-131

irregularly wired neural networks

Inception V1MobileNetShuffleNet

Inception V2

Inception V3Xception

ResNet-152

SENet

AmoebaNet-C

ReNeXt-101

PolyNetInception ResNet V2Inception V4

NASNet-A

NASNet-A

RandWire

AmoebaNet-A

RandWire

regular topology neural networks

irregularly wired neural networksshow better performance for

same number of parameters thanregular topology neural networks

top left means is better

6020 120

NASNet-A

(b) ImageNet accuracy vs number of parameters

Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks

B COMPARISON WITH TENSORFLOW LITE

In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks

1656

552

194

70

645

330 60

5

350

160

903

251

82 33

459

260 359

280

115

753

226

72 20

459

260 359

280

115

0

500

1000

1500

2000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Peak

Mem

ory

Foot

prin

t (K

B)

TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Smaller the better

Peak Memory Footprint (KB)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite

C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING

Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm

THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step

Proof If i=0 the optimal s0 is an empty sequence and micro0

must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +

prodvishapemicroi +

produlowasti shape)

and microi+1 is calculated by deductingprodpishape forallpi isin

(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption

D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF

We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node

A

D W

Z

CB X Y

GGraph

hellip

Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm

First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step

For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2

iminus1) On top of this each step would make an iteration

over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield

1+1times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=1+

(|V |minus2

0

)times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=2+

|V |minus2sumi=0

(|V |minus2

i

)times(|V |minus2minusi)

=2+(|V |minus2)times2|V |minus3

le(|V |minus2)times2|V |minus2 for |V |ge4

le|V |times2|V |

As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering

Page 10: Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF IRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES ByungHoonAhn1 yJinwonLee2 JamieMenjayLin

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

32s 57s

45s

278s 11

81s

151s

285s 744s

879s

406s

32s

421s

305s

393s 11

81s

151s

285s 744s

879s

488s

1

10

100

1000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Sche

dulin

g Ti

me

(sec

onds

)

Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Scheduling Time (seconds)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 13 Scheduling time evaluation for SERENITY

Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time

GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME

7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs

3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs

nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility

5 RELATED WORKS

The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)

bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices

Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper

Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint

Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts

6 CONCLUSION

As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

ACKNOWLEDGEMENT

We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms

REFERENCES

Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016

Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018

Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH

Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017

Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966

Bellman R Dynamic programming Science 1966

Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961

Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989

Bruno J and Sethi R Code generation for a one-register machineJACM 1976

Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm

Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018

Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014

Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016

Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017

Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf

Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf

Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017

Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf

Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS

Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015

Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018

Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017

Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017

Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite

Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015

Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a

Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b

He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018

Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962

Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf

Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014

Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018

Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019

Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017

Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016

Kahn A B Topological sorting of large networks CACM 1962

Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001

Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf

Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004

LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990

Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003

Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX

Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b

Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020

Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb

NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt

Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017

Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019

Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999

Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019

Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf

Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007

Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018

Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014

Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf

Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019

Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000

Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019

Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019

Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019

Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf

Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf

Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-

Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg

Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS

Multiply-and-accumulate (Billions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

200 10 30 40

DPN-131

Inception V1MobileNet

ShuffleNet

Inception V2

Inception V3Xception ResNet-152

SENet

AmoebaNet-A

ReNeXt-101PolyNetInception ResNet V2

Inception V4

NASNet-ANASNet-B

RandWire

AmoebaNet-A

AmoebaNet-B

RandWire

irregularly wired neural networksregular topology neural networks

irregularly wired neural networksshow better performance for

same amount of compute thanregular topology neural networks

top left means is better

(a) ImageNet accuracy vs number of multiply-and-accumulate

Number of Parameters (Millions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

800 40 100 140

DPN-131

irregularly wired neural networks

Inception V1MobileNetShuffleNet

Inception V2

Inception V3Xception

ResNet-152

SENet

AmoebaNet-C

ReNeXt-101

PolyNetInception ResNet V2Inception V4

NASNet-A

NASNet-A

RandWire

AmoebaNet-A

RandWire

regular topology neural networks

irregularly wired neural networksshow better performance for

same number of parameters thanregular topology neural networks

top left means is better

6020 120

NASNet-A

(b) ImageNet accuracy vs number of parameters

Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks

B COMPARISON WITH TENSORFLOW LITE

In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks

1656

552

194

70

645

330 60

5

350

160

903

251

82 33

459

260 359

280

115

753

226

72 20

459

260 359

280

115

0

500

1000

1500

2000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Peak

Mem

ory

Foot

prin

t (K

B)

TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Smaller the better

Peak Memory Footprint (KB)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite

C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING

Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm

THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step

Proof If i=0 the optimal s0 is an empty sequence and micro0

must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +

prodvishapemicroi +

produlowasti shape)

and microi+1 is calculated by deductingprodpishape forallpi isin

(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption

D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF

We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node

A

D W

Z

CB X Y

GGraph

hellip

Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm

First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step

For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2

iminus1) On top of this each step would make an iteration

over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield

1+1times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=1+

(|V |minus2

0

)times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=2+

|V |minus2sumi=0

(|V |minus2

i

)times(|V |minus2minusi)

=2+(|V |minus2)times2|V |minus3

le(|V |minus2)times2|V |minus2 for |V |ge4

le|V |times2|V |

As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering

Page 11: Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF IRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES ByungHoonAhn1 yJinwonLee2 JamieMenjayLin

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

ACKNOWLEDGEMENT

We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms

REFERENCES

Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016

Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018

Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH

Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017

Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966

Bellman R Dynamic programming Science 1966

Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961

Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989

Bruno J and Sethi R Code generation for a one-register machineJACM 1976

Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm

Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018

Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014

Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016

Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017

Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf

Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf

Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017

Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf

Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS

Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015

Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018

Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017

Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017

Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite

Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015

Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a

Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b

He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018

Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962

Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf

Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014

Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018

Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019

Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017

Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016

Kahn A B Topological sorting of large networks CACM 1962

Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001

Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf

Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004

LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990

Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003

Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX

Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b

Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020

Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb

NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt

Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017

Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019

Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999

Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019

Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf

Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007

Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018

Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014

Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf

Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019

Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000

Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019

Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019

Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019

Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf

Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf

Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-

Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg

Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS

Multiply-and-accumulate (Billions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

200 10 30 40

DPN-131

Inception V1MobileNet

ShuffleNet

Inception V2

Inception V3Xception ResNet-152

SENet

AmoebaNet-A

ReNeXt-101PolyNetInception ResNet V2

Inception V4

NASNet-ANASNet-B

RandWire

AmoebaNet-A

AmoebaNet-B

RandWire

irregularly wired neural networksregular topology neural networks

irregularly wired neural networksshow better performance for

same amount of compute thanregular topology neural networks

top left means is better

(a) ImageNet accuracy vs number of multiply-and-accumulate

Number of Parameters (Millions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

800 40 100 140

DPN-131

irregularly wired neural networks

Inception V1MobileNetShuffleNet

Inception V2

Inception V3Xception

ResNet-152

SENet

AmoebaNet-C

ReNeXt-101

PolyNetInception ResNet V2Inception V4

NASNet-A

NASNet-A

RandWire

AmoebaNet-A

RandWire

regular topology neural networks

irregularly wired neural networksshow better performance for

same number of parameters thanregular topology neural networks

top left means is better

6020 120

NASNet-A

(b) ImageNet accuracy vs number of parameters

Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks

B COMPARISON WITH TENSORFLOW LITE

In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks

1656

552

194

70

645

330 60

5

350

160

903

251

82 33

459

260 359

280

115

753

226

72 20

459

260 359

280

115

0

500

1000

1500

2000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Peak

Mem

ory

Foot

prin

t (K

B)

TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Smaller the better

Peak Memory Footprint (KB)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite

C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING

Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm

THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step

Proof If i=0 the optimal s0 is an empty sequence and micro0

must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +

prodvishapemicroi +

produlowasti shape)

and microi+1 is calculated by deductingprodpishape forallpi isin

(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption

D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF

We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node

A

D W

Z

CB X Y

GGraph

hellip

Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm

First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step

For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2

iminus1) On top of this each step would make an iteration

over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield

1+1times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=1+

(|V |minus2

0

)times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=2+

|V |minus2sumi=0

(|V |minus2

i

)times(|V |minus2minusi)

=2+(|V |minus2)times2|V |minus3

le(|V |minus2)times2|V |minus2 for |V |ge4

le|V |times2|V |

As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering

Page 12: Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF IRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES ByungHoonAhn1 yJinwonLee2 JamieMenjayLin

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014

Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018

Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019

Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017

Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016

Kahn A B Topological sorting of large networks CACM 1962

Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001

Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf

Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004

LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990

Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003

Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX

Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b

Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020

Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb

NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt

Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017

Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019

Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999

Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019

Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf

Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007

Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018

Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014

Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf

Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019

Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000

Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019

Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019

Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019

Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf

Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf

Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-

Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg

Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS

Multiply-and-accumulate (Billions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

200 10 30 40

DPN-131

Inception V1MobileNet

ShuffleNet

Inception V2

Inception V3Xception ResNet-152

SENet

AmoebaNet-A

ReNeXt-101PolyNetInception ResNet V2

Inception V4

NASNet-ANASNet-B

RandWire

AmoebaNet-A

AmoebaNet-B

RandWire

irregularly wired neural networksregular topology neural networks

irregularly wired neural networksshow better performance for

same amount of compute thanregular topology neural networks

top left means is better

(a) ImageNet accuracy vs number of multiply-and-accumulate

Number of Parameters (Millions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

800 40 100 140

DPN-131

irregularly wired neural networks

Inception V1MobileNetShuffleNet

Inception V2

Inception V3Xception

ResNet-152

SENet

AmoebaNet-C

ReNeXt-101

PolyNetInception ResNet V2Inception V4

NASNet-A

NASNet-A

RandWire

AmoebaNet-A

RandWire

regular topology neural networks

irregularly wired neural networksshow better performance for

same number of parameters thanregular topology neural networks

top left means is better

6020 120

NASNet-A

(b) ImageNet accuracy vs number of parameters

Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks

B COMPARISON WITH TENSORFLOW LITE

In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks

1656

552

194

70

645

330 60

5

350

160

903

251

82 33

459

260 359

280

115

753

226

72 20

459

260 359

280

115

0

500

1000

1500

2000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Peak

Mem

ory

Foot

prin

t (K

B)

TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Smaller the better

Peak Memory Footprint (KB)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite

C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING

Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm

THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step

Proof If i=0 the optimal s0 is an empty sequence and micro0

must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +

prodvishapemicroi +

produlowasti shape)

and microi+1 is calculated by deductingprodpishape forallpi isin

(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption

D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF

We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node

A

D W

Z

CB X Y

GGraph

hellip

Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm

First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step

For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2

iminus1) On top of this each step would make an iteration

over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield

1+1times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=1+

(|V |minus2

0

)times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=2+

|V |minus2sumi=0

(|V |minus2

i

)times(|V |minus2minusi)

=2+(|V |minus2)times2|V |minus3

le(|V |minus2)times2|V |minus2 for |V |ge4

le|V |times2|V |

As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering

Page 13: Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF IRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES ByungHoonAhn1 yJinwonLee2 JamieMenjayLin

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS

Multiply-and-accumulate (Billions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

200 10 30 40

DPN-131

Inception V1MobileNet

ShuffleNet

Inception V2

Inception V3Xception ResNet-152

SENet

AmoebaNet-A

ReNeXt-101PolyNetInception ResNet V2

Inception V4

NASNet-ANASNet-B

RandWire

AmoebaNet-A

AmoebaNet-B

RandWire

irregularly wired neural networksregular topology neural networks

irregularly wired neural networksshow better performance for

same amount of compute thanregular topology neural networks

top left means is better

(a) ImageNet accuracy vs number of multiply-and-accumulate

Number of Parameters (Millions)

Top-

1 Im

ageN

et A

ccur

acy

()

85

65

70

75

80

800 40 100 140

DPN-131

irregularly wired neural networks

Inception V1MobileNetShuffleNet

Inception V2

Inception V3Xception

ResNet-152

SENet

AmoebaNet-C

ReNeXt-101

PolyNetInception ResNet V2Inception V4

NASNet-A

NASNet-A

RandWire

AmoebaNet-A

RandWire

regular topology neural networks

irregularly wired neural networksshow better performance for

same number of parameters thanregular topology neural networks

top left means is better

6020 120

NASNet-A

(b) ImageNet accuracy vs number of parameters

Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks

B COMPARISON WITH TENSORFLOW LITE

In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks

1656

552

194

70

645

330 60

5

350

160

903

251

82 33

459

260 359

280

115

753

226

72 20

459

260 359

280

115

0

500

1000

1500

2000

Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C

DARTSImageNet

SwiftNetVisual Wake Words Dataset

RandWireCIFAR10

RandWireCIFAR100

Peak

Mem

ory

Foot

prin

t (K

B)

TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator

Smaller the better

Peak Memory Footprint (KB)

Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet

Human PresenceDARTSImageNet

RandWireCIFAR10

RandWireCIFAR100

Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite

C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING

Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm

THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step

Proof If i=0 the optimal s0 is an empty sequence and micro0

must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +

prodvishapemicroi +

produlowasti shape)

and microi+1 is calculated by deductingprodpishape forallpi isin

(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption

D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF

We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node

A

D W

Z

CB X Y

GGraph

hellip

Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm

First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step

For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2

iminus1) On top of this each step would make an iteration

over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield

1+1times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=1+

(|V |minus2

0

)times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=2+

|V |minus2sumi=0

(|V |minus2

i

)times(|V |minus2minusi)

=2+(|V |minus2)times2|V |minus3

le(|V |minus2)times2|V |minus2 for |V |ge4

le|V |times2|V |

As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering

Page 14: Ordering Chaos: Memory-Aware Scheduling of Irregularly ...ORDERING CHAOS: MEMORY-AWARE SCHEDULING OF IRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES ByungHoonAhn1 yJinwonLee2 JamieMenjayLin

Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step

For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2

iminus1) On top of this each step would make an iteration

over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield

1+1times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=1+

(|V |minus2

0

)times(|V |minus2)+

(|V |minus2

1

)times(|V |minus3)+

+

(|V |minus2

|V |minus2

)times0+1

=2+

|V |minus2sumi=0

(|V |minus2

i

)times(|V |minus2minusi)

=2+(|V |minus2)times2|V |minus3

le(|V |minus2)times2|V |minus2 for |V |ge4

le|V |times2|V |

As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering


Recommended