ORDERING CHAOS MEMORY-AWARE SCHEDULING OFIRREGULARLY WIRED NEURAL NETWORKS FOR EDGE DEVICES
Byung Hoon Ahn 1 dagger Jinwon Lee 2 Jamie Menjay Lin 2 Hsin-Pai Cheng 3 dagger Jilei Hou 2 Hadi Esmaeilzadeh 1
ABSTRACTRecent advance on automating machine learning through Neural Architecture Search and Random Network Gener-ators has yielded networks that deliver higher accuracy given the same hardware resource constrains eg memorycapacity bandwidth number of functional units Many of these emergent networks however comprise of irregularwirings (connections) that complicate their execution by deviating from the conventional regular patterns of layernode connectivity and computation The irregularity leads to a new problem space where the schedule and orderof nodes significantly affect the activation memory footprint during inference Concurrently there is an increasinggeneral demand to deploy neural models onto resource-constrained edge devices due to efficiency connectivity andprivacy concerns To enable such a transition from cloud to edge for the irregularly wired neural networks we set outto devise a compiler optimization that caps and minimizes the footprint to the limitations of the edge device Thisoptimization is a search for the schedule of the nodes in an intractably large space of possible solutions We offerand leverage the insight that partial schedules leads to repeated subpaths for search and use the graph properties togenerate a signature for these repetition These signatures enable the use of Dynamic Programming as a basis for theoptimization algorithm However due to the sheer number of neurons and connections the search space may remainprohibitively large As such we devise an Adaptive Soft Budgeting technique that during dynamic programming per-forms a light-weight meta-search to find the appropriate memory budget for pruning suboptimal paths Nonethelessschedules from any scheduling algorithm including ours is still bound to the topology of the neural graph undercompilation To alleviate this intrinsic restriction we develop an Identity Graph Rewriting scheme that leads toeven lower memory footprint without changing the mathematical integrity of the neural network We evaluate ourproposed algorithms and schemes using representative irregularly wired neural networks Compared to TensorFlowLite a widely used framework for edge devices the proposed framework provides 186timesreduction in memoryfootprint and 176times reduction in off-chip traffic with an average of less than one minute extra compilation time
1 INTRODUCTION
Growing body of work focuses on Automating MachineLearning (AutoML) using Neural Architecture Search(NAS) (Zoph amp Le 2017 Cortes et al 2017 Zoph et al2018 Liu et al 2019a Cai et al 2019 Real et al 2019Zhang et al 2019) and now even Random Network Genera-tors (Xie et al 2019 Wortsman et al 2019) which emit mod-els with irregular wirings and shows that such irregularlywired neural networks can significantly enhance classifica-tion performance These networks that deviate from regulartopology can even adapt to some of the constraints of thehardware (eg memory capacity bandwidth number of func-tional units) rendering themselves especially useful in target-
daggerWork done as intern at Qualcomm AI Research 1University ofCalifornia San Diego 2Qualcomm AI Research 3Duke UniversityCorrespondence to Byung Hoon Ahnltbhahnengucsdedugt
Proceedings of the 3 rd MLSys Conference Austin TX USA 2020Copyright 2020 by the author(s)
ing edge devices Therefore lifting the regularity conditionprovides significant freedom for NAS and expands the searchspace (Cortes et al 2017 Zhang et al 2019 Xie et al 2019)
The general objective is to enable deployment of neural in-telligence even on stringently constrained devices by tradingoff regular wiring of neurons for higher resource efficiencyImportantly pushing neural execution to edge is one way toaddress the growing concerns about privacy (Mireshghallahet al 2020) and enable their effective use where connectivityto cloud is restricted (Wu et al 2019) However the newchallenge arises regarding orchestrating execution of theseirregularly wired neural networks on the edge devices asworking memory footprint during execution frequentlysurpass the strict cap on the memory capacity of thesedevices The lack of multi-level memory hierarchy in thesemicro devices exacerbates the problem because the networkcannot even be executed if the footprint exceeds the capacityTo that end despite the significant potential of irregularlywired neural networks their complicated execution pattern
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
in contrast to previously streamlined execution of modelswith regular topology renders conventional frameworksfutile in taking these networks to edge due to their large peakmemory footprint While peak memory footprint is largelydependent on scheduling of neurons current deep learningcompilers (Chen et al 2018 Vasilache et al 2018) andframeworks (Abadi et al 2016 Paszke et al 2019 Jia et al2014) rely on basic topological ordering algorithms that areoblivious to peak memory footprint and instead focus on anorthogonal problem of tiling and kernel level optimizationThis paper is an initial step towards embedding peak memoryfootprint as first-grade constraint in deep learning schedulersto unleash the potential of the emergent irregularly wiredneural networks As such this paper makes the followingcontributions
(1) Memory-aware scheduling for irregularly wiredneural networks Scheduling for these networks is a topo-logical ordering problem which enumerates an intractablylarge space of possible schedules We offer and leverage theinsight that partial schedules leads to repeated subpaths forsearch and use the graph properties to generate a signaturefor these repetition while embedding a notion of the runningmemory usage These signatures enable the use of DynamicProgramming as a basis for the optimization algorithm(2) Adaptive soft budgeting for tractable compilationtime Even with the dynamic programming as the base dueto the sheer number of neurons and connections the searchspace may remain too large (exponentially large) in practiceAs such we devise an Adaptive Soft Budgeting techniquethat uses a lightweight meta-search mechanism to find theappropriate memory budget for pruning the suboptimalpaths This technique aims to find an inflection point beyondwhich tighter budgets may lead to no solution and looserbudget prolongs the scheduling substantially putting theoptimization in a position of questionable utility(3) Identity graph rewriting for enabling higher poten-tial in memory reduction Any scheduling algorithm in-cluding ours is still bound to the topology of the neural graphunder compilation To relax this intrinsic restriction wedevise an Identity Graph Rewriting scheme that exchangessubgraphs leading to a lower memory footprint withoutaltering the mathematical integrity of the neural network
Results show that our adaptive scheduling algorithmimproves peak memory footprint for irregularly wired neuralnetworks by 168timescompared to TensorFlow Lite the de factoframework for edge devices Our graph rewriting techniqueprovides an opportunity to lower the peak memory footprintby an additional 107 Furthermore our framework caneven bring about 176times reduction in off-chip traffic for de-vices with multi-level memory hierarchy and even eliminatethe traffic in some cases by confining the memory footprintbelow the on-chip memory capacity These gains come ataverage of less than one minute extra compilation time
(a) RandWire (b) SwiftNet
Figure 1 Architecture of network models from NAS and RandomNetwork Generators Topology of such networks include distinctiveirregular wirings between the nodes
2 CHALLENGES AND OUR APPROACH
21 Irregularly Wired Neural Networks
Recent excitement in Automated Machine Learning(AutoML) (Feurer et al 2015 Dean 2017 He et al 2018Elthakeb et al 2018 Wang et al 2019 Laredo et al 2019)aims to achieve human out of the loop in developing machinelearning systems This includes Neural Architecture Search(NAS) (Zoph amp Le 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Network Generators (Xie et al 2019 Wortsmanet al 2019) that focus on automation of designing neuralarchitectures Figure 1 demonstrates that networks of thisregime are characterized by their distinctive irregular graphtopology with much more irregular wirings (dataflow)compared to conventional networks with regular graphtopology This paper refers to these networks as irregularlywired neural networks
Multiply-and-accumulate (Billions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
200 10 30 40
DPN-131
Inception V1MobileNet
ShuffleNet
Inception V2
Inception V3Xception ResNet-152
SENet
AmoebaNet-A
ReNeXt-101PolyNetInception ResNet V2
Inception V4
NASNet-ANASNet-B
RandWire
AmoebaNet-A
AmoebaNet-B
RandWire
irregularly wired neural networksregular topology neural networks
irregularly wired neural networksshow better performance for
same amount of compute thanregular topology neural networks
top left means is better
Figure 2 ImageNet accuracy vs number of multiply-and-accumulate where irregularly wired neural networks show higherperformance for same compute than regular topology neural net-works Plot for number of parameters also displays a similar trend
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
concat
conv
(a) SwiftNet Cell A
Peak Memory Footprint (KB)
Cum
ulat
ive
Dist
ribut
ion
of S
ched
ules
()
250 KBconstraint100
020406080
350 400200 250 300
41 of schedulessatisfy the constraint
004 of schedulesare optimal
(b) CDF of peak memory fordifferent possible schedules
Figure 3 CDF of the peak memory footprint for the differentpossible schedules of a given irregularly wired neural network
From the performance perspective these networks haveshown to outperform manually designed architecturesin terms of accuracy while using less resources In factmajority of winning neural architectures in competitionswith primary goal of reducing resources (Gauen et al 2017)rely on NAS suggesting its effectiveness in that respectFigure 2 plots the accuracy of different models given theircomputation The figure clearly shows that the Paretofrontier of irregularly wired neural networks from NASand Random Network Generators are better than the handdesigned models with regular topology This indicates thatthe efficiency in terms of accuracy given fixed resources arebetter with the irregularly wired neural networks
22 Challenges
Many existing compilers (Chen et al 2018 Vasilache et al2018) and frameworks (Paszke et al 2019 Abadi et al 2016Jia et al 2014) rely on basic topological ordering algorithmsto schedule the graph While the current approach may be suf-ficient to run conventional networks on server-class machinessuch scheme may be unfit for running irregularly wiredneural networks on resource-constrained edge devices Thisis because unlike running networks with regular topologyrunning irregular networks results in varied range of memoryfootprint depending on the schedule For instance giventhe constraints of a representative edge device (SparkFunEdge 250KB weightactivation memory and 60M MACs)Figure 3(b) shows that 41 of the schedules barely meets thehard memory constraint while only 004 would achieve theoptimal peak memory In reality such limitation will preventfurther exploration regarding the diversity and innovationof network design and in order to allow edge computingregime to take full advantage of the irregularly wired neuralnetworks this limitation should be alleviated if not removed
23 Design Objectives
Scheduling algorithm To address this issue our workaims to find a schedule of nodes slowast from the search spaceS that would minimize peak memory footprint micropeak Senumerates all possible orderings of the nodes visinV whereV is the set of all nodes within a graph G
slowast=argmins
micropeak(sG) for sisinS (1)
The most straightforward way to schedule is a brute forceapproach which just enumerates S and picks one with theminimum peak memory footprint While this extrememethod may find an optimal solution it is too costly in termsof time due to its immense complexity Θ(|V |) where |V |denotes number of nodes in the graph One way to improveis to narrow down the search space to just focus on onlythe topological orderings ST subS However this will stillsuffer from a complexity with an upper bound of O(|V |)(takes days to schedule DAG with merely 30 nodes) In factprevious works (Bruno amp Sethi 1976 Bernstein et al 1989)already prove optimal scheduling for DAGs is NP-completeOn another extreme are heuristics for topological orderingsuch as Kahnrsquos algorithm (Kahn 1962) with complexity ofO(|V |+|E|) where V andE are number of nodes and edgesHowever as demonstrated in Figure 3 such method mayyield suboptimal schedule of nodes which will not run on thetarget hardware To this end we explore dynamic program-ming combined with adaptive soft budgeting for schedulingto achieve an optimal solution while keeping the graph con-stant slowast without adding too much overhead in terms of timeWe explain our algorithms in depth in Section 31 and 32
Graph rewriting Any scheduling algorithm includingours is intrinsically bounded by the graph topologyTherefore we explore to transform the search spacethrough graph rewriting (Plump 1999) Graph rewritingis generally concerned with substituting a certain patternin the graph with a different pattern to achieve a certainobjective For a computational dataflow graph leveragingdistributive associative and commutative properties withinthe computation of the graph graph rewriting can maintainthe semantics while bringing significant improvementsregarding some objective For example in general programssum
ilogxi can be represented assum
oddilogxi+sum
evenilogxior log
prodixi while x+x can be translated to xtimes2 or xltlt1
Likewise we bring this insight to neural networks to find aset of possible transformationsX that can rewrite the originalgraph G to a new graph Gprime that would also change our searchspace S to one with a lower peak memory footprint
X lowast=argminX
(micropeak(slowastX (G))) (2)
We identify a set of candidate patterns for transformationχ grarr gprime (g isinG and gprime isinGprime) which constitutes X Whiletransforming the graph our method keeps the mathematicalintegrity of the graph intact thus not an approximationmethod We embed this systematic way to improve peakmemory footprint and the search space as identity graphrewriting and we address this technique in Section 33
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
17
G s GGraph RewrittenGraph Schedule
IdentityGraph Rewriter
Dynamic Programming-
based Scheduler
Adaptive Soft Budgeting
Rewrite graph to alleviate activation memory footprint
of the graph
Find memory-optimal schedule given an
input graph
Adaptively manage soft budget to speed
up scheduling
G
flag = lsquono solutionrsquo lsquotimeoutrsquo lsquosolutionrsquo
τ T s
Figure 4 Overall workflow of SERENITY memory-aware scheduling of irregularly wired neural network
3 SERENITY MEMORY-AWARESCHEDULING OF IRREGULARLYWIRED NEURAL NETWORKS
As discussed in Section 2 the objective is reducing thepeak memory footprint while executing irregularly wiredneural networks We propose SERENITY memory-awarescheduling that targets devices with restricted resources(eg edge devices) Figure 4 summarizes the overallscheduling process highlighting the major contributions ofour approach Input to SERENITY is a graph of irregularlywired neural network G which in fact acts as an intermediaterepresentation (IR) during the scheduling process Weaugment this IR with the metadata of the nodes such as theoperation type inputoutput edges inputoutput shapesand memory cost Then the graph rewriter transformsthe graph G rarr Gprime to relax the memory costs of memoryintensive patterns with the goal of reducing the peak memoryfootprint micropeak of G SERENITY schedules the graph to anoptimal schedule slowast using the dynamic programming-basedscheduler However since the scheduling may be slowdue to the complexity we scale down search space byleveraging divide-and-conquer which partitions the graphinto multiple subgraphs Them we augment the schedulerwith an adaptive soft budgeting which prunes suboptimalpaths by adaptively finding a budget for thresholding througha swift meta-search to speed up the scheduling process Thissection focuses on the innovations of SERENITY dynamicprogramming-based scheduling divide-and-conqueradaptive soft budgeting and graph rewriting which areexplained in detail in Section 31 32 and 33 respectively
31 Dynamic Programming-based SchedulingAchieving Optimal Peak Memory Footprint
Our goal for the scheduling algorithm is to minimize thepeak memory footprint micropeak(sG) As stated in Section 23recursive algorithms that covers the entire search space Sor the subspace of all topological orderings ST sub S takesimpractically long time This is primarily due to the repetitivere-computation of subproblems that upper bounds the algo-rithm byO(|V |) Therefore we leverage dynamic program-ming (Bellman 1961 1966 Held amp Karp 1962) which in-cludes a memoization scheme that has been shown to be effec-tive in reducing the complexity of time-intensive algorithmsby reusing solutions from their subproblems while still find-
ing optimal solution by sweeping the entire search space
Identifying signature to enable dynamic programmingThe first step to applying dynamic programming to anew problem is characterizing the structure of an optimalsolution slowast=slowastn (slowastn is an optimal solution for n number ofnodes) Then it requires identifying a recursive relationshipbetween the optimal solution of a subproblem slowasti and theoriginal problem slowasti+1 and we do this by analyzing thestraightforward recursive topological ordering whichwhile inefficient sweeps the entire search space In essencetopological ordering algorithm is a repeated process ofidentifying a set of nodes that are available for schedulingand iterating the set for recursion In graph theory such aset of nodes available for scheduling is called zero-indegreeset z where z is a set of nodes which all of their incomingedges and the corresponding predecessor nodes (indegree)have been scheduled Figure 5 demonstrates the recursiontree of the different topological ordering algorithms wherethe height of the tree is the search step and every path fromthe root to the leaf is a topological ordering s isin ST Thefigure highlights the redundant z in the recursive topologicalordering in the recursion tree then merges these z to makethem unique identifying it as the signature for repetition andprevent the aforementioned re-computation This makes thescheduling for z into a unique subproblem that constitutes
A
B C
D E F
H I J
K
L
G
GGraph Recursive Topological Ordering
A
B C J
CD JG B E F J B C
hellip
Sear
ch S
tep
CDG CDGhellip hellipz
s
Redundant zero-indegree set z
Dynamic Programming-based Topological Ordering
A
B C J
ABCD ABJG E F ACJ
hellip
Sear
ch S
tep
CDGhellip
Unique zero-indegree set z
Scheduled SchedulableXX For memoizationXX
Figure 5 Illustration of identifying redundant zero-indegree set zand making z unique (square) throughout the topological orderingalgorithm to reduce re-computation
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
outdegree of 1rarr 0 outdegree of 1rarr 0
micro 8
A
B C
D E F
H I J
K
L
G
GGraph
Scheduled To ScheduleAllocateBA To DeallocateB
Activation MemoryD E F I J
(1) ScheduleAllocate H D E F I J H
(0) Initial State
(2) Deallocate F I J HD E
D E
H I
DE
i = 8
micropeak
micropeak9 = max(micropeak8 micropeak)
micro9
s8 = A B CD E F I Jmicro peak8 from M8
z8 = HG
u8 = H
s9 = A B CD E F I J H
Figure 6 Visualization of scheduling the node u8 = H during thesearch step i = 8 Starting from s8 micro8 and micropeak8 the figureshows how the algorithm calculates s9 micro9 and micropeak9
the dynamic programming-based topological ordering
Integrating the peak memory footprint constraint Ontop of the dynamic programming formulation that showspotential for optimizing the search space significantly weoverlay the problem specific constraints to achieve theoptimal solution In particular we calculate the memoryfootprint microi+1 and its corresponding peak micropeaki+1 in eachsearch step i to select optimal path slowasti+1 for memoizationHere we clarify the process of a search step explainingthe details of calculating micropeaki+1 and saving si+1 for eachsearch step i In each search step we start with number ofunique zero-indegree sets zi (signature) saved in ith entryof memoizationMi For each zi we append the schedule upto the point si sum of activations in the memory microi for thesignature zi and the peak memory footprint of the si denotedmicropeaki Therefore in each search step i we start with si microiandmicropeaki for si Then when we iterate zi to schedule a newnode ui its output activation is appended to si to form si+1and is allocated in the memory Size of ui is product (
prod) of
uishape where shape is a property of the activation tensorthat includes channels height width and the precision (egbyte float) is added to microi so microi+1 larr microi +
prod(uishape)
Then we use microi+1 as micropeak to update micropeaki+1 (peakmemory footprint for si+1) Since some predecessors ofui will not be used anymore after allocating ui we updatethe outdegrees of the node by decrementing them Havingupdated the outdegree we will be left with a zero-outdegreeset that denotes the nodes that are ready for deallocation Wedeallocate the nodes in the set and update microi+1 accordingly
To demonstrate scheduling of a node ui Figure 6 simulatesscheduling a node u8 = H in i=8 In the figure (1) H is ap-pended to s8 and allocated to memory as it is scheduled andthen the scheduler records maximum of the micropeak8 and thesum of all activations in the memory at this point as micropeak9Then it recalculates the outdegrees of the predecessor nodesof H D and E rsquos outdegree are decremented from one tozero (2) Then these nodes are deallocated and sum of the
Algorithm 1 Dynamic Programming-based Scheduling1 Input graph G2 Output optimal schedule slowast
3 initialize memoization4 s0larr [] micro0micropeak0larr0 z0larr zero-indegree(s0G)5 M0[z0]larr(s0micro0micropeak0)6 iterate search step7 for i=0 to nminus1 do8 iterate (schedule current memory peak memory)9 for zi(simicroimicropeak) inMi do
10 for ui in zi do11 si+1larrsiappend(ui) allocate12 zi+1larr zero-indegree(si+1G)13 microi+1micropeaklarrmicroi+
prod(uishape)
14 micropeaki+1larrmax(micropeakimicropeak)15 for pi in uipreds do16 if pi is in zero-outdegree(si+1G) then17 microi+1larrmicroi+1minus
prod(pishape) deallocate
18 end if19 end for20 memoize schedule with least peak memory21 if micropeaki+1leMi+1[zi+1]micropeaki+1 then22 Mi+1[zi+1]larr(si+1microi+1micropeaki+1)23 end if24 end for25 end for26 end for27 slowastmicrolowastpeaklarrM[middot]nsnM[middot]nmicropeakn solution
activation memory here is recorded as micro9
Finding schedule with optimal peak memory footprintAfter scheduling ui we save the new signature into theMi+1 for next search step i+1 Since the goal of this work isto minimize the overall micropeak we identify the correspondingoptimal schedule slowasti+1 for each zi+1 by only saving si+1 withthe minimum micropeaki+1 We integrate the aforementionedstep of scheduling ui and updating Mi+1 to completethe proposed dynamic programming-based schedulingalgorithm Algorithm 1 summarizes the the algorithm As afirst step the algorithm starts by initializing the memoizationtableM0 then the algorithm iterates different search stepsIn each search step i the algorithm performs the aboveillustrated memory allocation for allui in zi and saving si+1microi+1 and micropeaki+1 After iterating all search steps to nminus1slowast is saved inMn with a unique entry for n being numberof nodes in G We provide the proof for the optimality of thepeak memory footprint in the supplementary material
Complexity of the algorithm The complexity of theproposed dynamic programming-based scheduling isO(|V |times2|V |) which is significantly faster than the exhaus-tive search ofST with an upper bound complexity ofO(|V |)
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A
B C
D
E
F G
H
Divide Conquer Combine
ABCD
Schedule
Schedule
g1
g2Concatenate
s sg1
sg2EFGH
ABCDEFGH
Figure 7 Illustration of divide-and-conquer which divides thegraphs into multiple subgraphs (divide) schedules each of themusing the optimal scheduler (conquer) then concatenates thesub-schedules to get the final schedule (combine)
Due to the space limitation we present the derivation of thealgorithm complexity in the supplementary material
32 Optimizing Scheduling Speed Speeding upthe Dynamic Programming-based Scheduling
While the above scheduling algorithm improves complexityof the search search space may still be intractable due tothe immense irregularity Therefore we devise divide-and-conquer and adaptive soft budgeting to accelerate the searchby effectively shrinking and pruning the search space
Divide-and-conquer We can observe from Figure 1that the topology of irregularly wired neural networks arehourglass shaped ( ) because many NAS and RandomNetwork Generators design cells with single input and singleoutput then stack them to form an hourglass shape topology(Wilken et al 2000) shows that during general purpose codescheduling graphs can be partitioned (divide) into multiplesubgraphs and the corresponding solutions (conquer) can beconcatenated (combine) to form an optimal solution for theoverall problem While the complexity of the scheduling al-gorithm remains the same this divide-and-conquer approachcan reduce the number of nodes in each subproblem speed-ing up the overall scheduling time For instance for a graphthat can be partitioned intoN equal subgraphs the schedul-ing time will decrease from |V |times2|V | to |V |times2|V |N thatwe can speed up scheduling by multiple orders of magnitudecompared to the naive approach depending on the size ofthe graph and the number of partitions
As such Figure 7 shows this insight can be extended to ourproblem setting where we can first perform scheduling oneach cell and merge those solutions together to form thefinal solution First stage is partitioning the original graphG into multiple subgraphs g (divide) Then utilizing theindependence among the subgraphs each subgraph g canbe scheduled separately for their corresponding optimalschedule sg (conquer) Considering that the number ofnodes in the subgraph g is much smaller than the entire graphG the scheduling time will decrease significantly Finallythe schedules of the subgraphs are concatenated to giveoptimal schedule slowast of the entire graph (combine)
Adaptive soft budgeting While divide-and-conquerapproach scales down the number of nodes the algorithmmay still not be fast enough due to the exponential complexityof the algorithm Therefore we explore avoiding suboptimalsolutions during the early stage of scheduling without affect-ing the optimality of the original algorithm Since our goal isto find a single solution that can run within a given memorybudget τlowast=microlowast while all other solutions can be discarded set-ting some budget τ that is greater or equal to microlowast and pruningsuboptimal schedules with which their micropeak exceeds τ canfocus the search to a smaller search spaceS primeT subST while stillachieving the optimal schedule slowast On top of this we developa meta-search for τ This is inspired from engineers buyinga larger memory (increase τ ) if a program fails due to stackoverflow (= rsquono solutionrsquo due to an overly aggressive pruning)and selling out excess memory (decrease τ ) if the currentbudget is prohibitive (= rsquotimeoutrsquo due to lack of pruning)SERENITY takes advantage of this insight to develop anadaptive soft budgeting scheme while scheduling to cut downthe overall number of explored schedules Figure 8 illustratesthe overall idea by first showing how some schedules arepruned with regard to a given budget τ in Figure 8(a) thenimplication of different τ on scheduling time in Figure 8(b)
Figure 8(a) depicts a certain point while scheduling G wherenodes G H F and J can be scheduled In particular the
32
23
35
τ = 36
35 38 38J
s1
A
B C
D E F
H I
K
L
G
GGraph
Scheduled SchedulableXX For memoizationXX
ABCDEI hellip
Sear
ch S
tep
FHG
hellip
J
D 6 E 6 F 6J 6I 3H 3G 3
ABCDEIFH helliphellip
C 6
23 32s2
gt τ
z
z
s3
35
output activation size
(a) While both path s1 and s2 schedules lead to same zprime theirmicro and micropeak varies and we can prune schedules that yieldhigher micropeak than a given budget τ Numbers next to box orcircle are micro and numbers next to edges are micropeak
Proh
ibiti
ve
Sche
dulin
g Ti
me
(tim
eout
)
Sche
dulin
g Fa
ilure
(no
solu
tion
)
No o
f Exp
lore
d Sc
hedu
les
prop Sc
hedu
ling
Tim
e
Budget
Optimal Budget τ Hard Budget τmaxSoft Budget τ
Adaptive Soft Budgeting
(b) Adaptive soft budgeting starts by setting a hard budgetτmax as the maximum value for the soft budget τ Thenconducts a binary search for τ higher than τlowast that it finds asolution yet not too high that scheduling completes quickly
Figure 8 Illustration of the adaptive soft budgeting (a) shows howschedules are pruned and (b) illustrates how the soft budget τrelates to the number of explored schedules
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
figure compares two possible solutions s1 and s2 whichschedules Hrarr F and Frarr H respectively given τ = 36While s1 and s2 both starts from z with micro=32 schedulingH leads to micropeak = 32+3 (H) = 35 whereas scheduling F
or J leads to micropeak =32+6 (F or J) =38 Therefore sincewe assume τ = 36 s2 and s3 will fail because micropeak = 38for s2 and s3 exceeds 36 So as long as we set the budgetτ higher than microlowast the scheduler still finds a single optimalsolution while avoiding many suboptimal paths On theother hand too small a τ ltmicrolowast leads to no solution becausethe optimal path would be pruned away
Having established the possibility of pruning our questionboils down to discovering τ that is greater or equal to microlowast
which we call an optimal budget τlowast yet close enough toshrink the search space effectively Figure 8(b) and Algo-rithm 2 summarizes the proposed adaptive soft budgetingSince we start with no information about the approximaterange for τ we resort to a commonly used topologicalordering algorithm called Kahnrsquos algorithm (Kahn 1962)(O(|V |+|E|)) to adaptively gain idea of the range for τ Weuse the peak memory footprint from this sequence and useit as our hard budget τmax and in contrast we call adaptivelychanging τ as a soft budget Since τmaxgemicrolowast we know thatany τgeτmax do not need to be explored Having this upperbound for the search adaptive soft budgeting implementsa binary search to first run the scheduling algorithm with τand T as input where T is an hyperparameter that limits thescheduling time per search step The binary search increasesτ (τnew larr (τnew + τold)2) if it finds rsquono solutionrsquo anddecreases τ (τnewlarrτnew2) if a search step returns rsquotimeoutrsquo(search step duration exceeds T ) The binary search stops as
Algorithm 2 Adaptive Soft Budgeting1 Input graph G2 Output optimal schedule slowast
3 τmaxlarrmicro(KahnrsquosAlgorithm(G)G) hard budget4 τoldτnewlarrτmax
5 flaglarr rsquono solutionrsquo6 repeat7 binary search for τ decrease τ if rsquotimeoutrsquo8 and increase τ if rsquono solutionrsquo9 if flag is rsquotimeoutrsquo then
10 simultaneous11 τoldlarrτnew τnewlarrτnew212 else if flag is rsquono solutionrsquo then13 simultaneous14 τoldlarrτnew τnewlarr(τnew+τold)215 end if16 if flag is rsquosolutionrsquo then17 slowastlarrschedule optimal schedule18 end if19 until flag is rsquosolutionrsquo
micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(wixi) )
micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(y) )
Channel-wisePartitioning
Kernel-wisePartitioning
=
=
wij
concat
conv
x1 x2 xn
w1hellipwm
y
add
partialconv w1
x1 x2xn
w2 wn
y
concat
depth-conv
x1 x2 xn
w1hellipwn
y
concat
partialdepth-conv w1
x1 x2xn
w2 wn
y
yxi ith Input Output jth Channel of ith Kernel
x
x
Figure 9 Illustration of the graph rewriting patterns channel-wisepartitioning and kernel-wise partitioning can reduce the memorycost of convolution and depthwise convolution respectively
soon as it finds a schedule (rsquosolutionrsquo) and this method usingbinary search is guaranteed to work due to the monotonicallyincreasing number of explored schedules with τ
33 Identity Graph Rewriting Improving theSearch Space for Better Peak Memory Footprint
Reorganizing the computational graph of the irregularlywired neural networks may lead to significant reduction in thepeak memory footprint micropeak during computation For exam-ple it is notable that large stream of NAS-based works (Liuet al 2019a Zhang et al 2019) rely on extensive use of con-catenation as a natural approach to merge information frommultiple branches of the input activations and expand thesearch space of the neural architectures However concatena-tion with many incoming edges may prolong the liveness ofthe input activation and increase the memory pressure whichis unfavorable especially for resource constrained scenariosTo address this issue we propose identity graph rewritingto effectively reduce micropeak around the concatenation whilekeeping the arithmetic outputs identical To this end wepresent two main examples of the graph patterns in irregularlywired neural networks that benefits from our technique
Channel-wise partitioning (convolution) One typicalpattern in irregularly wired neural networks is concatenation(concat [middot]) that takes multiple branches of the input prior toa convolution (conv lowast) While executing such pattern peakmemory footprint micropeak occurs when the output y isinRn isbeing computed while concatenated branches of inputxisinRn
are also mandated to reside in the memory Our objectiveis to achieve the same arithmetic results and logical effectas concat yet sidestep the corresponding seemingly exces-sive memory cost To this end we channel-wise partitionthe conv that follows the concat so that the partitioned convcan be computed as soon as the input xi becomes available
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of
sumi and lowast these
transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from
sumxi+y tomax(wilowastxi)+y which becomes
more effective when there are more incoming edges to concat
y=[sum
i
w1ilowastxisumi
wmilowastxi]
(concat+conv) (3)
=sumi
[w1ilowastxiwmilowastxi
](4)
=sumi
[w1iwmi
]lowastxi (5)
=sumi
[wilowastxi
](partial conv+add) (6)
Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly
y=[w1lowastx1wnlowastxn
](concat+depthconv) (7)
=[[w1lowastx1][wnlowastxn]
](partial depthconv+concat)
(8)
Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost
Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1
ACCURACY
DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745
4 EVALUATION
We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time
41 Methodology
Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset
42 Experimental Results
Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In
1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
183x
220x
239x
209x
140x
127x 168x
125x
139x
168x220x
244x
270x 345x
140x
127x 168x
125x
139x 186x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Red
ucti
on i
n Pe
ak M
emor
yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Higher the betterReduction in Peak Memory
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 10 Reduction in peak memory footprint of SERENITY
against TensorFlow Lite (no memory hierarchy)
192x 2
58x
251x
115x
108x
129x
108x
130x
152x192x 2
68x
125x
111x
131x
111x 161x
149x192x
356x
125x
119x
109x
108x 151x200x
135x
250x
182x
138x 176x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Redu
ctio
n in
Off
-chi
p
32KB 64KB 128KB 256KB
Reduction in Off-chip
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Memory Communication
only
SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipon
ly SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipSERENITY removes off-chip communication
NA
NA
NA
NA
NA NA
Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)
addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks
Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed
Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory
0
50
100
150
200
250
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
251KB reduction in peak memoryfootprint with Memory Allocator
Memory Footprint (KB)
Time
(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)
0
50
100
150
200
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic ProgrammingDynamic Programming+Graph Rewriting
125KB reductionin peak memory footprint
Time
Memory Footprint (KB)
(b) Memory footprint without the memory allocator
Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)
allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting
Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting
Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
32s 57s
45s
278s 11
81s
151s
285s 744s
879s
406s
32s
421s
305s
393s 11
81s
151s
285s 744s
879s
488s
1
10
100
1000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Sche
dulin
g Ti
me
(sec
onds
)
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Scheduling Time (seconds)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 13 Scheduling time evaluation for SERENITY
Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time
GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME
7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs
3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs
nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility
5 RELATED WORKS
The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)
bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices
Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper
Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint
Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts
6 CONCLUSION
As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
ACKNOWLEDGEMENT
We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms
REFERENCES
Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016
Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018
Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH
Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017
Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966
Bellman R Dynamic programming Science 1966
Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961
Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989
Bruno J and Sethi R Code generation for a one-register machineJACM 1976
Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm
Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018
Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014
Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016
Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017
Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf
Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf
Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017
Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf
Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS
Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015
Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018
Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017
Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017
Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite
Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015
Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a
Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b
He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018
Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962
Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf
Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014
Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018
Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019
Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017
Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016
Kahn A B Topological sorting of large networks CACM 1962
Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001
Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf
Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004
LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990
Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003
Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX
Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b
Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020
Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb
NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt
Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017
Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019
Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999
Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019
Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf
Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007
Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018
Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014
Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf
Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019
Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000
Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019
Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019
Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019
Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf
Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf
Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-
Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg
Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS
Multiply-and-accumulate (Billions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
200 10 30 40
DPN-131
Inception V1MobileNet
ShuffleNet
Inception V2
Inception V3Xception ResNet-152
SENet
AmoebaNet-A
ReNeXt-101PolyNetInception ResNet V2
Inception V4
NASNet-ANASNet-B
RandWire
AmoebaNet-A
AmoebaNet-B
RandWire
irregularly wired neural networksregular topology neural networks
irregularly wired neural networksshow better performance for
same amount of compute thanregular topology neural networks
top left means is better
(a) ImageNet accuracy vs number of multiply-and-accumulate
Number of Parameters (Millions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
800 40 100 140
DPN-131
irregularly wired neural networks
Inception V1MobileNetShuffleNet
Inception V2
Inception V3Xception
ResNet-152
SENet
AmoebaNet-C
ReNeXt-101
PolyNetInception ResNet V2Inception V4
NASNet-A
NASNet-A
RandWire
AmoebaNet-A
RandWire
regular topology neural networks
irregularly wired neural networksshow better performance for
same number of parameters thanregular topology neural networks
top left means is better
6020 120
NASNet-A
(b) ImageNet accuracy vs number of parameters
Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks
B COMPARISON WITH TENSORFLOW LITE
In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks
1656
552
194
70
645
330 60
5
350
160
903
251
82 33
459
260 359
280
115
753
226
72 20
459
260 359
280
115
0
500
1000
1500
2000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Peak
Mem
ory
Foot
prin
t (K
B)
TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Smaller the better
Peak Memory Footprint (KB)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite
C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING
Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm
THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step
Proof If i=0 the optimal s0 is an empty sequence and micro0
must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +
prodvishapemicroi +
produlowasti shape)
and microi+1 is calculated by deductingprodpishape forallpi isin
(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption
D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF
We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node
A
D W
Z
CB X Y
GGraph
hellip
Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm
First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step
For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2
iminus1) On top of this each step would make an iteration
over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield
1+1times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=1+
(|V |minus2
0
)times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=2+
|V |minus2sumi=0
(|V |minus2
i
)times(|V |minus2minusi)
=2+(|V |minus2)times2|V |minus3
le(|V |minus2)times2|V |minus2 for |V |ge4
le|V |times2|V |
As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
in contrast to previously streamlined execution of modelswith regular topology renders conventional frameworksfutile in taking these networks to edge due to their large peakmemory footprint While peak memory footprint is largelydependent on scheduling of neurons current deep learningcompilers (Chen et al 2018 Vasilache et al 2018) andframeworks (Abadi et al 2016 Paszke et al 2019 Jia et al2014) rely on basic topological ordering algorithms that areoblivious to peak memory footprint and instead focus on anorthogonal problem of tiling and kernel level optimizationThis paper is an initial step towards embedding peak memoryfootprint as first-grade constraint in deep learning schedulersto unleash the potential of the emergent irregularly wiredneural networks As such this paper makes the followingcontributions
(1) Memory-aware scheduling for irregularly wiredneural networks Scheduling for these networks is a topo-logical ordering problem which enumerates an intractablylarge space of possible schedules We offer and leverage theinsight that partial schedules leads to repeated subpaths forsearch and use the graph properties to generate a signaturefor these repetition while embedding a notion of the runningmemory usage These signatures enable the use of DynamicProgramming as a basis for the optimization algorithm(2) Adaptive soft budgeting for tractable compilationtime Even with the dynamic programming as the base dueto the sheer number of neurons and connections the searchspace may remain too large (exponentially large) in practiceAs such we devise an Adaptive Soft Budgeting techniquethat uses a lightweight meta-search mechanism to find theappropriate memory budget for pruning the suboptimalpaths This technique aims to find an inflection point beyondwhich tighter budgets may lead to no solution and looserbudget prolongs the scheduling substantially putting theoptimization in a position of questionable utility(3) Identity graph rewriting for enabling higher poten-tial in memory reduction Any scheduling algorithm in-cluding ours is still bound to the topology of the neural graphunder compilation To relax this intrinsic restriction wedevise an Identity Graph Rewriting scheme that exchangessubgraphs leading to a lower memory footprint withoutaltering the mathematical integrity of the neural network
Results show that our adaptive scheduling algorithmimproves peak memory footprint for irregularly wired neuralnetworks by 168timescompared to TensorFlow Lite the de factoframework for edge devices Our graph rewriting techniqueprovides an opportunity to lower the peak memory footprintby an additional 107 Furthermore our framework caneven bring about 176times reduction in off-chip traffic for de-vices with multi-level memory hierarchy and even eliminatethe traffic in some cases by confining the memory footprintbelow the on-chip memory capacity These gains come ataverage of less than one minute extra compilation time
(a) RandWire (b) SwiftNet
Figure 1 Architecture of network models from NAS and RandomNetwork Generators Topology of such networks include distinctiveirregular wirings between the nodes
2 CHALLENGES AND OUR APPROACH
21 Irregularly Wired Neural Networks
Recent excitement in Automated Machine Learning(AutoML) (Feurer et al 2015 Dean 2017 He et al 2018Elthakeb et al 2018 Wang et al 2019 Laredo et al 2019)aims to achieve human out of the loop in developing machinelearning systems This includes Neural Architecture Search(NAS) (Zoph amp Le 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Network Generators (Xie et al 2019 Wortsmanet al 2019) that focus on automation of designing neuralarchitectures Figure 1 demonstrates that networks of thisregime are characterized by their distinctive irregular graphtopology with much more irregular wirings (dataflow)compared to conventional networks with regular graphtopology This paper refers to these networks as irregularlywired neural networks
Multiply-and-accumulate (Billions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
200 10 30 40
DPN-131
Inception V1MobileNet
ShuffleNet
Inception V2
Inception V3Xception ResNet-152
SENet
AmoebaNet-A
ReNeXt-101PolyNetInception ResNet V2
Inception V4
NASNet-ANASNet-B
RandWire
AmoebaNet-A
AmoebaNet-B
RandWire
irregularly wired neural networksregular topology neural networks
irregularly wired neural networksshow better performance for
same amount of compute thanregular topology neural networks
top left means is better
Figure 2 ImageNet accuracy vs number of multiply-and-accumulate where irregularly wired neural networks show higherperformance for same compute than regular topology neural net-works Plot for number of parameters also displays a similar trend
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
concat
conv
(a) SwiftNet Cell A
Peak Memory Footprint (KB)
Cum
ulat
ive
Dist
ribut
ion
of S
ched
ules
()
250 KBconstraint100
020406080
350 400200 250 300
41 of schedulessatisfy the constraint
004 of schedulesare optimal
(b) CDF of peak memory fordifferent possible schedules
Figure 3 CDF of the peak memory footprint for the differentpossible schedules of a given irregularly wired neural network
From the performance perspective these networks haveshown to outperform manually designed architecturesin terms of accuracy while using less resources In factmajority of winning neural architectures in competitionswith primary goal of reducing resources (Gauen et al 2017)rely on NAS suggesting its effectiveness in that respectFigure 2 plots the accuracy of different models given theircomputation The figure clearly shows that the Paretofrontier of irregularly wired neural networks from NASand Random Network Generators are better than the handdesigned models with regular topology This indicates thatthe efficiency in terms of accuracy given fixed resources arebetter with the irregularly wired neural networks
22 Challenges
Many existing compilers (Chen et al 2018 Vasilache et al2018) and frameworks (Paszke et al 2019 Abadi et al 2016Jia et al 2014) rely on basic topological ordering algorithmsto schedule the graph While the current approach may be suf-ficient to run conventional networks on server-class machinessuch scheme may be unfit for running irregularly wiredneural networks on resource-constrained edge devices Thisis because unlike running networks with regular topologyrunning irregular networks results in varied range of memoryfootprint depending on the schedule For instance giventhe constraints of a representative edge device (SparkFunEdge 250KB weightactivation memory and 60M MACs)Figure 3(b) shows that 41 of the schedules barely meets thehard memory constraint while only 004 would achieve theoptimal peak memory In reality such limitation will preventfurther exploration regarding the diversity and innovationof network design and in order to allow edge computingregime to take full advantage of the irregularly wired neuralnetworks this limitation should be alleviated if not removed
23 Design Objectives
Scheduling algorithm To address this issue our workaims to find a schedule of nodes slowast from the search spaceS that would minimize peak memory footprint micropeak Senumerates all possible orderings of the nodes visinV whereV is the set of all nodes within a graph G
slowast=argmins
micropeak(sG) for sisinS (1)
The most straightforward way to schedule is a brute forceapproach which just enumerates S and picks one with theminimum peak memory footprint While this extrememethod may find an optimal solution it is too costly in termsof time due to its immense complexity Θ(|V |) where |V |denotes number of nodes in the graph One way to improveis to narrow down the search space to just focus on onlythe topological orderings ST subS However this will stillsuffer from a complexity with an upper bound of O(|V |)(takes days to schedule DAG with merely 30 nodes) In factprevious works (Bruno amp Sethi 1976 Bernstein et al 1989)already prove optimal scheduling for DAGs is NP-completeOn another extreme are heuristics for topological orderingsuch as Kahnrsquos algorithm (Kahn 1962) with complexity ofO(|V |+|E|) where V andE are number of nodes and edgesHowever as demonstrated in Figure 3 such method mayyield suboptimal schedule of nodes which will not run on thetarget hardware To this end we explore dynamic program-ming combined with adaptive soft budgeting for schedulingto achieve an optimal solution while keeping the graph con-stant slowast without adding too much overhead in terms of timeWe explain our algorithms in depth in Section 31 and 32
Graph rewriting Any scheduling algorithm includingours is intrinsically bounded by the graph topologyTherefore we explore to transform the search spacethrough graph rewriting (Plump 1999) Graph rewritingis generally concerned with substituting a certain patternin the graph with a different pattern to achieve a certainobjective For a computational dataflow graph leveragingdistributive associative and commutative properties withinthe computation of the graph graph rewriting can maintainthe semantics while bringing significant improvementsregarding some objective For example in general programssum
ilogxi can be represented assum
oddilogxi+sum
evenilogxior log
prodixi while x+x can be translated to xtimes2 or xltlt1
Likewise we bring this insight to neural networks to find aset of possible transformationsX that can rewrite the originalgraph G to a new graph Gprime that would also change our searchspace S to one with a lower peak memory footprint
X lowast=argminX
(micropeak(slowastX (G))) (2)
We identify a set of candidate patterns for transformationχ grarr gprime (g isinG and gprime isinGprime) which constitutes X Whiletransforming the graph our method keeps the mathematicalintegrity of the graph intact thus not an approximationmethod We embed this systematic way to improve peakmemory footprint and the search space as identity graphrewriting and we address this technique in Section 33
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
17
G s GGraph RewrittenGraph Schedule
IdentityGraph Rewriter
Dynamic Programming-
based Scheduler
Adaptive Soft Budgeting
Rewrite graph to alleviate activation memory footprint
of the graph
Find memory-optimal schedule given an
input graph
Adaptively manage soft budget to speed
up scheduling
G
flag = lsquono solutionrsquo lsquotimeoutrsquo lsquosolutionrsquo
τ T s
Figure 4 Overall workflow of SERENITY memory-aware scheduling of irregularly wired neural network
3 SERENITY MEMORY-AWARESCHEDULING OF IRREGULARLYWIRED NEURAL NETWORKS
As discussed in Section 2 the objective is reducing thepeak memory footprint while executing irregularly wiredneural networks We propose SERENITY memory-awarescheduling that targets devices with restricted resources(eg edge devices) Figure 4 summarizes the overallscheduling process highlighting the major contributions ofour approach Input to SERENITY is a graph of irregularlywired neural network G which in fact acts as an intermediaterepresentation (IR) during the scheduling process Weaugment this IR with the metadata of the nodes such as theoperation type inputoutput edges inputoutput shapesand memory cost Then the graph rewriter transformsthe graph G rarr Gprime to relax the memory costs of memoryintensive patterns with the goal of reducing the peak memoryfootprint micropeak of G SERENITY schedules the graph to anoptimal schedule slowast using the dynamic programming-basedscheduler However since the scheduling may be slowdue to the complexity we scale down search space byleveraging divide-and-conquer which partitions the graphinto multiple subgraphs Them we augment the schedulerwith an adaptive soft budgeting which prunes suboptimalpaths by adaptively finding a budget for thresholding througha swift meta-search to speed up the scheduling process Thissection focuses on the innovations of SERENITY dynamicprogramming-based scheduling divide-and-conqueradaptive soft budgeting and graph rewriting which areexplained in detail in Section 31 32 and 33 respectively
31 Dynamic Programming-based SchedulingAchieving Optimal Peak Memory Footprint
Our goal for the scheduling algorithm is to minimize thepeak memory footprint micropeak(sG) As stated in Section 23recursive algorithms that covers the entire search space Sor the subspace of all topological orderings ST sub S takesimpractically long time This is primarily due to the repetitivere-computation of subproblems that upper bounds the algo-rithm byO(|V |) Therefore we leverage dynamic program-ming (Bellman 1961 1966 Held amp Karp 1962) which in-cludes a memoization scheme that has been shown to be effec-tive in reducing the complexity of time-intensive algorithmsby reusing solutions from their subproblems while still find-
ing optimal solution by sweeping the entire search space
Identifying signature to enable dynamic programmingThe first step to applying dynamic programming to anew problem is characterizing the structure of an optimalsolution slowast=slowastn (slowastn is an optimal solution for n number ofnodes) Then it requires identifying a recursive relationshipbetween the optimal solution of a subproblem slowasti and theoriginal problem slowasti+1 and we do this by analyzing thestraightforward recursive topological ordering whichwhile inefficient sweeps the entire search space In essencetopological ordering algorithm is a repeated process ofidentifying a set of nodes that are available for schedulingand iterating the set for recursion In graph theory such aset of nodes available for scheduling is called zero-indegreeset z where z is a set of nodes which all of their incomingedges and the corresponding predecessor nodes (indegree)have been scheduled Figure 5 demonstrates the recursiontree of the different topological ordering algorithms wherethe height of the tree is the search step and every path fromthe root to the leaf is a topological ordering s isin ST Thefigure highlights the redundant z in the recursive topologicalordering in the recursion tree then merges these z to makethem unique identifying it as the signature for repetition andprevent the aforementioned re-computation This makes thescheduling for z into a unique subproblem that constitutes
A
B C
D E F
H I J
K
L
G
GGraph Recursive Topological Ordering
A
B C J
CD JG B E F J B C
hellip
Sear
ch S
tep
CDG CDGhellip hellipz
s
Redundant zero-indegree set z
Dynamic Programming-based Topological Ordering
A
B C J
ABCD ABJG E F ACJ
hellip
Sear
ch S
tep
CDGhellip
Unique zero-indegree set z
Scheduled SchedulableXX For memoizationXX
Figure 5 Illustration of identifying redundant zero-indegree set zand making z unique (square) throughout the topological orderingalgorithm to reduce re-computation
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
outdegree of 1rarr 0 outdegree of 1rarr 0
micro 8
A
B C
D E F
H I J
K
L
G
GGraph
Scheduled To ScheduleAllocateBA To DeallocateB
Activation MemoryD E F I J
(1) ScheduleAllocate H D E F I J H
(0) Initial State
(2) Deallocate F I J HD E
D E
H I
DE
i = 8
micropeak
micropeak9 = max(micropeak8 micropeak)
micro9
s8 = A B CD E F I Jmicro peak8 from M8
z8 = HG
u8 = H
s9 = A B CD E F I J H
Figure 6 Visualization of scheduling the node u8 = H during thesearch step i = 8 Starting from s8 micro8 and micropeak8 the figureshows how the algorithm calculates s9 micro9 and micropeak9
the dynamic programming-based topological ordering
Integrating the peak memory footprint constraint Ontop of the dynamic programming formulation that showspotential for optimizing the search space significantly weoverlay the problem specific constraints to achieve theoptimal solution In particular we calculate the memoryfootprint microi+1 and its corresponding peak micropeaki+1 in eachsearch step i to select optimal path slowasti+1 for memoizationHere we clarify the process of a search step explainingthe details of calculating micropeaki+1 and saving si+1 for eachsearch step i In each search step we start with number ofunique zero-indegree sets zi (signature) saved in ith entryof memoizationMi For each zi we append the schedule upto the point si sum of activations in the memory microi for thesignature zi and the peak memory footprint of the si denotedmicropeaki Therefore in each search step i we start with si microiandmicropeaki for si Then when we iterate zi to schedule a newnode ui its output activation is appended to si to form si+1and is allocated in the memory Size of ui is product (
prod) of
uishape where shape is a property of the activation tensorthat includes channels height width and the precision (egbyte float) is added to microi so microi+1 larr microi +
prod(uishape)
Then we use microi+1 as micropeak to update micropeaki+1 (peakmemory footprint for si+1) Since some predecessors ofui will not be used anymore after allocating ui we updatethe outdegrees of the node by decrementing them Havingupdated the outdegree we will be left with a zero-outdegreeset that denotes the nodes that are ready for deallocation Wedeallocate the nodes in the set and update microi+1 accordingly
To demonstrate scheduling of a node ui Figure 6 simulatesscheduling a node u8 = H in i=8 In the figure (1) H is ap-pended to s8 and allocated to memory as it is scheduled andthen the scheduler records maximum of the micropeak8 and thesum of all activations in the memory at this point as micropeak9Then it recalculates the outdegrees of the predecessor nodesof H D and E rsquos outdegree are decremented from one tozero (2) Then these nodes are deallocated and sum of the
Algorithm 1 Dynamic Programming-based Scheduling1 Input graph G2 Output optimal schedule slowast
3 initialize memoization4 s0larr [] micro0micropeak0larr0 z0larr zero-indegree(s0G)5 M0[z0]larr(s0micro0micropeak0)6 iterate search step7 for i=0 to nminus1 do8 iterate (schedule current memory peak memory)9 for zi(simicroimicropeak) inMi do
10 for ui in zi do11 si+1larrsiappend(ui) allocate12 zi+1larr zero-indegree(si+1G)13 microi+1micropeaklarrmicroi+
prod(uishape)
14 micropeaki+1larrmax(micropeakimicropeak)15 for pi in uipreds do16 if pi is in zero-outdegree(si+1G) then17 microi+1larrmicroi+1minus
prod(pishape) deallocate
18 end if19 end for20 memoize schedule with least peak memory21 if micropeaki+1leMi+1[zi+1]micropeaki+1 then22 Mi+1[zi+1]larr(si+1microi+1micropeaki+1)23 end if24 end for25 end for26 end for27 slowastmicrolowastpeaklarrM[middot]nsnM[middot]nmicropeakn solution
activation memory here is recorded as micro9
Finding schedule with optimal peak memory footprintAfter scheduling ui we save the new signature into theMi+1 for next search step i+1 Since the goal of this work isto minimize the overall micropeak we identify the correspondingoptimal schedule slowasti+1 for each zi+1 by only saving si+1 withthe minimum micropeaki+1 We integrate the aforementionedstep of scheduling ui and updating Mi+1 to completethe proposed dynamic programming-based schedulingalgorithm Algorithm 1 summarizes the the algorithm As afirst step the algorithm starts by initializing the memoizationtableM0 then the algorithm iterates different search stepsIn each search step i the algorithm performs the aboveillustrated memory allocation for allui in zi and saving si+1microi+1 and micropeaki+1 After iterating all search steps to nminus1slowast is saved inMn with a unique entry for n being numberof nodes in G We provide the proof for the optimality of thepeak memory footprint in the supplementary material
Complexity of the algorithm The complexity of theproposed dynamic programming-based scheduling isO(|V |times2|V |) which is significantly faster than the exhaus-tive search ofST with an upper bound complexity ofO(|V |)
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A
B C
D
E
F G
H
Divide Conquer Combine
ABCD
Schedule
Schedule
g1
g2Concatenate
s sg1
sg2EFGH
ABCDEFGH
Figure 7 Illustration of divide-and-conquer which divides thegraphs into multiple subgraphs (divide) schedules each of themusing the optimal scheduler (conquer) then concatenates thesub-schedules to get the final schedule (combine)
Due to the space limitation we present the derivation of thealgorithm complexity in the supplementary material
32 Optimizing Scheduling Speed Speeding upthe Dynamic Programming-based Scheduling
While the above scheduling algorithm improves complexityof the search search space may still be intractable due tothe immense irregularity Therefore we devise divide-and-conquer and adaptive soft budgeting to accelerate the searchby effectively shrinking and pruning the search space
Divide-and-conquer We can observe from Figure 1that the topology of irregularly wired neural networks arehourglass shaped ( ) because many NAS and RandomNetwork Generators design cells with single input and singleoutput then stack them to form an hourglass shape topology(Wilken et al 2000) shows that during general purpose codescheduling graphs can be partitioned (divide) into multiplesubgraphs and the corresponding solutions (conquer) can beconcatenated (combine) to form an optimal solution for theoverall problem While the complexity of the scheduling al-gorithm remains the same this divide-and-conquer approachcan reduce the number of nodes in each subproblem speed-ing up the overall scheduling time For instance for a graphthat can be partitioned intoN equal subgraphs the schedul-ing time will decrease from |V |times2|V | to |V |times2|V |N thatwe can speed up scheduling by multiple orders of magnitudecompared to the naive approach depending on the size ofthe graph and the number of partitions
As such Figure 7 shows this insight can be extended to ourproblem setting where we can first perform scheduling oneach cell and merge those solutions together to form thefinal solution First stage is partitioning the original graphG into multiple subgraphs g (divide) Then utilizing theindependence among the subgraphs each subgraph g canbe scheduled separately for their corresponding optimalschedule sg (conquer) Considering that the number ofnodes in the subgraph g is much smaller than the entire graphG the scheduling time will decrease significantly Finallythe schedules of the subgraphs are concatenated to giveoptimal schedule slowast of the entire graph (combine)
Adaptive soft budgeting While divide-and-conquerapproach scales down the number of nodes the algorithmmay still not be fast enough due to the exponential complexityof the algorithm Therefore we explore avoiding suboptimalsolutions during the early stage of scheduling without affect-ing the optimality of the original algorithm Since our goal isto find a single solution that can run within a given memorybudget τlowast=microlowast while all other solutions can be discarded set-ting some budget τ that is greater or equal to microlowast and pruningsuboptimal schedules with which their micropeak exceeds τ canfocus the search to a smaller search spaceS primeT subST while stillachieving the optimal schedule slowast On top of this we developa meta-search for τ This is inspired from engineers buyinga larger memory (increase τ ) if a program fails due to stackoverflow (= rsquono solutionrsquo due to an overly aggressive pruning)and selling out excess memory (decrease τ ) if the currentbudget is prohibitive (= rsquotimeoutrsquo due to lack of pruning)SERENITY takes advantage of this insight to develop anadaptive soft budgeting scheme while scheduling to cut downthe overall number of explored schedules Figure 8 illustratesthe overall idea by first showing how some schedules arepruned with regard to a given budget τ in Figure 8(a) thenimplication of different τ on scheduling time in Figure 8(b)
Figure 8(a) depicts a certain point while scheduling G wherenodes G H F and J can be scheduled In particular the
32
23
35
τ = 36
35 38 38J
s1
A
B C
D E F
H I
K
L
G
GGraph
Scheduled SchedulableXX For memoizationXX
ABCDEI hellip
Sear
ch S
tep
FHG
hellip
J
D 6 E 6 F 6J 6I 3H 3G 3
ABCDEIFH helliphellip
C 6
23 32s2
gt τ
z
z
s3
35
output activation size
(a) While both path s1 and s2 schedules lead to same zprime theirmicro and micropeak varies and we can prune schedules that yieldhigher micropeak than a given budget τ Numbers next to box orcircle are micro and numbers next to edges are micropeak
Proh
ibiti
ve
Sche
dulin
g Ti
me
(tim
eout
)
Sche
dulin
g Fa
ilure
(no
solu
tion
)
No o
f Exp
lore
d Sc
hedu
les
prop Sc
hedu
ling
Tim
e
Budget
Optimal Budget τ Hard Budget τmaxSoft Budget τ
Adaptive Soft Budgeting
(b) Adaptive soft budgeting starts by setting a hard budgetτmax as the maximum value for the soft budget τ Thenconducts a binary search for τ higher than τlowast that it finds asolution yet not too high that scheduling completes quickly
Figure 8 Illustration of the adaptive soft budgeting (a) shows howschedules are pruned and (b) illustrates how the soft budget τrelates to the number of explored schedules
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
figure compares two possible solutions s1 and s2 whichschedules Hrarr F and Frarr H respectively given τ = 36While s1 and s2 both starts from z with micro=32 schedulingH leads to micropeak = 32+3 (H) = 35 whereas scheduling F
or J leads to micropeak =32+6 (F or J) =38 Therefore sincewe assume τ = 36 s2 and s3 will fail because micropeak = 38for s2 and s3 exceeds 36 So as long as we set the budgetτ higher than microlowast the scheduler still finds a single optimalsolution while avoiding many suboptimal paths On theother hand too small a τ ltmicrolowast leads to no solution becausethe optimal path would be pruned away
Having established the possibility of pruning our questionboils down to discovering τ that is greater or equal to microlowast
which we call an optimal budget τlowast yet close enough toshrink the search space effectively Figure 8(b) and Algo-rithm 2 summarizes the proposed adaptive soft budgetingSince we start with no information about the approximaterange for τ we resort to a commonly used topologicalordering algorithm called Kahnrsquos algorithm (Kahn 1962)(O(|V |+|E|)) to adaptively gain idea of the range for τ Weuse the peak memory footprint from this sequence and useit as our hard budget τmax and in contrast we call adaptivelychanging τ as a soft budget Since τmaxgemicrolowast we know thatany τgeτmax do not need to be explored Having this upperbound for the search adaptive soft budgeting implementsa binary search to first run the scheduling algorithm with τand T as input where T is an hyperparameter that limits thescheduling time per search step The binary search increasesτ (τnew larr (τnew + τold)2) if it finds rsquono solutionrsquo anddecreases τ (τnewlarrτnew2) if a search step returns rsquotimeoutrsquo(search step duration exceeds T ) The binary search stops as
Algorithm 2 Adaptive Soft Budgeting1 Input graph G2 Output optimal schedule slowast
3 τmaxlarrmicro(KahnrsquosAlgorithm(G)G) hard budget4 τoldτnewlarrτmax
5 flaglarr rsquono solutionrsquo6 repeat7 binary search for τ decrease τ if rsquotimeoutrsquo8 and increase τ if rsquono solutionrsquo9 if flag is rsquotimeoutrsquo then
10 simultaneous11 τoldlarrτnew τnewlarrτnew212 else if flag is rsquono solutionrsquo then13 simultaneous14 τoldlarrτnew τnewlarr(τnew+τold)215 end if16 if flag is rsquosolutionrsquo then17 slowastlarrschedule optimal schedule18 end if19 until flag is rsquosolutionrsquo
micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(wixi) )
micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(y) )
Channel-wisePartitioning
Kernel-wisePartitioning
=
=
wij
concat
conv
x1 x2 xn
w1hellipwm
y
add
partialconv w1
x1 x2xn
w2 wn
y
concat
depth-conv
x1 x2 xn
w1hellipwn
y
concat
partialdepth-conv w1
x1 x2xn
w2 wn
y
yxi ith Input Output jth Channel of ith Kernel
x
x
Figure 9 Illustration of the graph rewriting patterns channel-wisepartitioning and kernel-wise partitioning can reduce the memorycost of convolution and depthwise convolution respectively
soon as it finds a schedule (rsquosolutionrsquo) and this method usingbinary search is guaranteed to work due to the monotonicallyincreasing number of explored schedules with τ
33 Identity Graph Rewriting Improving theSearch Space for Better Peak Memory Footprint
Reorganizing the computational graph of the irregularlywired neural networks may lead to significant reduction in thepeak memory footprint micropeak during computation For exam-ple it is notable that large stream of NAS-based works (Liuet al 2019a Zhang et al 2019) rely on extensive use of con-catenation as a natural approach to merge information frommultiple branches of the input activations and expand thesearch space of the neural architectures However concatena-tion with many incoming edges may prolong the liveness ofthe input activation and increase the memory pressure whichis unfavorable especially for resource constrained scenariosTo address this issue we propose identity graph rewritingto effectively reduce micropeak around the concatenation whilekeeping the arithmetic outputs identical To this end wepresent two main examples of the graph patterns in irregularlywired neural networks that benefits from our technique
Channel-wise partitioning (convolution) One typicalpattern in irregularly wired neural networks is concatenation(concat [middot]) that takes multiple branches of the input prior toa convolution (conv lowast) While executing such pattern peakmemory footprint micropeak occurs when the output y isinRn isbeing computed while concatenated branches of inputxisinRn
are also mandated to reside in the memory Our objectiveis to achieve the same arithmetic results and logical effectas concat yet sidestep the corresponding seemingly exces-sive memory cost To this end we channel-wise partitionthe conv that follows the concat so that the partitioned convcan be computed as soon as the input xi becomes available
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of
sumi and lowast these
transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from
sumxi+y tomax(wilowastxi)+y which becomes
more effective when there are more incoming edges to concat
y=[sum
i
w1ilowastxisumi
wmilowastxi]
(concat+conv) (3)
=sumi
[w1ilowastxiwmilowastxi
](4)
=sumi
[w1iwmi
]lowastxi (5)
=sumi
[wilowastxi
](partial conv+add) (6)
Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly
y=[w1lowastx1wnlowastxn
](concat+depthconv) (7)
=[[w1lowastx1][wnlowastxn]
](partial depthconv+concat)
(8)
Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost
Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1
ACCURACY
DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745
4 EVALUATION
We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time
41 Methodology
Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset
42 Experimental Results
Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In
1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
183x
220x
239x
209x
140x
127x 168x
125x
139x
168x220x
244x
270x 345x
140x
127x 168x
125x
139x 186x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Red
ucti
on i
n Pe
ak M
emor
yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Higher the betterReduction in Peak Memory
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 10 Reduction in peak memory footprint of SERENITY
against TensorFlow Lite (no memory hierarchy)
192x 2
58x
251x
115x
108x
129x
108x
130x
152x192x 2
68x
125x
111x
131x
111x 161x
149x192x
356x
125x
119x
109x
108x 151x200x
135x
250x
182x
138x 176x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Redu
ctio
n in
Off
-chi
p
32KB 64KB 128KB 256KB
Reduction in Off-chip
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Memory Communication
only
SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipon
ly SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipSERENITY removes off-chip communication
NA
NA
NA
NA
NA NA
Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)
addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks
Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed
Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory
0
50
100
150
200
250
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
251KB reduction in peak memoryfootprint with Memory Allocator
Memory Footprint (KB)
Time
(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)
0
50
100
150
200
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic ProgrammingDynamic Programming+Graph Rewriting
125KB reductionin peak memory footprint
Time
Memory Footprint (KB)
(b) Memory footprint without the memory allocator
Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)
allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting
Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting
Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
32s 57s
45s
278s 11
81s
151s
285s 744s
879s
406s
32s
421s
305s
393s 11
81s
151s
285s 744s
879s
488s
1
10
100
1000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Sche
dulin
g Ti
me
(sec
onds
)
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Scheduling Time (seconds)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 13 Scheduling time evaluation for SERENITY
Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time
GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME
7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs
3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs
nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility
5 RELATED WORKS
The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)
bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices
Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper
Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint
Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts
6 CONCLUSION
As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
ACKNOWLEDGEMENT
We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms
REFERENCES
Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016
Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018
Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH
Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017
Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966
Bellman R Dynamic programming Science 1966
Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961
Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989
Bruno J and Sethi R Code generation for a one-register machineJACM 1976
Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm
Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018
Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014
Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016
Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017
Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf
Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf
Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017
Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf
Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS
Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015
Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018
Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017
Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017
Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite
Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015
Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a
Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b
He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018
Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962
Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf
Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014
Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018
Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019
Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017
Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016
Kahn A B Topological sorting of large networks CACM 1962
Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001
Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf
Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004
LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990
Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003
Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX
Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b
Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020
Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb
NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt
Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017
Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019
Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999
Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019
Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf
Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007
Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018
Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014
Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf
Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019
Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000
Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019
Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019
Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019
Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf
Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf
Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-
Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg
Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS
Multiply-and-accumulate (Billions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
200 10 30 40
DPN-131
Inception V1MobileNet
ShuffleNet
Inception V2
Inception V3Xception ResNet-152
SENet
AmoebaNet-A
ReNeXt-101PolyNetInception ResNet V2
Inception V4
NASNet-ANASNet-B
RandWire
AmoebaNet-A
AmoebaNet-B
RandWire
irregularly wired neural networksregular topology neural networks
irregularly wired neural networksshow better performance for
same amount of compute thanregular topology neural networks
top left means is better
(a) ImageNet accuracy vs number of multiply-and-accumulate
Number of Parameters (Millions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
800 40 100 140
DPN-131
irregularly wired neural networks
Inception V1MobileNetShuffleNet
Inception V2
Inception V3Xception
ResNet-152
SENet
AmoebaNet-C
ReNeXt-101
PolyNetInception ResNet V2Inception V4
NASNet-A
NASNet-A
RandWire
AmoebaNet-A
RandWire
regular topology neural networks
irregularly wired neural networksshow better performance for
same number of parameters thanregular topology neural networks
top left means is better
6020 120
NASNet-A
(b) ImageNet accuracy vs number of parameters
Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks
B COMPARISON WITH TENSORFLOW LITE
In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks
1656
552
194
70
645
330 60
5
350
160
903
251
82 33
459
260 359
280
115
753
226
72 20
459
260 359
280
115
0
500
1000
1500
2000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Peak
Mem
ory
Foot
prin
t (K
B)
TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Smaller the better
Peak Memory Footprint (KB)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite
C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING
Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm
THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step
Proof If i=0 the optimal s0 is an empty sequence and micro0
must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +
prodvishapemicroi +
produlowasti shape)
and microi+1 is calculated by deductingprodpishape forallpi isin
(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption
D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF
We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node
A
D W
Z
CB X Y
GGraph
hellip
Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm
First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step
For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2
iminus1) On top of this each step would make an iteration
over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield
1+1times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=1+
(|V |minus2
0
)times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=2+
|V |minus2sumi=0
(|V |minus2
i
)times(|V |minus2minusi)
=2+(|V |minus2)times2|V |minus3
le(|V |minus2)times2|V |minus2 for |V |ge4
le|V |times2|V |
As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
concat
conv
(a) SwiftNet Cell A
Peak Memory Footprint (KB)
Cum
ulat
ive
Dist
ribut
ion
of S
ched
ules
()
250 KBconstraint100
020406080
350 400200 250 300
41 of schedulessatisfy the constraint
004 of schedulesare optimal
(b) CDF of peak memory fordifferent possible schedules
Figure 3 CDF of the peak memory footprint for the differentpossible schedules of a given irregularly wired neural network
From the performance perspective these networks haveshown to outperform manually designed architecturesin terms of accuracy while using less resources In factmajority of winning neural architectures in competitionswith primary goal of reducing resources (Gauen et al 2017)rely on NAS suggesting its effectiveness in that respectFigure 2 plots the accuracy of different models given theircomputation The figure clearly shows that the Paretofrontier of irregularly wired neural networks from NASand Random Network Generators are better than the handdesigned models with regular topology This indicates thatthe efficiency in terms of accuracy given fixed resources arebetter with the irregularly wired neural networks
22 Challenges
Many existing compilers (Chen et al 2018 Vasilache et al2018) and frameworks (Paszke et al 2019 Abadi et al 2016Jia et al 2014) rely on basic topological ordering algorithmsto schedule the graph While the current approach may be suf-ficient to run conventional networks on server-class machinessuch scheme may be unfit for running irregularly wiredneural networks on resource-constrained edge devices Thisis because unlike running networks with regular topologyrunning irregular networks results in varied range of memoryfootprint depending on the schedule For instance giventhe constraints of a representative edge device (SparkFunEdge 250KB weightactivation memory and 60M MACs)Figure 3(b) shows that 41 of the schedules barely meets thehard memory constraint while only 004 would achieve theoptimal peak memory In reality such limitation will preventfurther exploration regarding the diversity and innovationof network design and in order to allow edge computingregime to take full advantage of the irregularly wired neuralnetworks this limitation should be alleviated if not removed
23 Design Objectives
Scheduling algorithm To address this issue our workaims to find a schedule of nodes slowast from the search spaceS that would minimize peak memory footprint micropeak Senumerates all possible orderings of the nodes visinV whereV is the set of all nodes within a graph G
slowast=argmins
micropeak(sG) for sisinS (1)
The most straightforward way to schedule is a brute forceapproach which just enumerates S and picks one with theminimum peak memory footprint While this extrememethod may find an optimal solution it is too costly in termsof time due to its immense complexity Θ(|V |) where |V |denotes number of nodes in the graph One way to improveis to narrow down the search space to just focus on onlythe topological orderings ST subS However this will stillsuffer from a complexity with an upper bound of O(|V |)(takes days to schedule DAG with merely 30 nodes) In factprevious works (Bruno amp Sethi 1976 Bernstein et al 1989)already prove optimal scheduling for DAGs is NP-completeOn another extreme are heuristics for topological orderingsuch as Kahnrsquos algorithm (Kahn 1962) with complexity ofO(|V |+|E|) where V andE are number of nodes and edgesHowever as demonstrated in Figure 3 such method mayyield suboptimal schedule of nodes which will not run on thetarget hardware To this end we explore dynamic program-ming combined with adaptive soft budgeting for schedulingto achieve an optimal solution while keeping the graph con-stant slowast without adding too much overhead in terms of timeWe explain our algorithms in depth in Section 31 and 32
Graph rewriting Any scheduling algorithm includingours is intrinsically bounded by the graph topologyTherefore we explore to transform the search spacethrough graph rewriting (Plump 1999) Graph rewritingis generally concerned with substituting a certain patternin the graph with a different pattern to achieve a certainobjective For a computational dataflow graph leveragingdistributive associative and commutative properties withinthe computation of the graph graph rewriting can maintainthe semantics while bringing significant improvementsregarding some objective For example in general programssum
ilogxi can be represented assum
oddilogxi+sum
evenilogxior log
prodixi while x+x can be translated to xtimes2 or xltlt1
Likewise we bring this insight to neural networks to find aset of possible transformationsX that can rewrite the originalgraph G to a new graph Gprime that would also change our searchspace S to one with a lower peak memory footprint
X lowast=argminX
(micropeak(slowastX (G))) (2)
We identify a set of candidate patterns for transformationχ grarr gprime (g isinG and gprime isinGprime) which constitutes X Whiletransforming the graph our method keeps the mathematicalintegrity of the graph intact thus not an approximationmethod We embed this systematic way to improve peakmemory footprint and the search space as identity graphrewriting and we address this technique in Section 33
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
17
G s GGraph RewrittenGraph Schedule
IdentityGraph Rewriter
Dynamic Programming-
based Scheduler
Adaptive Soft Budgeting
Rewrite graph to alleviate activation memory footprint
of the graph
Find memory-optimal schedule given an
input graph
Adaptively manage soft budget to speed
up scheduling
G
flag = lsquono solutionrsquo lsquotimeoutrsquo lsquosolutionrsquo
τ T s
Figure 4 Overall workflow of SERENITY memory-aware scheduling of irregularly wired neural network
3 SERENITY MEMORY-AWARESCHEDULING OF IRREGULARLYWIRED NEURAL NETWORKS
As discussed in Section 2 the objective is reducing thepeak memory footprint while executing irregularly wiredneural networks We propose SERENITY memory-awarescheduling that targets devices with restricted resources(eg edge devices) Figure 4 summarizes the overallscheduling process highlighting the major contributions ofour approach Input to SERENITY is a graph of irregularlywired neural network G which in fact acts as an intermediaterepresentation (IR) during the scheduling process Weaugment this IR with the metadata of the nodes such as theoperation type inputoutput edges inputoutput shapesand memory cost Then the graph rewriter transformsthe graph G rarr Gprime to relax the memory costs of memoryintensive patterns with the goal of reducing the peak memoryfootprint micropeak of G SERENITY schedules the graph to anoptimal schedule slowast using the dynamic programming-basedscheduler However since the scheduling may be slowdue to the complexity we scale down search space byleveraging divide-and-conquer which partitions the graphinto multiple subgraphs Them we augment the schedulerwith an adaptive soft budgeting which prunes suboptimalpaths by adaptively finding a budget for thresholding througha swift meta-search to speed up the scheduling process Thissection focuses on the innovations of SERENITY dynamicprogramming-based scheduling divide-and-conqueradaptive soft budgeting and graph rewriting which areexplained in detail in Section 31 32 and 33 respectively
31 Dynamic Programming-based SchedulingAchieving Optimal Peak Memory Footprint
Our goal for the scheduling algorithm is to minimize thepeak memory footprint micropeak(sG) As stated in Section 23recursive algorithms that covers the entire search space Sor the subspace of all topological orderings ST sub S takesimpractically long time This is primarily due to the repetitivere-computation of subproblems that upper bounds the algo-rithm byO(|V |) Therefore we leverage dynamic program-ming (Bellman 1961 1966 Held amp Karp 1962) which in-cludes a memoization scheme that has been shown to be effec-tive in reducing the complexity of time-intensive algorithmsby reusing solutions from their subproblems while still find-
ing optimal solution by sweeping the entire search space
Identifying signature to enable dynamic programmingThe first step to applying dynamic programming to anew problem is characterizing the structure of an optimalsolution slowast=slowastn (slowastn is an optimal solution for n number ofnodes) Then it requires identifying a recursive relationshipbetween the optimal solution of a subproblem slowasti and theoriginal problem slowasti+1 and we do this by analyzing thestraightforward recursive topological ordering whichwhile inefficient sweeps the entire search space In essencetopological ordering algorithm is a repeated process ofidentifying a set of nodes that are available for schedulingand iterating the set for recursion In graph theory such aset of nodes available for scheduling is called zero-indegreeset z where z is a set of nodes which all of their incomingedges and the corresponding predecessor nodes (indegree)have been scheduled Figure 5 demonstrates the recursiontree of the different topological ordering algorithms wherethe height of the tree is the search step and every path fromthe root to the leaf is a topological ordering s isin ST Thefigure highlights the redundant z in the recursive topologicalordering in the recursion tree then merges these z to makethem unique identifying it as the signature for repetition andprevent the aforementioned re-computation This makes thescheduling for z into a unique subproblem that constitutes
A
B C
D E F
H I J
K
L
G
GGraph Recursive Topological Ordering
A
B C J
CD JG B E F J B C
hellip
Sear
ch S
tep
CDG CDGhellip hellipz
s
Redundant zero-indegree set z
Dynamic Programming-based Topological Ordering
A
B C J
ABCD ABJG E F ACJ
hellip
Sear
ch S
tep
CDGhellip
Unique zero-indegree set z
Scheduled SchedulableXX For memoizationXX
Figure 5 Illustration of identifying redundant zero-indegree set zand making z unique (square) throughout the topological orderingalgorithm to reduce re-computation
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
outdegree of 1rarr 0 outdegree of 1rarr 0
micro 8
A
B C
D E F
H I J
K
L
G
GGraph
Scheduled To ScheduleAllocateBA To DeallocateB
Activation MemoryD E F I J
(1) ScheduleAllocate H D E F I J H
(0) Initial State
(2) Deallocate F I J HD E
D E
H I
DE
i = 8
micropeak
micropeak9 = max(micropeak8 micropeak)
micro9
s8 = A B CD E F I Jmicro peak8 from M8
z8 = HG
u8 = H
s9 = A B CD E F I J H
Figure 6 Visualization of scheduling the node u8 = H during thesearch step i = 8 Starting from s8 micro8 and micropeak8 the figureshows how the algorithm calculates s9 micro9 and micropeak9
the dynamic programming-based topological ordering
Integrating the peak memory footprint constraint Ontop of the dynamic programming formulation that showspotential for optimizing the search space significantly weoverlay the problem specific constraints to achieve theoptimal solution In particular we calculate the memoryfootprint microi+1 and its corresponding peak micropeaki+1 in eachsearch step i to select optimal path slowasti+1 for memoizationHere we clarify the process of a search step explainingthe details of calculating micropeaki+1 and saving si+1 for eachsearch step i In each search step we start with number ofunique zero-indegree sets zi (signature) saved in ith entryof memoizationMi For each zi we append the schedule upto the point si sum of activations in the memory microi for thesignature zi and the peak memory footprint of the si denotedmicropeaki Therefore in each search step i we start with si microiandmicropeaki for si Then when we iterate zi to schedule a newnode ui its output activation is appended to si to form si+1and is allocated in the memory Size of ui is product (
prod) of
uishape where shape is a property of the activation tensorthat includes channels height width and the precision (egbyte float) is added to microi so microi+1 larr microi +
prod(uishape)
Then we use microi+1 as micropeak to update micropeaki+1 (peakmemory footprint for si+1) Since some predecessors ofui will not be used anymore after allocating ui we updatethe outdegrees of the node by decrementing them Havingupdated the outdegree we will be left with a zero-outdegreeset that denotes the nodes that are ready for deallocation Wedeallocate the nodes in the set and update microi+1 accordingly
To demonstrate scheduling of a node ui Figure 6 simulatesscheduling a node u8 = H in i=8 In the figure (1) H is ap-pended to s8 and allocated to memory as it is scheduled andthen the scheduler records maximum of the micropeak8 and thesum of all activations in the memory at this point as micropeak9Then it recalculates the outdegrees of the predecessor nodesof H D and E rsquos outdegree are decremented from one tozero (2) Then these nodes are deallocated and sum of the
Algorithm 1 Dynamic Programming-based Scheduling1 Input graph G2 Output optimal schedule slowast
3 initialize memoization4 s0larr [] micro0micropeak0larr0 z0larr zero-indegree(s0G)5 M0[z0]larr(s0micro0micropeak0)6 iterate search step7 for i=0 to nminus1 do8 iterate (schedule current memory peak memory)9 for zi(simicroimicropeak) inMi do
10 for ui in zi do11 si+1larrsiappend(ui) allocate12 zi+1larr zero-indegree(si+1G)13 microi+1micropeaklarrmicroi+
prod(uishape)
14 micropeaki+1larrmax(micropeakimicropeak)15 for pi in uipreds do16 if pi is in zero-outdegree(si+1G) then17 microi+1larrmicroi+1minus
prod(pishape) deallocate
18 end if19 end for20 memoize schedule with least peak memory21 if micropeaki+1leMi+1[zi+1]micropeaki+1 then22 Mi+1[zi+1]larr(si+1microi+1micropeaki+1)23 end if24 end for25 end for26 end for27 slowastmicrolowastpeaklarrM[middot]nsnM[middot]nmicropeakn solution
activation memory here is recorded as micro9
Finding schedule with optimal peak memory footprintAfter scheduling ui we save the new signature into theMi+1 for next search step i+1 Since the goal of this work isto minimize the overall micropeak we identify the correspondingoptimal schedule slowasti+1 for each zi+1 by only saving si+1 withthe minimum micropeaki+1 We integrate the aforementionedstep of scheduling ui and updating Mi+1 to completethe proposed dynamic programming-based schedulingalgorithm Algorithm 1 summarizes the the algorithm As afirst step the algorithm starts by initializing the memoizationtableM0 then the algorithm iterates different search stepsIn each search step i the algorithm performs the aboveillustrated memory allocation for allui in zi and saving si+1microi+1 and micropeaki+1 After iterating all search steps to nminus1slowast is saved inMn with a unique entry for n being numberof nodes in G We provide the proof for the optimality of thepeak memory footprint in the supplementary material
Complexity of the algorithm The complexity of theproposed dynamic programming-based scheduling isO(|V |times2|V |) which is significantly faster than the exhaus-tive search ofST with an upper bound complexity ofO(|V |)
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A
B C
D
E
F G
H
Divide Conquer Combine
ABCD
Schedule
Schedule
g1
g2Concatenate
s sg1
sg2EFGH
ABCDEFGH
Figure 7 Illustration of divide-and-conquer which divides thegraphs into multiple subgraphs (divide) schedules each of themusing the optimal scheduler (conquer) then concatenates thesub-schedules to get the final schedule (combine)
Due to the space limitation we present the derivation of thealgorithm complexity in the supplementary material
32 Optimizing Scheduling Speed Speeding upthe Dynamic Programming-based Scheduling
While the above scheduling algorithm improves complexityof the search search space may still be intractable due tothe immense irregularity Therefore we devise divide-and-conquer and adaptive soft budgeting to accelerate the searchby effectively shrinking and pruning the search space
Divide-and-conquer We can observe from Figure 1that the topology of irregularly wired neural networks arehourglass shaped ( ) because many NAS and RandomNetwork Generators design cells with single input and singleoutput then stack them to form an hourglass shape topology(Wilken et al 2000) shows that during general purpose codescheduling graphs can be partitioned (divide) into multiplesubgraphs and the corresponding solutions (conquer) can beconcatenated (combine) to form an optimal solution for theoverall problem While the complexity of the scheduling al-gorithm remains the same this divide-and-conquer approachcan reduce the number of nodes in each subproblem speed-ing up the overall scheduling time For instance for a graphthat can be partitioned intoN equal subgraphs the schedul-ing time will decrease from |V |times2|V | to |V |times2|V |N thatwe can speed up scheduling by multiple orders of magnitudecompared to the naive approach depending on the size ofthe graph and the number of partitions
As such Figure 7 shows this insight can be extended to ourproblem setting where we can first perform scheduling oneach cell and merge those solutions together to form thefinal solution First stage is partitioning the original graphG into multiple subgraphs g (divide) Then utilizing theindependence among the subgraphs each subgraph g canbe scheduled separately for their corresponding optimalschedule sg (conquer) Considering that the number ofnodes in the subgraph g is much smaller than the entire graphG the scheduling time will decrease significantly Finallythe schedules of the subgraphs are concatenated to giveoptimal schedule slowast of the entire graph (combine)
Adaptive soft budgeting While divide-and-conquerapproach scales down the number of nodes the algorithmmay still not be fast enough due to the exponential complexityof the algorithm Therefore we explore avoiding suboptimalsolutions during the early stage of scheduling without affect-ing the optimality of the original algorithm Since our goal isto find a single solution that can run within a given memorybudget τlowast=microlowast while all other solutions can be discarded set-ting some budget τ that is greater or equal to microlowast and pruningsuboptimal schedules with which their micropeak exceeds τ canfocus the search to a smaller search spaceS primeT subST while stillachieving the optimal schedule slowast On top of this we developa meta-search for τ This is inspired from engineers buyinga larger memory (increase τ ) if a program fails due to stackoverflow (= rsquono solutionrsquo due to an overly aggressive pruning)and selling out excess memory (decrease τ ) if the currentbudget is prohibitive (= rsquotimeoutrsquo due to lack of pruning)SERENITY takes advantage of this insight to develop anadaptive soft budgeting scheme while scheduling to cut downthe overall number of explored schedules Figure 8 illustratesthe overall idea by first showing how some schedules arepruned with regard to a given budget τ in Figure 8(a) thenimplication of different τ on scheduling time in Figure 8(b)
Figure 8(a) depicts a certain point while scheduling G wherenodes G H F and J can be scheduled In particular the
32
23
35
τ = 36
35 38 38J
s1
A
B C
D E F
H I
K
L
G
GGraph
Scheduled SchedulableXX For memoizationXX
ABCDEI hellip
Sear
ch S
tep
FHG
hellip
J
D 6 E 6 F 6J 6I 3H 3G 3
ABCDEIFH helliphellip
C 6
23 32s2
gt τ
z
z
s3
35
output activation size
(a) While both path s1 and s2 schedules lead to same zprime theirmicro and micropeak varies and we can prune schedules that yieldhigher micropeak than a given budget τ Numbers next to box orcircle are micro and numbers next to edges are micropeak
Proh
ibiti
ve
Sche
dulin
g Ti
me
(tim
eout
)
Sche
dulin
g Fa
ilure
(no
solu
tion
)
No o
f Exp
lore
d Sc
hedu
les
prop Sc
hedu
ling
Tim
e
Budget
Optimal Budget τ Hard Budget τmaxSoft Budget τ
Adaptive Soft Budgeting
(b) Adaptive soft budgeting starts by setting a hard budgetτmax as the maximum value for the soft budget τ Thenconducts a binary search for τ higher than τlowast that it finds asolution yet not too high that scheduling completes quickly
Figure 8 Illustration of the adaptive soft budgeting (a) shows howschedules are pruned and (b) illustrates how the soft budget τrelates to the number of explored schedules
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
figure compares two possible solutions s1 and s2 whichschedules Hrarr F and Frarr H respectively given τ = 36While s1 and s2 both starts from z with micro=32 schedulingH leads to micropeak = 32+3 (H) = 35 whereas scheduling F
or J leads to micropeak =32+6 (F or J) =38 Therefore sincewe assume τ = 36 s2 and s3 will fail because micropeak = 38for s2 and s3 exceeds 36 So as long as we set the budgetτ higher than microlowast the scheduler still finds a single optimalsolution while avoiding many suboptimal paths On theother hand too small a τ ltmicrolowast leads to no solution becausethe optimal path would be pruned away
Having established the possibility of pruning our questionboils down to discovering τ that is greater or equal to microlowast
which we call an optimal budget τlowast yet close enough toshrink the search space effectively Figure 8(b) and Algo-rithm 2 summarizes the proposed adaptive soft budgetingSince we start with no information about the approximaterange for τ we resort to a commonly used topologicalordering algorithm called Kahnrsquos algorithm (Kahn 1962)(O(|V |+|E|)) to adaptively gain idea of the range for τ Weuse the peak memory footprint from this sequence and useit as our hard budget τmax and in contrast we call adaptivelychanging τ as a soft budget Since τmaxgemicrolowast we know thatany τgeτmax do not need to be explored Having this upperbound for the search adaptive soft budgeting implementsa binary search to first run the scheduling algorithm with τand T as input where T is an hyperparameter that limits thescheduling time per search step The binary search increasesτ (τnew larr (τnew + τold)2) if it finds rsquono solutionrsquo anddecreases τ (τnewlarrτnew2) if a search step returns rsquotimeoutrsquo(search step duration exceeds T ) The binary search stops as
Algorithm 2 Adaptive Soft Budgeting1 Input graph G2 Output optimal schedule slowast
3 τmaxlarrmicro(KahnrsquosAlgorithm(G)G) hard budget4 τoldτnewlarrτmax
5 flaglarr rsquono solutionrsquo6 repeat7 binary search for τ decrease τ if rsquotimeoutrsquo8 and increase τ if rsquono solutionrsquo9 if flag is rsquotimeoutrsquo then
10 simultaneous11 τoldlarrτnew τnewlarrτnew212 else if flag is rsquono solutionrsquo then13 simultaneous14 τoldlarrτnew τnewlarr(τnew+τold)215 end if16 if flag is rsquosolutionrsquo then17 slowastlarrschedule optimal schedule18 end if19 until flag is rsquosolutionrsquo
micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(wixi) )
micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(y) )
Channel-wisePartitioning
Kernel-wisePartitioning
=
=
wij
concat
conv
x1 x2 xn
w1hellipwm
y
add
partialconv w1
x1 x2xn
w2 wn
y
concat
depth-conv
x1 x2 xn
w1hellipwn
y
concat
partialdepth-conv w1
x1 x2xn
w2 wn
y
yxi ith Input Output jth Channel of ith Kernel
x
x
Figure 9 Illustration of the graph rewriting patterns channel-wisepartitioning and kernel-wise partitioning can reduce the memorycost of convolution and depthwise convolution respectively
soon as it finds a schedule (rsquosolutionrsquo) and this method usingbinary search is guaranteed to work due to the monotonicallyincreasing number of explored schedules with τ
33 Identity Graph Rewriting Improving theSearch Space for Better Peak Memory Footprint
Reorganizing the computational graph of the irregularlywired neural networks may lead to significant reduction in thepeak memory footprint micropeak during computation For exam-ple it is notable that large stream of NAS-based works (Liuet al 2019a Zhang et al 2019) rely on extensive use of con-catenation as a natural approach to merge information frommultiple branches of the input activations and expand thesearch space of the neural architectures However concatena-tion with many incoming edges may prolong the liveness ofthe input activation and increase the memory pressure whichis unfavorable especially for resource constrained scenariosTo address this issue we propose identity graph rewritingto effectively reduce micropeak around the concatenation whilekeeping the arithmetic outputs identical To this end wepresent two main examples of the graph patterns in irregularlywired neural networks that benefits from our technique
Channel-wise partitioning (convolution) One typicalpattern in irregularly wired neural networks is concatenation(concat [middot]) that takes multiple branches of the input prior toa convolution (conv lowast) While executing such pattern peakmemory footprint micropeak occurs when the output y isinRn isbeing computed while concatenated branches of inputxisinRn
are also mandated to reside in the memory Our objectiveis to achieve the same arithmetic results and logical effectas concat yet sidestep the corresponding seemingly exces-sive memory cost To this end we channel-wise partitionthe conv that follows the concat so that the partitioned convcan be computed as soon as the input xi becomes available
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of
sumi and lowast these
transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from
sumxi+y tomax(wilowastxi)+y which becomes
more effective when there are more incoming edges to concat
y=[sum
i
w1ilowastxisumi
wmilowastxi]
(concat+conv) (3)
=sumi
[w1ilowastxiwmilowastxi
](4)
=sumi
[w1iwmi
]lowastxi (5)
=sumi
[wilowastxi
](partial conv+add) (6)
Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly
y=[w1lowastx1wnlowastxn
](concat+depthconv) (7)
=[[w1lowastx1][wnlowastxn]
](partial depthconv+concat)
(8)
Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost
Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1
ACCURACY
DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745
4 EVALUATION
We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time
41 Methodology
Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset
42 Experimental Results
Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In
1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
183x
220x
239x
209x
140x
127x 168x
125x
139x
168x220x
244x
270x 345x
140x
127x 168x
125x
139x 186x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Red
ucti
on i
n Pe
ak M
emor
yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Higher the betterReduction in Peak Memory
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 10 Reduction in peak memory footprint of SERENITY
against TensorFlow Lite (no memory hierarchy)
192x 2
58x
251x
115x
108x
129x
108x
130x
152x192x 2
68x
125x
111x
131x
111x 161x
149x192x
356x
125x
119x
109x
108x 151x200x
135x
250x
182x
138x 176x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Redu
ctio
n in
Off
-chi
p
32KB 64KB 128KB 256KB
Reduction in Off-chip
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Memory Communication
only
SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipon
ly SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipSERENITY removes off-chip communication
NA
NA
NA
NA
NA NA
Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)
addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks
Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed
Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory
0
50
100
150
200
250
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
251KB reduction in peak memoryfootprint with Memory Allocator
Memory Footprint (KB)
Time
(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)
0
50
100
150
200
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic ProgrammingDynamic Programming+Graph Rewriting
125KB reductionin peak memory footprint
Time
Memory Footprint (KB)
(b) Memory footprint without the memory allocator
Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)
allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting
Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting
Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
32s 57s
45s
278s 11
81s
151s
285s 744s
879s
406s
32s
421s
305s
393s 11
81s
151s
285s 744s
879s
488s
1
10
100
1000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Sche
dulin
g Ti
me
(sec
onds
)
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Scheduling Time (seconds)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 13 Scheduling time evaluation for SERENITY
Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time
GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME
7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs
3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs
nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility
5 RELATED WORKS
The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)
bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices
Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper
Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint
Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts
6 CONCLUSION
As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
ACKNOWLEDGEMENT
We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms
REFERENCES
Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016
Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018
Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH
Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017
Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966
Bellman R Dynamic programming Science 1966
Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961
Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989
Bruno J and Sethi R Code generation for a one-register machineJACM 1976
Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm
Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018
Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014
Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016
Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017
Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf
Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf
Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017
Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf
Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS
Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015
Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018
Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017
Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017
Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite
Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015
Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a
Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b
He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018
Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962
Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf
Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014
Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018
Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019
Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017
Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016
Kahn A B Topological sorting of large networks CACM 1962
Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001
Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf
Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004
LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990
Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003
Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX
Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b
Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020
Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb
NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt
Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017
Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019
Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999
Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019
Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf
Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007
Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018
Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014
Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf
Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019
Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000
Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019
Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019
Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019
Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf
Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf
Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-
Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg
Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS
Multiply-and-accumulate (Billions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
200 10 30 40
DPN-131
Inception V1MobileNet
ShuffleNet
Inception V2
Inception V3Xception ResNet-152
SENet
AmoebaNet-A
ReNeXt-101PolyNetInception ResNet V2
Inception V4
NASNet-ANASNet-B
RandWire
AmoebaNet-A
AmoebaNet-B
RandWire
irregularly wired neural networksregular topology neural networks
irregularly wired neural networksshow better performance for
same amount of compute thanregular topology neural networks
top left means is better
(a) ImageNet accuracy vs number of multiply-and-accumulate
Number of Parameters (Millions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
800 40 100 140
DPN-131
irregularly wired neural networks
Inception V1MobileNetShuffleNet
Inception V2
Inception V3Xception
ResNet-152
SENet
AmoebaNet-C
ReNeXt-101
PolyNetInception ResNet V2Inception V4
NASNet-A
NASNet-A
RandWire
AmoebaNet-A
RandWire
regular topology neural networks
irregularly wired neural networksshow better performance for
same number of parameters thanregular topology neural networks
top left means is better
6020 120
NASNet-A
(b) ImageNet accuracy vs number of parameters
Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks
B COMPARISON WITH TENSORFLOW LITE
In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks
1656
552
194
70
645
330 60
5
350
160
903
251
82 33
459
260 359
280
115
753
226
72 20
459
260 359
280
115
0
500
1000
1500
2000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Peak
Mem
ory
Foot
prin
t (K
B)
TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Smaller the better
Peak Memory Footprint (KB)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite
C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING
Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm
THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step
Proof If i=0 the optimal s0 is an empty sequence and micro0
must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +
prodvishapemicroi +
produlowasti shape)
and microi+1 is calculated by deductingprodpishape forallpi isin
(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption
D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF
We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node
A
D W
Z
CB X Y
GGraph
hellip
Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm
First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step
For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2
iminus1) On top of this each step would make an iteration
over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield
1+1times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=1+
(|V |minus2
0
)times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=2+
|V |minus2sumi=0
(|V |minus2
i
)times(|V |minus2minusi)
=2+(|V |minus2)times2|V |minus3
le(|V |minus2)times2|V |minus2 for |V |ge4
le|V |times2|V |
As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
17
G s GGraph RewrittenGraph Schedule
IdentityGraph Rewriter
Dynamic Programming-
based Scheduler
Adaptive Soft Budgeting
Rewrite graph to alleviate activation memory footprint
of the graph
Find memory-optimal schedule given an
input graph
Adaptively manage soft budget to speed
up scheduling
G
flag = lsquono solutionrsquo lsquotimeoutrsquo lsquosolutionrsquo
τ T s
Figure 4 Overall workflow of SERENITY memory-aware scheduling of irregularly wired neural network
3 SERENITY MEMORY-AWARESCHEDULING OF IRREGULARLYWIRED NEURAL NETWORKS
As discussed in Section 2 the objective is reducing thepeak memory footprint while executing irregularly wiredneural networks We propose SERENITY memory-awarescheduling that targets devices with restricted resources(eg edge devices) Figure 4 summarizes the overallscheduling process highlighting the major contributions ofour approach Input to SERENITY is a graph of irregularlywired neural network G which in fact acts as an intermediaterepresentation (IR) during the scheduling process Weaugment this IR with the metadata of the nodes such as theoperation type inputoutput edges inputoutput shapesand memory cost Then the graph rewriter transformsthe graph G rarr Gprime to relax the memory costs of memoryintensive patterns with the goal of reducing the peak memoryfootprint micropeak of G SERENITY schedules the graph to anoptimal schedule slowast using the dynamic programming-basedscheduler However since the scheduling may be slowdue to the complexity we scale down search space byleveraging divide-and-conquer which partitions the graphinto multiple subgraphs Them we augment the schedulerwith an adaptive soft budgeting which prunes suboptimalpaths by adaptively finding a budget for thresholding througha swift meta-search to speed up the scheduling process Thissection focuses on the innovations of SERENITY dynamicprogramming-based scheduling divide-and-conqueradaptive soft budgeting and graph rewriting which areexplained in detail in Section 31 32 and 33 respectively
31 Dynamic Programming-based SchedulingAchieving Optimal Peak Memory Footprint
Our goal for the scheduling algorithm is to minimize thepeak memory footprint micropeak(sG) As stated in Section 23recursive algorithms that covers the entire search space Sor the subspace of all topological orderings ST sub S takesimpractically long time This is primarily due to the repetitivere-computation of subproblems that upper bounds the algo-rithm byO(|V |) Therefore we leverage dynamic program-ming (Bellman 1961 1966 Held amp Karp 1962) which in-cludes a memoization scheme that has been shown to be effec-tive in reducing the complexity of time-intensive algorithmsby reusing solutions from their subproblems while still find-
ing optimal solution by sweeping the entire search space
Identifying signature to enable dynamic programmingThe first step to applying dynamic programming to anew problem is characterizing the structure of an optimalsolution slowast=slowastn (slowastn is an optimal solution for n number ofnodes) Then it requires identifying a recursive relationshipbetween the optimal solution of a subproblem slowasti and theoriginal problem slowasti+1 and we do this by analyzing thestraightforward recursive topological ordering whichwhile inefficient sweeps the entire search space In essencetopological ordering algorithm is a repeated process ofidentifying a set of nodes that are available for schedulingand iterating the set for recursion In graph theory such aset of nodes available for scheduling is called zero-indegreeset z where z is a set of nodes which all of their incomingedges and the corresponding predecessor nodes (indegree)have been scheduled Figure 5 demonstrates the recursiontree of the different topological ordering algorithms wherethe height of the tree is the search step and every path fromthe root to the leaf is a topological ordering s isin ST Thefigure highlights the redundant z in the recursive topologicalordering in the recursion tree then merges these z to makethem unique identifying it as the signature for repetition andprevent the aforementioned re-computation This makes thescheduling for z into a unique subproblem that constitutes
A
B C
D E F
H I J
K
L
G
GGraph Recursive Topological Ordering
A
B C J
CD JG B E F J B C
hellip
Sear
ch S
tep
CDG CDGhellip hellipz
s
Redundant zero-indegree set z
Dynamic Programming-based Topological Ordering
A
B C J
ABCD ABJG E F ACJ
hellip
Sear
ch S
tep
CDGhellip
Unique zero-indegree set z
Scheduled SchedulableXX For memoizationXX
Figure 5 Illustration of identifying redundant zero-indegree set zand making z unique (square) throughout the topological orderingalgorithm to reduce re-computation
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
outdegree of 1rarr 0 outdegree of 1rarr 0
micro 8
A
B C
D E F
H I J
K
L
G
GGraph
Scheduled To ScheduleAllocateBA To DeallocateB
Activation MemoryD E F I J
(1) ScheduleAllocate H D E F I J H
(0) Initial State
(2) Deallocate F I J HD E
D E
H I
DE
i = 8
micropeak
micropeak9 = max(micropeak8 micropeak)
micro9
s8 = A B CD E F I Jmicro peak8 from M8
z8 = HG
u8 = H
s9 = A B CD E F I J H
Figure 6 Visualization of scheduling the node u8 = H during thesearch step i = 8 Starting from s8 micro8 and micropeak8 the figureshows how the algorithm calculates s9 micro9 and micropeak9
the dynamic programming-based topological ordering
Integrating the peak memory footprint constraint Ontop of the dynamic programming formulation that showspotential for optimizing the search space significantly weoverlay the problem specific constraints to achieve theoptimal solution In particular we calculate the memoryfootprint microi+1 and its corresponding peak micropeaki+1 in eachsearch step i to select optimal path slowasti+1 for memoizationHere we clarify the process of a search step explainingthe details of calculating micropeaki+1 and saving si+1 for eachsearch step i In each search step we start with number ofunique zero-indegree sets zi (signature) saved in ith entryof memoizationMi For each zi we append the schedule upto the point si sum of activations in the memory microi for thesignature zi and the peak memory footprint of the si denotedmicropeaki Therefore in each search step i we start with si microiandmicropeaki for si Then when we iterate zi to schedule a newnode ui its output activation is appended to si to form si+1and is allocated in the memory Size of ui is product (
prod) of
uishape where shape is a property of the activation tensorthat includes channels height width and the precision (egbyte float) is added to microi so microi+1 larr microi +
prod(uishape)
Then we use microi+1 as micropeak to update micropeaki+1 (peakmemory footprint for si+1) Since some predecessors ofui will not be used anymore after allocating ui we updatethe outdegrees of the node by decrementing them Havingupdated the outdegree we will be left with a zero-outdegreeset that denotes the nodes that are ready for deallocation Wedeallocate the nodes in the set and update microi+1 accordingly
To demonstrate scheduling of a node ui Figure 6 simulatesscheduling a node u8 = H in i=8 In the figure (1) H is ap-pended to s8 and allocated to memory as it is scheduled andthen the scheduler records maximum of the micropeak8 and thesum of all activations in the memory at this point as micropeak9Then it recalculates the outdegrees of the predecessor nodesof H D and E rsquos outdegree are decremented from one tozero (2) Then these nodes are deallocated and sum of the
Algorithm 1 Dynamic Programming-based Scheduling1 Input graph G2 Output optimal schedule slowast
3 initialize memoization4 s0larr [] micro0micropeak0larr0 z0larr zero-indegree(s0G)5 M0[z0]larr(s0micro0micropeak0)6 iterate search step7 for i=0 to nminus1 do8 iterate (schedule current memory peak memory)9 for zi(simicroimicropeak) inMi do
10 for ui in zi do11 si+1larrsiappend(ui) allocate12 zi+1larr zero-indegree(si+1G)13 microi+1micropeaklarrmicroi+
prod(uishape)
14 micropeaki+1larrmax(micropeakimicropeak)15 for pi in uipreds do16 if pi is in zero-outdegree(si+1G) then17 microi+1larrmicroi+1minus
prod(pishape) deallocate
18 end if19 end for20 memoize schedule with least peak memory21 if micropeaki+1leMi+1[zi+1]micropeaki+1 then22 Mi+1[zi+1]larr(si+1microi+1micropeaki+1)23 end if24 end for25 end for26 end for27 slowastmicrolowastpeaklarrM[middot]nsnM[middot]nmicropeakn solution
activation memory here is recorded as micro9
Finding schedule with optimal peak memory footprintAfter scheduling ui we save the new signature into theMi+1 for next search step i+1 Since the goal of this work isto minimize the overall micropeak we identify the correspondingoptimal schedule slowasti+1 for each zi+1 by only saving si+1 withthe minimum micropeaki+1 We integrate the aforementionedstep of scheduling ui and updating Mi+1 to completethe proposed dynamic programming-based schedulingalgorithm Algorithm 1 summarizes the the algorithm As afirst step the algorithm starts by initializing the memoizationtableM0 then the algorithm iterates different search stepsIn each search step i the algorithm performs the aboveillustrated memory allocation for allui in zi and saving si+1microi+1 and micropeaki+1 After iterating all search steps to nminus1slowast is saved inMn with a unique entry for n being numberof nodes in G We provide the proof for the optimality of thepeak memory footprint in the supplementary material
Complexity of the algorithm The complexity of theproposed dynamic programming-based scheduling isO(|V |times2|V |) which is significantly faster than the exhaus-tive search ofST with an upper bound complexity ofO(|V |)
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A
B C
D
E
F G
H
Divide Conquer Combine
ABCD
Schedule
Schedule
g1
g2Concatenate
s sg1
sg2EFGH
ABCDEFGH
Figure 7 Illustration of divide-and-conquer which divides thegraphs into multiple subgraphs (divide) schedules each of themusing the optimal scheduler (conquer) then concatenates thesub-schedules to get the final schedule (combine)
Due to the space limitation we present the derivation of thealgorithm complexity in the supplementary material
32 Optimizing Scheduling Speed Speeding upthe Dynamic Programming-based Scheduling
While the above scheduling algorithm improves complexityof the search search space may still be intractable due tothe immense irregularity Therefore we devise divide-and-conquer and adaptive soft budgeting to accelerate the searchby effectively shrinking and pruning the search space
Divide-and-conquer We can observe from Figure 1that the topology of irregularly wired neural networks arehourglass shaped ( ) because many NAS and RandomNetwork Generators design cells with single input and singleoutput then stack them to form an hourglass shape topology(Wilken et al 2000) shows that during general purpose codescheduling graphs can be partitioned (divide) into multiplesubgraphs and the corresponding solutions (conquer) can beconcatenated (combine) to form an optimal solution for theoverall problem While the complexity of the scheduling al-gorithm remains the same this divide-and-conquer approachcan reduce the number of nodes in each subproblem speed-ing up the overall scheduling time For instance for a graphthat can be partitioned intoN equal subgraphs the schedul-ing time will decrease from |V |times2|V | to |V |times2|V |N thatwe can speed up scheduling by multiple orders of magnitudecompared to the naive approach depending on the size ofthe graph and the number of partitions
As such Figure 7 shows this insight can be extended to ourproblem setting where we can first perform scheduling oneach cell and merge those solutions together to form thefinal solution First stage is partitioning the original graphG into multiple subgraphs g (divide) Then utilizing theindependence among the subgraphs each subgraph g canbe scheduled separately for their corresponding optimalschedule sg (conquer) Considering that the number ofnodes in the subgraph g is much smaller than the entire graphG the scheduling time will decrease significantly Finallythe schedules of the subgraphs are concatenated to giveoptimal schedule slowast of the entire graph (combine)
Adaptive soft budgeting While divide-and-conquerapproach scales down the number of nodes the algorithmmay still not be fast enough due to the exponential complexityof the algorithm Therefore we explore avoiding suboptimalsolutions during the early stage of scheduling without affect-ing the optimality of the original algorithm Since our goal isto find a single solution that can run within a given memorybudget τlowast=microlowast while all other solutions can be discarded set-ting some budget τ that is greater or equal to microlowast and pruningsuboptimal schedules with which their micropeak exceeds τ canfocus the search to a smaller search spaceS primeT subST while stillachieving the optimal schedule slowast On top of this we developa meta-search for τ This is inspired from engineers buyinga larger memory (increase τ ) if a program fails due to stackoverflow (= rsquono solutionrsquo due to an overly aggressive pruning)and selling out excess memory (decrease τ ) if the currentbudget is prohibitive (= rsquotimeoutrsquo due to lack of pruning)SERENITY takes advantage of this insight to develop anadaptive soft budgeting scheme while scheduling to cut downthe overall number of explored schedules Figure 8 illustratesthe overall idea by first showing how some schedules arepruned with regard to a given budget τ in Figure 8(a) thenimplication of different τ on scheduling time in Figure 8(b)
Figure 8(a) depicts a certain point while scheduling G wherenodes G H F and J can be scheduled In particular the
32
23
35
τ = 36
35 38 38J
s1
A
B C
D E F
H I
K
L
G
GGraph
Scheduled SchedulableXX For memoizationXX
ABCDEI hellip
Sear
ch S
tep
FHG
hellip
J
D 6 E 6 F 6J 6I 3H 3G 3
ABCDEIFH helliphellip
C 6
23 32s2
gt τ
z
z
s3
35
output activation size
(a) While both path s1 and s2 schedules lead to same zprime theirmicro and micropeak varies and we can prune schedules that yieldhigher micropeak than a given budget τ Numbers next to box orcircle are micro and numbers next to edges are micropeak
Proh
ibiti
ve
Sche
dulin
g Ti
me
(tim
eout
)
Sche
dulin
g Fa
ilure
(no
solu
tion
)
No o
f Exp
lore
d Sc
hedu
les
prop Sc
hedu
ling
Tim
e
Budget
Optimal Budget τ Hard Budget τmaxSoft Budget τ
Adaptive Soft Budgeting
(b) Adaptive soft budgeting starts by setting a hard budgetτmax as the maximum value for the soft budget τ Thenconducts a binary search for τ higher than τlowast that it finds asolution yet not too high that scheduling completes quickly
Figure 8 Illustration of the adaptive soft budgeting (a) shows howschedules are pruned and (b) illustrates how the soft budget τrelates to the number of explored schedules
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
figure compares two possible solutions s1 and s2 whichschedules Hrarr F and Frarr H respectively given τ = 36While s1 and s2 both starts from z with micro=32 schedulingH leads to micropeak = 32+3 (H) = 35 whereas scheduling F
or J leads to micropeak =32+6 (F or J) =38 Therefore sincewe assume τ = 36 s2 and s3 will fail because micropeak = 38for s2 and s3 exceeds 36 So as long as we set the budgetτ higher than microlowast the scheduler still finds a single optimalsolution while avoiding many suboptimal paths On theother hand too small a τ ltmicrolowast leads to no solution becausethe optimal path would be pruned away
Having established the possibility of pruning our questionboils down to discovering τ that is greater or equal to microlowast
which we call an optimal budget τlowast yet close enough toshrink the search space effectively Figure 8(b) and Algo-rithm 2 summarizes the proposed adaptive soft budgetingSince we start with no information about the approximaterange for τ we resort to a commonly used topologicalordering algorithm called Kahnrsquos algorithm (Kahn 1962)(O(|V |+|E|)) to adaptively gain idea of the range for τ Weuse the peak memory footprint from this sequence and useit as our hard budget τmax and in contrast we call adaptivelychanging τ as a soft budget Since τmaxgemicrolowast we know thatany τgeτmax do not need to be explored Having this upperbound for the search adaptive soft budgeting implementsa binary search to first run the scheduling algorithm with τand T as input where T is an hyperparameter that limits thescheduling time per search step The binary search increasesτ (τnew larr (τnew + τold)2) if it finds rsquono solutionrsquo anddecreases τ (τnewlarrτnew2) if a search step returns rsquotimeoutrsquo(search step duration exceeds T ) The binary search stops as
Algorithm 2 Adaptive Soft Budgeting1 Input graph G2 Output optimal schedule slowast
3 τmaxlarrmicro(KahnrsquosAlgorithm(G)G) hard budget4 τoldτnewlarrτmax
5 flaglarr rsquono solutionrsquo6 repeat7 binary search for τ decrease τ if rsquotimeoutrsquo8 and increase τ if rsquono solutionrsquo9 if flag is rsquotimeoutrsquo then
10 simultaneous11 τoldlarrτnew τnewlarrτnew212 else if flag is rsquono solutionrsquo then13 simultaneous14 τoldlarrτnew τnewlarr(τnew+τold)215 end if16 if flag is rsquosolutionrsquo then17 slowastlarrschedule optimal schedule18 end if19 until flag is rsquosolutionrsquo
micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(wixi) )
micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(y) )
Channel-wisePartitioning
Kernel-wisePartitioning
=
=
wij
concat
conv
x1 x2 xn
w1hellipwm
y
add
partialconv w1
x1 x2xn
w2 wn
y
concat
depth-conv
x1 x2 xn
w1hellipwn
y
concat
partialdepth-conv w1
x1 x2xn
w2 wn
y
yxi ith Input Output jth Channel of ith Kernel
x
x
Figure 9 Illustration of the graph rewriting patterns channel-wisepartitioning and kernel-wise partitioning can reduce the memorycost of convolution and depthwise convolution respectively
soon as it finds a schedule (rsquosolutionrsquo) and this method usingbinary search is guaranteed to work due to the monotonicallyincreasing number of explored schedules with τ
33 Identity Graph Rewriting Improving theSearch Space for Better Peak Memory Footprint
Reorganizing the computational graph of the irregularlywired neural networks may lead to significant reduction in thepeak memory footprint micropeak during computation For exam-ple it is notable that large stream of NAS-based works (Liuet al 2019a Zhang et al 2019) rely on extensive use of con-catenation as a natural approach to merge information frommultiple branches of the input activations and expand thesearch space of the neural architectures However concatena-tion with many incoming edges may prolong the liveness ofthe input activation and increase the memory pressure whichis unfavorable especially for resource constrained scenariosTo address this issue we propose identity graph rewritingto effectively reduce micropeak around the concatenation whilekeeping the arithmetic outputs identical To this end wepresent two main examples of the graph patterns in irregularlywired neural networks that benefits from our technique
Channel-wise partitioning (convolution) One typicalpattern in irregularly wired neural networks is concatenation(concat [middot]) that takes multiple branches of the input prior toa convolution (conv lowast) While executing such pattern peakmemory footprint micropeak occurs when the output y isinRn isbeing computed while concatenated branches of inputxisinRn
are also mandated to reside in the memory Our objectiveis to achieve the same arithmetic results and logical effectas concat yet sidestep the corresponding seemingly exces-sive memory cost To this end we channel-wise partitionthe conv that follows the concat so that the partitioned convcan be computed as soon as the input xi becomes available
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of
sumi and lowast these
transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from
sumxi+y tomax(wilowastxi)+y which becomes
more effective when there are more incoming edges to concat
y=[sum
i
w1ilowastxisumi
wmilowastxi]
(concat+conv) (3)
=sumi
[w1ilowastxiwmilowastxi
](4)
=sumi
[w1iwmi
]lowastxi (5)
=sumi
[wilowastxi
](partial conv+add) (6)
Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly
y=[w1lowastx1wnlowastxn
](concat+depthconv) (7)
=[[w1lowastx1][wnlowastxn]
](partial depthconv+concat)
(8)
Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost
Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1
ACCURACY
DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745
4 EVALUATION
We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time
41 Methodology
Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset
42 Experimental Results
Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In
1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
183x
220x
239x
209x
140x
127x 168x
125x
139x
168x220x
244x
270x 345x
140x
127x 168x
125x
139x 186x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Red
ucti
on i
n Pe
ak M
emor
yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Higher the betterReduction in Peak Memory
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 10 Reduction in peak memory footprint of SERENITY
against TensorFlow Lite (no memory hierarchy)
192x 2
58x
251x
115x
108x
129x
108x
130x
152x192x 2
68x
125x
111x
131x
111x 161x
149x192x
356x
125x
119x
109x
108x 151x200x
135x
250x
182x
138x 176x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Redu
ctio
n in
Off
-chi
p
32KB 64KB 128KB 256KB
Reduction in Off-chip
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Memory Communication
only
SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipon
ly SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipSERENITY removes off-chip communication
NA
NA
NA
NA
NA NA
Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)
addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks
Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed
Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory
0
50
100
150
200
250
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
251KB reduction in peak memoryfootprint with Memory Allocator
Memory Footprint (KB)
Time
(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)
0
50
100
150
200
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic ProgrammingDynamic Programming+Graph Rewriting
125KB reductionin peak memory footprint
Time
Memory Footprint (KB)
(b) Memory footprint without the memory allocator
Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)
allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting
Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting
Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
32s 57s
45s
278s 11
81s
151s
285s 744s
879s
406s
32s
421s
305s
393s 11
81s
151s
285s 744s
879s
488s
1
10
100
1000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Sche
dulin
g Ti
me
(sec
onds
)
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Scheduling Time (seconds)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 13 Scheduling time evaluation for SERENITY
Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time
GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME
7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs
3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs
nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility
5 RELATED WORKS
The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)
bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices
Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper
Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint
Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts
6 CONCLUSION
As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
ACKNOWLEDGEMENT
We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms
REFERENCES
Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016
Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018
Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH
Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017
Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966
Bellman R Dynamic programming Science 1966
Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961
Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989
Bruno J and Sethi R Code generation for a one-register machineJACM 1976
Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm
Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018
Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014
Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016
Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017
Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf
Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf
Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017
Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf
Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS
Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015
Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018
Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017
Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017
Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite
Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015
Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a
Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b
He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018
Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962
Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf
Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014
Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018
Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019
Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017
Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016
Kahn A B Topological sorting of large networks CACM 1962
Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001
Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf
Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004
LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990
Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003
Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX
Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b
Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020
Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb
NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt
Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017
Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019
Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999
Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019
Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf
Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007
Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018
Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014
Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf
Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019
Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000
Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019
Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019
Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019
Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf
Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf
Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-
Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg
Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS
Multiply-and-accumulate (Billions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
200 10 30 40
DPN-131
Inception V1MobileNet
ShuffleNet
Inception V2
Inception V3Xception ResNet-152
SENet
AmoebaNet-A
ReNeXt-101PolyNetInception ResNet V2
Inception V4
NASNet-ANASNet-B
RandWire
AmoebaNet-A
AmoebaNet-B
RandWire
irregularly wired neural networksregular topology neural networks
irregularly wired neural networksshow better performance for
same amount of compute thanregular topology neural networks
top left means is better
(a) ImageNet accuracy vs number of multiply-and-accumulate
Number of Parameters (Millions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
800 40 100 140
DPN-131
irregularly wired neural networks
Inception V1MobileNetShuffleNet
Inception V2
Inception V3Xception
ResNet-152
SENet
AmoebaNet-C
ReNeXt-101
PolyNetInception ResNet V2Inception V4
NASNet-A
NASNet-A
RandWire
AmoebaNet-A
RandWire
regular topology neural networks
irregularly wired neural networksshow better performance for
same number of parameters thanregular topology neural networks
top left means is better
6020 120
NASNet-A
(b) ImageNet accuracy vs number of parameters
Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks
B COMPARISON WITH TENSORFLOW LITE
In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks
1656
552
194
70
645
330 60
5
350
160
903
251
82 33
459
260 359
280
115
753
226
72 20
459
260 359
280
115
0
500
1000
1500
2000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Peak
Mem
ory
Foot
prin
t (K
B)
TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Smaller the better
Peak Memory Footprint (KB)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite
C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING
Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm
THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step
Proof If i=0 the optimal s0 is an empty sequence and micro0
must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +
prodvishapemicroi +
produlowasti shape)
and microi+1 is calculated by deductingprodpishape forallpi isin
(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption
D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF
We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node
A
D W
Z
CB X Y
GGraph
hellip
Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm
First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step
For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2
iminus1) On top of this each step would make an iteration
over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield
1+1times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=1+
(|V |minus2
0
)times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=2+
|V |minus2sumi=0
(|V |minus2
i
)times(|V |minus2minusi)
=2+(|V |minus2)times2|V |minus3
le(|V |minus2)times2|V |minus2 for |V |ge4
le|V |times2|V |
As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
outdegree of 1rarr 0 outdegree of 1rarr 0
micro 8
A
B C
D E F
H I J
K
L
G
GGraph
Scheduled To ScheduleAllocateBA To DeallocateB
Activation MemoryD E F I J
(1) ScheduleAllocate H D E F I J H
(0) Initial State
(2) Deallocate F I J HD E
D E
H I
DE
i = 8
micropeak
micropeak9 = max(micropeak8 micropeak)
micro9
s8 = A B CD E F I Jmicro peak8 from M8
z8 = HG
u8 = H
s9 = A B CD E F I J H
Figure 6 Visualization of scheduling the node u8 = H during thesearch step i = 8 Starting from s8 micro8 and micropeak8 the figureshows how the algorithm calculates s9 micro9 and micropeak9
the dynamic programming-based topological ordering
Integrating the peak memory footprint constraint Ontop of the dynamic programming formulation that showspotential for optimizing the search space significantly weoverlay the problem specific constraints to achieve theoptimal solution In particular we calculate the memoryfootprint microi+1 and its corresponding peak micropeaki+1 in eachsearch step i to select optimal path slowasti+1 for memoizationHere we clarify the process of a search step explainingthe details of calculating micropeaki+1 and saving si+1 for eachsearch step i In each search step we start with number ofunique zero-indegree sets zi (signature) saved in ith entryof memoizationMi For each zi we append the schedule upto the point si sum of activations in the memory microi for thesignature zi and the peak memory footprint of the si denotedmicropeaki Therefore in each search step i we start with si microiandmicropeaki for si Then when we iterate zi to schedule a newnode ui its output activation is appended to si to form si+1and is allocated in the memory Size of ui is product (
prod) of
uishape where shape is a property of the activation tensorthat includes channels height width and the precision (egbyte float) is added to microi so microi+1 larr microi +
prod(uishape)
Then we use microi+1 as micropeak to update micropeaki+1 (peakmemory footprint for si+1) Since some predecessors ofui will not be used anymore after allocating ui we updatethe outdegrees of the node by decrementing them Havingupdated the outdegree we will be left with a zero-outdegreeset that denotes the nodes that are ready for deallocation Wedeallocate the nodes in the set and update microi+1 accordingly
To demonstrate scheduling of a node ui Figure 6 simulatesscheduling a node u8 = H in i=8 In the figure (1) H is ap-pended to s8 and allocated to memory as it is scheduled andthen the scheduler records maximum of the micropeak8 and thesum of all activations in the memory at this point as micropeak9Then it recalculates the outdegrees of the predecessor nodesof H D and E rsquos outdegree are decremented from one tozero (2) Then these nodes are deallocated and sum of the
Algorithm 1 Dynamic Programming-based Scheduling1 Input graph G2 Output optimal schedule slowast
3 initialize memoization4 s0larr [] micro0micropeak0larr0 z0larr zero-indegree(s0G)5 M0[z0]larr(s0micro0micropeak0)6 iterate search step7 for i=0 to nminus1 do8 iterate (schedule current memory peak memory)9 for zi(simicroimicropeak) inMi do
10 for ui in zi do11 si+1larrsiappend(ui) allocate12 zi+1larr zero-indegree(si+1G)13 microi+1micropeaklarrmicroi+
prod(uishape)
14 micropeaki+1larrmax(micropeakimicropeak)15 for pi in uipreds do16 if pi is in zero-outdegree(si+1G) then17 microi+1larrmicroi+1minus
prod(pishape) deallocate
18 end if19 end for20 memoize schedule with least peak memory21 if micropeaki+1leMi+1[zi+1]micropeaki+1 then22 Mi+1[zi+1]larr(si+1microi+1micropeaki+1)23 end if24 end for25 end for26 end for27 slowastmicrolowastpeaklarrM[middot]nsnM[middot]nmicropeakn solution
activation memory here is recorded as micro9
Finding schedule with optimal peak memory footprintAfter scheduling ui we save the new signature into theMi+1 for next search step i+1 Since the goal of this work isto minimize the overall micropeak we identify the correspondingoptimal schedule slowasti+1 for each zi+1 by only saving si+1 withthe minimum micropeaki+1 We integrate the aforementionedstep of scheduling ui and updating Mi+1 to completethe proposed dynamic programming-based schedulingalgorithm Algorithm 1 summarizes the the algorithm As afirst step the algorithm starts by initializing the memoizationtableM0 then the algorithm iterates different search stepsIn each search step i the algorithm performs the aboveillustrated memory allocation for allui in zi and saving si+1microi+1 and micropeaki+1 After iterating all search steps to nminus1slowast is saved inMn with a unique entry for n being numberof nodes in G We provide the proof for the optimality of thepeak memory footprint in the supplementary material
Complexity of the algorithm The complexity of theproposed dynamic programming-based scheduling isO(|V |times2|V |) which is significantly faster than the exhaus-tive search ofST with an upper bound complexity ofO(|V |)
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A
B C
D
E
F G
H
Divide Conquer Combine
ABCD
Schedule
Schedule
g1
g2Concatenate
s sg1
sg2EFGH
ABCDEFGH
Figure 7 Illustration of divide-and-conquer which divides thegraphs into multiple subgraphs (divide) schedules each of themusing the optimal scheduler (conquer) then concatenates thesub-schedules to get the final schedule (combine)
Due to the space limitation we present the derivation of thealgorithm complexity in the supplementary material
32 Optimizing Scheduling Speed Speeding upthe Dynamic Programming-based Scheduling
While the above scheduling algorithm improves complexityof the search search space may still be intractable due tothe immense irregularity Therefore we devise divide-and-conquer and adaptive soft budgeting to accelerate the searchby effectively shrinking and pruning the search space
Divide-and-conquer We can observe from Figure 1that the topology of irregularly wired neural networks arehourglass shaped ( ) because many NAS and RandomNetwork Generators design cells with single input and singleoutput then stack them to form an hourglass shape topology(Wilken et al 2000) shows that during general purpose codescheduling graphs can be partitioned (divide) into multiplesubgraphs and the corresponding solutions (conquer) can beconcatenated (combine) to form an optimal solution for theoverall problem While the complexity of the scheduling al-gorithm remains the same this divide-and-conquer approachcan reduce the number of nodes in each subproblem speed-ing up the overall scheduling time For instance for a graphthat can be partitioned intoN equal subgraphs the schedul-ing time will decrease from |V |times2|V | to |V |times2|V |N thatwe can speed up scheduling by multiple orders of magnitudecompared to the naive approach depending on the size ofthe graph and the number of partitions
As such Figure 7 shows this insight can be extended to ourproblem setting where we can first perform scheduling oneach cell and merge those solutions together to form thefinal solution First stage is partitioning the original graphG into multiple subgraphs g (divide) Then utilizing theindependence among the subgraphs each subgraph g canbe scheduled separately for their corresponding optimalschedule sg (conquer) Considering that the number ofnodes in the subgraph g is much smaller than the entire graphG the scheduling time will decrease significantly Finallythe schedules of the subgraphs are concatenated to giveoptimal schedule slowast of the entire graph (combine)
Adaptive soft budgeting While divide-and-conquerapproach scales down the number of nodes the algorithmmay still not be fast enough due to the exponential complexityof the algorithm Therefore we explore avoiding suboptimalsolutions during the early stage of scheduling without affect-ing the optimality of the original algorithm Since our goal isto find a single solution that can run within a given memorybudget τlowast=microlowast while all other solutions can be discarded set-ting some budget τ that is greater or equal to microlowast and pruningsuboptimal schedules with which their micropeak exceeds τ canfocus the search to a smaller search spaceS primeT subST while stillachieving the optimal schedule slowast On top of this we developa meta-search for τ This is inspired from engineers buyinga larger memory (increase τ ) if a program fails due to stackoverflow (= rsquono solutionrsquo due to an overly aggressive pruning)and selling out excess memory (decrease τ ) if the currentbudget is prohibitive (= rsquotimeoutrsquo due to lack of pruning)SERENITY takes advantage of this insight to develop anadaptive soft budgeting scheme while scheduling to cut downthe overall number of explored schedules Figure 8 illustratesthe overall idea by first showing how some schedules arepruned with regard to a given budget τ in Figure 8(a) thenimplication of different τ on scheduling time in Figure 8(b)
Figure 8(a) depicts a certain point while scheduling G wherenodes G H F and J can be scheduled In particular the
32
23
35
τ = 36
35 38 38J
s1
A
B C
D E F
H I
K
L
G
GGraph
Scheduled SchedulableXX For memoizationXX
ABCDEI hellip
Sear
ch S
tep
FHG
hellip
J
D 6 E 6 F 6J 6I 3H 3G 3
ABCDEIFH helliphellip
C 6
23 32s2
gt τ
z
z
s3
35
output activation size
(a) While both path s1 and s2 schedules lead to same zprime theirmicro and micropeak varies and we can prune schedules that yieldhigher micropeak than a given budget τ Numbers next to box orcircle are micro and numbers next to edges are micropeak
Proh
ibiti
ve
Sche
dulin
g Ti
me
(tim
eout
)
Sche
dulin
g Fa
ilure
(no
solu
tion
)
No o
f Exp
lore
d Sc
hedu
les
prop Sc
hedu
ling
Tim
e
Budget
Optimal Budget τ Hard Budget τmaxSoft Budget τ
Adaptive Soft Budgeting
(b) Adaptive soft budgeting starts by setting a hard budgetτmax as the maximum value for the soft budget τ Thenconducts a binary search for τ higher than τlowast that it finds asolution yet not too high that scheduling completes quickly
Figure 8 Illustration of the adaptive soft budgeting (a) shows howschedules are pruned and (b) illustrates how the soft budget τrelates to the number of explored schedules
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
figure compares two possible solutions s1 and s2 whichschedules Hrarr F and Frarr H respectively given τ = 36While s1 and s2 both starts from z with micro=32 schedulingH leads to micropeak = 32+3 (H) = 35 whereas scheduling F
or J leads to micropeak =32+6 (F or J) =38 Therefore sincewe assume τ = 36 s2 and s3 will fail because micropeak = 38for s2 and s3 exceeds 36 So as long as we set the budgetτ higher than microlowast the scheduler still finds a single optimalsolution while avoiding many suboptimal paths On theother hand too small a τ ltmicrolowast leads to no solution becausethe optimal path would be pruned away
Having established the possibility of pruning our questionboils down to discovering τ that is greater or equal to microlowast
which we call an optimal budget τlowast yet close enough toshrink the search space effectively Figure 8(b) and Algo-rithm 2 summarizes the proposed adaptive soft budgetingSince we start with no information about the approximaterange for τ we resort to a commonly used topologicalordering algorithm called Kahnrsquos algorithm (Kahn 1962)(O(|V |+|E|)) to adaptively gain idea of the range for τ Weuse the peak memory footprint from this sequence and useit as our hard budget τmax and in contrast we call adaptivelychanging τ as a soft budget Since τmaxgemicrolowast we know thatany τgeτmax do not need to be explored Having this upperbound for the search adaptive soft budgeting implementsa binary search to first run the scheduling algorithm with τand T as input where T is an hyperparameter that limits thescheduling time per search step The binary search increasesτ (τnew larr (τnew + τold)2) if it finds rsquono solutionrsquo anddecreases τ (τnewlarrτnew2) if a search step returns rsquotimeoutrsquo(search step duration exceeds T ) The binary search stops as
Algorithm 2 Adaptive Soft Budgeting1 Input graph G2 Output optimal schedule slowast
3 τmaxlarrmicro(KahnrsquosAlgorithm(G)G) hard budget4 τoldτnewlarrτmax
5 flaglarr rsquono solutionrsquo6 repeat7 binary search for τ decrease τ if rsquotimeoutrsquo8 and increase τ if rsquono solutionrsquo9 if flag is rsquotimeoutrsquo then
10 simultaneous11 τoldlarrτnew τnewlarrτnew212 else if flag is rsquono solutionrsquo then13 simultaneous14 τoldlarrτnew τnewlarr(τnew+τold)215 end if16 if flag is rsquosolutionrsquo then17 slowastlarrschedule optimal schedule18 end if19 until flag is rsquosolutionrsquo
micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(wixi) )
micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(y) )
Channel-wisePartitioning
Kernel-wisePartitioning
=
=
wij
concat
conv
x1 x2 xn
w1hellipwm
y
add
partialconv w1
x1 x2xn
w2 wn
y
concat
depth-conv
x1 x2 xn
w1hellipwn
y
concat
partialdepth-conv w1
x1 x2xn
w2 wn
y
yxi ith Input Output jth Channel of ith Kernel
x
x
Figure 9 Illustration of the graph rewriting patterns channel-wisepartitioning and kernel-wise partitioning can reduce the memorycost of convolution and depthwise convolution respectively
soon as it finds a schedule (rsquosolutionrsquo) and this method usingbinary search is guaranteed to work due to the monotonicallyincreasing number of explored schedules with τ
33 Identity Graph Rewriting Improving theSearch Space for Better Peak Memory Footprint
Reorganizing the computational graph of the irregularlywired neural networks may lead to significant reduction in thepeak memory footprint micropeak during computation For exam-ple it is notable that large stream of NAS-based works (Liuet al 2019a Zhang et al 2019) rely on extensive use of con-catenation as a natural approach to merge information frommultiple branches of the input activations and expand thesearch space of the neural architectures However concatena-tion with many incoming edges may prolong the liveness ofthe input activation and increase the memory pressure whichis unfavorable especially for resource constrained scenariosTo address this issue we propose identity graph rewritingto effectively reduce micropeak around the concatenation whilekeeping the arithmetic outputs identical To this end wepresent two main examples of the graph patterns in irregularlywired neural networks that benefits from our technique
Channel-wise partitioning (convolution) One typicalpattern in irregularly wired neural networks is concatenation(concat [middot]) that takes multiple branches of the input prior toa convolution (conv lowast) While executing such pattern peakmemory footprint micropeak occurs when the output y isinRn isbeing computed while concatenated branches of inputxisinRn
are also mandated to reside in the memory Our objectiveis to achieve the same arithmetic results and logical effectas concat yet sidestep the corresponding seemingly exces-sive memory cost To this end we channel-wise partitionthe conv that follows the concat so that the partitioned convcan be computed as soon as the input xi becomes available
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of
sumi and lowast these
transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from
sumxi+y tomax(wilowastxi)+y which becomes
more effective when there are more incoming edges to concat
y=[sum
i
w1ilowastxisumi
wmilowastxi]
(concat+conv) (3)
=sumi
[w1ilowastxiwmilowastxi
](4)
=sumi
[w1iwmi
]lowastxi (5)
=sumi
[wilowastxi
](partial conv+add) (6)
Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly
y=[w1lowastx1wnlowastxn
](concat+depthconv) (7)
=[[w1lowastx1][wnlowastxn]
](partial depthconv+concat)
(8)
Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost
Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1
ACCURACY
DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745
4 EVALUATION
We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time
41 Methodology
Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset
42 Experimental Results
Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In
1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
183x
220x
239x
209x
140x
127x 168x
125x
139x
168x220x
244x
270x 345x
140x
127x 168x
125x
139x 186x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Red
ucti
on i
n Pe
ak M
emor
yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Higher the betterReduction in Peak Memory
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 10 Reduction in peak memory footprint of SERENITY
against TensorFlow Lite (no memory hierarchy)
192x 2
58x
251x
115x
108x
129x
108x
130x
152x192x 2
68x
125x
111x
131x
111x 161x
149x192x
356x
125x
119x
109x
108x 151x200x
135x
250x
182x
138x 176x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Redu
ctio
n in
Off
-chi
p
32KB 64KB 128KB 256KB
Reduction in Off-chip
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Memory Communication
only
SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipon
ly SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipSERENITY removes off-chip communication
NA
NA
NA
NA
NA NA
Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)
addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks
Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed
Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory
0
50
100
150
200
250
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
251KB reduction in peak memoryfootprint with Memory Allocator
Memory Footprint (KB)
Time
(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)
0
50
100
150
200
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic ProgrammingDynamic Programming+Graph Rewriting
125KB reductionin peak memory footprint
Time
Memory Footprint (KB)
(b) Memory footprint without the memory allocator
Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)
allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting
Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting
Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
32s 57s
45s
278s 11
81s
151s
285s 744s
879s
406s
32s
421s
305s
393s 11
81s
151s
285s 744s
879s
488s
1
10
100
1000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Sche
dulin
g Ti
me
(sec
onds
)
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Scheduling Time (seconds)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 13 Scheduling time evaluation for SERENITY
Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time
GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME
7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs
3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs
nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility
5 RELATED WORKS
The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)
bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices
Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper
Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint
Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts
6 CONCLUSION
As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
ACKNOWLEDGEMENT
We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms
REFERENCES
Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016
Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018
Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH
Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017
Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966
Bellman R Dynamic programming Science 1966
Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961
Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989
Bruno J and Sethi R Code generation for a one-register machineJACM 1976
Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm
Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018
Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014
Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016
Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017
Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf
Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf
Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017
Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf
Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS
Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015
Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018
Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017
Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017
Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite
Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015
Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a
Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b
He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018
Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962
Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf
Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014
Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018
Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019
Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017
Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016
Kahn A B Topological sorting of large networks CACM 1962
Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001
Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf
Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004
LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990
Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003
Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX
Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b
Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020
Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb
NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt
Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017
Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019
Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999
Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019
Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf
Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007
Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018
Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014
Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf
Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019
Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000
Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019
Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019
Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019
Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf
Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf
Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-
Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg
Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS
Multiply-and-accumulate (Billions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
200 10 30 40
DPN-131
Inception V1MobileNet
ShuffleNet
Inception V2
Inception V3Xception ResNet-152
SENet
AmoebaNet-A
ReNeXt-101PolyNetInception ResNet V2
Inception V4
NASNet-ANASNet-B
RandWire
AmoebaNet-A
AmoebaNet-B
RandWire
irregularly wired neural networksregular topology neural networks
irregularly wired neural networksshow better performance for
same amount of compute thanregular topology neural networks
top left means is better
(a) ImageNet accuracy vs number of multiply-and-accumulate
Number of Parameters (Millions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
800 40 100 140
DPN-131
irregularly wired neural networks
Inception V1MobileNetShuffleNet
Inception V2
Inception V3Xception
ResNet-152
SENet
AmoebaNet-C
ReNeXt-101
PolyNetInception ResNet V2Inception V4
NASNet-A
NASNet-A
RandWire
AmoebaNet-A
RandWire
regular topology neural networks
irregularly wired neural networksshow better performance for
same number of parameters thanregular topology neural networks
top left means is better
6020 120
NASNet-A
(b) ImageNet accuracy vs number of parameters
Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks
B COMPARISON WITH TENSORFLOW LITE
In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks
1656
552
194
70
645
330 60
5
350
160
903
251
82 33
459
260 359
280
115
753
226
72 20
459
260 359
280
115
0
500
1000
1500
2000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Peak
Mem
ory
Foot
prin
t (K
B)
TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Smaller the better
Peak Memory Footprint (KB)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite
C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING
Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm
THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step
Proof If i=0 the optimal s0 is an empty sequence and micro0
must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +
prodvishapemicroi +
produlowasti shape)
and microi+1 is calculated by deductingprodpishape forallpi isin
(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption
D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF
We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node
A
D W
Z
CB X Y
GGraph
hellip
Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm
First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step
For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2
iminus1) On top of this each step would make an iteration
over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield
1+1times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=1+
(|V |minus2
0
)times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=2+
|V |minus2sumi=0
(|V |minus2
i
)times(|V |minus2minusi)
=2+(|V |minus2)times2|V |minus3
le(|V |minus2)times2|V |minus2 for |V |ge4
le|V |times2|V |
As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A
B C
D
E
F G
H
Divide Conquer Combine
ABCD
Schedule
Schedule
g1
g2Concatenate
s sg1
sg2EFGH
ABCDEFGH
Figure 7 Illustration of divide-and-conquer which divides thegraphs into multiple subgraphs (divide) schedules each of themusing the optimal scheduler (conquer) then concatenates thesub-schedules to get the final schedule (combine)
Due to the space limitation we present the derivation of thealgorithm complexity in the supplementary material
32 Optimizing Scheduling Speed Speeding upthe Dynamic Programming-based Scheduling
While the above scheduling algorithm improves complexityof the search search space may still be intractable due tothe immense irregularity Therefore we devise divide-and-conquer and adaptive soft budgeting to accelerate the searchby effectively shrinking and pruning the search space
Divide-and-conquer We can observe from Figure 1that the topology of irregularly wired neural networks arehourglass shaped ( ) because many NAS and RandomNetwork Generators design cells with single input and singleoutput then stack them to form an hourglass shape topology(Wilken et al 2000) shows that during general purpose codescheduling graphs can be partitioned (divide) into multiplesubgraphs and the corresponding solutions (conquer) can beconcatenated (combine) to form an optimal solution for theoverall problem While the complexity of the scheduling al-gorithm remains the same this divide-and-conquer approachcan reduce the number of nodes in each subproblem speed-ing up the overall scheduling time For instance for a graphthat can be partitioned intoN equal subgraphs the schedul-ing time will decrease from |V |times2|V | to |V |times2|V |N thatwe can speed up scheduling by multiple orders of magnitudecompared to the naive approach depending on the size ofthe graph and the number of partitions
As such Figure 7 shows this insight can be extended to ourproblem setting where we can first perform scheduling oneach cell and merge those solutions together to form thefinal solution First stage is partitioning the original graphG into multiple subgraphs g (divide) Then utilizing theindependence among the subgraphs each subgraph g canbe scheduled separately for their corresponding optimalschedule sg (conquer) Considering that the number ofnodes in the subgraph g is much smaller than the entire graphG the scheduling time will decrease significantly Finallythe schedules of the subgraphs are concatenated to giveoptimal schedule slowast of the entire graph (combine)
Adaptive soft budgeting While divide-and-conquerapproach scales down the number of nodes the algorithmmay still not be fast enough due to the exponential complexityof the algorithm Therefore we explore avoiding suboptimalsolutions during the early stage of scheduling without affect-ing the optimality of the original algorithm Since our goal isto find a single solution that can run within a given memorybudget τlowast=microlowast while all other solutions can be discarded set-ting some budget τ that is greater or equal to microlowast and pruningsuboptimal schedules with which their micropeak exceeds τ canfocus the search to a smaller search spaceS primeT subST while stillachieving the optimal schedule slowast On top of this we developa meta-search for τ This is inspired from engineers buyinga larger memory (increase τ ) if a program fails due to stackoverflow (= rsquono solutionrsquo due to an overly aggressive pruning)and selling out excess memory (decrease τ ) if the currentbudget is prohibitive (= rsquotimeoutrsquo due to lack of pruning)SERENITY takes advantage of this insight to develop anadaptive soft budgeting scheme while scheduling to cut downthe overall number of explored schedules Figure 8 illustratesthe overall idea by first showing how some schedules arepruned with regard to a given budget τ in Figure 8(a) thenimplication of different τ on scheduling time in Figure 8(b)
Figure 8(a) depicts a certain point while scheduling G wherenodes G H F and J can be scheduled In particular the
32
23
35
τ = 36
35 38 38J
s1
A
B C
D E F
H I
K
L
G
GGraph
Scheduled SchedulableXX For memoizationXX
ABCDEI hellip
Sear
ch S
tep
FHG
hellip
J
D 6 E 6 F 6J 6I 3H 3G 3
ABCDEIFH helliphellip
C 6
23 32s2
gt τ
z
z
s3
35
output activation size
(a) While both path s1 and s2 schedules lead to same zprime theirmicro and micropeak varies and we can prune schedules that yieldhigher micropeak than a given budget τ Numbers next to box orcircle are micro and numbers next to edges are micropeak
Proh
ibiti
ve
Sche
dulin
g Ti
me
(tim
eout
)
Sche
dulin
g Fa
ilure
(no
solu
tion
)
No o
f Exp
lore
d Sc
hedu
les
prop Sc
hedu
ling
Tim
e
Budget
Optimal Budget τ Hard Budget τmaxSoft Budget τ
Adaptive Soft Budgeting
(b) Adaptive soft budgeting starts by setting a hard budgetτmax as the maximum value for the soft budget τ Thenconducts a binary search for τ higher than τlowast that it finds asolution yet not too high that scheduling completes quickly
Figure 8 Illustration of the adaptive soft budgeting (a) shows howschedules are pruned and (b) illustrates how the soft budget τrelates to the number of explored schedules
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
figure compares two possible solutions s1 and s2 whichschedules Hrarr F and Frarr H respectively given τ = 36While s1 and s2 both starts from z with micro=32 schedulingH leads to micropeak = 32+3 (H) = 35 whereas scheduling F
or J leads to micropeak =32+6 (F or J) =38 Therefore sincewe assume τ = 36 s2 and s3 will fail because micropeak = 38for s2 and s3 exceeds 36 So as long as we set the budgetτ higher than microlowast the scheduler still finds a single optimalsolution while avoiding many suboptimal paths On theother hand too small a τ ltmicrolowast leads to no solution becausethe optimal path would be pruned away
Having established the possibility of pruning our questionboils down to discovering τ that is greater or equal to microlowast
which we call an optimal budget τlowast yet close enough toshrink the search space effectively Figure 8(b) and Algo-rithm 2 summarizes the proposed adaptive soft budgetingSince we start with no information about the approximaterange for τ we resort to a commonly used topologicalordering algorithm called Kahnrsquos algorithm (Kahn 1962)(O(|V |+|E|)) to adaptively gain idea of the range for τ Weuse the peak memory footprint from this sequence and useit as our hard budget τmax and in contrast we call adaptivelychanging τ as a soft budget Since τmaxgemicrolowast we know thatany τgeτmax do not need to be explored Having this upperbound for the search adaptive soft budgeting implementsa binary search to first run the scheduling algorithm with τand T as input where T is an hyperparameter that limits thescheduling time per search step The binary search increasesτ (τnew larr (τnew + τold)2) if it finds rsquono solutionrsquo anddecreases τ (τnewlarrτnew2) if a search step returns rsquotimeoutrsquo(search step duration exceeds T ) The binary search stops as
Algorithm 2 Adaptive Soft Budgeting1 Input graph G2 Output optimal schedule slowast
3 τmaxlarrmicro(KahnrsquosAlgorithm(G)G) hard budget4 τoldτnewlarrτmax
5 flaglarr rsquono solutionrsquo6 repeat7 binary search for τ decrease τ if rsquotimeoutrsquo8 and increase τ if rsquono solutionrsquo9 if flag is rsquotimeoutrsquo then
10 simultaneous11 τoldlarrτnew τnewlarrτnew212 else if flag is rsquono solutionrsquo then13 simultaneous14 τoldlarrτnew τnewlarr(τnew+τold)215 end if16 if flag is rsquosolutionrsquo then17 slowastlarrschedule optimal schedule18 end if19 until flag is rsquosolutionrsquo
micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(wixi) )
micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(y) )
Channel-wisePartitioning
Kernel-wisePartitioning
=
=
wij
concat
conv
x1 x2 xn
w1hellipwm
y
add
partialconv w1
x1 x2xn
w2 wn
y
concat
depth-conv
x1 x2 xn
w1hellipwn
y
concat
partialdepth-conv w1
x1 x2xn
w2 wn
y
yxi ith Input Output jth Channel of ith Kernel
x
x
Figure 9 Illustration of the graph rewriting patterns channel-wisepartitioning and kernel-wise partitioning can reduce the memorycost of convolution and depthwise convolution respectively
soon as it finds a schedule (rsquosolutionrsquo) and this method usingbinary search is guaranteed to work due to the monotonicallyincreasing number of explored schedules with τ
33 Identity Graph Rewriting Improving theSearch Space for Better Peak Memory Footprint
Reorganizing the computational graph of the irregularlywired neural networks may lead to significant reduction in thepeak memory footprint micropeak during computation For exam-ple it is notable that large stream of NAS-based works (Liuet al 2019a Zhang et al 2019) rely on extensive use of con-catenation as a natural approach to merge information frommultiple branches of the input activations and expand thesearch space of the neural architectures However concatena-tion with many incoming edges may prolong the liveness ofthe input activation and increase the memory pressure whichis unfavorable especially for resource constrained scenariosTo address this issue we propose identity graph rewritingto effectively reduce micropeak around the concatenation whilekeeping the arithmetic outputs identical To this end wepresent two main examples of the graph patterns in irregularlywired neural networks that benefits from our technique
Channel-wise partitioning (convolution) One typicalpattern in irregularly wired neural networks is concatenation(concat [middot]) that takes multiple branches of the input prior toa convolution (conv lowast) While executing such pattern peakmemory footprint micropeak occurs when the output y isinRn isbeing computed while concatenated branches of inputxisinRn
are also mandated to reside in the memory Our objectiveis to achieve the same arithmetic results and logical effectas concat yet sidestep the corresponding seemingly exces-sive memory cost To this end we channel-wise partitionthe conv that follows the concat so that the partitioned convcan be computed as soon as the input xi becomes available
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of
sumi and lowast these
transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from
sumxi+y tomax(wilowastxi)+y which becomes
more effective when there are more incoming edges to concat
y=[sum
i
w1ilowastxisumi
wmilowastxi]
(concat+conv) (3)
=sumi
[w1ilowastxiwmilowastxi
](4)
=sumi
[w1iwmi
]lowastxi (5)
=sumi
[wilowastxi
](partial conv+add) (6)
Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly
y=[w1lowastx1wnlowastxn
](concat+depthconv) (7)
=[[w1lowastx1][wnlowastxn]
](partial depthconv+concat)
(8)
Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost
Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1
ACCURACY
DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745
4 EVALUATION
We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time
41 Methodology
Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset
42 Experimental Results
Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In
1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
183x
220x
239x
209x
140x
127x 168x
125x
139x
168x220x
244x
270x 345x
140x
127x 168x
125x
139x 186x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Red
ucti
on i
n Pe
ak M
emor
yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Higher the betterReduction in Peak Memory
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 10 Reduction in peak memory footprint of SERENITY
against TensorFlow Lite (no memory hierarchy)
192x 2
58x
251x
115x
108x
129x
108x
130x
152x192x 2
68x
125x
111x
131x
111x 161x
149x192x
356x
125x
119x
109x
108x 151x200x
135x
250x
182x
138x 176x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Redu
ctio
n in
Off
-chi
p
32KB 64KB 128KB 256KB
Reduction in Off-chip
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Memory Communication
only
SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipon
ly SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipSERENITY removes off-chip communication
NA
NA
NA
NA
NA NA
Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)
addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks
Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed
Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory
0
50
100
150
200
250
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
251KB reduction in peak memoryfootprint with Memory Allocator
Memory Footprint (KB)
Time
(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)
0
50
100
150
200
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic ProgrammingDynamic Programming+Graph Rewriting
125KB reductionin peak memory footprint
Time
Memory Footprint (KB)
(b) Memory footprint without the memory allocator
Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)
allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting
Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting
Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
32s 57s
45s
278s 11
81s
151s
285s 744s
879s
406s
32s
421s
305s
393s 11
81s
151s
285s 744s
879s
488s
1
10
100
1000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Sche
dulin
g Ti
me
(sec
onds
)
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Scheduling Time (seconds)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 13 Scheduling time evaluation for SERENITY
Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time
GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME
7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs
3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs
nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility
5 RELATED WORKS
The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)
bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices
Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper
Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint
Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts
6 CONCLUSION
As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
ACKNOWLEDGEMENT
We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms
REFERENCES
Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016
Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018
Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH
Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017
Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966
Bellman R Dynamic programming Science 1966
Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961
Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989
Bruno J and Sethi R Code generation for a one-register machineJACM 1976
Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm
Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018
Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014
Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016
Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017
Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf
Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf
Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017
Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf
Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS
Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015
Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018
Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017
Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017
Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite
Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015
Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a
Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b
He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018
Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962
Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf
Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014
Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018
Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019
Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017
Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016
Kahn A B Topological sorting of large networks CACM 1962
Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001
Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf
Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004
LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990
Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003
Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX
Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b
Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020
Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb
NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt
Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017
Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019
Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999
Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019
Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf
Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007
Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018
Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014
Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf
Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019
Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000
Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019
Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019
Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019
Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf
Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf
Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-
Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg
Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS
Multiply-and-accumulate (Billions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
200 10 30 40
DPN-131
Inception V1MobileNet
ShuffleNet
Inception V2
Inception V3Xception ResNet-152
SENet
AmoebaNet-A
ReNeXt-101PolyNetInception ResNet V2
Inception V4
NASNet-ANASNet-B
RandWire
AmoebaNet-A
AmoebaNet-B
RandWire
irregularly wired neural networksregular topology neural networks
irregularly wired neural networksshow better performance for
same amount of compute thanregular topology neural networks
top left means is better
(a) ImageNet accuracy vs number of multiply-and-accumulate
Number of Parameters (Millions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
800 40 100 140
DPN-131
irregularly wired neural networks
Inception V1MobileNetShuffleNet
Inception V2
Inception V3Xception
ResNet-152
SENet
AmoebaNet-C
ReNeXt-101
PolyNetInception ResNet V2Inception V4
NASNet-A
NASNet-A
RandWire
AmoebaNet-A
RandWire
regular topology neural networks
irregularly wired neural networksshow better performance for
same number of parameters thanregular topology neural networks
top left means is better
6020 120
NASNet-A
(b) ImageNet accuracy vs number of parameters
Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks
B COMPARISON WITH TENSORFLOW LITE
In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks
1656
552
194
70
645
330 60
5
350
160
903
251
82 33
459
260 359
280
115
753
226
72 20
459
260 359
280
115
0
500
1000
1500
2000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Peak
Mem
ory
Foot
prin
t (K
B)
TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Smaller the better
Peak Memory Footprint (KB)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite
C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING
Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm
THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step
Proof If i=0 the optimal s0 is an empty sequence and micro0
must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +
prodvishapemicroi +
produlowasti shape)
and microi+1 is calculated by deductingprodpishape forallpi isin
(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption
D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF
We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node
A
D W
Z
CB X Y
GGraph
hellip
Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm
First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step
For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2
iminus1) On top of this each step would make an iteration
over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield
1+1times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=1+
(|V |minus2
0
)times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=2+
|V |minus2sumi=0
(|V |minus2
i
)times(|V |minus2minusi)
=2+(|V |minus2)times2|V |minus3
le(|V |minus2)times2|V |minus2 for |V |ge4
le|V |times2|V |
As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
figure compares two possible solutions s1 and s2 whichschedules Hrarr F and Frarr H respectively given τ = 36While s1 and s2 both starts from z with micro=32 schedulingH leads to micropeak = 32+3 (H) = 35 whereas scheduling F
or J leads to micropeak =32+6 (F or J) =38 Therefore sincewe assume τ = 36 s2 and s3 will fail because micropeak = 38for s2 and s3 exceeds 36 So as long as we set the budgetτ higher than microlowast the scheduler still finds a single optimalsolution while avoiding many suboptimal paths On theother hand too small a τ ltmicrolowast leads to no solution becausethe optimal path would be pruned away
Having established the possibility of pruning our questionboils down to discovering τ that is greater or equal to microlowast
which we call an optimal budget τlowast yet close enough toshrink the search space effectively Figure 8(b) and Algo-rithm 2 summarizes the proposed adaptive soft budgetingSince we start with no information about the approximaterange for τ we resort to a commonly used topologicalordering algorithm called Kahnrsquos algorithm (Kahn 1962)(O(|V |+|E|)) to adaptively gain idea of the range for τ Weuse the peak memory footprint from this sequence and useit as our hard budget τmax and in contrast we call adaptivelychanging τ as a soft budget Since τmaxgemicrolowast we know thatany τgeτmax do not need to be explored Having this upperbound for the search adaptive soft budgeting implementsa binary search to first run the scheduling algorithm with τand T as input where T is an hyperparameter that limits thescheduling time per search step The binary search increasesτ (τnew larr (τnew + τold)2) if it finds rsquono solutionrsquo anddecreases τ (τnewlarrτnew2) if a search step returns rsquotimeoutrsquo(search step duration exceeds T ) The binary search stops as
Algorithm 2 Adaptive Soft Budgeting1 Input graph G2 Output optimal schedule slowast
3 τmaxlarrmicro(KahnrsquosAlgorithm(G)G) hard budget4 τoldτnewlarrτmax
5 flaglarr rsquono solutionrsquo6 repeat7 binary search for τ decrease τ if rsquotimeoutrsquo8 and increase τ if rsquono solutionrsquo9 if flag is rsquotimeoutrsquo then
10 simultaneous11 τoldlarrτnew τnewlarrτnew212 else if flag is rsquono solutionrsquo then13 simultaneous14 τoldlarrτnew τnewlarr(τnew+τold)215 end if16 if flag is rsquosolutionrsquo then17 slowastlarrschedule optimal schedule18 end if19 until flag is rsquosolutionrsquo
micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(wixi) )
micropeak = Σsize(xi) + size(y) micropeak = max(size(xi) + size(y) )
Channel-wisePartitioning
Kernel-wisePartitioning
=
=
wij
concat
conv
x1 x2 xn
w1hellipwm
y
add
partialconv w1
x1 x2xn
w2 wn
y
concat
depth-conv
x1 x2 xn
w1hellipwn
y
concat
partialdepth-conv w1
x1 x2xn
w2 wn
y
yxi ith Input Output jth Channel of ith Kernel
x
x
Figure 9 Illustration of the graph rewriting patterns channel-wisepartitioning and kernel-wise partitioning can reduce the memorycost of convolution and depthwise convolution respectively
soon as it finds a schedule (rsquosolutionrsquo) and this method usingbinary search is guaranteed to work due to the monotonicallyincreasing number of explored schedules with τ
33 Identity Graph Rewriting Improving theSearch Space for Better Peak Memory Footprint
Reorganizing the computational graph of the irregularlywired neural networks may lead to significant reduction in thepeak memory footprint micropeak during computation For exam-ple it is notable that large stream of NAS-based works (Liuet al 2019a Zhang et al 2019) rely on extensive use of con-catenation as a natural approach to merge information frommultiple branches of the input activations and expand thesearch space of the neural architectures However concatena-tion with many incoming edges may prolong the liveness ofthe input activation and increase the memory pressure whichis unfavorable especially for resource constrained scenariosTo address this issue we propose identity graph rewritingto effectively reduce micropeak around the concatenation whilekeeping the arithmetic outputs identical To this end wepresent two main examples of the graph patterns in irregularlywired neural networks that benefits from our technique
Channel-wise partitioning (convolution) One typicalpattern in irregularly wired neural networks is concatenation(concat [middot]) that takes multiple branches of the input prior toa convolution (conv lowast) While executing such pattern peakmemory footprint micropeak occurs when the output y isinRn isbeing computed while concatenated branches of inputxisinRn
are also mandated to reside in the memory Our objectiveis to achieve the same arithmetic results and logical effectas concat yet sidestep the corresponding seemingly exces-sive memory cost To this end we channel-wise partitionthe conv that follows the concat so that the partitioned convcan be computed as soon as the input xi becomes available
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of
sumi and lowast these
transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from
sumxi+y tomax(wilowastxi)+y which becomes
more effective when there are more incoming edges to concat
y=[sum
i
w1ilowastxisumi
wmilowastxi]
(concat+conv) (3)
=sumi
[w1ilowastxiwmilowastxi
](4)
=sumi
[w1iwmi
]lowastxi (5)
=sumi
[wilowastxi
](partial conv+add) (6)
Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly
y=[w1lowastx1wnlowastxn
](concat+depthconv) (7)
=[[w1lowastx1][wnlowastxn]
](partial depthconv+concat)
(8)
Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost
Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1
ACCURACY
DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745
4 EVALUATION
We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time
41 Methodology
Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset
42 Experimental Results
Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In
1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
183x
220x
239x
209x
140x
127x 168x
125x
139x
168x220x
244x
270x 345x
140x
127x 168x
125x
139x 186x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Red
ucti
on i
n Pe
ak M
emor
yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Higher the betterReduction in Peak Memory
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 10 Reduction in peak memory footprint of SERENITY
against TensorFlow Lite (no memory hierarchy)
192x 2
58x
251x
115x
108x
129x
108x
130x
152x192x 2
68x
125x
111x
131x
111x 161x
149x192x
356x
125x
119x
109x
108x 151x200x
135x
250x
182x
138x 176x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Redu
ctio
n in
Off
-chi
p
32KB 64KB 128KB 256KB
Reduction in Off-chip
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Memory Communication
only
SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipon
ly SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipSERENITY removes off-chip communication
NA
NA
NA
NA
NA NA
Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)
addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks
Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed
Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory
0
50
100
150
200
250
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
251KB reduction in peak memoryfootprint with Memory Allocator
Memory Footprint (KB)
Time
(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)
0
50
100
150
200
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic ProgrammingDynamic Programming+Graph Rewriting
125KB reductionin peak memory footprint
Time
Memory Footprint (KB)
(b) Memory footprint without the memory allocator
Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)
allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting
Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting
Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
32s 57s
45s
278s 11
81s
151s
285s 744s
879s
406s
32s
421s
305s
393s 11
81s
151s
285s 744s
879s
488s
1
10
100
1000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Sche
dulin
g Ti
me
(sec
onds
)
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Scheduling Time (seconds)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 13 Scheduling time evaluation for SERENITY
Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time
GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME
7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs
3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs
nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility
5 RELATED WORKS
The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)
bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices
Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper
Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint
Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts
6 CONCLUSION
As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
ACKNOWLEDGEMENT
We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms
REFERENCES
Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016
Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018
Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH
Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017
Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966
Bellman R Dynamic programming Science 1966
Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961
Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989
Bruno J and Sethi R Code generation for a one-register machineJACM 1976
Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm
Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018
Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014
Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016
Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017
Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf
Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf
Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017
Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf
Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS
Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015
Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018
Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017
Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017
Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite
Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015
Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a
Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b
He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018
Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962
Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf
Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014
Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018
Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019
Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017
Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016
Kahn A B Topological sorting of large networks CACM 1962
Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001
Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf
Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004
LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990
Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003
Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX
Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b
Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020
Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb
NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt
Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017
Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019
Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999
Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019
Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf
Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007
Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018
Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014
Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf
Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019
Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000
Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019
Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019
Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019
Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf
Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf
Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-
Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg
Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS
Multiply-and-accumulate (Billions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
200 10 30 40
DPN-131
Inception V1MobileNet
ShuffleNet
Inception V2
Inception V3Xception ResNet-152
SENet
AmoebaNet-A
ReNeXt-101PolyNetInception ResNet V2
Inception V4
NASNet-ANASNet-B
RandWire
AmoebaNet-A
AmoebaNet-B
RandWire
irregularly wired neural networksregular topology neural networks
irregularly wired neural networksshow better performance for
same amount of compute thanregular topology neural networks
top left means is better
(a) ImageNet accuracy vs number of multiply-and-accumulate
Number of Parameters (Millions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
800 40 100 140
DPN-131
irregularly wired neural networks
Inception V1MobileNetShuffleNet
Inception V2
Inception V3Xception
ResNet-152
SENet
AmoebaNet-C
ReNeXt-101
PolyNetInception ResNet V2Inception V4
NASNet-A
NASNet-A
RandWire
AmoebaNet-A
RandWire
regular topology neural networks
irregularly wired neural networksshow better performance for
same number of parameters thanregular topology neural networks
top left means is better
6020 120
NASNet-A
(b) ImageNet accuracy vs number of parameters
Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks
B COMPARISON WITH TENSORFLOW LITE
In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks
1656
552
194
70
645
330 60
5
350
160
903
251
82 33
459
260 359
280
115
753
226
72 20
459
260 359
280
115
0
500
1000
1500
2000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Peak
Mem
ory
Foot
prin
t (K
B)
TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Smaller the better
Peak Memory Footprint (KB)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite
C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING
Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm
THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step
Proof If i=0 the optimal s0 is an empty sequence and micro0
must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +
prodvishapemicroi +
produlowasti shape)
and microi+1 is calculated by deductingprodpishape forallpi isin
(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption
D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF
We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node
A
D W
Z
CB X Y
GGraph
hellip
Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm
First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step
For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2
iminus1) On top of this each step would make an iteration
over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield
1+1times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=1+
(|V |minus2
0
)times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=2+
|V |minus2sumi=0
(|V |minus2
i
)times(|V |minus2minusi)
=2+(|V |minus2)times2|V |minus3
le(|V |minus2)times2|V |minus2 for |V |ge4
le|V |times2|V |
As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Equation 3-6 detail the mathematical derivation of this sub-stitution Specifically as shown in Equation 3 each kerneliterates and sums up the result of convolving channels in convHowever using the distributive property of
sumi and lowast these
transform to summation of channel-wise partitioned convo-lution which we call partial conv This partial conv removesconcat from the graph leading to lower memory cost Asillustrated in Figure 9 the memory cost of same computationreduces from
sumxi+y tomax(wilowastxi)+y which becomes
more effective when there are more incoming edges to concat
y=[sum
i
w1ilowastxisumi
wmilowastxi]
(concat+conv) (3)
=sumi
[w1ilowastxiwmilowastxi
](4)
=sumi
[w1iwmi
]lowastxi (5)
=sumi
[wilowastxi
](partial conv+add) (6)
Kernel-wise partitioning (depthwise convolution)Depthwise convolution (depthconv) (Sifre amp Mallat 2014Howard et al 2017) has been shown to be effective inreducing computation yet achieve competitive performancehence its wide use in networks that target extreme efficiencyas its primary goal For concatenation (concat) followedby a depthwise convolution (depthconv) similar to aboveconcat+conv case peak memory footprint micropeak occurswhen the concatenated x is inside the memory and theresult y additionally gets saved to the memory before xis deallocated This time we leverage the independenceamong different kernels to kernel-wise partition thedepthconv that follows the concat so that each input xiis computed to smaller feature maps without residing inthe memory too long As such Equation 7-8 derives thissubstitution Equation 7 shows that every component in they is independent (different subscript index) and is viable forpartitioning In other words this rewriting simply exposesthe commutative property between depthconv and concatplus kernel-wise partitioning to reduce micropeak significantly
y=[w1lowastx1wnlowastxn
](concat+depthconv) (7)
=[[w1lowastx1][wnlowastxn]
](partial depthconv+concat)
(8)
Implementation Following the general practice of usingpattern matching algorithms in compilers (Lattner amp Adve2004 Rotem et al 2018 Jia et al 2019) we implementidentity graph rewriting using pattern matching to identifyregions of the graph which can be substituted to an operationwith lower computational cost Likewise we make use of thistechnique to identify regions that leads to lower memory cost
Table 1 Specification of the networks used for evaluationNETWORK TYPE DATASET MAC WEIGHT TOP-1
ACCURACY
DARTS NAS IMAGENET 5740M 47M 733SWIFTNET NAS HPD 574M 2497K 951RANDWIRE RAND CIFAR10 1110M 12M 936RANDWIRE RAND CIFAR100 1600M 47M 745
4 EVALUATION
We evaluate SERENITY with four representative irregularlywired neural networks graphs We first compare thepeak memory footprint of SERENITY against TensorFlowLite (Google) while using the same linear memory allocationscheme1 for both Furthermore we also experimentthe impact of such peak memory footprint reduction onoff-chip memory communication We also conduct anin-depth analysis of the gains from the proposed dynamicprogramming-based scheduler and graph rewriting usingSwiftNet Cell A (Zhang et al 2019) Lastly we study theimpact of adaptive soft budgeting on the scheduling time
41 Methodology
Benchmarks and datasets Table 1 lists the details ofthe networksndashrepresentative of the irregularly wired neuralnetworks from Neural Architecture Search (NAS) andRandom Network Generators (RAND)ndashused for evaluationDARTS (Liu et al 2019a) for ImageNet SwiftNet (Zhanget al 2019) for a dataset comprised of human presence or ab-sence (HPD) and RandWire (Xie et al 2019) for CIFAR10and CIFAR100 DARTS (Liu et al 2019a) is a gradient-based NAS algorithm In particular we focus on the learnednormal cell for image classification on ImageNet only thefirst cell because it has the highest peak memory footprint andthe reset of the network is just repeated stacking of the samecell following the practice in NASNet (Zoph et al 2018)SwiftNet (Zhang et al 2019) is network from NAS by target-ing human detection dataset RandWire (Xie et al 2019) arefrom Random Network Generators for image classificationon CIFAR10 and CIFAR100 The table also lists their datasetmultiply-accumulate count ( MAC) number of parameters( WEIGHT) and top-1 accuracy on their respective dataset
42 Experimental Results
Comparison with TensorFlow Lite Figure 10 evaluatesSERENITY over TensorFlow Lite on different cells of theaforementioned networks in terms of reduction in memoryfootprint The figures illustrate that SERENITYrsquos dynamicprogramming-based scheduler reduces the memory footprintby a factor of 168timeswithout any changes to the graph In
1TensorFlow Lite implements a linear memory allocator namedsimple memory arena httpsgithubcomtensorflowtensorflowblobmastertensorflowlitesimple memory arenacc
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
183x
220x
239x
209x
140x
127x 168x
125x
139x
168x220x
244x
270x 345x
140x
127x 168x
125x
139x 186x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Red
ucti
on i
n Pe
ak M
emor
yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Higher the betterReduction in Peak Memory
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 10 Reduction in peak memory footprint of SERENITY
against TensorFlow Lite (no memory hierarchy)
192x 2
58x
251x
115x
108x
129x
108x
130x
152x192x 2
68x
125x
111x
131x
111x 161x
149x192x
356x
125x
119x
109x
108x 151x200x
135x
250x
182x
138x 176x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Redu
ctio
n in
Off
-chi
p
32KB 64KB 128KB 256KB
Reduction in Off-chip
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Memory Communication
only
SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipon
ly SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipSERENITY removes off-chip communication
NA
NA
NA
NA
NA NA
Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)
addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks
Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed
Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory
0
50
100
150
200
250
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
251KB reduction in peak memoryfootprint with Memory Allocator
Memory Footprint (KB)
Time
(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)
0
50
100
150
200
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic ProgrammingDynamic Programming+Graph Rewriting
125KB reductionin peak memory footprint
Time
Memory Footprint (KB)
(b) Memory footprint without the memory allocator
Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)
allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting
Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting
Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
32s 57s
45s
278s 11
81s
151s
285s 744s
879s
406s
32s
421s
305s
393s 11
81s
151s
285s 744s
879s
488s
1
10
100
1000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Sche
dulin
g Ti
me
(sec
onds
)
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Scheduling Time (seconds)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 13 Scheduling time evaluation for SERENITY
Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time
GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME
7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs
3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs
nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility
5 RELATED WORKS
The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)
bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices
Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper
Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint
Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts
6 CONCLUSION
As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
ACKNOWLEDGEMENT
We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms
REFERENCES
Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016
Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018
Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH
Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017
Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966
Bellman R Dynamic programming Science 1966
Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961
Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989
Bruno J and Sethi R Code generation for a one-register machineJACM 1976
Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm
Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018
Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014
Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016
Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017
Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf
Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf
Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017
Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf
Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS
Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015
Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018
Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017
Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017
Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite
Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015
Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a
Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b
He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018
Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962
Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf
Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014
Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018
Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019
Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017
Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016
Kahn A B Topological sorting of large networks CACM 1962
Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001
Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf
Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004
LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990
Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003
Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX
Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b
Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020
Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb
NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt
Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017
Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019
Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999
Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019
Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf
Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007
Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018
Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014
Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf
Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019
Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000
Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019
Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019
Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019
Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf
Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf
Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-
Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg
Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS
Multiply-and-accumulate (Billions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
200 10 30 40
DPN-131
Inception V1MobileNet
ShuffleNet
Inception V2
Inception V3Xception ResNet-152
SENet
AmoebaNet-A
ReNeXt-101PolyNetInception ResNet V2
Inception V4
NASNet-ANASNet-B
RandWire
AmoebaNet-A
AmoebaNet-B
RandWire
irregularly wired neural networksregular topology neural networks
irregularly wired neural networksshow better performance for
same amount of compute thanregular topology neural networks
top left means is better
(a) ImageNet accuracy vs number of multiply-and-accumulate
Number of Parameters (Millions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
800 40 100 140
DPN-131
irregularly wired neural networks
Inception V1MobileNetShuffleNet
Inception V2
Inception V3Xception
ResNet-152
SENet
AmoebaNet-C
ReNeXt-101
PolyNetInception ResNet V2Inception V4
NASNet-A
NASNet-A
RandWire
AmoebaNet-A
RandWire
regular topology neural networks
irregularly wired neural networksshow better performance for
same number of parameters thanregular topology neural networks
top left means is better
6020 120
NASNet-A
(b) ImageNet accuracy vs number of parameters
Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks
B COMPARISON WITH TENSORFLOW LITE
In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks
1656
552
194
70
645
330 60
5
350
160
903
251
82 33
459
260 359
280
115
753
226
72 20
459
260 359
280
115
0
500
1000
1500
2000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Peak
Mem
ory
Foot
prin
t (K
B)
TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Smaller the better
Peak Memory Footprint (KB)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite
C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING
Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm
THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step
Proof If i=0 the optimal s0 is an empty sequence and micro0
must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +
prodvishapemicroi +
produlowasti shape)
and microi+1 is calculated by deductingprodpishape forallpi isin
(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption
D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF
We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node
A
D W
Z
CB X Y
GGraph
hellip
Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm
First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step
For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2
iminus1) On top of this each step would make an iteration
over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield
1+1times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=1+
(|V |minus2
0
)times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=2+
|V |minus2sumi=0
(|V |minus2
i
)times(|V |minus2minusi)
=2+(|V |minus2)times2|V |minus3
le(|V |minus2)times2|V |minus2 for |V |ge4
le|V |times2|V |
As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
183x
220x
239x
209x
140x
127x 168x
125x
139x
168x220x
244x
270x 345x
140x
127x 168x
125x
139x 186x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Red
ucti
on i
n Pe
ak M
emor
yTensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Higher the betterReduction in Peak Memory
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 10 Reduction in peak memory footprint of SERENITY
against TensorFlow Lite (no memory hierarchy)
192x 2
58x
251x
115x
108x
129x
108x
130x
152x192x 2
68x
125x
111x
131x
111x 161x
149x192x
356x
125x
119x
109x
108x 151x200x
135x
250x
182x
138x 176x
000
100
200
300
400
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Geomean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Redu
ctio
n in
Off
-chi
p
32KB 64KB 128KB 256KB
Reduction in Off-chip
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B GeomeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Memory Communication
only
SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipon
ly SERENITY fi
ts o
n-ch
ip
only
SERENITY fi
ts o
n-ch
ipSERENITY removes off-chip communication
NA
NA
NA
NA
NA NA
Figure 11 Reduction in off-chip memory communication ofSERENITY against TensorFlow Lite (with memory hierarchy)
addition the proposed graph rewriting technique yields anaverage of 186times(extra 107) reduction in terms of peakmemory footprint The results suggest that SERENITY yieldssignificant reduction in terms of the peak memory footprintfor irregularly wired neural networks
Improvement in off-chip memory communication Wealso show how SERENITY affects the off-chip memorycommunication which largely affects both power andinference speed (Chen et al 2016 Gao et al 2017 Sharmaet al 2018) To this end Figure 11 sweeps different on-chipmemory configurations to measure the reduction in off-chipcommunication on systems with multi-level memoryhierarchy Since we know the entire schedule a priori weuse Beladyrsquos optimal algorithm (Belady 1966) also referredto as the clairvoyant algorithm for measuring the off-chipmemory communication to distill the effects of the proposedscheduling The results show that SERENITY can reducethe off-chip memory communication by 176times for a devicewith 256KB on-chip memory In particular while there werefew cases where peak memory footprint was already smallenough to fit on-chip (NA in figure) there were some caseswhere SERENITY eradicated the off-chip communicationby successfully containing the activations in the on-chipmemory while TensorFlow Lite failed to do so (marked infigure) This suggests that SERENITYrsquos effort of reducingmemory footprint is also effective in reducing the off-chipmemory communication in systems with memory hierarchyhence the power consumption and inference speed
Improvement from dynamic programming-based sched-uler and identity graph rewriting To demonstrate wherethe improvement comes from Figure 12 plots the memoryfootprint while running Swiftnet Cell A Figure 12(a) showsthe memory footprint of SERENITY with the memory
0
50
100
150
200
250
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
251KB reduction in peak memoryfootprint with Memory Allocator
Memory Footprint (KB)
Time
(a) Memory footprint with the memory allocator (peak mem-ory footprint of TensorFlow Lite = 5510KB)
0
50
100
150
200
Mem
ory
Foot
prin
t (K
B)
Time
Dynamic ProgrammingDynamic Programming+Graph Rewriting
125KB reductionin peak memory footprint
Time
Memory Footprint (KB)
(b) Memory footprint without the memory allocator
Figure 12 Memory footprint while running SwiftNet Cell A withand without the memory allocator (red arrow denotes reduction)
allocation The figure shows that SERENITYrsquos dynamicprogramming-based scheduler brings significant improve-ment to the peak memory footprint (5510KBrarr2509KB)and the graph rewriting further improves this by 251KB(2509KBrarr2258KB) by utilizing patterns that alleviateregions with large memory footprint In order to focus onthe effect of the scheduler and graph rewriting Figure 12(b)presents the memory footprint of SERENITY without thememory allocation the sum of the activations while runningthe network The figure shows that the proposed schedulerfinds a schedule with the optimal (minimum) peak memoryfootprint without changes to the graph Then it shows thatthe proposed graph rewriting can further reduce the peakmemory footprint by 125KB (2007KBrarr1882KB) The re-sults suggest that the significant portion of the improvementcomes from the proposed dynamic programming-basedscheduler and the graph rewriting
Scheduling time of SERENITY Figure 13 summarizesthe (static) scheduling time taken for SERENITY to schedulethe networks Results show that the average schedulingtime is 406 secs without the graph rewriting and 488 secswith graph rewriting which the difference comes from theincrease in the number of nodes from graph rewriting Theresults show that all the above gains of SERENITY come atthe cost of less than one minute average extra compilationtime While the dynamic programming-based schedulingsuffers from an exponential time complexity SERENITYmanages to make the scheduling tractable through theproposed divide-and-conquer and adaptive soft budgeting
Speed up from divide-and-conquer and adaptive softbudgeting Table 2 summarizes the scheduling timeof SwiftNet (Zhang et al 2019) for different algorithmsto demonstrate the speed up from divide-and-conquerand adaptive soft budgeting techniques As such thetable lists different combination of algorithms number of
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
32s 57s
45s
278s 11
81s
151s
285s 744s
879s
406s
32s
421s
305s
393s 11
81s
151s
285s 744s
879s
488s
1
10
100
1000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Sche
dulin
g Ti
me
(sec
onds
)
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Scheduling Time (seconds)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 13 Scheduling time evaluation for SERENITY
Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time
GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME
7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs
3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs
nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility
5 RELATED WORKS
The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)
bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices
Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper
Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint
Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts
6 CONCLUSION
As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
ACKNOWLEDGEMENT
We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms
REFERENCES
Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016
Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018
Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH
Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017
Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966
Bellman R Dynamic programming Science 1966
Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961
Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989
Bruno J and Sethi R Code generation for a one-register machineJACM 1976
Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm
Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018
Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014
Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016
Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017
Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf
Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf
Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017
Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf
Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS
Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015
Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018
Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017
Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017
Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite
Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015
Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a
Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b
He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018
Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962
Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf
Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014
Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018
Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019
Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017
Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016
Kahn A B Topological sorting of large networks CACM 1962
Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001
Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf
Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004
LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990
Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003
Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX
Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b
Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020
Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb
NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt
Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017
Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019
Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999
Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019
Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf
Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007
Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018
Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014
Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf
Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019
Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000
Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019
Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019
Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019
Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf
Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf
Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-
Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg
Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS
Multiply-and-accumulate (Billions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
200 10 30 40
DPN-131
Inception V1MobileNet
ShuffleNet
Inception V2
Inception V3Xception ResNet-152
SENet
AmoebaNet-A
ReNeXt-101PolyNetInception ResNet V2
Inception V4
NASNet-ANASNet-B
RandWire
AmoebaNet-A
AmoebaNet-B
RandWire
irregularly wired neural networksregular topology neural networks
irregularly wired neural networksshow better performance for
same amount of compute thanregular topology neural networks
top left means is better
(a) ImageNet accuracy vs number of multiply-and-accumulate
Number of Parameters (Millions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
800 40 100 140
DPN-131
irregularly wired neural networks
Inception V1MobileNetShuffleNet
Inception V2
Inception V3Xception
ResNet-152
SENet
AmoebaNet-C
ReNeXt-101
PolyNetInception ResNet V2Inception V4
NASNet-A
NASNet-A
RandWire
AmoebaNet-A
RandWire
regular topology neural networks
irregularly wired neural networksshow better performance for
same number of parameters thanregular topology neural networks
top left means is better
6020 120
NASNet-A
(b) ImageNet accuracy vs number of parameters
Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks
B COMPARISON WITH TENSORFLOW LITE
In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks
1656
552
194
70
645
330 60
5
350
160
903
251
82 33
459
260 359
280
115
753
226
72 20
459
260 359
280
115
0
500
1000
1500
2000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Peak
Mem
ory
Foot
prin
t (K
B)
TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Smaller the better
Peak Memory Footprint (KB)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite
C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING
Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm
THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step
Proof If i=0 the optimal s0 is an empty sequence and micro0
must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +
prodvishapemicroi +
produlowasti shape)
and microi+1 is calculated by deductingprodpishape forallpi isin
(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption
D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF
We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node
A
D W
Z
CB X Y
GGraph
hellip
Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm
First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step
For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2
iminus1) On top of this each step would make an iteration
over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield
1+1times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=1+
(|V |minus2
0
)times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=2+
|V |minus2sumi=0
(|V |minus2
i
)times(|V |minus2minusi)
=2+(|V |minus2)times2|V |minus3
le(|V |minus2)times2|V |minus2 for |V |ge4
le|V |times2|V |
As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
32s 57s
45s
278s 11
81s
151s
285s 744s
879s
406s
32s
421s
305s
393s 11
81s
151s
285s 744s
879s
488s
1
10
100
1000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C Mean
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Sche
dulin
g Ti
me
(sec
onds
)
Dynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Scheduling Time (seconds)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell B MeanSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 13 Scheduling time evaluation for SERENITY
Table 2 Comparison of the scheduling time for different algorithmsto schedule SwiftNet 1 2 and 3 represent dynamic program-ming divide-and-conquer and adaptive soft budgeting respectivelyNA denotes infeasible within practical time
GRAPH ALGORITHM NODES AND SCHEDULINGREWRITING PARTITIONS TIME
7 1 62 =62 NA7 1 + 2 62=211922 565 secs7 1 + 2 + 3 62=211922 379 secs
3 1 92=92 NA3 1 + 2 92=332829 72 hours3 1 + 2 + 3 92=332829 1119 secs
nodes and the corresponding scheduling time Straightfor-ward implementation of the aforementioned 1 dynamicprogramming-based scheduling leads to an immeasurablylarge scheduling time regardless of the graph rewritingHowever additional application of the 2 divide-and-conquer ( 1 + 2 ) leads to a measurable scheduling time5653 secs and 729 hours to schedule without and with thegraph rewriting respectively Furthermore we observethat further applying 3 adaptive soft budgeting ( 1 + 2 + 3 )significantly reduces the scheduling time 379 secs and 1119secs to schedule without and with the graph rewriting re-spectively Above results indicate that applying the proposedalgorithms leads to a scheduling time of practical utility
5 RELATED WORKS
The prevalence of neural networks has led to the developmentof several compilation frameworks for deep learning (Abadiet al 2016 Paszke et al 2019 Rotem et al 2018Cyphers et al 2018) However even industry grade toolsmostly focus on tiling and fine-grained scheduling ofmicro-operations on the conventional hardware (NVIDIA2017 Google) or accelerators (Chen et al 2016 2014 Hanet al 2016a Judd et al 2016 Jouppi et al 2017 Gao et al2017 Parashar et al 2017 Sharma et al 2018 Fowerset al 2018) However these framework are mostly designedfor the common regular patterns that have dominateddeep learning from almost its conception As such thesetools inherently had no incentive to deal with the form ofirregularities that the emerging NAS (Zoph amp Le 2017Cortes et al 2017 Zoph et al 2018 Liu et al 2019aCai et al 2019 Real et al 2019 Zhang et al 2019) andRandom Networks (Xie et al 2019 Wortsman et al 2019)
bring about This paper in contrast focuses on this emergentclass that breaks the regularity convention and aims to enabletheir execution on memory constrained edge devices
Scheduling and tiling for neural networks Whileprior works on scheduling (Lee et al 2003 Keszligler ampBednarski 2001 Wilken et al 2000) focus on classicalcomputing workloads there have been limited study aboutthe implications of scheduling in the neural networks domainThere is also a significant body of work on schedulingoperations on hardware accelerators (Abdelfattah et al2018) that also considers tiling (Chen et al 2018 Vasilacheet al 2018 Liu et al 2019b Ahn et al 2020) Howevergraph scheduling for irregularly wired neural networkspecially with memory constraints is an emerging problemwhich is the focus of this paper
Graph rewriting for neural networks It has been acommon practice to rewrite parts of the graph using rule-based (Abadi et al 2016 Paszke et al 2019 Rotem et al2018 Cyphers et al 2018 NVIDIA 2017) or systematicapproaches to expose parallelism and make models moretarget-aware (Jia et al 2018 2019 Schosser amp Geiszlig 2007)While these approaches may alleviate the complexity of thegraph and reduce the peak memory footprint as a side effectthese frameworks do not explore and are not concerned withscheduling Our work exclusively explores graph rewritingin the context of improving the peak memory footprint
Optimizing neural networks There are different opti-mization techniques that aim to simplify the neural networkindifferent dimensions Sparsificationcompression (LeCunet al 1990 Han et al 2015 Zhu amp Gupta 2018 Anwaret al 2017) quantization (Han et al 2016b Courbariauxet al 2016 Zhou et al 2016 Mishra amp Marr 2018 Esseret al 2020) activation compression (Jain et al 2018) andkernel modifications reduce the complexity of the individualoperations or remove certain computations However ourfocus the problem of memory-aware graph scheduling stillremains orthogonal to these inspiring efforts
6 CONCLUSION
As the new forms of connectivity emerges in neural networksthere is a need for system support to enable their effectiveuse specially for intelligence at the edge This paper tookan initial step toward orchestrating such network understringent physical memory capacity constraints We devisedsignatures to enable dynamic programming and adaptive softbudgeting to make the optimization tractable Even more anidentity graph writing was developed to further the potentialfor gains The encouraging results for a set of emergent net-works suggest that there is significant potential for compilertechniques that enables new forms of intelligent workloads
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
ACKNOWLEDGEMENT
We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms
REFERENCES
Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016
Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018
Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH
Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017
Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966
Bellman R Dynamic programming Science 1966
Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961
Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989
Bruno J and Sethi R Code generation for a one-register machineJACM 1976
Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm
Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018
Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014
Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016
Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017
Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf
Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf
Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017
Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf
Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS
Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015
Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018
Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017
Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017
Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite
Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015
Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a
Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b
He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018
Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962
Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf
Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014
Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018
Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019
Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017
Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016
Kahn A B Topological sorting of large networks CACM 1962
Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001
Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf
Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004
LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990
Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003
Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX
Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b
Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020
Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb
NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt
Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017
Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019
Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999
Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019
Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf
Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007
Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018
Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014
Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf
Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019
Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000
Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019
Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019
Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019
Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf
Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf
Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-
Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg
Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS
Multiply-and-accumulate (Billions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
200 10 30 40
DPN-131
Inception V1MobileNet
ShuffleNet
Inception V2
Inception V3Xception ResNet-152
SENet
AmoebaNet-A
ReNeXt-101PolyNetInception ResNet V2
Inception V4
NASNet-ANASNet-B
RandWire
AmoebaNet-A
AmoebaNet-B
RandWire
irregularly wired neural networksregular topology neural networks
irregularly wired neural networksshow better performance for
same amount of compute thanregular topology neural networks
top left means is better
(a) ImageNet accuracy vs number of multiply-and-accumulate
Number of Parameters (Millions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
800 40 100 140
DPN-131
irregularly wired neural networks
Inception V1MobileNetShuffleNet
Inception V2
Inception V3Xception
ResNet-152
SENet
AmoebaNet-C
ReNeXt-101
PolyNetInception ResNet V2Inception V4
NASNet-A
NASNet-A
RandWire
AmoebaNet-A
RandWire
regular topology neural networks
irregularly wired neural networksshow better performance for
same number of parameters thanregular topology neural networks
top left means is better
6020 120
NASNet-A
(b) ImageNet accuracy vs number of parameters
Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks
B COMPARISON WITH TENSORFLOW LITE
In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks
1656
552
194
70
645
330 60
5
350
160
903
251
82 33
459
260 359
280
115
753
226
72 20
459
260 359
280
115
0
500
1000
1500
2000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Peak
Mem
ory
Foot
prin
t (K
B)
TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Smaller the better
Peak Memory Footprint (KB)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite
C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING
Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm
THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step
Proof If i=0 the optimal s0 is an empty sequence and micro0
must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +
prodvishapemicroi +
produlowasti shape)
and microi+1 is calculated by deductingprodpishape forallpi isin
(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption
D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF
We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node
A
D W
Z
CB X Y
GGraph
hellip
Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm
First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step
For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2
iminus1) On top of this each step would make an iteration
over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield
1+1times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=1+
(|V |minus2
0
)times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=2+
|V |minus2sumi=0
(|V |minus2
i
)times(|V |minus2minusi)
=2+(|V |minus2)times2|V |minus3
le(|V |minus2)times2|V |minus2 for |V |ge4
le|V |times2|V |
As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
ACKNOWLEDGEMENT
We thank the anonymous reviewers for their insightfulcomments We also thank Harris Teague and Jangho Kimfor the fruitful discussions and feedbacks on the manuscriptand Parham Noorzad for his help with the mathematicalformulations to calculate the complexity of the algorithms
REFERENCES
Abadi M Barham P Chen J Chen Z Davis A Dean JDevin M Ghemawat S Irving G Isard M et al TensorflowA system for large-scale machine learning In OSDI 2016
Abdelfattah M S Han D Bitar A DiCecco R OrsquoConnellS Shanker N Chu J Prins I Fender J Ling A C et alDLA Compiler and FPGA overlay for neural network inferenceacceleration In FPL 2018
Ahn B H Pilligundla P and Esmaeilzadeh H ChameleonAdaptive code optimization for expedited deep neuralnetwork compilation In ICLR 2020 URL httpsopenreviewnetforumid=rygG4AVFvH
Anwar S Hwang K and Sung W Structured pruning of deepconvolutional neural networks JETC 2017
Belady L A A study of replacement algorithms for a virtual-storage computer IBM Systems Journal 1966
Bellman R Dynamic programming Science 1966
Bellman R E Dynamic programming treatment of the travelingsalesman problem 1961
Bernstein D Rodeh M and Gertner I On the complexity ofscheduling problems for parallelpipelined machines TC 1989
Bruno J and Sethi R Code generation for a one-register machineJACM 1976
Cai H Zhu L and Han S ProxylessNAS Direct neural architec-ture search on target task and hardware In ICLR 2019 URL httpsopenreviewnetforumid=HylVB3AqYm
Chen T Moreau T Jiang Z Zheng L Yan E Shen H CowanM Wang L Hu Y Ceze L et al Tvm An automated end-to-end optimizing compiler for deep learning In OSDI 2018
Chen Y Luo T Liu S Zhang S He L Wang J Li LChen T Xu Z Sun N et al Dadiannao A machine-learningsupercomputer In MICRO 2014
Chen Y-H Krishna T Emer J S and Sze V EyerissAn energy-efficient reconfigurable accelerator for deepconvolutional neural networks JSSC 2016
Cortes C Gonzalvo X Kuznetsov V Mohri M and YangS AdaNet Adaptive structural learning of artificial neuralnetworks In ICML 2017
Courbariaux M Hubara I Soudry D El-Yaniv R and BengioY Binarized neural networks Training deep neural networkswith weights and activations constrained to +1 or -1 arXiv 2016URL httpsarxivorgpdf160202830pdf
Cyphers S Bansal A K Bhiwandiwalla A Bobba JBrookhart M Chakraborty A Constable W Convey CCook L Kanawi O et al Intel nGraph An intermediate repre-sentation compiler and executor for deep learning arXiv 2018URL httpsarxivorgpdf180108058pdf
Dean J Machine learning for systems and systems for machinelearning In NIPS Workshop on ML Systems 2017
Elthakeb A T Pilligundla P Yazdanbakhsh A Kinzer S andEsmaeilzadeh H Releq A reinforcement learning approachfor deep quantization of neural networks arXiv 2018 URLhttpsarxivorgpdf181101704pdf
Esser S K McKinstry J L Bablani D Appuswamy R andModha D S Learned step size quantization In ICLR 2020URL httpsopenreviewnetforumid=rkgO66VKDS
Feurer M Klein A Eggensperger K Springenberg J BlumM and Hutter F Efficient and robust automated machinelearning In NIPS 2015
Fowers J Ovtcharov K Papamichael M Massengill T LiuM Lo D Alkalay S Haselman M Adams L Ghandi Met al A configurable cloud-scale dnn processor for real-timeai In ISCA 2018
Gao M Pu J Yang X Horowitz M and Kozyrakis CTETRIS Scalable and efficient neural network acceleration with3d memory In ASPLOS 2017
Gauen K Rangan R Mohan A Lu Y-H Liu W and BergA C Low-power image recognition challenge In ASP-DAC2017
Google TensorFlow Lite URL httpswwwtensorfloworgmobiletflite
Han S Pool J Tran J and Dally W Learning both weights andconnections for efficient neural network In NIPS 2015
Han S Liu X Mao H Pu J Pedram A Horowitz M A andDally W J EIE efficient inference engine on compressed deepneural network In ISCA 2016a
Han S Mao H and Dally W J Deep compression Compressingdeep neural networks with pruning trained quantization andhuffman coding In ICLR 2016b
He Y Lin J Liu Z Wang H Li L-J and Han S AMCAutoML for model compression and acceleration on mobiledevices In ECCV 2018
Held M and Karp R M A dynamic programming approach tosequencing problems Journal of the SIAM 1962
Howard A G Zhu M Chen B Kalenichenko DWang W Weyand T Andreetto M and Adam HMobileNets Efficient convolutional neural networksfor mobile vision applications arXiv 2017 URLhttpsarxivorgpdf170404861pdf
Jain A Phanishayee A Mars J Tang L and Pekhimenko GGist Efficient data encoding for deep neural network trainingIn ISCA 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014
Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018
Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019
Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017
Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016
Kahn A B Topological sorting of large networks CACM 1962
Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001
Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf
Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004
LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990
Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003
Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX
Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b
Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020
Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb
NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt
Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017
Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019
Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999
Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019
Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf
Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007
Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018
Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014
Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf
Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019
Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000
Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019
Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019
Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019
Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf
Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf
Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-
Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg
Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS
Multiply-and-accumulate (Billions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
200 10 30 40
DPN-131
Inception V1MobileNet
ShuffleNet
Inception V2
Inception V3Xception ResNet-152
SENet
AmoebaNet-A
ReNeXt-101PolyNetInception ResNet V2
Inception V4
NASNet-ANASNet-B
RandWire
AmoebaNet-A
AmoebaNet-B
RandWire
irregularly wired neural networksregular topology neural networks
irregularly wired neural networksshow better performance for
same amount of compute thanregular topology neural networks
top left means is better
(a) ImageNet accuracy vs number of multiply-and-accumulate
Number of Parameters (Millions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
800 40 100 140
DPN-131
irregularly wired neural networks
Inception V1MobileNetShuffleNet
Inception V2
Inception V3Xception
ResNet-152
SENet
AmoebaNet-C
ReNeXt-101
PolyNetInception ResNet V2Inception V4
NASNet-A
NASNet-A
RandWire
AmoebaNet-A
RandWire
regular topology neural networks
irregularly wired neural networksshow better performance for
same number of parameters thanregular topology neural networks
top left means is better
6020 120
NASNet-A
(b) ImageNet accuracy vs number of parameters
Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks
B COMPARISON WITH TENSORFLOW LITE
In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks
1656
552
194
70
645
330 60
5
350
160
903
251
82 33
459
260 359
280
115
753
226
72 20
459
260 359
280
115
0
500
1000
1500
2000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Peak
Mem
ory
Foot
prin
t (K
B)
TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Smaller the better
Peak Memory Footprint (KB)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite
C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING
Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm
THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step
Proof If i=0 the optimal s0 is an empty sequence and micro0
must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +
prodvishapemicroi +
produlowasti shape)
and microi+1 is calculated by deductingprodpishape forallpi isin
(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption
D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF
We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node
A
D W
Z
CB X Y
GGraph
hellip
Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm
First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step
For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2
iminus1) On top of this each step would make an iteration
over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield
1+1times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=1+
(|V |minus2
0
)times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=2+
|V |minus2sumi=0
(|V |minus2
i
)times(|V |minus2minusi)
=2+(|V |minus2)times2|V |minus3
le(|V |minus2)times2|V |minus2 for |V |ge4
le|V |times2|V |
As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
Jia Y Shelhamer E Donahue J Karayev S Long J GirshickR Guadarrama S and Darrell T Caffe Convolutionalarchitecture for fast feature embedding In MM 2014
Jia Z Lin S Qi C R and Aiken A Exploring hiddendimensions in parallelizing convolutional neural networks InICML 2018
Jia Z Thomas J Warszawski T Gao M Zaharia M andAiken A Optimizing dnn computation with relaxed graphsubstitutions In SysML 2019
Jouppi N P Young C Patil N Patterson D Agrawal GBajwa R Bates S Bhatia S Boden N Borchers A et alIn-datacenter performance analysis of a tensor processing unitIn ISCA 2017
Judd P Albericio J Hetherington T Aamodt T M andMoshovos A Stripes Bit-serial deep neural network computingIn MICRO 2016
Kahn A B Topological sorting of large networks CACM 1962
Keszligler C and Bednarski A A dynamic programming approachto optimal integrated code generation In LCTES 2001
Laredo D Qin Y Schutze O and Sun J-Q Automaticmodel selection for neural networks arXiv 2019 URLhttpsarxivorgpdf190506010pdf
Lattner C and Adve V LLVM A compilation framework forlifelong program analysis amp transformation In CGO 2004
LeCun Y Denker J S and Solla S A Optimal brain damageIn NIPS 1990
Lee C Lee J K Hwang T and Tsai S-C Compileroptimization on vliw instruction scheduling for low powerTODAES 2003
Liu H Simonyan K and Yang Y DARTS Differen-tiable architecture search In ICLR 2019a URL httpsopenreviewnetforumid=S1eYHoC5FX
Liu Y Wang Y Yu R Li M Sharma V and Wang Y Opti-mizing CNN model inference on CPUs In USENIX ATC 2019b
Mireshghallah F Taram M Ramrakhyani P Jalali A TullsenD and Esmaeilzadeh H Shredder Learning noise distributionsto protect inference privacy In ASPLOS 2020
Mishra A and Marr D Apprentice Using knowl-edge distillation techniques to improve low-precisionnetwork accuracy In ICLR 2018 URL httpsopenreviewnetforumid=B1ae1lZRb
NVIDIA TensorRT Programmable inference accelerator 2017URL httpsdevelopernvidiacomtensorrt
Parashar A Rhu M Mukkara A Puglielli A Venkatesan RKhailany B Emer J Keckler S W and Dally W J ScnnAn accelerator for compressed-sparse convolutional neuralnetworks In ISCA 2017
Paszke A Gross S Massa F Lerer A Bradbury J Chanan GKilleen T Lin Z Gimelshein N Antiga L et al PyTorchAn imperative style high-performance deep learning library InNeurIPS 2019
Plump D Term graph rewriting In Handbook Of GraphGrammars And Computing By Graph Transformation Volume2 Applications Languages and Tools World Scientific 1999
Real E Aggarwal A Huang Y and Le Q V Regularizedevolution for image classifier architecture search In AAAI 2019
Rotem N Fix J Abdulrasool S Catron G Deng SDzhabarov R Gibson N Hegeman J Lele M Lev-enstein R et al Glow Graph lowering compilertechniques for neural networks arXiv 2018 URLhttpsarxivorgpdf180500907pdf
Schosser A and Geiszlig R Graph rewriting for hardware dependentprogram optimizations In AGTIVE 2007
Sharma H Park J Suda N Lai L Chau B Chandra V and Es-maeilzadeh H Bit Fusion Bit-level dynamically composable ar-chitecture for accelerating deep neural networks In ISCA 2018
Sifre L and Mallat S Rigid-motion scattering for imageclassification PhD dissertation 2014
Vasilache N Zinenko O Theodoridis T Goyal P DeVitoZ Moses W S Verdoolaege S Adams A and CohenA Tensor Comprehensions Framework-agnostic high-performance machine learning abstractions arXiv 2018 URLhttpsarxivorgpdf180204730pdf
Wang K Liu Z Lin Y Lin J and Han S HAQ Hardware-aware automated quantization with mixed precision In CVPR2019
Wilken K Liu J and Heffernan M Optimal instructionscheduling using integer programming In PLDI 2000
Wortsman M Farhadi A and Rastegari M Discovering neuralwirings In NeurIPS 2019
Wu C-J Brooks D Chen K Chen D Choudhury S DukhanM Hazelwood K Isaac E Jia Y Jia B et al Machinelearning at facebook Understanding inference at the edge InHPCA 2019
Xie S Kirillov A Girshick R and He K Exploring randomlywired neural networks for image recognition In ICCV 2019
Zhang T Yang Y Yan F Li S Teague H Chen Y et alSwiftnet Using graph propagation as meta-knowledge to searchhighly representative neural architectures arXiv 2019 URLhttpsarxivorgpdf190608305pdf
Zhou S Wu Y Ni Z Zhou X Wen H and Zou YDoReFa-Net Training low bitwidth convolutional neuralnetworks with low bitwidth gradients arXiv 2016 URLhttpsarxivorgpdf160606160pdf
Zhu M and Gupta S To prune or not to prune ex-ploring the efficacy of pruning for model compres-sion In ICLR Workshop 2018 URL httpsopenreviewnetforumid=S1lN69AT-
Zoph B and Le Q V Neural architecture searchwith reinforcement learning ICLR 2017 URLhttpsopenreviewnetforumid=r1Ue8Hcxg
Zoph B Vasudevan V Shlens J and Le Q V Learning transfer-able architectures for scalable image recognition In CVPR 2018
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS
Multiply-and-accumulate (Billions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
200 10 30 40
DPN-131
Inception V1MobileNet
ShuffleNet
Inception V2
Inception V3Xception ResNet-152
SENet
AmoebaNet-A
ReNeXt-101PolyNetInception ResNet V2
Inception V4
NASNet-ANASNet-B
RandWire
AmoebaNet-A
AmoebaNet-B
RandWire
irregularly wired neural networksregular topology neural networks
irregularly wired neural networksshow better performance for
same amount of compute thanregular topology neural networks
top left means is better
(a) ImageNet accuracy vs number of multiply-and-accumulate
Number of Parameters (Millions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
800 40 100 140
DPN-131
irregularly wired neural networks
Inception V1MobileNetShuffleNet
Inception V2
Inception V3Xception
ResNet-152
SENet
AmoebaNet-C
ReNeXt-101
PolyNetInception ResNet V2Inception V4
NASNet-A
NASNet-A
RandWire
AmoebaNet-A
RandWire
regular topology neural networks
irregularly wired neural networksshow better performance for
same number of parameters thanregular topology neural networks
top left means is better
6020 120
NASNet-A
(b) ImageNet accuracy vs number of parameters
Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks
B COMPARISON WITH TENSORFLOW LITE
In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks
1656
552
194
70
645
330 60
5
350
160
903
251
82 33
459
260 359
280
115
753
226
72 20
459
260 359
280
115
0
500
1000
1500
2000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Peak
Mem
ory
Foot
prin
t (K
B)
TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Smaller the better
Peak Memory Footprint (KB)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite
C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING
Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm
THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step
Proof If i=0 the optimal s0 is an empty sequence and micro0
must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +
prodvishapemicroi +
produlowasti shape)
and microi+1 is calculated by deductingprodpishape forallpi isin
(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption
D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF
We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node
A
D W
Z
CB X Y
GGraph
hellip
Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm
First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step
For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2
iminus1) On top of this each step would make an iteration
over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield
1+1times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=1+
(|V |minus2
0
)times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=2+
|V |minus2sumi=0
(|V |minus2
i
)times(|V |minus2minusi)
=2+(|V |minus2)times2|V |minus3
le(|V |minus2)times2|V |minus2 for |V |ge4
le|V |times2|V |
As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
A COMPARISON BETWEEN IRREGULARLYWIRED NEURAL NETWORKSAND CONVENTIONAL REGULARTOPOLOGY NEURAL NETWORKS
Multiply-and-accumulate (Billions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
200 10 30 40
DPN-131
Inception V1MobileNet
ShuffleNet
Inception V2
Inception V3Xception ResNet-152
SENet
AmoebaNet-A
ReNeXt-101PolyNetInception ResNet V2
Inception V4
NASNet-ANASNet-B
RandWire
AmoebaNet-A
AmoebaNet-B
RandWire
irregularly wired neural networksregular topology neural networks
irregularly wired neural networksshow better performance for
same amount of compute thanregular topology neural networks
top left means is better
(a) ImageNet accuracy vs number of multiply-and-accumulate
Number of Parameters (Millions)
Top-
1 Im
ageN
et A
ccur
acy
()
85
65
70
75
80
800 40 100 140
DPN-131
irregularly wired neural networks
Inception V1MobileNetShuffleNet
Inception V2
Inception V3Xception
ResNet-152
SENet
AmoebaNet-C
ReNeXt-101
PolyNetInception ResNet V2Inception V4
NASNet-A
NASNet-A
RandWire
AmoebaNet-A
RandWire
regular topology neural networks
irregularly wired neural networksshow better performance for
same number of parameters thanregular topology neural networks
top left means is better
6020 120
NASNet-A
(b) ImageNet accuracy vs number of parameters
Figure 14 ImageNet accuracy vs number of multiply-and-accumulate or parameters where irregularly wired neural networksshow higher performance for same amount of compute or numberof parameters than regular topology neural networks
B COMPARISON WITH TENSORFLOW LITE
In addition to the relative reductions provided in Figure 10Figure 15 provides the raw numbers of the peak memory foot-print for the benchmark irregularly wired neural networks
1656
552
194
70
645
330 60
5
350
160
903
251
82 33
459
260 359
280
115
753
226
72 20
459
260 359
280
115
0
500
1000
1500
2000
Normal Cell A Cell B Cell C Cell A Cell B Cell A Cell B Cell C
DARTSImageNet
SwiftNetVisual Wake Words Dataset
RandWireCIFAR10
RandWireCIFAR100
Peak
Mem
ory
Foot
prin
t (K
B)
TensorFow LiteDynamic Programming+Memory AllocatorDynamic Programming+Graph Rewriting+Memory Allocator
Smaller the better
Peak Memory Footprint (KB)
Normal Cell A Cell B Cell C Cell A Cell B Cell CCell A Cell BSwiftNet
Human PresenceDARTSImageNet
RandWireCIFAR10
RandWireCIFAR100
Figure 15 Peak memory footprint of running irregularly wiredneural networks on SERENITY and TensorFlow Lite
C PROOF FOR OPTIMAL PEAK MEMORYFOOTPRINT FROM THE DYNAMICPROGRAMMING-BASED SCHEDULING
Here we prove the optimality of the above dynamicprogramming-based scheduling algorithm
THEOREM 1 In order to find a schedule slowast with an optimalpeak memory consumption microlowast it is sufficient to keep justone schedule-peak memory pair (si zi) in STi for eachzero-indegree set zi and to append subsequent nodes on topof si to get si+1 in each search step
Proof If i=0 the optimal s0 is an empty sequence and micro0
must be 0 On the other hand if ige 1 assume that (subop-timal) vi constitutes slowast substituting ulowasti isinzi and achieves microlowastIn such case let vi be replaced with (optimal) ulowasti which willresult in micropeak larr min(microi +
prodvishapemicroi +
produlowasti shape)
and microi+1 is calculated by deductingprodpishape forallpi isin
(uipredscapzero-outdegree(si+1G)) By recursively apply-ing uk for rest of the search steps k the algorithm shouldfind an alternative sequence slowastprime with microlowastprimelemicrolowast due to the minoperator above contradicting the original assumption on theoptimality of slowast Therefore our algorithm finds a schedulewith an optimal peak memory consumption
D COMPLEXITY ANALYSIS OFTHE DYNAMIC PROGRAMMING-BASEDSCHEDULING AND PROOF
We compare the complexity of exhaustively exploring STand our dynamic programming-based scheduling Whilethe algorithm both lists candidate schedules and calculatestheir peak memory footprint we consider the peak memoryfootprint calculation as one operation while deriving thecomplexity In order to visualize the analysis we invent Gin Figure 16 to demonstrate the upper bound complexity ofeach algorithm It has a single entry node and a single exitnode A and Z respectively and all other nodes constituteindependent branches between the entry and the exit node
A
D W
Z
CB X Y
GGraph
hellip
Figure 16 Topology of G to demonstrate the upper boundcomplexity of each algorithm
First we demonstrate the complexity of the recursivetopological sorting that exhaustively explores ST Sincethere is a single entry node and a single exit node therewill be |V minus 2| remaining nodes and these nodes can bescheduled independently of one another thereby the numberof candidate schedules become 〈|V minus 2|〉 and the overall
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step
For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2
iminus1) On top of this each step would make an iteration
over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield
1+1times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=1+
(|V |minus2
0
)times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=2+
|V |minus2sumi=0
(|V |minus2
i
)times(|V |minus2minusi)
=2+(|V |minus2)times2|V |minus3
le(|V |minus2)times2|V |minus2 for |V |ge4
le|V |times2|V |
As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering
Ordering Chaos Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices
complexity becomesO(|V |) where |V | denotes the numberof nodes On the other hand for the dynamic programmingwe calculate the number of candidates by utilizing the num-ber of schedules that gets memoized Our memoization takesadvantage of the zero-indegree sets z for each search step
For the first and the last search steps we assume that we havea single entry node and a single exit node On the other handsince the number of nodes scheduled in search step iwouldbe iminus1 the maximum number of entries for memoizationis(|V |minus2
iminus1) On top of this each step would make an iteration
over the set of candidate nodes to discover the next searchsteprsquos z Therefore search step 1 would explore |V | minus 2nodes and the search steps 2 to |V |minus 1 would iterate over|V |minus1minusi nodes Summarizing this would yield
1+1times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=1+
(|V |minus2
0
)times(|V |minus2)+
(|V |minus2
1
)times(|V |minus3)+
+
(|V |minus2
|V |minus2
)times0+1
=2+
|V |minus2sumi=0
(|V |minus2
i
)times(|V |minus2minusi)
=2+(|V |minus2)times2|V |minus3
le(|V |minus2)times2|V |minus2 for |V |ge4
le|V |times2|V |
As a result we can see that our dynamic programming-basedscheduling algorithm is bounded by O(|V | times 2|V |) Byusing Stirlingrsquos approximation on the complexity of therecursive topological sorting we can prove that the dynamicprogramming-based scheduling algorithm should besignificantly faster than the recursive topological ordering