+ All Categories
Home > Documents > Metareasoning in Modular Software Systems: On …Metareasoning in Modular Software Systems:...

Metareasoning in Modular Software Systems: On …Metareasoning in Modular Software Systems:...

Date post: 01-Apr-2020
Category:
Upload: others
View: 8 times
Download: 0 times
Share this document with a friend
12
Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations Aditya Modi 1 , Debadeepta Dey 2 , Alekh Agarwal 2 , Adith Swaminathan 2 , Besmira Nushi 2 , Sean Andrist 2 , Eric Horvitz 2 1 University of Michigan, Ann Arbor 2 Microsoft Research, Redmond Abstract Assemblies of modular subsystems are being pressed into service to perform sensing, reason- ing, and decision making in high-stakes, time- critical tasks in such areas as transportation, healthcare, and industrial automation. We ad- dress the opportunity to maximize the utility of an overall computing system by employing re- inforcement learning to guide the configuration of the set of interacting modules that comprise the system. The challenge of doing system- wide optimization is a combinatorial problem. Local attempts to boost the performance of a specific module by modifying its configuration often leads to losses in overall utility of the system’s performance as the distribution of in- puts to downstream modules changes drasti- cally. We present metareasoning techniques which consider a rich representation of the in- put, monitor the state of the entire pipeline, and adjust the configuration of modules on-the-fly so as to maximize the utility of a system’s op- eration. We show significant improvement in both real-world and synthetic pipelines across a variety of reinforcement learning techniques. 1 INTRODUCTION The lives of a large segment of the world’s population are greatly influenced by complex software systems, be it the software that returns search results, enables the purchase of an airplane ticket, or runs a self-driving car. Software systems are inherently modular, i.e. they are composed of numerous distinct modules working together. As an ex- ample, a self-driving car has modules for sensors such as cameras, lidars which poll the sensors and output sensor messages, and a mapping module that consumes sensor Figure 1: Face detection and landmark detection mod- ular system. The input is an image stream to the face detection module which outputs locations of faces in the image which are then input to the face landmark detec- tion module which outputs locations of eyes, nose, lips, brows etc on the detected face landmark modules. The metareasoning module receives the input stream of im- ages along with intermediate outputs of the face detector to dynamically decide the configuration of the pipeline such that it optimizes the end system loss. messages and creates a high-resolution map of the imme- diate environment. The output of the mapping module is then input to a planning module whose job is to create safe trajectories for the vehicle. These distinct modules often operate at different frequencies; the camera mod- ule may be producing images at 120Hz while the GPS module may be producing vehicle position readings at 1000Hz. Furthermore, they may each have their own set of free parameters which are set via access of a configu- ration file at startup. For example, the software serving as the driver of a camera in the self-driving pipeline may have a parameter setting for the rate at which images are polled from the camera and another parameter for the resolution of the images. Similarly, the function of the mapping module may be controlled by a parameter that specifies the maximum amount of memory it is allowed to consume, leading to the continual removal of information about more distant and thus less relevant map content. Large software systems typically are composed of a set of distinct modular components. The operating characteris- arXiv:1905.05179v1 [cs.LG] 12 May 2019
Transcript
Page 1: Metareasoning in Modular Software Systems: On …Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations

Metareasoning in Modular Software Systems: On-the-Fly Configurationusing Reinforcement Learning with Rich Contextual Representations

Aditya Modi1, Debadeepta Dey2, Alekh Agarwal2, Adith Swaminathan2,Besmira Nushi2, Sean Andrist2, Eric Horvitz2

1 University of Michigan, Ann Arbor2 Microsoft Research, Redmond

Abstract

Assemblies of modular subsystems are beingpressed into service to perform sensing, reason-ing, and decision making in high-stakes, time-critical tasks in such areas as transportation,healthcare, and industrial automation. We ad-dress the opportunity to maximize the utility ofan overall computing system by employing re-inforcement learning to guide the configurationof the set of interacting modules that comprisethe system. The challenge of doing system-wide optimization is a combinatorial problem.Local attempts to boost the performance of aspecific module by modifying its configurationoften leads to losses in overall utility of thesystem’s performance as the distribution of in-puts to downstream modules changes drasti-cally. We present metareasoning techniqueswhich consider a rich representation of the in-put, monitor the state of the entire pipeline, andadjust the configuration of modules on-the-flyso as to maximize the utility of a system’s op-eration. We show significant improvement inboth real-world and synthetic pipelines acrossa variety of reinforcement learning techniques.

1 INTRODUCTION

The lives of a large segment of the world’s population aregreatly influenced by complex software systems, be it thesoftware that returns search results, enables the purchaseof an airplane ticket, or runs a self-driving car. Softwaresystems are inherently modular, i.e. they are composed ofnumerous distinct modules working together. As an ex-ample, a self-driving car has modules for sensors such ascameras, lidars which poll the sensors and output sensormessages, and a mapping module that consumes sensor

Figure 1: Face detection and landmark detection mod-ular system. The input is an image stream to the facedetection module which outputs locations of faces in theimage which are then input to the face landmark detec-tion module which outputs locations of eyes, nose, lips,brows etc on the detected face landmark modules. Themetareasoning module receives the input stream of im-ages along with intermediate outputs of the face detectorto dynamically decide the configuration of the pipelinesuch that it optimizes the end system loss.

messages and creates a high-resolution map of the imme-diate environment. The output of the mapping module isthen input to a planning module whose job is to createsafe trajectories for the vehicle. These distinct modulesoften operate at different frequencies; the camera mod-ule may be producing images at 120Hz while the GPSmodule may be producing vehicle position readings at1000Hz. Furthermore, they may each have their own setof free parameters which are set via access of a configu-ration file at startup. For example, the software servingas the driver of a camera in the self-driving pipeline mayhave a parameter setting for the rate at which images arepolled from the camera and another parameter for theresolution of the images. Similarly, the function of themapping module may be controlled by a parameter thatspecifies the maximum amount of memory it is allowed toconsume, leading to the continual removal of informationabout more distant and thus less relevant map content.

Large software systems typically are composed of a set ofdistinct modular components. The operating characteris-

arX

iv:1

905.

0517

9v1

[cs

.LG

] 1

2 M

ay 2

019

Page 2: Metareasoning in Modular Software Systems: On …Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations

tics of all of the components are usually manually config-ured to achieve system performance targets or constraintslike accuracy and/or latency of output. Configurationsof parameters may result from the tedious and long-termtuning of one parameter at a time. Once such nominalconfigurations have been produced, they are then heldconstant during system execution. The reliance on suchfixed policies in a dynamic world may often be subopti-mal. As an example, modules may take different amountsof time depending on the specific contents of the inputsthey receive.

As a running example, we illustrate a pipeline for ex-tracting faces with keypoint annotations from images inFigure 1. A natural performance metric for the pipelinemight blend the prediction latency and accuracy, wherethe latency of a face-detection module may vary dramati-cally based on the number of people in the camera view.In this case, one might prefer switching to a parametersetting which allows the face detector to sacrifice someaccuracy but which is much faster hence raising the over-all utility of the entire pipeline. Also modules which areupstream from the face detector like the camera drivermodule might ideally throttle back the rate at which itis producing images since most of these images will notget processed anyways, due to a bottleneck at the facedetector module. Attempts to separately optimize dis-tinct modules can often lead to losses in utility Bradley(2010) because of unaccounted shifts in the distributionof outputs produced by upstream modules.

Revisiting the self-driving car example, a basic utilityfunction is to simply navigate passengers to their destina-tion safely and in a reasonable amount of time. Highlight-ing the contextuality again, the emphasis on driving timemight be higher when trying to get to an important meet-ing or a flight than going grocery shopping. Furthermore,the utility function will typically be deeply personal tothe user and has to be inferred over time. Importantly,this is a complex pipeline-level feedback which is hard toattribute to individual components.

Optimizing the configuration of large modular systemsis challenging for the following reasons: 1. Changingthe parameters of an upstream module can drasticallychange the distribution of inputs to downstream modules.Jointly choosing configuration for each module leads to acombinatorial optimization problem where the space ofassignments is the cross product of the action space ofthe parameters of each module. 2. Even if we solved thecombinatorial optimization problem, a fixed configurationis not good across all inputs. Hence, we need to choose theconfiguration in an input-adaptive manner. This decisionabout a particular module’s parameter assignment has tobe made before input is passed through it. 3. There are

challenges of credit assignment about how much eachparticular parameter assignment, for each module alongthe way, contributed to the final utility. For non-additiveutility functions, this is especially challenging Daume IIIet al. (2018). 4. Finally, the metareasoning process byitself should add negligible latency to the original system.If the cost of metareasoning is significant, it may be bestto run the original pipeline with different configurationsand select the best performing assignment.

In this work, we leverage advances in representation andreinforcement learning (RL) to develop metareasoningmachinery that can optimize the configuration of modularsoftware systems under changing inputs and compute en-vironments. Specifically we demonstrate that by havinga metareasoner continuously monitor the entire systemwe can switch parameters of each module on-the-fly toadapt to changing inputs and optimize a desired objective.We also study the distinction between attainable perfor-mance between choosing the best configuration for theentire pipeline as a function of just the initial input, versusfurther choosing the configuration of each module basedon all the preceding actions and outputs. We experimentwith a synthetic pipeline meant to require adaptivity tothe inputs, and we find that by doing so at each mod-ule, we improve by roughly 50% or more over the bestconstant assignment, and typically by a similar marginover the choice of a configurationn just as a function ofthe initial input. For the face and landmark detectionpipeline 1, we use the activations of a pretrained neuralnetwork model as a contextual signal and leverage thisrich representation of context in decisions about the con-figuration of each module before the module operates onits inputs. We characterize the boosts in utility providedvia use of this contextual information, improving 9% ormore across different utility functions as opposed to thebest static configuration of the system. Overall, our ex-periments demonstrate the importance of online, adaptiveconfiguration of each module.

2 RELATED WORK

RL to control software pipelines: Decisions about com-putation under uncertainties in time and context havebeen described in Horvitz and Lengyel (1997), whichpresented the use of metareasoning to guide graphicsrendering under changing computational resources, con-sidering probabilistic models of human attention so as tomaximize the perceived quality of rendered content. Themetareasoning guided tradeoffs in rendering quality undershifting content and time constraints in accordance withpreferences encoded in a utility function. Principles forguiding proactive computation were formalized in Horvitz(2001). Raman et al. (2013) characterize a tradeoff be-

Page 3: Metareasoning in Modular Software Systems: On …Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations

tween computation and performance in data processingand ML pipelines, and provide a message-passing algo-rithm (derived by viewing pipelines as graphical models)that allows a human operator to manually navigate thistradeoff. Our work focuses on the use of metareasoningto replace the operator by seeting the best operating pointfor any pipeline automatically.

Bradley (2010) proposed using subgradient descent cou-pled with loss functions developed in imitation learningin order to jointly optimize modular robotics softwarepipelines which often involve planning modules, whenthe modules are differentiable with respect to the over-all utility function. This is not suited to most real-worldpipelines with modules described not with parametersbut lines of code. In this work we instead develop fullygeneral methods, which only assume the ability to evalu-ate the pipeline. Another form of pipeline optimizationis to accordingly pick or configure the machine whereeach module should be executed. Methods in this am-bit (Mirhoseini et al. (2017)) are complementary to thiswork in that optimizing the pipeline configuration per seremains a problem even with optimal device placement.

RL in distributed system optimization: The use of ma-chine learning for optimizing resource allocation in dis-tributed systems for data center and cluster managementhas been very well studied (Lorido-Botran et al. (2014);Demirci (2015); Delimitrou and Kozyrakis (2013, 2014)).Many of these techniques use supervised learning as wellas collaborative filtering for resource assignment, whichrely on the assumption of having a rich set of processesin the training data and might as a result suffer fromeventual data bias for new workloads. Most recently,the use of reinforcement learning for learning policieswhich dynamically optimize resources such that servicelevel agreements can be better satisfied has received alot of attention especially with the rise of reinforcementlearning with neural networks as function approxima-tors (colloquially termed as ‘deep reinforcement learn-ing’ (Li (2017); Arulkumaran et al. (2017)). Methodsusing model-free methods Mao et al. (2016) based onpolicy-gradients Williams (1992); Sutton et al. (2000) andQ-learning Watkins (1989); Xu et al. (2012) have shownpromise as modeling such large-scale distributed systemsis a challenge in itself. Similarly, RL has found impres-sive success in energy optimization for data centers (Gao(2014); Memeti et al. (2018)).

RL for scheduling in operating systems: Even at thesingle machine level, RL has found promise for threadscheduling and resource allocation in operating systems.For example Fedorova et al. (2007); Hanus (2013) useRL-based methods to learn adaptive policies which outper-form the best statically optimal policy (found by solving a

queuing model) as well as myopic reactive policies whichgreedily optimize for short term outcomes. The problemof scheduling in operating systems however differs frompipeline optimization in two fundamental ways. First, theoperating system (as well as the scheduler) is obliviousto accuracy dependencies between different processes orthreads. Second, due to either architectural or general-ity constraints, schedulers do not optimize process-levelparameters but mainly focus on machine configuration.

3 PROBLEM DEFINITION

3.1 FORMAL SETTING AND NOTATION

A pipeline of M modules can be viewed as a directedgraph where each node j is a module and an edge fromj to k represents module k consuming the output of j asits input. We assume the graph does not have any cycles.Without loss of generality, let the modules be numberedaccording to their topological sort; i.e. j refers to the in-dex of a module in a linear ordering of the DAG. For eachmodule j, we have a set of possible configurations—theseare the actions that are available for the metareasoner tochoose from. We denote this set by Aj . A module j canthen be viewed as a mapping from its inputs x ∈ Sjinto outputs z ∈ Sj

out, and each configuration a ∈ Ajimplies a different mapping. As a running example, wewill consider the face detection pipeline of Figure 1. Thepipeline contains two modules with module 1 having 4choices and module 2 having 3 choices. The input spaceto the first module S1

in is the space of images (possiblyin a feature space). The output space S1

out is the same asS2

in and can encode the image, the locations of faces inthe image, and the latency induced by the first module.

The quality of a pipeline’s operation is measured using aloss function denoted by L : SM

out 7→ R. In the exam-ple pipeline of Figure 1, the outputs from the landmarkdetector can be labeled by human evaluators to assessaccuracy and L can be a complex trade-off between thelatency incurred by the overall pipeline in processing animage vs. the accuracy of the detected landmarks. Iflabels are not available, accuracy might be inferred fromproxies such as an incorrect denial of authentication fora user based on the landmark detector output, which canbe observed when the user authenticates via alternativemeans such as a password. Crucially, we only observethe value of this loss-function for the specific outputsz ∈ SMout that the pipeline generates based on a certainconfiguration of actions at each module in response to aninput x. We highlight that the loss function L can be anyfunction mapping the pipeline’s final output and systemstate to a scalar value, such as a passenger’s satisfactionwith a ride in a self-driving car as discussed in Section 1.

Page 4: Metareasoning in Modular Software Systems: On …Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations

A metareasoner can be represented as a collection of(possibly randomized) policies π := {π1 . . . πM}, whereπj : Sj

in 7→ ∆(Aj) specifies a context-dependent con-figuration of the module and ∆(Aj) is the set of distri-butions over the action set Aj . We abuse the notationfor Sjin here to denote any succinct representation ofthe preceding pipeline component’s outputs, actions andsystem state variables which are needed to choose theappropriate action for module j. The pipeline receives astream of inputs and we use t to index the inputs. At timet, the pipeline receives an initial input x1t ∈ Sin1 , basedon which an action a1t ∼ π1(x1t ) is picked at the firstmodule and it produces an intermediate output z1t . Thisinduces the next input x2t ∈ S2

in at the second module,at which point the policy π2 is used to pick the next ac-tion and so on. At each intermediate module j, the inputxjt depends on the outputs of all its parents in the DAGcorresponding to the pipeline and we assume that the in-put spaces Sjin are chosen appropriately so that a goodmetareasoner policy for module j can solely depend onxjt instead of having to depend explicitly on the outputsof its predecessors. Proceeding this way, the interactionbetween the metareasoner and the environment can besummarized as follows:

1. x1t ∈ S1in is fed as input to the pipeline.

2. metareasoner chooses actions for each module basedon the output of its predecessors and induces a trajec-tory: (x1t , a

1t , z

1t , . . . , x

Mt , a

Mt , z

Mt ); eventual output

of the pipeline is zMt .3. Observe loss L(zMt ).

Formulated this way, the task of the metareasoner can beviewed as an episodic fixed-horizon reinforcement learn-ing problem, where the state transitions are deterministic(although the initial input can be highly stochastic, suchas an image in the face detection example). Each inputprocessed by the pipeline is an episode, the horizon isM , actions chosen by policies for the upstream modulesaffect the state distribution seen by downstream policies.The feedback is extremely sparse with the only loss beingobserved at the end of the pipeline. The goal of the metar-easoner is to minimize its average loss: 1

T

∑Tt=1 L(zMt ),

and the ideal metareasoner can be described as:

arg minπ1...πM

T∑t=1

Eπ1,...,πM

[L(zMt ) | x1t

] .= J(π). (1)

Our goal is to learn a metareasoner during the live opera-tion of the pipeline. Since we only observe pipeline lossesfor the current choices of the metareasoner’s policies, wemust balance exploration to discover new pipeline config-urations, and exploitation of previously found performantconfigurations. In such explore-exploit problems, wemeasure the average loss accumulated by our adaptive

learning strategy as a benchmark; a lower loss is better. Abetter learning strategy will quickly identify good context-dependent configurations and hence have lower averageloss as T increases.

3.2 CHALLENGES

In this section we highlight the important challenges thata metareasoner needs to address.

Combinatorial action space: Viewing the entire pipelineas a monolothic entity, with an aim to find the bestfixed assignment for each module with no input de-pendence, leaves the metareasoner with combinatoriallymany choices (every possible combination of moduleconfigurations) to consider. This can quickly become in-tractable even for modest pipelines (e.g. See Figure 3),despite the use of the simplest possible static policy class.

Adaptivity to inputs: Having a static action assignmentper module is overly simplistic in general and we typi-cally need a policy for manipulating configurations that iscontext-sensitive. For example, in Figure 4, we observethat the number of faces in the input image implies afundamentally different trade-off between latency and ac-curacy; implying a different optimal choice for the imageprocessing algorithm.

Credit Assignment: Since we only observe delayedepisodic reward, we do not know which module was toblame for a bad pipeline loss.

Exploration: Pipeline optimization offers a fundamentallychallenging domain for exploration. Though we employideas from contextual bandits here, we anticipate future di-rections that explore by using pipeline structure to derivebetter learning strategies.

4 METHODS

The methods we outline now each address some of thechallenges in Section 3.2. The simplest strategy is a non-adaptive (i.e. insensitive to the context) approach that can,however, effectively handle combinatorial actions (Sec-tion 4.1) to search for a locally optimal static assignment.A simple context-sensitive strategy views the pipelineoptimization problem as a monolithic contextual bandit,and is vulnerable to a combinatorial scaling of complexitywith pipeline size (Section 4.2). Finally, the most sophis-ticated strategy we develop produces a context-adaptivepolicy, exploits pipeline structure to learn per-modulepolicies and uses policy-gradient algorithms to quicklyreach a locally optimal configuration policy (Section 4.3).

Page 5: Metareasoning in Modular Software Systems: On …Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations

4.1 GREEDY HILL CLIMBING

The simplest (infeasible) strategy for pipeline optimiza-tion with input examples x11, . . . , x

1T is to brute-force

try every possible configuration for each of the T in-puts and pick the configuration that accumulates thelowest loss. This strategy will identify the best non-adaptive (i.e. context-insensitive) configuration, but needsT ·∏Mj=1 |Aj | executions of the pipeline to find this config-

uration. Since this is typically intractable even for modestvalues of T and Aj (especially in real-time), we nowdescribe a tractable alternative to find an approximatelygood configuration via random co-ordinate descent.

Rather than identifying the best configuration, supposewe aim to find a “locally optimal” configuration – that is,for every module, if we held all other module configura-tions fixed then deviating from the current configurationcan only worsen the pipeline loss. To achieve this, we be-gin by randomly picking an initial configuration for eachmodule in the pipeline. In each epoch, we first sample Kout of T examples sampled uniformly with replacementfrom the dataset, where is K is a hyperparameter that canbe set based on the available computational budget. Wethen choose one of the modules j ∈ {1, 2, . . . ,M} uni-formly at random and keep the configurations of all othermodules fixed. We cycle through every possible action forthat module (using, for instance, K/ |Aj | examples foreach choice of action at this module) and pick the configu-ration that achieves the lowest accumulated loss. We thenrepeat this process until our training budget of examplesis exhausted, or we cycled through every module withoutmaking a configuration change (which means we are ata local optimum). This is akin to a greedy hill-climbingstrategy, and has been used in many diverse applicationsof combinatorial optimization as an approximate heuris-tic, for instance in page layout optimization (Hill et al.,2017). More sophisticated variants of this approach canuse best-arm identification techniques during each epoch,but fundamentally, this strategy finds an approximatelyoptimal context-insensitive policy.

4.2 GLOBAL BANDIT FROM INITIAL INPUT

For many real-world pipelines, the modules’ operatingcharacteristics are sensitive to the initial input, meaningthat a context-insensitive configuration policy can be verysub-optimal w.r.t. the pipeline loss. This motivates our ap-proach to find a context-adaptive policy using contextualbandit (henceforth CB) algorithms.

A CB algorithm receives a context xt in each round t,takes an action a ∈ A and receives a reward rt. Thealgorithm learns a policy π : x 7→ ∆(A) that is context-sensitive and adaptively trades-off exploration and ex-

ploitation to maximize∑t rt. In our setting xt is the

input example to the pipeline, A .= A1 × A2 · · · × AM

is the cartesian product of all module-specific configura-tions and the reward is simply the negative of the observedpipeline loss.

In our experiments, we use a simple CB algorithmthat uses Boltzmann exploration (see e.g. (Kaelblinget al., 1996)). Concretely, the policy is represented by aparametrized scoring function sθ : S1

in × A 7→ R. Thescore for each global configuration is computed sθ(x, a)and the policy is a softmax distribution of these scores:

πθ(a | x) =exp(λsθ(x, a))∑a′ exp(λsθ(x, a′))

, (2)

where λ > 0 is a hyperparameter that governs the trade-off between exploration and exploitation. The score func-tion is typically updated using importance-weighted re-gression (Bietti et al., 2018) (henceforth IWR); that is,if we observe a reward rt after configuring the pipelinewith action at ∼ πθ(a | xt), then the score function isoptimized to minimize 1

πθ(at|xt) (rt − sθ(xt, at))2.

These contextual bandit algorithms can very effectivelyfind context-sensitive policies π and adaptively explorepromising configurations. However, by viewing the en-tire pipeline as one monolithic object with combinato-rially many actions, they cannot tractably scale to evenmoderate-sized pipelines.

4.3 PER-MODULE BANDIT: USINGINTERMEDIATE OBSERVATIONS

The contextual bandit approach of Section 4.2 does notscale well with the size of the pipeline, but it does guaran-tee (under mild assumptions, like an appropriate schedulefor λ, see e.g. Singh et al. (2000)) that we will eventu-ally find the best context-adaptive policy expressible byour scoring function sθ. It also does not capture the out-puts of prior modules in choosing the configuration at asuccessor, which can be vital such as when a previousmodule incurs a large latency. Suppose we again relaxthe goal to instead find an approximately good “locallyoptimal” policy. Our key insight is to now employ a CBalgorithm for each module, so that the algorithm for mod-ule j only needs to reason about Aj actions. Moreover,as inputs are processed by the pipeline, the metareasonercan use up-to-date information (e.g. about latencies intro-duced by upstream modules) as part of the context for thedownstream bandit algorithm.

One can again perform a variant of randomized co-ordinate ascent as in Section 4.1, holding all but onemodule fixed and running a CB algorithm for that module.This ensures that each bandit algorithm faces a stationaryenvironment and can reliably identify a good context-

Page 6: Metareasoning in Modular Software Systems: On …Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations

sensitive policy quickly. However, this can be very data-inefficient; we will next sketch an actor-critic based rein-forcement learning algorithm that can apply simultaneousupdates to all modules.

Suppose we consider stochastic policies of the form (2)for a module j, but where xjt ∈ Sinj and ajt ∈ Aj . Acommon approach to optimize the policy parameters θis to directly perform stochastic gradient descent on theaverage loss, which results in the policy gradient algo-rithm. Specialized to our setting, an unbiased estimate ofthe gradient for the parameters θj of πj , that is∇θjJ(θ)

(recall (1) is given by L(zMt )∇θj log πθ(ajt | x

jt ) since

the loss is only incurred at the end. Typically, policy gra-dient techniques use an additional trained critic C(xjt ) asa baseline to reduce the variance of the gradients (Kondaand Tsitsiklis, 2000). We train the critic to minimize themean squared error between the observed reward and thepredicted reward, (L(zMt )− C(xjt ))

2.

5 EXPERIMENTS

The algorithms discussed in the previous section are testedon two sets of pipelines: a synthetic pipeline with strongcontext dependence and a real-world perception pipeline.Our results show performance improvement by adaptivelychoosing the configuration of the pipeline. For all ourexperiments, we use a PyTorch based implementation(Paszke et al., 2017) with RMSProp (Hinton et al., 2012)as the optimizer. For hyperparameter tuning, we performa grid search over the possible choices. The commonhyperparameters for both methods are:

• Learning rate ∈ {0.0001, 0.0004, 0.001, 0.005}• Minibatch size ∈ {5, 10, 20, 50, 100}• `2-weight decay factor ∈ {0.01, 0.05, 0.1, 1}

All our plots include 5 different runs with 5 randomlychosen random seeds with standard error regions. Thespecific details for each algorithm are as follows:

Greedy hill-climbing For finding the greedy step in eachiteration, we use a minibatch of 1000 samples per action(K = Aj ∗ 1000). The procedure is run until it convergesto a fixed assignment. In the plots, we outline this as thenon-adaptive baseline with which each method is com-pared. The final assignment obtained by the procedure isevaluated using Monte Carlo runs with sufficiently largenumber of samples from the input distribution (syntheticpipeline) or using samples present in a holdout set (facedetection pipeline).

Global contextual bandit The policy parameters consistof a single policy that maps the input x1t to a configurationfor the entire pipeline, and policy class is a neural networkwith a single hidden layer. The inverse temperature coef-

Method Hyperparameter choicesGlobal CB λ ∈ {0.1, 0.3, 1, 5, 10}

Per-module CB ent wt ∈ {0.01, 0.03, 0.1, 0.3, 1}

Table 1: Algorithm specific hyperparameter choices

Figure 2: Synthetic pipeline

ficient for Boltzmann exploration, λ, is considered to bea hyperparameter. We use the IWR loss with minibatchesto perform updates to the policy. For hyperparameter tun-ing, we choose the setting with the minimum cumulativeloss for the pipeline across the input stream.

Per-module contextual bandit The policy function ateach module is a single hidden layer neural network witha softmax layer at the end. We use the policy gradientupdate rule as discussed in Section 4.3. The contextfor each module is the concatenation of the sequenceof actions chosen for previous modules, current latencyand the initial input to the system. Additionally, for eachmodule, we implement a critic which predicts the finalloss of the pipeline for the given context as describedin Section 4.3. The critic is again a single hidden layerneural network with a single output node and is trainedusing squared loss over the observed and predicted loss.We use the same learning rate for both networks. We useminibatches for training the networks for each moduleand these are concurrently updated for each minibatch. Inaddition, we also use entropy regularization weighted byent wt with the policy gradient loss function (Haarnojaet al., 2018). We tune hyperparameters using the bestcumulative pipeline loss across the input stream.

At a high-level, our experiments seek to uncover the im-portance of adaptivity to the inputs in configuring thepipeline. To capture practical trade-offs, we considerloss functions which combine the latency incurred whileprocessing an input, along with the accuracy of the finalprediction compared to ground truth annotations.

5.1 SYNTHETIC PIPELINES

We begin with an illustrative synthetic pipeline designedto highlight: (1) benefits of adaptivity to the input over astatic assignment, and (2) infeasibility of the global CBapproach for even modestly long pipelines. The structure

Page 7: Metareasoning in Modular Software Systems: On …Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations

of the synthetic pipeline with n modules is a linear chainof length n as shown in Figure 2. Each module has twopossible actions: 0 and 1 (cheap/expensive action) whichincur a latency cost of 0 and 1 respectively. Inputs to thepipeline consist of uniformly sampled binary strings from{0, 1}n, with the ith bit encoding the preferred action formodule i. If the ith bit is set to 0, both actions give anaccurate output and if it is 1, only the expensive actiongives an accurate output. If we make an incorrect pre-diction at module i, then the final prediction at the endof the pipeline is always incorrect. At each episode t,we provide an input to the pipeline by first sampling arandom binary string as mentioned above, but then adduniform noise in the interval [−0.3, 0.3] to each entry andthis perturbed input constitutes the initial context x1t forthe pipeline. The loss function for the final output of thepipeline is

`(a) := 4n2 (latency− n/2)2 + error

We center the latency term at n/2, which is the latencyof the optimal policy that routes each input perfectly tothe cheapest action that makes the correct prediction forit and the normalization keeps this term in [0, 1]. Thesecond term measures the error in the eventual prediction,which requires each module to make an accurate predic-tion. The value is set to 1 for an incorrect output and0 otherwise.While the initial input encodes the optimalconfiguration,suited to global CB, there is further room toadapt. When module i makes an error in prediction, thenall modules j > i should pick the cheap action.

We show results of our algorithms for n = 4, 8 and 11.For static assignments, we compute both the solution ofthe greedy hill climbing strategy and a brute force searchover all assignments, which results in similar averagelosses under the input distribution. The context for eachmodule for per-module CB contains the pipeline’s input,a binary string to denote upstream actions and the currentlatency. We use ReLU activations with the number ofhidden layers for each network in our experiments as theaverage of input dimension and the output dimension. Forinstance, for global CB, the number of hidden layers forn = 4 is h = 10 and for per-module CB is h = din

2 + 1.

We show the evolution of the average loss as a functionof the number of examples for different values of n inFigure 3. Our results show significant gains for beingadaptive over the constant assignment baseline in all theplots. For n = 4, the total number of assignments is16 and it can be clearly seen that global CB is effectivewhen compared to the per-module counterpart. However,global CB is slower in convergence than per-module CB.For n = 8, the difference between the two is more pro-nounced as the per-module CB method converges rapidly.For n = 11, the total number of assignments for the

pipeline is 2048 and global CB completely fails to learn abetter adaptive policy. The per-module CB has a slowerconvergence in this harder case, but still improves uponthe best constant assignment extremely quickly.

5.2 FACE AND LANDMARK DETECTION

Pipeline and dataset: We use a two-module production-grade real-world perception pipeline service to empiri-cally study the efficacy of our proposed methods (Figure1). The first module is a face detection module whichtakes as input an image stream and outputs the locationof faces present in the image as a list of bounding boxrectangles. This module has four different algorithmsfor detecting faces. The exact details of the algorithmsare proprietary and hence we only have black-box accessto them. We benchmarked the latency and accuracy ofthe algorithms on 2689 images from the validation setof the 2017 keypoint detection task of the open sourceCOCO dataset (Lin et al., 2014). COCO has ground truthannotations of up to 17 visible keypoints per person inan image. We notice that not only do each of the algo-rithm choices have large variation in latency and accuracyon average when compared to each other, more cruciallytheir latencies and accuracies vary drastically with thenumber of true faces present in the incoming images, i.e.they are context dependent. Specifically, we observe thatlatency drastically increases with the number of facespresent in the image. Figure 4 shows the latencies of allfour detection algorithm choices vs. number of true facespresent in the image. Note that different algorithms havedifferent latencies on average with Algorithm 0 beingthe fastest (∼ 0.2 seconds) and Algorithm 3 the slowest(∼ 2.5 seconds).

The second module is a face landmark detection modulewhich takes as input the original image and the predictedface rectangles output by the face detection module andcomputes the location of landmarks on the face like nose,eyes, ears, mouth etc. There are three different land-mark detector algorithm choices: 5 points, 27 points or87 points landmark detector. Again we observe in ourbenchmarking that the landmark detector which outputs87 points takes the most time at 0.25 ms per image onaverage vs. 0.17 ms per image for the 27 points algorithmand 0.08 ms per image for the 5 points one. Since thelandmark detectors are applied on each face rectangledetected by the face detector, the computational time re-quired goes up proportional to the number of faces. Figure5 shows example face detections and landmarks detectedon images from the validation sets of the COCO dataset.

Accuracy calculation: For evaluating when a predic-tion by the face detection module is a true/false posi-tive/negative, we closely follow the scheme laid out in

Page 8: Metareasoning in Modular Software Systems: On …Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations

Figure 3: Average loss as a function of the number of examples for the synthetic pipeline. The flat line corresponds tothe expected loss of the best constant assignment. The shaded region represents one standard error over 5 runs.

Figure 4: Face detection algorithm choices vs. latency in seconds as a function of true number of faces present in theimage. Algorithm 0 and 2 are much faster than Algorithm 1 and 3. All algorithms exhibit increasing latencies as thenumber of faces goes up in the image.

Figure 5: Example face and landmark detections fromCOCO validation set. (Left) Face detected (blue rectan-gle) and landmarks detected within the face (blue dots).The red dots represent groundtruth face landmarks notdetected. (Right) False face detections (blue rectangles)and wrong landmarks within the rectangles.

the COCO Keypoints evaluation page Lin et al. (2014).Specifically a rectangle location on the image is con-sidered a true positive if it is within 30 pixels of aground truth face annotation which is quite conservativeas the images we use are all resized to constant size of1280(W) × 960(H) pixels. Otherwise, it is marked as afalse positive. Ground truth faces which are not “covered”by any of the predicted faces cause an entry in the falsenegative count. If an image contains no faces and the facedetection module also predicts no faces then we count

such scenarios as true negatives.

Similarly, for the face landmark module, we mark a pre-diction as a true positive if it is within 5 pixels of theground truth landmark annotation, else a false positive.All landmarks not “covered” by any of the predicted facesare counted as false negatives. Note since the COCO key-point annotations include only 17 keypoint annotations onthe entire human body including only 5 face landmarkswe don’t penalize predictions of the 27 or 87 landmark de-tection algorithms which are not within threshold distanceof any ground truth landmark as that unfairly counts asfalse positives (due to lack of ground truth annotations)1.

Results: The dataset of 2689 images is divided into trainand test sets of size 2139 and 550 respectively. For train-ing, we use minibatches randomly sampled from the train-ing set and test curves are plotted using the average lossover the complete test set.2 We use the embedding from

1Since we find an optimal matching between predicted andtrue keypoints, each false negative also results in a false negative

2Unlike the synthetic pipeline we do not use average lossover the run of the algorithm here as the number of episodesis much larger than the size of our data set, which means algo-rithms can overfit to the data set unlike in the synthetic casewhere we have effectively an infinite data set. So we evaluate aproxy for average loss as the average test performance on held

Page 9: Metareasoning in Modular Software Systems: On …Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations

Figure 6: Test performance curves for the Face Detection and Landmark pipeline with t0 = 1.3 and t0 = 1.7. TheY -axis is the performance percentage improvement over static global policy after every 200 episodes of learning onheld-out examples. The plots use a latency-based loss (left), latency and false negative rate (middle) and latency,false negative rate and false discovery rate (right). The adaptive approaches significantly improve over the best fixedconfiguration in all the cases. Shading represents standard error across 5 runs.

Figure 7: Action counts in module 2 for per-module CB

the penultimate layer of ResNet-50 (He et al., 2016) as thecontextual representation for each image for both adap-tive methods. Thus, the context is a 1000 dimensionalreal valued vector. For per-module CB, the first module’spolicy network gets the embedding as input whereas thesecond one gets additional concatenated values of numberof faces detected by module 1 and its latency. All net-works here have a hidden layer with 256 units with ReLUactivations. For evaluating the final loss function of thepipeline, we consider a combination of three metrics:

Pure latency: Squared loss between the pipeline’s la-tency and a threshold t0: `(a) := (latency− t0)2.

out examples, following standard methodology.

Latency and accuracy: In addition to the squared dis-tance, we now consider the false negative rate (FNR)of the pipeline for the landmarks detected in each im-age. Since false negative rate is always in [0, 1], it isrobust to different number of landmarks in different im-ages as well as different number of predicted landmarksby different actions (5, 27 and 87), unlike a direct clas-sification error in landmark prediction. In this case,`(a) := (latency− t0)2 + FNR.

Latency, accuracy and false detection penalty: For theface detection module, in many cases there are non-zerofalse positives. This further increases the number of falsepositive landmarks for those cases and therefore we addanother penalty of the false discovery rate for face detec-tion.

In our experiments, we choose a value of t0 = 1.3 andt0 = 1.7 for all three loss functions for the pipeline. Notethat, if one tries to optimize total latency of the pipeline,then the non-adaptive solution of choosing the cheap-est action for both modules works well. Therefore, wechoose the bell shaped squared loss for latency whichreflects the specification of aiming for a target latency.Figure 6 shows the observed improvement of the adaptivemethods over the static global policy for t0 = 1.3 and

Page 10: Metareasoning in Modular Software Systems: On …Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations

#Faces #Images Global CB Per-module CB≥3 124 11.82% 15.23%≥4 83 18.58% 22.51%≥5 63 23.04% 24.12%

Table 2: Performance percentage improvement over staticglobal policy for global contextual bandit and per-modulecontextual bandit, broken down by the number of truefaces in the image. Numbers shown here are for latencyand accuracy loss with t0 = 1.3.

t0 = 1.7. Per-module CB and global CB show improve-ment for all loss functions against the constant assignmentbaseline found by greedy hill climbing. The numbers inTable 2 show the context-dependency of the pipeline. Thebenefits of algorithms which are able to effectively uti-lize context (Global CB and Per-module CB) is reallyhighlighted in the parts of the dataset which contain morethan 3, 4 or 5 faces. As the number of faces in an im-age increases, the percentage gain increases as well. Theobserved gains of approximately 15, 22 and 24 percentin respecting the utility function are arguably significantfor sensitive mission-critical applications. Although thesetwo methods are hard to distinguish on average, we thinkthis is due to the small length of the pipeline and the inter-mediate context for the second module’s policy not beingvery informative. In order to show that the adaptivityto the final loss function influences the chosen actions,we compare the counts of action 0 and 1 for the secondmodules using the first two loss functions. We show thesecounts for per-module CB. It can be seen from Figure 7that changing the loss function leads to a change in thechosen actions for the test set.

5.3 DISCUSSION

We observe that contextual optimization of softwarepipelines can provide drastic improvements in the av-erage performance of the pipeline for any chosen lossfunction. Our experiments show that for small pipelines,both global CB and per-module CB can give potentialimprovement over a constant assignment. However, theseexperiments should only be considered as a controlledstudy of the power of contextual optimization and thereare additional caveats which we defer for future work:

Computational overhead: The loss functions consid-ered by us for the pipeline involve a combination of la-tency and accuracy. In addition to the pipeline’s latency,any metareasoning module will add to the cost. In ourexperiments, the total time for inference and updates isless than 5-7 ms per input which is orders of magnitudeless the the pipeline’s latency. Moreover, making thepipeline configurable in real-time might induce furthercommunication/data re-configuration costs. We focus on

the potential improvements from adaptivity in this paperand leave the engineering constraints for future work.

Non-stationarity during learning: For the per-moduleCB algorithm, the input given to each network is ideallythe input for the corresponding module. Changing theconfiguration of these pipelines can vary the distributionof the inputs to these modules drastically and change inone action changes the input for downstream modules.The pipelines in our experiments do not showcase thisissue. We ignore this aspect in our current exposition andleave a more involved study to future work.

6 CONCLUSION

We presented the use of reinforcement learning to performreal-time control of the configuration of a modular systemfor maximizing a system’s overall utility. We employedcontextual bandits and provided them with a holistic rep-resentation of a visual scene and with the ability to bothsense and control the parameters of each module. Weshow significant improvement with the use of the metar-easoning methodology for both the face detection andsynthetic pipelines. Future directions include studies ofscaling up the mechanisms we have presented to moregeneral systems of interacting modules and the use ofdifferent forms of contextual signals and their analyses,including the use of more flexible neural network infer-ence methods.

Acknowledgements

This work was done while AM was at Microsoft Research.AM acknowledges the concurrent support in part by agrant from the Open Philanthropy Project to the Centerfor Human-Compatible AI, and in part by NSF grantCAREER IIS-1452099.

ReferencesArulkumaran, K., Deisenroth, M. P., Brundage, M., and

Bharath, A. A. (2017). A brief survey of deep rein-forcement learning. arXiv preprint arXiv:1708.05866.

Bietti, A., Agarwal, A., and Langford, J. (2018). A contex-tual bandit bake-off. arXiv preprint arXiv:1802.04064.

Bradley, D. M. (2010). Learning in modular sys-tems. Technical report, CARNEGIE-MELLON UNIVPITTSBURGH PA ROBOTICS INST.

Daume III, H., Langford, J., and Sharaf, A. (2018). Resid-ual loss prediction: Reinforcement learning with noincremental feedback.

Delimitrou, C. and Kozyrakis, C. (2013). Paragon: Qos-aware scheduling for heterogeneous datacenters. In

Page 11: Metareasoning in Modular Software Systems: On …Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations

ACM SIGPLAN Notices, volume 48, pages 77–88.ACM.

Delimitrou, C. and Kozyrakis, C. (2014). Quasar:resource-efficient and qos-aware cluster management.In ACM SIGARCH Computer Architecture News, vol-ume 42, pages 127–144. ACM.

Demirci, M. (2015). A survey of machine learning ap-plications for energy-efficient resource management incloud computing environments. In 2015 IEEE 14thInternational Conference on Machine Learning andApplications (ICMLA), pages 1185–1190. IEEE.

Fedorova, A., Vengerov, D., and Doucette, D. (2007).Operating system scheduling on heterogeneous coresystems. In Proceedings of the Workshop on OperatingSystem Support for Heterogeneous Multicore Architec-tures.

Gao, J. (2014). Machine learning applications for datacenter optimization.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018).Soft actor-critic: Off-policy maximum entropy deepreinforcement learning with a stochastic actor. In In-ternational Conference on Machine Learning, pages1856–1865.

Hanus, D. (2013). Smart scheduling: optimizing Tilera’sprocess scheduling via reinforcement learning. PhDthesis, Massachusetts Institute of Technology.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deepresidual learning for image recognition. In Proceedingsof the IEEE conference on computer vision and patternrecognition, pages 770–778.

Hill, D. N., Nassif, H., Liu, Y., Iyer, A., and Vish-wanathan, S. (2017). An efficient bandit algorithmfor realtime multivariate optimization. In Proceedingsof the 23rd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, KDD ’17,pages 1813–1821.

Hinton, G., Srivastava, N., and Swersky, K. (2012). Neu-ral networks for machine learning lecture 6a overviewof mini-batch gradient descent.

Horvitz, E. (2001). Principles and applications of contin-ual computation. Artificial Intelligence, 126(1-2):159–196.

Horvitz, E. and Lengyel, J. (1997). Perception, attention,and resources: A decision-theoretic approach to graph-ics rendering. In Proceedings of the Thirteenth con-ference on Uncertainty in artificial intelligence, pages238–249. Morgan Kaufmann Publishers Inc.

Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996).Reinforcement learning: A survey. Journal of artificialintelligence research, 4:237–285.

Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algo-rithms. In Advances in neural information processingsystems, pages 1008–1014.

Li, Y. (2017). Deep reinforcement learning: An overview.arXiv preprint arXiv:1701.07274.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,Ramanan, D., Dollar, P., and Zitnick, C. L. (2014).Microsoft coco: Common objects in context. In Euro-pean conference on computer vision, pages 740–755.Springer.

Lorido-Botran, T., Miguel-Alonso, J., and Lozano, J. A.(2014). A review of auto-scaling techniques for elasticapplications in cloud environments. Journal of gridcomputing, 12(4):559–592.

Mao, H., Alizadeh, M., Menache, I., and Kandula, S.(2016). Resource management with deep reinforcementlearning. In Proceedings of the 15th ACM Workshopon Hot Topics in Networks, pages 50–56. ACM.

Memeti, S., Pllana, S., Binotto, A., Kołodziej, J., andBrandic, I. (2018). Using meta-heuristics and machinelearning for software optimization of parallel comput-ing systems: a systematic literature review. Computing,pages 1–44.

Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen,R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S.,and Dean, J. (2017). Device placement optimizationwith reinforcement learning. In Proceedings of the34th International Conference on Machine Learning-Volume 70, pages 2430–2439. JMLR. org.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., andLerer, A. (2017). Automatic differentiation in pytorch.In NIPS-W.

Raman, K., Swaminathan, A., Gehrke, J., and Joachims, T.(2013). Beyond myopic inference in big data pipelines.In Proceedings of the 19th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining,pages 86–94.

Singh, S., Jaakkola, T., Littman, M. L., and Szepesvari, C.(2000). Convergence results for single-step on-policyreinforcement-learning algorithms. Machine learning,38(3):287–308.

Sutton, R. S., McAllester, D. A., Singh, S. P., and Man-sour, Y. (2000). Policy gradient methods for reinforce-ment learning with function approximation. In Ad-vances in neural information processing systems, pages1057–1063.

Watkins, C. J. C. H. (1989). Learning from delayed re-wards. PhD thesis, King’s College, Cambridge.

Page 12: Metareasoning in Modular Software Systems: On …Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcementlearning. Machine learning, 8(3-4):229–256.

Xu, C.-Z., Rao, J., and Bu, X. (2012). Url: A unifiedreinforcement learning approach for autonomic cloudmanagement. Journal of Parallel and Distributed Com-puting, 72(2):95–105.


Recommended