+ All Categories
Home > Documents > Auto-Predication of Critical Branches...Sreenivas Subramoney Processor Architecture Research Lab...

Auto-Predication of Critical Branches...Sreenivas Subramoney Processor Architecture Research Lab...

Date post: 31-Jul-2020
Category:
Upload: others
View: 16 times
Download: 2 times
Share this document with a friend
13
Auto-Predication of Critical Branches* Adarsh Chauhan Processor Architecture Research Lab Intel Labs Bengaluru, India [email protected] Jayesh Gaur Processor Architecture Research Lab Intel Labs Bengaluru, India [email protected] Zeev Sperber Intel Corporation Haifa, Israel [email protected] Franck Sala Intel Corporation Haifa, Israel [email protected] Lihu Rappoport Intel Corporation Haifa, Israel [email protected] Adi Yoaz Intel Corporation Haifa, Israel [email protected] Sreenivas Subramoney Processor Architecture Research Lab Intel Labs Bengaluru, India [email protected] Abstract—Advancements in branch predictors have allowed modern processors to aggressively speculate and gain significant performance with every generation of increasing out-of-order depth and width. Unfortunately, there are branches that are still hard-to-predict (H2P) and mis-speculation on these branches is severely limiting the performance scalability of future pro- cessors. One potential solution to mitigate this problem is to predicate branches by substituting control dependencies with data dependencies. Predication is very costly for performance as it inhibits instruction level parallelism. To overcome this limitation, prior works selectively applied predication at run-time on H2P branches that have low confidence of branch prediction. However, these schemes do not fully comprehend the delicate trade-offs involved in suppressing speculation and can suffer from performance degradation on certain workloads. Additionally, they need significant changes not just to the hardware but also to the compiler and the instruction set architecture, rendering their implementation complex and challenging. In this paper, by analyzing the fundamental trade-offs between branch prediction and predication, we propose Auto-Predication of Critical Branches (ACB) — an end-to-end hardware-based solution that intelligently disables speculation only on branches that are critical for performance. Unlike existing approaches, ACB uses a sophisticated performance monitoring mechanism to gauge the effectiveness of dynamic predication, and hence does not suffer from performance inversions. Our simulation results show that, with just 386 bytes of additional hardware and no software support, ACB delivers 8% performance gain over a baseline similar to the Skylake processor. We also show that ACB reduces pipeline flushes because of mis-speculations by 22%, thus effectively helping both power and performance. Index Terms—Microarchitecture, Dynamic Predication, Con- trol Flow Convergence, Run-time Throttling I. I NTRODUCTION High accuracy of modern branch predictors [2]–[5] has allowed Out-of-Order (OOO) processors to speculate aggres- sively on branches and gain significant performance with every generation of increasing processor depth and width. Unfortunately, there still remains a class of branches that are Hard-to-Predict (H2P) for even the most sophisticated branch * Concepts, techniques and implementations presented in this paper are subject matter of pending patent applications, which have been filed by Intel Corporation. !"#! $$$ %! % Fig. 1. Performance trends with scaling of OOO processor. The 1X point is similar in parameters to the Skylake processor [1]. Performance potential for future processors is bound by the problem of mis-speculation. predictors [6]–[8]. These branches cost not only performance but also significant power overheads because of pipeline flush and re-execution upon wrong speculation. Figure 1 shows the performance improvements from an oracle perfect branch predictor with increasing processor depth and width 1 . For these results, the baseline is similar in parameters to the Intel Skylake processor [1] and uses a branch predictor similar to TAGE [2], [3]. We show the performance impact of perfect branch prediction on a continuum of pro- cessors with varying OOO resources compared to Skylake. As is evident from Figure 1, the performance potential of perfect speculation increases with OOO processor scaling. For instance, a three times wider and deeper machine than the Skylake baseline is almost two times more speculation bound than Skylake. These results clearly motivate the need for mitigating branch mis-speculations, especially since future OOO processors are expected to scale deeper and wider [9]. As it gets harder to improve branch prediction, there is an 1 Simulation framework is described in Section IV. 92 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) 978-1-7281-4661-4/20/$31.00 ©2020 IEEE DOI 10.1109/ISCA45697.2020.00019
Transcript
Page 1: Auto-Predication of Critical Branches...Sreenivas Subramoney Processor Architecture Research Lab Intel Labs Bengaluru, India sreenivas.subramoney@intel.com Abstract—Advancements

Auto-Predication of Critical Branches*

Adarsh ChauhanProcessor Architecture Research Lab

Intel LabsBengaluru, India

[email protected]

Jayesh GaurProcessor Architecture Research Lab

Intel LabsBengaluru, India

[email protected]

Zeev SperberIntel Corporation

Haifa, Israel

[email protected]

Franck SalaIntel Corporation

Haifa, Israel

[email protected]

Lihu RappoportIntel Corporation

Haifa, Israel

[email protected]

Adi YoazIntel Corporation

Haifa, Israel

[email protected]

Sreenivas SubramoneyProcessor Architecture Research Lab

Intel LabsBengaluru, India

[email protected]

Abstract—Advancements in branch predictors have allowedmodern processors to aggressively speculate and gain significantperformance with every generation of increasing out-of-orderdepth and width. Unfortunately, there are branches that are stillhard-to-predict (H2P) and mis-speculation on these branchesis severely limiting the performance scalability of future pro-cessors. One potential solution to mitigate this problem is topredicate branches by substituting control dependencies withdata dependencies. Predication is very costly for performanceas it inhibits instruction level parallelism. To overcome thislimitation, prior works selectively applied predication at run-timeon H2P branches that have low confidence of branch prediction.However, these schemes do not fully comprehend the delicatetrade-offs involved in suppressing speculation and can suffer fromperformance degradation on certain workloads. Additionally,they need significant changes not just to the hardware but alsoto the compiler and the instruction set architecture, renderingtheir implementation complex and challenging.

In this paper, by analyzing the fundamental trade-offs betweenbranch prediction and predication, we propose Auto-Predicationof Critical Branches (ACB) — an end-to-end hardware-basedsolution that intelligently disables speculation only on branchesthat are critical for performance. Unlike existing approaches,ACB uses a sophisticated performance monitoring mechanism togauge the effectiveness of dynamic predication, and hence doesnot suffer from performance inversions. Our simulation resultsshow that, with just 386 bytes of additional hardware and nosoftware support, ACB delivers 8% performance gain over abaseline similar to the Skylake processor. We also show that ACBreduces pipeline flushes because of mis-speculations by 22%, thuseffectively helping both power and performance.

Index Terms—Microarchitecture, Dynamic Predication, Con-trol Flow Convergence, Run-time Throttling

I. INTRODUCTION

High accuracy of modern branch predictors [2]–[5] has

allowed Out-of-Order (OOO) processors to speculate aggres-

sively on branches and gain significant performance with

every generation of increasing processor depth and width.

Unfortunately, there still remains a class of branches that are

Hard-to-Predict (H2P) for even the most sophisticated branch

*Concepts, techniques and implementations presented in this paper aresubject matter of pending patent applications, which have been filed by IntelCorporation.

Fig. 1. Performance trends with scaling of OOO processor. The 1X point issimilar in parameters to the Skylake processor [1]. Performance potential forfuture processors is bound by the problem of mis-speculation.

predictors [6]–[8]. These branches cost not only performance

but also significant power overheads because of pipeline flush

and re-execution upon wrong speculation.

Figure 1 shows the performance improvements from an

oracle perfect branch predictor with increasing processor depth

and width 1. For these results, the baseline is similar in

parameters to the Intel Skylake processor [1] and uses a branch

predictor similar to TAGE [2], [3]. We show the performance

impact of perfect branch prediction on a continuum of pro-

cessors with varying OOO resources compared to Skylake.

As is evident from Figure 1, the performance potential of

perfect speculation increases with OOO processor scaling.

For instance, a three times wider and deeper machine than

the Skylake baseline is almost two times more speculation

bound than Skylake. These results clearly motivate the need

for mitigating branch mis-speculations, especially since future

OOO processors are expected to scale deeper and wider [9].

As it gets harder to improve branch prediction, there is an

1Simulation framework is described in Section IV.

92

2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)

978-1-7281-4661-4/20/$31.00 ©2020 IEEEDOI 10.1109/ISCA45697.2020.00019

Page 2: Auto-Predication of Critical Branches...Sreenivas Subramoney Processor Architecture Research Lab Intel Labs Bengaluru, India sreenivas.subramoney@intel.com Abstract—Advancements

urgent need to investigate solutions to address this problem.One possible solution is to limit speculation when an H2P

branch is encountered. A classic approach to achieve this is

predication [10], which allows fetching both the taken and

not-taken portions of a conditional branch, but the execution

is conditional based on the final branch outcome. Because

predication inherently limits instruction level parallelism, it

can be detrimental to overall performance. To overcome this,

several prior techniques have tried to predicate only those

instances of H2P branches which have low confidence of

prediction [7], [11]–[14]. Policies like Diverge Merge Proces-

sor (DMP) [7], [15] use careful compiler profiling to select

target H2P branches, and then throttle their application using

run-time monitoring of branch prediction confidence. These

techniques showed great promise in mitigating the problems

with H2P branches. Unfortunately for almost a decade, no

advancement has been further made in these policies, and

as we will show in this paper, on modern OOO processors

with accurate branch predictors these policies end up creating

severe run-time bottlenecks for some applications, thereby

limiting their applicability. Moreover, these techniques need

significant changes to the compiler and the instruction set

architecture (ISA), which makes their adoption challenging.In this work, we first perform a thorough study of the perfor-

mance trade-offs created by limiting speculation using predi-

cation. Based on this analysis, we propose Auto-Predication of

Critical Branches (ACB) that intelligently tries to disable spec-

ulation only on branches critical for performance. ACB needs

no compiler or ISA support and has a micro-architecture which

is implementable in modern OOO processors. Specifically, we

make the following new contributions.

1) We present an analysis of the fundamental cost-benefit

trade-offs that come to the fore when branch prediction

is replaced by predication, especially on how it impacts

the program critical path. Guided by this understanding,

we propose ACB, a light-weight mechanism that intelli-

gently decides whether limiting speculation for a given

critical branch is helpful or detrimental to performance.

ACB is a holistic and complete solution that mitigates

performance losses by wrong speculation, while ensur-

ing that it does not create performance inversions.

2) We describe ACB’s implementation in a modern OOO

processor with no ISA changes or compiler support.

ACB learns its targeted critical branch PCs (program

counters) using simple heuristics, and uses a novel

hardware mechanism to accurately detect control flow

convergence using generic patterns of convergence. This

is unlike previous approaches [7], [12]–[14] that were

dependent upon compiler analysis and profiling. With

small changes to Fetch and OOO pipelines, ACB dy-

namically predicates critical branches, thereby reducing

costly pipeline flushes and improving performance.

3) We also propose a unique throttling system (Dynamo)

that monitors the run-time performance delivered by

applying ACB on any targeted branch and promptly

throttles ACB instances that are found to be degrading

performance. This is in contrast to typical throttling

mechanisms that rely on monitoring multiple local per-

formance counters. Cost-benefit estimation is complex

for predication based solutions as they influence per-

formance, negatively or positively, in many different

ways. By directly monitoring the dynamic performance,

Dynamo makes holistic and informed decisions. With

suitable adaptations, Dynamo’s generic approach can

be applied to control any performance feature which

similarly requires balancing of cost-benefit trade-off.

Our simulation results show that with just 386 bytes of

overall additional storage, ACB delivers 8% performance

improvement over a baseline processor similar to Intel Sky-

lake [1]. Since ACB requires little additional hardware and

saves 22% of the baseline mispredictions, it helps both power

and performance. We also show that ACB overcomes some

of the fundamental limitations of past compiler-based opti-

mizations and scales seamlessly to future processors, that are

expected to be even more bound by branch mis-speculations.

II. BACKGROUND AND MOTIVATION

Modern branch predictors use program history to predict

future outcomes of a branch [2], [4], [5]. Decades of research

have made them very accurate. However, there remains a class

of branches that are still hard to predict. Many such branches

are data dependent branches and are difficult to predict using

just program history [6].

We characterized branch mispredictions on our selected

workloads 2. We found that on average, in a given program

phase, 64 branch PCs sufficiently contribute to more than

95% of all dynamic mispredictions. Analysing the type of

H2P branches reveals that a majority of total mispredictions

come from direct conditional branches, of which 72% comes

from convergent conditional branches. We define convergent

branches as those branches whose taken and not-taken paths

can converge to some later point in the program (using the

same convergence criterion as DMP [7]). Loops are naturally

converging and contribute to another 13%. Remaining 13%

conditional branches exhibit non-converging control flows.

These observations lead us to conclude that the majority of

branch misspeculations can be addressed by focusing on a

small set of 64 convergent conditional H2P branches.

A. Program Criticality

The performance of any OOO processor is bound by the

critical path of execution. Critical path can be conceptually un-

derstood as the sequence of (data/control) dependent instruc-

tions which determines the total execution cycles of a program.

Fields et al. [16] presented a graph-based definition of the

critical path where the critical path is the maximum weighted

path in the data-dependency graph (DDG). Instructions, whose

execution lies on this path, are critical for performance.

Branch mis-speculation appears on the critical path as a

control dependency between the mispredicting branch and the

2Study list used is described in Section IV.

93

Page 3: Auto-Predication of Critical Branches...Sreenivas Subramoney Processor Architecture Research Lab Intel Labs Bengaluru, India sreenivas.subramoney@intel.com Abstract—Advancements

correct target fetched after branch resolution. While most of

the branch mispredictions usually lie on the critical path, not

all instances are critical for performance. Some mispredictions

lie in the shadow of other, more critical events (e.g. long

latency loads that miss LLC) and may not be critical.

B. Predication

One possible solution to the branch misprediction problem

is to prevent speculation when an H2P branch is encountered.

Static predication provides code for both the taken and not-

taken directions of conditional hammocks, but the run-time

execution is conditionally data-dependent on the branch out-

come. Most ISAs have some support for static predication [17],

[18]. Even though predication reduces critical path length by

preventing pipeline flushes upon mispredictions, it substitutes

control dependencies with data dependencies in the execution

of the program. This limits instruction level parallelism and

can elongate the critical path. To mitigate this, past approaches

have dynamically applied predication only on branch instances

having low confidence from branch prediction [7], [13], [14].

Wish Branches [12] relies on the compiler to provide

predicated code for every branch PC. For every dynamic

branch instance, branch prediction confidence is used to se-

lect between fetching the predicated code or speculate nor-

mally. However, this approach increases the compiled code

footprint. Dynamic Hammock Predication (DHP [11]) uses

the compiler to identify simple, short hammocks which can

be predicated dynamically (and profitably) and fetches both

the directions of the hammock in hardware. Diverge Merge

Processor (DMP) [7] improves upon both Wish Branches and

DHP. DMP uses compiler analysis-and-profiling to identify

frequently mispredicting branch candidates and modifies the

compiled binary to supply the convergence information for

frequently converging, complex control flow patterns. Using

ISA support and changes to processor front-end, DMP fetches

both taken and not-taken paths of the conditional branch.

Register Alias Table (RAT) in the OOO is forked and both the

paths are renamed separately. Select-micro-ops are injected to

dynamically predicate the data outcome from both paths.

By predicting branch confidence separately at run-time,

DMP tries to effectively predicate only those instances that are

likely to mispredict and delivers significant performance. How-

ever, as we will analyze in the following section, predication-

based strategies like DMP can create new critical paths of

execution, which are difficult to comprehend just by moni-

toring branch confidence. Also, training data-sets used by the

compiler (for developing static/profiling-based branch selec-

tion criteria) can be very different from actual testing data

seen during execution. Since many H2P branches are data

dependent, the efficacy of compiler analyses [15] is dependent

on the quality of profiled input. As a result, application

of DMP and similar schemes may result in performance

inversions on certain workloads. Moreover, such schemes need

simultaneous changes to the hardware, compiler as well as

ISA, which makes their practical implementation challenging.

In Section V-C, we will quantitatively discuss the performance

of DMP and contrast it with our proposal.

C. Effects of Predication on Critical Path

As mentioned above, there are costs of performing predica-

tion to realize the benefits of saving mispredictions by elimi-

nating speculation on branches. An imbalance in this delicate

trade-off for predication can cause performance inversions.

Hence, it is important to understand and consider the factors

influencing this balance. Additionally, to encourage adoption

on modern processors, we need techniques that are easy to

implement completely in hardware, without needing support

from the compiler or ISA. In this section, we will hence use

program criticality to first develop an understanding of how

predication changes the critical path of execution. Through

this analysis, we will motivate the need for our feature.

1) Limiting Allocation: Predication, by fetching both the

taken and not-taken paths of a branch, alters the critical path

of execution. Figure 2(a) shows an example DDG (using

notations from [16]) with and without predication. Without

predication on a branch, a branch misprediction introduces

the misprediction latency on the critical path. However, with

predication, the critical path involves the latency of fetching

control dependent region on both the directions and allocating

them into the OOO (whereas the baseline speculates and

fetches on only one direction).

Consider the misprediction rate for a given H2P branch as

mispred rate, and the taken path has T and not-taken path

has N instructions. Assume p to be the probability of the

branch being taken. With predication, we need to fetch (T+N)instructions for every predicated instance. alloc width is the

maximum number of instructions that can be allocated in the

OOO per cycle and mispred penalty is the penalty of mis-

prediction, i.e. the total time taken to execute the mispredicting

branch, signal the misprediction and the subsequent pipeline

flush latency. For the baseline, misprediction increases the crit-

ical path of execution by (mispred rate ·mispred penalty)cycles. On the other hand, with predication, the critical path

increases by ((T +N)− (p · T + (1− p) ·N))/alloc width.

Predication will be profitable if,

((1− p) · T + p ·N)

alloc width≤ (mispred rate ·mispred penalty)

(1)

Equation 1 clearly shows the trade-off between higher allo-

cations and saving the pipeline flushes by mispredictions. Let’s

assume that allocation width (alloc width) is 4, pipeline flush

latency (mispred penalty) is 20 cycles and we have equal

probability of predicting taken and not-taken. If misprediction

rate is 10%, then predication will be beneficial only if the

total instructions in the predicated branch body (taken and

not-taken paths combined (T +N)) are less than 16. On the

other hand, if branch body size is larger, say 32 instructions,

then predication should be applied only for branches having

misprediction rate greater than 20%. Realistically, the actual

penalty for a branch misprediction is higher than just the

pipeline flush latency, since it includes the execution latency of

94

Page 4: Auto-Predication of Critical Branches...Sreenivas Subramoney Processor Architecture Research Lab Intel Labs Bengaluru, India sreenivas.subramoney@intel.com Abstract—Advancements

Fig. 2. (a) demonstrates change in the critical path due to extra-allocation by predication through a Data-Dependency-Graph (defined by Fields et al. [16]).(b) gives an example of a perfectly correlating branch following a predicated branch. (c) shows an example where a critical long-latency load is dependenton a predicated branch outcome. (Instructions in (b) and (c) have right-most logical register as destination.)

the branch-sources required for computing its outcome. Hence,

equation 1 will have a higher value for mispred penalty,

and predication may be able to tolerate somewhat larger

number of extra allocations. Therefore, we can conclude that

both misprediction rate and branch body size need to be

considered to qualify any branch for predication. For those

micro-architectures that allocate in OOO in terms of micro-

operations [19], this equation needs to be suitably adjusted.

2) Increasing Branch Mispredictions: Figure 2(b) shows

a sample program where branch B1 frequently mispredicts.

Since B1 is a small hammock, it should be very amenable

to dynamic predication. However, there is another branch B2that is perfectly correlated with B1, but is not amenable to

predication. Interestingly, in the baseline, B2 usually does not

see any misprediction since B1 is more likely to execute (and

cause pipeline flushes) before B2 can be executed. Perfect

correlation between them would mean that B2 will always

be correctly predicted when it is re-fetched, since it knows

the outcome of B1. This happens because the global branch

predictor would repair the prediction of B1 when there is

no predication (since global history is updated), and B2 will

always learn the correlation with B1.

With predication, however, there is no update to global

history from B1. Therefore, B2 will start mispredicting and

the effective number of mis-speculations will not come down.

In fact, because of predication on B1, B2 will now take a

longer time to execute, thereby elongating the critical path.

Hence, branches like B1 should not be predicated, unless B2can also be predicated. This effect of increasing the baseline

mispredictions is more pronounced in cases of dynamic pred-

ication on branches with complex control flow patterns and

large control dependent regions. Since branch history update

and resolution are separated in branch speculation, the branch

history cannot be perfectly corrected to improve the prediction

for branches following the predicated region.

3) Elongating Critical Paths: Figure 2(c) shows another

example where the body of an H2P branch creates sources for

a critical (long latency) load. Without predication, the load

would still be launched, and may be correct if the branch

prediction was correct. However, due to predication, this long

latency load’s dispatch is dependent upon the execution of the

predicated branch. As a result, the critical path of execution

may get elongated. If this H2P branch is very frequent, pred-

ication can result in a long chain of dependent instructions.

In all such scenarios, resorting to normal branch speculation,

even if the accuracy of branch prediction is low, may be a

more optimal solution than predication.

To summarize our learnings, we first need to detect our

target branches and learn their convergence patterns. Secondly,

the selection criteria for critical branches should take into

account the size of the branch body and the misprediction

rate. Thirdly, alterations to the critical path due to predication

need to be detected and handled at run-time. Finally, predi-

cation needs to be dynamic and completely implementable in

hardware. These problems motivate us towards our proposal

which we will describe in detail in the following section.

III. AUTO-PREDICATION OF CRITICAL BRANCHES (ACB)

The essential idea behind ACB is to eliminate speculation

when the criteria discussed in Section II are satisfied. ACB

first detects conditional critical branches and then uses a

novel hardware mechanism to find out their point of reconver-

gence. Thereafter, a simple mechanism is used to fetch both

taken and not-taken portions (up to the reconvergence point)

of the conditional branch. After the ACB-branch executes

in the OOO, the predicated-true path is executed, whereas

small micro-architectural modifications in the pipeline make

the predicated-false path transparent to program execution.

Finally, a dynamic monitoring (Dynamo) scheme monitors the

runtime performance and appropriately throttles ACB. We now

describe the micro-architecture of ACB in more detail.

A. Learning Target Branches

As reasoned in Section II-A, not all mispredicting branch

instances impact performance. However, branches that fre-

quently mispredict, invariably end up having several dynamic

instances that lie on the critical path. We found that the

frequency of misprediction for a given branch PC is a good

measure of its criticality. Our scheme hence uses a simple crit-

icality filter (≤16 mispredictions in 200K retired instructions

window) to filter out infrequently mispredicting branches.

Once convergence is confirmed for a branch, we further ensure

95

Page 5: Auto-Predication of Critical Branches...Sreenivas Subramoney Processor Architecture Research Lab Intel Labs Bengaluru, India sreenivas.subramoney@intel.com Abstract—Advancements

Fig. 3. Three Types (left-most three) categorized by ACB’s dynamic convergence detection algorithm. Other complex convergence patterns (right-most two)can also be condensed into the same set of Types.

by learning that it has sufficient misprediction rate using

confidence counters in the later stages.

We also experimented with other criticality heuristics to

improve the above qualification criteria. Offline analysis of

data dependence graphs for different applications expectedly

showed that some fraction of the branch misprediction in-

stances are not on the critical path. However, segregating

such instances on-the-fly, and with reasonable hardware, is

very challenging. We considered the heuristic of counting

a mis-speculation event as critical only if, at the time of

misprediction, the branch is within a fourth of the ROB

size from head of the ROB (i.e. oldest entry in the ROB).

Those mispredictions which happen near the retirement are

more critical for performance as they will cause a greater

part of ROB to be flushed and consequently, more control-

independent work to be wasted. This simple heuristic slightly

improved the accuracy of the frequency based criticality filter.

Such criticality heuristics can be improved by future research.

To track critical branches, ACB uses a direct-mapped Crit-ical Table indexed by the PC of mispredicting conditional

branches. Each table entry stores an 11 bit tag to prevent alias-

ing, a 2 bit utility counter for managing conflicts, and a 4 bit

saturating critical counter. Every critical branch misprediction

event (as defined by our heuristics) increments both critical

counter and utility counter of its PC-entry. In case of conflict

misses in the table, utility counter is decremented. An old entry

will be replaced by a new contending entry only if utility

counter is zero. As section II suggested, our experimental

sweeps over this table size show that a small 64-entry table

provides sufficient coverage useful for performance.

B. Learning Convergent Branches

The next step involves identifying convergent candidates

among the identified critical branches. For this, ACB uses a

single entry Learning Table (20 bytes) to detect convergence

one-branch-at-a-time which is sufficient for its functionality.

Types of Convergence: Through analysis of various con-

trol flow patterns in different workloads, we identified three

generic cases by which conditional direct branches can con-

verge. Figure 3 illustrates the three types, that we refer to

as Type-1, Type-2 and Type-3. Type-1 convergence is char-

acterized by the reconvergence point being identical to the

ACB-branch target. The simplest form of Type-1 branches are

IF-guarded hammocks that do not have an ELSE counter-

part. Type-2 convergence is characterized by the not-taken

path having some Jumper branch, which when taken, has a

branch-target that is ahead of the ACB-branch target. This

naturally guarantees that the taken path which starts from the

ACB-branch target will fall-through to meet the Jumper branch

target, making it the reconvergence point in this case. Type-2

covers conditional branches having pair of IF-ELSE clauses.

Finally, Type-3 convergence possesses a more complex control

flow pattern (which can have either IF-only or IF-ELSE form).

It is characterized by the taken path encountering a Jumper

branch which takes the control flow to its target that is less

than the ACB-branch target. This ensures that the not-taken

path naturally falls through to meet the Jumper branch target.

We have generalized these three types so that other complex

cases (see Figure 3) can also be contained within this set.

However, the above description defines conditions that hold

true for only forward-going branches (where the ACB-branch

target PC is more than the branch PC). To cover the cases

of backward-going branches, we adapted our algorithm by

exploiting the commutative nature of convergence for back-

branches. We use an important observation that by simply

moving the original back-branch from the beginning of its Not-

Taken block to the beginning of its Taken block, and modifying

it accordingly to being a forward branch with target as its own

original PC, the program remains logically unchanged. Thus,

the reconvergence point detected in this modified scenario is

going to be the same as original. Figure 4 illustrates this idea

through an example.

Convergence detection mechanism is implemented during

fetch since it needs to track only the PCs of instructions being

fetched. When an entry in the critical table saturates its critical

count, we copy the branch PC into the Learning Table which is

occupied until we confirm convergence or divergence on both

its directions. The mechanism first tries to learn if the ACB-

branch is a Type-1 or Type-2 convergence. It begins by first

96

Page 6: Auto-Predication of Critical Branches...Sreenivas Subramoney Processor Architecture Research Lab Intel Labs Bengaluru, India sreenivas.subramoney@intel.com Abstract—Advancements

Fig. 4. By interchanging the perspective of branch and its target for backward-going branches, we classify among the same set of Types.

inspecting the Not-Taken path. We track the first N fetched

PC’s following the ACB-branch. If we receive the target of

the ACB-branch within this interval, we classify it as Type-

1 and finish learning. Otherwise, if another taken branch is

observed whose target is ahead of the ACB-branch’s target,

then we record this branch’s target as the reconvergence point.

We then validate the occurrence of the same reconvergence

point on the next instance when the ACB-branch fetches the

Taken direction, within the same N instruction limit, before

confirming it as Type-2. If neither Type is confirmed, we leave

the ACB-branch as unclassified.

If still unclassified, we finally try to learn it as Type-3

by inspecting the Taken path. If, within N instructions, we

observe a taken branch whose target is before the ACB-branch,

then we record this branch’s target as the reconvergence point.

We then validate the occurrence of the same reconvergence

point on the next instance when the ACB-branch fetches the

Not-Taken direction. Upon success, we confirm it as Type-3.

At any stage, if we exhaust the N instruction counting

limit, we reset the Learning Table entry as a sign of non-

convergence. Upon confirmation of any Type, we copy the

branch PC to a new ACB Table entry, along with the learned

convergence information. We then vacate the corresponding

Critical Table entry and reset the Learning Table entry. Based

on the analysis in Section II-C1 and experimental sweeps, we

found N = 40 to be optimal to cover large-body convergences

that can be supported while being profitable with the given

misprediction rate thresholds.

Criticality Confidence: We use a 32-entry, 2-way ACBTable (indexed by branch PCs) having a 6-bit saturating

probabilistic-counter. All the meta-data needed to fetch both

the paths upon ACB application on a targeted branch PC

is also stored in this table entry (detailed composition in

Table I). Before ACB can dynamically predicate, we need to

establish confidence in accordance with the trade-off described

by Equation 1. During learning, we record the combined

body size of both paths that need to be fetched (encoded in

2 bits) and proportionally set the required misprediction rate

m for this branch, using a static mapping of Body-Size-to-

Misprediction-Rate (refer Table I). The confidence counter in

the ACB table is incremented for every mis-predicting instance

of this branch that triggers a pipeline flush. It is decremented

probabilistically by 1/M (where M = 1m−1) on every correct

prediction. When this counter becomes higher than 32 (half

of its saturated value), we start applying ACB.

Convergence Confidence: While critical counter is less

than 32, we use a single-entry Tracking Table to monitor

the occurrence of the learned reconvergence point PC on

both taken and not-taken paths for every fetched branch

instance. If the learned convergence does not happen, we reset

its confidence counter. This way we exclude branches from

getting activated which tend to diverge more often. Despite

low-associativity of ACB Table, we did not observe any

major contention/thrashing issues. In our sensitivity studies,

increasing its size from 32 to 256 had negligible effect on

performance (since Learning Table acts as a filter for allocation

from Critical Table to ACB Table).

C. Run-Time Application

1) Fetching the Taken and Not-Taken Paths: After learning

branches that are candidates for ACB, we need to fetch

both directions for predicated branches at run-time. Upon

fetching every dynamic branch instance whose PC has reached

confidence in the ACB Table, we open an ACB Context that

records the target of the branch (from the Branch Target

Array), and the reconvergence point (from the ACB Table).

If the branch is Type-1 or Type-2, we override the branch

predictor decision to first fetch the Not-Taken direction. If it

is Type-3, we fetch the Taken direction first. If the convergence

was Type-1, then we will naturally reach the PC for the point

of convergence. For convergences of Type-2 and Type-3, we

wait for fetching the Jumper branch which is predicted taken

and whose target is our expected reconvergence point. One

should note that this Jumper is allowed to be a different branch

than what was seen during training. Having found the Jumperwhich will take us to the point of reconvergence, we now

override the target of this Jumper branch to be either ACB-

branch target (when first fetched direction is Not-Taken) or

next PC after the ACB-branch (when first fetched direction is

Taken). This step is needed to fetch the other path. Once the

convergence PC is reached, present ACB Context is closed

and we wait for another ACB-branch instance. The ACB-

branch, Jumper branch, Reconvergence point and ACB-body

instructions are all attached with a 3-bit identifier for OOO

to identify and associate every predicated region with the

corresponding ACB-branch.

Occasionally, reconvergence point on either path may not be

reached. In such cases the front-end only waits for a certain

threshold (in terms of fetched instructions) beyond the allowed

convergence distance after the ACB-branch; if convergence

is not detected by then, we set the same 3-bit identifier to

indicate divergence for this instance. When the OOO receives

this signal, it forces a pipeline flush at the ACB-branch after

it resolves itself. It continues fetching from the correct target

normally thereafter. We also reset the confidence and the utility

bits in the ACB Table to make it re-train. Since we train for

convergence as well, divergence injected pipeline flushes are

rare and do not hurt performance.

2) Effective Predication in the OOO: OOO uses the ACB

identifiers set during fetch to handle the predicated region.

ACB-branch is stalled at scheduling for dispatch until ei-

97

Page 7: Auto-Predication of Critical Branches...Sreenivas Subramoney Processor Architecture Research Lab Intel Labs Bengaluru, India sreenivas.subramoney@intel.com Abstract—Advancements

ther the reconvergence-point or the divergence-identifier is

received. This stalling of ACB-branch is needed since a failure

in convergence implies incorrect fetching by ACB. To recover,

we force a pipeline flush on diverging ACB-branch instances

once their direction is known upon execution.

All instructions in the body of the ACB-branch are forced to

add the ACB-branch as a source, effectively stalling them from

execution until the ACB-branch has executed. Instructions post

the reconvergence point are free to execute. If they have true

data dependencies with any portion predicated by the ACB-

branch, they will be naturally stalled by the OOO. Once

ACB-branch executes, instructions on the predicated-true path

execute normally. However, since predicated-false path was

also allocated and OOO may have already added dependencies

for predicated-true path with predicated-false path, we need to

ensure Register Transparency beyond predicated-false path.

To achieve this aim, every instruction in the body of ACB

that is a producer of some logical register or flags, also tracks

the physical register corresponding to its logical destination.

For example, an instruction of the type mov RAX, RBX will

be tracking RAX (i.e. its destination) in the OOO. After ACB-

branch resolution, if an ACB-body instruction is identified

as belonging to the predicated-true path, we will execute it

normally as a move from RBX to RAX. If it instead turns out

as a predicated-false path instruction, then we will ignore the

original operation and it will act as a special move from RAXto RAX: it copies the last correctly produced value of RAX to

the register allocated to it for writing RAX. Since RAT provides

us with the last writer to a given logical register during OOO

allocation, we obtain the last written physical register ID from

the RAT during register renaming. Hence, the predicated-false

path is able to propagate the correct data for the live-outs it

produces, making it effectively transparent. Any instruction

on the predicated-false path, that does not produce register or

flags (like stores or branches), instantly releases its resources.

Prior works [7], [11] have relied on select-micro-op

based approaches to handle correctness of data dependen-

cies after the predicated region. While using select-micro-

ops also allows the execution of the predicated region be-

fore the reconvergence point (unlike ACB which stalls it

until ACB-branch resolution), it requires complex RAT fork-

and-merge on every predicated instance. This also causes

frequent loss of performance-critical allocation bandwidth,

which becomes more significant in future wider processors.

ACB’s design choices included the relatively simpler logical-

destination tracking approach. Using these less intrusive

micro-architectural changes, we are able to achieve regis-

ter transparency without resorting to complex RAT recovery

mechanisms or re-execution as proposed in [7], [11].

3) Predicated-False Path Loads/Stores: All ACB body

loads and stores are stalled in the OOO-IQ until ACB resolves

its direction. Memory disambiguation logic [20] stalls on

stores since their addresses are not computed yet. When the

branch resolves, these are dispatched from IQ with predicated-

true/false path information. Predicated-false path loads/stores

are invalidated in Load-Store Queue (LSQ) and are excluded

from matching addresses with younger loads. These invali-

dated loads/stores deallocate (upon retirement) without dis-

patching to caches/memory. Predicated-true path loads/stores

participate in store-load forwarding within the LSQ and are

dispatched normally.

Fig. 5. Finite State Machine for Dynamo.

4) Run-Time Throttling using Dynamo: Like other pred-

ication strategies, ACB can have undesirable and dynam-

ically varying side-effects on performance as analyzed in

Section II-C. Hence, ACB requires run-time monitoring and

throttling to optimize for performance and prevent inversions.

However, performance can be affected by various diverse

phenomena which, by tracking limited local heuristics, cannot

be accurately evaluated. In fact, this is a generic problem

that affects many other features which involve balancing cost-

benefit trade off to maximize performance.

We propose a novel dynamic monitoring (Dynamo) algo-

rithm that monitors the run-time performance delivered by

ACB. Dynamo is a first of its kind predictor that tracks ac-

tual performance and compares it with baseline performance.

Figure 5 describes the various elements of Dynamo and their

interactions. Dynamo assumes a 3-bit FSM-state for each entry

in the ACB Table, with the possible states being NEUTRAL,

GOOD, LIKELY-GOOD, LIKELY-BAD and BAD. FSM-state

transitions happen for all entries together at every W retired

instructions, which we call as one epoch. Entries reaching the

final states (GOOD or BAD) do not undergo further transitions.

Choosing a very small epoch-length will be highly susceptible

to noisy IPC changes, whereas a very large observation

window will not correctly evaluate the performance impact

since major program phase change falling in this window

might affect the overall IPC dominantly. Through experimental

analysis, we found epoch-length of 8K to 32K instructions as

optimal (16K chosen for best performance).

Dynamo computes the cycles taken to complete a given

epoch using an 18 bit saturating counter. Allocation in the

ACB Table initializes each entry with NEUTRAL state. For

the odd-numbered epoch, Dynamo disables ACB for all the

branches except those in GOOD state. In this epoch, the base-

line performance would be observed. For the even-numbered

epoch, Dynamo enables ACB for all the branches except those

in BAD state. At the end of every odd-even pair of epochs,

Dynamo checks the difference in cycles between the two.

98

Page 8: Auto-Predication of Critical Branches...Sreenivas Subramoney Processor Architecture Research Lab Intel Labs Bengaluru, India sreenivas.subramoney@intel.com Abstract—Advancements

If the cycles have increased due to enabling ACB beyond a

thresholded factor, then it means that doing ACB for this set

of unconfirmed branches is likely bad and Dynamo transitions

the state of all the involved ACB-branches towards BAD. On

the other hand, if the cycles have improved due to ACB, then

Dynamo moves the state of all the involved ACB-branches

towards GOOD. We found this cycle-change factor to be

optimum at 1/8. Intuitively, a high threshold will be insensitive

to subtle performance degradation by ACB whereas a low

threshold will be susceptible to minor IPC changes because

of changing program execution behavior.

To identify the ACB-branches responsible for affecting IPC

in a given epoch, Dynamo also counts the per-instance activity

of each ACB-branch in a 4 bit saturating Involvement Counter,

which is incremented on every predicated dynamic instance.

State transitions of activated ACBs are allowed only if their

involvement counter is saturated. This prevents Dynamo from

associating unrelated IPC fluctuations (or natural program

phase changes) to its judgment of any activated ACB. To

make it even more robust, Dynamo does not directly transition

any branch to the final (GOOD or BAD) states. Instead it

relies on observing positive or negative impacts of the branch

consecutively to obtain a final decision regarding GOOD or

BAD. Branches in GOOD state will perform ACB while those

in BAD state are disabled henceforth. If the cycle-change factor

is within allowed thresholds, then we do not update states in

either direction and continue with the next epoch-pair.

It must be noted that multiple ACBs may be learned and

simultaneously start getting applied in a given epoch. Dynamo

evaluates IPC changes with and without all the actively

working ACBs together since they eventually will be working

alongside each other. Also, since program phase changes can

potentially change the criticality of some branches, we wanted

to give a fair chance to the blocked candidates to re-learn

through Dynamo. So, we reset Dynamo state information for

all entries periodically (∼10 million retired instructions).

D. Storage Requirement

Table I enlists all the tabular structures used by ACB.

Aggregate storage required by ACB is just 386 bytes.

IV. SIMULATION METHODOLOGY

We simulate an Out-of-Order x86-ISA core on a cycle-

accurate simulator that accurately models the wrong path on

branch mispredictions. Simulated core runs at 3.2 GHz and

micro-architecture parameters are similar to Intel Skylake [1]

configuration. Detailed parameters enlisted in Table II.

We experimented with 70 diverse, single-threaded work-

loads from different categories (details in Table III). The

performance is measured in instructions-per-cycle (IPC).

V. RESULTS

We first present the performance improvement by ACB on

our workloads in Section V-A. We then evaluate the effective-

ness of Dynamo as a throttling scheme in Section V-B. In

Section V-C, we contrast ACB with state-of-the-art dynamic

Structure Per-entry (bit-size) CompositionCritical Table

(64 entries, 144B)Valid (1b), Tag (11b), Utility (2b), Criti-cal Counter (4b)

ACB Table(32 entries, 188B)

Valid (1b), Tag (11b), Utility (2b), Conv Type(2b), Reconv PC (16b), Confidence (6b),FSM State (3b), Involv Count (4b), Mis-pred Code (2b)

Learning Table(1 entry, 20B)

Valid (1b), Candidate (64b), Fetch Dir (1b),Inst Counter (5b), BrTarget (32b), BrNextPC(32b), Tracking Active (1b), Flip Bit (1b),Detected Type (3b), Reconv PC (16b)

Tracking Table(1 entry, 11B)

Valid (1b), Candidate (64b), Fetch Dir (1b),Inst Counter (5b), Reconv PC (16b)

ACB Context(1 entry, 21B)

Valid (1b), Active ACB (64b), Conv Type(2b), Reconv PC (64b), BrTarget (32b),BrNextPC (32b), Found Jumper (1b),Inst Counter (5b)

Body-Size-Rangeto M Table

(4 (4b) entries, 2B)

0-10→ 16, 11-20→ 8,21-30→ 4, 31-40→ 2;index : Mispred Code;

TABLE IDETAILS OF STRUCTURES USED BY ACB.

Front End 4 wide fetch and decode, TAGE-ITTAGE branch predic-tors [2], [3], 20 cycles mis-prediction penalty, 4 wide renameinto OOO with macro and micro fusion

Execution 224 ROB entries, 64 Load Queue entries, 60 Store Queueentries and 97 Issue Queue entries. 8 Execution units (ports)including 2 load ports, 3 store address ports (2 shared withload ports) and 1 store-data port. Support for Vector ports(AVX). 8 wide retire with full support for bypass. Memorydisambiguation predictor and out of order load scheduling

Caches 32 KB, 8-way L1 data caches with latency of 5 cycles, 256 KB16-way L2 cache (private) with round-trip latency of 15 cycles.8 MB, 16 way shared LLC with round-trip latency of 40cycles. Aggressive multi-stream prefetching into L2 and LLC.PC based stride prefetcher at L1

Memory Two DDR4-2133 channels, two ranks per channel, eight banksper rank, 64 bits data-width per channel. 2 KB row buffer perbank with 15-15-15-39 (tCAS-tRCD-tRP-tRAS) timing.

TABLE IICORE PARAMETERS USED IN OUR SIMULATOR.

predication approach. We evaluate ACB’s performance on

future OOO processors in Section V-D. Finally, we perform a

qualitative analysis of ACB’s effects on power in Section V-E.

A. Performance Summary of ACB

Figure 6 summarizes the performance benefits of apply-

ing ACB. ACB gives an overall performance gain of 8.0%

(geometric-mean) while providing an effective reduction in

branch mis-speculations by 22% on average. Figure 7 shows

a line graph correlating the performance improvement with

reduction in pipeline flushes for all our studied workloads. We

see that mis-speculation reduction correlates positively with

the observed performance gains. The largest positive outlier

(lammps) provides more than 2X speedup. Due to Dynamo’s

intervention, losses are contained within -5%. An interesting

observation comes from the analysis of outliers like soplex (on

the left-end of Figure 7), where despite significant reduction in

total mis-speculations, the performance gains are unexpectedly

low. Here, the accounted branch mispredictions are not on the

critical path of execution in the baseline itself. As seen in

Section II-A, such mispredictions are not important for perfor-

99

Page 9: Auto-Predication of Critical Branches...Sreenivas Subramoney Processor Architecture Research Lab Intel Labs Bengaluru, India sreenivas.subramoney@intel.com Abstract—Advancements

Benchmarks Categoryperlbench, bzip2, gcc, mcf, gobmk, hmmer, sjeng, libquan-tum, h264ref, omnetpp, astar, xalancbmk

ISPEC [21]

bwaves, gamess, milc, zeusmp soplex, povray, calculix,gemsfdtd, tonto, lbm, wrf, sphinx3 gromacs, cactusADM,leslie3D, namd, deall,

FSPEC [21]

catcubssn, lbm, cam4, pop2, imagick, nab, roms, perl-bench, gcc, mcf, omnetpp, xalancbmk, x264, deepsjeng,leela, exchange, xz

SPEC17 [22]

winzip, photoshop, sketchup, premiere SYSmark [23]tabletmark [24], geekbench [25], compression,3dmark [26], eembc [27], chrome

Client

lammps [28], parsec [29] Server

TABLE IIIALL 70 WORKLOADS USED IN OUR EXPERIMENTS.

mance. Another side-effect of ACB is noticeable in the largest

negative outlier (omnetpp), where the mis-speculations slightly

increase after applying ACB. This relates to Section II-C2 as

ACB overrides the branch predictor decision consistently (to

fetch both paths), causing the branch history to get modified.

This starts affecting the BPU’s predictability for some other

branches due to correlation effects. These outliers represent

those scenarios where the newly manifested mispredictions

cannot be helped by ACB due to its selective coverage.

Fig. 6. All workloads (category-wise) results for ACB.

B. Analysis of Dynamo

Figure 8 compares ACB’s performance with and without

Dynamo for all workloads. Dynamo brings up native ACB’s

performance from 6.7% to 8.0%. Without Dynamo, the largest

negative outliers (eembc and SPEC-h264) suffer nearly 20%

performance loss, strongly exhibiting the negative impacts of

non-judicious predication. Dynamo helps throttle out harmful

ACB-able PCs in such cases helping recover performance.

Prior to Dynamo, we also experimented with execution

stalls (i.e. waiting for dispatch at issue queue) counting based

simpler metric, since predication primarily creates additional

data-dependencies. But in few cases, we observed that despite

high stall counts, performing predication was favorable as

saved pipeline flushes outweighed the additional stalls in-

curred. This was also vulnerable to bad tuning. Dynamo was

designed to holistically evaluate this trade-off for ACB.

Fig. 7. ACB’s mis-speculation and performance ratio over baseline.

C. Comparison with Prior Compiler-based Solutions

Fig. 8. Comparison of ACB against DMP.

In this section, we compare against Diverge-Merge Pro-

cessor (DMP) [7], which relies on changes to the compiler,

ISA and micro-architecture to perform selective predication on

low confidence branch predictions. We modeled the enhanced

DMP [15], which improved upon the DMP solution through

profile-assisted compiler techniques.

Figure 8 compares the performance of ACB (both with and

without Dynamo) and DMP. ACB and DMP both produce

impressive positive outliers (category A). Workloads marked

as B1 and B2 are the cases of DMP outperforming ACB. The

category B1 benefits from DMP’s multiple reconvergence point

support by compiler assisted convergence detection. ACB can

be enhanced to support the same by actively learning and

allocating multiple reconvergence points in ACB Table. For

category B2, ACB’s approach of stalling both the paths re-

duces its performance gains compared to DMP which eagerly

executes the predicated region before the branch resolves

itself. DMP achieves this with the help of select-micro-ops

100

Page 10: Auto-Predication of Critical Branches...Sreenivas Subramoney Processor Architecture Research Lab Intel Labs Bengaluru, India sreenivas.subramoney@intel.com Abstract—Advancements

based micro-architecture. We experimented with adding select-

micro-op support to ACB which improves ACB’s performance

gains by only about 0.2%. Since Dynamo already throttles

negative outliers, this scheme only helps the positive gain-

ers slightly. This trade-off justifies ACB’s logical-destination

tracking approach to save hardware complexity (RAT and fetch

forking).

Fig. 9. DMP and Oracle DMP (DMP-PBH) for Categories D and E.

Workloads marked as C in Figure 8 suffer from negative

performance impact for both DMP and ACB without Dynamo.

This clearly highlights the utility of dynamic performance

monitoring. In these workloads, both ACB and DMP qualify

similar set of branches to be predicated. Despite ACB’s

stricter qualification constraints, these branches incur more

costs by creating data-dependency based stalls in the OOO (as

explained in Section II-C3). With Dynamo we are able to iden-

tify and block such delinquent candidates. While enhanced-

DMP also has a detailed cost-benefit analysis through static

compiler-profiling, the work itself acknowledges its limitation

in being able to account for only fetch related costs and

not execution related costs [15]. Compiler-based techniques

are also susceptible to inefficiencies arising from differences

between input sets used for profiling and actual execution.

Figure 9 focuses only on Category D and E workloads,

showing a correlation between mis-speculation ratio and per-

formance over baseline. There is a significant increase in

branch mispredictions by applying DMP. This may seem

surprising at first, but as we had reasoned in Section II,

predication (DMP or ACB) changes the branch history that

is being learned by the branch predictor. It is well known

that speculative update of the branch history is very important

for branch predictor accuracy [30]. In the baseline, branch

history is always speculatively updated, assuming the previous

branch predictions were correct. When a branch mispredicts,

a pipeline flush happens and subsequent branches that are

re-fetched use the updated branch history. Hence, branch

history is always up-to-date for all valid predictions (except

on the wrong path, that is eventually flushed out). However,

when a branch is dynamically predicated, subsequently fetched

branches do not have any knowledge of the predicated branch’s

direction, since no real prediction happened for it. They will

know the direction only upon branch resolution of DMP-

ed branch in OOO. By that time branch predictions have

already happened and front-end has moved ahead, making it

impossible to correct predictions for these branches.In case of ACB, we remove all ACB predicated instances

from the branch history, so the branch predictor adapts itself to

predict without the knowledge of the ACB-ed branch. However

in DMP, based on the branch confidence some instances are

predicated and others are not. This effectively means that many

more possible branch histories are possible, including wrong

branch history. Recall that the TAGE branch predictor [2]

allocates a higher branch history prediction table on every

misprediction, and the presence of unstable branch histories

results in severe thrashing of the tables. This badly effects

the baseline accuracy of the predictor, not just for the target

branch but also other branches. Also as described in Section II,

many branches are perfectly correlated with older branches,

and if the older branch is removed from the branch history by

predication, they lose their accuracy.DMP uses compile time profiling to pick the target H2P

branches. Unfortunately the application of DMP at runtime

changes the branch predictor behavior in some applications,

rendering the compile time profiling sub-optimal and causing

performance inversions in category D and E. To validate this

hypothesis, we ran category D and E workloads with an oracle

update of branch history and compared it to DMP and ACB

performance in Figure 9. As is evident from Figure 9, DMP-

PBH (oracle with perfect branch history), recovers most of

the losses for category D, and reduces mispredictions over

baseline. A similar observation was made by Klauser et al. [11]

for branch history update and dynamic predication.

Fig. 10. Allocation stalls comparison for Category E workloads.

Interestingly, Category E workloads are still not optimal

even with the perfect branch history. Figure 10 correlates their

performance with increase in allocation stalls of the OOO pro-

cessor. Even though these workloads reduce mispredictions, in

presence of perfect branch history, they suffer from allocation

stalls because of data dependencies with select-uops beyond

the reconvergence point. A throttling mechanism like Dynamo

is needed for such cases.

101

Page 11: Auto-Predication of Critical Branches...Sreenivas Subramoney Processor Architecture Research Lab Intel Labs Bengaluru, India sreenivas.subramoney@intel.com Abstract—Advancements

Comparison against DHP: Unlike DMP, DHP [11] per-

forms predication only on simple and short hammocks, target-

ing minimal cost of fetching the additional path as compared

to speculation. Limited by its simplicity of application, DHP

cannot cover complex, non-traditional control flows which

lead to convergence. On average, ACB delivers (8.0%) nearly

double the performance of DHP (4.3%). Figure 11 illustrates

this performance difference on a per-workload basis, clearly

highlighting the impact of difference in targeted coverage.

Fig. 11. Comparison of ACB against DHP. DHP has lower coverage andhence many workloads do not show sensitivity to it.

D. Effect of Core Scaling

Simulations on a scaled-up version of the present configura-

tion (8-wide with twice the execution/fetch resources) showed

that performance of ACB improves to 8.6% owing to a wider

and deeper processor amplifying the problem of inefficiency

due to mispredictions. This also highlights ACB’s improved

efficiency and robustness in tackling branch mis-speculations.

E. Qualitative Power Analysis

ACB reduces pipeline flushes by 22% leading to reduction

in the number of speculative OOO allocations. While ACB

also allocates additional instructions in OOO for the wrong

fetched path, our analysis reveals that ACB effectively reduces

the total number of OOO allocations by 5%, which naturally

translates to reduction in energy consumption.

Since tabular structures used by ACB are small and are

looked up only for branches, the front-end power increment

is insignificant. Additionally, one must note that mispredic-

tions cost power not just through pipeline flushes, but also

through re-execution of already executed (and correct) control-

independent instructions. Each eliminated misprediction con-

tributes to energy savings by preventing this wastage of work.

VI. RELATED WORK

Software predication has been studied extensively in the

past [10], [31], [32]. Popular ISAs support static predica-

tion [17], [18] but due to large overheads, the realistic benefits

are diminished [7], [12]. Wish Branches [12] rely on the

compiler to supply predicated code but applies predication

dynamically only on less predictable instances. Dynamic Ham-

mock Predication [11] targets only small, simple hammocks.

Hyperblock predication [33] uses compiler profiling to pred-

icate frequently occurring basic blocks. Generalized multi-

path execution was proposed in [34]–[36]. Diverge-Merge

Processor (DMP) [7], [15] uses branch prediction confidence

to selectively predicate conditional branches, while using the

compiler for convergence and branch selection information.

DMP outperformed previous schemes and was the focus of

our comparison. Joao et al. [37] extended dynamic predication

to indirect branches. Stephenson et al. [38] proposed another

compiler based approach to simplify prior hardware complex-

ity needed for enforcing correct dependence flow in predica-

tion. Their targeted hammocks are restricted by having specific

register writing patterns in the predicated region, which are

provided by the compiler. As examined in Section II-C and

comparatively analyzed in Section V-C, prior works do not

fully comprehend the delicate performance trade-offs created

by disabling speculation causing performance inversions in

certain scenarios. Additionally, they need significant changes

to hardware, compiler and ISA, making their implementation

challenging. In contrast, ACB is a pure hardware solution.

Several mechanisms exploiting control independence [39],

[40] also exist which perform selective flush on a branch mis-

speculation wherein only the control dependent instructions

are flushed and re-executed. In contrast to ACB, these tech-

niques require complex hardware to remove, re-fetch and re-

allocate the selectively flushed instructions, along with com-

plicated methods to correct data dependencies post pipeline

flush. Skipper [41] proposed out-of-order fetch-and-execute

of instructions post-control flow convergence to exploit con-

trol independence but required large area (about 6KB) for

supporting its learning and application. SYRANT [42] sim-

plified this approach by targeting only converging conditional

branches and smarter reservation of OOO resources. However,

it is limited in application only to consistently behaving

branches. Control Flow Decoupling (CFD) [8] is a branch

pre-computation based solution which modifies the targeted

branches by separating the control-dependent and control-

independent branch body using the compiler. Hardware then

does an early resolution of the control flow removing the need

for branch prediction. Store-Load-Branch (SLB) Predictor [43]

is an adjunct branch predictor which improves accuracy by

targeting data-dependent branches whose associated loads are

memory-dependent upon stores. It detects dependency be-

tween stores, loads and branches using compiler and modifies

hardware to override branch prediction with available pre-

computed outcomes. ACB is applicable on top of any baseline

branch predictor, including SLB.

Rotenberg et al. [44] proposed a hardware to detect only

forward convergence scenarios. Collins et al. [45] proposed

detecting any type of reconvergence. Their mechanism identi-

fies the common patterns of convergence and adds dedicated

hardware to the backend to simultaneously learn the different

reconvergence points of different branches, all at once, by

broadcasting the PCs of instructions being retired. As a result it

102

Page 12: Auto-Predication of Critical Branches...Sreenivas Subramoney Processor Architecture Research Lab Intel Labs Bengaluru, India sreenivas.subramoney@intel.com Abstract—Advancements

requires significant area (nearly 4KB) and much more complex

implementation. In contrast, ACB is extremely light-weight

with the overall mechanism needing just 386 bytes, including

the reconvergence detection hardware.

VII. SUMMARY

In this paper, we have presented ACB, a lightweight

mechanism and completely implementable in hardware, that

intelligently disables speculation by dynamic predication of

only selective critical branches, thereby mitigating some of

the costly pipeline flushes because of wrong speculation. ACB

uses a combination of program criticality directed selection of

hard-to-predict branches and a runtime monitoring of perfor-

mance to overcome the undesirable side-effects of disabling

speculation. Micro-architecture solutions invented for ACB,

like convergence detection and dynamic performance monitor,

can have far reaching effects on future micro-architecture

research. Our results on a diverse set of workloads show that

ACB is a power-and-performance feature that delivers 8%

average performance gain while reducing power consumption.

ACB also scales seamlessly to future out-of-order processors

and continues to deliver high performance at lower power.

REFERENCES

[1] J. Doweck, W. Kao, A. K. Lu, J. Mandelblat, A. Rahatekar, L. Rap-poport, E. Rotem, A. Yasin, and A. Yoaz, “Inside 6th-generation intelcore: New microarchitecture code-named skylake,” IEEE Micro, vol. 37,no. 2, pp. 52–62, Mar 2017.

[2] A. Seznec, “A new case for the tage branch predictor,” inProceedings of the 44th Annual IEEE/ACM International Symposiumon Microarchitecture, ser. MICRO-44. New York, NY, USA: ACM,2011, pp. 117–127. [Online]. Available: http://doi.acm.org/10.1145/2155620.2155635

[3] ——, “A 64-kbytes ittage indirect branch predictor,” in Third Champi-onship Branch Prediction (JWAC-2), 2011.

[4] A. Seznec, J. S. Miguel, and J. Albericio, “The inner most loop iterationcounter: A new dimension in branch history,” in 2015 48th AnnualIEEE/ACM International Symposium on Microarchitecture (MICRO),Dec 2015, pp. 347–357.

[5] D. A. Jimenez and C. Lin, “Dynamic branch prediction with percep-trons,” in Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture, Jan 2001, pp. 197–206.

[6] C. Ozturk and R. Sendag, “An analysis of hard to predict branches,”in 2010 IEEE International Symposium on Performance Analysis ofSystems Software (ISPASS), March 2010, pp. 213–222.

[7] H. Kim, J. A. Joao, O. Mutlu, and Y. N. Patt, “Diverge-merge processor(dmp): Dynamic predicated execution of complex control-flow graphsbased on frequently executed paths,” in 2006 39th Annual IEEE/ACMInternational Symposium on Microarchitecture (MICRO’06), Dec 2006,pp. 53–64.

[8] R. Sheikh, J. Tuck, and E. Rotenberg, “Control-flow decoupling,” in2012 45th Annual IEEE/ACM International Symposium on Microarchi-tecture, Dec 2012, pp. 329–340.

[9] S. Chaudhry, P. Caprioli, S. Yip, and M. Tremblay, “High-performancethroughput computing,” IEEE Micro, vol. 25, no. 3, pp. 32–45, May2005.

[10] J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren, “Conversion ofcontrol dependence to data dependence,” in Proceedings of the 10thACM SIGACT-SIGPLAN Symposium on Principles of ProgrammingLanguages, ser. POPL ’83. New York, NY, USA: ACM, 1983, pp. 177–189. [Online]. Available: http://doi.acm.org/10.1145/567067.567085

[11] A. Klauser, T. Austin, D. Grunwald, and B. Calder, “Dynamichammock predication for non-predicated instruction set architectures,”in Proceedings of the 1998 International Conference on ParallelArchitectures and Compilation Techniques, ser. PACT ’98. Washington,DC, USA: IEEE Computer Society, 1998, pp. 278–. [Online]. Available:http://dl.acm.org/citation.cfm?id=522344.825698

[12] H. Kim, O. Mutlu, J. Stark, and Y. N. Patt, “Wish branches:Combining conditional branching and predication for adaptivepredicated execution,” in Proceedings of the 38th Annual IEEE/ACMInternational Symposium on Microarchitecture, ser. MICRO 38.Washington, DC, USA: IEEE Computer Society, 2005, pp. 43–54.[Online]. Available: https://doi.org/10.1109/MICRO.2005.38

[13] T. Heil, M. Farrens, J. E. Smith, and G. Tyson, “Restricted dual pathexecution,” 01 1999.

[14] T. H. Heil and J. E. Smith, “Selective dual path execution,” 04 1998.[15] H. Kim, J. A. Joao, O. Mutlu, and Y. N. Patt, “Profile-assisted compiler

support for dynamic predication in diverge-merge processors,” in Inter-national Symposium on Code Generation and Optimization (CGO’07),March 2007, pp. 367–378.

[16] B. Fields, S. Rubin, and R. Bodik, “Focusing processor policies viacritical-path prediction,” in Proceedings 28th Annual International Sym-posium on Computer Architecture, June 2001, pp. 74–85.

[17] “Intel 64 and ia-32 architectures optimization referencemanual.” [Online]. Available: https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

[18] “Arm instruction set version 1.0 reference guide.” [Online].Available: https://static.docs.arm.com/100076/0100/arm instructionset reference guide 100076 0100 00 en.pdf

[19] A. Fog, “The microarchitecture of intel, amd and via cpus: An op-timization guide for assembly programmers and compiler makers,”Copenhagen University College of Engineering, pp. 02–29, 2012.

[20] S. Sethumadhavan, R. Desikan, D. Burger, C. R. Moore, andS. W. Keckler, “Scalable hardware memory disambiguation for high ilpprocessors,” in Proceedings of the 36th Annual IEEE/ACM InternationalSymposium on Microarchitecture, ser. MICRO 36. Washington, DC,USA: IEEE Computer Society, 2003, pp. 399–. [Online]. Available:http://dl.acm.org/citation.cfm?id=956417.956553

[21] J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCHComput. Archit. News, vol. 34, no. 4, pp. 1–17, Sep. 2006. [Online].Available: http://doi.acm.org/10.1145/1186736.1186737

[22] A. Limaye and T. Adegbija, “A workload characterization of the speccpu2017 benchmark suite,” in 2018 IEEE International Symposium onPerformance Analysis of Systems and Software (ISPASS), April 2018,pp. 149–158.

[23] “Sysmark 2018 - bapco.” [Online]. Available: http://bapco.com/wp-content/uploads/2018/08/SYSmark 2018 White Paper 1.0.pdf

[24] “Tabletmark 2017 - white paper.” [Online]. Available: https://bapco.com/wp-content/uploads/2017/02/TabletMark-2017-WhitePaper-1.0.pdf

[25] “Geekbench 4 cpu workloads.” [Online]. Available: https://www.geekbench.com/doc/geekbench4-cpu-workloads.pdf

[26] “3dmark 11 - the gamer’s benchmark for directx 11 -whitepaper.” [Online]. Available: http://s3.amazonaws.com/download-aws.futuremark.com/3DMark 11 Whitepaper.pdf

[27] J. A. Poovey, T. M. Conte, M. Levy, and S. Gal-On, “A benchmarkcharacterization of the eembc benchmark suite,” IEEE Micro, vol. 29,no. 5, pp. 18–29, Sep. 2009.

[28] “A quick tour of lammps.” [Online]. Available: https://lammps.sandia.gov/workshops/Aug15/PDF/tutorial Plimpton.pdf

[29] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmarksuite: Characterization and architectural implications,” in 2008 Interna-tional Conference on Parallel Architectures and Compilation Techniques(PACT), Oct 2008, pp. 72–81.

[30] E. Hao, Po-Yung Chang, and Y. N. Patt, “The effect of speculativeupdating branch history on branch prediction accuracy, revisited,” inProceedings of MICRO-27. The 27th Annual IEEE/ACM InternationalSymposium on Microarchitecture, Nov 1994, pp. 228–232.

[31] P.-Y. Chang, E. Hao, Y. N. Patt, and P. P. Chang, “Usingpredicated execution to improve the performance of a dynamicallyscheduled machine with speculative execution,” in Proceedings ofthe IFIP WG10.3 Working Conference on Parallel Architectures andCompilation Techniques, ser. PACT ’95. Manchester, UK, UK: IFIPWorking Group on Algol, 1995, pp. 99–108. [Online]. Available:http://dl.acm.org/citation.cfm?id=224659.224698

[32] D. I. August, W. W. Hwu, and S. A. Mahlke, “A framework forbalancing control flow and predication,” in Proceedings of 30th AnnualInternational Symposium on Microarchitecture, Dec 1997, pp. 92–103.

[33] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A.Bringmann, “Effective compiler support for predicated execution usingthe hyperblock,” in Proceedings of the 25th Annual InternationalSymposium on Microarchitecture, ser. MICRO 25. Los Alamitos,

103

Page 13: Auto-Predication of Critical Branches...Sreenivas Subramoney Processor Architecture Research Lab Intel Labs Bengaluru, India sreenivas.subramoney@intel.com Abstract—Advancements

CA, USA: IEEE Computer Society Press, 1992, pp. 45–54. [Online].Available: http://dl.acm.org/citation.cfm?id=144953.144998

[34] P. S. Ahuja, K. Skadron, M. Martonosi, and D. W. Clark, “Multipathexecution: Opportunities and limits,” in Proceedings of the 12thInternational Conference on Supercomputing, ser. ICS ’98. NewYork, NY, USA: ACM, 1998, pp. 101–108. [Online]. Available:http://doi.acm.org/10.1145/277830.277854

[35] A. Klauser and D. Grunwald, “Instruction fetch mechanisms for mul-tipath execution processors,” in MICRO-32. Proceedings of the 32ndAnnual ACM/IEEE International Symposium on Microarchitecture, Nov1999, pp. 38–47.

[36] A. Klauser, A. Paithankar, and D. Grunwald, “Selective eager executionon the polypath architecture,” in Proceedings. 25th Annual InternationalSymposium on Computer Architecture (Cat. No.98CB36235), July 1998,pp. 250–259.

[37] J. A. Joao, O. Mutlu, H. Kim, and Y. N. Patt, “Dynamic predication ofindirect jumps,” IEEE Computer Architecture Letters, vol. 7, no. 1, pp.1–4, Jan 2008.

[38] M. Stephenson, L. Zhang, and R. Rangan, “Lightweight predicationsupport for out of order processors,” in 2009 IEEE 15th InternationalSymposium on High Performance Computer Architecture, Feb 2009, pp.201–212.

[39] V. R. Kothinti Naresh, R. Sheikh, A. Perais, and H. W. Cain, “Spf:Selective pipeline flush,” in 2018 IEEE 36th International Conferenceon Computer Design (ICCD), Oct 2018, pp. 152–155.

[40] A. Gandhi, H. Akkary, and S. T. Srinivasan, “Reducing branchmisprediction penalty via selective branch recovery,” in Proceedingsof the 10th International Symposium on High Performance ComputerArchitecture, ser. HPCA ’04. USA: IEEE Computer Society, 2004, p.254. [Online]. Available: https://doi.org/10.1109/HPCA.2004.10004

[41] Chen-Yong Cher and T. N. Vijaykumar, “Skipper: a microarchitecture forexploiting control-flow independence,” in Proceedings. 34th ACM/IEEEInternational Symposium on Microarchitecture. MICRO-34, Dec 2001,pp. 4–15.

[42] N. Premillieu and A. Seznec, “Syrant: Symmetric resource allocationon not-taken and taken paths,” ACM Trans. Archit. Code Optim.,vol. 8, no. 4, pp. 43:1–43:20, Jan. 2012. [Online]. Available:http://doi.acm.org/10.1145/2086696.2086722

[43] M. U. Farooq, Khubaib, and L. K. John, “Store-load-branch (slb) predic-tor: A compiler assisted branch prediction for data dependent branches,”in 2013 IEEE 19th International Symposium on High PerformanceComputer Architecture (HPCA), Feb 2013, pp. 59–70.

[44] E. Rotenberg and J. Smith, “Control independence in trace processors,”in MICRO-32. Proceedings of the 32nd Annual ACM/IEEE InternationalSymposium on Microarchitecture, Nov 1999, pp. 4–15.

[45] J. D. Collins and D. M. T. and, “Control flow optimization via dy-namic reconvergence prediction,” in 37th International Symposium onMicroarchitecture (MICRO-37’04), Dec 2004, pp. 129–140.

104


Recommended