Branch Mispredictions in Quicksort
K. Kaligosi1 C. Martínez2 P. Sanders3
1Max-Planck-Inst., Germany
2Univ. Politècnica de Catalunya, Spain
3Univ. Karlsruhe, Germany
AofA 2006
Alden Biesen, Belgium
Introduction
I Modern hardware executes several sequential
instructions in a pipelined fashion
I Jump instructions pose a major challenge!
I So we try to predict which branch will be taken ...
I Branch mispredictions are expensive: we have to
rollback the pipeline
Introduction
I Modern hardware executes several sequential
instructions in a pipelined fashion
I Jump instructions pose a major challenge!
I So we try to predict which branch will be taken ...
I Branch mispredictions are expensive: we have to
rollback the pipeline
Introduction
I Modern hardware executes several sequential
instructions in a pipelined fashion
I Jump instructions pose a major challenge!
I So we try to predict which branch will be taken ...
I Branch mispredictions are expensive: we have to
rollback the pipeline
Introduction
I Modern hardware executes several sequential
instructions in a pipelined fashion
I Jump instructions pose a major challenge!
I So we try to predict which branch will be taken ...
I Branch mispredictions are expensive: we have to
rollback the pipeline
Introduction
I In comparison-based algorithms, we want
comparisons to yield as much information as
possible =) difficult to predict!
I In static branch prediction, jump instructions are
statically predicted as TAKEN or NOT TAKEN
I In dynamic branch prediction, the hardware predictswhat to do during execution, taking the past intoaccount
I 1-bit: We predict the instruction will take the same
direction it took the last time it was executedI 2-bit: We must be wrong twice before we change
the predictionI . . .
Introduction
I In comparison-based algorithms, we want
comparisons to yield as much information as
possible =) difficult to predict!
I In static branch prediction, jump instructions are
statically predicted as TAKEN or NOT TAKEN
I In dynamic branch prediction, the hardware predictswhat to do during execution, taking the past intoaccount
I 1-bit: We predict the instruction will take the same
direction it took the last time it was executedI 2-bit: We must be wrong twice before we change
the predictionI . . .
Introduction
I In comparison-based algorithms, we want
comparisons to yield as much information as
possible =) difficult to predict!
I In static branch prediction, jump instructions are
statically predicted as TAKEN or NOT TAKEN
I In dynamic branch prediction, the hardware predictswhat to do during execution, taking the past intoaccount
I 1-bit: We predict the instruction will take the same
direction it took the last time it was executedI 2-bit: We must be wrong twice before we change
the predictionI . . .
Introduction
I In comparison-based algorithms, we want
comparisons to yield as much information as
possible =) difficult to predict!
I In static branch prediction, jump instructions are
statically predicted as TAKEN or NOT TAKEN
I In dynamic branch prediction, the hardware predictswhat to do during execution, taking the past intoaccount
I 1-bit: We predict the instruction will take the same
direction it took the last time it was executed
I 2-bit: We must be wrong twice before we change
the predictionI . . .
Introduction
I In comparison-based algorithms, we want
comparisons to yield as much information as
possible =) difficult to predict!
I In static branch prediction, jump instructions are
statically predicted as TAKEN or NOT TAKEN
I In dynamic branch prediction, the hardware predictswhat to do during execution, taking the past intoaccount
I 1-bit: We predict the instruction will take the same
direction it took the last time it was executedI 2-bit: We must be wrong twice before we change
the prediction
I . . .
Introduction
I In comparison-based algorithms, we want
comparisons to yield as much information as
possible =) difficult to predict!
I In static branch prediction, jump instructions are
statically predicted as TAKEN or NOT TAKEN
I In dynamic branch prediction, the hardware predictswhat to do during execution, taking the past intoaccount
I 1-bit: We predict the instruction will take the same
direction it took the last time it was executedI 2-bit: We must be wrong twice before we change
the predictionI . . .
2-bit Predictor
00
PNTPNT
01
02
PT
03
PT
T
NT
T
NT
NT
T
T
NT
Partition
// We have to partition A[i::j] around the pivot
// that we have already put on A[i]int l = i; int u = j + 1; Elem pv = A[i];
for ( ; ; ) {
do ++l; while(A[l] < pv); // Loop S
do --u; while(A[u] > pv); // Loop G
if (l >= u) break;
swap(A[l], A[u]);
};
swap(A[i], A[u]); k = u;
}
Setting up the Recurrences
I Probability that the chosen pivot is the kthsmallest element out of the n: �n;k
I Average number of branch mispredictions when
partitioning an array of size n and the pivot is the
kth: bn;k
I Average number of branch mispredictions whan
partitioning an array of size n:
bn =X
1�k�n
�n;k � bn;k
Setting up the Recurrences
I Probability that the chosen pivot is the kthsmallest element out of the n: �n;k
I Average number of branch mispredictions when
partitioning an array of size n and the pivot is the
kth: bn;k
I Average number of branch mispredictions whan
partitioning an array of size n:
bn =X
1�k�n
�n;k � bn;k
Setting up the Recurrences
I Probability that the chosen pivot is the kthsmallest element out of the n: �n;k
I Average number of branch mispredictions when
partitioning an array of size n and the pivot is the
kth: bn;k
I Average number of branch mispredictions whan
partitioning an array of size n:
bn =X
1�k�n
�n;k � bn;k
Setting up the Recurrences
I Average number of branch mispredictions Bn to
sort n elements:
Bn = bn +nXk=1
�n;k � (Bk�1 +Bn�k)
I We will later consider the total cost Tn which
satisfies the same recurrence with toll function
tn = n+ � � bn + o(n)
Setting up the Recurrences
I Average number of branch mispredictions Bn to
sort n elements:
Bn = bn +nXk=1
�n;k � (Bk�1 +Bn�k)
I We will later consider the total cost Tn which
satisfies the same recurrence with toll function
tn = n+ � � bn + o(n)
Sampling
I It is well-known that using samples to select the
pivot of each recursive stage improves the
average performance of quicksort and reduces the
probability of worst-case behavior
I For quicksort with samples of size s from which
we pick the (p+ 1)th element as the pivot, we have
�n;k =
�k�1p
�� n�ks�1�p
��ns
�
Sampling
I It is well-known that using samples to select the
pivot of each recursive stage improves the
average performance of quicksort and reduces the
probability of worst-case behavior
I For quicksort with samples of size s from which
we pick the (p+ 1)th element as the pivot, we have
�n;k =
�k�1p
�� n�ks�1�p
��ns
�
Sampling
I A typical case is to pick the median of the sample
with s = 2t+ 1 and p = t
I We can use variable-size samples with s = s(n);then s!1 as n!1 but must grow sublinearly,
s = o(n); we use to denote the relative rank of
the pivot within the sample =) e.g., = 1=2 means
choosing the median of the sample
Sampling
I A typical case is to pick the median of the sample
with s = 2t+ 1 and p = t
I We can use variable-size samples with s = s(n);then s!1 as n!1 but must grow sublinearly,
s = o(n); we use to denote the relative rank of
the pivot within the sample =) e.g., = 1=2 means
choosing the median of the sample
General results
Theorem
The average number of branch mispredictions to sort
n elements with quicksort using samples of size s and
choosing the (p+ 1)th in the sample of each stage is
Bn =�(s; p)
H(s; p)n lnn+O(n);
where
H(s; p) = Hs+1 �p+ 1
s+ 1Hp+1 �
s� p
s+ 1Hs�p:
and
�(s; p) = limn!1
bnn
= limn!1
1
n
X1�k�n
�(s;p)n;k bn;k
General results
Theorem
For variable-sized sampling, if s!1 as n!1 with
s = o(n), and p=s! then
Bn =�( )
H( )n lnn+ o(n logn);
with �( ) = limn!1 �(s; � s+ o(s)) and
H(x) = �(x lnx+ (1� x) ln(1� x))
General results
Theorem
The total cost Tn of quicksort is given by
Tn =1 + � � �(s; p)
H(s; p)n lnn+O(n); s = �(1)
and
Tn =1 + � � �( )
H( )n lnn+ o(n logn); s = !(1); s = o(n)
General results
I In order to compute �(s; p), we can use, under
suitable conditions,
�(s; p) =s!
p!(s� 1� p)!
Z 1
0xp(1� x)s�1�pb(x) dx
with
b(x) = limn!1
bn;x�nn
I Computing �( ) is easier!
�( ) = b( )
General results
I In order to compute �(s; p), we can use, under
suitable conditions,
�(s; p) =s!
p!(s� 1� p)!
Z 1
0xp(1� x)s�1�pb(x) dx
with
b(x) = limn!1
bn;x�nn
I Computing �( ) is easier!
�( ) = b( )
General results
I The optimal value � for minimizes the total
cost, i.e., minimizes
��( ) =1 + � � �( )
H( )
and depends on �
I It’s not difficult to prove that for any s and p,
�(s; p)
H(s; p)>�( �)
H( �)
General results
I The optimal value � for minimizes the total
cost, i.e., minimizes
��( ) =1 + � � �( )
H( )
and depends on �
I It’s not difficult to prove that for any s and p,
�(s; p)
H(s; p)>�( �)
H( �)
General results
I In general, there exists a threshold value �c such
that if � � �c (branch mispredictions are not too
expensive) then we have to take the median of the
samples, i.e., � = 1=2
I If � > �c (that can happen often in practice!) then
� < 1=2 and it is given by the unique solution in
[0; 1=2) of the equation
� � b0( )H( ) = (1 + � � b( ))H0( )
(provided that b(x) is in C2[0; 1=2))
General results
I In general, there exists a threshold value �c such
that if � � �c (branch mispredictions are not too
expensive) then we have to take the median of the
samples, i.e., � = 1=2
I If � > �c (that can happen often in practice!) then
� < 1=2 and it is given by the unique solution in
[0; 1=2) of the equation
� � b0( )H( ) = (1 + � � b( ))H0( )
(provided that b(x) is in C2[0; 1=2))
General results
I The threshold value �c is the solution of
d2��( )
d 2
����� =1=2
= 0
I That is
�c = �4
b00(1=2) ln 2 + 4b(1=2)
General results
I The threshold value �c is the solution of
d2��( )
d 2
����� =1=2
= 0
I That is
�c = �4
b00(1=2) ln 2 + 4b(1=2)
Static branch prediction
I We analyze here optimal prediction: if the position
of the pivot k � n=2 then we predict Loop S not
taken and loop G taken, and the other way around
I If k � n=2 we incur a branch misprediction every
time there is an element which is smaller than the
pivot; symetrically, if k > n=2 then the number of
branch mispredictions is n� k
I Hence, bn;k = min(k� 1; n� k), b( ) = min( ; 1� ) and
��( ) =1 + � �min( ; 1� )
H( )
Static branch prediction
I We analyze here optimal prediction: if the position
of the pivot k � n=2 then we predict Loop S not
taken and loop G taken, and the other way around
I If k � n=2 we incur a branch misprediction every
time there is an element which is smaller than the
pivot; symetrically, if k > n=2 then the number of
branch mispredictions is n� k
I Hence, bn;k = min(k� 1; n� k), b( ) = min( ; 1� ) and
��( ) =1 + � �min( ; 1� )
H( )
Static branch prediction
I We analyze here optimal prediction: if the position
of the pivot k � n=2 then we predict Loop S not
taken and loop G taken, and the other way around
I If k � n=2 we incur a branch misprediction every
time there is an element which is smaller than the
pivot; symetrically, if k > n=2 then the number of
branch mispredictions is n� k
I Hence, bn;k = min(k� 1; n� k), b( ) = min( ; 1� ) and
��( ) =1 + � �min( ; 1� )
H( )
Static branch prediction
0.2
0.16
5
0.44
0.32
0.4
20 25
0.36
15
0.28
10
0.08
0.12
0 30
0.24
0.48
The value of � as a function of �
1-bit branch prediction
I The number of branch mispredictions is twice the
number of exchanges: we incur a misprediction
each time we abandon the loops S and G
I Hence, bn;k = 2(k � 1)(n� k) and b( ) = 2 (1� )
1-bit branch prediction
I The number of branch mispredictions is twice the
number of exchanges: we incur a misprediction
each time we abandon the loops S and G
I Hence, bn;k = 2(k � 1)(n� k) and b( ) = 2 (1� )
1-bit branch prediction
I We can analyze in full detail the performance when
using fixed-sized samples. For example, for
median-of-(2t+ 1) we have
�(2t+ 1; t) =t+ 1
2t+ 3
I For variable-size samples, �( ) = 2 (1� ).
I The threshold is then at �c = 2=(2 ln 2� 1) � 5:177 : : :and � is the solution of
ln + 2� 2 ln = ln(1� ) + 2�(1� )2 ln(1� )
1-bit branch prediction
I We can analyze in full detail the performance when
using fixed-sized samples. For example, for
median-of-(2t+ 1) we have
�(2t+ 1; t) =t+ 1
2t+ 3
I For variable-size samples, �( ) = 2 (1� ).
I The threshold is then at �c = 2=(2 ln 2� 1) � 5:177 : : :and � is the solution of
ln + 2� 2 ln = ln(1� ) + 2�(1� )2 ln(1� )
1-bit branch prediction
I We can analyze in full detail the performance when
using fixed-sized samples. For example, for
median-of-(2t+ 1) we have
�(2t+ 1; t) =t+ 1
2t+ 3
I For variable-size samples, �( ) = 2 (1� ).
I The threshold is then at �c = 2=(2 ln 2� 1) � 5:177 : : :and � is the solution of
ln + 2� 2 ln = ln(1� ) + 2�(1� )2 ln(1� )
1-bit branch prediction
0.28
0.08
5
0.44
15
0.36
10
0.12
0.16
0.48
2520
0.32
0.4
30
0.2
0
0.24
The value of � as a function of �
2-bit branch prediction
I In (Kaligosi, Sanders, 2006), an approximate model
to compute bn;k is given, from which
b(x) =2x4 � 4x3 + x2 + x
1� x(1� x)
follows
I We are working on a more refined analysis of bn;kfor this prediction scheme; once bn;k has been
found, we should only have to apply the machinery
shown here
2-bit branch prediction
I In (Kaligosi, Sanders, 2006), an approximate model
to compute bn;k is given, from which
b(x) =2x4 � 4x3 + x2 + x
1� x(1� x)
follows
I We are working on a more refined analysis of bn;kfor this prediction scheme; once bn;k has been
found, we should only have to apply the machinery
shown here
Some real data
6.8
7
7.2
7.4
7.6
7.8
8
8.2
8.4
10 12 14 16 18 20 22 24 26
time
/ n lg
n [n
s]
lg n
random pivotmedian of 3
exact medianskewed pivot n/10
Time vs. size on a Pentium 4 (from (Kaligosi, Sanders,
2006))
Some real data
6.8
7
7.2
7.4
7.6
7.8
8
2 4 6 8 10 12 14 16 18
time
/ n lg
n [n
s]
1/α
n=212
n=219
n=226
Time vs. 1= on a Pentium 4
Some real data
3
3.2
3.4
3.6
3.8
4
4.2
4.4
4.6
4.8
5
10 12 14 16 18 20 22 24 26
time
/ n lg
n [n
s]
lg n
random pivotmedian of 3
exact medianskewed pivot n/10
Time vs. size on an Athlon 64
Some real data
3
3.2
3.4
3.6
3.8
4
4.2
4.4
4.6
4.8
10 12 14 16 18 20 22 24 26
time
/ n lg
n [n
s]
lg n
random pivotmedian of 3
exact medianskewed pivot n/10
Time vs. size on an Opteron
Some real data
4
6
8
10
12
14
16
18
20
22
10 12 14 16 18 20 22 24
time
/ n lg
n [n
s]
lg n
random pivotmedian of 3
exact medianskewed pivot n/10
Time vs. size on a Sun
Future work
I Complete the analysis of static branch prediction
with fixed-size samples (it’s not easy to obtain
�(s; p) for general s and p!)
I Analyze the 2-bit prediction scheme and possibly
others
I Conduct additional experiments, compare
theoretical analysis to real data
I Analyze branch mispredictions and their impact on
the performance of other algorithms