Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 1 times |
© 2005 IBM Corporation
ISMM’06 Ottawa, Ontario, Canada
June 10th 2006 | ISMM’06 Ottawa, Ontario, Canada © 2006 IBM Corporation
Improving Locality withParallel Hierarchical Copying GC
David Siegwart, IBM Software GroupMartin Hirzel, IBM Watson Research Center
2
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Talk Summary
Motivation
Background & Related Work
Hierarchical Copying GC, Parallelized.
Evaluation across wide range of benchmarks.
Conclusions
3
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Motivation
Improving Locality:– Commercial workloads spend 45% stalled in memory requests.
[Adl-Tabatabai et al, PLDI’04 - SPECjbb2000 on Itanium II]
– Object order in memory influences misses.
– Copying GC can relocate objects, changing object ordering.
– Objective: co-locate objects that are used together, on the same page or cache line.
Maintaining Scalability:– parallelism and workload balancing is essential for server workloads
4
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Related Objects are Used Together
Looked at Consecutive Field Accesses:– Siblings
– child-parent
for SPECjbb2005:– 29% siblings
– 14% child-parent
for a Trade6 Primitive: (J2EE Benchmark)– 36% siblings
– 8% child-parent
Copying GC should have:– good locality for siblings
– good locality for child-parent.
5
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Background
Cheney
Moon
Wilson/Lam/Moher
Halstead
Imai/Tick
Parallel Hierarchical
1970
1984
2006
1985
19931992
+ parallel
+ load balancing
+ hierarchical
– rescanning
6
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Cheney Copying GC – Good for Siblings
o1
o2 o3
o4 o5 o6 o7
o8 o9 o10 o11 o12 o13 o14 o15
Breadthfirst
scanfre
e
To-space
scan
parent
child
free
copied
copied & scanned
free
scan
scan fre
efre
esc
an
7
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
0%
5%
10%
15%
20%
25%
30%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Scanned Slot to Copied Object Distance
(Log22 )
Pro
po
rtio
n
Cheney (Breadth First)
Cheney Copying GC – Bad for Parent-Child(SPECjbb2005)
64 bytecache line
page size (4 kB)
– Increases working set, hence TLB misses and L2 cache misses
8
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Depth-First Copying – Good for Parent-Child
o1
o2 o3
o4 o5 o6 o7
o8 o9 o10 o11 o12 o13 o14 o15
– Bad for Siblings(o4, o5, o6, o7 are on separate pages)
9
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Background
Cheney
Moon
Wilson/Lam/Moher
Halstead
Imai/Tick
Parallel Hierarchical
1970
1984
2006
1985
19931992
+ parallel
+ load balancing
+ hierarchical
– rescanning
10
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Moon’s Hierarchical Copying GC
To-space
o8
o1
o2 o3
o4 o5 o6 o7
o9 o10 o11 o12 o13 o14 o15
freepar
tial
= scan
freepar
tial
= scan
Two scan pointers: scan, partial
scan fre
epar
tial
scan fre
epar
tial
scan
partia
l
= free
A B DC E
re-scanned
scan
partia
l
= fr
eescan
partia
l
= fr
ee
11
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Wilson, Lam & Moher’s Hierarchical Copying GC
o8
o1
o2 o3
o4 o5 o6 o7
o9 o10 o11 o12 o13 o14 o15
scan
Afre
esc
anB
scan
Csc
anD
scan
E
scan block = copy block
free
scan
Csc
anB
scan
Dsc
anE
scan
A
scan block = copy block
free
scan
Csc
anD
scan
Asc
anB
scan
E
scan block = copy block
free
scan
Asc
anB
scan
Csc
anD
scan
E
scan block = copy block
scan pointer in each block:avoids re-scanning
aliasing scan blockto copy block reducescopy-scan distances
To-spaceA B DC E
scan
C
= freesc
anB
scan
Asc
anD
scan
E
scan block ≠ copy block
scan
Esc
anD
scan
Asc
anB
scan
C
= free
scan block ≠ copy block
12
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Background
Cheney
Moon
Wilson/Lam/Moher
Halstead
Imai/Tick
Parallel Hierarchical
1970
1984
2006
1985
19931992
+ parallel
+ load balancing
+ hierarchical
– rescanning
13
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Imai and Tick’s Parallel Copying GCTo-space
. . .Work Pool
Thread 1
Thread 2
scan block ≠ copy block
scan block = copy block(aliased)
Thread n
. . .
14
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Recognising the Connection. . .Work Pool
Thread 1
Thread 2
scan block ≠ copy block
scan block = copy block(aliased)
Wilson, Lam & Moher(hierarchical, not parallel)
Imai & Tick(parallel, not hierarchical)
the immediacy of aliasing in WLM is what distinguishes it from Imai and Tick.
So immediate aliasing in Imai & Tick gives hierarchical copying.
Need to increase aliasing in Imai & Tick to improve locality.
15
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Immediate Aliasing
Check for aliasing opportunity immediately after each reference slot in each object has been scanned.
Interrupt scanning at this point, and restart with the aliased block
Easier to see via transition diagram
16
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Parallel Hierarchical – Block State Transitions
freelist copy
scan donescanlist
aliased
shared data
17
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Parallel Hierarchical – Block State Transitions
freelist copy
scan donescanlist
aliased
shared data
18
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
0%
5%
10%
15%
20%
25%
30%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Scanned Slot to Copied Object Distance(Log2)
Pro
po
rtio
n
Breadth-FirstHierarchical
Parent-Child Distances for Parallel Hierarchical(SPECjbb2005)
64 bytecache line
page size (4 kB)
– less TLB misses, less L2 cache misses
19
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Baseline GC
IBM J9 JVM, GC has two Generations:
Parallel copying for the young generation:– two semi-spaces
– most GC’s are of this type.
Concurrent mark for the old generation:– stop-the-world phase.
(rare, compared to young collection)
20
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
-10%
-5%
0%
5%
10%
15%
20%
25%
SP
EC
jbb
20
05
db
java
src
mtr
t
jbyt
em
ark
java
c
cha
rt
jpa
t
ba
nsh
ee
java
lex
jyth
on
ecl
ipse
mp
eg
au
dio
com
pre
ss fop
hsq
ldb
kaw
a
soo
t
ba
tik
jack
an
tlr
jess ps
blo
at
pm
d
ipsi
xql
% S
pe
ed
up
s (1
- P
H/B
F)
heap size 10x min, except SPECjbb2005
Results – 26 Benchmark Suite
21
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Results – Scalability SPECjbb2005
Windows 2000 Advanced Server 5.0.2195 SP44x(1.6GHz HT Pentium 4 Xeon)256kB L2 (64byte cache line), 1MB L3, 2GB RAMBase Build: J9 5.0 GA pwi32dev-20051104
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Warehouses
Thr
ough
putt
Hierarchical
Breadth-First
22
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
0
1
2
3
4
0 1 2 3 4 5 6 7 8
GC Threads
Nor
mal
ized
Tra
nsa
ctio
ns /
(G
C T
ime)
Breadth-FirstHierarchical
GC Scaling – SPECjbb2005
Windows 2000 Advanced Server 5.0.2195 SP44x(1.6GHz HT Pentium 4 Xeon)256kB L2 (64byte cache line), 1MB L3, 2GB RAMBase Build: J9 5.0 GA pwi32dev-20051104
23
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Mutator vs Collector - db
Linux1x(3.06 GHz HT Pentium 4 Xeon)512kB L2 (64byte cache line), 1GB RAMBase Build: J9 5.0 GA pxi32dev-20051104
Mutator Time
1
1.1
1.2
1.3
1.4
1.5
1 2 3 4 5 6 7 8 9 10Heap Size relative to minimum heap size
Nor
mal
ized
Mut
ator
Tim
e .
Hierarchical
Breadth-First
1
1.5
2
2.5
3
1 2 3 4 5 6 7 8 9 10Heap Size relative to minimum heap size
Nor
mal
ized
GC
Tim
e .
Hierarchical
Breadth-FirstGC Time
24
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Cache & TLB Misses - db
Linux1x(3.06 GHz HT Pentium 4 Xeon)512kB L2 (64byte cache line), 1GB RAMBase Build: J9 5.0 GA pxi32dev-20051104
1
1.1
1.2
1.3
1.4
1.5
1 2 3 4 5 6 7 8 9 10Heap Size relative to minimum heap size
Nor
mal
ized
Mut
ator
L1
Cac
he M
isse
s .
Hierarchical
Breadth-First
1
1.1
1.2
1.3
1.4
1.5
1 2 3 4 5 6 7 8 9 10Heap Size relative to minimum heap size
Nor
mal
ized
Mut
ator
TLB
Mis
ses
.
Hierarchical
Breadth-First
25
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Conclusions
Introduced a new algorithm:– Improves Memory Locality
– Maintains Good Scalability
Two technologies in one – hierarchical decomposition and parallel copying GC.
Requires no online profiling.
Evaluated across wide range of benchmarks:– better locality, dramatic reduction TLB misses, and also reduces L1 misses.
– cost on collector outweighed by benefit to mutator.
– Majority of benchmarks show improvements.
26
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Backup
27
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Related Work
Ch./La‘98
Huang ‘04
Shuf ‘02
Shuf’02Adl-T.
‘04
Latt-ner‘04
La./Ad.’05Ch./Hi.
‘01
Cascaval‘05
Moon‘84
Kistler/Fra.‘03
Wi/La/
Mo.’91
L1 L2 TLB Paging
L1 L2 TLB Paging
C/C++
Java
Lisp
…
C/C++
Java
Lisp
…
OS Allocator PrefetchingMoving GC
OS Allocator PrefetchingMoving GC
28
ISMM’06 Ottawa, Ontario, Canada
Improving Locality with Parallel Hierarchical Copying GC | June 10th 2006 © 2006 IBM Corporation
Results – 26 Benchmark Suite – other heap sizes
-10%
-5%
0%
5%
10%
15%
20%
25%
SP
EC
jbb
20
05
db
java
src
mtr
tjb
yte
ma
rkja
vac
cha
rtjp
at
ba
nsh
ee
java
lex
jyth
on
ecl
ipse
mp
eg
au
dio
com
pre
ss fop
hsq
ldb
kaw
aso
ot
ba
tikja
cka
ntlr
jess ps
blo
at
pm
dip
sixq
l
% S
pee
du
ps
(1 -
PH
/BF
)
1.33x2x4x10x