380C Where are we & where we are going – Managed languages Dynamic compilation Inlining Garbage...

Post on 18-Dec-2015

215 views 1 download

Tags:

transcript

380C

• Where are we & where we are going– Managed languages

• Dynamic compilation• Inlining• Garbage collection• What else can you do when you examine the heap a

lot?– Why you need to care about workloads– Alias analysis– Dependence analysis– Loop transformations– EDGE architectures

1

2

380C lecture 18• Garbage Collection

– Why use garbage collection?– What is garbage?

• Reachable vs live, stack maps, etc.

– Allocators and their collection mechanisms• Semispace• Marksweep• Performance comparisons

• Mark Region– Incremental age based collection

• Write barriers: Friend or foe?• Generational • Beltway

Mark Region and Other Advances in Garbage

Collection

Kathryn S. McKinley Stephen M. BlackburnUniversity of Texas at Austin Australian National University

PLDI’08: Immix: A Mark-Region Collector With

Space Efficiency, Fast Collection, and Mutator Performance

Isn’t GC a bit retro?

4

“Languages without automated garbage collection are getting out of fashion. The chance of running into all kinds of memory problems is gradually outweighing the performance penalty you have to pay for garbage collection.”

Paul Jansen, managing director of TIOBE Software, in Dr Dobbs, April 2008

“Languages without automated garbage collection are getting out of fashion. The chance of running into all kinds of memory problems is gradually outweighing the performance penalty you have to pay for garbage collection.”

Paul Jansen, managing director of TIOBE Software, in Dr Dobbs, April 2008

Mark-CompactStyger, 1967

Mark-SweepMcCarthy, 1960

Semi-SpaceCheney, 1970

GC FundamentalsThe Time–Space Tradeoff

5

GC FundamentalsThe Time–Space Tradeoff

6

Our Goal

GC FundamentalsAlgorithmic Components

Allocation Reclamation

7

Identification

Bump Allocation

Free List

Tracing(implicit)

Reference Counting(explicit)

Sweep-to-Free

Compact

Evacuate

3 1

Mark-Compact [Styger 1967]

Bump allocation + trace + compact

Mark-Compact [Styger 1967]

Bump allocation + trace + compact

GC FundamentalsCanonical Garbage Collectors

8

Sweep-to-Free

Compact

Evacuate

Mark-Sweep [McCarthy 1960]

Free-list + trace + sweep-to-free

Mark-Sweep [McCarthy 1960]

Free-list + trace + sweep-to-free

Semi-Space [Cheney 1970]

Bump allocation + trace + evacuate

Semi-Space [Cheney 1970]

Bump allocation + trace + evacuate

Mark-SweepFree List Allocation + Trace + Sweep-to-Free

9

Actual data, taken from geomean of DaCapo, jvm98, and jbb2000 on 2.4GHz Core 2 Duo

✓✓Space

efficientSpace

efficient

✓✓Simple,

very fast collection

Simple, very fast collection

Poor localityPoor locality

10

Actual data, taken from geomean of DaCapo, jvm98, and jbb2000 on 2.4GHz Core 2 Duo

✓✓Space

efficientSpace

efficient

Mark-CompactBump Allocation + Trace + Compact

Expensive multi-pass collection

Expensive multi-pass collection

✓✓Good

localityGood

locality

Semi-SpaceBump Allocation + Trace + Evacuation

11

Actual data, taken from geomean of DaCapo, jvm98, and jbb2000 on 2.4GHz Core 2 Duo

✓✓Good

localityGood

locality

Space inefficient

Space inefficient

Space inefficient

Space inefficient

Mark-Regionwith Sweep-To-Region

12

Sweep-to-Free

Compact

Evacuate

Reclamation

Sweep-to-Region

Mark-SweepFree-list + trace + sweep-to-free

Mark-SweepFree-list + trace + sweep-to-free

Mark-CompactBump allocation + trace + compact

Mark-CompactBump allocation + trace + compact

Semi-SpaceBump allocation + trace + evacuate

Semi-SpaceBump allocation + trace + evacuate

Mark-RegionBump + trace + sweep-to-region

Mark-RegionBump + trace + sweep-to-region

Mark-RegionBump Allocation + Trace + Sweep-to-Region

13

✓✓Simple,

very fast collection

Simple, very fast collection

✓✓Space

efficientSpace

efficient

✓✓Good

localityGood

locality

Actual data, taken from geomean of DaCapo, jvm98, and jbb2000 on 2.4GHz Core 2 Duo

✓✓Excellent

performanceExcellent

performance

Naïve Mark-Region

14

• Contiguous allocation into regionsExcellent locality– For simplicity, objects cannot span regions

• Simple mark phase (like mark-sweep)– Mark objects and their containing region

• Unmarked regions can be freed

00

ImmixEfficient Mark-Region Garbage Collection

15

Lines and Blocks

16

Small Regions

Large Regions

✗ Fragmentation (can’t fill blocks)

✓More contiguous allocation ✗ Fragmentation (false marking)

Lines & BlocksN pages approx 1 cache line

✓Less fragmentation Objects span lines

✓Fast common case Lines marked with objects

✗ Increased metadata o/h

✗ Constrained object sizes

00

TLB locality, cache locality Block > 4 X max object size

Free FreeRecyclable lines Recyclable lines

Allocation Policy(Recycling)

17

• Recycle partially marked blocks first Minimizes fragmentation Maximizes sharing of freed blocks

• Recycle in address order– We explored other options

• Allocate into free blocks last

Opportunistic Defragmentation

18

00

• Identify source and target blocks– (see paper for heuristics)

• Evacuate objects in source blocks– Allocate into target blocks

• Opportunistic– Leave in place if no space, or object pinned

• Opportunistically evacuate fragmented blocks– Lightweight, uses same allocation mechanism– No cost in common case (specialized GC)

Other Optimizations

19

Implicit Marking

✓Most objects small Small objects implicitly mark next line✓V. Fast common case Large objects mark lines exactly Implicit line mark

Line mark

Overflow Allocation

Multi-line objects may skip many small holes Overflow allocation (used on failure)✓Large objects uncommon✓V. effective solution

✓✓

Results

Complete data available at:

http://cs.anu.edu.au/~Steve.Blackburn/pubs

20

Evaluation20 Benchmarks Hardware

21

Collectors

Methodology

DaCapoSPECjvm98

SPEC jbb2000

MMTkJikes RVM 2.9.3(Perf ≈ HotSpot 1.5)

Replay compilerDiscard outliersReport 95th %ile

Full HeapImmix

MarkSweepMarkCompact

SemiSpaceGenerational

GenIXGenMS

GenCopyStickyStickyIXStickyMS

Core 2 Duo2.4GHz, 32KB L1, 4MB L2, 2GB RAM

AMD Athlon 3500+

2.2GHz, 64KB L1, 512KB L2, 2GB

RAMPowerPC 970

1.6GHz, 32KB L1, 512KB L2, 2GB

RAM

Please see the paper for details.

Mutator Time

22

Geomean of DaCapo, jvm98 and jbb2000 on 2.4GHz Core 2 Duo

Minimum Heap

23

GC Time

24

Geomean of DaCapo, jvm98 and jbb2000 on 2.4GHz Core 2 Duo

Total Performance

25

Geomean of DaCapo, jvm98 and jbb2000 on 2.4GHz Core 2 Duo

Generational Performance

26

Geomean of DaCapo, jvm98 and jbb2000 on 2.4GHz Core 2 Duo

Sticky Performance

27

Geomean of DaCapo, jvm98 and jbb2000 on 2.4GHz Core 2 Duo

PseudoJBB 2000

28

On 2.4GHz Core 2 Duo

PseudoJBB 2000

29

On 2.4GHz Core 2 Duo

Prior Work

http://www.ibm.com/developerworks/ibm/library/i-garbage1/

• IBM product collector–Mark-Region not characterized– Collector not evaluated– Product and basis for other research

• [Domani et al 2000][Kermany & Petrank 2006]

30

Mark-Region Collection

31

Sweep-to-Free

Compact

Evacuate

Mark-SweepFree-list + trace + sweep-to-free

Mark-SweepFree-list + trace + sweep-to-free

Mark-CompactBump allocation + trace + compact

Mark-CompactBump allocation + trace + compact

Semi-SpaceBump allocation + trace + evacuate

Semi-SpaceBump allocation + trace + evacuate

Mark-RegionBump allocation + trace + sweep-to-region

Mark-RegionBump allocation + trace + sweep-to-region

Sweep-to-Region

ImmixEfficient Mark-Region Collection

32

✓✓Simple,

very fast collection

Simple, very fast collection

✓✓Space

efficientSpace

efficient

✓✓Good

localityGood

locality

Actual data, taken from geomean of DaCapo, jvm98, and jbb2000 on 2.4GHz Core 2 Duo

✓✓Excellent

performanceExcellent

performance

Open Source

Code available in JikesRVM 2.9.3 onward.

http://www.jikesrvm.org

Complete data available at:

http://cs.anu.edu.au/~Steve.Blackburn/pubs

33

Research History

• PLDI 1998– Clinger & Hanson postulated the

radioactive decay model for object lifetimes

• Genesis of Older-First– [Stefanovic, McKinley, Moss OOPSLA’99]

34

Garbage Collection Hypotheses

• Generational hypothesis: younger objects die quickly, so collect them first

• Older-first hypothesis: the collector can collect less the longer it waits

35

Survival function s(v) for object lifetime distribution

younger older

0 1/2V V

Age ordered heap

s(v)

Older-first Algorithm

36

Next Steps• Beltway

– [BJMM PLDI’02]– Increments– Belts– Combines generational and older-first

• Ulterior Reference Counting – [BM OOPSLA’03]– Reference count on-per-object basis– Responsiveness and throughput

• MMTk: [BCM SIGMETRICS’04 ICSE’04]– Toolkit for building & understanding GC– Motivated today’s work

37

3 4 5 6 7 8 9 10

33 34 35 36 37 38 39 40

0 1

Garbage Collection is the Answer to All Your Problems• Improves data and code locality

– [Huang et al. OOPSLA’02 ISMM’04, VEE’04]• Cooperative GC optimizations

– Colocation [Guyer OOPSLA’05]– Free-me [Guyer et al. PLDI’06]

• Finds leaks– [Bond ASPLOS’06, Jump POPL’07]

• Tolerates leaks– [Bond OOSLA’08]

• Helps with dynamic software updating!– [Subramaniam, Hicks ??’08]

• DaCapo Benchmarks– [Blackburn et al. OOPSLA’06 CACM’08]

38

380C

• Where are we & where we are going– Why you need to care about workloads– Managed languages

• Dynamic compilation• Inlining• Garbage collection

– Opportunity to improve data locality on-the-fly– Read: X. Huang, S. M. Blackburn, K. S. McKinley, J. E. B. Moss, Z. Wang, and P.

Cheng, The Garbage Collection Advantage: Improving Program Locality, ACM Conference on Object Oriented Programming, Systems, Languages, and Applications (OOPSLA), pp. 69-80, Vancouver, Canada, October 2004.

– Alias analysis– Dependence analysis– Loop transformations– EDGE architectures