+ All Categories
Home > Documents > Www.intel.com/labs Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman...

Www.intel.com/labs Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman...

Date post: 26-Dec-2015
Category:
Upload: anissa-watkins
View: 219 times
Download: 2 times
Share this document with a friend
Popular Tags:
22
www.intel.com/labs Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Tatiana Shpeisman Guei-Yuan Lueh Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Ali-Reza Adl-Tabatabai Intel Labs Intel Labs
Transcript

www.intel.com/labs

Just-In-Time Java Compilation for the Itanium Processor

Just-In-Time Java Compilation for the Itanium Processor

Tatiana ShpeismanTatiana Shpeisman

Guei-Yuan LuehGuei-Yuan Lueh

Ali-Reza Adl-TabatabaiAli-Reza Adl-Tabatabai

Intel LabsIntel Labs

2 www.intel.com/labs

IntroductionIntroduction Itanium processor is statically scheduled machineItanium processor is statically scheduled machine

Aggressive compiler techniques to extract ILPAggressive compiler techniques to extract ILP

Just-In-Time (JIT) compiler must be fastJust-In-Time (JIT) compiler must be fast

Must consider time & space efficiency of optimizationsMust consider time & space efficiency of optimizations

Balance compilation time with code qualityBalance compilation time with code quality

Light-weight compilation techniques Light-weight compilation techniques

Use heuristics for modeling micro architectureUse heuristics for modeling micro architecture

Leverage semantics and meta data of JVMLeverage semantics and meta data of JVM

3 www.intel.com/labs

OutlineOutline IntroductionIntroduction

Compiler overviewCompiler overview

Register allocationRegister allocation

Code schedulingCode scheduling

Other optimizationsOther optimizations

ConclusionsConclusions

4 www.intel.com/labs

Compiler StructureCompiler Structure

PrepassPrepass

InliningInlining

Global optimizationsGlobal optimizations

IR constructionIR construction

Code SelectionCode Selection

Register AllocationRegister Allocation

Code EmissionCode Emission

GC SupportGC Support

Front-endFront-end Back-endBack-end

Code SchedulingCode Scheduling

PredicationPredication

5 www.intel.com/labs

Register AllocationRegister Allocation Compilation time vs. code quality tradeoffCompilation time vs. code quality tradeoff

IPF architecture has large register filesIPF architecture has large register files

128 integer, 128 floating-point, 64 predicate, 8 branch128 integer, 128 floating-point, 64 predicate, 8 branch

Register Stack Engine (RSE) provides 96 stack registers Register Stack Engine (RSE) provides 96 stack registers to each procedureto each procedure

Use linear scan register allocationUse linear scan register allocation

““Linear Scan Register AllocationLinear Scan Register Allocation” by Massimiliano Poletto ” by Massimiliano Poletto and Vivek Sarkarand Vivek Sarkar

6 www.intel.com/labs

Live Range vs. Live IntervalLive Range vs. Live Interval...

t1=t1=......

v =v = t1t1

...= v= v

B1

B2 B3

B4

t2=t2=......

v =v = t2t2

t1=t1=......

v =v = t1t1

t2=t2=......

v =v = t2 t2

...= v= v

...B1

B2

B4

B3

Live RangesLive Ranges Live IntervalsLive Intervals

7 www.intel.com/labs

Coalescing AlgorithmCoalescing Algorithm Coalesce Coalesce vv and and tt in in v =v = tt iff iff

Live interval of Live interval of tt ends at ends at v = tv = t

Live interval of Live interval of tt does not does not intersect with live range of intersect with live range of vv

Requires one additional Requires one additional reverse pass over IRreverse pass over IR

OO(N(NINSTINST + N + NVARVAR * N * NBBBB))

t1=t1=......

v =v = t1t1

t2=t2=......

v =v = t2 t2

...= v= v

...B1

B2

B4

B3

8 www.intel.com/labs

Coalescing SpeedupCoalescing Speedup

0

1

2

3

4

5

6

7%

spe

edup

ove

r no

coa

lesc

ing

9 www.intel.com/labs

Code SchedulingCode Scheduling Forward cycle-based list schedulingForward cycle-based list scheduling

Scheduling unit is extended basic blockScheduling unit is extended basic block

Middle exits are due to run-time exceptionsMiddle exits are due to run-time exceptions

(p6,p7) = cmp.eq r35, 0(p6,p7) = cmp.eq r35, 0

(p6) br ThrowNullPointerException(p6) br ThrowNullPointerException

r10 = r35 + 16r10 = r35 + 16

r11 = ld8 [r10]r11 = ld8 [r10]

10 www.intel.com/labs

Type-based memory disambiguationType-based memory disambiguation Use JVM meta data to disambiguate memory Use JVM meta data to disambiguate memory

locationslocations

Type Type

Integer, floating-point, object reference …Integer, floating-point, object reference …

Kind Kind

Object field, array element, virtual table address …Object field, array element, virtual table address …

Field idField id

putfield #10 vs. putfield #15putfield #10 vs. putfield #15

11 www.intel.com/labs

Type-Based DisambiguationType-Based Disambiguation

00.5

11.5

22.5

_201

_com

pre ss

_202

_jess

_209

_db

_222

_mpe

gau d

_227

_mtrt

_228

_jack

%

spee

dup o

ver n

o dis

ambig

uatio

n

12 www.intel.com/labs

Exception DependenciesException Dependencies Java exceptions are preciseJava exceptions are precise

Naive approachNaive approach

Exception checks end basic blocksException checks end basic blocks

Our approachOur approach

Instruction depends on exception check iffInstruction depends on exception check iff

Its destination is live at the exception handler, orIts destination is live at the exception handler, or

It is an exception check for different exception typeIt is an exception check for different exception type

It is a memory reference that may be guarded by It is a memory reference that may be guarded by checkcheck

13 www.intel.com/labs

Exception Dependency ExampleException Dependency Example

1:1: (p6, p0) = cmp.eq r16, 0(p6, p0) = cmp.eq r16, 0

2:2: (p6)(p6) brbr ThrowNullPointerExceptionThrowNullPointerException

6: 6: f8 = fld [r21]f8 = fld [r21] // load static// load static

5: 5: r21 = movl r21 = movl 0x000F14E320190000x000F14E32019000

4:4: r18 = ld [r17]r18 = ld [r17] // load field// load field

3:3: r17 = add r16, 8r17 = add r16, 8

14 www.intel.com/labs

Exception DependenciesException Dependencies

0

5

10

15

20

25%

sp

eed

up

ove

r n

aive

15 www.intel.com/labs

IPF ArchitectureIPF Architecture Execution (functional) unit type – M, I, F, B Execution (functional) unit type – M, I, F, B

Instruction (syllable type) – M, A, I, F, B, ILInstruction (syllable type) – M, A, I, F, B, IL

Bundles, templates Bundles, templates .mii .mi;;i .mil .mmi .m;;mi .mfi .mmf .mib .mbb .bbb .mmb ..mii .mi;;i .mil .mmi .m;;mi .mfi .mmf .mib .mbb .bbb .mmb .

mfbmfb

Instruction group – no WAR, WAW with some Instruction group – no WAR, WAW with some exceptionsexceptions

.mi;;i r10 = ld [r15]

r9 = add r8, 1 ;; // stop bit

r16 = shr r9, r32

16 www.intel.com/labs

Template SelectionTemplate Selection Pack instructions into bundlesPack instructions into bundles

Choose slot for each instructionChoose slot for each instruction

Insert NOP instructions Insert NOP instructions

Assign instructions to functional unitsAssign instructions to functional units

Problem: Problem:

Resource over subscription Resource over subscription

Inaccurate bypass latenciesInaccurate bypass latencies

17 www.intel.com/labs

Greedy slot assignmentGreedy slot assignment

Sort instruction by syllable typeSort instruction by syllable type

M < F < IL < I < A < BM < F < IL < I < A < B

I1: r20 = sxt r14 (I-type)

I2: r21 = movl ADDR (IL-type)

I3: f15 = fadd f10, f11 (F-type)

AlgorithmAlgorithm

NOP I1 NOP

NOP I2

NOP I3 NOP

Unsorted

NOP I3 I1

NOP I2

Sorted

18 www.intel.com/labs

Template Selection HeuristicsTemplate Selection Heuristics

0

1

2

3

4

5

6

% s

peed

up o

ver

gree

dy s

trat

egy

19 www.intel.com/labs

Bypass Latency AccuracyBypass Latency Accuracy

r17 = add r16, 8M-Unit

r17 = add r16, 8I-Unit

r18 = ld [r17]M-Unit1 2

Phase ordering of functional unit assignment

Code selection time is too early: underutilizes resources

Template selection time too late: inaccurate scheduling latencies

Solution: Assign to functional unit during scheduling

Assign to M-Unit if available, else

Assign to I-Unit and increment latency

20 www.intel.com/labs

Modeling of Address Computation LatencyModeling of Address Computation Latency

0

0.5

1

1.5

2

2.5

3

3.5

% s

pee

du

p o

ver

ign

ori

ng

byp

ass

21 www.intel.com/labs

Other optimizationsOther optimizations PredicationPredication

Profitability depends on a benchmarkProfitability depends on a benchmark

Performance variations within 2%Performance variations within 2%

Branch hintsBranch hints

Up to 50% speedup from using branch hintsUp to 50% speedup from using branch hints

Sign-extension eliminationSign-extension elimination

1% potential gain for our compiler1% potential gain for our compiler

22 www.intel.com/labs

ConclusionsConclusions Light-weight optimizations techniques for ItaniumLight-weight optimizations techniques for Itanium

Considering micro architecture is importantConsidering micro architecture is important

Cannot ignore bypass latenciesCannot ignore bypass latencies

Template selection should be resource sensitiveTemplate selection should be resource sensitive

Language semantics helps to improve ILPLanguage semantics helps to improve ILP

Type-based memory disambiguationType-based memory disambiguation

Exception dependency eliminationException dependency elimination


Recommended