+ All Categories
Transcript
  • 1

    1

    Controlled Resource and Data

    Sharing in Multi-Core Platforms*

    Sandhya Dwarkadas

    Department of Computer Science

    University of Rochester

    *Joint work with Arrvindh Shriraman, Hemayet Hossain, Xiao Zhang,

    Hongzhou Zhao, Rongrong Zhong,

    Michael L. Scott, Michael Huang, Kai Shen

    2

    The University of Rochester

    • Small private research university

    • 4400 undergraduates

    • 2800 graduate students

    • Set on the Genesee River in Western NY State, near the south shore of Lake Ontario

    • 250km by road from Toronto; 590km from New York City

    3 4

    The Computer Science Dept.

    • 15 tenure-track

    faculty; 45 Ph.D.

    students

    • Specializing in AI,

    theory, and parallel

    and distributed

    systems

    • Among the best

    small departments in

    the US

  • 2

    5

    Handheld DeviceJava

    Internet

    ClusterFortran/C

    data

    InterWeave Server

    IW libraryIW library

    IW library

    cache

    DesktopC/C++

    cache

    cache

    The Hardware-Software Interface

    TreadMarks

    Cashmere-2L

    InterWeave

    RTM FlexTM

    Concurrency: Coherence,

    Synchronization, Consistency

    Memory Systems

    Peer-to-peer systems

    Power-Aware Computing CAP

    MCD

    P P P

    M

    ARMCO DDCache LADLE

    FCS Multi-Cores

    Resource-Aware OS Scheduling

    DT-CMT

    Distributed Systems

    Operating Systems

    Willow

    Sentry:

    Protection Support

    RPPT

    6

    The Implications of Technology

    Scaling

    • Many more transistors for compute power

    • Energy constraints

    • Large volumes of data

    • High-speed communication

    • Concurrency (parallel or distributed)

    • Need support for

    – Scalable sharing

    – Reliability

    – Protection and security

    – Performance isolation Source: http://weblog.infoworld.com/tech-bottom-line/archives/IT-In-The-Clouds_hp.jpg

    7

    Multi-Core Challenges

    • Ensuring performance isolation

    • Providing protected and controlled sharing

    across cores

    • Scaling support for data sharing

    8

    Current Projects

    • CoSyn: Communication and Synchronization Mechanisms for Emerging Multi-Core Processors

    – Collaboration with Professors Michael Scott and Michael Huang

    – Arrvindh Shriraman, Hemayet Hossain, Hongzhou Zhao

    • Operating System-Level Resource Management in the Multi-Core Era

    – Collaboration with Professor Kai Shen

    – Xiao Zhang and Rongrong Zhong

    See http://www.cs.rochester.edu/research/cosyn

    and http://www.cs.rochester.edu/~sandhya

    http://www.cs.rochester.edu/research/cosynhttp://www.cs.rochester.edu/~sandhya

  • 3

    9

    Multi-Core Challenges

    • Ensuring performance isolation

    • Providing protected and controlled sharing

    across cores

    • Scaling support for data sharing

    10

    Performance Isolation

    11

    11

    Resource Sharing is (and will be)

    Ubiquitous!

    • Floating point, integer, state, cache with multiple threads on a core

    • Second-level cache with multiple cores on a chip

    • Interconnect bandwidth on multiprocessors

    Sun UltraSparc T1, … Intel’s 6-core (12-thread), …

    AMD’s 12-core, …

    12

    Resource Sharing on Multicore Chip

    • Memory bandwidth and last level cache are

    commonly shared by sibling cores sitting on the

    same chip

    http://download.intel.com/pressroom/kits/45nm/Penryn Die Photo_300.jpg

  • 4

    13

    13

    Resource Management To Date

    • Capitalistic - generation of more requests results

    in more resource usage

    – Performance: resource contention can result

    in significantly reduced overall performance

    – Fairness: equal time slice does not

    necessarily guarantee equal progress

    14

    Poor Performance Due to

    Uncontrolled Resource Contention

    Experiments were conducted on a 3Ghz Intel Core 2 Duo processor with a shared 4MB L2 cache

    Win-win situation

    15

    Fluctuating Performance Due to

    Uncontrolled Resource Contention

    Performance of art when co-running with different applications

    on an Intel dual-core processor with a 4MB shared L2 cache

    16

    Fairness and Security Concerns

    • Priority inversion

    • Poor fairness among competing applications

    • Information leakage at chip level

    • Denial of service attack at chip level

  • 5

    17

    Big Picture

    A B D C

    Select which applications run

    together

    X ………

    …..

    Control resource usage of co-running

    applications

    Resource-aware scheduling

    [USENIX’10]

    Page coloring

    [Eurosys’09]

    or Hardware throttling

    [USENIX’09]

    18

    Existing Mechanism(I):

    Software based Page Coloring

    Thread A

    Thread B

    Shared Cache

    Way-1 Way-n …………

    Memory page A1

    A2

    A3

    A4

    A5

    Thread A’s footprint • Classic technique to reduce cache misses,

    now used by OS to manage cache

    partitioning

    • Partition cache at coarse

    granularity

    • No need for hardware

    support

    19

    Drawbacks of Page Coloring • Expensive re-coloring cost

    – Prohibitive in a dynamic environment where

    frequent re-coloring may be necessary

    • Complex memory management

    – Introduces artificial memory pressure

    Thread A

    Thread B

    Shared Cache

    Way-1 Way-n …………

    Memory page

    A1

    A2

    A3

    A4

    A5

    Thread A’s footprint

    20

    Toward Practical Page Coloring

    • Hotness-based Page Coloring – Efficiently find a small group of hot pages

    – Restrain page coloring or re-coloring to hot pages

    – Pay less re-coloring overhead while achieving most of the cache partitioning benefit (separate competing applications’ most frequently accessed pages)

    • Key challenge – Efficient way to track page hotness

  • 6

    21

    Methods to Track Page Hotness

    • Using page protection – Capture page accesses by triggering page faults

    – Microseconds overhead per page fault

    • Using access bits – A single bit stored in each Page Table Entry (PTE)

    – Generally available on x86, automatically set by hardware upon page access

    – Tens of cycles per page table entry check

    – Recycle spare bits in PTE as hotness counter • Counter is aged to reflect recency and frequency

    22

    Sampling of Access Bits

    • Decouple sampling frequency and window – Hotness sampling accuracy is determined by sampling time

    window T

    – Hotness sampling overhead is determined by sampling

    frequency N

    0

    N 2N 3N 4N

    T N+T 2N+T 3N+T 4N+T

    Clear all access bits

    Check all access bits

    Time

    Check all access bits

    Clear all access bits

    In our experiments, T = 2 milliseconds N = 100 or 10 milliseconds

    23

    Miss-Ratio-Curve Driven

    Cache Partition Policy

    Thread A’s Miss Ratio

    Thread B’s Miss Ratio

    Cache Allocation

    Cache Allocation

    0.5 0.7 0.3 0.2

    0

    0

    Optimal partition point

    Cache Size = ∑A,B Cache Allocation

    4M

    4M

    System optimization metric =

    24

    Hot Page Coloring

    • Budget control of page re-coloring overhead

    – % of time slice, e.g. 5%

    • Recolor from hottest until budget is reached

    – Maintain a set of hotness bins during sampling

    • bin[ i ][ j ] = # of pages in color i with normalized hotness in

    range [ j, j+1]

    – Given a budget K, K-th hottest page’s hotness value is

    estimated in constant time by searching hotness bins

    – Make sure hot pages are uniformly distributed among colors

  • 7

    25

    Re-coloring Procedure

    Budget = 3 pages

    Cache share decrease

    100

    83

    71

    14

    0

    98 97 99

    87

    74

    10

    1

    82

    75

    11

    3

    81

    73

    12

    2

    X

    hotness counter value

    Page sorted in

    hotness ascending

    order

    Color Red Color Blue Color Green Color Gray 26

    Performance Comparison

    4 SPECcpu2k benchmarks (art,

    equake, mcf, and twolf) are running

    on 2 sibling cores (Intel core2duo)

    that share a 4MB L2 cache.

    27

    Additional Benefit of

    Hotness-based Page Coloring

    • Page coloring introduces artificial memory pressure – App’s footprint is larger

    than its entitled memory color pages, but system still has an abundance of memory pages

    • Allow app to “steal”

    other’s colors, but it preferentially copies cold pages to other’s memory colors

    Thread A

    Thread B

    Cache Way-1 Way-n …………

    Memory page A1

    A2

    A3

    A4

    A5

    Thread A’s footprint

    28

    Big Picture

    A B D C

    Select which applications run

    together

    X ………

    …..

    Control resource usage of co-running

    applications

    Resource-aware scheduling

    Page coloring or Hardware throttling

  • 8

    29

    Hardware Execution Throttling • Instead of directly controlling resource allocation,

    throttle the execution speed of application that overuses resource

    • Available throttling knobs

    – Duty-cycle modulation

    – Frequency/voltage scaling

    – Cache prefetchers

    30

    Comparing Hardware Execution

    Throttling to Page Coloring

    • Kernel code modification complexity – Code length: 40 lines in a single file, as a reference our

    page coloring implementation takes 700+ lines of code

    crossing 10+ files

    • Runtime overhead of configuration – Less than 1 microseconds, as a reference re-coloring a

    page takes 3 microseconds

    31

    Existing Mechanism(II):

    Scheduling Quantum Adjustment

    • Shorten the time slice of app that overuses cache

    • May let core idle if there is no other active thread

    available

    Thread B

    Thread A idle

    Thread B

    Thread A idle

    Thread B

    Thread A idle Core 0

    Core 1

    time

    32

    Drawback of Scheduling Quantum

    Adjustment Coarse-grained control at scheduling quantum granularity may

    result in fluctuating service delays for individual transactions

  • 9

    33

    New Mechanism:

    Hardware Execution Throttling [Usenix’09]

    • Throttle the execution speed of app that overuses cache

    – Duty cycle modulation

    • CPU works only in duty cycles and stalls in non-duty cycles

    • Different from Dynamic Voltage Frequency Scaling

    – Per-core vs. per-processor control

    – Thermal vs. power management

    – Enable/disable cache prefetchers

    • L1 prefetchers

    – IP: keeps track of instruction pointer for load history

    – DCU: when detecting multiple loads from the same line within a time limit,

    prefetches the next line

    • L2 prefetchers

    – Adjacent line: Prefetches the adjacent line of required data

    – Stream: looks at streams of data for regular patterns

    34

    Comparison of Hardware Execution

    Throttling to other two mechanisms • Comparison to page coloring

    – Little complexity to kernel • Code length: 40 lines in a single file, as a reference our page coloring implementation

    takes 700+ lines of code crossing 10+ files

    – Lightweight to configure • Read plus write register: duty-cycle 265 + 350 cycles, prefetcher 298 + 2065 cycles

    • Less than 1 microseconds, as a reference re-coloring a page takes 3 microseconds

    • Comparison to scheduling quantum adjustment

    – More fine-grained controlling

    Thread B

    Core 0

    Core 1

    Thread A idle

    Quantum adjustment Hardware execution

    throttling

    time

    35

    35

    Fairness Comparison

    • On average all three mechanisms are effective in improving fairness

    • Case {swim, SPECweb} illustrates limitation of page coloring

    • Unfairness factor: coefficient of variation (deviation-to-mean ratio, σ / μ) of co-running apps’ normalized performances (normalization base is the execution-time/throughput when the application monopolizes the whole chip)

    36

    36

    Performance Comparison

    • System efficiency: geometric mean of co-running apps’ normalized performances

    • On average all three mechanisms achieve system efficiency comparable to default sharing

    • Case where severe inter-

    thread cache conflicts exist

    favors segregation, e.g.

    {swim, mcf}

    • Case where well-interleaved

    cache accesses exist favors

    sharing, e.g. {mcf, mcf}

  • 10

    37

    Policies for Hardware Throttling Based

    Multicore Management

    • User-defined service level agreements (SLAs) – Proportional progress among competing threads

    • Unfairness metric: coefficient of variation of threads’ performance

    – Quality of service guarantee for high-priority application(s)

    • Key challenge

    – Throttling configuration space grows exponentially as

    the number of cores increases

    – Quickly determining optimal or close to optimal

    throttling configurations is challenging

    38

    Model-Driven Iterative Framework

    • Customizable performance estimation model

    • Reference configuration set and linear approximation

    • Currently incorporates duty cycle modulation and

    frequency/voltage scaling

    • Iterative refinement

    • Prediction accuracy gets improved over time as more

    configurations are added into reference set

    39

    Iterative Refinement Patterns

    40

    Online Deployment:

    Hill-Climbing Search Acceleration

    • For a m-throttling-level n-core system, need to compute nm times to predict a “best” one

    • Hill-climbing searches along the best child rather than all children

    • Prunes the computation space to (m-1)n2

    (X-

    1,Y,Z,U)

    (X,Y,Z,U)

    (X,Y-

    1,Z,U)

    (X,Y,Z-

    1,U)

    (X,Y,Z,U-

    1)

    (X-1,Y-

    1,Z,U) (X,Y-2,Z,U)

    (X,Y-1,Z-

    1,U)

    (X,Y-1,Z,U-

    1)

    (X-1,Y-1,Z-

    1,U) (X,Y-2,Z-1,U) (X,Y-1,Z-2,U)

    (X,Y-1,Z-1,U-

    1)

    … … … …

  • 11

    41

    Accuracy Evaluation

    • Test platform

    • A quad-core Nehalem processor with 8MB shared L3

    cache

    • Search space from full CPU speed (duty cycle level 8) to

    half CPU speed (duty cycle level 4), so 369 configurations

    for each test

    • Benchmarks: SPECCPU2k • Set-1: {mesa, art, mcf, equake}

    • Set-2: {swim, mgrid, mcf, equake}

    • Set-3: {swim, art, equake, twolf}

    • Set-4: {swim, applu, equake, twolf}

    • Set-5: {swim, mgrid, art, equake}

    42

    Capability of Satisfying SLAs • Service Level Agreements (SLAs)

    • Fairness-oriented: keep the unfairness below a threshold

    • QoS-oriented: keep the QoS-core above a QoS

    threshold

    • 4 different unfairness/QoS thresholds for 5 sets

    • Optimization goal: satisfy SLAs while optimizing

    performance or power efficiency

    # Passing tests Avg. num of

    samples

    Avg.

    performance of

    picked configs

    that pass tests

    Oracle 39/40 0 100%

    Model 39/40 4.1 99.4%

    Random 25/40 15 91.1% Recall the search space has 369 configurations

    43

    Accuracy of Performance Estimation

    Error Rate = |Prediction – Actual| / Actual

    44

    Big Picture

    A B D C

    Select which applications run

    together

    X ………

    …..

    Control resource usage of co-running

    applications

    Resource-aware scheduling

    Page coloring or Hardware throttling

  • 12

    45

    Resource-aware Scheduling

    • Scheduling decision could significantly affect performance

    46

    Similarity Grouping Scheduling

    • Group applications with similar cache miss ratio on the same chip

    – Separate high and low miss ratio apps on different chips

    • Benefits – Mitigate cache thrashing effect

    – Avoid over-saturating memory bandwidth

    – Engage per-chip DVFS-based power savings

    • A single voltage setting applies to all sibling cores on existing multicore chips

    • High-miss-ratio chip runs at low frequency while low-miss-ratio chip runs at high frequency

    47

    Frequency-to-Performance Model

    • Objective: explore power savings with bounded performance loss

    • Assumptions – An application’s performance is linearly determined by

    cache and memory access latencies

    – Frequency scaling only affects on-chip accesses

    – Miss ratio does not vary across frequencies

    Normalized performance at frequency f = T(F) / T(f)

    48

    Model Accuracy

    Error Rate = (Prediction - Actual) / Actual

  • 13

    49

    Model-based Dynamic Frequency Setting

    • Dynamically adjust CPU frequency based on

    current running application’s behavior

    – Collect cache miss ratio every 10 milliseconds

    – Calculate an appropriate frequency setting based on

    performance estimation model

    • Guided by performance degradation threshold (e.g. 10%)

    50

    Model-based Dynamic Frequency Setting

    51

    Hardware Counter-based Power

    Containers: An OS Resource

    • Cross-core activity influence

    • Online calibration with actual measurement

    • Application-transparent online request context

    tracking

    52

    Power Conditioning Using Power

    Containers

  • 14

    53

    Power Conditioning Achieved

    Using Targeted Throttling

    54

    Ongoing Work

    • Variation-directed information and management

    – Using behavior fluctuation to trigger

    monitoring

    – Supporting fine-grain resource accounting

    – Developing policies to reshape behavior for

    high dependability and low jitter

    – Request-level power attribution, modeling,

    and management

    55

    See

    http://www.cs.rochester.edu/research/cosyn

    http://www.cs.rochester.edu/~sandhya

    56

    Arch/App: Shared Memory ++: DIMM [TRANSACT’06, ISCA’07, ISCA’08,ICS’09]

    • Data Isolation (DI)

    – Provide control over propagation of writes

    – Buffer writes and allow group undo or propagation

    Applications: Sand-boxing, transactional programming, speculation

    • Memory Monitoring (MM)

    – Monitor memory at summary or individual cache line level

    Applications: Synchronization/event notification, reliability, security,

    watchpoints/debugging

    http://www.cs.rochester.edu/research/cosyn

  • 15

    57

    Arch/App/OS:

    Protection: Separation of Privileges

    Reality: Today’s programs often consist of multiple modules

    written by different programmers

    Reliability and composability requires developing access and

    interface conventions

    58

    Sentry: Light-Weight Auxiliary Memory

    Access Control [ISCA’10]

    • Access checks on an L1

    miss

    – Saves 90x energy

    – Simplifies

    implementation

    • Metadata cache (M-cache)

    accessed in parallel with

    the L2 to speed up check

    59

    1 1 1 1

    0 0 1 1

    1 1 1 1

    0 0 0 0

    1 0

    0 0 0

    0

    1

    1

    60

    P0

    P4

    P11

    The Indirection Problem

    PACT 2008

    A (M)

    Home A

    Load A

    Data A

    A(S)

    1

    DG A

    Data A

    2

    A(S)

    3 3 Load A

    DG A

    Data A

    Data A

    Longer distance means longer latency

  • 16

    61

    Fine-Grain Data Sharing

    Simultaneous access to the same data by more

    than one core while the data still resides in

    some L1 cache

    Key Idea: Fine-grain sharing can be leveraged to localize

    communication

    62

    Goal: Localize Shared Data Communication

    P0

    P4

    P11

    A (M)

    Home A

    Load A

    Data A A(S)

    A(S)

    1 2

    Load A Data A

    Data availability at P0:

    2 vs 10 physical hops

    (whether P4 holds A in M or S)

    63

    Summary

    • Harnessing 50B transistors requires a fresh look at conventional hardware-software boundaries and interfaces with support for

    – Scalable coherence design

    – Controlled data sharing via architectural support for

    • Memory monitoring, isolation, and protection

    – Controlled resource sharing via operating system-level policies for

    • Performance isolation • We have examined coherence protocol additions to allow

    – Fast event-based communication

    – Fine-grain access control

    – Programmable support for isolation

    – Low-latency access for fine-grain data sharing

    – Software to determine policy decisions in a flexible manner

    A combined hardware/software approach to support for concurrency with improved performance and scalability


Top Related