+ All Categories
Home > Documents > Controlled Resource and Data Sharing in Multi-Core...

Controlled Resource and Data Sharing in Multi-Core...

Date post: 30-Jan-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
16
1 1 Controlled Resource and Data Sharing in Multi-Core Platforms* Sandhya Dwarkadas Department of Computer Science University of Rochester *Joint work with Arrvindh Shriraman, Hemayet Hossain, Xiao Zhang, Hongzhou Zhao, Rongrong Zhong, Michael L. Scott, Michael Huang, Kai Shen 2 The University of Rochester Small private research university 4400 undergraduates 2800 graduate students Set on the Genesee River in Western NY State, near the south shore of Lake Ontario 250km by road from Toronto; 590km from New York City 3 4 The Computer Science Dept. 15 tenure-track faculty; 45 Ph.D. students Specializing in AI, theory, and parallel and distributed systems Among the best small departments in the US
Transcript
  • 1

    1

    Controlled Resource and Data

    Sharing in Multi-Core Platforms*

    Sandhya Dwarkadas

    Department of Computer Science

    University of Rochester

    *Joint work with Arrvindh Shriraman, Hemayet Hossain, Xiao Zhang,

    Hongzhou Zhao, Rongrong Zhong,

    Michael L. Scott, Michael Huang, Kai Shen

    2

    The University of Rochester

    • Small private research university

    • 4400 undergraduates

    • 2800 graduate students

    • Set on the Genesee River in Western NY State, near the south shore of Lake Ontario

    • 250km by road from Toronto; 590km from New York City

    3 4

    The Computer Science Dept.

    • 15 tenure-track

    faculty; 45 Ph.D.

    students

    • Specializing in AI,

    theory, and parallel

    and distributed

    systems

    • Among the best

    small departments in

    the US

  • 2

    5

    Handheld DeviceJava

    Internet

    ClusterFortran/C

    data

    InterWeave Server

    IW libraryIW library

    IW library

    cache

    DesktopC/C++

    cache

    cache

    The Hardware-Software Interface

    TreadMarks

    Cashmere-2L

    InterWeave

    RTM FlexTM

    Concurrency: Coherence,

    Synchronization, Consistency

    Memory Systems

    Peer-to-peer systems

    Power-Aware Computing CAP

    MCD

    P P P

    M

    ARMCO DDCache LADLE

    FCS Multi-Cores

    Resource-Aware OS Scheduling

    DT-CMT

    Distributed Systems

    Operating Systems

    Willow

    Sentry:

    Protection Support

    RPPT

    6

    The Implications of Technology

    Scaling

    • Many more transistors for compute power

    • Energy constraints

    • Large volumes of data

    • High-speed communication

    • Concurrency (parallel or distributed)

    • Need support for

    – Scalable sharing

    – Reliability

    – Protection and security

    – Performance isolation Source: http://weblog.infoworld.com/tech-bottom-line/archives/IT-In-The-Clouds_hp.jpg

    7

    Multi-Core Challenges

    • Ensuring performance isolation

    • Providing protected and controlled sharing

    across cores

    • Scaling support for data sharing

    8

    Current Projects

    • CoSyn: Communication and Synchronization Mechanisms for Emerging Multi-Core Processors

    – Collaboration with Professors Michael Scott and Michael Huang

    – Arrvindh Shriraman, Hemayet Hossain, Hongzhou Zhao

    • Operating System-Level Resource Management in the Multi-Core Era

    – Collaboration with Professor Kai Shen

    – Xiao Zhang and Rongrong Zhong

    See http://www.cs.rochester.edu/research/cosyn

    and http://www.cs.rochester.edu/~sandhya

    http://www.cs.rochester.edu/research/cosynhttp://www.cs.rochester.edu/~sandhya

  • 3

    9

    Multi-Core Challenges

    • Ensuring performance isolation

    • Providing protected and controlled sharing

    across cores

    • Scaling support for data sharing

    10

    Performance Isolation

    11

    11

    Resource Sharing is (and will be)

    Ubiquitous!

    • Floating point, integer, state, cache with multiple threads on a core

    • Second-level cache with multiple cores on a chip

    • Interconnect bandwidth on multiprocessors

    Sun UltraSparc T1, … Intel’s 6-core (12-thread), …

    AMD’s 12-core, …

    12

    Resource Sharing on Multicore Chip

    • Memory bandwidth and last level cache are

    commonly shared by sibling cores sitting on the

    same chip

    http://download.intel.com/pressroom/kits/45nm/Penryn Die Photo_300.jpg

  • 4

    13

    13

    Resource Management To Date

    • Capitalistic - generation of more requests results

    in more resource usage

    – Performance: resource contention can result

    in significantly reduced overall performance

    – Fairness: equal time slice does not

    necessarily guarantee equal progress

    14

    Poor Performance Due to

    Uncontrolled Resource Contention

    Experiments were conducted on a 3Ghz Intel Core 2 Duo processor with a shared 4MB L2 cache

    Win-win situation

    15

    Fluctuating Performance Due to

    Uncontrolled Resource Contention

    Performance of art when co-running with different applications

    on an Intel dual-core processor with a 4MB shared L2 cache

    16

    Fairness and Security Concerns

    • Priority inversion

    • Poor fairness among competing applications

    • Information leakage at chip level

    • Denial of service attack at chip level

  • 5

    17

    Big Picture

    A B D C

    Select which applications run

    together

    X ………

    …..

    Control resource usage of co-running

    applications

    Resource-aware scheduling

    [USENIX’10]

    Page coloring

    [Eurosys’09]

    or Hardware throttling

    [USENIX’09]

    18

    Existing Mechanism(I):

    Software based Page Coloring

    Thread A

    Thread B

    Shared Cache

    Way-1 Way-n …………

    Memory page A1

    A2

    A3

    A4

    A5

    Thread A’s footprint • Classic technique to reduce cache misses,

    now used by OS to manage cache

    partitioning

    • Partition cache at coarse

    granularity

    • No need for hardware

    support

    19

    Drawbacks of Page Coloring • Expensive re-coloring cost

    – Prohibitive in a dynamic environment where

    frequent re-coloring may be necessary

    • Complex memory management

    – Introduces artificial memory pressure

    Thread A

    Thread B

    Shared Cache

    Way-1 Way-n …………

    Memory page

    A1

    A2

    A3

    A4

    A5

    Thread A’s footprint

    20

    Toward Practical Page Coloring

    • Hotness-based Page Coloring – Efficiently find a small group of hot pages

    – Restrain page coloring or re-coloring to hot pages

    – Pay less re-coloring overhead while achieving most of the cache partitioning benefit (separate competing applications’ most frequently accessed pages)

    • Key challenge – Efficient way to track page hotness

  • 6

    21

    Methods to Track Page Hotness

    • Using page protection – Capture page accesses by triggering page faults

    – Microseconds overhead per page fault

    • Using access bits – A single bit stored in each Page Table Entry (PTE)

    – Generally available on x86, automatically set by hardware upon page access

    – Tens of cycles per page table entry check

    – Recycle spare bits in PTE as hotness counter • Counter is aged to reflect recency and frequency

    22

    Sampling of Access Bits

    • Decouple sampling frequency and window – Hotness sampling accuracy is determined by sampling time

    window T

    – Hotness sampling overhead is determined by sampling

    frequency N

    0

    N 2N 3N 4N

    T N+T 2N+T 3N+T 4N+T

    Clear all access bits

    Check all access bits

    Time

    Check all access bits

    Clear all access bits

    In our experiments, T = 2 milliseconds N = 100 or 10 milliseconds

    23

    Miss-Ratio-Curve Driven

    Cache Partition Policy

    Thread A’s Miss Ratio

    Thread B’s Miss Ratio

    Cache Allocation

    Cache Allocation

    0.5 0.7 0.3 0.2

    0

    0

    Optimal partition point

    Cache Size = ∑A,B Cache Allocation

    4M

    4M

    System optimization metric =

    24

    Hot Page Coloring

    • Budget control of page re-coloring overhead

    – % of time slice, e.g. 5%

    • Recolor from hottest until budget is reached

    – Maintain a set of hotness bins during sampling

    • bin[ i ][ j ] = # of pages in color i with normalized hotness in

    range [ j, j+1]

    – Given a budget K, K-th hottest page’s hotness value is

    estimated in constant time by searching hotness bins

    – Make sure hot pages are uniformly distributed among colors

  • 7

    25

    Re-coloring Procedure

    Budget = 3 pages

    Cache share decrease

    100

    83

    71

    14

    0

    98 97 99

    87

    74

    10

    1

    82

    75

    11

    3

    81

    73

    12

    2

    X

    hotness counter value

    Page sorted in

    hotness ascending

    order

    Color Red Color Blue Color Green Color Gray 26

    Performance Comparison

    4 SPECcpu2k benchmarks (art,

    equake, mcf, and twolf) are running

    on 2 sibling cores (Intel core2duo)

    that share a 4MB L2 cache.

    27

    Additional Benefit of

    Hotness-based Page Coloring

    • Page coloring introduces artificial memory pressure – App’s footprint is larger

    than its entitled memory color pages, but system still has an abundance of memory pages

    • Allow app to “steal”

    other’s colors, but it preferentially copies cold pages to other’s memory colors

    Thread A

    Thread B

    Cache Way-1 Way-n …………

    Memory page A1

    A2

    A3

    A4

    A5

    Thread A’s footprint

    28

    Big Picture

    A B D C

    Select which applications run

    together

    X ………

    …..

    Control resource usage of co-running

    applications

    Resource-aware scheduling

    Page coloring or Hardware throttling

  • 8

    29

    Hardware Execution Throttling • Instead of directly controlling resource allocation,

    throttle the execution speed of application that overuses resource

    • Available throttling knobs

    – Duty-cycle modulation

    – Frequency/voltage scaling

    – Cache prefetchers

    30

    Comparing Hardware Execution

    Throttling to Page Coloring

    • Kernel code modification complexity – Code length: 40 lines in a single file, as a reference our

    page coloring implementation takes 700+ lines of code

    crossing 10+ files

    • Runtime overhead of configuration – Less than 1 microseconds, as a reference re-coloring a

    page takes 3 microseconds

    31

    Existing Mechanism(II):

    Scheduling Quantum Adjustment

    • Shorten the time slice of app that overuses cache

    • May let core idle if there is no other active thread

    available

    Thread B

    Thread A idle

    Thread B

    Thread A idle

    Thread B

    Thread A idle Core 0

    Core 1

    time

    32

    Drawback of Scheduling Quantum

    Adjustment Coarse-grained control at scheduling quantum granularity may

    result in fluctuating service delays for individual transactions

  • 9

    33

    New Mechanism:

    Hardware Execution Throttling [Usenix’09]

    • Throttle the execution speed of app that overuses cache

    – Duty cycle modulation

    • CPU works only in duty cycles and stalls in non-duty cycles

    • Different from Dynamic Voltage Frequency Scaling

    – Per-core vs. per-processor control

    – Thermal vs. power management

    – Enable/disable cache prefetchers

    • L1 prefetchers

    – IP: keeps track of instruction pointer for load history

    – DCU: when detecting multiple loads from the same line within a time limit,

    prefetches the next line

    • L2 prefetchers

    – Adjacent line: Prefetches the adjacent line of required data

    – Stream: looks at streams of data for regular patterns

    34

    Comparison of Hardware Execution

    Throttling to other two mechanisms • Comparison to page coloring

    – Little complexity to kernel • Code length: 40 lines in a single file, as a reference our page coloring implementation

    takes 700+ lines of code crossing 10+ files

    – Lightweight to configure • Read plus write register: duty-cycle 265 + 350 cycles, prefetcher 298 + 2065 cycles

    • Less than 1 microseconds, as a reference re-coloring a page takes 3 microseconds

    • Comparison to scheduling quantum adjustment

    – More fine-grained controlling

    Thread B

    Core 0

    Core 1

    Thread A idle

    Quantum adjustment Hardware execution

    throttling

    time

    35

    35

    Fairness Comparison

    • On average all three mechanisms are effective in improving fairness

    • Case {swim, SPECweb} illustrates limitation of page coloring

    • Unfairness factor: coefficient of variation (deviation-to-mean ratio, σ / μ) of co-running apps’ normalized performances (normalization base is the execution-time/throughput when the application monopolizes the whole chip)

    36

    36

    Performance Comparison

    • System efficiency: geometric mean of co-running apps’ normalized performances

    • On average all three mechanisms achieve system efficiency comparable to default sharing

    • Case where severe inter-

    thread cache conflicts exist

    favors segregation, e.g.

    {swim, mcf}

    • Case where well-interleaved

    cache accesses exist favors

    sharing, e.g. {mcf, mcf}

  • 10

    37

    Policies for Hardware Throttling Based

    Multicore Management

    • User-defined service level agreements (SLAs) – Proportional progress among competing threads

    • Unfairness metric: coefficient of variation of threads’ performance

    – Quality of service guarantee for high-priority application(s)

    • Key challenge

    – Throttling configuration space grows exponentially as

    the number of cores increases

    – Quickly determining optimal or close to optimal

    throttling configurations is challenging

    38

    Model-Driven Iterative Framework

    • Customizable performance estimation model

    • Reference configuration set and linear approximation

    • Currently incorporates duty cycle modulation and

    frequency/voltage scaling

    • Iterative refinement

    • Prediction accuracy gets improved over time as more

    configurations are added into reference set

    39

    Iterative Refinement Patterns

    40

    Online Deployment:

    Hill-Climbing Search Acceleration

    • For a m-throttling-level n-core system, need to compute nm times to predict a “best” one

    • Hill-climbing searches along the best child rather than all children

    • Prunes the computation space to (m-1)n2

    (X-

    1,Y,Z,U)

    (X,Y,Z,U)

    (X,Y-

    1,Z,U)

    (X,Y,Z-

    1,U)

    (X,Y,Z,U-

    1)

    (X-1,Y-

    1,Z,U) (X,Y-2,Z,U)

    (X,Y-1,Z-

    1,U)

    (X,Y-1,Z,U-

    1)

    (X-1,Y-1,Z-

    1,U) (X,Y-2,Z-1,U) (X,Y-1,Z-2,U)

    (X,Y-1,Z-1,U-

    1)

    … … … …

  • 11

    41

    Accuracy Evaluation

    • Test platform

    • A quad-core Nehalem processor with 8MB shared L3

    cache

    • Search space from full CPU speed (duty cycle level 8) to

    half CPU speed (duty cycle level 4), so 369 configurations

    for each test

    • Benchmarks: SPECCPU2k • Set-1: {mesa, art, mcf, equake}

    • Set-2: {swim, mgrid, mcf, equake}

    • Set-3: {swim, art, equake, twolf}

    • Set-4: {swim, applu, equake, twolf}

    • Set-5: {swim, mgrid, art, equake}

    42

    Capability of Satisfying SLAs • Service Level Agreements (SLAs)

    • Fairness-oriented: keep the unfairness below a threshold

    • QoS-oriented: keep the QoS-core above a QoS

    threshold

    • 4 different unfairness/QoS thresholds for 5 sets

    • Optimization goal: satisfy SLAs while optimizing

    performance or power efficiency

    # Passing tests Avg. num of

    samples

    Avg.

    performance of

    picked configs

    that pass tests

    Oracle 39/40 0 100%

    Model 39/40 4.1 99.4%

    Random 25/40 15 91.1% Recall the search space has 369 configurations

    43

    Accuracy of Performance Estimation

    Error Rate = |Prediction – Actual| / Actual

    44

    Big Picture

    A B D C

    Select which applications run

    together

    X ………

    …..

    Control resource usage of co-running

    applications

    Resource-aware scheduling

    Page coloring or Hardware throttling

  • 12

    45

    Resource-aware Scheduling

    • Scheduling decision could significantly affect performance

    46

    Similarity Grouping Scheduling

    • Group applications with similar cache miss ratio on the same chip

    – Separate high and low miss ratio apps on different chips

    • Benefits – Mitigate cache thrashing effect

    – Avoid over-saturating memory bandwidth

    – Engage per-chip DVFS-based power savings

    • A single voltage setting applies to all sibling cores on existing multicore chips

    • High-miss-ratio chip runs at low frequency while low-miss-ratio chip runs at high frequency

    47

    Frequency-to-Performance Model

    • Objective: explore power savings with bounded performance loss

    • Assumptions – An application’s performance is linearly determined by

    cache and memory access latencies

    – Frequency scaling only affects on-chip accesses

    – Miss ratio does not vary across frequencies

    Normalized performance at frequency f = T(F) / T(f)

    48

    Model Accuracy

    Error Rate = (Prediction - Actual) / Actual

  • 13

    49

    Model-based Dynamic Frequency Setting

    • Dynamically adjust CPU frequency based on

    current running application’s behavior

    – Collect cache miss ratio every 10 milliseconds

    – Calculate an appropriate frequency setting based on

    performance estimation model

    • Guided by performance degradation threshold (e.g. 10%)

    50

    Model-based Dynamic Frequency Setting

    51

    Hardware Counter-based Power

    Containers: An OS Resource

    • Cross-core activity influence

    • Online calibration with actual measurement

    • Application-transparent online request context

    tracking

    52

    Power Conditioning Using Power

    Containers

  • 14

    53

    Power Conditioning Achieved

    Using Targeted Throttling

    54

    Ongoing Work

    • Variation-directed information and management

    – Using behavior fluctuation to trigger

    monitoring

    – Supporting fine-grain resource accounting

    – Developing policies to reshape behavior for

    high dependability and low jitter

    – Request-level power attribution, modeling,

    and management

    55

    See

    http://www.cs.rochester.edu/research/cosyn

    http://www.cs.rochester.edu/~sandhya

    56

    Arch/App: Shared Memory ++: DIMM [TRANSACT’06, ISCA’07, ISCA’08,ICS’09]

    • Data Isolation (DI)

    – Provide control over propagation of writes

    – Buffer writes and allow group undo or propagation

    Applications: Sand-boxing, transactional programming, speculation

    • Memory Monitoring (MM)

    – Monitor memory at summary or individual cache line level

    Applications: Synchronization/event notification, reliability, security,

    watchpoints/debugging

    http://www.cs.rochester.edu/research/cosyn

  • 15

    57

    Arch/App/OS:

    Protection: Separation of Privileges

    Reality: Today’s programs often consist of multiple modules

    written by different programmers

    Reliability and composability requires developing access and

    interface conventions

    58

    Sentry: Light-Weight Auxiliary Memory

    Access Control [ISCA’10]

    • Access checks on an L1

    miss

    – Saves 90x energy

    – Simplifies

    implementation

    • Metadata cache (M-cache)

    accessed in parallel with

    the L2 to speed up check

    59

    1 1 1 1

    0 0 1 1

    1 1 1 1

    0 0 0 0

    1 0

    0 0 0

    0

    1

    1

    60

    P0

    P4

    P11

    The Indirection Problem

    PACT 2008

    A (M)

    Home A

    Load A

    Data A

    A(S)

    1

    DG A

    Data A

    2

    A(S)

    3 3 Load A

    DG A

    Data A

    Data A

    Longer distance means longer latency

  • 16

    61

    Fine-Grain Data Sharing

    Simultaneous access to the same data by more

    than one core while the data still resides in

    some L1 cache

    Key Idea: Fine-grain sharing can be leveraged to localize

    communication

    62

    Goal: Localize Shared Data Communication

    P0

    P4

    P11

    A (M)

    Home A

    Load A

    Data A A(S)

    A(S)

    1 2

    Load A Data A

    Data availability at P0:

    2 vs 10 physical hops

    (whether P4 holds A in M or S)

    63

    Summary

    • Harnessing 50B transistors requires a fresh look at conventional hardware-software boundaries and interfaces with support for

    – Scalable coherence design

    – Controlled data sharing via architectural support for

    • Memory monitoring, isolation, and protection

    – Controlled resource sharing via operating system-level policies for

    • Performance isolation • We have examined coherence protocol additions to allow

    – Fast event-based communication

    – Fine-grain access control

    – Programmable support for isolation

    – Low-latency access for fine-grain data sharing

    – Software to determine policy decisions in a flexible manner

    A combined hardware/software approach to support for concurrency with improved performance and scalability


Recommended