+ All Categories
Home > Documents > Hoard: A Scalable Memory Allocator for Multithreaded Applications

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Date post: 30-Dec-2015
Category:
Upload: ira-ware
View: 46 times
Download: 0 times
Share this document with a friend
Description:
Hoard: A Scalable Memory Allocator for Multithreaded Applications. Emery Berger , Kathryn McKinley * , Robert Blumofe, Paul Wilson. Department of Computer Sciences. * Department of Computer Science. Motivation. Parallel multithreaded programs becoming prevalent - PowerPoint PPT Presentation
Popular Tags:
21
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Emery Berger, Kathryn McKinley * , Robert Blumofe, Paul Wilson Hoard: A Scalable Memory Allocator for Multithreaded Applications Department of Computer Sciences * Department of Computer Science
Transcript

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Emery Berger, Kathryn McKinley*, Robert Blumofe, Paul Wilson

Hoard: A Scalable Memory Allocator for Multithreaded

Applications

Department of Computer Sciences*Department of Computer Science

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Motivation

Parallel multithreaded programs becoming prevalent

web servers, search engines, database managers, etc.

run on SMP’s for high performance

often embarrassingly parallel

Memory allocation is a bottleneckprevents scaling with number of processors

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Assessment Criteria for Multiprocessor AllocatorsSpeed

competitive with uniprocessor allocators on one processor

Scalabilityperformance linear with the number of

processors

Fragmentation (= max allocated / max in use)competitive with uniprocessor allocators

worst-case and average-case

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Uniprocessor Allocators on Multiprocessors

Fragmentation: ExcellentVery low for most programs [Wilson &

Johnstone]

Speed & Scalability: PoorHeap contention

a single lock protects the heap

Can exacerbate false sharingdifferent processors can share cache lines

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Allocator-InducedFalse Sharing

Allocators cause false sharing!

Cache lines can end up spread across a number of processors

Practically all allocators do this

processor 1 processor 2x2 = malloc(s);x1 = malloc(s);

A cache line

thrash… thrash…

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Existing Multiprocessor AllocatorsSpeed:

One concurrent heap (e.g., concurrent B-tree): too expensive

too many locks/atomic updates

O(log n) cost per memory operation

Fast allocators use multiple heaps

Scalability:Allocator-induced false sharing and other

bottlenecks

Fragmentation: P-fold increase or even unbounded

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Multiprocessor Allocator I:Pure Private Heaps

Pure private heaps:one heap per processor.

malloc gets memoryfrom the processor's heap or the system

free puts memory on the processor's heap

Avoids heap contentionExamples: STL, ad hoc

(e.g., Cilk 4.1)

x1= malloc(s)

free(x1) free(x2)

x3= malloc(s)

x2= malloc(s)

x4= malloc(s)

processor 1 processor 2

= allocated by heap 1

= free, on heap 2

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

How to Break Pure Private Heaps: Fragmentation

Pure private heaps:memory

consumption can grow without bound!

Producer-consumer:processor 1

allocatesprocessor 2 frees

free(x1)

x2= malloc(s)

free(x2)

x1= malloc(s)processor 1 processor 2

x3= malloc(s)

free(x3)

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Multiprocessor Allocator II:Private Heaps with OwnershipPrivate heaps with

ownership:free puts memory back on the originating processor's heap.

Avoids unbounded memory consumptionExamples:

ptmalloc [Gloger], LKmalloc [Larson & Krishnan]

x1= malloc(s)

free(x1)

free(x2)

x2= malloc(s)

processor 1 processor 2

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

How to Break Private Heaps with Ownership:FragmentationPrivate heaps with

ownership:memory consumption can blowup by a factor of P.

Round-robin producer-consumer:processor i allocatesprocessor i+1 frees

This really happens (NDS).

free(x2)

free(x1)

free(x3)

x1= malloc(s)

x2= malloc(s)

x3=malloc(s)

processor 1 processor 2 processor 3

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

So What Do We Do Now?

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

The Hoard Multiprocessor Memory AllocatorManages memory in page-sized superblocks of

same-sized objects- Avoids false sharing by not carving up cache lines- Avoids heap contention - local heaps allocate &

free small blocks from their set of superblocks

Adds a global heap that is a repository of superblocks

When the fraction of free memory exceeds the empty fraction, moves superblocks to the global heap- Avoids blowup in memory consumption

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Hoard ExampleHoard:

one heap per processor+ a global heap

malloc gets memory from a superblock on its heap.

free returns memory to its superblock. If the heap is “too empty”, it moves a superblock to the global heap.

x1= malloc(s)

processor 1 global heap

free(x7)

…some mallocs

…some frees

Empty fraction = 1/3

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Summary of Analytical Results

Worst-case memory consumption:O(n log M/m + P) [instead of O(P n log

M/m)]n = memory required

M = biggest object sizem = smallest object sizeP = number of processors

Best possible: O(n log M/m) [Robson]

Provably low synchronization in most cases

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

ExperimentsRun on a dedicated 14-processor Sun Enterprise

300 MHz UltraSparc, 1 GB of RAMSolaris 2.7

All programs compiled with g++ version 2.95.1

Allocators:Hoard version 2.0.2Solaris (system allocator)Ptmalloc (GNU libc – private heaps with ownership)mtmalloc (Sun’s “MT-hot” allocator)

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Performance: threadtest

speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors)

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Performance: Larson

Server-style benchmark with sharing

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Performance: false sharing

Each thread reads & writes heap data

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Fragmentation ResultsOn most standard uniprocessor benchmarks,

Hoard’s fragmentation was low:p2c (Pascal-to-C): 1.20 espresso: 1.47LRUsim: 1.05 Ghostscript: 1.15

Within 20% of Lea’s allocator

On the multiprocessor benchmarksand other codes: Fragmentation was between 1.02 and 1.24 for all

but one anomalous benchmark (shbench: 3.17).

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Hoard ConclusionsSpeed: Excellent

As fast as a uniprocessor allocator on one processor

amortized O(1) cost1 lock for malloc, 2 for free

Scalability: ExcellentScales linearly with the number of processorsAvoids false sharing

Fragmentation: Very goodWorst-case is provably close to idealActual observed fragmentation is low

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

Hoard Heap Details“Segregated size class”

allocatorSize classes are

logarithmically-spacedSuperblocks hold objects of one

size class

empty superblocks are “recycled”

Approximately radix-sorted:Allocate from mostly-full

superblocksFast removal of mostly-empty

superblocks

8 16 2432 40 48

sizeclass bins

radix-sortedsuperblock lists(emptiest to fullest)

superblocks


Recommended