+ All Categories
Home > Documents > Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9....

Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9....

Date post: 17-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
131
Database Architectures for New Hardware a tutorial by Anastassia Ailamaki Database Group Carnegie Mellon University http://www.cs.cmu.edu/~natassa
Transcript
Page 1: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

Database Architectures for New Hardware

a tutorial byAnastassia Ailamaki

Database GroupCarnegie Mellon University

http://www.cs.cmu.edu/~natassa

Page 2: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

2©2004 Anastassia Ailamaki

Focus of this tutorialDB workload execution on a modern computer

Processor

0%

20%

40%

60%

80%

100%

Ideal seq.scan

indexscan

DSS OLTP

exec

utio

n tim

e

BUSY IDLE

DBMS can run MUCH faster if they use new hardware efficiently

Page 3: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

3©2004 Anastassia Ailamaki

Trends in processor performanceScaling # of transistors, innovative microarchitectureHigher performance, despite technological hurdles!

Processor speed doubles every 18 months

Page 4: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

4©2004 Anastassia Ailamaki

Trends in Memory (DRAM) PerformanceMemory capacity increases exponentially

DRAM Fabrication primarily targets densitySpeed increases linearly

Larger but not as much faster memories

16MB4MB

1MB64KB 256KB

64MB

4GB

512MB

0.1

1

10

100

1000

10000

1980 1983 1986 1989 1992 1995 2000 2005

DRAM sizeD R A M S P E E D T R E N D S

YEAR OF INTRODUCTION1980 1982 1984 1986 1988 1990 1992 1994

SPEE

D(n s )

0

50

100

150

200

250SLOWEST RAS (ns)

FASTEST RAS (ns)CAS (ns)

CYCLE TIME (ns)

1 Mbit

64 Mbit

16 Mbit

4 Mbit

256 Kbit

64 Kbit

AC

CES

S TI

ME

(µs)

Page 5: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

5©2004 Anastassia Ailamaki

The Memory/Processor Speed Gap

Trip to memory = thousands of instructions!

0.25

10

0.0625

80

6

0.01

0.1

1

10

100

1000pr

oces

sor

cycl

es /

inst

ruct

ion

0.01

0.1

1

10

100

1000

cycl

es /

acce

ss to

DR

AM

CPUMemory

VAX/1980 PPro/1996 2010+

Page 6: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

6©2004 Anastassia Ailamaki

100G

New Hardware

Caches trade off capacity for speedExploit instruction/data localityDemand fetch/wait for data

[ADH99]:Running top 4 database systemsAt most 50% CPU utilization

But wait a minute…Isn’t I/O the bottleneck??? MemoryMemory

CCPPUU

1000

clk

100

clk

1 cl

k10

clk

L2 2M

L1 64K

4GBto

1TB

L3 32M

Page 7: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

7©2004 Anastassia Ailamaki

Modern storage managers

Several decades work to hide I/OAsynchronous I/O + Prefetch & Postwrite

Overlap I/O latency by useful computationParallel data access

Partition data on modern disk array [PAT88]Smart data placement / clustering

Improve data localityMaximize parallelismExploit hardware characteristics

…and larger main memories fit more data1MB in the 80’s, 10GB today, TBs coming soon

DB storage mgrs efficiently hide I/O latencies

Page 8: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

8©2004 Anastassia Ailamaki

Why should we (databasers) care?

0.330.8

1.4

DB

4

DB

Theoreticalminimum

Desktop/Engineering

(SPECInt)

DecisionSupport (TPC-H)

OnlineTransactionProcessing

(TPC-C)

Cyc

les

per i

nstr

uctio

n

Database workloads under-utilize hardwareNew bottleneck: Processor-memory delays

Page 9: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

9©2004 Anastassia Ailamaki

Breaking the Memory Wall

Wish for a Database Architecture:that uses hardware intelligentlythat won’t fall apart when new computers arrivethat will adapt to alternate configurations

Efforts from multiple research communities Cache-conscious data placement and algorithmsInstruction stream optimizationsNovel database software architecturesNovel hardware designs (covered briefly)

Page 10: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

10©2004 Anastassia Ailamaki

Detailed OutlineIntroduction and OverviewNew Hardware

Execution PipelinesCache memories

Where Does Time Go?Measuring Time (Tools and Benchmarks)Analyzing DBs: Experimental Results

Bridging the Processor/Memory Speed GapData PlacementAccess MethodsQuery Processing AlorithmsInstruction Stream OptimizationsStaged Database Systems

Newer HardwareHip and Trendy

Query co-processingDatabases on MEMStore

Directions for Future Research

Page 11: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

11©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

Page 12: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

12©2004 Anastassia Ailamaki

This Section’s Goals

Understand how a program is executedHow new hardware parallelizes executionWhat are the pitfalls

Understand why database programs do not take advantage of microarchitectural advancesUnderstand memory hierarchies

How they workWhat are the parameters that affect program behaviorWhy they are important to database performance

Page 13: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

13©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew Hardware

Execution PipelinesCache memories

Where Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

Page 14: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

14©2004 Anastassia Ailamaki

Sequential Program Execution

Sequential Code

Precedences: overspecificationsSufficient, NOT necessary for correctness

i1: xxxx

i2: xxxx

i3: xxxx

i1

i2

i3

Instruction-level Parallelism (ILP)

i1 i2 i3

pipeliningsuperscalar execution

Modern processors do both!

Page 15: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

15©2004 Anastassia Ailamaki

fetch decode execute memory write

Pipelined Program Execution

Ft0

DF

t1ED

t2

F

ME

t3

D

WM

t4

E

t5

WM W

Tpipeline = Tbase / 5

Inst2Inst1

Inst3

FETCH

Inst

ruct

ion

stre

am

EXECUTE RETIRE

Write results

Page 16: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

16©2004 Anastassia Ailamaki

Pipeline Stalls (delays)Reason: dependencies between instructions E.g., Inst1: r1 ← r2 + r3

Inst2: r4 ← r1 + r2

F D E M WF D E M W

t0 t1 t2 t3 t4 t5Inst1Inst2

F D E M W

Read-after-write (RAW)

DB programs: frequent data dependencies

F D E M WF D E M W

t0 t1 t2 t3 t4 t5Inst1Inst2 E Stall

F E MD Stall D

peak ILP = d

Peak instruction-per-cycle (IPC) = CPI = 1Inst3

Page 17: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

17©2004 Anastassia Ailamaki

Higher ILP: Superscalar Out-of-Order

F D E M Wt0 t1 t2 t3 t4 t5

Inst1…n

Peak instruction-per-cycle (IPC)=n (CPI=1/n)

F D E M WF D E M W

Inst(n+1)…2n

Inst(2n+1)…3n

at most n

peak ILP = d*n

Out-of-order (as opposed to “inorder”) execution:Shuffle execution of independent instructionsRetire instruction results using a reorder bufferDB: 1.5x faster than inorder [KPH98,RGA98]

Limited ILP opportunity

Page 18: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

18©2004 Anastassia Ailamaki

true:

fetc

h B

Even Higher ILP: Branch PredictionWhich instruction block to fetch?

Evaluating a branch condition causes pipeline stall

C?IDEA: Speculate branch while evaluating C!

Record branch history in a buffer, predict A or BIf correct, saved a (long) delay!If incorrect, misprediction penalty=Flush pipeline, fetch correct

instruction streamExcellent predictors (97% accuracy!)Mispredictions costlier in OOO

1 lost cycle = >1 missed instructions!

false

: fet

ch A

xxxxif C goto B

A: xxxxxxxxxxxxxxxx

B: xxxxxxxxxxxxxxxxxxxxxxxxDB programs: long code paths => mispredictions

Page 19: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

19©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew Hardware

Execution PipelinesCache memories

Where Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

Page 20: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

20©2004 Anastassia Ailamaki

Larger

Faster

Memory HierarchyMake common case fast

common: temporal & spatial localityfast: smaller, more expensive memory

Keep recently accessed blocks (temporal locality)Group data into blocks (spatial locality)

Registers

Caches

Memory

DisksDB programs: >50% load/store instructions

Page 21: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

21©2004 Anastassia Ailamaki

Cache Contents

Keep recently accessed block in “cache line”

address state data

On memory readif incoming address = a stored address tag then

HIT: return dataelse

MISS: choose & displace a line in usefetch new (referenced) block from memory into linereturn data

Important parameters:cache size, cache line size, cache associativity

Page 22: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

22©2004 Anastassia Ailamaki

Cache Associativitymeans # of lines a block can be in (set size)Replacement: LRU or random, within set

Fully-associativea block goes in

any frame

Direct-mappeda block goes in

exactly one frame

Set-associativea block goes in any frame in

exactly one set

0

1

2

3

01234567

01010101

01234567

Line Set/Line Set

lower associativity ⇒ faster lookup

Page 23: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

23©2004 Anastassia Ailamaki

Miss Classification (3+1 C’s)

compulsory (cold)“cold miss” on first access to a block

— defined as: miss in infinite cache

capacity misses occur because cache not large enough— defined as: miss in fully-associative cache

conflictmisses occur because of restrictive mapping strategyonly in set-associative or direct-mapped cache— defined as: not attributable to compulsory or capacity

coherencemisses occur because of sharing among multiprocessors

Cold misses are unavoidableCapacity, conflict, and coherence misses can be reduced

Page 24: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

24©2004 Anastassia Ailamaki

Lookups in Memory Hierarchy

referencesmissesratemiss

## =EXECUTION PIPELINE

MAIN MEMORY

L1 I-CACHE L1 D-CACHE

L2 CACHE

Trips to memory are most expensive

$$$

L1: Split, 16-64K each.As fast as processor (1 cycle)L2: Unified, 512K-8MOrder of magnitude slower than L1

Memory: Unified, 512M-8GB~400 cycles (Pentium4)

(there may be more cache levels)

Page 25: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

25©2004 Anastassia Ailamaki

Miss penalty

L1D: low miss penalty, if L2 hit (partly overlapped with OOO execution)

enalty)avg(miss pmiss rate*ttavg hitaccess +=)(

EXECUTION PIPELINE

MAIN MEMORY

L1 I-CACHE L1 D-CACHE

L2 CACHE

DB: long code paths, large data footprints

$$$L2: High penalty (trip to memory)

means the time to fetch and deliver block

L1I: In critical execution path.Cannot be overlapped with OOO execution.

Modern caches: non-blocking

Page 26: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

26©2004 Anastassia Ailamaki

Typical processor microarchitecture

I-Unit E-Unit

L1 I-Cache L1 D-Cache

L2 Cache (SRAM on-chip)

D-TLBI-TLB

Regs

Main Memory (DRAM)

Processor

L3 Cache (SRAM off-chip)

Will assume a 2-level cache in this talk

TLB: Translation Lookaside Buffer (page table cache)

Page 27: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

27©2004 Anastassia Ailamaki

Summary: New Hardware

Fundamental goal in processor design: max ILPPipelined, superscalar, speculative executionOut-of-order executionNon-blocking cachesDependencies in instruction stream lower ILP

Deep memory hierarchiesCaches important for database performanceLevel 1 instruction cache in critical execution pathTrips to memory most expensive

DB workloads on new hardwareToo many load/store instructionsTight dependencies in instruction streamAlgorithms not optimized for cache hierarchiesLong code pathsLarge instruction and data footprints

Page 28: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

28©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

Page 29: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

29©2004 Anastassia Ailamaki

This Section’s Goals

Hardware takes time: how do we measure time?Understand how to efficiently analyze microarchitectural behavior of database workloads

Should we use simulators? When? Why?How do we use processor counters?Which tools are available for analysis?Which database systems/benchmarks to use?

Survey experimental results on workload characterization

Discover what matters for database performance

Page 30: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

30©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew HardwareWhere Does Time Go?

Measuring Time (Tools and Benchmarks)Analyzing DBs: Experimental Results

Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

Page 31: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

31©2004 Anastassia Ailamaki

Simulator vs. Real MachineReal machine

Limited to available hardware counters/eventsLimited to (real) hardware configurations Fast (real-life) execution

Enables testing real: large & more realistic workloads

Sometimes not repeatable

Tool: performance counters

Real-machine experiments to locate problemsSimulation to evaluate solutions

SimulatorCan measure any event

Vary hardware configurations

(Too) Slow executionOften forces use of scaled-down/simplified workloads

Always repeatable

Virtutech Simics, SimOS, SimpleScalar, etc.

Page 32: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

32©2004 Anastassia Ailamaki

Hardware Performance Counters

What are they?Special purpose registers that keep track of programmable eventsNon-intrusive counts “accurately” measure processor eventsSoftware API’s handle event programming/overflowGUI interfaces built on top of API’s to provide higher-level analysis

What can they count?Instructions, branch mispredictions, cache misses, etc.No standard set exists

Issues that may complicate lifeProvides only hard counts, analysis must be done by user or toolsMade specifically for each processor

even processor families may have different interfacesVendors don’t like to support because is not profit contributor

Page 33: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

33©2004 Anastassia Ailamaki

Evaluating Behavior using HW CountersStall time (cycle) counters

very useful for time breakdowns(e.g., instruction-related stall time)

Event countersuseful to compute ratios(e.g., # misses in L1-Data cache)

Need to understand counters before using themOften not easy from documentationBest way: microbenchmark (run programs with pre-computed events)

E.g., strided accesses to an array

Page 34: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

34©2004 Anastassia Ailamaki

Example: Intel PPRO/PIII

RESOURCE_STALLSResource stallsPARTIAL_RAT_STALLSDependence stalls

IFU_IFETCH_MISSL1 Instruction missesITLB_MISSTLB misses

BR_MISS_PRED_RETIREDBranch mispredictionsBR_INST_DECODEDBranches

IFU_MEM_STALLInstruction-related stallsL2_LINES_INL2 Misses

DCU_LINES_INL1 Data (L1D) missesDATA_MEM_REFSL1 Data (L1D) accesses

INST_RETIREDInstructionsCPU_CLK_UNHALTEDCycles

“time”

Lots more detail, measurable events, statisticsOften >1 ways to measure the same thing

Page 35: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

35©2004 Anastassia Ailamaki

Producing time breakdowns

Determine benchmark/methodology (more later)Devise formulae to derive useful statisticsDetermine (and test!) software

E.g., Intel Vtune (GUI, sampling), or emonPublicly available & universal (e.g., PAPI [DMM04])

Determine time components T1….TnDetermine how to measure each using the countersCompute execution time as the sum

Verify model correctnessMeasure execution time (in #cycles)Ensure measured time = computed time (or almost)Validate computations using redundant formulae

Page 36: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

36©2004 Anastassia Ailamaki

Computation

Memory

BranchMispredictions

HardwareResources

Overlap opportunity:Load AD=B+CLoad E

Execution Time = Computation + StallsExecution Time = Computation + Stalls - Overlap

Execution Time Breakdown Formula

Stalls

[ADH99]

Page 37: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

37©2004 Anastassia Ailamaki

Computation

Memory

BranchMispredictions

HardwareResources

Memory Stalls = Σn(stalls at cache level n)

Where Does Time Go (memory)?

Instruction lookup missed in L1I, hit in L2L1I

Data lookup missed in L1D, hit in L2L1D

Instruction or data lookup missed in L1, missed in L2, hit in memory

L2

[ADH99]

Page 38: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

38©2004 Anastassia Ailamaki

What to measure?Decision Support System (DSS:TPC-H)

Complex queries, low-concurrencyRead-only (with rare batch updates)Sequential access dominatesRepeatable (unit of work = query)

On-Line Transaction Processing (OLTP:TPCC, ODB)Transactions with simple queries, high-concurrencyUpdate-intensiveRandom access frequentNot repeatable (unit of work = 5s of execution after rampup)

Often too complex to provide useful insight

Page 39: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

39©2004 Anastassia Ailamaki

Microbenchmarks

What matters is basic execution loopsIsolate three basic operations:

Sequential scan (no index)Random access on records (non-clustered index)Join (access on two tables)

Vary parameters:selectivity, projectivity, # of attributes in predicatejoin algorithm, isolate phasestable size, record size, # of fields, type of fields

Determine behavior and trendsMicrobenchmarks can efficiently mimic TPC microarchitecturalbehavior! Widely used to analyze query execution

Excellent for microarchitectural analysis

[KPH98,ADH99,KP00,SAF04]

Page 40: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

40©2004 Anastassia Ailamaki

Shore: YES!

On which DBMS to measure?Commercial DBMS are most realistic

Difficult to setup, may need help from companies

Prototypes can evaluate techniquesShore [ADH01] (for PAX), PostgreSQL[TLZ97] (eval)

Tricky: similar behavior to commercial DBMS?

Execution time breakdown

0%

20%

40%

60%

80%

100%

A B C D ShoreDBMS

% e

xecu

tion

time

Computation Memory Branch mispr. Resource

Memory stall time breakdown

0%

20%

40%

60%

80%

100%

A B C D ShoreDBMS

Mem

ory

stal

l tim

e (%

)

L1 Data L2 Data L1 Instruction L2 Instruction

[ADH02]

Page 41: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

41©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew HardwareWhere Does Time Go?

Measuring Time (Tools and Benchmarks)Analyzing DBs: Experimental Results

Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

Page 42: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

42©2004 Anastassia Ailamaki

DB Execution Time Breakdown

0%

20%

40%

60%

80%

100%

seq. scan TPC-D index scan TPC-C

exec

utio

n tim

e

Computation Memory Branch mispred. Resource[ADH99,BGB98,BGN00,KPH98]

PII XeonNT 4.0Four DBMS: A, B, C, D

At least 50% cycles on stalls Memory is major bottleneck

Branch mispredictions increase cache misses!

Page 43: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

43©2004 Anastassia Ailamaki

Join (no index)

0%

20%

40%

60%

80%

100%

A B C DDBMS

table scan

0%

20%

40%

60%

80%

100%

A B C DDBMS

Mem

ory

stal

l tim

e (%

)

clustered index scan

0%

20%

40%

60%

80%

100%

A B C DDBMS

L1 Data L2 Data L1 Instruction L2 Instruction

PII Xeon running NT 4.0, used performance countersFour commercial Database Systems: A, B, C, D

DSS/OLTP basics: Cache Behavior

non-clustered index

0%

20%

40%

60%

80%

100%

B C DDBMS

[ADH99,ADH01]

Bottlenecks: data in L2, instructions in L1Random access (OLTP): L1I-bound

Page 44: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

44©2004 Anastassia Ailamaki

Why Not Increase L1I Size?

L1I: in critical execution pathslower L1I: slower clock

Trends:L1-I cache

Max on-chipL2/L3 cache

‘96 ‘00 ‘04‘98 ‘02Year Introduced

10 KB

100 KB

1 MB

10 MB

Cac

he s

ize

Problem: a larger cache is typically a slower cacheNot a big problem for L2

L1I size is stableL2 size increase: Effect on performance?

[HA04]

Page 45: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

45©2004 Anastassia Ailamaki

Increasing L2 Cache SizeDSS: Performance improves as L2 cache growsNot as clear a win for OLTP on multiprocessors

Reduce cache size ⇒ more capacity/conflict missesIncrease cache size ⇒ more coherence misses

0%

5%

10%

15%

20%

25%

1P 2P 4P# of processors

% o

f L2

cach

e m

isse

s to

dirt

y da

ta

in a

noth

er p

roce

ssor

's c

ache

256KB 512KB 1MB

Larger L2: trade-off for OLTP

[BGB98,KPH98]

Page 46: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

46©2004 Anastassia Ailamaki

Summary: Where Does Time Go?Goal: discover bottlenecks

Hardware performance counters ⇒ time breakdownTools available for access and analysis (+simulators)Run commercial DBMS and equivalent prototypesMicrobenchmarks offer valuable insight

Database workloads: more than 50% stallsMostly due to memory delaysCannot always reduce stalls by increasing cache size

Crucial bottlenecksData accesses to L2 cache (esp. for DSS)Instruction accesses to L1 cache (esp. for OLTP)

Page 47: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

47©2004 Anastassia Ailamaki

Memory

BBranchMispredictions

RHardwareResources

D-cache D

I-cache I

DBMS

DBMS + Compiler

Compiler + Hardware

Hardware

How to Address Bottlenecks

Next: Optimizing cache accesses

Page 48: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

48©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

Page 49: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

49©2004 Anastassia Ailamaki

This Section’s Goals

Survey techniques to improve localityRelational dataAccess methods

Survey new query processing algorithmsPresent a new database system architectureBriefly explain Instruction Stream Optimizations

Show how much good understanding of the platform can achieve

Page 50: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

50©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed Gap

Data PlacementAccess MethodsQuery ProcessingInstruction Stream OptimizationsStaged Database Systems

Newer HardwareHip and TrendyDirections for Future Research

Page 51: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

51©2004 Anastassia Ailamaki

Current Database Storage Managers

Same layout on disk/memoryMulti-level storage hierarchy

different devices at each leveldifferent “optimal” access on each device

Variable workloads and access patterns

OLTP: Full-record accessDSS: Partial-record accessno optimal “universal” layout

non-volatile storage

main memory

CPU cache

Goal: Reduce data traffic in memory hierarchy

Page 52: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

52©2004 Anastassia Ailamaki

PAGE HEADER 1237RH1

30Jane RH2 4322 John

45

•••

RH4

7658 Susan 52

RH3 Jim 201563

37Dan87916

43Leon25345

52Susan76584

20Jim15633

45John43222

30Jane12371

AgeNameSSNRID

R

NSM (n-ary Storage Model, or Slotted Pages)

Records are stored sequentiallyAttributes of a record are stored together

“Classic” Data Layout on Disk Pages

Page 53: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

53©2004 Anastassia Ailamaki

NSM in Memory Hierarchy

DISK

PAGE HEADER

7658 Susan 52

1237 Jane

Jim 20

4322 John 4530 1563

CPU CACHEMAIN MEMORY

PAGE HEADER

7658 Susan 52

1237 Jane

Jim 20

4322 John 4530 1563

4322 Jo30 Block 1

hn 45 1563 Block 27658Jim 20 Block 3

Susan 52 Block 4

Optimized for full-record accessSlow partial-record access

Wastes I/O bandwidth (fixed page layout)Low spatial locality at CPU cache

Query accesses all attributes (full-record access)Query evaluates attribute “age” (partial-record access)

BEST select namefrom Rwhere age > 50

Page 54: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

54©2004 Anastassia Ailamaki

Decomposition Storage Model (DSM)

37Dan8791

43Leon2534

52Susan7658

20Jim1563

45John4322

30Jane1237

AgeNameEID

Partition original table into n 1-attribute sub-tables

[CK85]

Page 55: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

55©2004 Anastassia Ailamaki

12371PAGE HEADER

4322 1563 76583 42

John Jim Suzan

Jane1PAGE HEADER

3 42

30

45 20 52

1PAGE HEADER

3 4

2

376

435

524

203

452

301

AgeRID

R387916

25345

76584

15633

43222

12371

EIDRID

R1

Dan6

Leon5

Suzan4

Jim3

John2

Jane1

NameRID

R2 8KB

8KB

8KB

Partition original table into n 1-attribute sub-tablesEach sub-table stored separately in NSM pages

Decomposition Storage Model (DSM)[CK85]

Page 56: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

56©2004 Anastassia Ailamaki

DSM in Memory Hierarchy

John Jim Suzan

Jane1PAGE HEADER

3 4

2

DISK

301PAGE HEADER 2 45

20 523 4 5 43

12371PAGE HEADER

4322 1563 76583 42

MAIN MEMORY

301PAGE HEADER 2 45

20 523 4 5 43CPU CACHE

block 1301 2 45

block 220 523 4 5

John Jim Suzan

Jane1PAGE HEADER

3 4

2

12371PAGE HEADER

4322 1563 76583 42

Query accesses all attributes (full-record access)Query accesses attribute “age” (partial-record access)

CostlyCostly

Optimized for partial-record accessSlow full-record access

Reconstructing full record may incur random I/O

BEST

select namefrom Rwhere age > 50

Page 57: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

57©2004 Anastassia Ailamaki

Repairing NSM’s cache performance

We need a data placement that…Eliminates unnecessary memory accessesImproves inter-record localityKeeps a record’s fields togetherDoes not affect NSM’s I/O performance

and, most importantly, is…

low-implementation-cost, high-impact

Page 58: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

58©2004 Anastassia Ailamaki

1237RH1PAGE HEADER

30Jane RH2 4322 John

45

1563

RH3 Jim 20

•••

RH4

7658 Susan 52

PAGE HEADER 1237 4322

1563

7658

Jane John Jim Susan

30 45 2052

• •••

NSM PAGE PAX PAGE

Idea: Partition data within page for spatial locality

Partition Attributes Across (PAX)

minipage

[ADH01]

Page 59: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

59©2004 Anastassia Ailamaki

PAX in Memory Hierarchy

1563PAGE HEADER 1237 4322

7658

Jane John Jim Susan

30 45 2052

DISK

1563PAGE HEADER 1237 4322

7658

Jane John Jim Susan

30 45 2052

MAIN MEMORY CPU CACHE

block 152 45 2030

Optimizes CPU cache-to-memory communicationRetains NSM’s I/O (page contents do not change)

cheap

Partial-record access in memoryFull-record access on diskBEST

BEST select namefrom Rwhere age > 50

[ADH01]

Page 60: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

60©2004 Anastassia Ailamaki

PAX Performance Results (Shore)Cache data stalls

0

20

40

60

80

100

120

140

160

NSM PAXpage layout

stal

l cyc

les

/ rec

ord

L1 Data stallsL2 Data stalls

Execution time breakdown

0

300

600

900

1200

1500

1800

NSM PAX

page layout

cloc

k cy

cles

per

reco

rd

HardwareResource

BranchMispredict

Memory

ComputationQuery:select avg (ai) from R where aj >= Lo

and aj <= Hi

PII Xeon Windows NT4 16KB L1-I&D, 512 KB L2, 512 MB RAM

Validation with microbenchmarks:70% less data stall time (only compulsory misses left)Better use of processor’s superscalar capability

TPC-H performance: 15%-2x speedup in queriesExperiments with/without I/O, on three different processors

[ADH01]

PAX eliminates unnecessary trips to memory

Page 61: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

61©2004 Anastassia Ailamaki

Dynamic PAX: Data Morphing

PAX random access: more cache misses in recordStore attributes accessed together contiguouslyDynamic partition updates with changing workloads

Optimize total cost based on cache missesPartition algorithms: naïve & hill-climbing algorithms

Fewer cache missesBetter projectivity and scalability for index scan queriesUp to 45% faster than NSM & 25% faster than PAX

Same I/O performance as PAX and NSMFuture work: how to handle conflicts?

[HP03]

Page 62: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

62©2004 Anastassia Ailamaki

Alternatively: Repair DSM’s I/O behavior

We like DSM for partial record accessWe like NSM for full-record access

Solution: Fractured Mirrors [RDS02]

1. Get data placement right

Sparse B-Tree on ID

4 A4 A51 A1 A2 A3

3

2. Faster record reconstructionLineitem (TPCH) 1GB

020406080

100120140160180

1 2 3 4 6 8 10 12 14No. of Attributes

Seco

nds

NSMPage-at-a-timeChunk-Merge

Instead of record- or page-at-a-time…Chunk-based merge algorithm!

1. Read in segments of M pages ( a “chunk”)2. Merge segments in memory3. Requires (N*K)/M disk seeks4. For a memory budget of B pages, each

partition gets B/N pages in a chunk

Page 63: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

63©2004 Anastassia Ailamaki

Fractured Mirrors3. Smart mirroring

Achieves 2-3x speedups on TPC-HNeeds 2 copies of the databaseFuture work:

A new optimizerSmart buffer pool managementUpdates

Disk 1 Disk 2

NSM Copy DSM Copy

Disk 1 Disk 2

NSM Copy DSM Copy

Page 64: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

64©2004 Anastassia Ailamaki

Summary (no replication)

PAXPAX

fullfull--record record accessaccess

MemoryMemory--disk Performancedisk PerformanceCacheCache--memory Performancememory Performance

DSMDSM

NSMNSM

partial record partial record accessaccess

partial record partial record accessaccess

fullfull--record record accessaccess

Page layoutPage layout

☺☺ ☺☺

☺☺ ☺☺☺☺ ☺☺ ☺☺

Need new placement method: Efficient full- and partial-record accessesMaximize utilization at all levels of memory hierarchy

Difficult!!! Different devices/access methodsDifferent workloads on the same database

Page 65: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

65©2004 Anastassia Ailamaki

Clotho

Operators

Main-memory Manager

LachesisStorage Manager

LachesisStorage Manager

AtroposLV ManagerAtropos

LV Manager

disk array

disk 0 disk 1

Bufferpool

page hdr

The Fates Storage ManagerIDEA: Decouple layout!

non-volatile storage

main memory

CPU cache

[SAG03,SSS04,SSS04a]

data directly placed via scatter/gather I/O

Memory does not need to store full NSM pages

Page 66: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

66©2004 Anastassia Ailamaki

Clotho: memory stores PAX minipagesDISK select EID from R where AGE>30

1563

PAGE HEADER

1237 4322 7658

30 45 2052

Jane John Jim Susan

MAIN MEMORY2534

PAGE HEADER

2865 1015 8791

Tom Jerry Jean Kate

31 54 3325

PAGE HEADER (EID &AGE)

15631237 4322 7658

30 52 45 20

25342865 1015 8791

31 25 54 33

In-memory “skeleton”(Tailored to query)

On-disk page:PAX-like layoutBlock boundary aligned

Just the data you needQuery-specific pages!Great cache performance

Decoupled layoutFits different hardware

Low reconstruction costDone at I/O levelGuaranteed by Lachesis and Atropos [SSS04]

[SSS04]

New buffer pool manager handles sharing

Page 67: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

67©2004 Anastassia Ailamaki

CSM: best-case performance of DSM and NSMTable scan time

0

50

100

150

200

250

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Query payload [# of attributes]

Run

time

[s]

NSMDSMPAXCSM

Table: a1 … a15 (float)

Query: select a1, …from R where a1 < Hi

[SSS04]

TPC-H: Outperform DSM by 20% to 2xTPC-C: Comparable to NSM (6% lower throughput)

CSMCSM

fullfull--record record accessaccess

MemoryMemory--disk Performancedisk PerformanceCacheCache--memory Performancememory Performance

partial record partial record accessaccess

partial record partial record accessaccess

fullfull--record record accessaccess

Page layoutPage layout

☺☺ ☺☺ ☺☺ ☺☺

Page 68: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

68©2004 Anastassia Ailamaki

Data Placement: Summary

Smart data placement increases spatial localityResearch targets table (relation) dataGoal: Reduce number of non-cold cache misses

Techniques focus grouping attributes into cache lines for quick access

PAX, Data morphing: Cache optimization techniquesFractured Mirrors: Cache-and-disk optimizationFates DB Storage Manager: Independent data layout support across the entire memory hierarchy

Page 69: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

69©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed Gap

Data PlacementAccess MethodsQuery ProcessingInstruction Stream OptimizationsStaged Database Systems

Newer HardwareHip and TrendyDirections for Future Research

Page 70: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

70©2004 Anastassia Ailamaki

Main-Memory Tree Indexes

T Trees: proposed in mid-80s for MMDBs [LC86]Aim: balance space overhead with searching time Uniform memory access assumption (no caches)

Main-memory B+ Trees: better cache performance [RR99]Node width = cache line size (32-128b)

Minimize number of cache misses for searchMuch higher than traditionaldisk-based B-Trees

So now trees are too deep

How to make trees shallower?

Page 71: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

71©2004 Anastassia Ailamaki

Reducing Pointers for Larger Fanout

Cache Sensitive B+ Trees (CSB+ Trees)Layout child nodes contiguouslyEliminate all but one child pointers

Double fanout of nonleaf node

B+ Trees CSB+ TreesK1 K2

K3 K4 K5 K6 K7 K8

K1 K3K2 K4

K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4

35% faster tree lookupsUpdate performance is 30% worse (splits)

[RR00]

Page 72: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

72©2004 Anastassia Ailamaki

What do we do with cold misses?

Answer: hide latencies using prefetchingPrefetching enabled by

Non-blocking cache technologyPrefetch assembly instructions

SGI R10000, Alpha 21264, Intel Pentium4

Main MemoryCPU L2/L3CacheL1

Cache

pref 0(r2)pref 4(r7)pref 0(r3)pref 8(r9)

Prefetching hides cold cache miss latencyEfficiently used in pointer-chasing lookups!

Page 73: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

73©2004 Anastassia Ailamaki

Prefetching B+ Trees(pB+ Trees) Idea: Larger nodesNode size = multiple cache lines (e.g. 8 lines)

Later corroborated by [HP03a]Prefetch all lines of a node before searching it

Cost to access a node only increases slightlyMuch shallower trees, no changes required

Time

Cache miss

[CGM01]

>2x better search AND update performanceApproach complementary to CSB+ Trees!

Page 74: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

74©2004 Anastassia Ailamaki

Prefetching B+ Trees

Goal: faster range scan

Leaf parent nodes contain addresses of all leavesLink leaf parent nodes togetherUse this structure for prefetching leaf nodes

Leaf parent nodes

pB+ Trees: 8X speedup over B+ Trees

[CGM01]

Page 75: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

75©2004 Anastassia Ailamaki

Fractal Prefetching B+ TreesWhat if B+-tree does not fit in memory?(fpB+ Trees) Idea: Combine memory & disk trees

Embed cache-optimized trees in disk tree nodesfpB+ Trees optimize both cache AND diskKey compression to increase fanout [BMR01]

Compared to disk-based B+ Trees, 80% faster in-memory searches with similar disk performance

[CGM02]

Page 76: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

76©2004 Anastassia Ailamaki

Bulk lookups: Buffer Index AccessesOptimize data cache performanceSimilar technique in [PMH02]

Idea: increase temporal locality by delaying (buffering) node probes until a group is formedExample: NLJ probe stream: (r1, 10) (r2, 80) (r3, 15)

r110keyRID

(r1, 10)

buffer

root

B C

D E

(r1,10) is bufferedbefore accessing B

r110keyRID

(r2, 80)

r2 80

B C

(r2,80) is bufferedbefore accessing C

r110keyRID

(r3, 15)

r2 80

B Cr315

B is accessed,buffer entries are

divided among children

[ZR03a]

3x speedup with enough concurrency

Page 77: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

77©2004 Anastassia Ailamaki

Access Methods: Summary

Optimize B+ Tree pointer-chasing cache behaviorReduce node size to few cache linesReduce pointers for larger fanout (CSB+)“Next” pointers to lowest non-leaf level for easy prefetching (pB+)Simultaneously optimize cache and disk (fpB+)Bulk searches: Buffer index accesses

Additional work:Cache-oblivious B-Trees [BDF00]

Optimal bound in number of memory transfersRegardless of # of memory levels, block size, or level speed

Survey of techniques for B-Tree cache performance [GL01]Existing heretofore-folkloric knowledgeKey normalization/compression, alignment, separating keys/pointers

Lots more to be done in area – consider interference and scarce resources

Page 78: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

78©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed Gap

Data PlacementAccess MethodsQuery ProcessingStaged Database SystemsInstruction Stream Optimizations

Newer HardwareHip and TrendyDirections for Future Research

Page 79: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

79©2004 Anastassia Ailamaki

Query Processing Algorithms

Idea: Adapt query processing algorithms to cachesRelated work includes:

Improving data cache performanceSortingJoin

Improving instruction cache performanceDSS applications

Page 80: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

80©2004 Anastassia Ailamaki

Sorting

In-memory sorting / generating runsAlphaSort

Use quick sort rather than replacement selectionSequential vs. random accessNo cache misses after sub-arrays fit in cache

Sort (key-prefix, pointer) pairs rather than records3x cpu speedup for the Datamation benchmark

L2cache

L1

Quick SortReplacement-selection

[NBC94]

Page 81: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

81©2004 Anastassia Ailamaki

Hash Join

Random accesses to hash tableBoth when building AND when probing!!!

Poor cache performance≥ 73% of user time is CPU cache stalls [CAG04]

Approaches to improving cache performanceCache partitioning – maximizes localityPrefetching – hides latencies

Build Relation

Probe Relation

Hash Table

Page 82: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

82©2004 Anastassia Ailamaki

Reducing non-cold misses

Idea: Cache partitioning (similar to I/O partitioning)Divide relations into cache-sized partitionsFit build partition and hash table into cacheAvoid cache misses for hash table visits

1/3 fewer cache misses, 9.3% speedup>50% misses due to partitioning overhead

Build Probe

cache

[SKN94]

Page 83: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

83©2004 Anastassia Ailamaki

Hash Joins in Monet

Monet main-memory database system [B02]Vertically partitioned tuples (DSM)

Join two vertically partitioned relationsJoin two join-attribute arrays [BMK99,MBK00]Extract other fields for output relation [MBN04]

Build Probe

Output

[B02]

Page 84: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

84©2004 Anastassia Ailamaki

Monet: Reducing Partition CostJoin two arrays of simple fields (8 byte tuples)

Original cache partitioning is single passTLB thrashing if # partitions > # TLB entriesCache thrashing if # partitions > # lines in cache

Solution: multiple passes# partitions per pass is smallRadix-cluster [BMK99,MBK00]

Use different bits of hashed keys fordifferent passesE.g. In figure, use 2 bits of hashed keys for each pass

Plus CPU optimizationsXOR instead of %Simple assignments instead of memcpy

2-pass partitionUp to 2.7X speedup on an Origin 2000

Results most significant for small tuples

[BMK99,MBK00]

Page 85: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

85©2004 Anastassia Ailamaki

Monet: Extracting Payload

Two ways to extract payload:Pre-projection: copy fields during cache partitioningPost-projection: generate join index, then extract fields

Monet: post-projectionRadix-decluster algorithm for good cache performance

Post-projection good for DSMUp to 2X speedup compared to pre-projection

Post-projection is not recommended for NSMCopying fields during cache partitioning is better

[MBN04]

Paper presented in this conference!

Page 86: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

86©2004 Anastassia Ailamaki

Optimizing non-DSM hash joins

foreach probe tuple{

(0)compute bucket number;(1)visit header;(2)visit cell array;(3)visit matching build tuple;

}

HashBucket

Headers

Hash Cell (hash code, build tuple ptr)

BuildPartition

0123 0123

time

Cache miss latency

Idea: Exploit inter-tuple parallelism

[CAG04]

Simplified probing algorithm

Page 87: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

87©2004 Anastassia Ailamaki

Group Prefetching

0123

0123

0123 0

123

0123

0123

a group

foreach group of probe tuples {foreach tuple in group {(0)compute bucket number;

prefetch header;}foreach tuple in group {(1)visit header;

prefetch cell array;}foreach tuple in group {(2)visit cell array;

prefetch build tuple;}foreach tuple in group {(3)visit matching build tuple;

}}

[CAG04]

Page 88: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

88©2004 Anastassia Ailamaki

Software PipeliningPrologue;for j=0 to N-4 do {

tuple j+3:(0)compute bucket number;

prefetch header;tuple j+2:(1)visit header;

prefetch cell array;tuple j+1:(2)visit cell array;

prefetch build tuple;tuple j:(3)visit matching build tuple;

}Epilogue;

prologue

epilogue

j

j+3

0123

0123

0123

0123

0123

0123

0123

[CAG04]

Page 89: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

89©2004 Anastassia Ailamaki

Prefetching: Performance Results

Techniques exhibit similar performanceGroup prefetching easier to implementCompared to cache partitioning:

Cache partitioning costly when tuples are large (>20b)Prefetching about 50% faster than cache partitioning

0

1000

2000

3000

4000

5000

6000

150 cycles 1000 cycles

BaselineGroup PrefSP Pref

processor to memory latency

exec

utio

n tim

e (M

cyc

les)

9X speedups over baseline at 1000 cycles

Absolute numbers do not change!

[CAG04]

Page 90: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

90©2004 Anastassia Ailamaki

DSS: Reducing I-missesDemand-pull execution model: one tuple at a time

ABABABABABABABABAB…If A + B > L1 instruction cache sizePoor instruction cache utilization!

Solution: multiple tuples at an operatorABBBBBAAAAABBBBB…

Modify operators to support block of tuples [PMA01]Insert “buffer” operators between A and B [ZR04]

“buffer” calls B multiple timesStores intermediate tuple pointers to serve A’s requestNo need to change original operators

A

BQuery Plan

12% speedup for simple TPC-H queries

[PMA01,ZR04]

Page 91: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

91©2004 Anastassia Ailamaki

Concurrency ControlMultiple CPUs share a treeLock coupling: too much cost

Latching a node means writingTrue even for readers !!!Coherence cache misses due to writes from different CPUs

Solution: Optimistic approach for readersUpdaters still latch nodesUpdaters also set node versionsReaders check version to ensure correctness

Search throughput: 5x (=no locking case)Update throughput: 4x

[CHK01]

Page 92: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

92©2004 Anastassia Ailamaki

Query processing: summary

Alphasort: use quicksort and key prefix-pointerMonet: MM-DBMS uses aggressive DSM

Optimize partitioning with hierarchical radix-clusteringOptimize post-projection with radix-declusteringMany other optimizations

Traditional hash joins: aggressive prefetchingEfficiently hides data cache missesRobust performance with future long latencies

DSS I-misses: group computation (new operator)B-tree concurrency control: reduce readers’ latching

Page 93: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

93©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed Gap

Data PlacementAccess MethodsQuery ProcessingInstruction Stream OptimizationsStaged Database Systems

Newer HardwareHip and TrendyDirections for Future Research

Page 94: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

94©2004 Anastassia Ailamaki

Instruction-Related Stalls

25-40% of execution time [KPH98, HA04]Recall importance of instruction cache: In the critical execution path!

EXECUTION PIPELINE

L1 I-CACHE L1 D-CACHE

L2 CACHE

Impossible to overlap I-cache delays

Page 95: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

95©2004 Anastassia Ailamaki

Goal: improve DSS I-cache performanceIdea: Predict next function call using small cache

Example: create_recalways calls find_ , lock_ , update_ , and unlock_ page in same order

Experiments: Shore on SimpleScalar SimulatorRunning Wisconsin Benchmark

Call graph prefetching for DB apps[APD03]

Beneficial for predictable DSS streams

Page 96: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

96©2004 Anastassia Ailamaki

SIMD: Single – Instruction – Multiple – Data In modern CPUs, target multimedia apps

Example: Pentium 4, 128-bit SIMD register holds four 32-bit values

Assume data stored columnwise as contiguous array of fixed-length numeric values (e.g., PAX)Scan example:

X3 X2 X1 X0

Y3 Y2 Y1 Y0

OP OP OP OP

X3 OP Y3 X2 OP Y2 X1 OP Y1 X0 OP Y0

if x[n] > 10result[pos++] = x[n]

x[n+3] x[n+2] x[n+1] x[n]

10 10 10 10

> > > >

0 1 0 0

8 12 6 5

original scan code

SIMD 1st phase:produce bitmapvector with 4comparison resultsin parallel

[ZR02]DB operators using SIMD

Page 97: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

97©2004 Anastassia Ailamaki

DB operators using SIMDScan example (cont’d)

Parallel comparisons, fewer branches ⇒ fewer mispredictions

0 1 0 0

SIMD 2nd phase:if bit_vector == 0, continueelse copy all 4 results, increase pos when bit==1

keep this result

[ZR02]

Superlinear speedup to # of parallelismNeed to rewrite code to use SIMD

Aggregation operation (1M records w/ 20% selectivity)

0

5

10

15

20

25

SUM SIMDSUM

COUNT SIMDCOUNT

MAX SIMDMAX

MIN SIMDMIN

Elap

sed

tim

e [m

s] Branch mispred. PenaltyOther cost

Page 98: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

98©2004 Anastassia Ailamaki

STEPS: Cache-Resident OLTP

Targets instruction-cache performance for OLTPExploits high transaction concurrencySynchronized Transactions through Explicit Processor Scheduling: Multiplex concurrent transactions to exploit common code paths

[HA04]

code fits inI-cache

context-switchpoint

CPU00101100100010110111010011011000110

thread A

CPU executes code

CPU performs context-switch

00101100100010110111010011011000110

thread B

instructioncache

capacitywindow

CPU00101100100010110111010011011000110

thread A

00101100100010110111010011011000110

thread B

before after

All capacity/conflict I-cache misses gone!

Page 99: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

99©2004 Anastassia Ailamaki

STEPS implementation runs full OLTP workloads (TPC-C)Groups threads per DB operator, then uses fast context-switch to reuse instructions in the cache

Full-system TPC-C implementation:65% fewer L1-I misses, 40% speedup

STEPS: Cache-Resident OLTP[HA04]

STEPS minimizes L1-I cache misses without increasing cache size

Page 100: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

100©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed Gap

Data PlacementAccess MethodsQuery ProcessingInstruction Stream OptimizationsStaged Database Systems

Newer HardwareHip and TrendyDirections for Future Research

Page 101: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

101©2004 Anastassia Ailamaki

Context loaded multiple times for each queryNo means to exploit overlapping work

Thread-based concurrency pitfalls

Q1Q2Q3

context-switch points TIME

CPU

: component loading time

Current

[HA03]

Page 102: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

102©2004 Anastassia Ailamaki

Thread-based concurrency pitfalls

Q1Q2Q3

context-switch points TIME

CPU

: component loading time

Current

Q1Q2Q3

CPU

Desired

[HA03]

Context-switch at module boundaryContext-switch at module boundary

Load context once for all queries

Load context once for all queries

Load context once for all queries

Page 103: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

103©2004 Anastassia Ailamaki

Staged software design allows forCohort scheduling of queries to amortize loading timeSuspend at module boundaries to maintain context

Break DBMS into stagesStages act as independent serversQueries exist in the form of “packets”

Proposed query scheduling algorithms to address locality/wait time tradeoffs [HA02]

Staged Database Systems[HA03]

Page 104: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

104©2004 Anastassia Ailamaki

Staged Database Systems[HA03]

Optimize instruction/data cache localityNaturally enable multi-query processing

Highly scalable, fault-tolerant, trustworthy

INOUT

connect parser optimizer sendresults

FSCAN

JOIN

SORT

AGGRISCAN

L1

L2

MEMORY

L1

L2

MEMORY

Page 105: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

105©2004 Anastassia Ailamaki

Summary: Bridging the Gap

Cache-aware data placementEliminates unnecessary trips to memoryMinimizes conflict/capacity missesFates: decouple memory from storage layout

What about compulsory (cold) misses?Can’t avoid, but can hide latency with prefetchingTechniques for B-trees, hash joins

Staged Database Systems: a scalable futureAddressing instruction stalls

DSS: Call Graph Prefetching, SIMD, group operatorOLTP: STEPS, a promising direction for any platform

Page 106: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

106©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapNewer HardwareHip and TrendyDirections for Future Research

Page 107: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

107©2004 Anastassia Ailamaki

Current/Near-future Multiprocessors

Typical platforms:1. Chips with multiple cores2. Servers with multiple chips3. Memory shared across

Memory access:Traverse multiple hierarchiesLarge non-uniform latencies

PP

PP

Mem

Multiprocessor Server

PP

PPP

PP

P

MemMem

Programmer/Software must Hide/Tolerate Latency

Page 108: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

108©2004 Anastassia Ailamaki

Chip Multi-Processors (CMP)

Highly variable memory latencySpeedup: OLTP 3x, DSS 2.3x on Piranha [BGM00]

Two cores

Shared L2

Example: IBM Power4, Power5

Page 109: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

109©2004 Anastassia Ailamaki

Simultaneous Multi-Threading (SMT)

Speedup: OLTP 3x, DSS 0.5x (simulated) [LBE98]

Implements threads in a superscalar processorKeeps hardware state for multiple threadsE.g.: Intel Pentium 4 (SMT), IBM Power5 (SMT&CMP)

2 cores * 2 threads

Page 110: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

110©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and Trendy

Query co-processingDatabases on MEMStore

Directions for Future Research

Page 111: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

111©2004 Anastassia Ailamaki

Oprimizing Spatial Operations

Spatial operation is computation intensiveIntersection, distance computationNumber of vertices per object↑, cost↑

Use graphics card to increase speedIdea: use color blending to detect intersection

Draw each polygon with grayIntersected area is black because of color mixing effectAlgorithms cleverly use hardware features

Intersection selection: up to 64% improvement using graphics card

[SAA03]

Page 112: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

112©2004 Anastassia Ailamaki

Fast Computation of DB Operations Using Graphics Processors

Exploit graphics features for database operationsPredicate, Boolean operations, Aggregates

Examples:Predicate: attribute > constant

Graphics: test a set of pixels against a reference valuepixel = attribute value, reference value = constant

Aggregations: COUNTGraphics: count number of pixels passing a test

Good performance: e.g. over 2X improvement for predicate evaluations

[GLW04]

Promising! Peak performance of graphics processor increases 2.5-3 times a year

Page 113: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

113©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and Trendy

Query co-processingDatabases on MEMStore

Directions for Future Research

Page 114: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

114©2004 Anastassia Ailamaki

MEMStore (MEMS*-based storage)

On-chip mechanical storage - using MEMS for media positioning

Read/writetips

Read/writetips

Recordingmedia (sled)Recording

media (sled)

ActuatorsActuators

* microelectromechanical systems

Page 115: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

115©2004 Anastassia Ailamaki

MEMStore (MEMS*-based storage)

60 - 200 GB capacity4 – 40 GB portable

100 cm3 volume10’s MB/s bandwidth< 10 ms latency

10 – 15 ms portable

2 - 10 GB capacity

< 1 cm3 volume~100 MB/s bandwidth< 1 ms latency

Many parallelheads

Many parallelheads

Singleread/write

head

Singleread/write

head

So how can MEMS help improve DB performance?

* microelectromechanical systems

Page 116: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

116©2004 Anastassia Ailamaki

Two-dimensional database accessR

ecor

ds

Attributes33

30

27

34

31

28

35

32

29

3

6

12

9

21

24

0

15

18

4

7

13

10

22

25

1

16

19

5

8

14

11

2

17

20

23

26

54

57

60

55

58

61

56

59

62

36 69

39 66

42 63

37 70

40 67

43 64

38 71

41 68

44 65

51 72

48 75

45 78

52 73

49 76

46 79

53 74

50 77

47 80

Exploit inherent parallelism

[SSA03,YAA03,YAA04]

Page 117: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

117©2004 Anastassia Ailamaki

Two-dimensional database access

0

20

40

60

80

100

NSM - Row order MEMStore - Roworder

NSM - Attributeorder

MEMStore - Attributeorder

Sca

n tim

e (s

)

all a1 a2 a3 a4[SSA03]

Peak performance along both dimensions

Page 118: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

118©2004 Anastassia Ailamaki

Outline

Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research

Page 119: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

119©2004 Anastassia Ailamaki

Future research directions

Rethink Query Optimization – with increasing complexity, cost-based optimization not idealMultiprocessors and really new modular software architectures to fit new computers

Current research in DB workloads only scratches surfaceOptimize execution on multiple-core chipsExploit multithreaded processors

Power-aware database systems On embeded processors, laptops, etc.

Automatic data placement and memory layer optimization –one level should not need to know what others do

Auto-everything

Aggressive use of hybrid processors

Page 120: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

ACKNOWLEDGEMENTS

Page 121: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

121©2004 Anastassia Ailamaki

Special thanks go to…

Shimin Chen, Minglong Shao, Stavros Harizopoulos, and Nikos Hardavellas for invaluable contributions to this talkSteve Schlosser (MEMStore)Ravi Ramamurthy (fractured mirrors)Babak Falsafi and Chris Colohan (h/w architecture)

Page 122: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

REFERENCES(used in presentation)

Page 123: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

123©2004 Anastassia Ailamaki

ReferencesWhere Does Time Go? (simulation only)

[ADS02] Branch Behavior of a Commercial OLTP Workload on Intel IA32 Processors. M. Annavaram, T. Diep, J. Shen. International Conference on Computer Design: VLSI in Computers and Processors (ICCD), Freiburg, Germany, September 2002.

[SBG02] A Detailed Comparison of Two Transaction Processing Workloads. R. Stets, L.A. Barroso, and K. Gharachorloo. IEEE Annual Workshop on Workload Characterization (WWC), Austin,Texas, November 2002.

[BGN00] Impact of Chip-Level Integration on Performance of OLTP Workloads. L.A. Barroso, K. Gharachorloo, A. Nowatzyk, and B. Verghese. IEEE International Symposium on High-Performance Computer Architecture (HPCA), Toulouse, France, January 2000.

[RGA98] Performance of Database Workloads on Shared Memory Systems with Out-of-Order Processors. P. Ranganathan, K. Gharachorloo, S. Adve, and L.A. Barroso. International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), San Jose, California, October 1998.

[LBE98] An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors. J. Lo, L.A. Barroso, S. Eggers, K. Gharachorloo, H. Levy, and S. Parekh. ACM International Symposium on Computer Architecture (ISCA), Barcelona, Spain, June 1998.

[EJL96] Evaluation of Multithreaded Uniprocessors for Commercial Application Environments. R.J. Eickemeyer, R.E. Johnson, S.R. Kunkel, M.S. Squillante, and S. Liu. ACM International Symposium on Computer Architecture (ISCA), Philadelphia, Pennsylvania, May 1996.

Page 124: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

124©2004 Anastassia Ailamaki

ReferencesWhere Does Time Go? (real-machine/simulation)

[RAD02] Comparing and Contrasting a Commercial OLTP Workload with CPU2000. J. Rupley II, M. Annavaram, J. DeVale, T. Diep and B. Black (Intel). IEEE Annual Workshop on Workload Characterization (WWC), Austin, Texas, November 2002.

[CTT99] Detailed Characterization of a Quad Pentium Pro Server Running TPC-D. Q. Cao, J. Torrellas, P. Trancoso, J. Larriba-Pey, B. Knighten, Y. Won. International Conference on Computer Design (ICCD), Austin, Texas, October 1999.

[ADH99] DBMSs on a Modern Processor: Experimental Results A. Ailamaki, D. J. DeWitt, M. D. Hill, D.A. Wood. International Conference on Very Large Data Bases (VLDB), Edinburgh, UK, September 1999.

[KPH98] Performance Characterization of a Quad Pentium Pro SMP using OLTP Workloads. K. Keeton, D.A. Patterson, Y.Q. He, R.C. Raphael, W.E. Baker. ACM International Symposium on Computer Architecture (ISCA), Barcelona, Spain, June 1998.

[BGB98] Memory System Characterization of Commercial Workloads. L.A. Barroso, K. Gharachorloo, and E. Bugnion. ACM International Symposium on Computer Architecture (ISCA), Barcelona, Spain, June 1998.

[TLZ97] The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors. P. Trancoso, J. Larriba-Pey, Z. Zhang, J. Torrellas. IEEE International Symposium on High-Performance Computer Architecture (HPCA), San Antonio, Texas, February 1997.

Page 125: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

125©2004 Anastassia Ailamaki

ReferencesArchitecture-Conscious Data Placement

[SSS04] Clotho: Decoupling memory page layout from storage organization. M. Shao, J. Schindler, S.W. Schlosser, A. Ailamaki, G.R. Ganger. International Conference on Very Large Data Bases (VLDB), Toronto, Canada, September 2004.

[SSS04a] Atropos: A Disk Array Volume Manager for Orchestrated Use of Disks. J. Schindler, S.W. Schlosser, M. Shao, A. Ailamaki, G.R. Ganger. USENIX Conference on File and Storage Technologies (FAST), San Francisco, California, March 2004.

[YAA04] Declustering Two Dimensional

Page 126: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

126©2004 Anastassia Ailamaki

ReferencesArchitecture-Conscious Access Methods

[ZR03a] Buffering Accesses to Memory-Resident Index Structures. J. Zhou and K.A. Ross. International Conference on Very Large Data Bases (VLDB), Berlin, Germany, September 2003.

[HP03a] Effect of node size on the performance of cache-conscious B+ Trees. R.A. Hankins and J.M. Patel. ACM International conference on Measurement and Modeling of Computer Systems (SIGMETRICS), San Diego, California, June 2003.

[CGM02] Fractal Prefetching B+ Trees: Optimizing Both Cache and Disk Performance. S. Chen, P.B. Gibbons, T.C. Mowry, and G. Valentin. ACM International Conference on Management of Data (SIGMOD), Madison, Wisconsin, June 2002.

[GL01] B-Tree Indexes and CPU Caches. G. Graefe and P. Larson. International Conference on Data Engineering (ICDE), Heidelberg, Germany, April 2001.

[CGM01] Improving Index Performance through Prefetching. S. Chen, P.B. Gibbons, and T.C. Mowry. ACM International Conference on Management of Data (SIGMOD), Santa Barbara, California, May 2001.

[BMR01] Main-memory index structures with fixed-size partial keys. P. Bohannon, P. Mcllroy, and R. Rastogi. ACM International Conference on Management of Data (SIGMOD), Santa Barbara, California, May 2001.

[BDF00] Cache-Oblivious B-Trees. M.A. Bender, E.D. Demaine, and M. Farach-Colton. Symposium on Foundations of Computer Science (FOCS), Redondo Beach, California, November 2000.

[RR00] Making B+ Trees Cache Conscious in Main Memory. J. Rao and K.A. Ross. ACM International Conference on Management of Data (SIGMOD), Dallas, Texas, May 2000.

[RR99] Cache Conscious Indexing for Decision-Support in Main Memory. J. Rao and K.A. Ross. International Conference on Very Large Data Bases (VLDB), Edinburgh, the United Kingdom, September 1999.

[LC86] Query Processing in main-memory database management systems. T. J. Lehman and M. J. Carey. ACM International Conference on Management of Data (SIGMOD), 1986.

Page 127: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

127©2004 Anastassia Ailamaki

ReferencesArchitecture-Conscious Query Processing

[MBN04] Cache-Conscious Radix-Decluster Projections. Stefan Manegold, Peter A. Boncz, Niels Nes, Martin L. Kersten. In Proceedings of the International Conference on Very Large Data Bases (VLDB), Toronto, Canada, September 2004.

[GLW04] Fast Computation of Database Operations using Graphics Processors. N.K. Govindaraju, B. Lloyd, W. Wang, M. Lin, D. Manocha. ACM International Conference on Management of Data (SIGMOD), Paris, France, June 2004.

[CAG04] Improving Hash Join Performance through Prefetching. S. Chen, A. Ailamaki, P. B. Gibbons, and T.C. Mowry. International Conference on Data Engineering (ICDE), Boston, Massachusetts, March 2004.

[ZR04] Buffering Database Operations for Enhanced Instruction Cache Performance. J. Zhou, K. A. Ross. ACMInternational Conference on Management of Data (SIGMOD), Paris, France, June 2004.

[SAA03] Hardware Acceleration for Spatial Selections and Joins. C. Sun, D. Agrawal, A.E. Abbadi. ACM International conference on Management of Data (SIGMOD), San Diego, California, June,2003.

[CHK01] Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems. S. K. Cha, S. Hwang, K. Kim, and K. Kwon. International Conference on Very Large Data Bases (VLDB), Rome, Italy, September 2001.

[PMA01] Block Oriented Processing of Relational Database Operations in Modern Computer Architectures. S. Padmanabhan, T. Malkemus, R.C. Agarwal, A. Jhingran. International Conference on Data Engineering (ICDE), Heidelberg, Germany, April 2001.

[MBK00] What Happens During a Join? Dissecting CPU and Memory Optimization Effects. S. Manegold, P.A. Boncz, and M.L.. Kersten. International Conference on Very Large Data Bases (VLDB), Cairo, Egypt, September 2000.

[SKN94] Cache Conscious Algorithms for Relational Query Processing. A. Shatdal, C. Kant, and J.F. Naughton. International Conference on Very Large Data Bases (VLDB), Santiago de Chile, Chile, September 1994.

[NBC94] AlphaSort: A RISC Machine Sort. C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D.B. Lomet. ACM International Conference on Management of Data (SIGMOD), Minneapolis, Minnesota, May 1994.

Page 128: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

128©2004 Anastassia Ailamaki

ReferencesInstrustion Stream Optimizations and

DBMS Architectures

[HA04] STEPS towards Cache-resident Transaction Processing. S. Harizopoulos and A. Ailamaki. International Conference on Very Large Data Bases (VLDB), Toronto, Canada, September 2004.

[APD03] Call Graph Prefetching for Database Applications. M. Annavaram, J.M. Patel, and E.S. Davidson. ACM Transactions on Computer Systems, 21(4):412-444, November 2003.

[SAG03] Lachesis: Robust Database Storage Management Based on Device-specific Performance Characteristics. J. Schindler, A. Ailamaki, and G. R. Ganger. International Conference on Very Large Data Bases (VLDB), Berlin, Germany, September 2003.

[HA02] Affinity Scheduling in Staged Server Architectures. S. Harizopoulos and A. Ailamaki. Carnegie Mellon University, Technical Report CMU-CS-02-113, March, 2002.

[HA03] A Case for Staged Database Systems. S. Harizopoulos and A. Ailamaki. Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, January 2003.

[B02] Monet: A Next-Generation DBMS Kernel For Query-Intensive Applications. P. A. Boncz. Ph.D. Thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands, May 2002.

[PMH02] Computation Regrouping: Restructuring Programs for Temporal Data Cache Locality. V.K. Pingali, S.A. McKee, W.C. Hseih, and J.B. Carter. International Conference on Supercomputing (ICS), New York, New York, June 2002.

[ZR02] Implementing Database Operations Using SIMD Instructions. J. Zhou and K.A. Ross. ACM International Conference on Management of Data (SIGMOD), Madison, Wisconsin, June 2002.

Page 129: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

129©2004 Anastassia Ailamaki

ReferencesNewer Hardware

[BWS03] Improving the Performance of OLTP Workloads on SMP Computer Systems by Limiting Modified Cache Lines. J.E. Black, D.F. Wright, and E.M. Salgueiro. IEEE Annual Workshop on Workload Characterization (WWC), Austin, Texas, October 2003.

[GH03] Technological impact of magnetic hard disk drives on storage systems. E. Grochowskiand R. D. Halem IBM Systems Journal 42(2), 2003.

[DJN02] Shared Cache Architectures for Decision Support Systems. M. Dubois, J. Jeong , A. Nanda, Performance Evaluation 49(1), September 2002 .

[G02] Put Everything in Future (Disk) Controllers. Jim Gray, talk at the USENIX Conference on File and Storage Technologies (FAST), Monterey, California, January 2002.

[BGM00] Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. L.A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. International Symposium on Computer Architecture (ISCA). Vancouver, Canada, June 2000.

[AUS98] Active disks: Programming model, algorithms and evaluation. A. Acharya, M. Uysal, and J. Saltz. International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), San Jose, California, October 1998.

[KPH98] A Case for Intelligent Disks (IDISKs). K. Keeton, D. A. Patterson, J. Hellerstein. SIGMOD Record, 27(3):42--52, September 1998.

[PGK88] A Case for Redundant Arrays of Inexpensive Disks (RAID). D. A. Patterson, G. A. Gibson, and R. H. Katz. ACM International Conference on Management of Data (SIGMOD), June 1988.

Page 130: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

130©2004 Anastassia Ailamaki

ReferencesMethodologies and Benchmarks

[DMM04] Accurate Cache and TLB Characterization Using hardware Counters. J. Dongarra, S. Moore, P. Mucci, K. Seymour, H. You. International Conference on Computational Science (ICCS), Krakow, Poland, June 2004.

[SAF04] DBmbench: Fast and Accurate Database Workload Representation on Modern Microarchitecture. M. Shao, A. Ailamaki, and B. Falsafi. Carnegie Mellon University Technical Report CMU-CS-03-161, 2004 .

[KP00] Towards a Simplified Database Workload for Computer Architecture Evaluations. K. Keeton and D. Patterson. IEEE Annual Workshop on Workload Characterization, Austin, Texas, October 1999.

Page 131: Database Architectures for New Hardwarenatassa/aapubs/tutorials/dbarchVLDB04.pdf · 2004. 9. 10. · @Carnegie Mellon Databases ©2004 Anastassia Ailamaki 9 Breaking the Memory Wall

@Carnegie MellonDatabases

131©2004 Anastassia Ailamaki

Useful Links

Info on Intel Pentium4 Performance Counters: ftp://download.intel.com/design/Pentium4/manuals/25366814.pdf

AMD hardware performance countershttp://www.amd.com/us-en/Processors/DevelopWithAMD/

PAPI Performance Libraryhttp://icl.cs.utk.edu/papi/Intel® VTune™ Performance Analyzers http://developer.intel.com/software/products/vtune/


Recommended