Performance evaluation and
benchmarking of DBMSs
INF5100 Autumn 2008
Jarle Søberg
INF5100 © 2008 Jarle Søberg 2
Overview
• What is performance evaluation and
benchmarking?• Theory
• Examples
• Domain-specific benchmarks and benchmarking
DBMSs• We focus on the most popular one: TPC
INF5100 © 2008 Jarle Søberg 3
What is benchmarking?
1. Evaluation techniques and metrics
2. Workload
3. Workload characterization
4. Monitors
5. Representation
INF5100 © 2008 Jarle Søberg 4
Evaluation techniques and metrics
• Examining systems with respect to one or more metrics• Speed in km/h
• Accuracy
• Availability
• Response time
• Throughput
• Etc.
• An example: Early processor comparison based on the speed of the addition instruction, since it was most used instruction
• Metric selection is based on evaluation technique (next slide)
Criteria to compare
the performance
Three main evaluation techniques
Analytical modeling• On the paper
• Formal proofs
• Simplifications
• Assumptions
Simulation• Closer to reality
• Still omitted
details
Measurements• Investigates
real system
5INF5100 © 2008 Jarle Søberg
6
Evaluation techniques and metrics
• Three main evaluation techniques
Criterion Analytical
modeling
Simulation Measurement
(concrete syst.)
Stage Any Any Postprototype
Time required Small Medium Varies
Tools Analysts Computer
languages
Instrumentation
Accuracy Low Moderate Varies
Trade-off
evaluation
Easy Moderate Difficult
Cost Small Medium High
Saleability Low Medium High
© 1
991, R
aj J
ain
INF5100 © 2008 Jarle Søberg
INF5100 © 2008 Jarle Søberg 7
What is benchmarking?
• “benchmark v. trans. To subject (a system) to a
series of tests in order to obtain prearranged
results not available on competitive systems”
• S. Kelly-Bootle
The Devil’s DP Dictionary
In other words: Benchmarks are measurements
used to differ two or more systems
INF5100 © 2008 Jarle Søberg 8
Workload
• Must fit the systems that are benchmarked• Instruction frequency for CPUs
• Transaction frequencies
• Select level of detail and use as workload1. Most frequent request
2. Most frequent request types
3. Time-stamped sequence of requests (a trace)• From real system, e.g. to perform measurements
4. Average resource demand• For analytical modeling
• Rather than real resource demands
5. Distribution of resource demands• When having a large variance
• Good for simulations
INF5100 © 2008 Jarle Søberg 9
Workload
• Representativeness• Arrival rate
• Resource demands
• Resource usage profile
• Timeliness• Workload should represent usage patterns
INF5100 © 2008 Jarle Søberg 10
Workload characterization
• Repeatability is important
• Observe real-user behavior and create a repeatable workload based on that?
• One should only need to change workload parameters
• Transaction types• Instructions• Packet sizes• Source/destinations of packets• Page reference patterns
• Generate new traces for each parameter?
INF5100 © 2008 Jarle Søberg 11
Monitors
• How do we obtain the results from sending the workload into the system?
• Observe the activities• Performance
• Collect statistics
• Analyze data
• Display results
• Either monitor all activities or sample• E.g. top monitor update in Linux
• On-line• Continuously display system state
• Batch• Collect data and analyze later
Monitors
• In system• Put monitors inside system
• We need the source code
• Gives great detail?
• May add overhead?
• As black-box• Measure input and output, is that all good?
12INF5100 © 2008 Jarle Søberg
INF5100 © 2008 Jarle Søberg 13
Benchmarking: Represented by common
mistakes
• Only average behavior represented in test workload
• Variance is ignored
• Skewness of device demands ignored• Evenly distribution of I/O or network requests during
test, which might not be the case in real environments
• Loading level controlled inappropriately• Think time, i.e. the time between workload items, and
number of users increased/decreased inappropriately
• Caching effects ignored• Order of arrival for requests• Elements thrown out of the queues?
INF5100 © 2008 Jarle Søberg 14
Common mistakes in benchmarking
• Buffer sizes not appropriate• Should represent the values used in production
systems
• Inaccuracies due to sampling ignored• Make sure to use accurate sampled data
• Ignoring monitoring overhead
• Not validating measurements• Is the measured data correct?
• Not ensuring same initial conditions• Disk space, starting time of monitors, things are run
by hand …
INF5100 © 2008 Jarle Søberg 15
Common mistakes in benchmarking
• Not measuring transient performance• Depending on the system, but if the system is more in
transitions than steady states, this has to be
considered: Know your system!
• Collecting too much data but doing very little
analysis• In measurements, often all time is used to obtain the
data, but less time is available to analyze it
• It is more fun to experiment than analyze the data
• It is hard to use statistical techniques to get significant
results; let’s just show the average
The art of data presentation
It is not what you say, but how you say it.
- A. Putt
• Results from performance evaluations aim to
help in decision making
• Decision makers do not have time to dig into
complex result sets
• Requires prudent use of words, pictures, and
graphs to explain the results and the analysis
INF5100 © 2008 Jarle Søberg 16
Some glorious examples
INF5100 © 2008 Jarle Søberg 17
Availa
bili
ty
Unava
ilabili
tyDay of the week Day of the week
Some glorious examples (cont.)
INF5100 © 2008 Jarle Søberg 18
40
30
20
10
Response
time
100
75
50
25
Utilization
Throughput
20
15
10
5
Throughput
Utilization
Response
time
INF5100 © 2007 Jarle Søberg 19
Overview
• What is performance evaluation and
benchmarking?• Theory
• Examples
• Domain-specific benchmarks and
benchmarking DBMSs• We focus on the most popular one: TPC
19INF5100 © 2008 Jarle Søberg
INF5100 © 2008 Jarle Søberg 20
Domain-specific benchmarks
• No single metric can measure the performance of
computer systems on all applications• Simple update-intensive transactions for online
databases
vs.
• Speed in decision-support queries
INF5100 © 2008 Jarle Søberg 21
The key criteria for a domain-specific
benchmark
• Relevant• Perform typical operations within the problem domain
• Portable• The benchmark should be easy to implement and run
on many different systems and architectures
• Scaleable• To larger systems or parallel systems as they evolve
• Simple• It should be understandable in order to maintain
credibility
TPC: Transaction Processing Performance
Council• Background
• IBM released an early benchmark, TP1, in early 80’s• ATM transactions in batch-mode
• No user interaction• No network interaction
• Originally internally used at IBM, and thus poorly defined• Exploited by many other commercial vendors
• Anon (i.e. Gray) et al. released a more well thought of benchmark, DebitCredit, in 1985
• Total system cost published with the performance rating• Test specified in terms of high-level functional requirements
• A bank with several branches and ATMs connected to the braches• The benchmark workload had scale-up rules• The overall transaction rate would be constrained by a response time
requirement
• Vendors often deleted key requirements in DebitCredit to improve their performance results
22INF5100 © 2008 Jarle Søberg
TPC: Transaction Processing Performance
Council
• A need for a more standardized benchmark
• In 1988, eight companies came together and
formed TPC
• Started making benchmarks based on the
domains used in DebitCredit.
23INF5100 © 2008 Jarle Søberg
Early (and obsolete) TPCs
• TPC-A• 90 percent of transactions must complete in less than 2
seconds• 10 ATM terminals per system and the cost of the terminals was
included in the system price• Could be run in a local or wide-area network configuration
• DebitCredit has specified only WANs• The ACID requirements were bolstered and specific tests added
to ensure ACID viability• TPC-A specified that all benchmark testing data should be
publicly disclosed in a Full Disclosure Report
• TPC-B• Vendors complained about all the extra in TPC-A• Vendors of servers were not interested in adding terminals and
networks• TPC-B was a standardization of TP1 (to the core)
24INF5100 © 2008 Jarle Søberg
TPC-C
• On-line transaction processing (OLTP)
• More complex than TPC-A
• Handles orders in warehouses• 10 sales districts
• 3000 costumers
• Each warehouse must cooperate with the other
warehouses to complete orders
• TPC-C measures how many complete business
operations can be processed per minute
25INF5100 © 2008 Jarle Søberg
TPC-C (results)
© 2
007 T
PC
26INF5100 © 2008 Jarle Søberg
TPC-E
• Is considered a
successor of
TPC-C
• Brokerage house• Customers
• Accounts
• Securities
• Pseudo-real data
• More complex than
TPC-C
Characteristic TPC-E TPC-C
Tables 33 9
Columns 188 92
Min Cols / Table 2 3
Max Cols / Table 24 21
Data Type Count Many 4
Data Types UID, CHAR,
NUM, DATE,
BOOL, LOB
UID, CHAR, NUM,
DATE
Primary Keys 33 8
Foreign Keys 50 9
Tables w/ Foreign
Keys
27 7
Check
Constraints
22 0
Referential
Integrity
Yes No
© 2
00
7 T
PC
27INF5100 © 2008 Jarle Søberg
TPC-E (results)
© 2
00
7 T
PC
28INF5100 © 2008 Jarle Søberg
TPC-H
• Decision support
• Simulates an environment in which users connected to
the database system send individual queries that are not
known in advance
• Metric• Composite Query-per-Hour Performance Metric (QphH@Size)
• Selected database size against which the queries are executed
• The query processing power when queries are submitted by a
single stream
• The query throughput when queries are submitted by multiple
concurrent users
29INF5100 © 2008 Jarle Søberg
INF5100 © 2008 Jarle Søberg 30
Reference
• The Art of Computer Systems Performance Analysis
• Raj Jain, 1991
• The Benchmark Handbook for Database and Transaction Processing Systems
• Jim Gray, 1991
• The TPC homepage: www.tpc.org
• Poess, M. and Floyd, C. 2000. New TPC benchmarks for decision support and web commerce. SIGMOD Rec. 29, 4 (Dec. 2000), 64-71