Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | ricky-zhu |
View: | 706 times |
Download: | 0 times |
<Insert Picture Here>
Practical Performance Management for Oracle RAC
Barb Lundhild RAC Product ManagementMichael Zoll RAC Development, Performance
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions.The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
<Insert Picture Here>
Agenda
• Oracle RAC Fundamentals and Infrastructure • Common Problems and Symptoms• Application and Database Design• Diagnostics and Problem Determination • Summary: Practical Performance Analysis• Appendix
OBJECTIVE
• Realize that Oracle RAC performance does not requires “Black Magic”
• General system and SQL analysis and tuning experience is practically sufficient for Oracle RAC
• Problems can be identified with a minimum of metrics and effort
• Diagnostics framework and advisories are efficient
<Insert Picture Here>
RAC Fundamentals and Infrastructure
Service
Oracle RAC Architecture
public network
Node1
Operating System
Oracle Clusterware
instance 1
ASM
VIP1
ListenerNode 2
Operating System
Oracle Clusterware
instance 2
ASM
VIP2
Listener
Service Node n
Operating System
Oracle Clusterware
instance n
ASM
VIPn
Listener
Service
/…/
Redo / Archive logs all instances
shared storage
Database / Control files
OCR and Voting Disks
Managed by ASM
RAW Devices
Oracle Clusterware
Node1 public network
EVMD
CRSD
OPROCD
ONS
VIP1
CSSDNode 2
EVMD
CRSD
OPROCD
ONS
VIP2
CSSDNode n
EVMD
CRSD
OPROCD
ONS
VIPn
CSSD
/…/
shared storage
OCR and Voting DisksRAW Devices
CSSDRuns in Real
Time Priority
Under the Covers
Redo Log Files
Node nNode 2
Data Files and Control Files
Redo Log Files Redo Log Files
DictionaryCache
Log buffer
VKTM LGWR DBW0
SMON PMON
LibraryCache
Global Resoruce Directory
LMS0
Instance 2
SGA
Instance n
Cluster Private High Speed Network
Buffer Cache
LMON LMD0 DIAG
Dictionary Cache
Log buffer
VKTM LGWR DBW0
SMON PMON
Library Cache
Global Resoruce Directory
LMS0
Buffer Cache
LMON LMD0 DIAG
Dictionary Cache
Log buffer
VKTM LGWR DBW0
SMON PMON
Library Cache
Global Resoruce Directory
LMS0
Buffer Cache
LMON LMD0 DIAG
Instance 1
Node 1
SGA SGA
Runs in Real
Time Priority
Global Cache Service (GCS)
• Manages coherent access to data in buffer caches of all instances in the cluster
• Minimizes access time to data which is not in local cache • access to data in global cache faster than disk access
• Implements fast direct memory access over high-speed interconnects • for all data blocks and types
• Uses an efficient and scalable messaging protocol• Never more than 3 hops
• New optimizations for read-mostly applications
Cache Hierarchy: Data in Remote Cache
Local Cache Miss
Datablock Requested
Datablock Returned
Remote Cache Hit
Cache Hierarchy: Data On Disk
Local Cache Miss
Datablock Requested
Grant Returned
Remote Cache Miss
Disk Read
Cache Hierarchy: Read Mostly
Local Cache Miss
No Message required Disk Read
Performance of Cache Fusion
Message:~200 bytes
Block: e.g. 8K
LMS
Initiate send and wait
Receive
Process block
Send
Receive
200 bytes/(1 Gb/sec )
8192 bytes/(1 Gb/sec)
Total access time: e.g. ~360 microseconds (UDP over GBE)Network propagation delay ( “wire time” ) is a minor factor for roundtrip time
( approx.: 6% , vs. 52% in OS and network stack )
Fundamentals: Minimum Latency (*), UDP/GBE and RDS/IB
Block size
RT (ms)
2K 4K 8K 16K
UDP/GE 0.30 0.31 0.36 0.46
RDS/IB 0.12 0.13 0.16 0.20
(*) roundtrip, blocks are not “busy” i.e. no log flush, no serialization ( “buffer busy”)AWR and Statspack reports would report averages as if they were normally distributed, the session wait history which is included in Statspack in 10.2 and AWR in 11g will show the actual quantilesThe minimum values in this table are the optimal values for 2-way and 3-way block transfers, but can be assumed to be the expected values ( I.e. 10ms for a 2-way block would be very high )
Infrastructure: Private Interconnect
• Network between the nodes of a RAC cluster MUST be private• Best practice is not to share IE with ISCSI storage
• Supported links: GbE, IB ( IPoIB: 10.2 ) • Supported transport protocols: UDP, RDS (10.2.0.3)• Use multiple or dual-ported NICs for redundancy and
increase bandwidth with NIC bonding• Large ( Jumbo ) Frames for GbE recommended
Infrastructure: Interconnect Bandwidth
• Bandwidth requirements depend on several factors ( e.g. buffer cache size, #of CPUs per node, access patterns) and cannot be predicted precisely for every application
• Typical utilization approx. 10-30% in OLTP• 10000-12000 8K blocks per sec to saturate 1 x Gb Ethernet ( 75-
80% of theoretical bandwidth )
• Generally, 1Gb/sec sufficient for performance and scalability in OLTP.
• DSS/DW systems should be designed with > 1Gb/sec capacity• A sizing approach with rules of thumb is described in
• Project MegaGrid: Capacity Planning for Large Commodity Clusters (http://otn.oracle.com/rac)
Infrastructure: IPC configuration
• Important Settings:• Negotiated top bit rate and full duplex mode • NIC ring buffers• Ethernet flow control settings• CPU(s) receiving network interrupts
• Verify your setup:• CVU does checking • Load testing eliminates potential for problems• AWR and ADDM give estimations of link utilization
• Buffer overflows, congested links and flow control can have severe consequences for performance
Infrastructure: Operating System
• Block access latencies increase when CPU(s) busy and run queues are long • Immediate LMS scheduling is critical for predictable block
access latencies when CPU > 80% busy
• Fewer and busier LMS processes may be more efficient. • monitor their CPU utilization• Caveat: 1 LMS can be good for runtime performance but
may impact cluster reconfiguration and instance recovery time
• the default is good for most requirements
• Higher priority for LMS is default• The implementation is platform-specific
<Insert Picture Here>
Common Problems and Symptoms
<Insert Picture Here>
Common Problems and Symptoms
• “Lost Blocks”: Interconnect or Switch Problems
• Slow or bottlenecked disks• System load and scheduling • Contention• Unexpectedly high latencies
Miss-configured or Faulty Interconnect Can Cause:
• Dropped packets/fragments• Buffer overflows• Packet reassembly failures or timeouts• Ethernet Flow control kicks in• TX/RX errors
“lost blocks” at the RDBMS level, responsible for 64% of escalations
“Lost Blocks”: NIC Receive Errors
Db_block_size = 8K
ifconfig –a:
eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04
inet addr:130.35.25.110 Bcast:130.35.27.255 Mask:255.255.252.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95
TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0
…
“Lost Blocks”: IP Packet Reassembly Failures
netstat –s
Ip: 84884742 total packets received … 1201 fragments dropped after timeout … 3384 packet reassembles failed
Top 5 Timed Events Avg %Total~~~~~~~~~~~~~~~~~~ wait CallEvent Waits Time(s)(ms) Time Wait Class----------------------------------------------------------------------------------------------------
log file sync 286,038 49,872 174 41.7 Commit
gc buffer busy 177,315 29,021 164 24.3 Cluster
gc cr block busy 110,348 5,703 52 4.8 Cluster
gc cr block lost 4,272 4,953 1159 4.1 Cluster
cr request retry 6,316 4,668 739 3.9 Other
Finding a Problem with the Interconnect or IPC
Should never be here
Global Cache Lost block handling
• Detection Time in 11g reduced• 500ms ( around 5 secs in 10g )• can be lowered if necessary • robust ( no false positives )• no extra overhead
• Cr request retry event related to lost blocks• It is highly likely to see it when gc cr blocks lost show up
Interconnect StatisticsAutomatic Workload Repository (AWR )
Target Avg Latency Stddev Avg Latency Stddev
Instance 500B msg 500B msg 8K msg 8K msg
---------------------------------------------------------------------
1 .79 .65 1.04 1.06
2 .75 .57 . 95 .78
3 .55 .59 .53 .59
4 1.59 3.16 1.46 1.82
---------------------------------------------------------------------
Latency probes for different message sizes
Exact throughput measurements ( not shown)
Send and receive errors, dropped packets ( not shown )
“Blocks Lost”: Solution
• Fix interconnect NICs and switches• Tune IPC buffer sizes
Disk IO Performance Issues
• Log flush IO delays can cause “busy” buffers• “Bad” queries on one node can saturate an
interconnect link• IO is issued from ALL nodes to shared storage
• Use Automatic Database Diagnostic Monitor (ADDM) /AWR • single system image of I/O across cluster
Cluster-wide impact of IO or query plan issues responsible for 23% of escalations
Cluster-Wide I/O Impact
Top 5 Timed Events Avg %Total~~~~~~~~~~~~~~~~~~ wait CallEvent Waits Time(s)(ms) Time ------------------------------ ------------ ----------- ------ ------
log file sync 286,038 49,872 174 41.7
gc buffer busy 177,315 29,021 164 24.3
gc cr block busy 110,348 5,703 52 4.8 ``
Load Profile~~~~~~~~~~~~ Per Second
---------------
Redo size: 40,982.21
Logical reads: 81,652.41
Physical reads: 51,193.37
Node 2
Node 1
Expensive Query in Node 2
1. IO on disk group containing
redo logs is bottlenecked
2. Block shipping for “hot” blocks
is delayed by log flush IO
3. Serialization/Queues build up
IO and/or Bad SQL problem fixed
Top 5 Timed Events Avg %Total~~~~~~~~~~~~~~~~~~ wait CallEvent Waits Time (s) (ms) Time Wait Class--------------------------- --------- ----------- ---- ------ ----------
CPU time 4,580 65.4
log file sync 276,281 1,501 5 21.4 Commit
log file parallel write 298,045 923 3 13.2 System I/O
gc current block 3-way 605,628 631 1 9.0 Cluster
gc cr block 3-way 514,218 533 1 7.6 Cluster
1. Log file writes are normal
2. Global serialization has disappeared
Drill-down: An IO capacity problem
Symptom of Full Table Scans I/O contention
Top 5 Timed Events Avg %Total wait CallEvent Waits Time(s) (ms) Time Wait Class---------------- -------- ------- ---- ---- ----------
db file scattered read 3,747,683 368,301 98 33.3 User I/O
gc buffer busy 3,376,228 233,632 69 21.1 Cluster
db file parallel read 1,552,284 225,218 145 20.4 User I/O
gc cr multi block 35,588,800 101,888 3 9.2 Clusterrequest
read by other session 1,263,599 82,915 66 7.5 User I/O
IO issues: Solution
• Tune IO layout• Tune queries with a lot of IO
CPU Saturation or Long Run Queues
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s) (ms) Time Wait Class----------------- --------- ------ ---- ----- ----------
db file sequential 1,312,840 21,590 16 21.8 User I/Oread
gc current block 275,004 21,054 77 21.3 Clustercongested
gc cr grant congested 177,044 13,495 76 13.6 Cluster
gc current block 1,192,113 9,931 8 10.0 Cluster2-way
gc cr block congested 85,975 8,917 104 9.0 Cluster“Congested” : LMS could not dequeue messages fast enough Cause : Long run queue, CPU starvation
High CPU Load: Solution
• Run LMS at higher priority (default)• Start more LMS processes • Reduce the number of user processes• Find cause of high CPU consumption
Contention
Event Waits Time (s) AVG (ms) % Call Time
---------------------- --------- -------- -------- --------
gc cr block 2-way 317,062 5,767 18 19.0
gc current block 2-way 201,663 4,063 20 13.4
gc buffer busy 111,372 3,970 36 13.1
CPU time 2,938 9.7
gc cr block busy 40,688 1,670 41 5.5 -------------------------------------------------------
Global Contention on DataSerialization
Its is very likely that CR BLOCK BUSY and GC BUFFER BUSY are related
Contention: Solution
• Identify “hot” blocks in application• Reduce concurrency on hot blocks
High Latencies
Event Waits Time (s) AVG (ms) % Call Time
---------------------- ---------- ---------- --------- --------
gc cr block 2-way 317,062 5,767 18 19.0
gc current block 2-way 201,663 4,063 20 13.4
gc buffer busy 111,372 3,970 36 13.1
CPU time 2,938 9.7
gc cr block busy 40,688 1,670 41 5.5
-------------------------------------------------------
Tackle latency first, then tackle busy events
Expected: To see 2-way, 3-way
Unexpected: To see > 1 ms (AVG ms should be around 1 ms)
High Latencies : Solution
• Check network configuration • Private• Running at expected bit rate
• Find cause of high CPU consumption• Runaway or spinning processes
Health Check
Look for: • Unexpected Events
gc cr block lost 1159 ms
• Unexpected “Hints”• Contention and Serialization
gc cr/current block busy 52 ms
• Load and Schedulinggc current block congested 14 ms
• Unexpected high avggc cr/current block 2-way 36 ms
<Insert Picture Here>
Application and Database Design
General Principles
• No fundamentally different design and coding practices for RAC
• BUT: flaws in execution or design have higher impact in RAC• Performance and scalability in RAC will be more sensitive to
bad plans or bad schema design • Serializing contention makes applications less scalable
• Standard SQL and schema tuning solves > 80% of performance problems
Scalability Pitfalls
• Serializing contention on a small set of data/index blocks• monotonically increasing key • frequent updates of small cached tables• segment without automatic segment space managmenent
(ASSM) or Free List Group (FLG)• Full table scans
• Optimization for full scans in 11g can save CPU and latency
• Frequent invalidation and parsing of cursors• Requires data dictionary lookups and synchronizations
• Concurrent DDL ( e.g. truncate/drop )
Health Check
Look for:• Indexes with right-growing characteristics
• Eliminate indexes which are not needed
• Frequent updated and reads of “small” tables• “small”=fits into a single buffer cache • Sparse blocks ( PCTFREE 99 ) will reduce serialization
• SQL which scans large amount of data• Perhaps more efficient when parallelized• Direct reads do not need to be globally synchronized
( hence less CPU for global cache )
<Insert Picture Here>
Diagnostics and Problem Determination
MOST OF THE TIME, A PERFORMANCE PROBLEM IS NOT A Oracle RAC PROBLEM
Checklist for the Skeptical Performance Analyst ( AWR based )• Check where most of the time in the
database is spend (“Top 5” )• Check whether gc events are “busy”,
“congested”• Check the avg wait time• Drill down
• SQL with highest cluster wait time • Segment Statistics with highest block transfers
or JUST USE ADDM with Oracle RAC 11g!
Drill-down: An IO capacity problem
Symptom of Full Table Scans I/O contention
Top 5 Timed Events Avg %Total wait CallEvent Waits Time(s) (ms) Time Wait Class---------------- -------- ------- ---- ---- ----------
db file scattered read 3,747,683 368,301 98 33.3 User I/O
gc buffer busy 3,376,228 233,632 69 21.1 Cluster
db file parallel read 1,552,284 225,218 145 20.4 User I/O
gc cr multi block 35,588,800 101,888 3 9.2 Clusterrequest
read by other session 1,263,599 82,915 66 7.5 User I/O
Drill-down: SQL Statements
“Culprit”: Query that overwhelms IO subsystem on one node
Physical Reads Executions per Exec %Total
-------------- ----------- ------------- ------
182,977,469 1,055 173,438.4 99.3
SELECT SHELL FROM ES_SHELL WHERE MSG_ID = :msg_id ORDER BY ORDER_NO ASC
The same query reads from the interconnect:
Cluster CWT % of CPU
Wait Time (s) Elapsd Tim Time(s) Executions
------------- ---------- ----------- --------------
341,080.54 31.2 17,495.38 1,055
SELECT SHELL FROM ES_SHELL WHERE MSG_ID = :msg_id ORDER BY ORDER_NO ASC
GC
Tablespace Subobject Obj Buffer % of
Name Object Name Name Type Busy Capture-------- ------------- -------- ------ ------- ------
ESSMLTBL ES_SHELL SYS_P537 TABLE 311,966 9.91
ESSMLTBL ES_SHELL SYS_P538 TABLE 277,035 8.80
ESSMLTBL ES_SHELL SYS_P527 TABLE 239,294 7.60
…
Drill-Down: Top Segments
Apart from being the table with the highest IO demandit was the table with the highest number of block transfersAND global serialization
Findings Summary in EM
• Each finding type has a descriptive name • Facilitates search / aggregation / directives etc.
Recommendations
• Most relevant data for analysis can be derived from the wait events
• Always use Enterprise Manager (EM) and ADDM reports for performance health checks and analysis
• Activity Session History (ASH) can be used for session-based analysis of variation
• Export AWR repository regularly to save all of the above
ADDM Diagnosis for RAC
• Data sources are:• Wait events (especially Cluster class and buffer busy)• ASH• Instance cache transfer data • Interconnect statistics (throughput, usage by component, pings)
• ADDM analyzes for both the entire database (DATABASE analysis mode) and for each instance (INSTANCE analysis mode).
• Analysis of both database and instance resources summarized in a single report
• Allows drill down to specific instance.
What ADDM Diagnoses for RAC
• Latency problems in interconnect• Congestion (identifying top instances affecting the
entire cluster)• Contention (buffer busy, top objects etc.)• Top consumers of multiblock requests• Lost blocks• Reports information about interconnect devices. Warns
about using PUBLIC interfaces.• Reports throughput of devices, and how much of it is
used by Oracle and for what purpose (GC, locks, PQ)
AQ&Q U E S T I O N SQ U E S T I O N S
A N S W E R SA N S W E R S
OTHER SESSIONS to CHECKOUT
THURSDAY TIME
Title
10:00 AM S291242 Demystifying Oracle RAC Internals South 104
1:00 PMS291662 Using Oracle RAC and Microsoft Windows 64-bit as the Foundation (with Intel and Talx) South 309
4:00 PMS291670 Oracle Database 11g: First Experiences with Grid Computing (with Mobiltel and BCF) South 310
For More Information
http://search.oracle.com
or
otn.oracle.com/rac
REAL APPLICATION CLUSTERS