Date post: | 07-Apr-2018 |
Category: |
Documents |
Upload: | gayanthick |
View: | 222 times |
Download: | 0 times |
of 48
8/4/2019 Cluster Ware Network[1]
1/48
Copyright 2008, Oracle. All rights reserved.1
Oracle Clusterware and Private Network Considerations
Much of this presentation is attributed to Michael Zoll and work done by the RAC Performance Development group
8/4/2019 Cluster Ware Network[1]
2/48
Copyright 2008, Oracle. All rights reserved.2
The following is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not berelied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracles
products remains at the sole discretion of Oracle.
8/4/2019 Cluster Ware Network[1]
3/48
Copyright 2008, Oracle. All rights reserved.3
Architectural Overview
RAC and Cache Fusion Performance
Infrastructure
Common Problems and ResolutionAggregation and VLANs
Agenda
8/4/2019 Cluster Ware Network[1]
4/48
Copyright 2008, Oracle. All rights reserved.4
Oracle Clusterware
public network
EVMD
CRSD
OPROCD
ONS
VIP1
CSSD
VIP2 VIPn
//
shared storage
OCR and Voting DisksRAW Devices
CSSDRuns in Real
Time Priority
OS
EVMD
CRSD
OPROCD
ONS
CSSD
OS
EVMD
CRSD
OPROCD
ONS
CSSD
OS
ClusterPrivate High Speed
Network
L2/L3 SWITCH
8/4/2019 Cluster Ware Network[1]
5/48
Copyright 2008, Oracle. All rights reserved.5
Under the Covers
Redo Log Files
Node nNode 2
Data Files and Control Files
Redo Log Files Redo Log Files
Dictionary
CacheLog buffer
VKTM LGWR DBW0
SMON PMON
LibraryCache
Global Resoruce Directory
LMS0
Instance 2
SGA
Instance n
Buffer Cache
LMON LMD0 DIAG
Dictionary
CacheLog buffer
VKTM LGWR DBW0
SMON PMON
LibraryCache
Global Resoruce Directory
LMS0
Buffer Cache
LMON LMD0 DIAG
Dictionary
Cache
Log buffer
VKTM LGWR DBW0
SMON PMON
Library
Cache
Global Resoruce Directory
LMS0
Buffer Cache
LMON LMD0 DIAG
Instance 1
Node 1
SGA SGA
Runs in Real
Time Priority
ClusterPrivate High Speed Network
L2/L3 SWITCH
8/4/2019 Cluster Ware Network[1]
6/48
Copyright 2008, Oracle. All rights reserved.6
Global Cache Service (GCS)
Manages coherent access to data in buffer caches of allinstances in the cluster
Minimizes access time to data which is not in local
cache
access to data in global cache faster than disk
access
Implements fast direct memory access over high-speed
interconnects
for all data blocks and types
Uses an efficient and scalable messaging protocol
Never more than 3 hops
New optimizations for read-mostly applications
8/4/2019 Cluster Ware Network[1]
7/48Copyright 2008, Oracle. All rights reserved.7
Cache Hierarchy: Data in Remote Cache
Local Cache Miss
Datablock Requested
Datablock Returned
Remote Cache Hit
8/4/2019 Cluster Ware Network[1]
8/48Copyright 2008, Oracle. All rights reserved.8
Cache Hierarchy: Data On Disk
Local Cache Miss
Datablock Requested
Grant Returned
Remote Cache Miss
Disk Read
8/4/2019 Cluster Ware Network[1]
9/48Copyright 2008, Oracle. All rights reserved.9
Cache Hierarchy: Read Mostly
Local Cache Miss
No Message required Disk Read
8/4/2019 Cluster Ware Network[1]
10/48Copyright 2008, Oracle. All rights reserved.10
11.1
CPU Optimizations for read-intensive operations
Read-only access
No messages
Direct reads
Read-mostly access
Message reductions
Latency improvements
Significant gains
From 50-70% reductions measured
8/4/2019 Cluster Ware Network[1]
11/48Copyright 2008, Oracle. All rights reserved.11
Performance of Cache Fusion
Message:~200 bytes
Block: e.g. 8K
LMS
Initiate send and wait
Receive
Process block
Send
Receive
200 bytes/(1 Gb/sec )
8192 bytes/(1 Gb/sec)
Total access time: e.g. ~360 microseconds (UDP over GBE)
Network propagation delay ( wire time ) is a minor factor for roundtrip time
( approx.: 6% , vs. 52% in OS and network stack )
8/4/2019 Cluster Ware Network[1]
12/48Copyright 2008, Oracle. All rights reserved.12
Fundamentals: Minimum Latency (*), UDP/GBE
and RDS/IB
0.200.160.130.12RDS/IB
0.460.360.310.30UDP/GE
16K8K4K2KBlock sizeRT (ms)
(*) roundtrip, blocks are not busy i.e. no log flush, no serialization ( buffer busy)
AWR and Statspack reports would report averages as if they were normally distributed, the
session wait history which is included in Statspack in 10.2 and AWR in 11g will show theactual quantiles
The minimum values in this table are the optimal values for 2-way and 3-way block
transfers, but can be assumed to be the expected values ( I.e. 10ms for a 2-way block
would be very high )
8/4/2019 Cluster Ware Network[1]
13/48Copyright 2008, Oracle. All rights reserved.13
PROCESS
(FG/LMS*)
PROCESS
(FG/LMS*)
SOCKET
LAYER
(tx Buffers)
IP LAYER
TCP / UDP
L2/L3 SWITCH
IP LAYER
TCP / UDP
INTERFACE
LAYER
INTERFACE
LAYER
TX RX
SOCKET
LAYER
(rx Buffers)
Infrastructure: Network Packet Processing
8/4/2019 Cluster Ware Network[1]
14/48Copyright 2008, Oracle. All rights reserved.14
Infrastructure: Interconnect Bandwidth
Bandwidth requirements depend on several factors
( e.g. buffer cache size, #of CPUs per node, access
patterns) and cannot be predicted precisely for every
application
Typical utilization approx. 10-30% inOLTP
10000-12000 8K blocks per sec to saturate 1 x GbEthernet ( 75-80% of theoretical bandwidth )
Generally, 1Gb/sec sufficient for performance and
scalability in OLTP.
DSS/DW systems should be designed with > 1Gb/seccapacity
A sizing approach with rules of thumb is described in
Project MegaGrid: Capacity Planning for Large
Commodity Clusters (http://otn.oracle.com/rac)
http://otn.oracle.com/rachttp://otn.oracle.com/rac8/4/2019 Cluster Ware Network[1]
15/48Copyright 2008, Oracle. All rights reserved.15
Infrastructure: Private Interconnect
Network between the nodes of a RAC cluster MUST
be private
Supported links: GbE, IB ( IPoIB: 10.2 )
Supported transport protocols: UDP, RDS (10.2.0.3) Use multiple or dual-ported NICs for redundancy and
increase bandwidth with NIC bonding
Large ( Jumbo ) Frames for GbE recommended if the
global cache workload requires it.
global cache block shipping versus small lock
message passing.
N k P k P i L Q d B ff
8/4/2019 Cluster Ware Network[1]
16/48Copyright 2008, Oracle. All rights reserved.16
PROCESS
(FG/LMS*)
PROCESS
(FG/LMS*)
SOCKET
LAYER
(tx Buffers)
IP LAYER
TCP / UDP
L2/L3 SWITCH
Ingress
Buffers
Egress
Buffers
IP LAYER
TCP / UDP
INTERFACE
LAYER
INTERFACE
LAYER Hardware interrupts
RX queues, RX
buffers
Software interrupts
RX IP input queue
TX
Socket queues
RX
SOCKET
LAYER
(rx Buffers)
USER
KERNEL
Backplane pressure cpu
Socket buffers
TX IP
queues
Network Packet Processing: Layers, Queues and Buffers
Rec()
8/4/2019 Cluster Ware Network[1]
17/48Copyright 2008, Oracle. All rights reserved.17
Infrastructure: IPC configuration
Important Settings:
Negotiated top bit rate and full duplex mode
NIC ring buffers
Ethernet flow control settings
CPU(s) receiving network interrupts
Verify your setup:
CVU does checking
Load testing eliminates potential for problems
AWR and ADDM give estimations of link utilization
Buffer overflows, congested links and flow control can
have severe consequences for performance
8/4/2019 Cluster Ware Network[1]
18/48Copyright 2008, Oracle. All rights reserved.18
Infrastructure: Operating System
Block access latencies increase when CPU(s) busy
and run queues are long Immediate LMS scheduling is critical for
predictable block access latencies when CPU >
80% busy
Fewer and busier LMS processes may be more
efficient.
monitor their CPU utilization
Caveat: 1 LMS can be good for runtime
performance but may impact cluster
reconfiguration and instance recovery time the default is good for most requirements
Higher priority for LMS is default
The implementation is platform-specific
8/4/2019 Cluster Ware Network[1]
19/48Copyright 2008, Oracle. All rights reserved.19
Common Problems and Symptoms
Lost Blocks: Interconnect or Switch Problems
System load and scheduling
Contention
Unexpectedly high global cache latencies
8/4/2019 Cluster Ware Network[1]
20/48Copyright 2008, Oracle. All rights reserved.20
Miss-configured or Faulty Interconnect Can
Cause:
Dropped packets/fragments
Buffer overflows
Packet reassembly failures or timeouts
Ethernet Flow control kicks inTX/RX errors
lost blocks at the RDBMS level, responsible for64% of escalations
8/4/2019 Cluster Ware Network[1]
21/48Copyright 2008, Oracle. All rights reserved.21
Lost Blocks: NIC Receive Errors
Db_block_size = 8K
ifconfig a:
eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04
inet addr:130.35.25.110 Bcast:130.35.27.255 Mask:255.255.252.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95
TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0
8/4/2019 Cluster Ware Network[1]
22/48Copyright 2008, Oracle. All rights reserved.22
Lost Blocks: IP Packet Reassembly Failures
netstat s
Ip:
84884742 total packets received
1201 fragments dropped after timeout
3384 packet reassembles failed
8/4/2019 Cluster Ware Network[1]
23/48Copyright 2008, Oracle. All rights reserved.23
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time(s)(ms) Time Wait
Class
----------------------------------------------------------------------------------------------------
log file sync 286,038 49,872 174 41.7 Commit
gc buffer busy 177,315 29,021 164 24.3 Cluster
gc cr block busy 110,348 5,703 52 4.8 Cluster
gc cr block lost 4,272 4,953 1159 4.1 Cluster
cr request retry 6,316 4,668 739 3.9 Other
Finding a Problem with the Interconnect or IPC
Should never be here
8/4/2019 Cluster Ware Network[1]
24/48Copyright 2008, Oracle. All rights reserved.24
Global Cache Lost block handling
Detection Time in 11g reduced 500ms ( around 5 secs in 10g )
can be lowered if necessary
robust ( no false positives )
no extra overhead
Cr request retryevent related to lost blocks
It is highly likely to see it when gc cr blocks lost
show up
8/4/2019 Cluster Ware Network[1]
25/48Copyright 2008, Oracle. All rights reserved.25
Interconnect Statistics
Automatic Workload Repository (AWR )
Target Avg Latency Stddev Avg Latency Stddev
Instance 500B msg 500B msg 8K msg 8K msg
---------------------------------------------------------------------
1 .79 .65 1.04 1.06
2 .75 .57 . 95 .78
3 .55 .59 .53 .59
4 1.59 3.16 1.46 1.82
---------------------------------------------------------------------
Latency probes for different message sizes
Exact throughput measurements ( not shown)
Send and receive errors, dropped packets ( not shown )
8/4/2019 Cluster Ware Network[1]
26/48
Copyright 2008, Oracle. All rights reserved.26
Blocks Lost: Solution
Fix interconnect NICs and switches
Tune IPC buffer sizes
8/4/2019 Cluster Ware Network[1]
27/48
Copyright 2008, Oracle. All rights reserved.27
CPU Saturation or Long Run Queues
Top 5 Timed Events Avg %Total~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time(s) (ms) Time Wait
Class----------------- --------- ------ ---- ---------------
db file sequential 1,312,840 21,590 16 21.8 User I/Oread
gc current block 275,004 21,054 77 21.3 Clustercongested
gc cr grant congested 177,044 13,495 76 13.6 Cluster
gc current block 1,192,113 9,931 8 10.0 Cluster2-way
gc cr blockcongested 85,975 8,917 104 9.0 ClusterCongested : LMS could not dequeue messages fast enough
Cause : Long run queue, CPU starvation
8/4/2019 Cluster Ware Network[1]
28/48
Copyright 2008, Oracle. All rights reserved.28
High CPU Load: Solution
Run LMS at higher priority (default)
Start more LMS processes
Never use more LMS processes than CPUsReduce the number of user processes
Find cause of high CPU consumption
8/4/2019 Cluster Ware Network[1]
29/48
Copyright 2008, Oracle. All rights reserved.29
Contention
Event Waits Time (s) AVG (ms) % Call
Time
---------------------- --------- -------- -------- --------
gc cr block 2-way 317,062 5,767 18 19.0
gc current block 2-way 201,663 4,063 20 13.4
gc bufferbusy 111,372 3,970 36 13.1
CPU time 2,938 9.7
gc cr blockbusy 40,688 1,670 41 5.5-------------------------------------------------------
Global Contention on Data Serialization
Its is very likely that CR BLOCK BUSY and GC BUFFER BUSY are related
8/4/2019 Cluster Ware Network[1]
30/48
Copyright 2008, Oracle. All rights reserved.30
Contention: Solution
Identify hot blocks in application
Reduce concurrency on hot blocks
8/4/2019 Cluster Ware Network[1]
31/48
Copyright 2008, Oracle. All rights reserved.31
High Latencies
Event Waits Time (s) AVG (ms) % Call
Time
---------------------- ---------- ---------- --------- --------
gc cr block 2-way 317,062 5,767 18 19.0
gc current block 2-way 201,663 4,063 20 13.4
gc buffer busy 111,372 3,970 36 13.1
CPU time 2,938 9.7
gc cr block busy 40,688 1,670 41 5.5
-------------------------------------------------------
Tackle latency first, then tackle busy events
Expected: To see 2-way, 3-way
Unexpected: To see > 1 ms (AVG ms should be around 1 ms)
8/4/2019 Cluster Ware Network[1]
32/48
Copyright 2008, Oracle. All rights reserved.32
High Latencies : Solution
Check network configuration
Private
Running at expected bit rateFind cause of high CPU consumption
Runaway or spinning processes
8/4/2019 Cluster Ware Network[1]
33/48
Copyright 2008, Oracle. All rights reserved.33
Health Check
Look for:Unexpected Events
gc cr block lost 1159 ms
Unexpected Hints
Contention and Serializationgc cr/current block busy 52 ms
Load and Schedulinggc current block congested 14 ms
Unexpected high avggc cr/current block 2-way 36 ms
8/4/2019 Cluster Ware Network[1]
34/48
Copyright 2008, Oracle. All rights reserved.34
Gigabit Ethernet Definition
Max Bandwidth 1000 Mbits = 125 MB per sec, excluding header and pauseframes 118 MB per sec
Equates to 85000 Clusterware/RAC messages or 14000 8k
blocks
RAC workload has a mix of short messages of 256 bytes and
db_block_size of long messages
For real life workload only 60-70% the bandwidth can be
sustained
For RAC type workload 40 MB per sec per interface is optimalload
For additional bandwidth more interfaces can be aggregated
8/4/2019 Cluster Ware Network[1]
35/48
Copyright 2008, Oracle. All rights reserved.35
1 U
ce2
NIC NIC
ce4:1
$ --> ifconfig -a
ce2: flags=69040843 mtu 1500 index 3
inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255groupnameprivate
ce4: flags=9040843 mtu 1500 index 8
inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255
groupnameprivate
ce4:1: flags=1000843 mtu 1500 index 8
inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255
ce4
Aggregation Active/Standby(single switch)
8/4/2019 Cluster Ware Network[1]
36/48
Copyright 2008, Oracle. All rights reserved.36
1 U
ce2
NIC NIC
ce4:1
$ --> ifconfig -a
ce2: flags=69040843 mtu 1500 index 3
inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255groupnameprivate
ce4: flags=9040843 mtu 1500 index 8
inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255
groupnameprivate
ce4:1: flags=1000843 mtu 1500 index 8
inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255
ce4
Aggregation Active/Active(Single Switch)
8/4/2019 Cluster Ware Network[1]
37/48
Copyright 2008, Oracle. All rights reserved.37
1 U
1 U
ce2ce4
ce8ce10
NIC
NICce4:1
ce2ce4
ce8ce10
NIC
NICce4:1
$ --> ifconfig -a
ce10: flags=69040843 mtu 1500 index 3
inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255groupnameprivate
ce4: flags=9040843 mtu 1500 index 8
inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255
groupnameprivate
ce4:1: flags=1000843 mtu 1500 index 8
inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255
Aggregation Active/Standby(Switch Redundancy)
8/4/2019 Cluster Ware Network[1]
38/48
Copyright 2008, Oracle. All rights reserved.38
Aggregation Solutions
Cisco Etherchannel based 802.3adAIX Etherchannel
HPUX Auto Port Aggregation
SUN Trunking, IPMP, GLD
Linux Bonding (only certain modes)Windows NIC teaming
8/4/2019 Cluster Ware Network[1]
39/48
Copyright 2008, Oracle. All rights reserved.39
Aggregation Methods
Load balance/failover/load spreading
spread on sends/serialize on receives
Active/Standby
Oracle Interconnect Requirement
Both Send/Receive side load balancing
NIC and Switch port failure detection
8/4/2019 Cluster Ware Network[1]
40/48
Copyright 2008, Oracle. All rights reserved.40
General Interconnect requirement Recommendations
For OLTP Workloads Normally 1 Gbit Ethernet with redundancy
(active/standby or load-balance) is sufficient
For DW workloads
Multiple GigE aggregated10 Gig E or Infiniband
8/4/2019 Cluster Ware Network[1]
41/48
Copyright 2008, Oracle. All rights reserved.41
Oracle RAC Cluster Interconnect network selection
Oracle Clusterware
IP address associated with Private Hostname
(provided during Install interview)
Oracle RAC Database
Private Network specified during the Install interview Cluster_interconnect parameter provided IP address
8/4/2019 Cluster Ware Network[1]
42/48
Copyright 2008, Oracle. All rights reserved.42
Jumbo Frames
Non-IEEE standard
Useful for NAS/iSCSI storage
Network device inter-operability issues
Configure with care and test rigorously
Excerpt from alert.log:
Maximum Tranmission Unit (mtu) of the ether
adapter is different on the node running instance 4,
and this node. Ether adapters connecting the
cluster nodes must be configured with identical mtu
on all the nodes, for Oracle. Please ensure the mtuattribute of the ether adapter on all nodes [and
switch ports]are identical, before running Oracle.
8/4/2019 Cluster Ware Network[1]
43/48
Copyright 2008, Oracle. All rights reserved.43
UDP Socket Buffer(rx)
Default settings adequate for majority of customers
May need to increase allocated buffer size
MTU size increases
netstat reporting fragmentation and/or reassembly
errors
ifconfig reporting dropped packets or overflow
8/4/2019 Cluster Ware Network[1]
44/48
Copyright 2008, Oracle. All rights reserved.44
Cluster Interconnect NIC settings
NIC driver dependent DEFAULTS GENERALLY SATISFACTORY
Changes can occur between OS versions
Linux 2.4 => 2.6 kernels,flowcontrol on e1000 drivers
NAPI interrupt coalescence in 2.6
Confirm flow control: rx=on, tx=off
Confirm full bit rate (1000) for the NICs
Confirm full duplex auto-negotiate
Ensure NIC names/slots identical on all nodes
Configure interconnect NICs on fastest PCI bus
Ensure compatible switch settings
802.3ad on NICs = 802.3ad on switch ports
MTU=9000 on NICs = MTU=9000 on switch portsFAILURE TO CONFIGURE THE NICS AND SWITCHES CORRECTLY WILL RESULT IN SEVERE
PERFORMANCE DEGRADATION AND NODE FENCING
8/4/2019 Cluster Ware Network[1]
45/48
Copyright 2008, Oracle. All rights reserved.45
The Interconnect and VLANs
Interconnect should be dedicated non-routable subnetmapped to a single dedicated, non-shared VLAN
If VLANs are trunked the interconnect VLAN traffic
should not exceed the access switch layer
Minimize the impact of Spanning Tree events Monitor the switch(es) for congestion
Avoid QoS definitions that may negatively impact
interconnect performance
8/4/2019 Cluster Ware Network[1]
46/48
Copyright 2008, Oracle. All rights reserved.46
8/4/2019 Cluster Ware Network[1]
47/48
Copyright 2008, Oracle. All rights reserved.47
8/4/2019 Cluster Ware Network[1]
48/48
AQ&Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S