Application High Availability with Oracle
Aychin Gasimov 02/2014
Application High Availability
• Application must be able to provide uninterrupted service to its end users.
• Application must be able to handle below listed cases:
– Member instance of the service failure
– All instances of the service failure
– Node/Site failure
– Planned downtimes
Required components
• Oracle Clusterware, Oracle Restart, Oracle Data Guard
• FAN
• ONS
• Services
• UCP
• LBA and different types of load balancing
• FCF
• TAF
FAN Fast Application Notification
• FAN is a notification mechanism that Oracle Clusterware uses to notify other processes
• FAN publishes service/instance/node state change events, like UP and DOWN
• FAN also publishes load balancing advisory events • FAN events are published using Oracle
Notification Service and Oracle Streams Advanced Queuing.
• Oracle Net Services listeners are integrated with FAN events
FAN Fast Application Notification
FAN publishes service/instance/node state change events, like UP and DOWN
FAN notifies about configuration and service level information that includes service status changes, such as UP or DOWN events. Applications can respond to FAN events and take immediate action. FAN UP and DOWN events can apply to instances, services, and nodes.
For cluster configuration changes, the Oracle RAC high availability framework publishes a FAN event immediately when a state change occurs in the cluster. Instead of waiting for the application to poll the database and detect a problem, applications can receive FAN events and react immediately. With FAN, in-flight transactions can be immediately terminated and the client notified when the instance fails.
FAN Fast Application Notification
FAN publishes load balancing advisory events FAN also publishes load balancing advisory events. Applications can take advantage of the load balancing advisory FAN events to direct work requests to the instance in the cluster that is currently providing the best service quality.
Listeners are integrated with FAN events Oracle Net Services listeners are integrated with FAN events, enabling the listener and CMAN to immediately de-register services provided by the failed instance and to avoid erroneously sending connection requests to failed instances.
SUBSCRIBE_FOR_NODE_DOWN_EVENT_listener_name=ON (default)
ONS Oracle Notification Service
• A publish and subscribe service for communicating information about all FAN events.
• Oracle Notification Service is included as part of the Oracle Clusterware and Client software (ons.jar).
• Maintained as the Clusterware resource
• One ONS process per node
• Can communicate ONS processes on other nodes and on client side
Services
• A named representation of one or more database instances. The service name for an Oracle database is normally its global database name. Clients use the service name to connect to one or more database instances.
• Logical abstractions for managing workloads in Oracle Database • The services are tightly integrated with Oracle Database and are
maintained in the data dictionary. • Connection requests can include a database service name. • Services enable you to configure a workload, administer it, enable and
disable it, and measure the workload as a single entity. • AWR records service performance. Each service has quality-of-service
thresholds for response time and CPU consumption. • Database Resource Manager can map services to consumer groups.
Therefore, you can automatically manage the priority of one service relative to others.
• Services can be created by DBMS_SERVICE package or srvctl utility
Services
Node 1
Node 2
Node 3
Srv1_db
Srv2_db
Srv3_db
70%
30%
40%
60%
100%
RTPC 0.5s CPUPC 0.3s
RTPC 0.7s CPUPC 0.5s
RTPC 0.8s CPUPC 0.6s
RTPC 0.5s CPUPC 0.3s
RTPC 0.3s CPUPC 0.2s
Applications Oracle Cluster CLS1
Instance1
Instance2
Instance3
• Using Resource Manager to distribute resources between services • Setting thresholds on Response Time per sec and CPU per sec for the services
UCP Universal Connection Pool
• UCP for JDBC provides a connection pool implementation for caching JDBC connections. Java applications that are database-intensive use the connection pool to improve performance and better utilize system resources.
• A UCP JDBC connection pool can use any JDBC driver to create physical connections that are then maintained by the pool.
• The pool also leverages many high availability and performance features available through an Oracle Real Application Clusters (RAC) database. These features include Fast Connection Failover (FCF), run-time connection load balancing, and connection affinity.
• Documented in Oracle® Universal Connection Pool for JDBC Developer's Guide
Requirements for UCP
• JRE 1.5 or higher • A JDBC diver or a connection factory class
capable of returning a java.sql.Connection and javax.sql.XAConnection object – Oracle drivers from releases 10.1 or higher are
supported. Advanced Oracle Database features, such as Oracle RAC and Fast Connection Failover, require the Oracle Notification Service library (ons.jar) that is included with the Oracle Client software.
• The ucp.jar library must be included in the CLASSPATH of an application.
LBA Load Balancing Advisory
• The Load Balancing Advisory provides information to applications or clients about the current service levels that the Oracle RAC database instances are providing. (v$servicemetric.goodness)
• Load balancing advisory is integrated with the AWR. AWR measures response time and CPU consumption for each service
• The advice given by the LBA takes into account the power of the server and the current workload of the service
• Integrated with Oracle 11g JDBC, ODP.NET and OCI • Applications can take advantage of the load balancing FAN events to direct
work requests to the instance in the cluster that provides the best performance based on the workload management directives defined for that service.
• Configured by defining service-level goals for the Service. It enables the LBA for that service and enables the publication of FAN load balancing events.
• Listener also can use the load balancing advisory when it balances the connection loads if LBA enabled and clb_goal is set to SHORT for the Service.
RLB Run-time Load Balancing
• RLB is a feature of Oracle connection pools that can distribute client work requests across the instances in an Oracle RAC, based on the LBA information. It allocates connections, based on the current performance levels. This provides load balancing at the transaction level.
• There are two types of service-level goals for Run-time Connection Load Balancing – Service Time (SERVICE_TIME)—Attempts to direct work requests to instances
according to response time. Load balancing advisory data is based on elapsed time for work done in the service plus available bandwidth to the service. An example for the use of SERVICE_TIME is for workloads such as internet shopping where the rate of demand changes. (v$servicemetric.dbtimepercall)
– Throughput (THROUGHPUT)—Attempts to direct work requests according to throughput. The load balancing advisory is based on the rate that work is completed in the service plus available bandwidth to the service. An example for the use of THROUGHPUT is for workloads such as batch processes, where the next job starts when the last job completes. (v$servicemetric.callspersec)
srvctl modify service -d DB -s app_srvc -B SERVICE_TIME -j SHORT
srvctl modify service -d DB -s batch_srvc -B THROUGHPUT -j LONG
CLB Connection Load Balancing
• Provides load balancing at the time of the initial database connection
• Listener directs a connection request to the best instance currently providing the service
• For each service, you can define the method the listener uses for load balancing by setting the connection load balancing goal. – SHORT --Connection load balancing uses Load Balancing Advisory,
when Load Balancing Advisory is enabled (either goal_service_time or goal_throughput). When GOAL=NONE (LBA disabled), connection load balancing uses an abridged advice based on CPU utilization.
– LONG --Balances the number of connections per instance using session count per service. This setting is recommended for applications with long connections such as forms.
• Controlled by clb_goal property of the Service
Client-Side Load Balancing
• Client-side load balancing balances the connection requests across the listeners.
• Client-side load balancing is defined in client connection definition by setting the parameter LOAD_BALANCE=ON
• Oracle client randomly selects an address from the address list, and connects to that node's listener
• Client-side load balancing includes connection failover. • LOAD_BALANCE is ON by default for DESCRIPTION_LIST
only. This parameter by default is OFF for an address list within a DESCRIPTION. Setting this ON for a SCAN-based address implies that new connections will be randomly assigned to one of the 3 SCAN-based IP addresses resolved by DNS.
Client-Side Failover and Load Balancing DB = (DESCRIPTION = (FAILOVER = on) (LOAD_BALANCE = off) (CONNECT_TIMEOUT = 5) (TRANSPORT_CONNECT_TIMEOUT = 2) (RETRY_COUNT = 2) (ADDRESS = (PROTOCOL = TCP)(HOST = scan1)(PORT = 1521)) (ADDRESS = (PROTOCOL = TCP)(HOST = scan2)(PORT = 1521)) (ADDRESS = (PROTOCOL = TCP)(HOST = scan3)(PORT = 1521)) (CONNECT_DATA = (SERVICE_NAME = myservice) ) )
100.125.200.21
100.125.200.22
100.125.200.23
3 3
• CLB occurs on client side • FAILOVER option is set to ON, it is default value. • Connection will be tried to first SCAN address • Then within this 3 SCAN IPs connection will be tried to first IP, if it will fail then second IP will be tried, each try
will have 2 sec TCP timeout and 5 sec overall timeout to connect and this 3 IPs will be traversed 3 times, 1 time + 2 RETRY_COUNT. It means that if all 3 SCAN IPs will fail it will take up to 2 * 3 * 3 = 18 sec to try next SCAN address. If TCP connection will success in 2 sec then we will have additional time (CONNECT_TIMEOUT - TRANSPORT_CONNECT_TIMOUT) to establish connection to the instance.
• If next SCAN address will success then connection will be established • If all subsequent address will fail then all addresses will be tries 2 more times. Overall number of tries will be 3. • Addresses will be tried one by one in sequential order (LOAD_BALANCE=off), in this particular case the load
balancing between 3 SCAN IPs also will not be performed, it will try to connect to the first IP returned from DNS
• To enable CLB set LOAD_BALANCE=ON, then address will be randomly chosen from 3 addresses and also it will randomly choose between 3 SCAN IPs.
How it works together
UCP
pds.setURL(„jdbc:oracle:thin:@(DESCRIPTION= (LOAD_BALANCE=ON) (ADDRESS = …(host=db-scan)…) ... (CONNECT_DATA=(SERVICE_NAME=service1))“);
service1
service2
Instance 1
Instance 2
Instance 3
ONS
ONS
ONS
Application
SCAN Listener
service1
service2
SCAN Listener
FAN LBA event
• UCP will create physical connections to the instances using provided connection description. Client side load balancing will distribute new connection requests between different SCAN listeners (3 IPs) because LOAD_BALANCE=ON
• Connection request arrives to the Listener, now according to the Services clb_goal value it will redirect it to the appropriate instance, it is server-side connection load balancing. If clb_goal is SHORT and LBA is enabled for the Service then listener will use the services GOODNESS information which it receives from serving instances to decide to which instance to redirect the connection. If clb_goal is LONG then Listener will balance connections by number of sessions per service. If connection pools physical connections count is constant then we can use clb_goal=LONG with UCP, if this number is dynamic then clb_goal=SHORT must be used, because each new connection request from UCP must be accurately redirected according to the LBA advice and goal (goal can be SERVICE_TIME or THROUGHPUT)
• ONS from each node periodically sends LBA FAN events to UCP. This way UCP is aware about current service levels on each instance, like Listener. According to this information Run-time load balancing mechanism distributes workload between different instances during application life.
How it works together • Run-Time Load Balancing and Connection load balancing are related if clb_goal
of the Service is set to SHORT in: They both use Load Balancing Advisor. They both use same balancing goal defined in the Service definition by –B
key, i.e. SERVICE_TIME or THROUGHPUT. • Database using AWR data will calculate the GOODNESS for each service based on
the runtime load balancing goal or clb_goal for that service. Current GOODNESS number can be found in the V$SERVICEMETRIC.GOODNESS field.
• If clb_goal is: LONG, LBA will not be used for server-side load balancing, GOODNESS field
will contain just the number of current sessions for this service in current instance.
SHORT, LBA will be used for server-side load balancing, GOODNESS will be calculated based on the load balancing goal, SERVICE_TIME or THROUGHPUT.
• If clb_goal is SHORT and LBA is not enabled –B NONE then listener will consider the node load to equalize CPU usage when distributing connections.
FCF Fast Connection Failover
• FCF designed for fast instance and database failover and switchover with Oracle RAC and Oracle Data Guard.
• FCF receives FAN availability events and immediately clears affected connections from the pool.
• Requires the use of an Oracle JDBC driver for JAVA applications and an Oracle RAC database or an Oracle Restart.
• Can be used with Session and Connection pools of OCI applications • It was introduced as part of pooling feature “Implicit Connection Cache”
that available from JDBC 10g • Starting from 11gR2 Implicit Connection Caching is deprecated in favor of
UCP • Now UCP must be used to benefit from FCF and RLB. • FCF supports planned (instance relocation or shutdown in RAC database)
and unplanned outages • Application logic must be used to make outages transparent for the end
users.
FCF
• Planned outage – Stale borrowed connections are marked and removed after they
are returned to the pool – On-going transactions proceed to complete
• Unplanned outage – Detect and remove stale connections from pool – Borrowed connections are immediately aborted and closed – On-going transactions immediately receive an exception
• FCF supports RAC database, Data Guard and Single Instance with Oracle Restart, they all can publish FAN messages
• Set oracle.net.ns.SQLnetDef.TCP_CONNTIMEOUT_STR property in milliseconds.
FCF planned outage
Instance 1
Node 1
Service
ONS
Instance 2
Node 2
Service
ONS
Inte
rco
nn
ect
Application
UCP Borrowed
connections
• Application uses UCP, there is 9 physical connections in the pool • Connections are distributed between 2 RAC instances
Now execute:
srvctl stop service –d DB –s Service –I Instance1
FCF planned outage
Instance 1
Node 1
Service
ONS
Instance 2
Node 2
Service
ONS
Inte
rco
nn
ect
Application
UCP Borrowed
connections
• Service on Instance 1 went down, evmd publishes service DOWN event
FCF planned outage
Instance 1
Node 1
Service
ONS
Instance 2
Node 2
Service
ONS
Inte
rco
nn
ect
Application
UCP Borrowed
connections
• ONS publishes FAN availability event about service DOWN on Instance 1
FAN servc DOWN
FCF planned outage
Instance 1
Node 1
Service
ONS
Instance 2
Node 2
Service
ONS
Inte
rco
nn
ect
Application
UCP Borrowed
connections
• UCP received FAN event and immediately marks borrowed connections to the Instance 1 as to be cleared, not borrowed connects are cleared and if needed reestablished to the available instance
• Physical connections is still there, because there is borrowed connections in use. It is possible because when we do normal service shutdown already active connections are not disconnected and it is up to client (UCP) when to disconnect.
FCF planned outage
Instance 1
Node 1
Service
ONS
Instance 2
Node 2
Service
ONS
Inte
rco
nn
ect
Application
UCP Borrowed
connections
• As soon as application closes borrowed connection UCP will clear it
FCF planned outage
Instance 1
Node 1
Service
ONS
Instance 2
Node 2
Service
ONS
Inte
rco
nn
ect
Application
UCP Borrowed
connections
• If the pool min size will be reached new connection will be reestablished immediately to the available Instance
• After Service on Node 1 will be started new connections will be placed to it by SLB
FCF unplanned outage
Instance 1
Node 1
Service
ONS
Instance 2
Node 2
Service
ONS
Inte
rco
nn
ect
Application
UCP Borrowed
connections
• Application uses UCP, there is 9 physical connections in the pool • Connections are distributed between 2 RAC instances
FCF unplanned outage
Instance 1
Node 1
Service
ONS
Instance 2
Node 2
Service
ONS
Inte
rco
nn
ect
Application
UCP Borrowed
connections
• Node 1 fails, evmd publishes DOWN event
FCF unplanned outage
Instance 1
Node 1
Service
ONS
Instance 2
Node 2
Service
ONS
Inte
rco
nn
ect
Application
UCP Borrowed
connections
• Connections to the Instance 1 will fall into TCP retransmission cycle and will be in this state until TCP timeout will expire which can take several minutes, but …
FCF unplanned outage
Instance 1
Node 1
Service
ONS
Instance 2
Node 2
Service
ONS
Inte
rco
nn
ect
Application
UCP Borrowed
connections
• ONS will distribute DOWN event immediately
FAN DOWN event
FCF unplanned outage
Instance 1
Node 1
Service
ONS
Instance 2
Node 2
Service
ONS
Inte
rco
nn
ect
Application
UCP Borrowed
connections
• UCP will receive DOWN event and will immediately break affected connections out of TCP timeouts by disconnecting physical connections
• Application will immediately receive error, all not committed work is already rolled back by Instance 2. Application do not need to execute rollback.
• Application must: Retry the connection request, because the old one is no longer open Replay the transaction
Usage model of UCP/FCF
1. Get connection from the pool
2. Perform activity on it
3. Get exception from failure of some component
4. Check with isValid() function if connection still valid
5. If not, reconnect and recover lost actions
For information about how to configure UCP in your java app refer to: Oracle® Universal Connection Pool for JDBC Developer's Guide 11g Release 2 (11.2)
FCF with Data Guard failover
1. Primary site lost! Connections fall into hang-state.
2. After failover complete. Respective database services will start and DG Broker publish FAN availability event
3. FCF will break connections out from TCP time out, clear stale connections and throw error to the application
4. Application will retry connection and replay lost transactions if any
FCF not needed for DG switchover
• For DG switchover FCF is not needed because its primary role is to break connections from TCP timeouts. Which is not a case when planned switchover occurs.
• Switchover steps: 1. Primary converts to physical standby and disconnects all
sessions 2. Client sessions receive ORA-3113 and begin going through
their retry logic (TAF for OCI and Application logic for JDBC) 3. Standby converted to primary database 4. As new primary opened the respective services are started
and clients now see the services as available and connect. Replay lost actions if any.
TAF Transparent Application Failover
• Client side feature of the OCI driver
• Transparently fails over read-only sessions
• Can use FAN events distributed by Streams AQ
• Do not restore sessions state (ALTER SESSION)
• Do not support DML
• Provides callback functions to manage failover steps
• Can be configured on client as well as on the server side using database Services
TAF Transparent Application Failover
• To use FAN with OCI next conditions must met:
Initialize the OCI Environment in OCI_EVENTS mode
Connect to the Service that have AQ HA notifications
Link with a thread library
• TAF have 2 failover types
SESSION, when new sessions will be reestablished by TAF but no select operation recovery
SELECT, new sessions will be reestablished and enables users with open cursors to continue fetching after failover. Involves overhead on the client side in normal select operations
TAF Transparent Application Failover • Sessions with active update transactions (UPDATE,
INSERT, DELETE) at the time of the failure: Will be reconnected to a new session Uncommitted transactions will be rolled back Error message will be returned to the application, stating
that a rollback must be issued Application must rollback and reissue the transaction
• TAF also provides the ability, with the RETRIES and DELAY parameters, to automatically retry re-connecting on failover
• Example of TAF configured service creation:
srvctl add service -d DB -s taf_service -q TRUE -e SESSION -m BASIC -w 10 -z 50
• “-q TRUE” enables AQ HA notifications • “-e SESSION” sets failover type to SESSION • “-m” set failover method to BASIC • “-w” set failover delay to 10 sec • “-z” set failover retries to 50
Oracle 12c Application Continuity
• Restores full session including all states, cursors, variables and last transaction if there was any.
• Supports planned and unplanned outages
• Performed automatically, minimal application change
• Supported for Oracle RAC, Data Guard, Active Data Guard and WebLogic Server in conjunction with the JDBC Thin Driver or the UCP.
• It applies only to JDBC Thin connections (JDBC OCI is not supported).
• Requires JDBC Replay driver
• Service properties: FAILOVER_TYPE=TRANSACTION, COMMIT_OUTCOME=TRUE,NOTIFICATION=TRUE
Oracle 12c Application Continuity
Instance 1
Node 1
Service
ONS
Instance 2
Node 2
Service
ONS
Inte
rco
nn
ect
Application
UCP Borrowed
connections
JDB
C R
ep
lay
Dri
ver
Rep
lay
Co
nte
xt
LTX
ID Continuity
Directory
LTXID
Continuity Directory
LTXID