+ All Categories
Home > Education > HA, SRX Cluster & Redundancy Groups

HA, SRX Cluster & Redundancy Groups

Date post: 24-May-2015
Category:
Upload: kashif-latif
View: 4,919 times
Download: 0 times
Share this document with a friend
Popular Tags:
45
HA, SRX Cluster & Redundancy Groups Prepared By: Kashif Latif Muhammad Bilal
Transcript
Page 1: HA, SRX Cluster & Redundancy Groups

HA, SRX Cluster & Redundancy Groups

Prepared By:Kashif LatifMuhammad Bilal

Page 2: HA, SRX Cluster & Redundancy Groups

High Availability ClusterHigh-availability clusters (also known as HA clusters or failover clusters) are groups of computers that support server applications that can be reliably utilized with a minimum of down-time.

They operate by harnessing redundant computers in groups or clusters that provide continued service when system components fail.

Page 3: HA, SRX Cluster & Redundancy Groups

Uses of HA Cluster

HA clusters are often used for:

1. Critical Databases2. File Sharing on a Network3. Business Applications4. Customer Services such

as electronic commerce websites

Page 4: HA, SRX Cluster & Redundancy Groups

Cluster Monitoring

HA clusters usually use a heartbeat private network connection which is used to monitor the health and status of each node in the cluster.

Page 5: HA, SRX Cluster & Redundancy Groups

Application Design RequirementsIn order to run in a high-availability cluster environment, an application must satisfy at least the following technical requirements:

There must be a relatively easy way to start, stop, force-stop, and check the status of the application.

The application must be able to use shared storage.

Page 6: HA, SRX Cluster & Redundancy Groups

Count…

Ability to restart on another node at the last state before failure using the saved state from the shared storage.

The application must not corrupt data if it crashes, or restarts from the saved state.

Page 7: HA, SRX Cluster & Redundancy Groups

Node Configurations

In two-node cluster configurations can sometimes be categorized into one of the following models:

1. Active/active2. Active/passive3. N+14. N+M5. N-to-16. N-to-N

Page 8: HA, SRX Cluster & Redundancy Groups

Active/active

Traffic intended for the failed node is either passed onto an existing node or load balanced across the remaining nodes.This is usually only possible when the nodes utilize a homogeneous software configuration.

Page 9: HA, SRX Cluster & Redundancy Groups

Active/passive

Provides a fully redundant instance of each node, which is only brought online when its associated primary node fails.This configuration typically requires the most extra hardware.

Page 10: HA, SRX Cluster & Redundancy Groups

Node Reliability

HA clusters usually utilize all available techniques to make the individual systems and shared infrastructure as reliable as possible. These include:1. Disk Mirroring2. Redundant Network3. Redundant Storage Area Network4. Redundant Electrical Power5. Redundant Power Supply Units

Page 11: HA, SRX Cluster & Redundancy Groups

Failover Strategies

Systems that handle failures in distributed computing have different strategies to cure a failure. For instance, API defines three ways to configure a failover:

1. FAIL_FAST: The try fails if the first node cannot be reached.

2. ON_FAIL_TRY_ONE_NEXT_AVAILABLE: Tries one more host before giving up

3. ON_FAIL_TRY_ALL_AVAILABLE: Tries all existing nodes before giving up

Page 12: HA, SRX Cluster & Redundancy Groups

What is SRX Cluster…?

SRX Cluster provides network node redundancy by grouping a pair of the same kind of supported SRX Series devices or J Series devices into a cluster.

The devices must be running the same version of Junos OS.

Page 13: HA, SRX Cluster & Redundancy Groups

SRX Cluster Example

Page 14: HA, SRX Cluster & Redundancy Groups

SRX PlaneThe SRX has a separated planes. Depending on the SRX platform architecture, the separation varies from being separate processes running on separate cores to completely physically differentiated subsystems.

1. Control Plane2. Data Plane

Page 15: HA, SRX Cluster & Redundancy Groups

Control Plane

The control plane is used in HA to synchronize the kernel state between the two REs.

It also provides a path between the two devices to send hello messages between them.

The two devices’ control planes talk to each other over a control link. This link is reserved for control plane communication.

Page 16: HA, SRX Cluster & Redundancy Groups

Count… The control plane is always in an

active/backup state. This means only one RE can be the master over the cluster’s configuration and state.

This ensures that there is only one ultimate truth over the state of the cluster. If the primary RE fails, the secondary takes over for it.

Creating an active/active control plane makes synchronization more difficult because many checks would need to be put in place to validate which RE is right.

Page 17: HA, SRX Cluster & Redundancy Groups

Control Plane States

Page 18: HA, SRX Cluster & Redundancy Groups

Data Plane The data plane’s responsibility in

the SRX is to pass data and processes based on the administrator’s configuration.

All session and service states are maintained on the data plane.

The REs and/or control plane are not responsible for maintaining state.

Page 19: HA, SRX Cluster & Redundancy Groups

Responsibilities of Data Plane

The data plane has a few responsibilities when it comes to HA implementation.

First and foremost is state synchronization. The state of sessions and services is shared between the two devices.

Sessions are the state of the current set of traffic that is going through the SRX, and services are other items such as:

1. VPN2. IPS3. ALGs

Page 20: HA, SRX Cluster & Redundancy Groups

Chassis Cluster

An SRX cluster implements a concept called chassis cluster. A chassis cluster takes the two SRX devices and represents them as a single device.

The interfaces are numbered in such a way that they are counted starting at the first chassis and then end on the second chassis.

Page 21: HA, SRX Cluster & Redundancy Groups

Chassis Cluster Numbering

Page 22: HA, SRX Cluster & Redundancy Groups

Chassis Cluster Functionality

1. Resilient system architecture, with a single active control plane for the entire cluster and multiple Packet Forwarding Engines. This architecture presents a single device view of the cluster.

2. Synchronization of configuration and dynamic runtime states between nodes within a cluster.

3. Monitoring of physical interfaces, and failover if the failure parameters cross a configured threshold.

Page 23: HA, SRX Cluster & Redundancy Groups

States of ClusterThe different states that a cluster can be in at any given instant are as follows:1. Hold2. Primary3. Secondary-Hold4. Secondary5. Ineligible6. Disabled

A state transition can be triggered because of any event, such as interface monitoring, SPU monitoring, failures, and manual failovers.

Page 24: HA, SRX Cluster & Redundancy Groups

Chassis Cluster Formation

To form a chassis cluster, a pair of the same kind of supported SRX Series devices or J Series devices are combined to act as a single system that enforces the same overall security.

You can deploy up to 15 chassis clusters in a Layer 2 domain.

Page 25: HA, SRX Cluster & Redundancy Groups

Identification of Clusters

Clusters and nodes are identified in the following way: A cluster is identified by a

cluster ID (cluster-id) specified as a number from 1 through15.

A cluster node is identified by a node ID (node) specified as a number from 0 to 1.

Page 26: HA, SRX Cluster & Redundancy Groups

Redundancy Groups

A redundancy group is an abstract construct that includes and manages a collection of objects. A redundancy group contains objects on both nodes.

A redundancy group is primary on one node and backup on the other at any time.

We can create up to 128 redundancy groups.

Page 27: HA, SRX Cluster & Redundancy Groups

Example of Redundancy Groups

Page 28: HA, SRX Cluster & Redundancy Groups

Primacy of Redundancy GroupThree things determine the primacy of a redundancy group:1. The priority configured for the node2. The node ID (in case of tied priorities)3. The order in which the node comes

up.

If a lower priority node comes up first, then it will assume the primacy for a redundancy group (and will stay as primary if preempt is not enabled).

Page 29: HA, SRX Cluster & Redundancy Groups

Redundancy Group Monitoring

A redundancy group is automatically fail over to another node, for this it has to monitor some following components of the Chassis Cluster:1. Interface Monitoring2. IP Address Monitoring3. Monitoring of Global-Level Objects

i. SPU Monitoringii. Flowd Monitoringiii. Cold-Sync Monitoring

Page 30: HA, SRX Cluster & Redundancy Groups

Chassis Cluster Redundancy Group Failover

A redundancy group is a collection of objects that fail over as a group. Each redundancy group monitors a set of objects (physical interfaces), and each monitored object is assigned a weight.

Each redundancy group has an initial threshold of 255.

Page 31: HA, SRX Cluster & Redundancy Groups

Count… When a monitored object fails, the

weight of the object is subtracted from the threshold value of the redundancy group.

When the threshold value reaches zero, the redundancy group fails over to the other node. As a result, all the objects associated with the redundancy group fail over as well.

Page 32: HA, SRX Cluster & Redundancy Groups

Count…

Because back-to-back redundancy group failovers that occur too quickly can cause a cluster to exhibit unpredictable behavior, a dampening time between failovers is needed.

The default dampening time is 300 seconds (5 minutes) for redundancy group 0 and is configurable to up to 1800 seconds with the hold-down-interval statement.

Page 33: HA, SRX Cluster & Redundancy Groups

Count… Redundancy groups x

(redundancy groups numbered 1 through 128) have a default dampening time of 1 second, with a range of 0 through 1800 seconds.

The hold-down interval affects manual failovers, as well as automatic failovers associated with monitoring failures.

Page 34: HA, SRX Cluster & Redundancy Groups

Chassis Cluster Redundancy Group Manual Failover

We can initiate a redundancy group x failover manually. A manual failover applies until a failback event occurs.

You can also initiate a redundancy group 0 failover manually if you want to change the primary node for redundancy group 0.

Page 35: HA, SRX Cluster & Redundancy Groups

State Transitions Cases

There are three transition cases:

1. Reboot case—The node in the secondary-hold state transitions to the primary state; the other node goes dead (inactive).

Page 36: HA, SRX Cluster & Redundancy Groups

Count…

2. Control link failure case—The node in the secondary-hold state transitions to the ineligible state and then to a disabled state; the other node transitions to the primary state.

3. Fabric link failure case—The node in the secondary-hold state transitions directly to the disabled state.

Page 37: HA, SRX Cluster & Redundancy Groups

SNMP Failover Traps

Chassis clustering supports SNMP traps, which are triggered whenever there is a redundancy group failover.The trap message can help you troubleshoot failovers. It contains the following information:1. The cluster ID and node ID2. The reason for the failover3. The redundancy group that is involved

in the failover4. The redundancy group’s previous state

and current state

Page 38: HA, SRX Cluster & Redundancy Groups

Chassis Cluster Interfaces

A network device doesn’t help a network without participating in traffic processing.An SRX has two different interface types that it can use to process traffic that are:1. Reth Interface2. Local Interface

Page 39: HA, SRX Cluster & Redundancy Groups

Reth Interface A Reth is a Junos

aggregate Ethernet interface and it has special properties compared to a traditional aggregate Ethernet interface.

The Reth allows the administrator to add one or more child links per chassis.

Page 40: HA, SRX Cluster & Redundancy Groups

Reth MAC Address

The MAC address for the Reth is based on a combination of the cluster ID and the Reth number.

Page 41: HA, SRX Cluster & Redundancy Groups

Count…

In the figure the first four of the six bytes are fixed. They do not change between cluster deployments.

The last two bytes vary based on the cluster ID and the Reth index.

Page 42: HA, SRX Cluster & Redundancy Groups

Local Interface A local interface is an interface that is

configured local to a specific node. This method of configuration on an

interface is the same method of configuration on a standalone device.

Page 43: HA, SRX Cluster & Redundancy Groups

Count… The significance of a local interface

in an SRX cluster is that it does not have a backup interface on the other chassis, meaning that it is part of neither a Reth nor a redundancy group.

If this interface were to fail, its IP address would not fail over to the other node.

Page 44: HA, SRX Cluster & Redundancy Groups

Troubleshooting the Cluster

There are various methods that show the administrator how to troubleshoot a chassis cluster:1. Identify the Cluster Status2. Checking Interfaces3. Verifying the Data Plane4. Core Dumps5. The Dreaded Priority Zero

Page 45: HA, SRX Cluster & Redundancy Groups

Thank YouFrom:Kashif LatifMuhammad Bilal


Recommended