Cluster Failover

What is Failover?

Failover is a cluster redundancy operation that automatically occurs if a Cluster Member is not functional. When this occurs, other Cluster Members take over for the failed Cluster Member.

In a High Availability mode:

If the Active Cluster Member detects that it cannot function as a Cluster Member, it notifies the peer Standby Cluster Members that it must go down. One of the Standby Cluster Members (with the next highest priority) will promote itself to the Active state.
If one of the Standby Cluster Members stops receiving Cluster Control Protocol (CCP) packets from the current Active Cluster Member, that Standby Cluster Member can assume that the current Active Cluster Member failed. As a result, one of the Standby Cluster Members (with the next highest priority) will promote itself to the Active state.
If you do not use State Synchronization in the cluster, existing connections are interrupted when cluster failover occurs.

In a Load Sharing mode:

If a Cluster Member detects that it cannot function as a Cluster Member, it notifies the peer Cluster Members that it must go down. Traffic load will be redistributed between the working Cluster Members.
If the Cluster Members stop receiving Cluster Control Protocol (CCP) packets from one of their peer Cluster Member, those working Cluster Members can assume that their peer Cluster Member failed. As a result, traffic load will be redistributed between the working Cluster Members.
Because by design, all Cluster Members are always synchronized, current connections are not interrupted when cluster failover occurs.

To tell each Cluster Member that the other Cluster Members are alive and functioning, the ClusterXL Cluster Control Protocol (CCP) maintains a heartbeat between Cluster Members. If after a predefined time, no CCP packets are received from a Cluster Member, it is assumed that the Cluster Member is down. As a result, cluster failover can occur.

Note that more than one Cluster Member may encounter a problem that will result in a cluster failover event. In cases where all Cluster Members encounter such problems, ClusterXL will try to choose a single Cluster Member to continue operating. The state of the chosen member will be reported as Active Attention. This situation lasts until another Cluster Member fully recovers. For example, if a cross cable connecting the sync interfaces on Cluster Members malfunctions, both Cluster Members will detect an interface problem. One of them will change to the Down state, and the other to Active Attention state.

When Does a Failover Occur?

A failover takes place when one of the following occurs in a cluster:

Any Critical Device reports its state as problem (see Monitoring Critical Devices).
For example, fwd process failed, or Security Policy is uninstalled on a Cluster Member.
Cluster Members do not receive Cluster Control Protocol (CCP) packets from their peer Cluster Member.

For more on failovers, see sk62570.

What Happens When a Cluster Member Recovers?

In a High Availability mode:

If cluster object is configured as Maintain current active Cluster Member, it means any Cluster Member that becomes Active, remains Active.
If the Cluster Member with highest priority fails, cluster failover occurs. A Cluster Member with the next highest priority becomes Active.

If the Cluster Member with highest priority recovers, cluster failover does not occurs again, and that Cluster Member becomes Standby.
If cluster object is configured as Switch to higher priority Cluster Member, it means that Cluster Member with the highest priority always has to be Active.
Cluster Member with the highest priority is the Cluster Member that appears at the top of the list in Cluster object > Cluster Members pane.

If the Cluster Member with the highest priority fails, cluster failover occurs. A peer Cluster Member in Standby state, with the next highest priority, becomes Active.

If the Cluster Member with the highest priority recovers, cluster failover occurs again. The Cluster Member with the highest priority becomes Active again. The Cluster Member with the next highest priority that was Active, returns to the Standby state.

In a Load Sharing mode:

When the failed Cluster Member recovers, all connections are redistributed between all Active Cluster Members.

How a Recovered Cluster Member Obtains the Security Policy

The Administrator installs the Security Policy on the cluster object, rather than separately on individual Cluster Members. The policy is automatically installed on all Cluster Members. The policy is sent to the IP addresses defined in the General Properties page of the cluster member object.

When a failed cluster member recovers, first it tries to fetch a policy from one of the peer Active Cluster Members. The assumption is that the other Cluster Members have a more up to date policy. If fetching a policy from peer cluster member fails, the recovered cluster member compares its own local policy to the policy on its Management Server. If the policy on the Management Server is more up to date than the one on the recovered cluster member, the policy is fetched from the Management Server. If the cluster member does not have a local policy, it retrieves one from the Management Server. This ensures that all Cluster Members use the same policy at any given moment.