Cluster Failover

What is Failover?

Failover Transferring of a control over traffic (packet filtering) from a Cluster Member that suffered a failure to another Cluster Member (based on internal cluster algorithms). Synonym: Fail-over. is a cluster Two or more Security Gateways that work together in a redundant configuration - High Availability, or Load Sharing. redundancy operation that automatically occurs if a Cluster Member Security Gateway that is part of a cluster. is not functional. When this occurs, other Cluster Members take over for the failed Cluster Member.

In the High Availability mode:

If the Active State of a Cluster Member that is fully operational: (1) In ClusterXL, this applies to the state of the Security Gateway component (2) In 3rd-party / OPSEC cluster, this applies to the state of the cluster State Synchronization mechanism. Cluster Member detects that it cannot function as a Cluster Member, it notifies the peer Standby State of a Cluster Member that is ready to be promoted to Active state (if the current Active Cluster Member fails). Applies only to ClusterXL High Availability Mode. Cluster Members that it must go down State of a Cluster Member during a failure when one of the Critical Devices reports its state as "problem": In ClusterXL, applies to the state of the Security Gateway component; in 3rd-party / OPSEC cluster, applies to the state of the State Synchronization mechanism. A Cluster Member in this state does not process any traffic passing through cluster.. One of the Standby Cluster Members (with the next highest priority) will promote itself to the Active state.
If one of the Standby Cluster Members stops receiving Cluster Control Protocol Proprietary Check Point protocol that runs between Cluster Members on UDP port 8116, and has the following roles: (1) State Synchronization (Delta Sync), (2) Health checks (state of Cluster Members and of cluster interfaces): Health-status Reports, Cluster-member Probing, State-change Commands, Querying for cluster membership. Note: CCP is located between the Check Point Firewall kernel and the network interface (therefore, only TCPdump should be used for capturing this traffic). Acronym: CCP. (CCP) packets from the current Active Cluster Member, that Standby Cluster Member can assume that the current Active Cluster Member failed. As a result, one of the Standby Cluster Members (with the next highest priority) will promote itself to the Active state.
If you do not use State Synchronization Technology that synchronizes the relevant information about the current connections (stored in various kernel tables on Check Point Security Gateways) among all Cluster Members over Synchronization Network. Due to State Synchronization, the current connections are not cut off during cluster failover. in the cluster, existing connections are interrupted when cluster failover occurs.

In Load Sharing modes:

If a Cluster Member detects that it cannot function as a Cluster Member, it notifies the peer Cluster Members that it must go down. Traffic load will be redistributed between the working Cluster Members.
If the Cluster Members stop receiving Cluster Control Protocol (CCP) packets from one of their peer Cluster Member, those working Cluster Members can assume that their peer Cluster Member failed. As a result, traffic load will be redistributed between the working Cluster Members.
Because by design, all Cluster Members are always synchronized, current connections are not interrupted when cluster failover occurs.

To tell each Cluster Member that the other Cluster Members are alive and functioning, the ClusterXL Cluster of Check Point Security Gateways that work together in a redundant configuration. The ClusterXL both handles the traffic and performs State Synchronization. These Check Point Security Gateways are installed on Gaia OS: (1) ClusterXL supports up to 5 Cluster Members, (2) VRRP Cluster supports up to 2 Cluster Members, (3) VSX VSLS cluster supports up to 13 Cluster Members. Note: In ClusterXL Load Sharing mode, configuring more than 4 Cluster Members significantly decreases the cluster performance due to amount of Delta Sync traffic. Cluster Control Protocol (CCP) maintains a heartbeat between Cluster Members. If after a predefined time, no CCP packets are received from a Cluster Member, it is assumed that the Cluster Member is down. As a result, cluster failover can occur.

Note that more than one Cluster Member may encounter a problem that will result in a cluster failover event. In cases where all Cluster Members encounter such problems, ClusterXL will try to choose a single Cluster Member to continue operating. The state of the chosen member will be reported as Active(!). This situation lasts until another Cluster Member fully recovers. For example, if a cross cable connecting the sync interfaces on Cluster Members malfunctions, both Cluster Members will detect an interface problem. One of them will change to the Down state, and the other to Active (!) state.

When Does a Failover Occur?

A failover takes place when one of the following occurs in a cluster:

Any Critical Device A special software device on each Cluster Member, through which the critical aspects for cluster operation are monitored. When the critical monitored component on a Cluster Member fails to report its state on time, or when its state is reported as problematic, the state of that member is immediately changed to Down. The complete list of the configured critical devices (pnotes) is printed by the 'cphaprob -ia list' command or 'show cluster members pnotes all' command. Synonyms: Pnote, Problem Notification. reports its state as "problem" (see Viewing Critical Devices).

For example, the "fwd" process failed, or Security Policy Collection of rules that control network traffic and enforce organization guidelines for data protection and access to resources with packet inspection. is uninstalled on a Cluster Member.
A Cluster Member does not receive Cluster Control Protocol (CCP) packets from its peer Cluster Member.

For more on failovers, see sk62570.

What Happens When a Cluster Member Recovers?

In the High Availability mode:

If cluster object is configured as Maintain current active Cluster Member, it means any Cluster Member that becomes Active, remains Active.

If the Cluster Member with highest priority fails, cluster failover occurs. A Cluster Member with the next highest priority becomes Active.

If the Cluster Member with highest priority recovers, cluster failover does not occurs again, and that Cluster Member becomes Standby.
If cluster object is configured as Switch to higher priority Cluster Member, it means that Cluster Member with the highest priority always has to be Active.

Cluster Member with the highest priority is the Cluster Member that appears at the top of the list in Cluster object > Cluster Members pane.

If the Cluster Member with the highest priority fails, cluster failover occurs. A peer Cluster Member in Standby state, with the next highest priority, becomes Active.

If the Cluster Member with the highest priority recovers, cluster failover occurs again. The Cluster Member with the highest priority becomes Active again. The Cluster Member with the next highest priority that was Active, returns to the Standby state.

In the Load Sharing modes:

When the failed Cluster Member recovers, all connections are redistributed between all Active Cluster Members.

How a Recovered Cluster Member Obtains the Security Policy

The Administrator installs the Security Policy on the cluster object, rather than separately on individual Cluster Members. The policy is automatically installed on all Cluster Members. The policy is sent to the IP addresses defined in the General Properties page of the cluster member object.

When a failed cluster member recovers, first it tries to fetch a policy from one of the peer Active Cluster Members. The assumption is that the other Cluster Members have a more up to date policy. If fetching a policy from peer cluster member fails, the recovered cluster member compares its own local policy to the policy on its Management Server Check Point Single-Domain Security Management Server or a Multi-Domain Security Management Server.. If the policy on the Management Server is more up to date than the one on the recovered cluster member, the policy is fetched from the Management Server. If the cluster member does not have a local policy, it retrieves one from the Management Server. This ensures that all Cluster Members use the same policy at any given moment.

General Failover Limitations

Some connections may not survive cluster failover:

Security Servers connections.
Connections that are handled by the Check Point services, in which the option Synchronize connections on cluster is disabled.
Connections initiated by the Cluster Member itself.
TCP connections handled by the Check Point Active Streaming (CPAS) or Passive Streaming Layer (PSL) mechanism.
Connections handled by Software Blades:
- If the IPS Check Point Software Blade on a Security Gateway that inspects and analyzes packets and data for numerous types of risks (Intrusion Prevention System). Software Blade Specific security solution (module): (1) On a Security Gateway, each Software Blade inspects specific characteristics of the traffic (2) On a Management Server, each Software Blade enables different management capabilities. in the cluster object R77.30 is configured to Prefer connectivity, and the Cluster Member that owns the connections is Down, then the connection is accepted without inspection.
  
  Otherwise, the Cluster Members drop the connection.
- For all other Software Blades:
  - If the destination Cluster Member is available, the connection is forwarded to the Cluster Member that owns the connection.
  - If the destination Cluster Member is not available, the Cluster Members drop the connection.