Monitoring and Troubleshooting Gateway Clusters

Verifying that a Cluster is Working Properly

The cphaprob Command

Use the cphaprob command to verify that the cluster and the cluster members are working properly, and to define critical devices. A critical device is a process running on a cluster member that enables the member to notify other cluster members that it can no longer function as a member. The device reports to the ClusterXL mechanism regarding its current state or it may fail to report, in which case ClusterXL decides that a failover has occurred and another cluster member takes over. When a critical device (also known as a Problem Notification, or pnote) fails, the cluster member is considered to have failed.

There are a number of built-in critical devices, and the administrator can define additional critical devices. The default critical devices are:

The cluster interfaces on the cluster members.
Synchronization — full synchronization completed successfully.
Filter — the Security Policy, and whether it is loaded.
cphad — which follows the ClusterXL process called cphamcset.
fwd — the Security Gateway daemon.

These commands can be run automatically by including them in scripts.

To produce a usage printout for cphaprob that shows all the available commands, type cphaprob at the command line and press Enter. The meaning of each of these commands is explained in the following sections.

cphaprob -d <device> -t <timeout(sec)> -s <ok|init|problem> [-p] register

cphaprob -f <file> register

cphaprob -d <device> [-p] unregister

cphaprob -d <device> -s <ok|init|problem> report

cphaprob [-i[a]] [-e] list

cphaprob state

cphaprob [-a] if

Monitoring Cluster Status

To see the status of a single or multiple cluster members:

Run the following command:

cphaprob state

Do this command after setting up the cluster, and whenever you want to monitor the cluster status.

The following is an example of the output of cphaprob state:

cphaprob state

Cluster mode: Load sharing (Multicast)

Number Unique Address State

1 (local) 30.0.0.1 active

2 30.0.0.2 active

Cluster mode can be
- Load Sharing (Multicast).
- Load Sharing (Unicast).
- High Availability New Mode (Primary Up or Active Up).
- High Availability Legacy Mode (Primary Up or Active Up).
- For third-party clustering products: "Service", refer to Clustering Definitions and Terms, for further information.
- The number of the member indicates the member ID for Load Sharing, and the Priority for High Availability.
- In Load sharing configuration, all machines in a fully functioning cluster should be Active. In High Availability configurations, only one machine in a properly functioning cluster must be Active, and the others must be in the Standby state.
Third-party clustering products show Active/Active even if one of the members is in standby state. This is because this command only reports the status of the full synchronization process. For IPSO VRRP, this command shows the exact state of the Firewall, but not the cluster member (for example, the member may not be working properly but the state of the Firewall is active).

When examining the state of the cluster member, you need to consider whether it is forwarding packets, and whether it has a problem that is preventing it from forwarding packets. Each state reflects the result of a test on critical devices. This is a list that explains the possible cluster states, and whether or not they represent a problem.

Cluster States

State	Meaning	Forwardingpackets?	Is this state a Problem?
Active	Everything is OK.	Yes	No
Active attention	A problem has been detected, but the cluster member is still forwarding packets because it is the only machine in the cluster or there are no other active machines in the cluster. In any other situation the state of the machine would be down.	Yes	Yes
Down	One of the critical devices is down.	No	Yes
Ready	State Ready means that the machine recognizes itself as a part of the cluster and is literally ready to go into action, but, by design, something prevents the machine from taking action. Possible reasons that the machine is not yet Active include: Not all required software components were loaded and initialized yet and/or not all configuration steps finished successfully yet. Before a cluster member becomes Active, it sends a message to the rest of the cluster members, checking whether it can become Active. In High-Availability mode it will check if there is already an Active member and in Load Sharing Unicast mode it will check if there is a Pivot member already. The member remains in the Ready state until it receives the response from the rest of the cluster members and decides which state to choose next (Active, Standby, Pivot, or non-Pivot). Software installed on this member has a higher version than the rest of the members in this cluster. For example, when a cluster is upgraded from one version of Check Point Security Gateway to another, and the cluster members have different versions of Check Point Security Gateway, the members with a new version have the Ready state and the members with the previous version have the Active / Active Attention state. If the software installed on all cluster members includes CoreXL, which is installed by default in versions R70 and higher, a member in Ready state may have a higher number of CoreXL instances than other members. See sk42096 for a solution	No	No
Standby	Applies only to a High Availability configuration, and means the member is waiting for an active machine to fail in order to start packet forwarding.	No	No
Initializing	An initial and transient state of the cluster member. The cluster member is booting up, and ClusterXL product is already running, but the Security Gateway is not yet ready.	No	No
ClusterXL inactive or machine is down	Local machine cannot hear anything coming from this cluster member.	Unknown	Yes

Monitoring Cluster Interfaces

To see the state of the cluster member interfaces and the virtual cluster interfaces:

Run the following command on the cluster members:

cphaprob [-a] if

The output of this command must be identical to the configuration in the cluster object Topology page.

For example:

cphaprob -a if

Required interfaces: 4

Required secured interfaces: 1

qfe4 UP (secured, unique, multicast)

qfe5 UP (non secured, unique, multicast)

qfe6 DOWN (4810.2 secs) (non secured, unique, multicast)

qfe7 UP (non secured, unique, multicast)

Virtual cluster interfaces: 2

qfe5 30.0.1.130

qfe6 30.0.2.130

The interfaces are ClusterXL critical devices. ClusterXL checks the number of good interfaces and sets a value of Required interfaces to the maximum number of good interfaces seen since the last reboot. If the number of good interfaces is less than the Required number, ClusterXL initiates failover. The same applies for secured interfaces, where only the good synchronization interfaces are counted.

An interface can be:

Non-secured or Secured. A secured interface is a synchronization interface.
Shared or unique. A shared interface applies only to High Availability Legacy mode.
Multicast or broadcast. The Cluster Control Protocol (CCP) mode used in the cluster. CCP can be changed to use broadcast instead. To toggle between these two modes use the command cphaconf set_ccp <broadcast|multicast>

For third-party clustering products, except in the case of IPSO IP Clustering,
cphaprob -a if should always show virtual cluster IP addresses.

When an interface is DOWN, it means that the interface cannot receive or transmit CCP packets, or both. This may happen when an interface is malfunctioning, is connected to an incorrect subnet, is unable to pick up Multicast Ethernet packets and so on. The interface may also be able to receive but not transmit CCP packets, in which case the status field is read. The displayed time is the number of seconds that have elapsed since the interface was last able to receive/transmit a CCP packet.

See Defining Disconnected Interfaces for additional information.

Monitoring Critical Devices

When a critical device fails, the cluster member is considered to have failed. To see the list of critical devices on a cluster member, and of all the other machines in the cluster, run the following command on the cluster member:

 cphaprob [-i[a]] [-e] list

There are a number of built-in critical devices, and the administrator can define additional critical devices. The default critical devices are:

The cluster interfaces on the cluster members.
Synchronization — full synchronization completed successfully.
Filter — the Security Policy, and whether it is loaded.
cphad — which follows the ClusterXL process called cphamcset.
fwd — the Security Gateway daemon.

For IPSO Clustering, the output is the same as for ClusterXL Load Sharing. For other third-party products, this command produces no output. The following example output shows that the fwd process is down:

cphaprob list

Built-in Devices:

Device Name: Interface Active Check

Current state: OK

Registered Devices:

Device Name: Synchronization

Registration number: 0

Timeout: none

Current state: OK

Time since last report: 15998.4 sec

Device Name: Filter

Registration number: 1

Timeout: none

Current state: OK

Time since last report: 15644.4 sec

Device Name: fwd

Registration number: 3

Timeout: 2 sec

Current state: problem

Time since last report: 4.5 sec

Registering a Critical Device

cphaprob -d <device> -t <timeout(sec)> -s <ok|init|problem> [-p] register

It is possible to add a user defined critical device to the default list of critical devices. Use this command to register <device> as a critical process, and add it to the list of devices that must be running for the cluster member to be considered active. If <device> fails, then the cluster member is considered to have failed.

If <device> fails to contact the cluster member in <timeout> seconds, <device> will be considered to have failed. For no timeout, use the value 0.

Define the status of the <device> that will be reported to ClusterXL upon registration. This initial status can be one of:

ok — <device> is alive.
init — <device> is initializing. The machine is down. This state prevents the machine from becoming active.
problem — <device> has failed.

[-p] makes these changes permanent. After performing a reboot or after removing the Security Gateway (on Linux or IPSO for example) and re-attaching it, the status of critical devices that were registered with this flag will be saved.

Registering Critical Devices Listed in a File

cphaprob -f <file> register

Register all the user defined critical devices listed in <file>. <file> must be an ASCII file, with each device on a separate line. Each line must list three parameters, which must be separated by at least a space or a tab, as follows:

<device> <timeout> <status>

<device> — The name of the critical device. It must have no more than 15 characters, and must not include white spaces.
<timeout> — If <device> fails to contact the cluster member in <timeout> seconds, <device> will be considered to have failed. For no timeout, use the value 0.
<status> — can be one of
ok — <device> is alive.
init — <device> is initializing. The machine is down. This state prevents the machine from becoming active.
problem — <device> has failed.

Unregistering a Critical Device

cphaprob -d <device> [-p] unregister

Unregistering a user defined <device> as a critical process. This means that this device is no longer considered critical. If a critical device (and hence a cluster member) was registered as "problem" before running this command, then after running this command the status of the cluster will depend only on the remaining critical devices.

[-p] makes these changes permanent. This means that after performing a reboot or after removing the kernel (on Linux or IPSO for example) and re-attaching it, these critical devices remain unregistered.

Reporting Critical Device Status to ClusterXL

cphaprob -d <device> -s <ok|init|problem> report

Use this command to report the status of a user defined critical device to ClusterXL.

<device> is the device that must be running for the cluster member to be considered active. If <device> fails, then the cluster member is considered to have failed.

The status to be reported. The status can be one of:

ok — <device> is alive

init — <device> is initializing. The machine is down. This state prevents the machine from becoming active.

problem — <device> has failed. If this status is reported to ClusterXL, the cluster member will immediately failover to another cluster member.

If <device> fails to contact the cluster member within the timeout that was defined when the it was registered, <device> and hence the cluster member, will be considered to have failed. This is true only for critical devices with timeouts. If a critical device is registered with the -t 0 parameter, there will be no timeout, and until the device reports otherwise, the status is considered to be the last reported status.

Example cphaprob Script

Predefined cphaprob scripts are located on the location $FWDIR/bin. Two scripts are available

clusterXL_monitor_ips

clusterXL_monitor_process

The clusterXL_monitor_ips script in the Appendix chapter Example cphaprob Script has been designed to provide a way to check end-to-end connectivity to routers or other network devices and cause failover if the ping fails. The clusterXL_monitor_process script monitors the existence of given processes and causes failover if the processes die. This script uses the normal pnote mechanism.

Monitoring Cluster Status Using SmartConsole Clients

SmartView Monitor

SmartView Monitor displays a snapshot of all ClusterXL cluster members in the enterprise, enabling real-time monitoring and alerting. For each cluster member, state change and critical device problem notifications are displayed. SmartView Monitor allows you to specify the action to be taken if the status of a cluster member changes. For example, the Security Gateway can issue an alert notifying you of suspicious activity.

Starting and Stopping ClusterXL Using SmartView Monitor

To stop ClusterXL on the machine and cause failover to another machine, open SmartView Monitor, click the cluster object, select one of the member gateway branches, right click a cluster member, and select Down.

To initiate a restart of ClusterXL, open SmartView Monitor, click the cluster object, select one of the member gateway branches, right click a cluster member, and select Up.

Note - SmartView Monitor does not initiate full synchronization, so that some connections may be lost. To initiate full synchronization, run cpstart.

SmartView Tracker

Every change in status of a cluster member is recorded in SmartView Tracker according to the choice in the Fail-Over Tracking option of the cluster object ClusterXL page.

ClusterXL Log Messages

The following conventions are used in this section:

Square brackets are used to indicate place holders, which are substituted by relevant data when an actual log message is issued (for example, [NUMBER] will be replaced by a numeric value).
Angle brackets are used to indicate alternatives, one of which will be used in actual log messages. The different alternatives are separated with a vertical line (for example, <up|down> indicates that either "up" or "down" will be used).
The following place holders are frequently used:

ID: A unique cluster member identifier, starting from "1". This corresponds to the order in which members are sorted in the cluster object's GUI.
IP: Any unique IP address that belongs to the member.
MODE: The cluster mode (for example, New HA, LS Multicast, and so on).
STATE: The state of the member (for example, active, down, standby).
DEVICE: The name of a pnote device (for example, fwd, Interface Active Check).

General logs

Starting <ClusterXL|State Synchronization>.

Indicates that ClusterXL (or State Synchronization, for 3rd party clusters) was successfully started on the reporting member. This message is usually issued after a member boots, or after an explicit call to cphastart.

Stopping <ClusterXL|State Synchronization>.

Informs that ClusterXL (or State Synchronization) was deactivated on this machine. The machine will no longer be a part of the cluster (even if configured to be so), until ClusterXL is restarted.

Unconfigured cluster Machines changed their MAC Addresses. Please reboot the cluster so that the changes take affect.

This message is usually issued when a machine is shut down, or after an explicit call to cphastop.

State logs

Mode inconsistency detected: member [ID] ([IP]) will change its mode to [MODE]. Please re-install the security policy on the cluster.

This message should rarely happen. It indicates that another cluster member has reported a different cluster mode than is known to the local member. This is usually the result of a failure to install the security policy on all cluster members. To correct this problem, install the Security Policy again.

Note - The cluster will continue to operate after a mode inconsistency has been detected, by altering the mode of the reporting machine to match the other cluster members. However, it is highly recommended that the policy will be re-installed as soon as possible.

State change of member [ID] ([IP]) from [STATE] to [STATE] was cancelled, since all other members are down. Member remains [STATE].

When a member needs to change its state (for example, when an active member encounters a problem and needs to bring itself down), it first queries the other members for their state. If all other members are down, this member cannot change its state to a non-active one (or else all members will be down, and the cluster will not function). Thus, the reporting member continues to function, despite its problem (and will usually report its state as "active attention").

member [ID] ([IP]) <is active|is down|is stand-by|is initializing> ([REASON]).

This message is issued whenever a cluster member changes its state. The log text specifies the new state of the member.

Pnote logs

PNote log messages are issued when a pnote device changes its state.

[DEVICE] on member [ID] ([IP]) status OK ([REASON]).
The pnote device is working normally.
[DEVICE] on member [ID] ([IP]) detected a problem ([REASON]).
Either an error was detected by the pnote device, or the device has not reported its state for a number of seconds (as set by the "timeout" option of the pnote)
[DEVICE] on member [ID] ([IP]) is initializing ([REASON]).
Indicates that the device has registered itself with the pnote mechanism, but has not yet determined its state.
[DEVICE] on member [ID] ([IP]) is in an unknown state ([STATE ID]) ([REASON]).
This message should not normally appear. Contact Check Point Support.

Interface logs

interface [INTERFACE NAME] of member [ID] ([IP]) is up.
Indicates that this interface is working normally, meaning that it is able to receive and transmit packets on the expected subnet.
interface [INTERFACE NAME] of member [ID] ([IP]) is down (receive <up|down>, transmit <up|down>).
This message is issued whenever an interface encounters a problem, either in receiving or transmitting packets. Note that in this case the interface may still be working properly, as far as the OS is concerned, but is unable to communicate with other cluster members due to a faulty cluster configuration.
interface [INTERFACE NAME] of member [ID] ([IP]) was added.
Notifies users that a new interface was registered with the Security Gateway (meaning that packets arriving on this interface are filtered by the firewall). Usually this message is the result of activating an interface (such as issuing an ifconfig up command on Unix systems). The interface will now be included in the ClusterXL reports (such as in SmartView Monitor, or in the output of cphaprob -a if). Note that the interface may still be reported as "Disconnected", in case it was configured as such for ClusterXL.
interface [INTERFACE NAME] of member [ID] ([IP}) was removed.
Indicates that an interface was detached from the Security Gateway, and is therefore no longer monitored by ClusterXL.

SecureXL logs

SecureXL device was deactivated since it does not support CPLS.
This message is the result of an attempt to configure a ClusterXL in Load Sharing Multicast mode over Security Gateways using an acceleration device that does not support Load Sharing. As a result, acceleration will be turned off, but the cluster will work in Check Point Load Sharing mode (CPLS).

Reason Strings

member [ID] ([IP]) reports more interfaces up.
This text can be included in a pnote log message describing the reasons for a problem report: Another member has more interfaces reported to be working, than the local member does. This means that the local member has a faulty interface, and that its counterpart can do a better job as a cluster member. The local member will therefore go down, leaving the member specified in the message to handle traffic.
member [ID] ([IP]) has more interfaces - check your disconnected interfaces configuration in the <discntd.if file|registry>.
This message is issued when members in the same cluster have a different number of interfaces. A member having less interfaces than the maximal number in the cluster (the reporting member) may not be working properly, as it is missing an interface required to operate against a cluster IP address, or a synchronization network. If some of the interfaces on the other cluster member are redundant, and should not be monitored by ClusterXL, they should be explicitly designated as "Disconnected". This is done using the file $FWDIR/conf/discntd.if (under Unix systems), or the Windows Registry.
[NUMBER] interfaces required, only [NUMBER] up.
ClusterXL has detected a problem with one or more of the monitored interfaces. This does not necessarily mean that the member will go down, as the other members may have less operational interfaces. In such a condition, the member with the highest number of operational interfaces will remain up, while the others will go down.

ClusterXL Configuration Commands

The cphaconf command

Description The cphaconf command configures ClusterXL.

Important - Running this command is not recommended. It should be run automatically, only by the Security Gateway or by Check Point support. The only exception to this rule is running this command with set_cpp option, as described below.

Usage

cphaconf [-i <machine id>] [-p <policy id>] [-b <db_id>] [-n <cluster num>][-c <cluster size>] [-m <service >]
[-t <secured IF 1>...] start

cphaconf [-t <secured IF 1>...] [-d <disconnected IF 1>...] add
cphaconf clear-secured
cphaconf clear-disconnected
cphaconf stop
cphaconf init
cphaconf forward <on/off>
cphaconf debug <on/off>
cphaconf set_ccp <broadcast/multicast>
cphaconf mc_reload
cphaconf debug_data

cphaconf stop_all_vs

Syntax

Parameter

Description

set_ccp <broadcast/multicast>

Sets whether Cluster Control Protocol (CCP) packets should be sent with a broadcast or multicast destination MAC address. The default behavior is multicast. The setting created using this command will survive reboot.

Note: The same value (either broadcast or multicast) should be set on all cluster members.

stop_all_vs

Stops the cluster product on all Virtual Systems on a VSX Gateway.

The cphastart and cphastop Commands

Running cphastart on a cluster member activates ClusterXL on the member. It does not initiate full synchronization. cpstart is the recommended way to start a cluster member.

Running cphastop on a cluster member stops the cluster member from passing traffic. State synchronization also stops. It is still possible to open connections directly to the cluster member. In High Availability Legacy mode, running cphastop may cause the entire cluster to stop functioning.

These commands should only be run by the Security Gateway, and not directly by the user.

How to Initiate Failover

Method

To Stop ClusterXL

To Start ClusterXL

Run:

cphaprob -d faildevice -t 0 -s ok register
cphaprob -d faildevice -s problem report

and:

cphaprob -d faildevice -s ok report
cphaprob -d faildevice unregister

Effect:

Disables ClusterXL
Does not disable synchronization

Effect:

Enables ClusterXL
Does not initiate full synchronization

Recommended method:

Run:

clusterXL_admin down
clusterXL_admin up

Disables ClusterXL
Does not disable synchronization

Enables ClusterXL
Does not initiate full synchronization

In SmartView Monitor:

Click the Cluster object.
Select one of the member gateway branches.
Right click the cluster member.
Select Down.

Disables ClusterXL
Disables synchronization

Enables ClusterXL
Does not initiate full synchronization

In load-sharing mode, the cluster distributes the load between the remaining active members.

In HA mode, the cluster fails over to next active member with the highest priority.

For more on initiating manual failovers, see: sk55081

Monitoring Synchronization (fw ctl pstat)

To monitor the synchronization mechanism on ClusterXL or third-party OPSEC certified clustering products:

Run this command on a cluster member: fw ctl pstat

The output of this command is a long list of statistics for the Security Gateway. At the end of the list there is a section called "Synchronization" which applies per Gateway Cluster member. Many of the statistics are counters that can only increase. A typical output is as follows:

Version: new

Status: Able to Send/Receive sync packets

Sync packets sent:

total : 3976, retransmitted : 0, retrans reqs : 58, acks : 97

Sync packets received:

total : 4290, were queued : 58, dropped by net : 47

retrans reqs : 0, received 0 acks

retrans reqs for illegal seq : 0

Callback statistics: handled 3 cb, average delay : 1, max delay : 2

Delta Sync memory usage: currently using XX KB mem

Callback statistics: handled 322 cb, average delay : 2, max delay : 8

Number of Pending packets currently held: 1

Packets released due to timeout: 18

The meaning of each line in this printout is explained below.

 Version: new

This line must appear if synchronization is configured. It indicates that new sync is working (as opposed to old sync from version 4.1).

Status: Able to Send/Receive sync packets

If sync is unable to either send or receive packets, there is a problem. Sync may be temporarily unable to send or receive packets during boot, but this should not happen during normal operation. When performing full sync, sync packet reception may be interrupted.

Sync packets sent:

 total : 3976,  retransmitted : 0, retrans reqs : 58,  acks : 97

The total number of sync packets sent is shown. Note that the total number of sync packets is non-zero and increasing.

The cluster member sends a retransmission request when a sync packet is received out of order. This number may increase when under load.

Acks are the acknowledgments sent for received sync packets, when an acknowledgment was requested by another cluster member.

Sync packets received:

  total : 4290,  were queued : 58, dropped by net : 47

The total number of sync packets received is shown. The queued packets figure increases when a sync packet is received that complies with one of the following conditions:

The sync packet is received with a sequence number that does not follow the previously processed sync packet.
The sync packet is fragmented. This is done to solve MTU restrictions.

This figure never decreases. A non-zero value does not indicate a problem.

The dropped by net number may indicate network congestion. This number may increase slowly under load. If this number increases too fast, a networking error may be interfering with the sync protocol. In that case, check the network.

retrans reqs : 0, received 0 acks

retrans reqs for illegal seq : 0

 Callback statistics: handled 3 cb, average delay : 1,  max delay : 2

This message refers to the number of received retransmission requests, in contrast to the transmitted retransmission requests in the section above. When this number grows very fast, it may indicate that the load on the machine is becoming too high for sync to handle.

Acks refer to the number of acknowledgments received for the "cb request" sync packets, which are sync packets with requests for acknowledgments.

Retrans reqs for illegal seq displays the number of retransmission requests for packets which are no longer in this member's possession. This may indicate a sync problem.

Callback statistics relate to received packets that involve Flush and Ack. This statistic only appears for a non-zero value.

The callback average delay is how much the packet was delayed in this member until it was released when the member received an ACK from all the other members. The delay happens because packets are held until all other cluster members have acknowledged reception of that sync packet.

This figure is measured in terms of numbers of packets. Normally this number should be small (~1-5). Larger numbers may indicate an overload of sync traffic, which causes connections that require sync acknowledgments to suffer slight latency.

dropped updates as a result of sync overload: 0

In a heavily loaded system, the cluster member may drop synchronization updates sent from another cluster member.

Delta Sync memory usage: currently using XX KB mem

Delta Sync memory usage only appears for a non-zero value. Delta sync requires memory only while full sync is occurring. Full sync happens when the system goes up- after reboot for example. At other times, Delta sync requires no memory because Delta sync updates are applied immediately. For information about Delta sync see How State Synchronization Works.

Number of Pending packets currently held: 1

   Packets released due to timeout: 18

Number of Pending packets currently held only appears for a non-zero value. ClusterXL prevents out-of-state packets in non-sticky connections. It does this by holding packets until a SYN-ACK is received from all other active cluster members. If for some reason a SYN-ACK is not received, the Security Gateway on the cluster member will not release the packet, and the connection will not be established.

Packets released due to timeout only appears for a non-zero value. If the Number of Pending Packets is large (more than 100 pending packets), and the number of Packets released due to timeout is small, you should take action to reduce the number of pending packets. To solve this problem, see Reducing the Number of Pending Packets.

Troubleshooting Synchronization

Introduction to cphaprob [-reset] syncstat

Heavily loaded clusters and clusters with geographically separated members pose special challenges. High connection rates, and large distances between the members can lead to delays that affect the operation of the cluster.

The cphaprob [-reset] syncstat command is a tool for monitoring the operation of the State Synchronization mechanism in highly loaded and distributed clusters. It can be used for both ClusterXL and third-party OPSEC certified clustering products.

The troubleshooting process is as follows:

Run the cphaprob syncstat command.
Examine and understand the output statistics.
Tune the relevant synchronization global configuration parameters.
Rerun the command, resetting the statistics counters using the -reset option:
cphaprob -reset syncstat
Examine the output statistics to see if the problem is solved.

The section Output of cphaprob [-reset] syncstat explains each of the output parameters, and also explains when the output represents a problem.

Any identified problem can be solved by performing one or more of the tips described in Synchronization Troubleshooting Options.

Output of cphaprob [-reset] syncstat

The output parameters of the cphaprob syncstat command are shown below. The values (not shown) give an insight into the state and characteristics of the synchronization network. Each parameter and the meaning of its possible values is explained in the following sections.

Parameters:

Sync Statistics (IDs of F&A Peers - 1)

Other Member Updates

Sent Retransmission Requests

Avg |Missing Updates per Request

Old or too-new Arriving Updates

Unsynced Missing Updates

Lost Sync Connection (num of events)

Timed out Sync Connection

Local Updates

Total Generated Updates

Recv Retransmission requests

Recv Duplicate Retrans request

Blocking Scenarios

Blocked Packets

Max Length of Sending Queue

Avg Length of Sending Queue

Hold Pkts Events

Unhold Pkt Events

Not Held Due to no Members

Max Held Duration (ticks)

Avg Held Duration (ticks)

Sync Statistics (IDs of F&A Peers - 1)

These statistics relate to the state synchronization mechanism. The F&A (Flush and Ack) peers are the cluster members that this member recognizes as being part of the cluster. The IDs correspond to IDs and IP addresses generated by the cphaprob state command.

Other Member Updates

The statistics in this section relate to updates generated by other cluster members, or to updates that were not received from the other members. Updates inform about changes in the connections handled by the cluster member, and are sent from and to members. Updates are identified by sequence numbers.

Sent Retransmission Requests

The number of retransmission requests, which were sent by this member. Retransmission requests are sent when certain packets (with a specified sequence number) are missing, while the sending member already received updates with advanced sequences.

A high value can imply connectivity problems.

Note - Compare the number of retransmission requests to the Total Regenerated Updates of the other members (see Total Generated Updates).

If its value is unreasonably high (more than 30% of the Total Generated Updates of other members), contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Avg |Missing Updates per Request

Each retransmission request can contain up to 32 missing consecutive sequences. The value of this field is the average number of requested sequences per retransmission request.

More than 20 missing consecutive sequences per retransmission request can imply connectivity problems.

Note - If this value is unreasonably high, contact Technical Support, equipped with the entire output and a detailed description of the network topology and configuration.

Old or too-new Arriving Updates

The number of arriving sync updates where the sequence number is too low, which implies it belongs to an old transmission, or too high, to the extent that it cannot belong to a new transmission.

Large values imply connectivity problems.

Note - See Enlarging the Receiving Queue If this value is unreasonably high (more than 10% of the total updates sent), contact Technical Support, equipped with the entire output and a detailed description of the network topology and configuration.

Unsynced Missing Updates

The number of missing sync updates for which the receiving member stopped waiting. It stops waiting when the difference in sequence numbers between the newly arriving updates and the missing updates is larger than the length of the receiving queue.

This value should be zero. However, the loss of some updates is acceptable as long as the number of lost updates is less than 1% of the total generated updates.

Note - To decrease the number of lost updates, expand the capacity of the Receiving Queue. See Enlarging the Receiving Queue.

Lost Sync Connection (num of events)

The number of events in which synchronization with another member was lost and regained due to either Security Policy installation on the other member, or a large difference between the expected and received sequence number.

The value should be zero. A positive value indicates connectivity problems.

Note - Allow the sync mechanism to handle large differences in sequence numbers by expanding the Receiving Queue capacity. See Enlarging the Receiving Queue.

Timed out Sync Connection

The number of events in which the member declares another member as not connected. The member is considered as disconnected because no ACK packets were received from that member for a period of time (one second), even though there are Flush and Ack packets being held for that member.

The value should be zero. Even with a round trip time on the sync network as high as 100ms, one second should be enough time to receive an ACK. A positive value indicates connectivity problems.

Note - Try enlarging the Sync Timer (see Enlarging the Sync Timer). However, you may well have to contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Local Updates

The statistics in this section relate to updates generated by the local cluster member. Updates inform about changes in the connections handled by the cluster member, and are sent from and to members. Updates are identified by sequence numbers.

Total Generated Updates

The number of sync update packets generated by the sync mechanism since the statistics were last reset. Its value is the same as the difference between the sequence number when applying the -reset option, and the current sequence number.
Can have any value.

Recv Retransmission requests

The number of received retransmission requests. A member requests retransmissions when it is missing specified packets with lower sequence numbers than the ones already received.
A large value can imply connectivity problems.

Note - If this value is unreasonably high (more than 30% of the Total Generated Updates) contact Technical Support, equipped with the entire output and a detailed description of the network topology and configuration.

Recv Duplicate Retrans request

The number of duplicated retransmission requests received by the member. Duplicate requests were already handled, and so are dropped.
A large value may indicate network problem or storms on the sync network.

Blocking Scenarios

Under extremely heavy load conditions, the cluster may block new connections. This parameter shows the number of times that the cluster member started blocking new connections due to sync overload.

The member starts to block connections when its Sending Queue has reached its capacity threshold. The capacity threshold is calculated as 80% of the difference between the current sequence number and the sequence number for which the member received an ACK from all the other operating members.

A positive value indicates heavy load. In this case, observe the Blocked Packets to see how many packets we blocked. Each dropped packet means one blocked connection.

This parameter is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.

To activate the Block New Connections mechanism:

Apply the fw ctl set int fw_sync_block_new_conns 0 command to all the cluster members.

Note - The best way to handle a severe blocking connections problem is to enlarge the sending queue. See Enlarging the Sending Queue.

Another possibility is to decrease the timeout after which a member initiates an ACK. See Reconfiguring the Acknowledgment Timeout. This updates the sending queue capacity more accurately, thus making the blocking process more precise.

Blocked Packets

The number of packets that were blocked because the cluster member was blocking all new connections (see Blocking Scenarios). The number of blocked packets is usually one packet per new connection attempt.

A value higher than 5% of the Sending Queue (see Avg Length of Sending Queue) can imply a connectivity problem, or that ACKs are not being sent frequently enough.

This parameter is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.

To activate the Block New Connections mechanism:

Apply the fw ctl set int fw_sync_block_new_conns 0 command on all the cluster members.

Note - The best way to handle a severe blocking connections problem is to enlarge the sending queue. See Enlarging the Sending Queue.

Max Length of Sending Queue

The size of the Sending Queue is fixed. By default it is 512 sync updates. As newer updates with higher sequence numbers enter the queue, older updates with lower sequence numbers drop off the end of the queue. An older update could be dropped from the queue before the member receives an ACK about that update from all the other members.

This parameter is the difference between the current sync sequence number and the last sequence number for which the member received an ACK from all the other members. The value of this parameter can therefore be greater than 512.

The value of this parameter should be less than 512. If larger than 512, there is not necessarily a sync problem. However, the member will be unable to answer retransmission request for updates which are no longer in its queue.

This parameter is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.

To activate the Block New Connections mechanism:

Apply the fw ctl set int fw_sync_block_new_conns 0 command on all the cluster members.

Note - Enlarge the Sending Queue to value larger than this value. See Enlarging the Sending Queue.

Avg Length of Sending Queue

The average value of the Max Length of Sending Queue parameter, since reboot or since the Sync statistics were reset.

The value should be up to 80% of the size of the Sending Queue.

This parameters is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.

To activate the Block New Connections mechanism:

Apply the fw ctl set int fw_sync_block_new_conns 0 command on all the cluster members.

Note - Enlarge the Sending Queue so that this value is not larger than 80% of the new queue size. See Enlarging the Sending Queue.

Hold Pkts Events

The number of occasions where the sync update required Flush and Ack, and so was kept within the system until an ACK arrived from all the other functioning members.

Should be the same as the number of Unhold Pkt Events.

Note - Contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Unhold Pkt Events

The number of occasions when the member received all the required ACKS from the other functioning members.

Should be the same as the number of Hold Pkts Events.

Note - Contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Not Held Due to no Members

The number of packets which should have been held within the system, but were released because there were no other operating members.

When the cluster has at least two live members, the value should be 0.

Note - The cluster has a connectivity problem. Examine the values of the parameters: Lost Sync Connection (num of events) and Timed out Sync Connection to find out why the member thinks that it is the only cluster member.

You may also need to contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Max Held Duration (ticks)

The maximum time in ticks (one tick equals 100ms) for which a held packet was delayed in the system for Flush and Ack purposes.

It should not be higher than 50 (5 seconds), because of the pending timeout mechanism which releases held packets after a certain timeout. By default, the release timeout is 50 ticks. A high value indicates connectivity problem between the members.

Note - Optionally change the default timeout by changing the value of the fwldbcast_pending_timeout global variable. See Advanced Cluster Configuration and Reducing the Number of Pending Packets.

Also, examine the parameter Timed out Sync Connection to understand why packets were held for a long time.

You may also need to contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Avg Held Duration (ticks)

The average duration in ticks (tick equals 100ms) that held packets were delayed within the system for Flush and Ack purposes.

The average duration should be about the round-trip time of the sync network. A larger value indicates connectivity problem.

Note - If the value is high, contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration in order to examine the cause to the problem.

Timers

The Sync and CPHA timers perform sync and cluster related actions every fixed interval.

Sync tick (ms)

The Sync timer performs cluster related actions every fixed interval. By default, the Sync timer interval is 100ms. The base time unit is 100ms (or 1 tick), which is also the minimum value.

CPHA tick (ms)

The CPHA timer performs cluster related actions every fixed interval. By default, the CPHA timer interval is 100ms. The base time unit is 100ms (or 1 tick), which is also the minimum value.

Queues

Each cluster member has two queues. The Sending Queue and the Receiving Queue.

Sending Queue Size

The Sending Queue on the cluster member stores locally generated sync updates. Updates in the Sending Queue are replaced by more recent updates. In a highly loaded cluster, updates are therefore kept for less time. If a member is asked to retransmit an update, it can only do so if the update is still in its Sending Queue. The default (and minimum) size of this queue is 512. Each member has one sending queue.

Receiving Queue Size

The Receiving Queue on the cluster member keeps the updates from each cluster member until it has received a complete sequence of updates. The default (and minimum) size of this queue is 256. Each member keeps a Receiving Queue for each of the peer members.

Synchronization Troubleshooting Options

The following options specify the available troubleshooting options. Each option involves editing a global system configurable parameter to reconfigure the system with different value than the default.

Enlarging the Sending Queue

To enlarge the sending queue size:

Change the value of the global parameter fw_sync_sending_queue_size. See Advanced Cluster Configuration.
You must also make sure that the required queue size survives boot. See How to Configure Gateway to Survive a Boot.

Enlarging this queue allows the member to save more updates from other members. However, be aware that each saved update consumes memory. When changing this variable you should consider carefully the memory implications. Changes will only take effect after reboot.

Enlarging the Receiving Queue

To enlarge the receiving queue size:

Change the value of the global parameter fw_sync_recv_queue_size. See Advanced Cluster Configuration.
You must also make sure that the required queue size survives boot. See How to Configure Gateway to Survive a Boot.

Enlarging this queue means that the member can save more updates from other members. However, be aware that each saved update consumes memory. When changing this variable you should carefully consider the memory implications. Changes will only take effect after reboot.

Enlarging the Sync Timer

The sync timer performs sync related actions every fixed interval. By default, the sync timer interval is 100ms. The base time unit is 100ms (or 1 tick), which is therefore the minimum value.

To enlarge the sync timer:

Change the value of the global parameter fwha_timer_sync_res. See Advanced Cluster Configuration. The value of this variable can be changed while the system is working. A reboot is not needed.

By default, fwha_timer_sync_res has a value of 1, meaning that the sync timer operates every base time unit (every 100ms). If you configure this variable to n, the timer will be operated every n*100ms.

Enlarging the CPHA Timer

The CPHA timer performs cluster related actions every fixed interval. By default, the CPHA timer interval is 100ms. The base time unit is 100ms (or 1 tick), which is also the minimum value.

If the cluster members are geographically separated from each other, set the CPHA timer to be around 10 times the round-trip delay of the sync network.

Enlarging this value increases the time it takes to detect a failover. For example, if detecting interface failure takes 0.3 seconds, and the timer is doubled to 200ms, the time needed to detect an interface failure is doubled to 0.6 seconds.

To enlarge the CPHA timer:

Change the value of the global parameter fwha_timer_cpha_res. See Advanced Cluster Configuration. The value of this variable can be changed while the system is working. A reboot is not needed.

By default, fwha_timer_cpha_res has a value of 1, meaning that the CPHA timer operates every base time unit (every 100ms). If you configure this variable to n, the timer will be operated every n*100ms.

Reconfiguring the Acknowledgment Timeout

A cluster member deletes updates from its Sending Queue (described in Sending Queue Size) on a regular basis. This frees up space in the queue for more recent updates.

The cluster member deletes updates from this queue if it receives an ACK about the update from the peer member.

The peer member sends an ACK in one of two circumstances — on condition that the Block New Connections mechanism (described in Blocking New Connections Under Load) is active:

After receiving a certain number of updates.
If it didn't send an ACK for a certain time. This is important if the sync network has a considerable line delay, which can occur if the cluster members are geographically separated from each other.

To reconfigure the timeout after which the member sends an ACK:

Change the value of the global parameter fw_sync_ack_time_gap. See Advanced Cluster Configuration. The value of this variable can be changed while the system is working. A reboot is not needed.

The default value for this variable is 10 ticks (10 * 100ms). Thus, if a member didn't send an ACK for a whole second, it will send an ACK for the updates it received.

Contact Technical Support

If the other recommendations do not help solve the problem, contact Technical Support for further assistance.

ClusterXL Error Messages

This section lists the ClusterXL error messages. For other, less common error messages, see SecureKnowledge solution sk23642.

General ClusterXL Error Messages

FW-1: changing local mode from <mode1> to <mode2> because of ID <machine_id>
This log message can happen if the working mode of the cluster members is not the same, for example, if one machine is running High Availability, and another Load Sharing Multicast or Unicast mode. In this case, the internal ClusterXL mechanism tries to synchronize the configuration of the cluster members, by changing the working mode to the lowest common mode. The order of priority of the working modes (highest to lowest) is: 1. Synchronization only 2. Load Sharing 3. High Availability (Active Up) 4. High Availability (Primary Up).
CPHA: Received confirmations from more machines than the cluster size
This log message can occur during policy installation on the cluster. It means that a serious configuration problem exists in that cluster. Probably some other cluster has been configured with identical parameters and both of them have common networks.
fwldbcast_timer: peer X probably stopped...
This is caused when the member that printed this message stops hearing certain types of messages from member X. Verify that cphaprob state shows all members as active and that fw ctl pstat shows that sync is configured correctly and working properly on all members. In such a case it is fair to assume that there was a temporary connectivity problem that was fixed in the meantime. There may be several connections that may suffer from connectivity problems due to that temporary synchronization problem between the two members. On the other hand, this can indicate that the other member is really down.
FW-1: fwha_notify_interface: there are more than 4 IPs on interface <interface name> notifying only the first ones
A member of the same cluster as the reporting machine has more than three virtual IP addresses defined on the same interface. This is not a supported configuration and will harm ClusterXL functionality.
Sync could not start because there is no sync license
This is a license error message: If you have a basic Security Gateway license then sync is also licensed. Check the basic Security Gateway license using cplic print and cplic check.
FW-1: h_slink: an attempt to link to a link
kbuf id not found
fw_conn_post_inspect: fwconn_init_links failed
Several problems of this sort can happen during a full sync session when there are connections that are opened and closed during the full sync process. Full sync is automatic as far as possible, but it is not fully automatic for reasons of performance, A gateway continues to process traffic even when it is serving as a full sync server. This can cause some insignificant problems, such as a connection that is being deleted twice, a link to an existing link, and so forth. It should not affect connectivity or cause security issues.
Error SEP_IKE_owner_outbound: other cluster member packet in outbound
Cluster in not synchronized. Usually happens in OPSEC certified third-party load sharing products for which Support non-sticky connections is unchecked in the cluster object 3rd Party Configuration page.
FW-1: fwha_pnote_register: too many registering members, cannot register
The critical device (also known as Problem Notification, or pnote) mechanism can only store up to 16 different devices. An attempt to configure the 17th device (either by editing the cphaprob.conf file or by using the cphaprob -d ... register command) will result in this message.
FW-1: fwha_pnote_register: <NAME> already registered (# <NUMBER>)
Each device registered with the pnote mechanism must have a unique name. This message may happen when registering new pnote device, and means that the device <NAME> is already registered as with pnote number <NUMBER>.
FW-1: fwha_pnote_unregister: attempting to unregister an unregistered device <DEVICE NAME>
Indicates an attempt to unregister a device which is not currently registered.
FW-1: alert_policy_id_mismatch: failed to send a log
A log indicating that there is a different policy id between the two or more members was not sent. Verify all cluster members have the same policy (using fw stat). It is recommended to re-install the policy.
FW-1: fwha_receive_fwhap_msg: received incomplete HAP packet (read <number> bytes)
This message can be received when ClusterXL hears CCP packets of clusters of version 4.1. In that case it can be safely ignored.

SmartView Tracker Active Mode Messages

The following error messages can appear in SmartView Tracker Active mode. These errors indicate that some entries may not have been successfully processed, which may lead to missing synchronization information on a cluster member and inaccurate reports in SmartView Tracker.

FW-1: fwlddist_adjust_buf: record too big for sync. update Y for table <id> failed. fwlddist_state=<val>
Indicates a configuration problem on a clustered machine. Either synchronization is misconfigured, or there is a problem with transmitting packets on the sync interface. To get more information on the source of the problem
Run fw ctl pstat (described in Monitoring Synchronization (fw ctl pstat)).
In ClusterXL clusters, run cphaprob -a if to get the statuses of the interfaces (see Monitoring Cluster Interfaces).
To solve this problem, see Working with SmartView Tracker Active Mode.
FW-1: fwldbcast_flush: active connections is currently enabled and due to high load it is making sync too slow to function properly. X active updates were dropped
Indicates that a clustered machine has dropped SmartView Tracker Active mode updates in order to maintain sync functionality. To solve this problem, see Working with SmartView Tracker Active Mode.

Sync Related Error Messages

FW-1: fwldbcast_retreq: machine <MACHINE_ID> sent a retrans request for seq <SEQ_NUM> which is no longer in my possession (current seq <SEQ_NUM>)
This message appears when the local member receives a retransmission request for a sequence number which in no longer in its sending window. This message can indicate a sync problem if the sending member didn't receive the requested sequence.
FW-1: fwlddist_save: WARNING: this member will not be fully synchronized !
FW-1: fwlddist_save: current delta sync memory during full sync has reached the maximum of <MEM_SIZE> MB
FW-1: fwlddist_save: it is possible to set a different limit by changing fw_sync_max_saved_buf_mem value
These messages may appear only during full sync. While performing full sync the delta sync updates are being saved and are applied only after the full sync process has finished. It is possible to limit the memory used for saving delta sync updates by setting the fw_sync_max_saved_buf_mem variable to this limit.
FW-1: fwldbcast_flush: fwlddist_buf_ldbcast_unread is not being reset fast enough (ur=<UNREAD_LOC>,fwlddist_buflen=<BUFFER_LEN>)
This message may appear due to high load resulting in the sync buffer being filled faster than it is being read. A possible solution is to enlarge fwlddist_buf_size, as described in the Working with SmartView Tracker Active Mode.
FW-1: fwlddist_mode_change: Failed to send trap requesting full sync
This message may appear due to a problem starting the full sync process, and indicates a severe problem. Contact Technical Support.
FW-1: State synchronization is in risk. Please examine your synchronization network to avoid further problems!
This message could appear under extremely high load, when a synchronization update was permanently lost. A synchronization update is considered to be permanently lost when it cannot be retransmitted because it is no longer in the transmit queue of the update originator. This scenario does not mean that the Security Gateway will malfunction, but rather that there is a potential problem. The potential problem is harmless if the lost sync update was to a connection that runs only on a single member as in the case of unencrypted (clear) connections (except in the case of a failover when the other member needs this update).

The potential problem can be harmful when the lost sync update refers to a connection that is non-sticky (see Non-Sticky Connections), as is the case with encrypted connections. In this case the other cluster member(s) may start dropping packets relating to this connection, usually with a TCP out of state error message (see TCP Out-of-State Error Messages). In this case it is important to block new connections under high load, as explained in Blocking New Connections Under Load.

The following error message is related to this one.
FW-1: fwldbcast_recv: delta sync connection with member <MACHINE_ID> was lost and regained. <UPDATES_NUM> updates were lost.
FW-1: fwldbcast_recv: received sequence <SEQ_NUM> (fragm <FRAG_NUM>, index <INDEX_NUM>), last processed seq <SEQ_NUM>
These messages appear when there was a temporary sync problem and some of the sync updates were not synchronized between the members. As a result some of the connections might not survive a failover.

The previous error message is related to this one.
FW-1: The use of the non_sync_ports table is not recommended anymore. Refer to the user guide for configuring selective sync instead
Previous versions used a kernel table called non_sync_ports to implement selective sync, which is a method of choosing services that don't need to be synchronized. Selective sync can now be configured from SmartDashboard. See Choosing Services That Do Not Require Synchronization.

TCP Out-of-State Error Messages

When the synchronization mechanism is under load, TCP packet out-of-state error messages may appear in the Information column of SmartView Tracker. This section explains how to resolve each error.

TCP packet out of state - first packet isn't SYN tcp_flags: FIN-ACK
TCP packet out of state - first packet isn't SYN tcp_flags: FIN-PUSH-ACK
These messages occur when a FIN packet is retransmitted after deleting the connection from the connection table. To solve the problem, in SmartDashboard Global properties for Stateful Inspection, enlarge the TCP end timeout from 20 seconds to 60 seconds. If necessary, also enlarge the connection table so it won't fill completely.
SYN packet for established connection
This message occurs when a SYN is received on an established connection, and the sequence verifier is turned off. The sequence verifier is turned off for a non-sticky connection in a cluster (or in SecureXL). Some applications close connections with a RST packet (in order to reuse ports). To solve the problem, enable this behavior to specific ports or to all ports. For example, run the command:
fw ctl set int fw_trust_rst_on_port <port>
Which means that the Security Gateway should trust a RST coming from every port, in case a single port is not enough.

Platform Specific Error Messages

IPSO Specific Error Messages

FW-1: fwha_nok_get_mc_mac_by_ip: received a NULL query
FW-1: fwha_nok_get_mc_mac_by_ip: nokcl_get_clustermac returned unknown type <TYPE>
These messages mean that automatic proxy ARP entries for static NAT configuration might not be properly installed.
FW-1: fwha_nokcl_sync_rx_f: received NULL mbuf from ipso. Packet dropped.
FW-1: fwha_nokcl_sync_rx_f: received packet with illegal flag=<FLAG>. drop packet.
These messages mean that an illegal CPHA packet was received and will be dropped. If this happens more than few times during boot, the cluster malfunctions.
FW-1: fwha_nokcl_reregister_rx: unregister old magic mac values with IPSO.
FW-1: fwha_nokcl_reregister_rx: new magic mac values <MAC,FORWARD MAC> registered successfully with IPSO.
A notification that the operation fw ctl set int fwha_magic_mac succeeded.
FW-1: fwha_nokcl_reregister_rx: error in de-registration to the sync_rx (<ERR NUM>) new magic macs values will not be applied
A notification that the operation fw ctl set int fwha_magic_mac failed. Previous MAC values will be retained.
FW-1: fwha_nokcl_creation_f: error in registration …
FW-1: fwha_nok_init: NOT calling nokcl_register_creation since did not de-register yet.
FW-1: fwha_nok_fini: failed nokcl_deregister_creation with rc=<ERROR NUM>
These messages mean that an internal error in registration to the IPSO clustering mechanism has occurred. Verify that the IPSO version is supported by this the Security Gateway version and that the IPSO IP Clustering or VRRP cluster is configured properly.
FW-1: successfully (dis)connected to IPSO Clustering
A notification that should be normally received during Security Gateway initialization and removal.
FW-1: fwha_pnote_register: noksr_register_with_status failed
FW-1: fwha_IPSO_pnote_expiration: mismatch between IPSO device to ckp device <DEVICE NAME>
FW-1: fwha_nokia_pnote_expiration: cannot find the expired device
FW-1: fwha_noksr_report_wrapper: attempting to report an unregistered device <DEVICE NAME>
These messages may appear as a result of a problem in the interaction between the IPSO and ClusterXL device monitoring mechanisms. A reboot should solve this problem. Should this problem repeat itself contact Check Point Technical support.

Member Fails to Start After Reboot

If a reboot (or cpstop followed by cpstart) is performed on a cluster member while the cluster is under severe load, the member may fail to start correctly. The starting member will attempt to perform a full sync with the existing active member(s) and may in the process use up all its resources and available memory. This can lead to unexpected behavior.

To overcome this problem, define the maximum amount of memory that the member may use when starting up for synchronizing its connections with the active member. By default this amount is not limited. Estimate the amount of memory required as follows:

Memory required (MB) for Full Sync.


	New connections/second
Number of open Connections	100	1000	5000	10,000
1000	1.1	6.9
10000	11	69	329
20000	21	138	657	1305
50000	53	345	1642	3264

Note - These figures were derived for cluster members using the Windows platform, with Pentium 4 processors running at 2.4 GHz.

For example, if the cluster holds 10,000 connections, and the connection rate is 1000 connections/sec you will need 69 MB for full sync.

Define the maximum amount of memory using the gateway global parameter: fw_sync_max_saved_buf_mem.

The units are in megabytes. For details, see Advanced Cluster Configuration.

Top of Page

Download Complete PDF

Send Feedback

Monitoring and Troubleshooting Gateway Clusters

Related Topics

Verifying that a Cluster is Working Properly

The cphaprob Command

Monitoring Cluster Status

Monitoring Cluster Interfaces

Monitoring Critical Devices

Registering a Critical Device

Registering Critical Devices Listed in a File

Unregistering a Critical Device

Reporting Critical Device Status to ClusterXL

Example cphaprob Script

Monitoring Cluster Status Using SmartConsole Clients

SmartView Monitor

Starting and Stopping ClusterXL Using SmartView Monitor

SmartView Tracker

ClusterXL Log Messages

General logs

State logs

Pnote logs

Interface logs

SecureXL logs

Reason Strings

ClusterXL Configuration Commands

The cphaconf command

The cphastart and cphastop Commands

How to Initiate Failover

Monitoring Synchronization (fw ctl pstat)

Troubleshooting Synchronization

Introduction to cphaprob [-reset] syncstat

Output of cphaprob [-reset] syncstat

Parameters:

Sync Statistics (IDs of F&A Peers - 1)

Other Member Updates

Sent Retransmission Requests

Avg |Missing Updates per Request

Old or too-new Arriving Updates

Unsynced Missing Updates

Lost Sync Connection (num of events)

Timed out Sync Connection

Local Updates

Total Generated Updates

Recv Retransmission requests

Recv Duplicate Retrans request

Blocking Scenarios

Blocked Packets

Max Length of Sending Queue

Avg Length of Sending Queue

Hold Pkts Events

Unhold Pkt Events

Not Held Due to no Members

Max Held Duration (ticks)

Avg Held Duration (ticks)

Timers

Sync tick (ms)

CPHA tick (ms)

Queues

Sending Queue Size

Receiving Queue Size

Synchronization Troubleshooting Options

Enlarging the Sending Queue

Enlarging the Receiving Queue

Enlarging the Sync Timer

Enlarging the CPHA Timer

Reconfiguring the Acknowledgment Timeout

Contact Technical Support

ClusterXL Error Messages

General ClusterXL Error Messages

SmartView Tracker Active Mode Messages

Sync Related Error Messages

TCP Out-of-State Error Messages

Platform Specific Error Messages

IPSO Specific Error Messages

Member Fails to Start After Reboot