Included Topics |
Heavily loaded clusters and clusters with geographically separated members pose special challenges. High connection rates, and large distances between the members can lead to delays that affect the operation of the cluster.
The cphaprob [-reset] syncstat command is a tool for monitoring the operation of the State Synchronization mechanism in highly loaded and distributed clusters. It can be used for both ClusterXL and third-party OPSEC certified clustering products.
The troubleshooting process is as follows:
s
etting the statistics counters using the -reset option:cphaprob -reset syncstat
The section Output of cphaprob [-reset] syncstat explains each of the output parameters, and also explains when the output represents a problem.
Any identified problem can be solved by performing one or more of the tips described in Synchronization Troubleshooting Options.
The output parameters of the cphaprob syncstat command are shown below. The values (not shown) give an insight into the state and characteristics of the synchronization network. Each parameter and the meaning of its possible values is explained in the following sections.
Parameters: |
These statistics relate to the state synchronization mechanism. The F&A (Flush and Ack) peers are the cluster members that this member recognizes as being part of the cluster. The IDs correspond to IDs and IP addresses generated by the cphaprob state command.
The statistics in this section relate to updates generated by other cluster members, or to updates that were not received from the other members. Updates inform about changes in the connections handled by the cluster member, and are sent from and to members. Updates are identified by sequence numbers.
The number of retransmission requests, which were sent by this member. Retransmission requests are sent when certain packets (with a specified sequence number) are missing, while the sending member already received updates with advanced sequences.
A high value can imply connectivity problems.
Note - Compare the number of retransmission requests to the Total Regenerated Updates of the other members (see Total Generated Updates). If its value is unreasonably high (more than 30% of the Total Generated Updates of other members), contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration. |
Each retransmission request can contain up to 32 missing consecutive sequences. The value of this field is the average number of requested sequences per retransmission request.
More than 20 missing consecutive sequences per retransmission request can imply connectivity problems.
Note - If this value is unreasonably high, contact Technical Support, equipped with the entire output and a detailed description of the network topology and configuration. |
The number of arriving sync updates where the sequence number is too low, which implies it belongs to an old transmission, or too high, to the extent that it cannot belong to a new transmission.
Large values imply connectivity problems.
Note - See Enlarging the Receiving Queue If this value is unreasonably high (more than 10% of the total updates sent), contact Technical Support, equipped with the entire output and a detailed description of the network topology and configuration. |
The number of missing sync updates for which the receiving member stopped waiting. It stops waiting when the difference in sequence numbers between the newly arriving updates and the missing updates is larger than the length of the receiving queue.
This value should be zero. However, the loss of some updates is acceptable as long as the number of lost updates is less than 1% of the total generated updates.
Note - To decrease the number of lost updates, expand the capacity of the Receiving Queue. See Enlarging the Receiving Queue. |
The number of events in which synchronization with another member was lost and regained due to either Security Policy installation on the other member, or a large difference between the expected and received sequence number.
The value should be zero. A positive value indicates connectivity problems.
Note - Allow the sync mechanism to handle large differences in sequence numbers by expanding the Receiving Queue capacity. See Enlarging the Receiving Queue. |
The number of events in which the member declares another member as not connected. The member is considered as disconnected because no ACK packets were received from that member for a period of time (one second), even though there are Flush and Ack packets being held for that member.
The value should be zero. Even with a round trip time on the sync network as high as 100ms, one second should be enough time to receive an ACK. A positive value indicates connectivity problems.
Note - Try enlarging the Sync Timer (see Enlarging the Sync Timer). However, you may well have to contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration. |
The statistics in this section relate to updates generated by the local cluster member. Updates inform about changes in the connections handled by the cluster member, and are sent from and to members. Updates are identified by sequence numbers.
Note - If this value is unreasonably high (more than 30% of the Total Generated Updates) contact Technical Support, equipped with the entire output and a detailed description of the network topology and configuration. |
Note - If this value is unreasonably high (more than 30% of the Total Generated Updates) contact Technical Support, equipped with the entire output and a detailed description of the network topology and configuration. |
Under extremely heavy load conditions, the cluster may block new connections. This parameter shows the number of times that the cluster member started blocking new connections due to sync overload.
The member starts to block connections when its Sending Queue has reached its capacity threshold. The capacity threshold is calculated as 80% of the difference between the current sequence number and the sequence number for which the member received an ACK from all the other operating members.
A positive value indicates heavy load. In this case, observe the Blocked Packets to see how many packets we blocked. Each dropped packet means one blocked connection.
This parameter is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.
To activate the Block New Connections mechanism:
Apply the fw ctl set int fw_sync_block_new_conns 0 command to all the cluster members.
Note - The best way to handle a severe blocking connections problem is to enlarge the sending queue. See Enlarging the Sending Queue. Another possibility is to decrease the timeout after which a member initiates an ACK. See Reconfiguring the Acknowledgment Timeout. This updates the sending queue capacity more accurately, thus making the blocking process more precise. |
The number of packets that were blocked because the cluster member was blocking all new connections (see Blocking Scenarios). The number of blocked packets is usually one packet per new connection attempt.
A value higher than 5% of the Sending Queue (see Avg Length of Sending Queue) can imply a connectivity problem, or that ACKs are not being sent frequently enough.
This parameter is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.
To activate the Block New Connections mechanism:
Apply the fw ctl set int fw_sync_block_new_conns 0 command on all the cluster members.
Note - The best way to handle a severe blocking connections problem is to enlarge the sending queue. See Enlarging the Sending Queue. Another possibility is to decrease the timeout after which a member initiates an ACK. See Reconfiguring the Acknowledgment Timeout. This updates the sending queue capacity more accurately, thus making the blocking process more precise. |
The size of the Sending Queue is fixed. By default it is 512 sync updates. As newer updates with higher sequence numbers enter the queue, older updates with lower sequence numbers drop off the end of the queue. An older update could be dropped from the queue before the member receives an ACK about that update from all the other members.
This parameter is the difference between the current sync sequence number and the last sequence number for which the member received an ACK from all the other members. The value of this parameter can therefore be greater than 512.
The value of this parameter should be less than 512. If larger than 512, there is not necessarily a sync problem. However, the member will be unable to answer retransmission request for updates which are no longer in its queue.
This parameter is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.
To activate the Block New Connections mechanism:
Apply the fw ctl set int fw_sync_block_new_conns 0 command on all the cluster members.
Note - Enlarge the Sending Queue to value larger than this value. See Enlarging the Sending Queue. |
The average value of the Max Length of Sending Queue parameter, since reboot or since the Sync statistics were reset.
The value should be up to 80% of the size of the Sending Queue.
This parameters is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.
To activate the Block New Connections mechanism:
Apply the fw ctl set int fw_sync_block_new_conns 0 command on all the cluster members.
Note - Enlarge the Sending Queue so that this value is not larger than 80% of the new queue size. See Enlarging the Sending Queue. |
The number of occasions where the sync update required Flush and Ack, and so was kept within the system until an ACK arrived from all the other functioning members.
Should be the same as the number of Unhold Pkt Events.
Note - Contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration. |
The number of occasions when the member received all the required ACKS from the other functioning members.
Should be the same as the number of Hold Pkts Events.
Note - Contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration. |
The number of packets which should have been held within the system, but were released because there were no other operating members.
When the cluster has at least two live members, the value should be 0.
Note - The cluster has a connectivity problem. Examine the values of the parameters: Lost Sync Connection (num of events) and Timed out Sync Connection to find out why the member thinks that it is the only cluster member. You may also need to contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration. |
The maximum time in ticks (one tick equals 100ms) for which a held packet was delayed in the system for Flush and Ack purposes.
It should not be higher than 50 (5 seconds), because of the pending timeout mechanism which releases held packets after a certain timeout. By default, the release timeout is 50 ticks. A high value indicates connectivity problem between the members.
Note - Optionally change the default timeout by changing the value of the fwldbcast_pending_timeout global variable. See Advanced Cluster Configuration and Reducing the Number of Pending Packets. Also, examine the parameter Timed out Sync Connection to understand why packets were held for a long time. You may also need to contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration. |
The average duration in ticks (tick equals 100ms) that held packets were delayed within the system for Flush and Ack purposes.
The average duration should be about the round-trip time of the sync network. A larger value indicates connectivity problem.
Note - If the value is high, contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration in order to examine the cause to the problem. |
The Sync and CPHA timers perform sync and cluster related actions every fixed interval.
The Sync timer performs cluster related actions every fixed interval. By default, the Sync timer interval is 100ms. The base time unit is 100ms (or 1 tick), which is also the minimum value.
The CPHA timer performs cluster related actions every fixed interval. By default, the CPHA timer interval is 100ms. The base time unit is 100ms (or 1 tick), which is also the minimum value.
Each cluster member has two queues. The Sending Queue and the Receiving Queue.
The Sending Queue on the cluster member stores locally generated sync updates. Updates in the Sending Queue are replaced by more recent updates. In a highly loaded cluster, updates are therefore kept for less time. If a member is asked to retransmit an update, it can only do so if the update is still in its Sending Queue. The default (and minimum) size of this queue is 512. Each member has one sending queue.
The Receiving Queue on the cluster member keeps the updates from each cluster member until it has received a complete sequence of updates. The default (and minimum) size of this queue is 256. Each member keeps a Receiving Queue for each of the peer members.
The following options specify the available troubleshooting options. Each option involves editing a global system configurable parameter to reconfigure the system with different value than the default.
The Sending Queue on the cluster member stores locally generated sync updates. Updates in the Sending Queue are replaced by more recent updates. In a highly loaded cluster, updates are therefore kept for less time. If a member is asked to retransmit an update, it can only do so if the update is still in its Sending Queue. The default (and minimum) size of this queue is 512. Each member has one sending queue.
To enlarge the sending queue size:
Enlarging this queue allows the member to save more updates from other members. However, be aware that each saved update consumes memory. When changing this variable you should consider carefully the memory implications. Changes will only take effect after reboot.
The Receiving Queue on the cluster member keeps the updates from each cluster member until it has received a complete sequence of updates. The default (and minimum) size of this queue is 256. Each member keeps a Receiving Queue for each of the peer members.
To enlarge the receiving queue size:
Enlarging this queue means that the member can save more updates from other members. However, be aware that each saved update consumes memory. When changing this variable you should carefully consider the memory implications. Changes will only take effect after reboot.
The sync timer performs sync related actions every fixed interval. By default, the sync timer interval is 100ms. The base time unit is 100ms (or 1 tick), which is therefore the minimum value.
To enlarge the sync timer:
Change the value of the global parameter fwha_timer_sync_res. See Advanced Cluster Configuration. The value of this variable can be changed while the system is working. A reboot is not needed.
By default, fwha_timer_sync_res has a value of 1, meaning that the sync timer operates every base time unit (every 100ms). If you configure this variable to n, the timer will be operated every n*100ms.
The CPHA timer performs cluster related actions every fixed interval. By default, the CPHA timer interval is 100ms. The base time unit is 100ms (or 1 tick), which is also the minimum value.
If the cluster members are geographically separated from each other, set the CPHA timer to be around 10 times the round-trip delay of the sync network.
Enlarging this value increases the time it takes to detect a failover. For example, if detecting interface failure takes 0.3 seconds, and the timer is doubled to 200ms, the time needed to detect an interface failure is doubled to 0.6 seconds.
To enlarge the CPHA timer:
Change the value of the global parameter fwha_timer_cpha_res. See Advanced Cluster Configuration. The value of this variable can be changed while the system is working. A reboot is not needed.
By default, fwha_timer_cpha_res has a value of 1, meaning that the CPHA timer operates every base time unit (every 100ms). If you configure this variable to n, the timer will be operated every n*100ms.
A cluster member deletes updates from its Sending Queue (described in Sending Queue Size) on a regular basis. This frees up space in the queue for more recent updates.
The cluster member deletes updates from this queue if it receives an ACK about the update from the peer member.
The peer member sends an ACK in one of two circumstances — on condition that the Block New Connections mechanism (described in Blocking New Connections Under Load) is active:
To reconfigure the timeout after which the member sends an ACK:
Change the value of the global parameter fw_sync_ack_time_gap. See Advanced Cluster Configuration. The value of this variable can be changed while the system is working. A reboot is not needed.
The default value for this variable is 10 ticks (10 * 100ms). Thus, if a member didn't send an ACK for a whole second, it will send an ACK for the updates it received.
If the other recommendations do not help solve the problem, contact Technical Support for further assistance.
In R76, Check Point added a new ClusterXL Pnote called routeD that works with Dynamic Routing for Gaia clusters. This Pnote makes sure that traffic is not assigned to a cluster member before it is ready to handle the traffic. The Gaia RouteD daemon handles all routing (static and dynamic) operations.
There can be an issue with Dynamic Routing that shows one or more of these symptoms:
cphaprob list
command shows:Device Name: routed
Registration number: 4
Timeout: none
Current state: problem
These are some of the common causes of this issue:
Typically, the routed Pnote reports its state as Problem when:
The routed Pnote reports its state as Ok when:
cphaprob -a if
to make sure that your cluster and member interfaces are configured correctly. dbset routed:instance:default:traceoptions:traceoptions:Cluster
to generate RouteD cluster messages. The messages are located at /var/log/routed/log
.router-id
mismatch in the OSPF configuration.For advanced troubleshooting procedures and more information, see sk92787.
For troubleshooting OSPF and the RouteD daemon, see sk84520.