Troubleshooting Synchronization

Included Topics

Introduction to cphaprob [-reset] syncstat

Heavily loaded clusters and clusters with geographically separated members pose special challenges. High connection rates, and large distances between the members can lead to delays that affect the operation of the cluster.

The cphaprob [-reset] syncstat command is a tool for monitoring the operation of the State Synchronization mechanism in highly loaded and distributed clusters. It can be used for both ClusterXL and third-party OPSEC certified clustering products.

The troubleshooting process is as follows:

Run the cphaprob syncstat command.
Examine and understand the output statistics.
Tune the relevant synchronization global configuration parameters.
Rerun the command, resetting the statistics counters using the -reset option:
cphaprob -reset syncstat
Examine the output statistics to see if the problem is solved.

The section Output of cphaprob [-reset] syncstat explains each of the output parameters, and also explains when the output represents a problem.

Any identified problem can be solved by performing one or more of the tips described in Synchronization Troubleshooting Options.

Output of cphaprob [-reset] syncstat

The output parameters of the cphaprob syncstat command are shown below. The values (not shown) give an insight into the state and characteristics of the synchronization network. Each parameter and the meaning of its possible values is explained in the following sections.

Parameters:

Sync Statistics (IDs of F&A Peers - 1)

Other Member Updates

Sent Retransmission Requests

Avg |Missing Updates per Request

Old or too-new Arriving Updates

Unsynchronized Missing Updates

Lost Sync Connection (num of events)

Timed out Sync Connection

Local Updates

Total Generated Updates

Recv Retransmission requests

Recv Duplicate Retrans request

Blocking Scenarios

Blocked Packets

Max Length of Sending Queue

Avg Length of Sending Queue

Hold Pkts Events

Unhold Pkt Events

Not Held Due to no Members

Max Held Duration (ticks)

Avg Held Duration (ticks)

Sync Statistics (IDs of F&A Peers - 1)

These statistics relate to the state synchronization mechanism. The F&A (Flush and Ack) peers are the cluster members that this member recognizes as being part of the cluster. The IDs correspond to IDs and IP addresses generated by the cphaprob state command.

Other Member Updates

The statistics in this section relate to updates generated by other cluster members, or to updates that were not received from the other members. Updates inform about changes in the connections handled by the cluster member, and are sent from and to members. Updates are identified by sequence numbers.

Sent Retransmission Requests

The number of retransmission requests, which were sent by this member. Retransmission requests are sent when certain packets (with a specified sequence number) are missing, while the sending member already received updates with advanced sequences.

A high value can imply connectivity problems.

Note - Compare the number of retransmission requests to the Total Regenerated Updates of the other members (see Total Generated Updates).

If its value is unreasonably high (more than 30% of the Total Generated Updates of other members), contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Avg |Missing Updates per Request

Each retransmission request can contain up to 32 missing consecutive sequences. The value of this field is the average number of requested sequences per retransmission request.

More than 20 missing consecutive sequences per retransmission request can imply connectivity problems.

Note - If this value is unreasonably high, contact Technical Support, equipped with the entire output and a detailed description of the network topology and configuration.

Old or too-new Arriving Updates

The number of arriving sync updates where the sequence number is too low, which implies it belongs to an old transmission, or too high, to the extent that it cannot belong to a new transmission.

Large values imply connectivity problems.

Note - See Enlarging the Receiving Queue If this value is unreasonably high (more than 10% of the total updates sent), contact Technical Support, equipped with the entire output and a detailed description of the network topology and configuration.

Unsynchronized Missing Updates

The number of missing sync updates for which the receiving member stopped waiting. It stops waiting when the difference in sequence numbers between the newly arriving updates and the missing updates is larger than the length of the receiving queue.

This value should be zero. However, the loss of some updates is acceptable as long as the number of lost updates is less than 1% of the total generated updates.

Note - To decrease the number of lost updates, expand the capacity of the Receiving Queue. See Enlarging the Receiving Queue.

Lost Sync Connection (num of events)

The number of events in which synchronization with another member was lost and regained due to either Security Policy installation on the other member, or a large difference between the expected and received sequence number.

The value should be zero. A positive value indicates connectivity problems.

Note - Allow the sync mechanism to handle large differences in sequence numbers by expanding the Receiving Queue capacity. See Enlarging the Receiving Queue.

Timed out Sync Connection

The number of events in which the member declares another member as not connected. The member is considered as disconnected because no ACK packets were received from that member for a period of time (one second), even though there are Flush and Ack packets being held for that member.

The value should be zero. Even with a round trip time on the sync network as high as 100ms, one second should be enough time to receive an ACK. A positive value indicates connectivity problems.

Note - Try enlarging the Sync Timer (see Enlarging the Sync Timer). However, you may well have to contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Local Updates

The statistics in this section relate to updates generated by the local cluster member. Updates inform about changes in the connections handled by the cluster member, and are sent from and to members. Updates are identified by sequence numbers.

Total Generated Updates

The number of sync update packets generated by the sync mechanism since the statistics were last reset. Its value is the same as the difference between the sequence number when applying the -reset option, and the current sequence number.
Can have any value.

Recv Retransmission requests

The number of received retransmission requests. A member requests retransmissions when it is missing specified packets with lower sequence numbers than the ones already received.
A large value can imply connectivity problems.

Note - If this value is unreasonably high (more than 30% of the Total Generated Updates) contact Technical Support, equipped with the entire output and a detailed description of the network topology and configuration.

Recv Duplicate Retrans request

The number of duplicated retransmission requests received by the member. Duplicate requests were already handled, and so are dropped.
A large value may indicate network problem or storms on the sync network.

Blocking Scenarios

Under extremely heavy load conditions, the cluster may block new connections. This parameter shows the number of times that the cluster member started blocking new connections due to sync overload.

The member starts to block connections when its Sending Queue has reached its capacity threshold. The capacity threshold is calculated as 80% of the difference between the current sequence number and the sequence number for which the member received an ACK from all the other operating members.

A positive value indicates heavy load. In this case, observe the Blocked Packets to see how many packets we blocked. Each dropped packet means one blocked connection.

This parameter is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.

To activate the Block New Connections mechanism:

Apply the fw ctl set int fw_sync_block_new_conns 0 command to all the cluster members.

Note - The best way to handle a severe blocking connections problem is to enlarge the sending queue. See Enlarging the Sending Queue.

Another possibility is to decrease the timeout after which a member initiates an ACK. See Reconfiguring the Acknowledgment Timeout. This updates the sending queue capacity more accurately, thus making the blocking process more precise.

Blocked Packets

The number of packets that were blocked because the cluster member was blocking all new connections (see Blocking Scenarios). The number of blocked packets is usually one packet per new connection attempt.

A value higher than 5% of the Sending Queue (see Avg Length of Sending Queue) can imply a connectivity problem, or that ACKs are not being sent frequently enough.

This parameter is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.

To activate the Block New Connections mechanism:

Apply the fw ctl set int fw_sync_block_new_conns 0 command on all the cluster members.

Note - The best way to handle a severe blocking connections problem is to enlarge the sending queue. See Enlarging the Sending Queue.

Max Length of Sending Queue

The size of the Sending Queue is fixed. By default it is 512 sync updates. As newer updates with higher sequence numbers enter the queue, older updates with lower sequence numbers drop off the end of the queue. An older update could be dropped from the queue before the member receives an ACK about that update from all the other members.

This parameter is the difference between the current sync sequence number and the last sequence number for which the member received an ACK from all the other members. The value of this parameter can therefore be greater than 512.

The value of this parameter should be less than 512. If larger than 512, there is not necessarily a sync problem. However, the member will be unable to answer retransmission request for updates which are no longer in its queue.

This parameter is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.

To activate the Block New Connections mechanism:

Apply the fw ctl set int fw_sync_block_new_conns 0 command on all the cluster members.

Note - Enlarge the Sending Queue to value larger than this value. See Enlarging the Sending Queue.

Avg Length of Sending Queue

The average value of the Max Length of Sending Queue parameter, since reboot or since the Sync statistics were reset.

The value should be up to 80% of the size of the Sending Queue.

This parameters is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.

To activate the Block New Connections mechanism:

Apply the fw ctl set int fw_sync_block_new_conns 0 command on all the cluster members.

Note - Enlarge the Sending Queue so that this value is not larger than 80% of the new queue size. See Enlarging the Sending Queue.

Hold Pkts Events

The number of occasions where the sync update required Flush and Ack, and so was kept within the system until an ACK arrived from all the other functioning members.

Should be the same as the number of Unhold Pkt Events.

Note - Contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Unhold Pkt Events

The number of occasions when the member received all the required ACKS from the other functioning members.

Should be the same as the number of Hold Pkts Events.

Note - Contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Not Held Due to no Members

The number of packets which should have been held within the system, but were released because there were no other operating members.

When the cluster has at least two live members, the value should be 0.

Note - The cluster has a connectivity problem. Examine the values of the parameters: Lost Sync Connection (num of events) and Timed out Sync Connection to find out why the member thinks that it is the only cluster member.

You may also need to contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Max Held Duration (ticks)

The maximum time in ticks (one tick equals 100ms) for which a held packet was delayed in the system for Flush and Ack purposes.

It should not be higher than 50 (5 seconds), because of the pending timeout mechanism which releases held packets after a certain timeout. By default, the release timeout is 50 ticks. A high value indicates connectivity problem between the members.

Note - Optionally change the default timeout by changing the value of the fwldbcast_pending_timeout global variable. See Advanced Cluster Configuration and Reducing the Number of Pending Packets.

Also, examine the parameter Timed out Sync Connection to understand why packets were held for a long time.

You may also need to contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Avg Held Duration (ticks)

The average duration in ticks (tick equals 100ms) that held packets were delayed within the system for Flush and Ack purposes.

The average duration should be about the round-trip time of the sync network. A larger value indicates connectivity problem.

Note - If the value is high, contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration in order to examine the cause to the problem.

Timers

The Sync and CPHA timers perform sync and cluster related actions every fixed interval.

Sync tick (ms)

The Sync timer performs cluster related actions every fixed interval. By default, the Sync timer interval is 100ms. The base time unit is 100ms (or 1 tick), which is also the minimum value.

CPHA tick (ms)

The CPHA timer performs cluster related actions every fixed interval. By default, the CPHA timer interval is 100ms. The base time unit is 100ms (or 1 tick), which is also the minimum value.

Queues

Each cluster member has two queues. The Sending Queue and the Receiving Queue.

Sending Queue Size

The Sending Queue on the cluster member stores locally generated sync updates. Updates in the Sending Queue are replaced by more recent updates. In a highly loaded cluster, updates are therefore kept for less time. If a member is asked to retransmit an update, it can only do so if the update is still in its Sending Queue. The default (and minimum) size of this queue is 512. Each member has one sending queue.

Receiving Queue Size

The Receiving Queue on the cluster member keeps the updates from each cluster member until it has received a complete sequence of updates. The default (and minimum) size of this queue is 256. Each member keeps a Receiving Queue for each of the peer members.

Synchronization Troubleshooting Options

The following options specify the available troubleshooting options. Each option involves editing a global system configurable parameter to reconfigure the system with different value than the default.

Enlarging the Sending Queue

To enlarge the sending queue size:

Change the value of the global parameter fw_sync_sending_queue_size. See Advanced Cluster Configuration.
You must also make sure that the required queue size survives boot. See How to Configure a Security Gateway to Survive a Boot.

Enlarging this queue allows the member to save more updates from other members. However, be aware that each saved update consumes memory. When changing this variable you should consider carefully the memory implications. Changes will only take effect after reboot.

Enlarging the Receiving Queue

To enlarge the receiving queue size:

Change the value of the global parameter fw_sync_recv_queue_size. See Advanced Cluster Configuration.
You must also make sure that the required queue size survives boot. See How to Configure Security Gateway to Survive a Boot.

Enlarging this queue means that the member can save more updates from other members. However, be aware that each saved update consumes memory. When changing this variable you should carefully consider the memory implications. Changes will only take effect after reboot.

Enlarging the Sync Timer

The sync timer performs sync related actions every fixed interval. By default, the sync timer interval is 100ms. The base time unit is 100ms (or 1 tick), which is therefore the minimum value.

To enlarge the sync timer:

Change the value of the global parameter fwha_timer_sync_res. See Advanced Cluster Configuration. The value of this variable can be changed while the system is working. A reboot is not needed.

By default, fwha_timer_sync_res has a value of 1, meaning that the sync timer operates every base time unit (every 100ms). If you configure this variable to n, the timer will be operated every n*100ms.

Enlarging the CPHA Timer

The CPHA timer performs cluster related actions every fixed interval. By default, the CPHA timer interval is 100ms. The base time unit is 100ms (or 1 tick), which is also the minimum value.

If the cluster members are geographically separated from each other, set the CPHA timer to be around 10 times the round-trip delay of the sync network.

Enlarging this value increases the time it takes to detect a failover. For example, if detecting interface failure takes 0.3 seconds, and the timer is doubled to 200ms, the time needed to detect an interface failure is doubled to 0.6 seconds.

To enlarge the CPHA timer:

Change the value of the global parameter fwha_timer_cpha_res. See Advanced Cluster Configuration. The value of this variable can be changed while the system is working. A reboot is not needed.

By default, fwha_timer_cpha_res has a value of 1, meaning that the CPHA timer operates every base time unit (every 100ms). If you configure this variable to n, the timer will be operated every n*100ms.

Reconfiguring the Acknowledgment Timeout

A cluster member deletes updates from its Sending Queue (described in Sending Queue Size) on a regular basis. This frees up space in the queue for more recent updates.

The cluster member deletes updates from this queue if it receives an ACK about the update from the peer member.

The peer member sends an ACK in one of two circumstances — on condition that the Block New Connections mechanism (described in Blocking New Connections Under Load) is active:

After receiving a certain number of updates.
If it didn't send an ACK for a certain time. This is important if the sync network has a considerable line delay, which can occur if the cluster members are geographically separated from each other.

To reconfigure the timeout after which the member sends an ACK:

Change the value of the global parameter fw_sync_ack_time_gap. See Advanced Cluster Configuration. The value of this variable can be changed while the system is working. A reboot is not needed.

The default value for this variable is 10 ticks (10 * 100ms). Thus, if a member didn't send an ACK for a whole second, it will send an ACK for the updates it received.

Contact Technical Support

If the other recommendations do not help solve the problem, contact Technical Support for further assistance.

Troubleshooting Dynamic Routing (routeD) Pnotes

In R76, Check Point added a new ClusterXL Pnote called routeD that works with Dynamic Routing for Gaia clusters. This Pnote makes sure that traffic is not assigned to a cluster member before it is ready to handle the traffic. The Gaia RouteD daemon handles all routing (static and dynamic) operations.

There can be an issue with Dynamic Routing that shows one or more of these symptoms:

Cluster IP address connectivity problems
Unexpected failovers
SmartView Tracker logs show that a member is down because a routeD Pnote is set to problem.
The cphaprob list command shows:
Device Name: routed
Registration number: 4
Timeout: none
Current state: problem

These are some of the common causes of this issue:

Cluster misconfiguration
Port 2010 is blocked by the Firewall
The routeD daemon did not get all of its routes
The routeD daemon did not start correctly

Standard RouteD Pnote Behavior

Typically, the routed Pnote reports its state as Problem when:

A cluster member fails over
A cluster member reboots
There is an inconsistency in the Dynamic Routing configuration on cluster members

The routed Pnote reports its state as Ok when:

A ClusterXL member tells the RouteD daemon that it is a Master
The RouteD daemon gets the entire routing state from the Master

Basic Troubleshooting Steps

Run cphaprob -a if to make sure that your cluster and member interfaces are configured correctly.
Run dbset routed:instance:default:traceoptions:traceoptions:Cluster to generate RouteD cluster messages. The messages are located at /var/log/routed/log.
Make sure that Firewall rules do not block TCP port 2010.
Make sure that the RouteD daemon is running on the Active member.
Look for a router-id mismatch in the OSPF configuration.
Make sure that the OSPF interface is up on the Standby member.

For advanced troubleshooting procedures and more information, see sk92787.

For troubleshooting OSPF and the RouteD daemon, see sk84520.