Print Download PDF Send Feedback

Previous

Next

Troubleshooting Synchronization

Introduction to cphaprob [-reset] syncstat

Heavily loaded clusters and clusters with geographically separated members pose special challenges. High connection rates, and large distances between the members can lead to delays that affect the operation of the cluster.

The cphaprob [-reset] syncstat command is a tool for monitoring the operation of the State Synchronization mechanism in highly loaded and distributed clusters.

The troubleshooting process is as follows:

  1. Run the cphaprob syncstat command.
  2. Examine and understand the output statistics.
  3. Tune the relevant synchronization global configuration parameters.
  4. Rerun the command, resetting the statistics counters using the -reset option:

    cphaprob -reset syncstat

  5. Examine the output statistics to see if the problem is solved.

The section Output of cphaprob [-reset] syncstat explains each of the output parameters, and also explains when the output represents a problem.

Any identified problem can be solved by performing one or more of the tips described in Synchronization Troubleshooting Options.

Output of cphaprob [-reset] syncstat

The output parameters of the cphaprob syncstat command are shown below. The values (not shown) give an insight into the state and characteristics of the synchronization network. Each parameter and the meaning of its possible values is explained in the following sections.

Parameters:

Sync Statistics (IDs of F&A Peers - 1)

Other Member Updates

Sent Retransmission Requests

Avg Missing Updates per Request

Old or too-new Arriving Updates

Unsynchronized Missing Updates

Lost Sync Connection (num of events)

Timed out Sync Connection

Local Updates

Total Generated Updates

Recv Retransmission requests

Recv Duplicate Retrans request

Blocking Scenarios

Blocked Packets

Max Length of Sending Queue

Avg Length of Sending Queue

Hold Pkts Events

Unhold Pkt Events

Not Held Due to no Members

Max Held Duration (ticks)

Avg Held Duration (ticks)

Timers

Sync tick (ms)

CPHA tick (ms)

Queues

Sending Queue Size

Receiving Queue Size

Sync Statistics (IDs of F&A Peers - 1)

These statistics relate to the state synchronization mechanism. The F&A (Flush and Ack) peers are the cluster members that this member recognizes as being part of the cluster. The IDs correspond to IDs and IP addresses generated by the cphaprob state command.

Other Member Updates

The statistics in this section relate to updates generated by other cluster members, or to updates that were not received from the other members. Updates inform about changes in the connections handled by the cluster member, and are sent from and to members. Updates are identified by sequence numbers.

Sent Retransmission Requests

The number of retransmission requests, which were sent by this member. Retransmission requests are sent when certain packets (with a specified sequence number) are missing, while the sending member already received updates with advanced sequences.

A high value can imply connectivity problems.

Note - Compare the number of retransmission requests to the Total Regenerated Updates of the other members (see Total Generated Updates).

If its value is unreasonably high (more than 30% of the Total Generated Updates of other members), contact Check Point Support equipped with the entire output and a detailed description of the network topology and configuration.

Avg Missing Updates per Request

Each retransmission request can contain up to 32 missing consecutive sequences. The value of this field is the average number of requested sequences per retransmission request.

More than 20 missing consecutive sequences per retransmission request can imply connectivity problems.

Note - If this value is unreasonably high, contact Technical Support, equipped with the entire output and a detailed description of the network topology and configuration.

Old or too-new Arriving Updates

The number of arriving sync updates where the sequence number is too low, which implies it belongs to an old transmission, or too high, to the extent that it cannot belong to a new transmission.

Large values imply connectivity problems.

Note - See Enlarging the Receiving Queue. If this value is unreasonably high (more than 10% of the total updates sent), contact Check Point Support, equipped with the entire output and a detailed description of the network topology and configuration.

Unsynchronized Missing Updates

The number of missing sync updates, for which the receiving member stopped waiting. It stops waiting when the difference in sequence numbers between the newly arriving updates and the missing updates is larger than the length of the receiving queue.

This value should be zero. However, the loss of some updates is acceptable as long as the number of lost updates is less than 1% of the total generated updates.

Note - To decrease the number of lost updates, expand the capacity of the Receiving Queue. See Enlarging the Receiving Queue.

Lost Sync Connection (num of events)

The number of events, in which synchronization with another member was lost and regained due to either Security Policy installation on the other member, or a large difference between the expected and received sequence number.

The value should be zero. A positive value indicates connectivity problems.

Note - Allow the sync mechanism to handle large differences in sequence numbers by expanding the Receiving Queue capacity. See Enlarging the Receiving Queue.

Timed out Sync Connection

The number of events, in which the member declares another member as not connected. The member is considered as disconnected because no ACK packets were received from that member for a period of time (one second), even though there are Flush and Ack packets being held for that member.

The value should be zero. Even with a round trip time on the sync network as high as 100ms, one second should be enough time to receive an ACK. A positive value indicates connectivity problems.

Note - Try enlarging the Sync Timer (see Enlarging the Sync Timer). However, you may well have to contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Local Updates

The statistics in this section relate to updates generated by the local cluster member. Updates inform about changes in the connections handled by the cluster member, and are sent from and to members. Updates are identified by sequence numbers.

Total Generated Updates

Recv Retransmission requests

Recv Duplicate Retrans request

Blocking Scenarios

Under extremely heavy load conditions, the cluster may block new connections. This parameter shows the number of times that the cluster member started blocking new connections due to sync overload.

The member starts to block connections when its Sending Queue has reached its capacity threshold. The capacity threshold is calculated as 80% of the difference between the current sequence number and the sequence number, for which the member received an ACK from all the other operating members.

A positive value indicates heavy load. In this case, observe the Blocked Packets to see how many packets we blocked. Each dropped packet means one blocked connection.

This parameter is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.

To activate the Block New Connections mechanism:

Run the fw ctl set int fw_sync_block_new_conns 0 command to all the cluster members.

Note - The best way to handle a severe blocking connections problem is to enlarge the Sending Queue. See Enlarging the Sending Queue.

Another possibility is to decrease the timeout, after which a cluster member initiates an ACK. See Reconfiguring the Acknowledgment Timeout. This updates the sending queue capacity more accurately, thus making the blocking process more precise.

For more information, see sk43896: Blocking New Connections Under Load in ClusterXL.

Blocked Packets

The number of packets that were blocked because the cluster member was blocking all new connections (see Blocking Scenarios). The number of blocked packets is usually one packet per new connection attempt.

A value higher than 5% of the Sending Queue (see Avg Length of Sending Queue) can imply a connectivity problem, or that ACKs are not being sent frequently enough.

This parameter is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.

To activate the Block New Connections mechanism:

Run the fw ctl set int fw_sync_block_new_conns 0 command on all the cluster members.

Note - The best way to handle a severe blocking connections problem is to enlarge the Sending Queue. See Enlarging the Sending Queue.

Another possibility is to decrease the timeout, after which a cluster member initiates an ACK. See Reconfiguring the Acknowledgment Timeout. This updates the sending queue capacity more accurately, thus making the blocking process more precise.

For more information, see sk43896: Blocking New Connections Under Load in ClusterXL.

Max Length of Sending Queue

The size of the Sending Queue is fixed. By default it is 512 sync updates. As newer updates with higher sequence numbers enter the queue, older updates with lower sequence numbers drop off the end of the queue. An older update could be dropped from the queue before the member receives an ACK about that update from all the other members.

This parameter is the difference between the current sync sequence number and the last sequence number, for which the member received an ACK from all the other members. The value of this parameter can therefore be greater than 512.

The value of this parameter should be less than 512. If larger than 512, there is not necessarily a sync problem. However, the member will be unable to answer retransmission request for updates, which are no longer in its queue.

This parameter is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.

To activate the Block New Connections mechanism:

Run the fw ctl set int fw_sync_block_new_conns 0 command on all the cluster members.

Note - Enlarge the Sending Queue to value larger than this value. See Enlarging the Sending Queue.

For more information, see sk43896: Blocking New Connections Under Load in ClusterXL.

Avg Length of Sending Queue

The average value of the Max Length of Sending Queue parameter, since reboot or since the Sync statistics were reset.

The value should be up to 80% of the size of the Sending Queue.

This parameters is only measured if the Block New Connections mechanism (described in Blocking New Connections Under Load) is active.

To activate the Block New Connections mechanism:

Run the fw ctl set int fw_sync_block_new_conns 0 command on all the cluster members.

Note - Enlarge the Sending Queue so that this value is not larger than 80% of the new queue size. See Enlarging the Sending Queue.

For more information, see sk43896: Blocking New Connections Under Load in ClusterXL.

Hold Pkts Events

The number of occasions where the sync update required Flush and Ack, and so was kept within the system until an ACK arrived from all the other functioning members.

Should be the same as the number of Unhold Pkt Events.

Note - Contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Unhold Pkt Events

The number of occasions when the member received all the required ACKs from the other functioning members.

Should be the same as the number of Hold Pkts Events.

Note - Contact Check Point Support equipped with the entire output and a detailed description of the network topology and configuration.

Not Held Due to no Members

The number of packets, which should have been held within the system, but were released because there were no other operating members.

When the cluster has at least two live members, the value should be 0.

Note - The cluster has a connectivity problem. Examine the values of the parameters: Lost Sync Connection (num of events) and Timed out Sync Connection to find out why the member thinks that it is the only cluster member.

You may also need to contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Max Held Duration (ticks)

The maximum time in ticks (one tick equals 100ms), for which a held packet was delayed in the system for Flush and Ack purposes.

It should not be higher than 50 (5 seconds), because of the pending timeout mechanism, which releases held packets after a certain timeout. By default, the release timeout is 50 ticks. A high value indicates connectivity problem between the members.

Note - Optionally change the default timeout by changing the value of the fwldbcast_pending_timeout global variable. See Advanced Cluster Configuration and Reducing the Number of Pending Packets.

Also, examine the parameter Timed out Sync Connection to understand why packets were held for a long time.

You may also need to contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration.

Avg Held Duration (ticks)

The average duration in ticks (tick equals 100ms) that held packets were delayed within the system for Flush and Ack purposes.

The average duration should be about the round-trip time of the sync network. A larger value indicates connectivity problem.

Note - If the value is high, contact Technical Support equipped with the entire output and a detailed description of the network topology and configuration in order to examine the cause to the problem.

Timers

The Sync and CPHA timers perform sync and cluster related actions every fixed interval.

Sync tick (ms)

The Sync timer performs cluster related actions every fixed interval. By default, the Sync timer interval is 100ms. The base time unit is 100ms (or 1 tick), which is also the minimum value.

CPHA tick (ms)

The CPHA timer performs cluster related actions every fixed interval. By default, the CPHA timer interval is 100ms. The base time unit is 100ms (or 1 tick), which is also the minimum value.

Queues

Each cluster member has two queues. The Sending Queue and the Receiving Queue.

Sending Queue Size

The Sending Queue on the cluster member stores locally generated sync updates. Updates in the Sending Queue are replaced by more recent updates. In a highly loaded cluster, updates are therefore kept for less time. If a member is asked to retransmit an update, it can only do so if the update is still in its Sending Queue. The default (and minimum) size of this queue is 512. Each member has one sending queue.

Receiving Queue Size

The Receiving Queue on the cluster member keeps the updates from each cluster member until it has received a complete sequence of updates. The default (and minimum) size of this queue is 256. Each member keeps a Receiving Queue for each of the peer members.