Print Download PDF Send Feedback

Previous

Next

Monitoring and Troubleshooting Clusters

In This Section:

Making Sure that a Cluster is Working

Monitoring Cluster Status Using SmartConsole Clients

Working with SNMP Traps

ClusterXL Configuration Commands

How to Initiate Failover

Monitoring Synchronization (fw ctl pstat)

Troubleshooting Synchronization

Troubleshooting Dynamic Routing (routeD) Pnotes

ClusterXL Error Messages

Member Fails to Start After Reboot

Making Sure that a Cluster is Working

The cphaprob Command

Use the cphaprob command to verify that the cluster and the cluster members are working properly, and to define critical devices. A critical device is a process running on a cluster member that enables the member to notify other cluster members that it can no longer function as a member. The device reports to the ClusterXL mechanism regarding its current state or it may fail to report, in which case ClusterXL decides that a failover has occurred and another cluster member takes over. When a critical device (also known as a Problem Notification, or pnote) fails, the cluster member is considered to have failed.

There are a number of built-in critical devices, and the Administrator can define additional critical devices. The default critical devices are:

These commands can be run automatically by including them in scripts.

To produce a usage printout for cphaprob that shows all the available commands, type cphaprob at the command line and press Enter. The meaning of each of these commands is explained in the following sections.

chaprob -d <device> -t <timeout(sec)> -s <ok|init|problem> [-p] register
cphaprob -f <file> register
cphaprob -d <device> [-p] unregister
cphaprob -d <device> -s <ok|init|problem> report
cphaprob [-i[a]] [-e] list
cphaprob statecphaprob [-a] if

Monitoring Cluster Status

To see the status of a single or multiple cluster members:

cphaprob state
 
Cluster mode:   Load Sharing (Multicast)
 
Number     Unique Address  State
 
1 (local)  30.0.0.1        active
2          30.0.0.2        active

When examining the state of the cluster member, you need to consider whether it is forwarding packets, and whether it has a problem that is preventing it from forwarding packets. Each state reflects the result of a test on critical devices. This is a list that explains the possible cluster states, and whether or not they represent a problem.

State

Meaning

Forwarding
packets?

Is this
state a
Problem?

Active

Everything is OK.

Yes

No

Active attention

A problem has been detected, but the cluster member is still forwarding packets because it is the only member in the cluster or there are no other active members in the cluster. In any other situation the state of the member would be down.

Yes

Yes

Down

One of the critical devices is down.

No

Yes

Ready

  • State Ready means that the member recognizes itself as a part of the cluster and is literally ready to go into action, but, by design, something prevents the member from taking action. Possible reasons that the member is not yet Active include:
  • Not all required software components were loaded and initialized yet and/or not all configuration steps finished successfully yet. Before a cluster member becomes Active, it sends a message to the rest of the cluster members, checking whether it can become Active. In High Availability mode it will check if there is already an Active member and in Load Sharing Unicast mode it will check if there is a Pivot member already. The member remains in the Ready state until it receives the response from the rest of the cluster members and decides which state to choose next (Active, Standby, Pivot, or non-Pivot).
  • Software installed on this member has a higher version than the rest of the members in this cluster. For example, when a cluster is upgraded from one version of Check Point Security Gateway to another, and the cluster members have different versions of Check Point Security Gateway, the members with a new version have the Ready state and the members with the previous version have the Active / Active Attention state.
  • If the software installed on all cluster members includes CoreXL, which is installed by default in versions R70 and higher, a member in Ready state may have a higher number of CoreXL instances than other members. See sk42096 for a solution

No

No

Standby

Applies only to a High Availability configuration, and means the member is waiting for an active member to fail in order to start packet forwarding.

No

No

Initializing

An initial and transient state of the cluster member. The cluster member is booting up, and ClusterXL product is already running, but the Security Gateway is not yet ready.

No

No

ClusterXL inactive or member is down

Local member cannot hear anything coming from this cluster member.

Unknown

Yes

Monitoring Cluster Interfaces

To see the state of the cluster member interfaces and the virtual cluster interfaces:

Run this command on the cluster members:

cphaprob [-a] if

The output of this command must be identical to the configuration in the cluster object Topology page.

For example:

cphaprob -a if
 
Required interfaces: 4
Required secured interfaces: 1
 
qfe4      UP                       (secured, unique, multicast)
qfe5      UP                       (non secured, unique, multicast)
qfe6      DOWN (4810.2 secs)       (non secured, unique, multicast)
qfe7      UP                       (non secured, unique, multicast)
 
Virtual cluster interfaces: 2
qfe5           30.0.1.130
qfe6           30.0.2.130

Interfaces are ClusterXL critical devices. ClusterXL makes sure that interfaces can send and receive CCP packets. It also sets the required minimum number of functional interfaces to the largest number of functional interfaces seen since the last reboot. If the number of functional interfaces is less than the required number, ClusterXL starts a failover. The same applies for secured interfaces, where only good synchronization interfaces are counted.

An interface can be:

For third-party clustering products, except in the case of IPSO IP Clustering,
cphaprob -a if should always show virtual cluster IP addresses.

When an interface is DOWN, it means that the interface cannot receive or transmit CCP packets, or both. This may happen when an interface is malfunctioning, is connected to an incorrect subnet, is unable to pick up Multicast Ethernet packets and so on. The interface may also be able to receive but not transmit CCP packets, in which case the status field is read. The displayed time is the number of seconds that have elapsed since the interface was last able to receive/transmit a CCP packet.

See Defining Disconnected Interfaces for additional information.

Monitoring Critical Devices

When a critical device fails, the cluster member is considered to have failed. To see the list of critical devices on a cluster member, and of all the other members in the cluster, run the following command on the cluster member:

There are a number of built-in critical devices, and the Administrator can define additional critical devices.

The Critical Devices are:

Critical Device

Description

Meaning of "OK" state

Meaning of "problem" state

Problem Notification

Monitors all the Critical Devices.

None of the Critical Devices reports its state as problem.

At least one of the Critical Devices reports its state as problem.

Interface Active Check

Monitors the state of cluster interfaces.

All cluster interfaces are up (CCP packets are sent and received on all cluster interfaces).

At least one of the cluster interface is down (CCP packets are not sent and/or received on time).

HA Initialization

 

HA module was initialized successfully (see sk36372).

 

Load Balancing Configuration

 

Pnote is currently not used (see sk36373).

 

Recovery Delay

Monitors the state of a Virtual System (see sk92353).
Recovery Delay mechanism is disabled by default on 3rd party clusters.

State of a Virtual System can be changed.

State of a Virtual System can not be changed yet.

IPSO member status

 

IPSO member joined the cluster, all interfaces are up.

IPSO member left the cluster, less interfaces than expected in UP state.

Synchronization

Monitors if Full Sync on this cluster member completed successfully

Full Sync has completed successfully.

Full Sync has failed.

Filter

Monitors if the Security Policy is loaded

Security Policy was installed successfully.

Security Policy is not currently installed.

fwd

Monitors the Security Gateway process called fwd.

fwd daemon reported its state on time.

fwd daemon did not report its state on time.

cphad

Monitors the ClusterXL process called cphamcset.
In R77.20 and higher, also see the $FWDIR/log/cphamcset.elg file.

cphamcset daemon reported its state on time.

cphamcset daemon did not report its state on time.

routed

Monitors the Gaia process called routed.
This pnote appears since R76.

routed daemon reported its state on time.

routed daemon did not report its state on time.

cvpnd

Monitors the Mobile Access back-end process called cvpnd.
This pnote appears if Mobile Access Software Blade is enabled.

cvpnd daemon reported its state on time.

cvpnd daemon did not report its state on time.

ted

Monitors the Threat Emulation process called ted.
This pnote appears since R77.

ted daemon reported its state on time.

ted daemon did not report its state on time.

FIB

This pnote appears only on SecurePlatform Pro OS, when Advanced Dynamic Routing is enabled.

fibmgrd daemon reported its state on time and it is able to send and receive its packets on TCP port 2010.

fibmgrd daemon did not report its state on time, or it is not able to exchange its packets with peer members on TCP port 2010.

VSX

Monitors all Virtual Systems in VSX cluster.

On VS0, means that states of all Virtual Systems are not Down.

On other Virtual Systems, means that VS0 is alive.

Minimum of blocking states of all Virtual Systems is not "active" (the VSIDs will be printed on the line Problematic VSIDs:).

Instances

This pnote appears in VSX HA mode (not VSLS) cluster.

The number of CoreXL FW

instances in the received CCP packet matches the number of loaded CoreXL FW

instances on this VSX cluster member or this Virtual System

There is a mismatch between the number of CoreXL FW

instances in the received CCP packet and the number of loaded CoreXL FW

instances on this VSX cluster member or this Virtual System (see sk106912).

admin_down

User ran the clusterXL_admin down command.
See Appendix A - The clusterXL_admin Script.

 

 

host_monitor

User executed the $FWDIR/bin/clusterXL_monitor_ips script.
See Appendix B - The clusterXL_monitor_ips Script.

All monitored IP address replied to pings.

At least one of the monitored IP address did not reply to at least one ping.

name of a user space process

User executed the $FWDIR/bin/clusterXL_monitor_process script.
See Appendix C - The clusterXL_monitor_process Script.

All monitored user space processes are running.

At least one of the monitored user space processes is not running.

Syntax in Expert mode

cphaprob [-l] [-ia] [-e] list

Where:

Command

Description

cphaprob -l

Prints the list of all the "Built-in Devices" and the "Registered Devices"

cphaprob -i list

When there are no issues on the cluster member, shows:
There are no pnotes in problem state

When a critical device reports a problem, prints only the critical device that reports its state as "problem".

cphaprob -ia list

When there are no issues on the cluster member, shows:
There are no pnotes in problem state

When a critical device reports a problem, prints the device "Problem Notification" and the critical device that reports its state as "problem"

cphaprob -e list

When there are no issues on the cluster member, shows:
There are no pnotes in problem state

When a critical device reports a problem, prints only the critical device that reports its state as "problem"

Example

The following example output shows that the fwd process is down:

cphaprob list
 
Built-in Devices:
 
Device Name: Interface Active Check
Current state: OK
 
Registered Devices:
 
Device Name: Synchronization
Registration number: 0
Timeout: none
Current state: OK
Time since last report: 15998.4 sec
 
Device Name: Filter
Registration number: 1
Timeout: none
Current state: OK
Time since last report: 15644.4 sec
 
Device Name: fwd
Registration number: 3
Timeout: 2 sec
Current state: problem
Time since last report: 4.5 sec

Registering a Critical Device

cphaprob -d <device> -t <timeout(sec)> -s <ok|init|problem> [-p] register

It is possible to add a user defined critical device to the default list of critical devices. Use this command to register <device> as a critical process, and add it to the list of devices that must be running for the cluster member to be considered active. If <device> fails, then the cluster member is considered to have failed.

If <device> fails to contact the cluster member in <timeout> seconds, <device> will be considered to have failed. For no timeout, use the value 0.

Define the status of the <device> that will be reported to ClusterXL upon registration. This initial status can be one of:

[-p] makes these changes permanent. After performing a reboot or after removing the Security Gateway (on Linux or IPSO for example) and re-attaching it, the status of critical devices that were registered with this flag will be saved.

Registering Critical Devices Listed in a File

cphaprob -f <file> register

Register all the user defined critical devices listed in <file>. <file> must be an ASCII file, with each device on a separate line. Each line must list three parameters, which must be separated by at least a space or a tab, as follows:

<device> <timeout> <status>

Unregistering a Critical Device

cphaprob -d <device> [-p] unregister

Unregistering a user defined <device> as a critical process. This means that this device is no longer considered critical. If a critical device (and hence a cluster member) was registered as "problem" before running this command, then after running this command the status of the cluster will depend only on the remaining critical devices.

[-p] makes these changes permanent. This means that after performing a reboot or after removing the kernel (on Linux or IPSO for example) and re-attaching it, these critical devices remain unregistered.

Reporting Critical Device Status to ClusterXL

cphaprob -d <device> -s <ok|init|problem> report

Use this command to report the status of a user defined critical device to ClusterXL.

<device> is the device that must be running for the cluster member to be considered active. If <device> fails, then the cluster member is considered to have failed.

The status to be reported. The status can be one of:

ok — <device> is alive

init — <device> is initializing. The member is down. This state prevents the member from becoming active.

problem — <device> has failed. If this status is reported to ClusterXL, the cluster member will immediately failover to another cluster member.

If <device> fails to contact the cluster member within the timeout that was defined when the it was registered, <device> and hence the cluster member, will be considered to have failed. This is true only for critical devices with timeouts. If a critical device is registered with the -t 0 parameter, there will be no timeout, and until the device reports otherwise, the status is considered to be the last reported status.

Example cphaprob Script

Predefined cphaprob scripts are located on the location $FWDIR/bin. Two scripts are available

The ClusterXL_monitor_ips script in the Appendix chapter Example cphaprob Script provides a way to check end-to-end connectivity to routers or other network devices and cause failover if the ping fails. The ClusterXL_monitor_process script monitors the existence of given processes and causes failover if the processes die. This script uses the normal pnote mechanism.