Monitoring and Troubleshooting Clusters

In This Section:

Making Sure that a Cluster is Working

The cphaprob Command

Use the monitoring commands to make sure that the cluster and the cluster members work properly, and to define Critical Devices. A Critical Device (also known as a Problem Notification, or pnote) is a special software device on each cluster member, through which the critical aspects for cluster operation are monitored. When the critical monitored component on a cluster member fails to report its state on time, or when its state is reported as problematic, the state of that member is immediately changed to 'Down'.

These commands can be run automatically by including them in scripts. The meaning of each of these commands is explained in the next sections. You can run the commands both in Expert mode:

Monitoring Commands:

cphaprob [-vs <vsid>] state

cphaprob [-l] [-ia] [-e] list

cphaprob [-a][-m] if

cphaprob [-reset] syncstat

cphaprob igmp

cphaprob [-reset] ldstat

cphaprob tablestat

Configuration Commands:

cphaprob -d <device> -t <timeout(sec)> -s {ok|init|problem} [-p] [-g] register

cphaprob -d <device> [-p] [-g] unregister

cphaprob -f <file> [-g] register

cphaprob -a [-g] unregister

cphaprob -d <device> -s {ok|init|problem} [-g] report

Monitoring Cluster Status

Description

Run this command after you set up the cluster, and whenever you want to monitor the cluster status.

Syntax in Expert mode

cphaprob [-vs <VSID>] state

Example

Cluster mode: Load Sharing (Multicast)

Number Unique Address State

1 (local) 30.0.0.1 active

2 30.0.0.2 active

Cluster mode can be:
- Load Sharing (Multicast).
- Load Sharing (Unicast).
- High Availability New Mode (Primary Up).
- High Availability New Mode (Active Up).
- Virtual System Load Sharing
- For third-party clustering products: "Service", refer to Clustering Definitions and Terms, for further information.
The number of the member indicates the member ID for Load Sharing, and the Priority for High Availability.
In Load Sharing configuration, all members in a fully functioning cluster should be Active.
In High Availability configurations, only one member in a properly functioning cluster must be Active, and the others must be in the Standby state.
Third-party clustering products show Active/Active even if one of the members is in standby state. This is because this command only reports the status of the full synchronization process.

When examining the state of the cluster member, you need to consider whether it is forwarding packets, and whether it has a problem that is preventing it from forwarding packets. Each state reflects the result of a test on critical devices. This is a list that explains the possible cluster states, and whether or not they represent a problem.

State	Meaning	Forwarding packets?	Is this state a Problem?
Active	Everything is OK.	Yes	No
Active attention	A problem has been detected, but the cluster member is still forwarding packets because it is the only member in the cluster or there are no other active members in the cluster. In any other situation the state of the member would be down.	Yes	Yes
Down	One of the critical devices is down.	No	Yes
Ready	State Ready means that the member recognizes itself as a part of the cluster and is literally ready to go into action, but, by design, something prevents the member from taking action. Possible reasons that the member is not yet Active include: Not all required software components were loaded and initialized yet and/or not all configuration steps finished successfully yet. Before a cluster member becomes Active, it sends a message to the rest of the cluster members, checking whether it can become Active. In High Availability mode it will check if there is already an Active member and in Load Sharing Unicast mode it will check if there is a Pivot member already. The member remains in the Ready state until it receives the response from the rest of the cluster members and decides which state to choose next (Active, Standby, Pivot, or non-Pivot). Software installed on this member has a higher version than the rest of the members in this cluster. For example, when a cluster is upgraded from one version of Check Point Security Gateway to another, and the cluster members have different versions of Check Point Security Gateway, the members with a new version have the Ready state and the members with the previous version have the Active / Active Attention state. If the software installed on all cluster members includes CoreXL, which is installed by default in versions R70 and higher, a member in Ready state may have a higher number of CoreXL instances than other members. See sk42096 for a solution.	No	No
Standby	Applies only to a High Availability configuration, and means the member is waiting for an active member to fail in order to start packet forwarding.	No	No
Initializing	An initial and transient state of the cluster member. The cluster member is booting up, and ClusterXL product is already running, but the Security Gateway is not yet ready.	No	No
ClusterXL inactive or member is down	Local member cannot hear anything coming from this cluster member.	Unknown	Yes

Monitoring Critical Devices

When a critical device fails, the cluster member is considered to have failed. To see the list of critical devices on a cluster member, and of all the other members in the cluster, run the cphaprob command listed below on the cluster member.

There are a number of built-in Critical Devices, and the Administrator can define additional critical devices.

The Critical Devices are:

Critical Device	Description	Meaning of "OK" state	Meaning of "problem" state
`Problem Notification`	Monitors all the Critical Devices.	None of the Critical Devices on this cluster member reports its state as `problem`.	At least one of the Critical Devices on this cluster member reports its state as `problem`.
`HA Initialization`	Monitors if "HA module" was initialized successfully. See sk36372.	This cluster member receives cluster state information from peer cluster members.
`Interface Active Check`	Monitors the state of cluster interfaces.	All cluster interfaces on this cluster member are up (CCP packets are sent and received on all cluster interfaces).	At least one of the cluster interfaces on this cluster member is down (CCP packets are not sent and/or received on time).
`Load Balancing Configuration`	Pnote is currently not used (see sk36373).
`Recovery Delay`	Monitors the state of a Virtual System (see sk92353).	State of a Virtual System can be changed on this cluster member.	State of a Virtual System cannot be changed yet on this cluster member.
`Synchronization`	Monitors if Full Sync on this cluster member completed successfully.	This cluster member completed Full Sync successfully.	This cluster member was not able to complete Full Sync.
`Filter`	Monitors if the Security Policy is installed.	This cluster member successfully installed Security Policy.	Security Policy is not currently installed on this cluster member.
`fwd`	Monitors the Security Gateway process called `fwd`.	`fwd` daemon on this cluster member reported its state on time.	`fwd` daemon on this cluster member did not report its state on time.
`cphad`	Monitors the ClusterXL process called `cphamcset`. also see the `$FWDIR/log/cphamcset.elg` file.	`cphamcset` daemon reported its state on time. on this cluster member.	`cphamcset` daemon on this cluster member did not report its state on time.
`routed`	Monitors the Gaia process called `routed`.	`routed` daemon on this cluster member reported its state on time.	`routed` daemon on this cluster member did not report its state on time.
`cvpnd`	Monitors the Mobile Access back-end process called `cvpnd`. This pnote appears if Mobile Access Software Blade is enabled.	`cvpnd` daemon on this cluster member reported its state on time.	`cvpnd` daemon on this cluster member did not report its state on time.
`ted`	Monitors the Threat Emulation process called `ted`.	`ted` daemon on this cluster member reported its state on time.	`ted` daemon on this cluster member did not report its state on time.
`VSX`	Monitors all Virtual Systems in VSX cluster.	On VS0, means that states of all Virtual Systems are not `Down` on this cluster member. On other Virtual Systems, means that VS0 is alive on this cluster member.	Minimum of blocking states of all Virtual Systems is not "active" (the VSIDs will be printed on the line `Problematic VSIDs:`) on this cluster member.
`Instances`	This pnote appears in VSX HA mode (not VSLS) cluster.	The number of CoreXL FW instances in the received CCP packet matches the number of loaded CoreXL FW instances on this VSX cluster member or this Virtual System.	There is a mismatch between the number of CoreXL FW instances in the received CCP packet and the number of loaded CoreXL FW instances on this VSX cluster member or this Virtual System (see sk106912).
`admin_down`	Monitors the Critical Device `admin_down`.		User ran the `clusterXL_admin down` command on this cluster member. See Appendix A - The clusterXL_admin Script.
`host_monitor`	Monitors the Critical Device `host_monitor`. User executed the `$FWDIR/bin/clusterXL_monitor_ips` script. See Appendix B - The clusterXL_monitor_ips Script.	All monitored IP addresses on this cluster member replied to pings.	At least one of the monitored IP addresses on this cluster member did not reply to at least one ping.
a name of a user space process (except `fwd`, `routed`, `cvpnd`, `ted`)	User executed the `$FWDIR/bin/clusterXL_monitor_process` script. See Appendix C - The clusterXL_monitor_process Script	All monitored user space processes on this cluster member are running.	At least one of the monitored user space on this cluster member processes is not running.

Syntax in Expert mode

cphaprob [-l] [-ia] [-e] list

Where:

Command	Description
`cphaprob -l`	Prints the list of all the "Built-in Devices" and the "Registered Devices"
`cphaprob -i list`	When there are no issues on the cluster member, shows: `There are no pnotes in problem state` When a critical device reports a problem, prints only the critical device that reports its state as "problem".
`cphaprob -ia list`	When there are no issues on the cluster member, shows: `There are no pnotes in problem state` When a critical device reports a problem, prints the device "Problem Notification" and the critical device that reports its state as "problem"
`cphaprob -e list`	When there are no issues on the cluster member, shows: `There are no pnotes in problem state` When a critical device reports a problem, prints only the critical device that reports its state as "problem"

Example

The following example output shows that the fwd process is down:

[Expert@Member2:0]# cphaprob list

Built-in Devices:

Device Name: Interface Active Check

Current state: OK

Registered Devices:

Device Name: Synchronization

Registration number: 0

Timeout: none

Current state: OK

Time since last report: 15998.4 sec

Device Name: Filter

Registration number: 1

Timeout: none

Current state: OK

Time since last report: 15644.4 sec

Device Name: fwd

Registration number: 3

Timeout: 2 sec

Current state: problem

Time since last report: 4.5 sec

Monitoring Cluster Interfaces

Description

This command lets you see the state of the cluster member interfaces and the virtual cluster interfaces. Interfaces are ClusterXL critical devices. ClusterXL makes sure that interfaces can send and receive CCP packets. It also sets the required minimum number of functional interfaces to the largest number of functional interfaces seen since the last reboot. If the number of functional interfaces is less than the required number, ClusterXL starts a failover. The same applies to secured interfaces, where only good synchronization interfaces are counted.

When an interface is DOWN, it means that the interface cannot receive or transmit CCP packets, or both. This happens when an interface malfunctions, is connected to an incorrect subnet, is unable to pick up Multicast Ethernet packets and so on. The interface may also be able to receive but not transmit CCP packets, in which case the status field is read. The displayed time is the number of seconds that elapsed since the interface was last able to receive or transmit a CCP packet.

See Defining Disconnected Interfaces for additional information.

Syntax in Expert mode

cphaprob [-a][-m] if

Where:

Command	Description
`cphaprob if`	Shows only cluster interfaces (Cluster and Sync) and their states: without Network Objective without VLAN monitoring mode without monitored VLAN interfaces
`cphaprob -a if`	Shows full list of cluster interfaces and their states: including the number of required interfaces including Network Objective without VLAN monitoring mode without monitored VLAN interfaces
`cphaprob -a -m if` `cphaprob -a-m if`	Shows full list of all cluster interfaces and their states: including the number of required interfaces including Network Objective including VLAN monitoring mode, or list of monitored VLAN interfaces

Command

Description

cphaprob if

Shows only cluster interfaces (Cluster and Sync) and their states:

without Network Objective
without VLAN monitoring mode
without monitored VLAN interfaces

cphaprob -a if

Shows full list of cluster interfaces and their states:

including the number of required interfaces
including Network Objective
without VLAN monitoring mode
without monitored VLAN interfaces

cphaprob -a -m if

cphaprob -a-m if

Shows full list of all cluster interfaces and their states:

including the number of required interfaces
including Network Objective
including VLAN monitoring mode, or list of monitored VLAN interfaces

Output

The output of this command must be identical to the configuration in the cluster object Topology page.

For example:

[Expert@Member2]# cphaprob -a if

Required interfaces: 4

Required secured interfaces: 1

eth1 UP (secured, unique, multicast)

eth2 UP (non secured, unique, multicast)

eth3 DOWN (4810.2 secs) (non secured, unique, multicast)

eth4 UP (non secured, unique, multicast)

Virtual cluster interfaces: 2

eth2 30.0.1.130

eth4 30.0.2.130

An interface can be:

Non-secured or Secured. A secured interface is a synchronization interface.
Multicast, or broadcast. The Cluster Control Protocol (CCP) mode used in the cluster. To toggle between the CCP modes, use the command cphaconf set_ccp {multicast|broadcast}. See sk20576.

Monitoring Bond Interfaces

Description

Shows the configuration of bond interfaces and their slave interfaces.

Syntax in Expert mode

cphaconf show_bond {-a | <bond_name>}

Where:

Command	Description
`cphaconf show_bond -a`	Shows configuration of all configured bond interfaces
`cphaconf show_bond` <bond_name>	Shows configuration of the specified bond interface

Example

[Expert@MemberB]# cphaconf show_bond boond0

Bond name: bond0

Bond mode: Load Sharing

Bond status: UP

Balancing mode: 802.3ad Layer3+4 Load Balancing

Configured slave interfaces: 4

In use slave interfaces: 4

Required slave interfaces: 2

Slave name | Status | Link

----------------+-----------------+-------

eth2 | Active | Yes

eth3 | Active | Yes

eth4 | Active | Yes

eth5 | Active | Yes

The output shows:

Configured slave interfaces
Required slave interfaces
Status of slave interface:
- Active - This slave interface is currently handling traffic.
- Backup - (Bond High Availability only) This slave interface is ready and can support internal bond failover.
- Not Available - (Bond High Availability only) The physical link on this slave interface is broken, or the Cluster member is in status down. The bond cannot failover in this state.
Status of link on slave interface - The status of the physical link on this slave interface (Yes or No).

Registering a Critical Device

cphaprob -d <device> -t <timeout in sec> -s <ok|init|problem> [-p] register

It is possible to add a user defined critical device to the default list of critical devices. Use this command to register <device> as a critical process, and add it to the list of devices that must be running for the cluster member to be considered active. If <device> fails, then the cluster member is considered to have failed.

If <device> fails to contact the cluster member in <timeout> seconds, <device> will be considered to have failed. For no timeout, use the value 0.

Define the status of the <device> that will be reported to ClusterXL upon registration. This initial status can be one of:

ok — <device> is alive.
init — <device> is initializing. The member is down. This state prevents the member from becoming active.
problem — <device> has failed.

The -p flag makes these changes permanent. After performing a reboot or after removing the Security Gateway (on Linux or IPSO for example) and re-attaching it, the status of critical devices that were registered with this flag will be saved.

Restrictions:

Total number of critical devices (pnotes) on cluster member is limited to 16.
Name of any critical device (pnote) on cluster member is limited to 16 characters.

Registering Critical Devices Listed in a File

cphaprob -f <file> register

Register all the user defined critical devices listed in <file>. <file> must be an ASCII file, with each device on a separate line. Each line must list three parameters, which must be separated by at least a space or a tab, as follows:

<device> <timeout in sec> <status>

<device> — The name of the critical device. It must have no more than 15 characters, and must not include white spaces.
<timeout in sec> — If <device> fails to contact the cluster member in <timeout> seconds, <device> will be considered to have failed. For no timeout, use the value 0.
<status> — can be one of
ok — <device> is alive.
init — <device> is initializing. The member is down. This state prevents the member from becoming active.
problem — <device> has failed.

Unregistering a Critical Device

cphaprob -d <device> [-p] unregister

Unregistering a user defined <device> as a critical process. This means that this device is no longer considered critical. If a critical device (and hence a cluster member) was registered as "problem" before running this command, then after running this command the status of the cluster will depend only on the remaining critical devices.

The -p flag makes these changes permanent. This means that after performing a reboot or after removing the kernel (on Linux or IPSO for example) and re-attaching it, these critical devices remain unregistered.

Reporting Critical Device Status to ClusterXL

cphaprob -d <device> -s <ok|init|problem> report

Use this command to report the status of a user defined critical device to ClusterXL.

<device> is the device that must be running for the cluster member to be considered active. If <device> fails, then the cluster member is considered to have failed.

The status to be reported. The status can be one of:

ok — <device> is alive

init — <device> is initializing. The member is down. This state prevents the member from becoming active.

problem — <device> has failed. If this status is reported to ClusterXL, the cluster member will immediately failover to another cluster member.

If <device> fails to contact the cluster member within the timeout that was defined when the it was registered, <device> and hence the cluster member, will be considered to have failed. This is true only for critical devices with timeouts. If a critical device is registered with the -t 0 parameter, there will be no timeout, and until the device reports otherwise, the status is considered to be the last reported status.