H Troubleshooting Oracle Clusterware

This appendix introduces monitoring the Oracle Clusterware environment and explains how you can enable dynamic debugging to troubleshoot Oracle Clusterware processing, and enable debugging and tracing for specific components and specific Oracle Clusterware resources to focus your troubleshooting efforts.

This appendix includes the following topics:

Monitoring Oracle Clusterware
Clusterware Log Files and the Unified Log Directory Structure
Diagnostics Collection Script
Oracle Clusterware Alerts

Monitoring Oracle Clusterware

You can use Oracle Enterprise Manager to monitor the Oracle Clusterware environment. When you log in to Oracle Enterprise Manager using a client browser, the Cluster Database Home page appears where you can monitor the status of both Oracle Clusterware environments. Monitoring can include such things as:

Notification if there are any VIP relocations
Status of the Oracle Clusterware on each node of the cluster using information obtained through the Cluster Verification Utility (cluvfy)
Notification if node applications (nodeapps) start or stop
Notification of issues in the Oracle Clusterware alert log for the Oracle Cluster Registry, voting disk issues (if any), and node evictions

The Cluster Database Home page is similar to a single-instance Database Home page. However, on the Cluster Database Home page, Oracle Enterprise Manager displays the system state and availability. This includes a summary about alert messages and job activity, and links to all the database and Automatic Storage Management (Oracle ASM) instances. For example, you can track problems with services on the cluster including when a service is not running on all of the preferred instances or when a service response time threshold is not being met.

You can use the Oracle Enterprise Manager Interconnects page to monitor the Oracle Clusterware environment. The Interconnects page shows the public and private interfaces on the cluster, the overall throughput on the private interconnect, individual throughput on each of the network interfaces, error rates (if any) and the load contributed by database instances on the interconnect, including:

Overall throughput across the private interconnect
Notification if a database instance is using public interface due to misconfiguration
Throughput and errors (if any) on the interconnect
Throughput contributed by individual instances on the interconnect

All of this information also is available as collections that have a historic view. This is useful with cluster cache coherency, such as when diagnosing problems related to cluster wait events. You can access the Interconnects page by clicking the Interconnect tab on the Cluster Database home page.

Also, the Oracle Enterprise Manager Cluster Database Performance page provides a quick glimpse of the performance statistics for a database. Statistics are rolled up across all the instances in the cluster database in charts. Using the links next to the charts, you can get more specific information and perform any of the following tasks:

Identify the causes of performance issues.
Decide whether resources must be added or redistributed.
Tune your SQL plan and schema for better optimization.
Resolve performance issues

The charts on the Cluster Database Performance page include the following:

Chart for Cluster Host Load Average: The Cluster Host Load Average chart in the Cluster Database Performance page shows potential problems that are outside the database. The chart shows maximum, average, and minimum load values for available nodes in the cluster for the previous hour.
Chart for Global Cache Block Access Latency: Each cluster database instance has its own buffer cache in its System Global Area (SGA). Using Cache Fusion, Oracle RAC environments logically combine each instance's buffer cache to enable the database instances to process data as if the data resided on a logically combined, single cache.
Chart for Average Active Sessions: The Average Active Sessions chart in the Cluster Database Performance page shows potential problems inside the database. Categories, called wait classes, show how much of the database is using a resource, such as CPU or disk I/O. Comparing CPU time to wait time helps to determine how much of the response time is consumed with useful work rather than waiting for resources that are potentially held by other processes.
Chart for Database Throughput: The Database Throughput charts summarize any resource contention that appears in the Average Active Sessions chart, and also show how much work the database is performing on behalf of the users or applications. The Per Second view shows the number of transactions compared to the number of logons, and the amount of physical reads compared to the redo size for each second. The Per Transaction view shows the amount of physical reads compared to the redo size for each transaction. Logons is the number of users that are logged on to the database.

In addition, the Top Activity drilldown menu on the Cluster Database Performance page enables you to see the activity by wait events, services, and instances. Plus, you can see the details about SQL/sessions by going to a prior point in time by moving the slider on the chart.

Cluster Health Monitor

The Cluster Health Monitor (CHM) stores real-time operating system metrics in the CHM Repository that you can use for later triage with the help of Oracle Support should you have cluster issues.

This section includes the following CHM topics:

CHM Services
OCLUMON Command Reference

CHM Services

CHM consists of the following services:

System Monitor Service
Cluster Logger Service

System Monitor Service

There is one system monitor service on every node. The system monitor service (osysmond) is the monitoring and operating system metric collection service that sends the data to the cluster logger service. The cluster logger service receives the information from all the nodes and persists in a CHM Repository-based database.

Cluster Logger Service

There is one cluster logger service (ologgerd) on only one node in a cluster and another node is chosen by the cluster logger service to house the standby for the master cluster logger service. If the master cluster logger service fails (because the service is not able come up after a fixed number of retries or the node where the master was running is down), the node where the standby resides takes over as master and selects a new node for standby. The master manages the operating system metric database in the CHM Repository and interacts with the standby to manage a replica of the master operating system metrics database.

CHM Repository

The CHM Repository, by default, resides within the Grid Infrastructure home and requires 1 GB of disk space per node in the cluster in order to support failover. You can adjust its size and location, and Oracle supports to moving it to shared storage. You manage the CHM Repository with OCLUMON.

OCLUMON Command Reference

The OCLUMON command-line tool is included with CHM and you can use it to query the CHM repository to display node-specific metrics for a specified time period. You can also use oclumon to query and print the durations and the states for a resource on a node during a specified time period. These states are based on predefined thresholds for each resource metric and are denoted as red, orange, yellow, and green, indicating decreasing order of criticality. For example, you can query to show how many seconds the CPU on a node named node1 remained in the RED state during the last hour. You can also use OCLUMON to perform miscellaneous administrative tasks, such as changing the debug levels, querying the version of CHM, and changing the metrics database size.

This section details the following OCLUMON commands:

oclumon debug
oclumon manage
oclumon version

oclumon debug

Use the oclumon debug command to set the log level for the CHM services.

Syntax

[oclumon debug [log daemon module:log_level][version]

Parameters

Table H-1 oclumon debug Command Parameters

Parameter Description

Parameter	Description
log daemon module:log_level	Use this option change the log level of daemons and daemon modules. Supported daemons are: `osysmond` `ologgerd` `client` `all` Supported daemon modules are: `osysmond`: `CRFMOND`, `CRFM`, and `allcomp` `ologgerd`: `CRFLOGD`, `CRFLDBDB`, `CRFM`, and `allcomp` `client`: `OCLUMON`, `CRFM`, and `allcomp` `all`: `allcomp` Supported `log_level` values are `0`, `1`, `2`, and `3`.
version	Use this option to display the versions of the daemons.

log daemon module:log_level

Use this option change the log level of daemons and daemon modules. Supported daemons are:

osysmond
ologgerd
client
all

Supported daemon modules are:

osysmond: CRFMOND, CRFM, and allcomp
ologgerd: CRFLOGD, CRFLDBDB, CRFM, and allcomp
client: OCLUMON, CRFM, and allcomp
all: allcomp

Supported log_level values are 0, 1, 2, and 3.

version

Use this option to display the versions of the daemons.

Example

The following example sets the log level of the system monitor service (osysmond):

$ oclumon debug log osysmond CRFMOND:3

oclumon manage

Use the oclumon manage command to view log information from the system monitor service.

Syntax

[oclumon manage [[-repos {resize size] | reploc new_location} | 
[-get key1 key2 ...]]

Parameters

Table H-2 oclumon manage Command Parameters

Parameter Description

Parameter	Description
-repos {resize size \| reploc new_location	The `-repos` flag is required to resize (with the `resize` option) or relocate (with the `reploc` option) the CHM Repository. Notes: The size of the CHM Repository is given in number of seconds. The size must be more than 3600 (one hour) and less than 259200 (three days). When using the `reploc` option, specify the path to a new CHM Repository location.
-get key1 key2 ...	Use this option to obtain CHM Repository information using the following keywords: `repsize`: Current size of the CHM Repository `reppath`: Directory path to the CHM Repository `master`: Name of the master node `replica`: Name of the standby node You can specify any number of keywords in a space-delimited list following the `-get` flag.
-h	Displays online help for the `oclumon manage` command.

-repos {resize size |
reploc new_location

The -repos flag is required to resize (with the resize option) or relocate (with the reploc option) the CHM Repository.

Notes:

The size of the CHM Repository is given in number of seconds. The size must be more than 3600 (one hour) and less than 259200 (three days).
When using the reploc option, specify the path to a new CHM Repository location.

-get key1 key2 ...

Use this option to obtain CHM Repository information using the following keywords:

repsize: Current size of the CHM Repository
reppath: Directory path to the CHM Repository
master: Name of the master node
replica: Name of the standby node

You can specify any number of keywords in a space-delimited list following the -get flag.

-h

Displays online help for the oclumon manage command.

Usage Notes

Both the local system monitor service and the master cluster logger service must be running to resize the CHM Repository.

Example

The following examples show commands and sample output:

$ oclumon manage -repos reploc /shared/oracle/chm

The preceding example moves the CHM Repository to shared storage.

$ oclumon manage -get reppath
CHM Repository Path = /opt/oracle/grid/crf/db/node1
Done

$ oclumon manage -get master
Master = node1
done

$ oclumon manage -get repsize
CHM Repository Size = 86400
Done

oclumon version

Use the oclumon version command to obtain the version of CHM that you are using.

Syntax

oclumon version

Example

This command produces output similar to the following:

Cluster Health Monitor (OS), Version 2.00.20100622 - Production Copyright 2010
Oracle. All rights reserved.

Clusterware Log Files and the Unified Log Directory Structure

Oracle Database uses a unified log directory structure to consolidate the Oracle Clusterware component log files. This consolidated structure simplifies diagnostic information collection and assists during data retrieval and problem analysis.

Alert files are stored in the directory structures shown in Table H-3.

Table H-3 Locations of Oracle Clusterware Component Log Files

Component	Log File Location^Foot 1
Cluster Health Monitor (CHM)	The system monitor service and cluster logger service record log information in following locations, respectively: Grid_home/log/host_name/crfmond Grid_home/log/host_name/crflogd
Oracle Database Quality of Service Management (DBQOS)	Oracle Database QoS Management Grid Operations Manager logs: Grid_home/oc4j/j2ee/home/log/dbwlm/auditing Oracle Database QoS Management trace logs: Grid_home/oc4j/j2ee/home/log/dbwlm/logging
Cluster Ready Services Daemon (CRSD) Log Files	Grid_home/log/host_name/crsd
Cluster Synchronization Services (CSS)	Grid_home/log/host_name/cssd
Cluster Time Synchronization Service (CTSS)	Grid_home/log/host_name/ctssd
Grid Plug and Play	Grid_home/log/host_name/gpnpd
Multicast Domain Name Service Daemon (MDNSD)	Grid_home/log/host_name/mdnsd
Oracle Cluster Registry	Oracle Cluster Registry tools (OCRDUMP, OCRCHECK, OCRCONFIG) record log information in the following location:^Foot 2 Grid_home/log/host_name/client Cluster Ready Services records Oracle Cluster Registry log information in the following location: Grid_home/log/host_name/crsd
Oracle Grid Naming Service (GNS)	Grid_home/log/host_name/gnsd
Oracle High Availability Services Daemon (OHASD)	Grid_home/log/host_name/ohasd
Oracle Automatic Storage Management Cluster File System (Oracle ACFS)	Grid_home/log/host_name/acfsrepl Grid_home/log/host_name/acfsreplroot Grid_home/log/host_name/acfssec
Event Manager (EVM) information generated by `evmd`	Grid_home/log/host_name/evmd
Cluster Verification Utility (CVU)	Grid_home/log/host_name/cvu
Oracle RAC RACG	The Oracle RAC high availability trace files are located in the following two locations: Grid_home/log/host_name/racg $ORACLE_HOME/log/host_name/racg Core files are in subdirectories of the log directory. Each RACG executable has a subdirectory assigned exclusively for that executable. The name of the RACG executable subdirectory is the same as the name of the executable. Additionally, you can find logging information for the VIP in `Grid_home/log/host_name/agent/crsd/orarootagent_root` and for the database in `$ORACLE_HOME/log/host_name/racg`.
Server Manager (SRVM)	Grid_home/log/host_name/srvm
Disk Monitor Daemon (`diskmon`)	Grid_home/log/host_name/diskmon
Grid Interprocess Communication Daemon (GIPCD)	Grid_home/log/host_name/gipcd

^Footnote 1The directory structure is the same for Linux, UNIX, and Windows systems.

^Footnote 2 To change the amount of logging, edit the path in the Grid_home/srvm/admin/ocrlog.ini file.

Diagnostics Collection Script

Every time an Oracle Clusterware error occurs, run the diagcollection.pl script to collect diagnostic information from Oracle Clusterware in trace files. The diagnostics provide additional information so My Oracle Support can resolve problems. Run this script from the following location:

Grid_home/bin/diagcollection.pl

Note:

You must run this script as the root user.

Oracle Clusterware Alerts

Oracle Clusterware posts alert messages when important events occur. The following is an example of an alert from the CRSD process:

2009-07-16 00:27:22.074
[ctssd(12817)]CRS-2403:The Cluster Time Synchronization Service on host stnsp014 is in observer mode.
2009-07-16 00:27:22.146
[ctssd(12817)]CRS-2407:The new Cluster Time Synchronization Service reference node is host stnsp013.
2009-07-16 00:27:22.753
[ctssd(12817)]CRS-2401:The Cluster Time Synchronization Service started on host stnsp014.
2009-07-16 00:27:43.754
[crsd(12975)]CRS-1012:The OCR service started on node stnsp014.
2009-07-16 00:27:46.339
[crsd(12975)]CRS-1201:CRSD started on node stnsp014.

The location of this alert log on Linux, UNIX, and Windows systems is in the following directory path, where Grid_home is the name of the location where the Oracle Grid Infrastructure is installed: Grid_home/log/host_name.

The following example shows the start of the Oracle Cluster Time Synchronization Service (OCTSS) after a cluster reconfiguration:

[ctssd(12813)]CRS-2403:The Cluster Time Synchronization Service on host stnsp014 is in observer mode.
2009-07-15 23:51:18.292
[ctssd(12813)]CRS-2407:The new Cluster Time Synchronization Service reference node is host stnsp013.
2009-07-15 23:51:18.961
[ctssd(12813)]CRS-2401:The Cluster Time Synchronization Service started on host stnsp014.

Alert Messages Using Diagnostic Record Unique IDs

Beginning with Oracle Database 11g release 2 (11.2), certain Oracle Clusterware messages contain a text identifier surrounded by "(:" and ":)". Usually, the identifier is part of the message text that begins with "Details in..." and includes an Oracle Clusterware diagnostic log file path and name similar to the following example. The identifier is called a DRUID, or Diagnostic Record Unique ID:

2009-07-16 00:18:44.472
[/scratch/11.2/grid/bin/orarootagent.bin(13098)]CRS-5822:Agent '/scratch/11.2/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) in /scratch/11.2/grid/log/stnsp014/agent/crsd/orarootagent_root/orarootagent_root.log.

DRUIDs are used to relate external product messages to entries in a diagnostic log file and to internal Oracle Clusterware program code locations. They are not directly meaningful to customers and are used primarily by My Oracle Support when diagnosing problems.

Note:

Oracle Clusterware uses a file rotation approach for log files. If you cannot find the reference given in the file specified in the "Details in" section of an alert file message, then this file might have been rolled over to a rollover version, typically ending in *.lnumber where number is a number that starts at 01 and increments to however many logs are being kept, the total for which can be different for different logs. While there is usually no need to follow the reference unless you are asked to do so by My Oracle Support, you can check the path given for roll over versions of the file. The log retention policy, however, foresees that older logs are be purged as required by the amount of logs generated.