This appendix introduces monitoring the Oracle Clusterware environment and explains how you can enable dynamic debugging to troubleshoot Oracle Clusterware processing, and enable debugging and tracing for specific components and specific Oracle Clusterware resources to focus your troubleshooting efforts.
This appendix includes the following topics:
You can use various tools to monitor Oracle Clusterware. While Oracle recommends that you use Oracle Enterprise Manager to monitor the everyday operations of Oracle Clusterware, Cluster Health Monitor (CHM) monitors the complete technology stack, including the operating system, for the purpose of ensuring smooth cluster operations. Both tools are enabled, by default, for any Oracle cluster, and Oracle strongly recommends that you use them.
This section includes the following topics:
You can use Oracle Enterprise Manager to monitor the Oracle Clusterware environment. When you log in to Oracle Enterprise Manager using a client browser, the Cluster Database Home page appears where you can monitor the status of both Oracle Clusterware environments. Monitoring can include such things as:
Notification if there are any VIP relocations
Status of the Oracle Clusterware on each node of the cluster using information obtained through the Cluster Verification Utility (cluvfy
)
Notification if node applications (nodeapps) start or stop
Notification of issues in the Oracle Clusterware alert log for the Oracle Cluster Registry, voting file issues (if any), and node evictions
The Cluster Database Home page is similar to a single-instance Database Home page. However, on the Cluster Database Home page, Oracle Enterprise Manager displays the system state and availability. This includes a summary about alert messages and job activity, and links to all the database and Oracle Automatic Storage Management (Oracle ASM) instances. For example, you can track problems with services on the cluster including when a service is not running on all of the preferred instances or when a service response time threshold is not being met.
You can use the Oracle Enterprise Manager Interconnects page to monitor the Oracle Clusterware environment. The Interconnects page shows the public and private interfaces on the cluster, the overall throughput on the private interconnect, individual throughput on each of the network interfaces, error rates (if any) and the load contributed by database instances on the interconnect, including:
Overall throughput across the private interconnect
Notification if a database instance is using public interface due to misconfiguration
Throughput and errors (if any) on the interconnect
Throughput contributed by individual instances on the interconnect
All of this information also is available as collections that have a historic view. This is useful with cluster cache coherency, such as when diagnosing problems related to cluster wait events. You can access the Interconnects page by clicking the Interconnect tab on the Cluster Database home page.
Also, the Oracle Enterprise Manager Cluster Database Performance page provides a quick glimpse of the performance statistics for a database. Statistics are rolled up across all the instances in the cluster database in charts. Using the links next to the charts, you can get more specific information and perform any of the following tasks:
Identify the causes of performance issues.
Decide whether resources must be added or redistributed.
Tune your SQL plan and schema for better optimization.
Resolve performance issues
The charts on the Cluster Database Performance page include the following:
Chart for Cluster Host Load Average: The Cluster Host Load Average chart in the Cluster Database Performance page shows potential problems that are outside the database. The chart shows maximum, average, and minimum load values for available nodes in the cluster for the previous hour.
Chart for Global Cache Block Access Latency: Each cluster database instance has its own buffer cache in its System Global Area (SGA). Using Cache Fusion, Oracle RAC environments logically combine each instance's buffer cache to enable the database instances to process data as if the data resided on a logically combined, single cache.
Chart for Average Active Sessions: The Average Active Sessions chart in the Cluster Database Performance page shows potential problems inside the database. Categories, called wait classes, show how much of the database is using a resource, such as CPU or disk I/O. Comparing CPU time to wait time helps to determine how much of the response time is consumed with useful work rather than waiting for resources that are potentially held by other processes.
Chart for Database Throughput: The Database Throughput charts summarize any resource contention that appears in the Average Active Sessions chart, and also show how much work the database is performing on behalf of the users or applications. The Per Second view shows the number of transactions compared to the number of logons, and the amount of physical reads compared to the redo size for each second. The Per Transaction view shows the amount of physical reads compared to the redo size for each transaction. Logons is the number of users that are logged on to the database.
In addition, the Top Activity drop down menu on the Cluster Database Performance page enables you to see the activity by wait events, services, and instances. Plus, you can see the details about SQL/sessions by going to a prior point in time by moving the slider on the chart.
The Cluster Health Monitor (CHM) detects and analyzes operating system and cluster resource-related degradation and failures. CHM stores real-time operating system metrics in the Oracle Grid Infrastructure Management Repository that you can use for later triage with the help of My Oracle Support should you have cluster issues.
This section includes the following CHM topics:
CHM consists of the following services:
There is one system monitor service on every node. The system monitor service (osysmond
) is a real-time, monitoring and operating system metric collection service that sends the data to the cluster logger service. The cluster logger service receives the information from all the nodes and persists in an Oracle Grid Infrastructure Management Repository database.
There is one cluster logger service (OLOGGERD) for every 32 nodes in a cluster. Another OLOGGERD is spawned for every additional 32 nodes (which can be a sum of Hub and Leaf Nodes). If the cluster logger service fails (because the service is not able come up after a fixed number of retries or the node where it was running is down), then Oracle Clusterware starts OLOGGERD on a different node. The cluster logger service manages the operating system metric database in the Oracle Grid Infrastructure Management Repository.
Oracle Grid Infrastructure Management Repository
The Oracle Grid Infrastructure Management Repository:
Is an Oracle database that stores real-time operating system metrics collected by CHM. You configure the Oracle Grid Infrastructure Management Repository during an installation of or upgrade to Oracle Clusterware 12c on a cluster.
Note:
If you are upgrading Oracle Clusterware to Oracle Clusterware 12c and Oracle Cluster Registry (OCR) and the voting file are stored on raw or block devices, then you must move them to Oracle ASM or a shared file system before you upgrade your software.Runs on one node in the cluster (this must be a Hub Node in an Oracle Flex Cluster configuration), and must support failover to another node in case of node or storage failure.
You can locate the Oracle Grid Infrastructure Management Repository on the same node as the OLOGGERD to improve performance and decrease private network traffic.
Communicates with any cluster clients (such as OLOGGERD and OCLUMON) through the private network. Oracle Grid Infrastructure Management Repository communicates with external clients over the public network, only.
Data files are located in the same disk group as the OCR and voting file.
If OCR is stored in an Oracle ASM disk group called +MYDG
, then configuration scripts will use the same disk group to store the Oracle Grid Infrastructure Management Repository.
Oracle increased the Oracle Clusterware shared storage requirement to accommodate the Oracle Grid Infrastructure Management Repository, which can be a network file system (NFS), cluster file system, or an Oracle ASM disk group.
Size and retention is managed with OCLUMON.
You can collect CHM data from any node in the cluster by running the Grid_home
/bin/diagcollection.pl
script on the node.
Notes:
Oracle recommends that, when you run the diagcollection.pl
script to collect CHM data, you run the script on all nodes in the cluster to ensure gathering all of the information needed for analysis.
You must run this script as a privileged user.
To run the data collection script on only the node where the cluster logger service is running:
Run the following command to identify the node running the cluster logger service:
$ Grid_home/bin/oclumon manage -get master
Run the following command from a writable directory outside the Grid home as a privileged user on the cluster logger service node to collect all the available data in the Oracle Grid Infrastructure Management Repository:
# Grid_home/bin/diagcollection.pl --collect
On Windows, run the following commands:
C:\Grid_home\perl\bin\perl.exe C:\Grid_home\bin\diagcollection.pl --collect
The diagcollection.pl
script creates a file called chmosData_
host_name_time_stamp
.tar.gz
, similar to the following:
chmosData_stact29_20121006_2321.tar.gz
To limit the amount of data you want collected, enter the following command on a single line:
# Grid_home/bin/diagcollection.pl --collect --chmos --incidenttime time --incidentduration duration
In the preceding command, the format for the --incidenttime
argument is MM/DD/YYYY24HH:MM:SS
and the format for the --incidentduration
argument is HH:MM
. For example:
# Grid_home/bin/diagcollection.pl --collect --crshome Grid_home --chmos --incidenttime 07/21/2013 01:00:00 --incidentduration 00:30
The OCLUMON command-line tool is included with CHM and you can use it to query the CHM repository to display node-specific metrics for a specified time period. You can also use OCLUMON to perform miscellaneous administrative tasks, such as changing the debug levels, querying the version of CHM, and changing the metrics database size.
This section details the following OCLUMON commands:
Use the oclumon debug
command to set the log level for the CHM services.
oclumon debug [log daemon module:log_level] [version]
Table J-1 oclumon debug Command Parameters
Parameter | Description |
---|---|
log daemon module:log_level |
Use this option change the log level of daemons and daemon modules. Supported daemons are: osysmond ologgerd client all
Supported daemon modules are: osysmond : CRFMOND , CRFM , and allcomp ologgerd : CRFLOGD , CRFLDREP , CRFM , and allcomp client : OCLUMON , CRFM , and allcomp all : allcomp
Supported |
version |
Use this option to display the versions of the daemons. |
The following example sets the log level of the system monitor service (osysmond
):
$ oclumon debug log osysmond CRFMOND:3
Use the oclumon dumpnodeview
command to view log information from the system monitor service in the form of a node view.
A node view is a collection of all metrics collected by CHM for a node at a point in time. CHM attempts to collect metrics every five seconds on every node. Some metrics are static while other metrics are dynamic.
A node view consists of eight views when you display verbose output:
SYSTEM: Lists system metrics such as CPU COUNT, CPU USAGE, and MEM USAGE
TOP CONSUMERS: Lists the top consuming processes in the following format:
metric_name: 'process_name(process_identifier) utilization'
PROCESSES: Lists process metrics such as PID, name, number of threads, memory usage, and number of file descriptors
DEVICES: Lists device metrics such as disk read and write rates, queue length, and wait time per I/O
NICS: Lists network interface card metrics such as network receive and send rates, effective bandwidth, and error rates
FILESYSTEMS: Lists file system metrics, such as total, used, and available space
You can generate a summary report that only contains the SYSTEM and TOP CONSUMERS views.
"Metric Descriptions" lists descriptions for all the metrics associated with each of the views in the preceding list.
Note:
Metrics displayed in the TOP CONSUMERS view are described in Table J-4, "PROCESSES View Metric Descriptions".Example J-1 shows an example of a node view.
oclumon dumpnodeview [[-allnodes] | [-n node1 node2 noden] [-last "duration"] | [-s "time_stamp" -e "time_stamp"] [-i interval] [-v]] [-h]
Table J-2 oclumon dumpnodeview Command Parameters
Parameter | Description |
---|---|
-allnodes |
Use this option to dump the node views of all the nodes in the cluster. |
-n node1 node2 |
Specify one node (or several nodes in a space-delimited list) for which you want to dump the node view. |
-last "duration"
|
Use this option to specify a time, given in HH24:MM:SS format surrounded by double quotation marks ( "23:05:00" |
-s "time_stamp" -e "time_stamp" |
Use the "2011-05-10 23:05:00" Note: You must specify these two options together to obtain a range. |
-i interval
|
Specify a collection interval, in five-second increments. |
-v |
Displays verbose node view output. |
-h |
Displays online help for the |
In certain circumstances, data can be delayed for some time before it is replayed by this command. For example, the crsctl stop cluster -all
command can cause data delay. After running crsctl start cluster -all
, it may take several minutes before oclumon dumpnodeview
shows any data collected during the interval.
The default is to continuously dump node views. To stop continuous display, use Ctrl+C on Linux and Windows.
Both the local system monitor service (osysmond
) and the cluster logger service (ologgerd
) must be running to obtain node view dumps.
The following example dumps node views from node1
, node2
, and node3
collected over the last twelve hours:
$ oclumon dumpnodeview -n node1 node2 node3 -last "12:00:00"
The following example displays node views from all nodes collected over the last fifteen minutes at a 30 second interval:
$ oclumon dumpnodeview -allnodes -last "00:15:00" -i 30
This section includes descriptions of the metrics in each of the seven views that comprise a node view listed in the following tables.
Table J-3 SYSTEM View Metric Descriptions
Metric | Description |
---|---|
#pcpus |
The number of physical CPUs |
#vcpus |
Number of logical compute units |
cpuht |
CPU hyperthreading enabled (Y) or disabled (N) |
chipname |
The name of the CPU vendor |
cpu |
Average CPU utilization per processing unit within the current sample interval (%). |
cpuq |
Number of processes waiting in the run queue within the current sample interval |
physmemfree |
Amount of free RAM (KB) |
physmemtotal |
Amount of total usable RAM (KB) |
mcache |
Amount of physical RAM used for file buffers plus the amount of physical RAM used as cache memory (KB) On Windows systems, this is the number of bytes currently being used by the file system cache Note: This metric is not available on Solaris. |
swapfree |
Amount of swap memory free (KB) |
swaptotal |
Total amount of physical swap memory (KB) |
hugepagetotal |
Total size of huge in KB Note: This metric is not available on Solaris or Windows systems. |
hugepagefree |
Free size of huge page in KB Note: This metric is not available on Solaris or Windows systems. |
hugepagesize |
Smallest unit size of huge page Note: This metric is not available on Solaris or Windows systems. |
ior |
Average total disk read rate within the current sample interval (KB per second) |
iow |
Average total disk write rate within the current sample interval (KB per second) |
ios |
I/O operation average time to serve I/O request |
swpin |
Average swap in rate within the current sample interval (KB per second) Note: This metric is not available on Windows systems. |
swpout |
Average swap out rate within the current sample interval (KB per second) Note: This metric is not available on Windows systems. |
pgin |
Average page in rate within the current sample interval (pages per second) |
pgout |
Average page out rate within the current sample interval (pages per second) |
netr |
Average total network receive rate within the current sample interval (KB per second) |
netw |
Average total network send rate within the current sample interval (KB per second) |
procs |
Number of processes |
procsoncpu |
The current number of processes running on the CPU |
rtprocs |
Number of real-time processes |
rtprocsoncpu |
The current number of real-time processes running on the CPU |
#fds |
Number of open file descriptors or Number of open handles on Windows |
#sysfdlimit |
System limit on number of file descriptors Note: This metric is not available on either Solaris or Windows systems. |
#disks |
Number of disks |
#nics |
Number of network interface cards |
nicErrors |
Average total network error rate within the current sample interval (errors per second) |
Table J-4 PROCESSES View Metric Descriptions
Metric | Description |
---|---|
name |
The name of the process executable |
pid |
The process identifier assigned by the operating system |
#procfdlimit |
Limit on number of file descriptors for this process Note: This metric is not available on Windows, AIX, and HP-UX systems. |
cpuusage |
Process CPU utilization (%) Note: The utilization value can be up to 100 times the number of processing units. |
privmem |
Process private memory usage (KB) |
shm |
Process shared memory usage (KB) Note: This metric is not available on Windows, Solaris, and AIX systems. |
workingset |
Working set of a program (KB) Note: This metric is only available on Windows. |
#fd |
Number of file descriptors open by this process or Number of open handles by this process on Windows |
#threads |
Number of threads created by this process |
priority |
The process priority |
nice |
The nice value of the process Note: This metric is not applicable to Windows systems. |
state |
The state of the process Note: This metric is not applicable to Windows systems. |
Table J-5 DEVICES View Metric Descriptions
Metric | Description |
---|---|
ior |
Average disk read rate within the current sample interval (KB per second) |
iow |
Average disk write rate within the current sample interval (KB per second) |
ios |
Average disk I/O operation rate within the current sample interval (I/O operations per second) |
qlen |
Number of I/O requests in wait state within the current sample interval |
wait |
Average wait time per I/O within the current sample interval (msec) |
type |
If applicable, identifies what the device is used for. Possible values are |
Table J-6 NICS View Metric Descriptions
Metric | Description |
---|---|
netrr |
Average network receive rate within the current sample interval (KB per second) |
netwr |
Average network sent rate within the current sample interval (KB per second) |
neteff |
Average effective bandwidth within the current sample interval (KB per second) |
nicerrors |
Average error rate within the current sample interval (errors per second) |
pktsin |
Average incoming packet rate within the current sample interval (packets per second) |
pktsout |
Average outgoing packet rate within the current sample interval (packets per second) |
errsin |
Average error rate for incoming packets within the current sample interval (errors per second) |
errsout |
Average error rate for outgoing packets within the current sample interval (errors per second) |
indiscarded |
Average drop rate for incoming packets within the current sample interval (packets per second) |
outdiscarded |
Average drop rate for outgoing packets within the current sample interval (packets per second) |
inunicast |
Average packet receive rate for unicast within the current sample interval (packets per second) |
type |
Whether PUBLIC or PRIVATE |
innonunicast |
Average packet receive rate for multi-cast (packets per second) |
latency |
Estimated latency for this network interface card (msec) |
Table J-7 FILESYSTEMS View Metric Descriptions
Metric | Description |
---|---|
total |
Total amount of space (KB) |
mount |
Mount point |
type |
File system type, whether local file system, NFS, or other |
used |
Amount of used space (KB) |
available |
Amount of available space (KB) |
used% |
Percentage of used space (%) |
ifree% |
Percentage of free file nodes (%) Note: This metric is not available on Windows systems. |
Table J-8 PROTOCOL ERRORS View Metric DescriptionsFoot 1
Metric | Description |
---|---|
IPHdrErr |
Number of input datagrams discarded due to errors in their IPv4 headers |
IPAddrErr |
Number of input datagrams discarded because the IPv4 address in their IPv4 header's destination field was not a valid address to be received at this entity |
IPUnkProto |
Number of locally-addressed datagrams received successfully but discarded because of an unknown or unsupported protocol |
IPReasFail |
Number of failures detected by the IPv4 reassembly algorithm |
IPFragFail |
Number of IPv4 discarded datagrams due to fragmentation failures |
TCPFailedConn |
Number of times that TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times that TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state |
TCPEstRst |
Number of times that TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state |
TCPRetraSeg |
Total number of TCP segments retransmitted |
UDPUnkPort |
Total number of received UDP datagrams for which there was no application at the destination port |
UDPRcvErr |
Number of received UDP datagrams that could not be delivered for reasons other than the lack of an application at the destination port |
Footnote 1 All protocol errors are cumulative values since system startup.
Table J-9 CPUS View Metric Descriptions
Metric | Description |
---|---|
cpuid
|
Virtual CPU |
sys-usage
|
CPU usage in system space |
user-usage
|
CPU usage in user space |
nice |
Value of NIC for a specific CPU |
usage |
CPU usage for a specific CPU |
iowait |
CPU wait time for I/O operations |
----------------------------------------
Node: rwsak10 Clock: '14-04-16 18.47.25 PST8PDT' SerialNo:155631
----------------------------------------
SYSTEM:
#pcpus: 2 #vcpus: 24 cpuht: Y chipname: Intel(R) cpu: 1.23 cpuq: 0
physmemfree: 8889492 physmemtotal: 74369536 mcache: 55081824 swapfree: 18480404
swaptotal: 18480408 hugepagetotal: 0 hugepagefree: 0 hugepagesize: 2048 ior: 132
iow: 236 ios: 23 swpin: 0 swpout: 0 pgin: 131 pgout: 235 netr: 72.404
netw: 97.511 procs: 969 procsoncpu: 6 rtprocs: 62 rtprocsoncpu N/A #fds: 32640
#sysfdlimit: 6815744 #disks: 9 #nics: 5 nicErrors: 0
TOP CONSUMERS:
topcpu: 'osysmond.bin(30981) 2.40' topprivmem: 'oraagent.bin(14599) 682496'
topshm: 'ora_dbw2_oss_3(7049) 2156136' topfd: 'ocssd.bin(29986) 274'
topthread: 'java(32255) 53'
CPUS:
cpu18: sys-2.93 user-2.15 nice-0.0 usage-5.8 iowait-0.0 steal-0.0
.
.
.
PROCESSES:
name: 'osysmond.bin' pid: 30891 #procfdlimit: 65536 cpuusage: 2.40 privmem: 35808
shm: 81964 #fd: 119 #threads: 13 priority: -100 nice: 0 state: S
.
.
.
DEVICES:
sdi ior: 0.000 iow: 0.000 ios: 0 qlen: 0 wait: 0 type: SYS
sda1 ior: 0.000 iow: 61.495 ios: 629 qlen: 0 wait: 0 type: SYS
.
.
.
NICS:
lo netrr: 39.935 netwr: 39.935 neteff: 79.869 nicerrors: 0 pktsin: 25
pktsout: 25 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0
inunicast: 25 innonunicast: 0 type: PUBLIC
eth0 netrr: 1.412 netwr: 0.527 neteff: 1.939 nicerrors: 0 pktsin: 15
pktsout: 4 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0
inunicast: 15 innonunicast: 0 type: PUBLIC latency: <1
FILESYSTEMS:
mount: / type: rootfs total: 563657948 used: 78592012 available: 455971824
used%: 14 ifree%: 99 GRID_HOME
.
.
.
PROTOCOL ERRORS:
IPHdrErr: 0 IPAddrErr: 0 IPUnkProto: 0 IPReasFail: 0 IPFragFail: 0
TCPFailedConn: 5197 TCPEstRst: 717163 TCPRetraSeg: 592 UDPUnkPort: 103306
UDPRcvErr: 70
Use the oclumon manage
command to view and change configuration information from the system monitor service.
oclumon manage -repos {{changeretentiontime time} | {changerepossize memory_size}} | -get {key1 [key2 ...] | alllogger [-details] | mylogger [-details]}
Table J-10 oclumon manage Command Parameters
Parameter | Description |
---|---|
-repos {{changeretentiontime time} | {changerepossize memory_size}} |
The
|
-get key1 [key2 ...] |
Use this option to obtain CHM repository information using the following keywords: repsize : Size of the CHM repository, in secondsreppath : Directory path to the CHM repositorymaster : Name of the master nodealllogger : Special key to obtain a list of all nodes running Cluster Logger Servicemylogger : Special key to obtain the node running the Cluster Logger Service which is serving the current node
You can specify any number of keywords in a space-delimited list following the |
-h |
Displays online help for the |
The local system monitor service must be running to change the retention time of the CHM repository.
The Cluster Logger Service must be running to change the retention time of the CHM repository.
The following examples show commands and sample output:
$ oclumon manage -get MASTER Master = node1 $ oclumon manage -get alllogger -details Logger = node1 Nodes = node1,node2 $ oclumon manage -repos changeretentiontime 86400 $ oclumon manage -repos changerepossize 6000
Oracle Clusterware uses Oracle Database fault diagnosability infrastructure to manage diagnostic data and its alert log. As a result, most diagnostic data resides in the Automatic Diagnostic Repository (ADR), a collection of directories and files located under a base directory that you specify during installation. This section describes clusterware-specific aspects of how Oracle Clusterware uses ADR.
See Also:
Oracle Database Administrator's Guide for more information on the fault diagnosability infrastructure and ADR
Oracle Database Utilities for information about the ADR Command Interpreter (ADRCI)
This section includes the following topics:
Oracle Clusterware ADR data is written under a root directory known as the ADR base. Because components other than ADR use this directory, it may also be referred to as the Oracle base. You specify the file system path to use as the base during Oracle Grid Infrastructure installation and can only be changed if you reinstall the Oracle Grid Infrastructure.
ADR files reside in an ADR home directory. The ADR home for Oracle Clusterware running on a given host always has this structure:
ORACLE_BASE/diag/crs/host_name/crs
In the preceding example, ORACLE_BASE
is the Oracle base path you specified when you installed the Oracle Grid Infrastructure and host_name
is the name of the host. On a Windows platform, this path uses backslashes (\
) to separate directory names.
Under the ADR home are various directories for specific types of ADR data. The directories of greatest interest are trace
and incident
. The trace
directory contains all normal (non-incident) trace files written by Oracle Clusterware daemons and command-line programs as well as the simple text version of the Oracle Clusterware alert log. This organization differs significantly from versions prior to Oracle Clusterware 12c release 1 (12.1.0.2), where diagnostic log files were written under distinct directories per daemon.
Starting with Oracle Clusterware 12c release 1 (12.1.0.2), diagnostic data files written by Oracle Clusterware programs are known as trace files and have a .trc
file extension, and appear together in the trace
subdirectory of the ADR home. The naming convention for these files generally uses the executable program name as the file name, possibly augmented with other data depending on the type of program.
Trace files written by Oracle Clusterware command-line programs incorporate the Operating System process ID (PID) in the trace file name to distinguish data from multiple invocations of the same command program. For example, trace data written by CRSCTL uses this name structure: crsctl_
PID
.trc
. In this example, PID
is the operating system process ID displayed as decimal digits.
Trace files written by Oracle Clusterware daemon programs do not include a PID in the file name, and they also are subject to a file rotation mechanism that affects naming. Rotation means that when the current daemon trace file reaches a certain size, the file is closed, renamed, and a new trace file is opened. This occurs a fixed number of times, and then the oldest trace file from the daemon is discarded, keeping the rotation set at a fixed size.
Most Oracle Clusterware daemons use a file size limit of 10 MB and a rotation set size of 10 files, thus maintaining a total of 100 MB of trace data. The current trace file for a given daemon simply uses the program name as the file name; older files in the rotation append a number to the file name. For example, the trace file currently being written by the Oracle High Availability Services daemon (OHASD) is named ohasd.trc
; the most recently rotated-out trace file is named ohasd_
n
.trc
, where n
is an ever-increasing decimal integer. The file with the highest n
is actually the most recently archived trace, and the file with the lowest n
is the oldest.
Oracle Clusterware agents are daemon programs whose trace files are subject to special naming conventions that indicate the origin of the agent (whether it was spawned by the OHASD or the Cluster Ready Services daemon (CRSD)) and the Operating System user name with which the agent runs. Thus, the name structure for agents is:
origin_executable_user_name
Note:
The first two underscores (_) in the name structure are literal and are included in the trace file name. The underscore inuser_name
is not part of the file naming convention.In the previous example, origin
is either ohasd
or crsd
, executable
is the executable program name, and user_name
is the operating system user name. In addition, because they are daemons, agent trace files are subject to the rotation mechanism previously described, so files with an additional _n
suffix are present after rotation occurs.
Besides trace files, the trace
subdirectory in the Oracle Clusterware ADR home contains the simple text Oracle Clusterware alert log. It always has the name alert.log
. The alert log is also written as an XML file in the alert
subdirectory of the ADR home, but the text alert log is most easily read.
The alert log is the first place to look when a problem or issue arises with Oracle Clusterware. Unlike the Oracle Database instance alert log, messages in the Oracle Clusterware alert log are identified, documented, and localized (translated). Oracle Clusterware alert messages are written for most significant events or errors that occur.
Note:
Messages and data written to Oracle Clusterware trace files generally are not documented and translated and are used mainly by My Oracle Support for problem diagnosis.Certain errors occur in Oracle Clusterware programs that will raise an ADR incident. In most cases, these errors should be reported to My Oracle Support for diagnosis. The occurrence of an incident normally produces one or more descriptive messages in the Oracle Clusterware alert log.
In addition to alert messages, incidents also cause the affected program to produce a special, separate trace file containing diagnostic data related to the error. These incident-specific trace files are collected in the incident
subdirectory of the ADR home rather than the trace
subdirectory. Both the normal trace files and incident trace files are collected and submitted to Oracle when reporting the error.
See Also:
Oracle Database Administrator's Guide for more information on incidents and data collectionBesides ADR data, Oracle Clusterware collects or uses other data related to problem diagnosis. Starting with Oracle Clusterware 12c release 1 (12.1.0.2), this data resides under the same base path used by ADR, but in a separate directory structure with this form: ORACLE_BASE
/crsdata/
host_name
. In this example, ORACLE_BASE
is the Oracle base path you specified when you installed the Grid Infrastructure and host_name
is the name of the host.
In this directory, on a given host, are several subdirectories. The two subdirectories of greatest interest if a problem occurs are named core
and output
. The core
directory is where Oracle Clusterware daemon core files are written when the normal ADR location used for core files is not available (for example, before ADR services are initialized in a program). The output
directory is where Oracle Clusterware daemons redirect their C standard output and standard error files. These files generally use a name structure consisting of the executable name with the characters OUT appended to a .trc
file extension (like trace files). For example, the redirected standard output from the Cluster Time Synchronization Service daemon is named octssdOUT.trc
. Typically, daemons write very little to these files, but in certain failure scenarios important data may be written there.
When an Oracle Clusterware error occurs, run the diagcollection.pl
diagnostics collection script to collect diagnostic information from Oracle Clusterware into trace files. The diagnostics provide additional information so My Oracle Support can resolve problems. Run this script as root
from the Grid_home
/bin
directory.
Use the diagcollection.pl script with the following syntax:
diagcollection.pl {--collect [--crs | --acfs | -all] [--chmos [--incidenttime time [--incidentduration time]]] [--adr location [--aftertime time [--beforetime time]]] [--crshome path | --clean | --coreanalyze}]
Note:
Thediagcollection.pl
script arguments are all preceded by two dashes (--
).Table J-11 lists and describes the parameters used with the diagcollection.pl
script.
Table J-11 diagcollection.pl Script Parameters
Parameter | Description |
---|---|
--collect |
Use this parameter with any of the following arguments:
|
--clean |
Use this parameter to clean up the diagnostic information gathered by the Note: You cannot use this parameter with |
--coreanalyze |
Use this parameter to extract information from core files and store it in a text file. Note: You can only use this parameter on UNIX systems. |
During an upgrade, while running the Oracle Clusterware root.sh
script, you may see the following messages:
ACFS-9427 Failed to unload ADVM/ACFS drivers. A system restart is recommended.
ACFS-9428 Failed to load ADVM/ACFS drivers. A system restart is recommended.
If you see these error messages during the upgrade of the initial (first) node, then do the following:
Complete the upgrade of all other nodes in the cluster.
Restart the initial node.
Run the root.sh
script on initial node again.
Run the Oracle_home
/cfgtoollogs/configToolAllCommands
script as root
to complete the upgrade.
For nodes other than the initial node (the node on which you started the installation):
Restart the node where the error occurs.
Run the orainstRoot.sh
script as root
on the node where the error occurs.
Change directory to the Grid home, and run the root.sh
script on the node where the error occurs.
See Also:
Appendix E, "Oracle Clusterware Control (CRSCTL) Utility Reference" for information about using the CRSCTL commands referred to in this procedureUse the following procedure to test zone delegation:
Start the GNS VIP by running the following command as root
:
# crsctl start ip -A IP_name/netmask/interface_name
The interface_name
should be the public interface and netmask of the public network.
Start the test DNS server on the GNS VIP by running the following command (you must run this command as root
if the port number is less than 1024):
# crsctl start testdns -address address [-port port]
This command starts the test DNS server to listen for DNS forwarded packets at the specified IP and port.
Ensure that the GNS VIP is reachable from other nodes by running the following command as root
:
crsctl status ip -A IP_name
Query the DNS server directly by running the following command:
crsctl query dns -name name -dnsserver DNS_server_address
This command fails with the following error:
CRS-10023: Domain name look up for name asdf.foo.com failed. Operating system error: Host name lookup failure
Look at Grid_home
/log/
host_name
/client/odnsd_*.log
to see if the query was received by the test DNS server. This validates that the DNS queries are not being blocked by a firewall.
Query the DNS delegation of GNS domain queries by running the following command:
crsctl query dns -name name
Note:
The only difference between this step and the previous step is that you are not giving the-dnsserver
DNS_server_address
option. This causes the command to query name servers configured in /etc/resolv.conf
. As in the previous step, the command fails with same error. Again, look at odnsd*.log
to ensure that odnsd
received the queries. If step 5 succeeds but step 6 does not, then you must check the DNS configuration.Stop the test DNS server by running the following command:
crsctl stop testdns -address address
Stop the GNS VIP by running the following command as root
:
crsctl stop ip -A IP_name/netmask/interface_name
Note:
This feature is not supported on Windows operating systems.The Oracle Trace File Analyzer (TFA) Collector is a tool for targeted diagnostic collection that simplifies diagnostic data collection for Oracle Clusterware, Oracle Grid Infrastructure, and Oracle RAC systems. TFA collects and packages diagnostic data and also has the ability to centralize and automate the collection of diagnostic information.
TFA is installed into the Oracle Grid Infrastructure home when you install, or upgrade to, Oracle Database 12c release 1 (12.1.0.2). The TFA daemon discovers relevant trace file directories and then analyzes the trace files in those directories to determine their file type (whether they are database trace or log files, or Oracle Clusterware trace or log files, for example) and the first and last timestamps of those files. TFA stores this data in a Berkeley database in the Grid home owner's ORACLE_BASE
directory.
The TFA daemon periodically checks for new trace directories to add, such as when a new database instance is created, and also periodically checks whether the trace files metadata need updating. TFA uses this metadata when performing diagnostic data collections.
TFA can perform diagnostic collections in two ways, either on demand or automatically:
Use tfactl set
to enable automatic collection of diagnostics upon discovery of a limited set of errors in trace files
See Also:
"tfactl set"
The tfactl diagcollect
command collects trimmed trace files for any component and specific time range
This section includes the following topics:
TFA starts automatically whenever a node starts. You can manually start or stop TFA using the following commands:
/etc/init.d/init.tfa restart
: Stops and then starts the TFA daemon
/etc/init.d/init.tfa shutdown
: Stops the TFA daemon and removes entries from the appropriate operating system configuration
If the TFA daemon fails, then the operating system restarts it, automatically.
The TFA control utility, TFACTL, is the command-line interface for TFA, and is located in the Grid_home
/tfa/bin
directory. You must run TFACTL as root
or sudo
, because that gives access to trace files that normally allow only root access. Some commands, such as tfactl host add
, require strict root access.
You can append the -help
flag to any of the TFACTL commands to obtain online usage information.
This section lists and describes the following TFACTL commands:
Use the tfactl print
command to print information from the Berkeley database.
tfactl print [status | config | directories | hosts | actions | repository | cookie
Table J-12 tfactl print Command Parameters
Parameter | Description |
---|---|
status |
Prints the status of TFA across all nodes in the cluster, and also prints the TFA version and the port on which it is running. |
config |
Prints the current TFA configuration settings. |
directories |
Lists all the directories that TFA scans for trace or log file data, and shows the location of the trace directories allocated for the database, Oracle ASM, and instance. |
hosts |
Lists the hosts that are part of the TFA cluster, and that can receive clusterwide commands. |
actions |
Lists all the actions submitted to TFA, such as diagnostic collection. By default, |
repository |
Prints the current location and amount of used space of the repository directory. Initially, the maximum size of the repository directory is the smaller of either 10GB or 50% of available file system space. If the maximum size is exceeded or the file system space gets to 1GB or less, then TFA suspends operations and the repository is closed. Use the |
cookie |
Generates and prints an identification code for use by the |
The tfactl print config
command returns output similar to the following:
Configuration parameter Value ------------------------------------------------------------ TFA Version 2.5.1.5 Automatic diagnostic collection OFF Trimming of files during diagcollection ON Repository current size (MB) in node1 526 Repository maximum size (MB) in node1 10240 Trace Level 1
In the preceding sample output:
Automatic diagnostic collection
: When ON
(default is OFF
), if scanning an alert log, then finding specific events in those logs triggers diagnostic collection.
Trimming of files during diagcollection
: Determines if TFA will trim large files to contain only data that is within specified time ranges. When this is OFF
, no trimming of trace files occurs for automatic diagnostic collection.
Repository current size in MB
: How much space in the repository is currently used.
Repository maximum size in MB
: The maximum size of storage space in the repository. Initially, the maximum size is set to the smaller of either 10GB or 50% of free space in the file system.
Trace Level
: 1 is the default, and the values 2, 3, and 4 have increasing verbosity. While you can set the trace level dynamically for the running TFA daemon, increasing the trace level significantly impacts the performance of TFA, and should only be done at the request of My Oracle Support.
Use the tfactl purge
command to delete diagnostic collections from the TFA repository that are older than a specific time.
tfactl purge -older number[h | d]
Specify a number followed by h
or d
(to specify a number of hours or days, respectively) to remove files older than the specified time constraint. For example:
# tfactl purge -older 30d
The preceding command removes files older than 30 days.
Use the tfactl directory
command to add a directory to, or remove a directory from, the list of directories that will have their trace or log files analyzed. You can also use this command to change the directory permissions. When a directory is added by auto discovery, it is added as public, which means that any file in that directory can be collected by any user that has permission to run the tfactl diagcollect
command. This is only important when sudo
is used to allow users other than root to run TFACTL commands. If a directory is marked as private, then TFA determines which user is running TFACTL commands and verifies that the user has permissions to see the files in the directory before allowing any files to be collected.
Note:
A user can only add a directory to TFA to which they have read access, and also that TFA auto collections, when configured, run asroot
, and so will always collect all available files.tfactl directory [add directory | remove directory | modify directory -private | -public]
Table J-13 tfactl directory Command Parameters
Parameter | Description |
---|---|
add directory
|
Adds a specific directory |
remove directory
|
Removes a specific directory |
modify directory
-private | -public
|
Modifies a specific directory to either be private or public, where information can only be collected by users with specific operating system permissions ( |
You must add all trace directory names to the Berkeley database, so that TFA will collect file metadata in that directory. The discovery process finds most directories but if new or undiscovered directories are required, then you can add these manually using the tfactl directory
command. When you add a directory using TFACTL, TFA attempts to determine whether the directory is for the database, Oracle Clusterware, operating system logs, or some other component, and for which database or instance. If TFA cannot determine this information, then it returns an error and requests that you enter the information, similar to the following:
# tfactl directory add /tmp Failed to add directory to TFA. Unable to determine parameters for directory: /tmp Please enter component for this Directory [RDBMS|CRS|ASM|INSTALL|OS|CFGTOOLS] : RDBMS Please enter database name for this Directory :MYDB Please enter instance name for this Directory :MYDB1
Use the tfactl host
command to add hosts to, or remove hosts from, the TFA cluster.
tfactl host [add host_name | remove host_name]
Specify a host name to add or remove, as in the following example:
# tfactl host add myhost.domain.com
Using the tfactl host
add command notifies the local TFA about other nodes on your network. When you add a host, TFA contacts that host and, if TFA is running on that host, then both hosts synchronize their host list. TFA authenticates that a host can be added using a cookie. If the host to be added does not have the correct cookie, then you must retrieve that cookie from an existing host in the cluster and set it on the host being added, similar to the following:
#tfactl host add node2 Failed to add host: node2 as the TFA cookies do not match. To add the host successfully, try the following steps: 1. Get the cookie in node1 using: ./tfa_home/bin/tfactl print cookie 2. Set the cookie from Step 1 in node2 using: ./tfa_home/bin/tfactl set cookie=<COOKIE> 3. After Step 2, add host again: ./tfa_home/bin/tfactl host add node2
After you successfully add the host, all clusterwide commands will activate on all nodes registered in the Berkeley database.
Use the tfactl set
command to adjust the manner in which TFA runs.
tfactl set [autodiagcollect=ON | OFF | cookie=UID | trimfiles=ON | OFF | tracelevel=1 | 2 | 3 | 4 | reposizeMB=number | repositorydir=directory] [-c]
Table J-14 tfactl set Command Parameters
Parameter | Description |
---|---|
autodiagcollect=ON | OFF |
When set to To set automatic collection for all nodes of the TFA cluster, you must specify the |
cookie=UID
|
Use the |
trimfiles=ON | OFF |
When set to Note: When using |
tracelevel=1 | 2 | 3 | 4 |
Do not change the tracing level unless you are directed to do so by My Oracle Support. |
-c |
Specify this parameter to propagate these settings to all nodes in the TFA configuration. |
Automatic Diagnostic Collection
After TFA initially gathers trace file metadata, the daemon monitors all files that are determined to be alert logs using tail
, so that TFA can take action when certain strings occur.
By default, these logs are database alert logs, Oracle ASM alert logs, and Oracle Clusterware alert logs. When specific patterns occur in the logs saved to the Berkeley database, automatic diagnostic collection may take place.
Exactly what is collected depends on the pattern that is matched. TFA may just store information on the pattern matched or may initiate local diagnostic collection. Although TFA always monitors the logs and collects information into its database, automatic diagnostic collection only happens if it is enabled first using the tfactl set
command.
Use the tfactl diagcollect
command to perform on-demand diagnostic collection. You can configure a number of different parameters to determine how large or detailed a collection you want. You can specify a specific time of an incident or a time range for data to be collected, and determine if whole files that have relevant data should be collected or just a time interval of data from those files.
Note:
If you specify no parameters, then thetfactl diagcollect
command collects files from all nodes for all components where the file has been updated in the last four hours, and also trims excessive files. If an incident occurred prior to this period, then you can use the parameters documented in this section to target the correct data collection period.tfactl diagcollect [-all | -database all | database_1,database_2,... | -asm | -crs | -os | -install] [-node all | local | node_1,node_2,...][-tag description] [-z file_name] [-since numberh | d | -from "mmm/dd/yyyy hh:mm:ss" -to "mmm/dd/yyyy hh:mm:ss" | -for "mmm/dd/yyyy hh:mm:ss" [-nocopy] [-nomonitor]]
Table J-15 tfactl diagcollect Command Parameters
Parameter | Description |
---|---|
-all | -database all | database_1,database_2,... | -asm | -crs | -os | -install |
You can choose one or more individual components from which to collect trace and log files or choose |
-node all | local | node_1,node_2,... |
You can specify a comma-delimited list of nodes from which to collect diagnostic information. Default is |
-tag description
|
Use this parameter to create a subdirectory for the resulting collection in the TFA repository. |
-z file_name
|
Use this parameter to specify an output file name. |
-since numberh | d | -from "mmm/dd/yyyy hh:mm:ss" -to "mmm/dd/yyyy hh:mm:ss" | -for "mmm/dd/yyyy hh:mm:ss" |
Specify the Specify the Specify the Note: If you specify both date and time, then you must enclose both values in double quotation marks ( |
-nocopy |
Specify this parameter to stop the resultant trace file collection from being copied back to the initiating node. The file remains in the TFA repository on the executing node. |
-nomonitor |
Specify this parameter to prevent the terminal on which you run the command from displaying the progress of the command. |
The following command trims and zips all files updated in the last four hours, including chmos
and osw
data, from across the cluster and collects it on the initiating node:
# tfactl diagcollect
The following command trims and zips all files updated in the last eight hours, including chmos
and osw
data, from across the cluster and collects it on the initiating node:
# tfactl diagcollect –all –since 8h
The following command trims and zips all files from databases hrdb
and fdb
updated in the last one day and collects it on the initiating node:
# tfactl diagcollect -database hrdb,fdb -since 1d -z foo
The following command trims and zips all Oracle Clusterware files, operating system logs, and chmos
and osw
data from node1
and node2
updated in the last six hours, and collects it on the initiating node:
# tfactl diagcollect –crs -os -node node1,node2 -since 6h
The following command trims and zips all Oracle ASM logs from node1
updated between July 4, 2014 and July 5, 2015 at 21:00, and collects it on the initiating node:
# tfactl diagcollect -asm -node node1 -from Jul/4/2014 -to "Jul/5/2014 21:00:00"
The following command trims and zips all log files updated on July 2, 2014 and collects it on the initiating node:
# tfactl diagcollect -for Jul/2/2014
The following command trims and zips all log files updated from 09:00 on July 2, 2014, to 09:00 on July 3, 2014, which is 12 hours before and after the time specified in the command, and collects it on the initiating node:
# tfactl diagcollect -for "Jul/2/2014 21:00:00"
You can hide sensitive data by replacing data in files or paths in file names. To use this feature, place a file called redaction.xml
in the tfa_home
/resources
directory. TFA uses the data contained within the redaction.xml
file to replace strings in file names and contents. The format of the redaction.xml
file is as follows:
<replacements> <replace> <before>securestring</before> <after>nonsecurestring</after> </replace> <replace> <before>securestring2</before> <after>nonsecurestring2</after> </replace> Etc… </replacements>
Oracle Clusterware writes messages to the ADR alert log file (as previously described) for various important events. Alert log messages generally are localized (translated) and carry a message identifier that can be used to look up additional information about the message. The alert log is the first place to look if there appears to be problems with Oracle Clusterware.
The following is an example of alert log messages from two different CRS daemon processes:
2014-07-16 00:27:22.074 [CTSSD(12817)]CRS-2403:The Cluster Time Synchronization Service on host stnsp014 is in observer mode. 2014-07-16 00:27:22.146 [CTSSD(12817)]CRS-2407:The new Cluster Time Synchronization Service reference node is host stnsp013. 2014-07-16 00:27:22.753 [CTSSD(12817)]CRS-2401:The Cluster Time Synchronization Service started on host stnsp014. 2014-07-16 00:27:43.754 [CRSD(12975)]CRS-1012:The OCR service started on node stnsp014. 2014-07-16 00:27:46.339 [CRSD(12975)]CRS-1201:CRSD started on node stnsp014.
The following example shows the start of the Oracle Cluster Time Synchronization Service (OCTSS) after a cluster reconfiguration:
2014-07-15 23:51:17.532 [CTSSD(12813)]CRS-2403:The Cluster Time Synchronization Service on host stnsp014 is in observer mode. 2014-07-15 23:51:18.292 [CTSSD(12813)]CRS-2407:The new Cluster Time Synchronization Service reference node is host stnsp013. 2014-07-15 23:51:18.961 [CTSSD(12813)]CRS-2401:The Cluster Time Synchronization Service started on host stnsp014.
Beginning with Oracle Database 11g release 2 (11.2), certain Oracle Clusterware messages contain a text identifier surrounded by "(:
" and ":)
". Usually, the identifier is part of the message text that begins with "Details in...
" and includes an Oracle Clusterware diagnostic log file path and name similar to the following example. The identifier is called a DRUID, or diagnostic record unique ID:
2014-07-16 00:18:44.472 [ORAROOTAGENT(13098)]CRS-5822:Agent '/scratch/12.1/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) in /scratch/12.1/grid/log/stnsp014/agent/crsd/orarootagent_ root/orarootagent_root.log.
DRUIDs are used to relate external product messages to entries in a diagnostic log file and to internal Oracle Clusterware program code locations. They are not directly meaningful to customers and are used primarily by My Oracle Support when diagnosing problems.