A Troubleshooting the Oracle Grid Infrastructure Installation Process

This appendix provides troubleshooting information for installing Oracle Grid Infrastructure.

See Also:

The Oracle Database 11g Oracle Real Application Clusters (Oracle RAC) documentation set included with the installation media in the Documentation directory:

This appendix contains the following topics:

General Installation Issues
Interpreting CVU "Unknown" Output Messages Using Verbose Mode
Interpreting CVU Messages About Oracle Grid Infrastructure Setup
About the Oracle Clusterware Alert Log
Oracle Clusterware Install Actions Log Errors and Causes
Performing Cluster Diagnostics During Oracle Grid Infrastructure Installations
Interconnect Configuration Issues

A.1 General Installation Issues

The following is a list of examples of types of errors that can occur during installation. It contains the following issues:

INS-32026 INSTALL_COMMON_HINT_DATABASE_LOCATION_ERROR
Failure to start network or VIP resources when Microsoft Failover Cluster is installed
Nodes unavailable for selection from the OUI Node Selection screen
Node nodename is unreachable
Shared disk access fails
Installation does not complete successfully on all nodes
INS-20802: Grid Infrastructure configuration failed
Timed out waiting for the CRS stack to start

INS-32026 INSTALL_COMMON_HINT_DATABASE_LOCATION_ERROR: Cause: The location selected for the Grid home for a cluster installation is located under an Oracle base directory.; Action: For Oracle Grid Infrastructure for a Cluster installations, the Grid home must not be placed under one of the Oracle base directories, or under Oracle home directories of Oracle Database installation owners, or in the home directory of an installation owner.

Failure to start network or VIP resources when Microsoft Failover Cluster is installed: Cause: If Microsoft Failover Cluster (MSFC) is installed on a Windows Server 2008 cluster (even if it is not configured) and you attempt to install Oracle Grid Infrastructure, the installation fails during the 'Configuring Grid Infrastructure' phase with an indication that it was unable to start the resource ora.net1.network or the VIP resources.
When MSFC is installed, it creates a virtual network adapter and places it at the top of the binding order. This change in the binding order can only be seen in the registry; it is not visible through 'View Network Connections' under Server Manager.; Action: Currently, the only solution is not to install MSFC and Oracle Grid Infrastructure or Oracle Clusterware on the same Windows Server 2008 cluster.

Nodes unavailable for selection from the OUI Node Selection screen: Cause: Oracle Grid Infrastructure is either not installed, or the Oracle Grid Infrastructure services are not up and running.; Action: Install Oracle Grid Infrastructure, or review the status of your installation. Consider restarting the nodes, because doing so may resolve the problem.

Node nodename is unreachable

Cause: Unavailable IP host.

Action: Attempt the following:

Run the command ipconfig /all. Compare the output of this command with the contents of the C:\WINDOWS\system32\drivers\etc\hosts file to ensure that the node IP is listed.
Run the command nslookup to see if the host is reachable.

Shared disk access fails: Cause: Windows 2003 R2 does not automount raw drives by default. This is a change from Windows 2000.; Action: Change the automount to enabled. Refer to Section 3.5.2, "Enabling Automounting for Windows."

Installation does not complete successfully on all nodes: Cause: If a configuration issue prevents the Oracle Grid Infrastructure software from installing successfully on all nodes, you might see an error message such as "Timed out waiting for the CRS stack to start", or when you exit the installer you might notice that the Oracle Clusterware managed resources were not created on some nodes, or have a status other than ONLINE on those nodes.; Action: One solution to this problem is to deconfigure Oracle Clusterware on the nodes where the installation did not complete successfully, and then fix the configuration issue that caused the installation on that node to error out. After the configuration issue has been fixed, you can then rerun the scripts used during installation to configure Oracle Clusterware. See "Deconfiguring Oracle Clusterware Without Removing the Software" for details.

INS-20802: Grid Infrastructure configuration failed: Cause: If an error is encountered while running an Oracle Grid Infrastructure installation and the Deinstallation tool is used to remove the failed installation, the rootcrs.pl -deconfig command is not run.; Action: Run the rootcrs.pl -deconfig command manually after using the Deinstallation tool, then install Oracle Grid Infrastructure again.

Timed out waiting for the CRS stack to start: Cause: If a configuration issue prevents the Oracle Grid Infrastructure software from installing successfully on all nodes, then you may see error messages such as "Timed out waiting for the CRS stack to start," or you may notice that Oracle Clusterware-managed resources were not created on some nodes after you exit the installer. You also may notice that resources have a status other than ONLINE.; Action: Deconfigure the Oracle Grid Infrastructure installation without removing the software, and review the installation log files to determine the cause of the configuration issue. After you have fixed the configuration issue, rerun the scripts used during installation to configure Oracle Clusterware.

See Also:
Section 6.4, "Deconfiguring Oracle Clusterware Without Removing the Software"

A.2 Interpreting CVU "Unknown" Output Messages Using Verbose Mode

If you run Cluster Verification Utility (CVU) using the -verbose argument, and a Cluster Verification Utility command responds with UNKNOWN for a particular node, then this is because Cluster Verification Utility cannot determine if a check passed or failed. The following is a list of possible causes for an "Unknown" response:

The node is down
Common operating system command binaries required by Cluster Verification Utility are missing in the bin directory of the Oracle Grid Infrastructure home or Oracle home directory
The user account starting Cluster Verification Utility does not have privileges to run common operating system commands on the node
The node is missing an operating system patch, or a required package

A.3 Interpreting CVU Messages About Oracle Grid Infrastructure Setup

If the Cluster Verification Utility report indicates that your system fails to meet the requirements for Oracle Grid Infrastructure installation, then use the topics in this section to correct the problem or problems indicated in the report, and run Cluster Verification Utility again.

User Equivalence Check Failed

Cause: Failure to establish user equivalency across all nodes. This can be due to not creating the required users, the installation user not being the same on all nodes, or using a different password on the failed nodes.

Action: Cluster Verification Utility provides a list of nodes on which user equivalence failed. For each node listed as a failure node, review the installation owner user configuration to ensure that the user configuration is properly completed, and that user equivalence is properly completed.

Check to ensure that:

You are using a Domain user account or a local account that has been granted explicit membership in the Administrators group on each cluster node.
The user account has the same password on each node.
If using a domain user, then the domain for the user is the same on each node.
The user account has administrative privileges on each node.
The user can connect to the registry of each node from the local node.
If you are using Windows Server 2008 for your cluster, then you might have to change the Windows 2008 User Account Control settings on each node:
- Change the elevation prompt behavior for administrators to "Elevate without prompting". See http://technet.microsoft.com/en-us/library/cc709691.aspx
- Confirm that the Administrators group is listed under 'Manage auditing and security log'.

Node Reachability Check or Node Connectivity Check Failed

Cause: One or more nodes in the cluster cannot be reached using TCP/IP protocol, through either the public or private interconnects.

Action: Use the command ping address to check each node address. When you find an address that cannot be reached, check your list of public and private addresses to make sure that you have them correctly configured. Ensure that the public and private network interfaces have the same interface names on each node of your cluster.

Do not use the names PUBLIC and PRIVATE (all capital letters) for your public and interconnect network adapters (NICs). You can use private, Private, public, and Public for the network interface names.

See Also:

A.4 About the Oracle Clusterware Alert Log

The Oracle Clusterware alert log is the first place to look for serious errors. In the event of an error, it can contain path information to diagnostic logs that can provide specific information about the cause of errors.

After installation, Oracle Clusterware posts alert messages when important events occur. For example, you might see alert messages from the Cluster Ready Services (CRS) daemon process when it starts, if it aborts, if the failover process fails, or if automatic restart of a CRS resource failed.

Oracle Enterprise Manager monitors the Oracle Clusterware alert log and posts an alert on the Cluster Home page if an error is detected. For example, if a voting disk is not available, then a CRS-1604 error is raised, and a critical alert is posted on the Cluster Home page. You can customize the error detection and alert settings on the Metric and Policy Settings page.

The location of the Oracle Clusterware log file is Grid_home\log\hostname\alerthostname.log, where Grid_home is the directory in which Oracle Grid Infrastructure was installed and hostname is the host name of the local node.

A.5 Oracle Clusterware Install Actions Log Errors and Causes

During installation of the Oracle Grid Infrastructure software, a log file named installActions<Date_Timestamp>.log is written to the %TEMP%\OraInstall<Date_Timestamp> directory.

The following is a list of potential errors in the installActions.log:

PRIF-10: failed to initialize the cluster registry

Configuration assistant "Oracle Private Interconnect Configuration Assistant" failed
KFOD-0311: Error scanning device device_path_name
Step 1: checking status of Oracle Clusterware cluster

Step 2: configuring OCR repository

ignoring upgrade failure of ocr(-1073740972)

failed to configure Oracle Cluster Registry with CLSCFG, ret -1073740972

Each of these error messages can be caused by one of the following issues:

A.5.1 The OCFS for Windows format is not recognized on a remote cluster node

If you are using Oracle Cluster File System for Windows (OCFS for Windows) for your Oracle Cluster Registry (OCR) and Voting disk partitions, then:

Leave the Oracle Universal Installer (OUI) window in place.
Restart the second node, and any additional nodes.
Retry the assistants.

A.5.2 You have a Windows 2003 system and Automount of new drives is not enabled

If this is true, then:

For Oracle RAC on Windows Server 2003, you must issue the following commands on all nodes:

C:\> diskpart
DISKPART> automount enable

If you did not enable automounting of disks before attempting to install Oracle Grid Infrastructure, and the configuration assistants fail during installation, then you must clean up your Oracle Clusterware install, enable automounting on all nodes, reboot all nodes, and then start the Oracle Clusterware install again.

A.5.3 Symbolic links for disks were not removed

When you stamp a disk with ASMTOOL, it creates symbolic links for the disks. If these links are not removed when the disk is deleted or reconfigured, then errors can occur when attempting to access the disks.

To correct the problem, you can try stamping the disks again with ASMTOOL.

A.5.4 Discovery string used by Oracle Automatic Storage Management is incorrect

When specifying Oracle Automatic Storage Management (Oracle ASM) for storage, you have the option of changing the default discovery string used to locate the disks. If the discovery string is set incorrectly, Oracle ASM will not be able to locate the disks.

A.5.5 You used a period in a node name during Oracle Clusterware install

Periods (.) are not permitted in node names. Instead, use a hyphen (-).

To resolve a failed installation, remove traces of the Oracle Grid Infrastructure installation, and reinstall with a supported node name.

A.5.6 Ignoring upgrade failure of ocr(-1073740972)

This error indicates that the user that is performing the installation does not have Administrator privileges.

A.6 Performing Cluster Diagnostics During Oracle Grid Infrastructure Installations

If the installer does not display the Node Selection page, then use the following command syntax to check the integrity of the Cluster Manager:

cluvfy comp clumgr -n node_list -verbose

In the preceding syntax example, the variable node_list is the list of nodes in your cluster, separated by commas.

Note:

If you encounter unexplained installation errors during or after a period when scheduled tasks are run, then your scheduled task may have deleted temporary files before the installation is finished. Oracle recommends that you complete the installation before scheduled tasks are run, or disable scheduled tasks that perform cleanup until after the installation is completed.

A.7 Interconnect Configuration Issues

If you plan to use multiple network interface cards (NICs) for the interconnect, then you should use a third party solution to bond the interfaces at the operating system level. Otherwise, the failure of a single NIC will affect the availability of the cluster node.

If you install Oracle Grid Infrastructure and Oracle RAC, then they must use the same bonded NIC cards or teamed NIC cards for the interconnect. If you use bonded or teamed NIC cards, then they must be on the same subnet.

If you encounter errors, then perform the following system checks:

Verify with your network providers that they are using the correct cables (length, type) and software on their switches. In some cases, to avoid bugs that cause disconnects under loads, or to support additional features such as Jumbo Frames, you may need a firmware upgrade on interconnect switches, or you may need newer NIC driver or firmware at the operating system level. Running without such fixes can cause later instabilities to Oracle RAC databases, even though the initial installation seems to work.
Review virtual local area network (VLAN) configurations, duplex settings, and auto-negotiation in accordance with vendor and Oracle recommendations.

A.8 Storage Configuration Issues

The following is a list of issues involving storage configuration:

Recovery from Losing a Node File System or Grid Home
Oracle ASM Storage Issues

A.8.1 Recovery from Losing a Node File System or Grid Home

With Oracle Clusterware release 11.2 and later, if you remove a file system by mistake, or encounter another storage configuration issue that results in losing the Oracle Local Registry or otherwise corrupting a node, you can recover the node in one of two ways:

Restore the node from an operating system level backup (preferred)
Remove the node from the cluster, and then add the node to the cluster. With Oracle Clusterware release 11.2 and later clusters, profile information for the cluster is copied to the node, and the node is restored.

The feature that enables cluster nodes to be removed and added again, so that they can be restored from the remaining nodes in the cluster, is called Grid Plug and Play (GPnP). Grid Plug and Play eliminates per-node configuration data and the need for explicit add and delete nodes steps. This allows a system administrator to take a template system image and run it on a new node with no further configuration. This removes many manual operations, reduces the opportunity for errors, and encourages configurations that can be changed easily. Removal of the per-node configuration makes the nodes easier to replace, because they do not need to contain individually-managed state.

Grid Plug and Play reduces the cost of installing, configuring, and managing database nodes by making their per-node state disposable. It allows nodes to be easily replaced with regenerated state.

Initiate recovery of a node using the addnode command, similar to the following, where lostnode is the node that you are adding back to the cluster:

If you are using Grid Naming Service (GNS):

C:\Grid_home\oui\bin> addNode.bat -silent "CLUSTER_NEW_NODES=lostnode"

If you are not using GNS:

C:\Grid_home\oui\bin> addNode.bat -silent "CLUSTER_NEW_NODES={lostnode}" 
"CLUSTER_NEW_VIRTUAL_HOSTNAMES={lostnode-vip}"

You must run the addNode.bat command as an Administrator user on the node that you are restoring, to recreate OCR keys and to perform other configuration tasks.

After the addNode.bat command finishes, run the following command on the node being added to the cluster:

C:\>Grid_home\crs\config\gridconfig.bat

A.8.2 Oracle ASM Storage Issues

This section describes Oracle ASM storage error messages, and how to address these errors.

O/S-Error: (OS-2) The system cannot find the file specified.: Cause: If a disk is disabled at the operating system level and enabled again, some of the Oracle ASM operations such as CREATE DISKGROUP, MOUNT DISKGROUP, ADD DISK, ONLINE DISK, or querying V$ASM_DISK fail with the error:
OS Error: (OS-2) The system cannot find the file specified.

This happens when a previously mounted disk is assigned a new volume ID by the operating system. When Oracle ASM uses the old volume ID, it fails to open the disk and signals the above error.; Action: Use ASMTOOL to restamp the disk and update the volume ID used by Oracle ASM.

Unable to mount disk group; ASM discovered an insufficient number of disks for diskgroup

Cause: You performed an Oracle Grid Infrastructure software-only installation, and want to configure a disk group for storing the OCR and voting disk files during the postinstallation configuration of Oracle Clusterware. You used ASMTOOL to stamp the disks, and then used ASMCMD to create a disk group using the stamped disks. If you then update a crsconfig_params file with the disk device or disk partition names that constitute the Oracle ASM disk group. During the configuration of Oracle Clusterware, errors are displayed such as ORA-15017: diskgroup "DATA" cannot be mounted.

Action: Change the crsconfig_params file to use the stamped names generated by ASMTOOL instead of the disk partition names, for example:

"\\.\ORCLDISKDATA0"

A.9 Troubleshooting Windows Firewall Exceptions

If you cannot establish certain connections even after granting exceptions to the executables listed in Chapter 5, "Oracle Grid Infrastructure Postinstallation Procedures,", then follow these steps to troubleshoot the installation:

Examine Oracle configuration files (such as *.conf files), the Oracle key in the Windows registry, and network configuration files in %ORACLE_HOME%\network\admin.
Pay particular attention to any executable listed in %ORACLE_HOME%\network\admin\listener.ora in a PROGRAM= clause. Each of these must be granted an exception in the Windows Firewall, because a connection can be made through the TNS listener to that executable.
Examine Oracle trace files, log files, and other sources of diagnostic information for details on failed connection attempts. Log and trace files on the database client computer may contain useful error codes or troubleshooting information for failed connection attempts. The Windows Firewall log file on the server may contain useful information as well.
If the preceding troubleshooting steps do not resolve a specific configuration issue on Windows, then provide the output from the following command to Oracle Support for diagnosis and problem resolution:
```
netsh firewall show state verbose=enable
```

See Also:

Section 5.1.2, "Configure Exceptions for the Windows Firewall"
http://www.microsoft.com/downloads/details.aspx?FamilyID=a7628646-131d-4617-bf68-f0532d8db131&displaylang=en for information on Windows Firewall troubleshooting
http://support.microsoft.com/default.aspx?scid=kb;en-us;875357 for more information on Windows Firewall configuration