A Troubleshooting the Oracle Grid Infrastructure Installation Process

This appendix provides troubleshooting information for installing Oracle Grid Infrastructure.

A.1 General Installation Issues

The following is a list of examples of types of errors that can occur during installation. It contains the following issues:

An error occurred while trying to get the disks
Could not execute auto check for display colors using command /usr/X11R6/bin/xdpyinfo
CRS-5823:Could not initialize agent framework.
Failed to connect to server, Connection refused by server, or Can't open display
Failed to initialize ocrconfig
INS-32026 INSTALL_COMMON_HINT_DATABASE_LOCATION_ERROR
Nodes unavailable for selection from the OUI Node Selection screen
Node nodename is unreachable
projadd: Duplicate project name "user.grid"
PROT-8: Failed to import data from specified file to the cluster registry
PRVE-0038 : The SSH LoginGraceTime setting, or fatal: Timeout before authentication
Timed out waiting for the CRS stack to start
YPBINDPROC_DOMAIN: Domain not bound

An error occurred while trying to get the disks: Cause: There is an entry in /var/opt/oracle/oratab pointing to a non-existent Oracle home. The OUI log file should show the following error: "java.io.IOException: /home/oracle/OraHome/bin/kfod: not found"; Action: Remove the entry in /etc/oratab pointing to a non-existing Oracle home.

Could not execute auto check for display colors using command /usr/X11R6/bin/xdpyinfo: Cause: Either the DISPLAY variable is not set, or the user running the installation is not authorized to open an X window. This can occur if you run the installation from a remote terminal, or if you use an su command to change from a user that is authorized to open an X window to a user account that is not authorized to open an X window on the display, such as a lower-privileged user opening windows on the root user's console display.; Action: Run the command echo $DISPLAY to ensure that the variable is set to the correct visual or to the correct host. If the display variable is set correctly then either ensure that you are logged in as the user authorized to open an X window, or run the command xhost + to allow any user to open an X window.
If you are logged in locally on the server console as root, and used the su - command to change to the Oracle Grid Infrastructure installation owner, then log out of the server, and log back in as the grid installation owner.

CRS-5823:Could not initialize agent framework.

Cause: Installation of Oracle Grid Infrastructure fails when you run root.sh. Oracle Grid Infrastructure fails to start because the local host entry is missing from the hosts file.

The Oracle Grid Infrastructure alert.log file shows the following:

[/oracle/app/grid/bin/orarootagent.bin(11392)]CRS-5823:Could not initialize
agent framework. Details at (:CRSAGF00120:) in
/oracle/app/grid/log/node01/agent/crsd/orarootagent_root/orarootagent_root.log
2010-10-04 12:46:25.857
[ohasd(2401)]CRS-2765:Resource 'ora.crsd' has failed on server 'node01'.

You can verify this as the cause by checking crsdOUT.log file, and finding the following:

Unable to resolve address for localhost:2016
ONS runtime exiting
Fatal error: eONS: eonsapi.c: Aug 6 2009 02:53:02

Action: Add the local host entry in the hosts file.

Failed to connect to server, Connection refused by server, or Can't open display

Cause: These are typical of X Window display errors on Windows or UNIX systems, where xhost is not properly configured, or where you are running as a user account that is different from the account you used with the startx command to start the X server.

Action: In a local terminal window, log in as the user that started the X Window session, and enter the following command:

$ xhost fullyqualifiedRemoteHostname

For example:

$ xhost somehost.example.com

Then, enter the following commands, where workstationname is the host name or IP address of your workstation.

Bourne, Bash, or Korn shell:

$ DISPLAY=workstationname:0.0
$ export DISPLAY

To determine whether X Window applications display correctly on the local system, enter the following command:

$ xclock

The X clock should appear on your monitor. If this fails to work, then use of the xhost command may be restricted.

If you are using a VNC client to access the server, then ensure that you are accessing the visual that is assigned to the user that you are trying to use for the installation. For example, if you used the su command to become the installation owner on another user visual, and the xhost command use is restricted, then you cannot use the xhost command to change the display. If you use the visual assigned to the installation owner, then the correct display will be available, and entering the xclock command will display the X clock.

When the X clock appears, then close the X clock and start the installer again.

Failed to initialize ocrconfig

Cause: You have the wrong options configured for NFS in the /etc/vfstab file.

You can confirm this by checking ocrconfig.log files located in the path Grid_home/log/nodenumber/client and finding the following:

/u02/app/grid/clusterregistry, ret -1, errno 75, os err string Value too large
for defined data type
2007-10-30 11:23:52.101: [ OCROSD][3085960896]utopen:6'': OCR location

Action: For file systems mounted on NFS, provide the correct mount configuration for NFS mounts in the /etc/vfstab file:

rw,sync,bg,hard,nointr,tcp,vers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0

Note:

You should not have netdev in the mount instructions, or vers=2. The netdev option is only required for OCFS file systems, and vers=2 forces the kernel to mount NFS using the older version 2 protocol.

After correcting the NFS mount information, remount the NFS mount point, and run the root.sh script again. For example, with the mount point /u02, and with the Oracle Grid Infrastructure home set to $Grid_home:

#umount /u02
#mount -a -t nfs
#cd $Grid_home
#sh root.sh

INS-32026 INSTALL_COMMON_HINT_DATABASE_LOCATION_ERROR: Cause: The location selected for the Grid home for a cluster installation is located under an Oracle base directory.; Action: For Oracle Grid Infrastructure for a Cluster installations, the Grid home must not be placed under one of the Oracle base directories, or under Oracle home directories of Oracle Database installation owners, or in the home directory of an installation owner. During installation, ownership of the path to the Grid home is changed to root. This change causes permission errors for other installations. In addition, the Oracle Clusterware software stack may not come up under an Oracle base path.

Nodes unavailable for selection from the OUI Node Selection screen: Cause: Oracle Grid Infrastructure is either not installed, or the Oracle Grid Infrastructure services are not up and running.; Action: Install Oracle Grid Infrastructure, or review the status of your installation. Consider restarting the nodes, as doing so may resolve the problem.

Node nodename is unreachable

Cause: Unavailable IP host

Action: Attempt the following:

Run the shell command ifconfig -a. Compare the output of this command with the contents of the /etc/hosts file to ensure that the node IP is listed.
Run the shell command nslookup to see if the host is reachable.

projadd: Duplicate project name "user.grid"

Cause: If the fixup script fails for some reason, you cannot run it again until you delete the project names created when the fixup script ran unsuccessfully.

Action: Do the following:

Log in as root
Use a command similar to the following to delete the project that the fixup script created (in this case user.grid):
```
# /usr/sbin/projdel "user.grid"
```
Run the fixup script again.

PROT-8: Failed to import data from specified file to the cluster registry: Cause: Insufficient space in an existing Oracle Cluster Registry device partition, which causes a migration failure while running rootupgrade.sh. To confirm, look for the error "utopen:12:Not enough space in the backing store" in the log file, where Grid_home is the Oracle Grid Infrastructure home path, and hostname is the name of the server: Grid_home/log/hostname/client/ocrconfig_pid.log.; Action: Identify a storage device that has 280 MB or more available space. Oracle recommends that you allocate the entire disk to Oracle ASM.

PRVE-0038 : The SSH LoginGraceTime setting, or fatal: Timeout before authentication: Cause: PRVE-0038: The SSH LoginGraceTime setting on node "nodename" may result in users being disconnected before login is completed. This error may because the default timeout value for SSH connections is too low, or if the LoginGraceTime parameter is commented out.; Action: Oracle recommends uncommenting the LoginGraceTime parameter in the OpenSSH configuration file /etc/ssh/sshd_config, and setting it to a value of 0 (unlimited).

Timed out waiting for the CRS stack to start: Cause: If a configuration issue prevents the Oracle Grid Infrastructure software from installing successfully on all nodes, then you may see error messages such as "Timed out waiting for the CRS stack to start," or you may notice that Oracle Clusterware-managed resources were not create on some nodes after you exit the installer. You also may notice that resources have a status other than ONLINE.; Action: Deconfigure the Oracle Grid Infrastructure installation without removing binaries, and review log files to determine the cause of the configuration issue. After you have fixed the configuration issue, rerun the scripts used during installation to configure Oracle Clusterware.

See Also:
Section 6.5, "Deconfiguring Oracle Clusterware Without Removing Binaries"

YPBINDPROC_DOMAIN: Domain not bound

Cause: This error can occur during postinstallation testing when the public network interconnect for a node is pulled out, and the VIP does not fail over. Instead, the node hangs, and users are unable to log in to the system. This error occurs when the Oracle home, listener.ora, Oracle log files, or any action scripts are located on an NAS device or NFS mount, and the name service cache daemon nscd has not been activated.

Action: Enter the following command on all nodes in the cluster to start the nscd service:

/sbin/service  nscd start

A.2 Interpreting CVU "Unknown" Output Messages Using Verbose Mode

If you run Cluster Verification Utility using the -verbose argument, and a Cluster Verification Utility command responds with UNKNOWN for a particular node, then this is because Cluster Verification Utility cannot determine if a check passed or failed. The following is a list of possible causes for an "Unknown" response:

The node is down
Common operating system command binaries required by Cluster Verification Utility are missing in the /bin directory in the Oracle Grid Infrastructure home or Oracle home directory
The user account starting Cluster Verification Utility does not have privileges to run common operating system commands on the node
The node is missing an operating system patch, or a required package
The node has exceeded the maximum number of processes or maximum number of open files, or there is a problem with IPC segments, such as shared memory or semaphores

A.3 Interpreting CVU Messages About Oracle Grid Infrastructure Setup

If the Cluster Verification Utility report indicates that your system fails to meet the requirements for Oracle Grid Infrastructure installation, then use the topics in this section to correct the problem or problems indicated in the report, and run Cluster Verification Utility again.

User Equivalence Check Failed
Node Reachability Check or Node Connectivity Check Failed
User Existence Check or User-Group Relationship Check Failed

User Equivalence Check Failed

Cause: Failure to establish user equivalency across all nodes. This can be due to not creating the required users, or failing to complete secure shell (SSH) configuration properly.

Action: Cluster Verification Utility provides a list of nodes on which user equivalence failed.

For each node listed as a failure node, review the installation owner user configuration to ensure that the user configuration is properly completed, and that SSH configuration is properly completed. The user that runs the Oracle Clusterware installation must have permissions to create SSH connections.

Oracle recommends that you use the SSH configuration option in OUI to configure SSH. You can use Cluster Verification Utility before installation if you configure SSH manually, or after installation, when SSH has been configured for installation.

For example, to check user equivalency for the user account oracle, use the command su - oracle and check user equivalence manually by running the ssh command on the local node with the date command argument using the following syntax:

$ ssh nodename date

The output from this command should be the timestamp of the remote node identified by the value that you use for nodename. If you are prompted for a password, then you need to configure SSH. If ssh is in the default location, the /usr/bin directory, then use ssh to configure user equivalence. You can also use rsh to confirm user equivalence.

If you see a message similar to the following when entering the date command with SSH, then this is the probable cause of the user equivalence error:

The authenticity of host 'node1 (140.87.152.153)' can't be established.
RSA key fingerprint is 7z:ez:e7:f6:f4:f2:4f:8f:9z:79:85:62:20:90:92:z9.
Are you sure you want to continue connecting (yes/no)?

Enter yes, and then run Cluster Verification Utility to determine if the user equivalency error is resolved.

If ssh is in a location other than the default, /usr/bin, then Cluster Verification Utility reports a user equivalence check failure. To avoid this error, navigate to the directory Grid_home/cv/admin, open the file cvu_config with a text editor, and add or update the key ORACLE_SRVM_REMOTESHELL to indicate the ssh path location on your system. For example:

# Locations for ssh and scp commands
ORACLE_SRVM_REMOTESHELL=/usr/local/bin/ssh
ORACLE_SRVM_REMOTECOPY=/usr/local/bin/scp

Note the following rules for modifying the cvu_config file:

Key entries have the syntax name=value
Each key entry and the value assigned to the key defines one property only
Lines beginning with the number sign (#) are comment lines, and are ignored
Lines that do not follow the syntax name=value are ignored

When you have changed the path configuration, run Cluster Verification Utility again. If ssh is in another location than the default, you also need to start OUI with additional arguments to specify a different location for the remote shell and remote copy commands. Enter runInstaller -help to obtain information about how to use these arguments.

Note:

When you or OUI run ssh or rsh commands, including any login or other shell scripts they start, you may see errors about invalid arguments or standard input if the scripts generate any output. You should correct the cause of these errors.

To stop the errors, remove all commands from the oracle user's login scripts that generate output when you run ssh or rsh commands.

If you see messages about X11 forwarding, then complete the task "Setting Display and X11 Forwarding Configuration" to resolve this issue.

If you see errors similar to the following:

stty: standard input: Invalid argument
stty: standard input: Invalid argument

These errors are produced if hidden files on the system (for example, .bashrc or .cshrc) contain stty commands. If you see these errors, then refer to Chapter 2, "Preventing Installation Errors Caused by Terminal Output Commands" to correct the cause of these errors.

Node Reachability Check or Node Connectivity Check Failed: Cause: One or more nodes in the cluster cannot be reached using TCP/IP protocol, through either the public or private interconnects.; Action: Use the command /bin/ping address to check each node address. When you find an address that cannot be reached, check your list of public and private addresses to make sure that you have them correctly configured. If you use third-party vendor clusterware, then refer to the vendor documentation for assistance. Ensure that the public and private network interfaces have the same interface names on each node of your cluster.

User Existence Check or User-Group Relationship Check Failed: Cause: The administrative privileges for users and groups required for installation are missing or incorrect.; Action: Use the id command on each node to confirm that the installation owner user (for example, grid or oracle) is created with the correct group membership. Ensure that you have created the required groups, and create or modify the user account on affected nodes to establish required group membership.

See Also:
Section 2.4, "Creating Groups, Users and Paths for Oracle Grid Infrastructure" in Chapter 2 for instructions about how to create required groups, and how to configure the installation owner user

A.4 About the Oracle Clusterware Alert Log

The Oracle Clusterware alert log is the first place to look for serious errors. In the event of an error, it can contain path information to diagnostic logs that can provide specific information about the cause of errors.

After installation, Oracle Clusterware posts alert messages when important events occur. For example, you might see alert messages from the Cluster Ready Services (CRS) daemon process when it starts, if it aborts, if the failover process fails, or if automatic restart of a CRS resource failed.

Oracle Enterprise Manager monitors the Clusterware log file and posts an alert on the Cluster Home page if an error is detected. For example, if a voting disk is not available, a CRS-1604 error is raised, and a critical alert is posted on the Cluster Home page. You can customize the error detection and alert settings on the Metric and Policy Settings page.

The location of the Oracle Clusterware log file is CRS_home/log/hostname/alerthostname.log, where CRS_home is the directory in which Oracle Clusterware was installed and hostname is the host name of the local node.

A.5 Performing Cluster Diagnostics During Oracle Grid Infrastructure Installations

If the installer does not display the Node Selection page, then use the following command syntax to check the integrity of the Cluster Manager:

cluvfy comp clumgr -n node_list -verbose

In the preceding syntax example, the variable node_list is the list of nodes in your cluster, separated by commas.

Note:

If you encounter unexplained installation errors during or after a period when cron jobs are run, then your cron job may have deleted temporary files before the installation is finished. Oracle recommends that you complete installation before daily cron jobs are run, or disable daily cron jobs that perform cleanup until after the installation is completed.

A.6 Interconnect Configuration Issues

If you plan to use multiple network interface cards (NICs) for the interconnect, and you do not configure them during installation or after installation with Redundant Interconnect Usage, then you should use a third party solution to aggregate the interfaces at the operating system level. Otherwise, the failure of a single NIC will affect the availability of the cluster node.

A.6.1 IP Network Multipathing (IPMP) Issues

On Solaris, if you use IP network multipathing (IPMP) to aggregate multiple interfaces for the public or the private networks, then during installation of Oracle Grid Infrastructure, ensure you identify all interface names aggregated into an IPMP group as interfaces that should be used for the public or private network.

A.6.2 Aggregated NIC Card Issues

If you install Oracle Grid Infrastructure and Oracle RAC, then they must use the same NIC or aggregated NIC cards for the interconnect.

If you use aggregated NIC cards, then they must be on the same subnet.

If you encounter errors, then carry out the following system checks:

Verify with your network providers that they are using correct cables (length, type) and software on their switches. In some cases, to avoid bugs that cause disconnects under loads, or to support additional features such as Jumbo Frames, you may need a firmware upgrade on interconnect switches, or you may need newer NIC driver or firmware at the operating system level. Running without such fixes can cause later instabilities to Oracle RAC databases, even though the initial installation seems to work.
Review VLAN configurations, duplex settings, and auto-negotiation in accordance with vendor and Oracle recommendations.

A.6.3 Highly Available IP Address (HAIP) Issues

AGFW Could not find the resource type [ ora.haip.type ]: Cause: The private interconnect interfaces are IPMP group members, but HAIP is not supported on Solaris 11 for use with the private interconnect.; Action: No action needed. If you want to have HAIP support, then you must reinstall Oracle Grid Infrastructure and designate interfaces that are not IPMP group members for private interconnect use.
This error can occur with Solaris 11 configuration

A.7 Storage Configuration Issues

The following is a list of issues involving storage configuration:

A.7.1 Recovery from Losing a Node Filesystem or Grid Home

With Oracle Clusterware release 11.2 and later, if you remove a filesystem by mistake, or encounter another storage configuration issue that results in losing the Oracle Local Registry or otherwise corrupting a node, you can recover the node in one of two ways:

Restore the node from an operating system level backup (preferred)
Remove the node, and then add the node. With 11.2 and later clusters, profile information for is copied to the node, and the node is restored.

The feature that enables cluster nodes to be removed and added again, so that they can be restored from the remaining nodes in the cluster, is called Grid Plug and Play (GPnP). Grid Plug and Play eliminates per-node configuration data and the need for explicit add and delete nodes steps. This allows a system administrator to take a template system image and run it on a new node with no further configuration. This removes many manual operations, reduces the opportunity for errors, and encourages configurations that can be changed easily. Removal of the per-node configuration makes the nodes easier to replace, because they do not need to contain individually-managed state.

Grid Plug and Play reduces the cost of installing, configuring, and managing database nodes by making their per-node state disposable. It allows nodes to be easily replaced with regenerated state.

Initiate recovery of a node using addnode syntax similar to the following, where lostnode is the node that you are adding back to the cluster:

If you are using Grid Naming Service (GNS):

$ ./addNode.sh -silent "CLUSTER_NEW_NODES=lostnode"

If you are not using GNS:

$ ./addNode.sh -silent "CLUSTER_NEW_NODES={lostnode}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={lostnode-vip}"

Note that you require access to root to be able to run the root.sh script on the node you restore, to recreate OCR keys and to perform other configuration tasks. When you see prompts to overwrite your existing information in /usr/local/bin, accept the default (n):

The file "dbhome" already exists in /usr/local/bin. Overwrite it? (y/n) [n]:
The file "oraenv" already exists in /usr/local/bin. Overwrite it? (y/n) [n]:
The file "coraenv" already exists in /usr/local/bin. Overwrite it? (y/n) [n]: