A Troubleshooting the Oracle Grid Infrastructure Installation Process

This appendix provides troubleshooting information for installing Oracle Grid Infrastructure.

See Also:

The Oracle Database 11g Oracle RAC documentation set included with the installation media in the Documentation directory:

Oracle Clusterware Administration and Deployment Guide
Oracle Real Application Clusters Administration and Deployment Guide

This appendix contains the following topics:

General Installation Issues
Interpreting CVU "Unknown" Output Messages Using Verbose Mode
Interpreting CVU Messages About Oracle Grid Infrastructure Setup
About the Oracle Grid Infrastructure Alert Log
Troubleshooting Issues on AIX
Performing Cluster Diagnostics During Oracle Grid Infrastructure Installations
About Using CVU Cluster Healthchecks After Installation
Interconnect Configuration Issues
SCAN VIP and SCAN Listener Issues
Storage Configuration Issues
Failed or Incomplete Installations and Upgrades

A.1 Best Practices for Contacting Oracle Support

If you find that it is necessary for you to contact Oracle Support to report an issue, then Oracle recommends that you follow these guidelines when you enter your service request:

Provide a clear explanation of the problem, including exact error messages.
Provide an explanation of any steps you have taken to troubleshoot issues, and the results of these steps.
Provide exact versions (major release and patch release) of the affected software.
Provide a step-by-step procedure of what actions you carried out when you encountered the problem, so that Oracle Support can reproduce the problem.
Provide an evaluation of the effect of the issue, including affected deadlines and costs.
Provide screen shots, logs, Remote Diagnostic Agent (RDA) output, or other relevant information.

A.2 General Installation Issues

The following is a list of examples of types of errors that can occur during installation. It contains the following issues:

root.sh failed to complete with error messages such as: Start of resource "ora.cluster_interconnect.haip" failed...
An error occurred while trying to get the disks
Could not execute auto check for display colors using command /usr/X11R6/bin/xdpyinfo
CRS-5823:Could not initialize agent framework.
Failed to connect to server, Connection refused by server, or Can't open display
Nodes unavailable for selection from the OUI Node Selection screen
Node nodename is unreachable
PROT-8: Failed to import data from specified file to the cluster registry
PRVE-0038 : The SSH LoginGraceTime setting
Time stamp is in the future
Timed out waiting for the CRS stack to start

root.sh failed to complete with error messages such as: Start of resource "ora.cluster_interconnect.haip" failed...

Cause: When configuring public and private network interfaces for Oracle RAC, you must enable ARP. Highly Available IP (HAIP) addresses do not require ARP on the public network, but for VIP failover, you will need to enable ARP. Do not configure NOARP.

Action: Configure the hsi0 (or eth) device to use ARP protocol by running the following command:

# ifconfig hsi0 arp

An error occurred while trying to get the disks: Cause: There is an entry in /etc/oratab pointing to a non-existent Oracle home. The OUI log file should show the following error: "java.io.IOException: /home/oracle/OraHome//bin/kfod: not found"; Action: Remove the entry in /etc/oratab pointing to a non-existing Oracle home.

Could not execute auto check for display colors using command /usr/X11R6/bin/xdpyinfo: Cause: Either the DISPLAY variable is not set, or the user running the installation is not authorized to open an X window. This can occur if you run the installation from a remote terminal, or if you use an su command to change from a user that is authorized to open an X window to a user account that is not authorized to open an X window on the display, such as a lower-privileged user opening windows on the root user's console display.; Action: Run the command echo $DISPLAY to ensure that the variable is set to the correct visual or to the correct host. If the display variable is set correctly then either ensure that you are logged in as the user authorized to open an X window, or run the command xhost + to allow any user to open an X window.
If you are logged in locally on the server console as root, and used the su - command to change to the Oracle Grid Infrastructure installation owner, then log out of the server, and log back in as the grid installation owner.

CRS-5823:Could not initialize agent framework.

Cause: Installation of Oracle Grid Infrastructure fails when you run root.sh. Oracle Grid Infrastructure fails to start because the local host entry is missing from the hosts file.

The Oracle Grid Infrastructure alert.log file shows the following:

[/oracle/app/grid/bin/orarootagent.bin(11392)]CRS-5823:Could not initialize
agent framework. Details at (:CRSAGF00120:) in
/oracle/app/grid/log/node01/agent/crsd/orarootagent_root/orarootagent_root.log
2010-10-04 12:46:25.857
[ohasd(2401)]CRS-2765:Resource 'ora.crsd' has failed on server 'node01'.

You can verify this as the cause by checking crsdOUT.log file, and finding the following:

Unable to resolve address for localhost:2016
ONS runtime exiting
Fatal error: eONS: eonsapi.c: Aug 6 2009 02:53:02

Action: Add the localhost entry in the hosts file.

Failed to connect to server, Connection refused by server, or Can't open display

Cause: These are typical of X Window display errors on Windows or UNIX systems, where xhost is not properly configured, or where you are running as a user account that is different from the account you used with the startx command to start the X server.

Action: In a local terminal window, log in as the user that started the X Window session, and enter the following command:

$ xhost fully_qualified_remote_host_name

For example:

$ xhost somehost.example.com

Then, enter the following commands, where workstation_name is the host name or IP address of your workstation.

Bourne, Bash, or Korn shell:

$ DISPLAY=workstation_name:0.0
$ export DISPLAY

To determine whether X Window applications display correctly on the local system, enter the following command:

$ xclock

The X clock should appear on your monitor. If xclock is not available, then install it on your system and repeat the test. If xclock is installed on your system, but the X clock fails to open on your display, then use of the xhost command may be restricted.

If you are using a VNC client to access the server, then ensure that you are accessing the visual that is assigned to the user that you are trying to use for the installation. For example, if you used the su command to become the installation owner on another user visual, and the xhost command use is restricted, then you cannot use the xhost command to change the display. If you use the visual assigned to the installation owner, then the correct display is available, and entering the xclock command results in the X clock starting on your display.

When the X clock appears, close the X clock, and start the installer again.

Failed to initialize ocrconfig

Cause: You have the wrong options configured for NFS in the /etc/fstab file.

You can confirm this by checking ocrconfig.log files located in the path Grid_home/log/nodenumber/client and finding the following:

/u02/app/crs/clusterregistry, ret -1, errno 75, os err string Value too large
for defined data type
2007-10-30 11:23:52.101: [ OCROSD][3085960896]utopen:6'': OCR location

Action: For file systems mounted on NFS, provide the correct mount configuration for NFS mounts in the /etc/fstab file:

rw,sync,bg,hard,nointr,tcp,vers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0

Note:

You should not have netdev in the mount instructions, or vers=2. The netdev option is only required for OCFS file systems, and vers=2 forces the kernel to mount NFS using the earlier version 2 protocol.

After correcting the NFS mount information, remount the NFS mount point, and run the root.sh script again. For example, with the mount point /u02:

#umount /u02
#mount -a -t nfs
#cd $GRID_HOME
#sh root.sh

INS-32026 INSTALL_COMMON_HINT_DATABASE_LOCATION_ERROR: Cause: The location selected for the Grid home for a Cluster installation is located under an Oracle base directory.; Action: For Oracle Grid Infrastructure for a cluster installations, the Grid home must not be placed under one of the Oracle base directories, or under Oracle home directories of Oracle Database installation owners, or in the home directory of an installation owner. During installation, ownership of the path to the Grid home is changed to root. This change causes permission errors for other installations. In addition, the Oracle Clusterware software stack may not come up under an Oracle base path.

LSRSC-444: Run root.sh command on the Node with OUI session: Cause: If this message appears listing a node that is not the one where you are running OUI, then the likely cause is that the named node shut down during or before the root.sh script completed its run.; Action: Complete running the root.sh script on all other cluster member nodes, and do not attempt to run the root script on the node named in the error message. After you complete Oracle Grid Infrastructure on all or part of the set of planned cluster member nodes, start OUI and deinstall the failed Oracle Grid Infrastructure installation on the node named in the error. When you have deinstalled the failed installation on the node, add that node manually to the cluster.

See Also:
Oracle Clusterware Administration and Deployment Guide for information about how to add a node

Nodes unavailable for selection from the OUI Node Selection screen: Cause: Oracle Grid Infrastructure is either not installed, or the Oracle Grid Infrastructure services are not up and running.; Action: Install Oracle Grid Infrastructure, or review the status of your Oracle Grid Infrastructure installation. Consider restarting the nodes, as doing so may resolve the problem.

Node nodename is unreachable

Cause: Unavailable IP host

Action: Attempt the following:

Run the shell command ifconfig -a. Compare the output of this command with the contents of the /etc/hosts file to ensure that the node IP is listed.
Run the shell command nslookup to see if the host is reachable.
As the oracle user, attempt to connect to the node with ssh or rsh. If you are prompted for a password, then user equivalence is not set up properly.

PROT-8: Failed to import data from specified file to the cluster registry: Cause: Insufficient space in an existing Oracle Cluster Registry device partition, which causes a migration failure while running rootupgrade.sh. To confirm, look for the error "utopen:12:Not enough space in the backing store" in the log file Grid_home /log/hostname/client/ocrconfig_pid.log.; Action: Identify a storage device that has 280 MB or more available space. Locate the existing raw device name from /var/opt/oracle/srvConfig.loc, and copy the contents of this raw device to the new device using the command dd.

PRVE-0038 : The SSH LoginGraceTime setting: Cause: PRVE-0038: The SSH LoginGraceTime setting on node "nodename" may result in users being disconnected before login is completed. This error occurs because the default timeout value for SSH connections on AIX is too low, if the LoginGraceTime parameter is commented out.; Action: Oracle recommends uncommenting the LoginGraceTime parameter in the OpenSSH configuration file /etc/ssh/sshd_config, and setting it to a value of 0 (unlimited).

Time stamp is in the future

Cause: One or more nodes has a different clock time than the local node. If this is the case, then you may see output similar to the following:

time stamp 2005-04-04 14:49:49 is 106 s in the future

Action: Ensure that all member nodes of the cluster have the same clock time.

Timed out waiting for the CRS stack to start: Cause: If a configuration issue prevents the Oracle Grid Infrastructure software from installing successfully on all nodes, then you may see error messages such as "Timed out waiting for the CRS stack to start," or you may notice that Oracle Grid Infrastructure-managed resources were not create on some nodes after you exit the installer. You also may notice that resources have a status other than ONLINE.; Action: Unconfigure the Oracle Grid Infrastructure installation without removing binaries, and review log files to determine the cause of the configuration issue. After you have fixed the configuration issue, rerun the scripts used during installation to configure Oracle Grid Infrastructure.

See Also:
Unconfiguring Oracle Clusterware Without Removing Binaries

YPBINDPROC_DOMAIN: Domain not bound

Cause: This error can occur during postinstallation testing when a network interconnect for a node is pulled out, and the VIP does not fail over. Instead, the node hangs, and users are unable to log in to the system. This error occurs when the Oracle home, listener.ora, Oracle log files, or any action scripts are located on an NAS device or NFS mount, and the name service cache daemon nscd has not been activated.

Action: Enter the following command on all nodes in the cluster to start the nscd service:

/sbin/service  nscd start

A.2.1 Other Installation Issues and Errors

For additional help in resolving error messages, see My Oracle Support. For example, the note with Doc ID 1367631.1 contains some of the most common installation issues for Oracle Grid Infrastructure and Oracle Clusterware.

A.3 Interpreting CVU "Unknown" Output Messages Using Verbose Mode

If you run Cluster Verification Utility using the -verbose argument, and a Cluster Verification Utility command responds with UNKNOWN for a particular node, then this is because Cluster Verification Utility cannot determine if a check passed or failed. The following is a list of possible causes for an "Unknown" response:

The node is down
Common operating system command binaries required by Cluster Verification Utility are missing in the /bin directory in the Oracle Grid Infrastructure home or Oracle home directory
The user account starting Cluster Verification Utility does not have privileges to run common operating system commands on the node
The node is missing an operating system patch, or a required package
The node has exceeded the maximum number of processes or maximum number of open files, or there is a problem with IPC segments, such as shared memory or semaphores

A.4 Interpreting CVU Messages About Oracle Grid Infrastructure Setup

If the Cluster Verification Utility report indicates that your system fails to meet the requirements for Oracle Grid Infrastructure installation, then use the topics in this section to correct the problem or problems indicated in the report, and run Cluster Verification Utility again.

User Equivalence Check Failed
Node Reachability Check or Node Connectivity Check Failed
User Existence Check or User-Group Relationship Check Failed

User Equivalence Check Failed

Cause: Failure to establish user equivalency across all nodes. This can be due to not creating the required users, or failing to complete secure shell (SSH) configuration properly.

Action: Cluster Verification Utility provides a list of nodes on which user equivalence failed.

For each node listed as a failure node, review the installation owner user configuration to ensure that the user configuration is properly completed, and that SSH configuration is properly completed. The user that runs the Oracle Grid Infrastructure installation must have permissions to create SSH connections.

Oracle recommends that you use the SSH configuration option in OUI to configure SSH. You can run Cluster Verification Utility before installation if you configure SSH manually, or after installation, when SSH has been configured for installation.

For example, to check user equivalency for the user account oracle, use the command su - oracle and check user equivalence manually by running the ssh command on the local node with the date command argument using the following syntax:

$ ssh nodename date

The output from this command should be the timestamp of the remote node identified by the value that you use for nodename. If you are prompted for a password, then you need to configure SSH. If ssh is in the default location, the /usr/bin directory, then use ssh to configure user equivalence. You can also use rsh to confirm user equivalence.

If you see a message similar to the following when entering the date command with SSH, then this is the probable cause of the user equivalence error:

The authenticity of host 'node1 (140.87.152.153)' can't be established.
RSA key fingerprint is 7z:ez:e7:f6:f4:f2:4f:8f:9z:79:85:62:20:90:92:z9.
Are you sure you want to continue connecting (yes/no)?

Enter yes, and then run Cluster Verification Utility to determine if the user equivalency error is resolved.

If ssh is in a location other than the default, /usr/bin, then Cluster Verification Utility reports a user equivalence check failure. To avoid this error, navigate to the directory Grid_home/cv/admin, open the file cvu_config with a text editor, and add or update the key ORACLE_SRVM_REMOTESHELL to indicate the ssh path location on your system. For example:

# Locations for ssh and scp commands
ORACLE_SRVM_REMOTESHELL=/usr/local/bin/ssh
ORACLE_SRVM_REMOTECOPY=/usr/local/bin/scp

Note the following rules for modifying the cvu_config file:

Key entries have the syntax name=value
Each key entry and the value assigned to the key defines one property only
Lines beginning with the number sign (#) are comment lines, and are ignored
Lines that do not follow the syntax name=value are ignored

When you have changed the path configuration, run Cluster Verification Utility again. If ssh is in another location than the default, you also need to start OUI with additional arguments to specify a different location for the remote shell and remote copy commands. Enter runInstaller -help to obtain information about how to use these arguments.

Note:

When you or OUI run ssh or rsh commands, including any login or other shell scripts they start, you may see errors about invalid arguments or standard input if the scripts generate any output. You should correct the cause of these errors.

To stop the errors, remove all commands from the oracle user's login scripts that generate output when you run ssh or rsh commands.

If you see messages about X11 forwarding, then complete the task Section 5.2.4, "Setting Display and X11 Forwarding Configuration" to resolve this issue.

If you see errors similar to the following:

stty: standard input: Invalid argument
stty: standard input: Invalid argument

These errors are produced if hidden files on the system (for example, .bashrc or .cshrc) contain stty commands. If you see these errors, then refer to Section 5.2.5, "Preventing Installation Errors Caused by Terminal Output Commands" to correct the cause of these errors.

Node Reachability Check or Node Connectivity Check Failed: Cause: One or more nodes in the cluster cannot be reached using TCP/IP protocol, through either the public or private interconnects.; Action: Use the command /bin/ping address to check each node address. When you find an address that cannot be reached, check your list of public and private addresses to make sure that you have them correctly configured. If you use third-party vendor clusterware, then refer to the vendor documentation for assistance. Ensure that the public and private network interfaces have the same interface names on each node of your cluster.

User Existence Check or User-Group Relationship Check Failed: Cause: The administrative privileges for users and groups required for installation are missing or incorrect.; Action: Use the id command on each node to confirm that the installation owner user (for example, grid or oracle) is created with the correct group membership. Ensure that you have created the required groups, and create or modify the user account on affected nodes to establish required group membership.

See Also:
Section 5.1, "Creating Groups, Users and Paths for Oracle Grid Infrastructure" for instructions about how to create required groups, and how to configure the installation owner user

A.5 About the Oracle Grid Infrastructure Alert Log

Oracle Clusterware uses Oracle Database fault diagnosability infrastructure to manage diagnostic data and its alert log. As a result, most diagnostic data resides in the Automatic Diagnostic Repository (ADR), a collection of directories and files located under a base directory that you specify during installation. Starting with Oracle Clusterware 12c release 1 (12.1.0.2), diagnostic data files written by Oracle Clusterware programs are known as trace files and have a .trc file extension, and appear together in the trace subdirectory of the ADR home. Besides trace files, the trace subdirectory in the Oracle Clusterware ADR home contains the simple text Oracle Clusterware alert log. It always has the name alert.log. The alert log is also written as an XML file in the alert subdirectory of the ADR home, but the text alert log is most easily read.

The Oracle Clusterware alert log is the first place to look for serious errors. In the event of an error, it can contain path information to diagnostic logs that can provide specific information about the cause of errors.

After installation, Oracle Clusterware posts alert messages when important events occur. For example, you may see alert messages from the Cluster Ready Services daemon process (CRSD) when it starts, if it aborts, if the failover process fails, or if automatic restart of an Oracle Clusterware resource fails.

Oracle Enterprise Manager monitors the Oracle Clusterware log file and posts an alert on the Cluster Home page if an error is detected. For example, if a voting file is not available, a CRS-1604 error is raised, and a critical alert is posted on the Cluster Home page. You can customize the error detection and alert settings on the Metric and Policy Settings page.

The location of the Oracle Clusterware log file is ORACLE_BASE/diag/crs/hostname/crs/trace/alert.log, where ORACLE_BASE is the Oracle base path you specified when you installed Oracle Grid Infrastructure and hostname is the name of the host.

See Also:

Oracle Clusterware Administration and Deployment Guide for information about Oracle Clusterware troubleshooting

A.6 Troubleshooting Issues on AIX

The following issues can occur on IBM AIX release 6.1:

Oracle Universal Installer error INS-13001, "Environment does not meet minimum requirements" and CLUVFY Reports "Reference data is not available for verifying prerequisites on this operating system distribution"

Cause: The Verified OS is on supported level:

/bin/oslevel
6.1.4.0
 
/bin/oslevel -s
6100-04-01-0944

This issue is caused by AIX incorrectly reporting the operating system level. In this example, the value returned by /bin/oslevel should be 6.1.0.0.

Action: Install AIX patch IZ64508 to fix the oslevel bug. This action may not be appropriate if the minimum operating system level required is higher than the operating system level for Oracle Grid Infrastructure installation. In this case, you will need to upgrade AIX operating system to the required AIX operating system level or higher.

A.7 Performing Cluster Diagnostics During Oracle Grid Infrastructure Installations

If Oracle Universal Installer (OUI) does not display the Node Selection page, then perform clusterware diagnostics by running the olsnodes -v command from the binary directory in your Oracle Grid Infrastructure home (Grid_home/bin on Linux and UNIX-based systems, and Grid_home\BIN on Windows-based systems) and analyzing its output. Refer to your clusterware documentation if the detailed output indicates that your clusterware is not running.

In addition, use the following command syntax to check the integrity of the Cluster Manager:

cluvfy comp clumgr -n node_list -verbose

In the preceding syntax example, the variable node_list is the list of nodes in your cluster, separated by commas.

A.8 About Using CVU Cluster Healthchecks After Installation

Starting with Oracle Grid Infrastructure 11g release 2 (11.2.0.3) and later, you can use the CVU healthcheck command option to check your Oracle Grid Infrastructure and Oracle Database installations for their compliance with mandatory requirements and best practices guidelines, and to check to ensure that they are functioning properly.

Use the following syntax to run the healthcheck command option:

cluvfy comp healthcheck [-collect {cluster|database}] [-db db_unique_name] [-bestpractice|-mandatory] [-deviations] [-html] [-save [-savedir directory_path]

For example:

$ cd /home/grid/cvu_home/bin
$ ./cluvfy comp healthcheck -collect cluster -bestpractice -deviations -html

The options are:

-collect [cluster|database]

Use this flag to specify that you want to perform checks for Oracle Grid Infrastructure (cluster) or Oracle Database (database). If you do not use the collect flag with the healthcheck option, then cluvfy comp healthcheck performs checks for both Oracle Grid Infrastructure and Oracle Database.
-db db_unique_name

Use this flag to specify checks on the database unique name that you enter after the db flag.

CVU uses JDBC to connect to the database as the user cvusys to verify various database parameters. For this reason, if you want checks to be performed for the database you specify with the -db flag, then you must first create the cvusys user on that database, and grant that user the CVU-specific role, cvusapp. You must also grant members of the cvusapp role select permissions on system tables.

A SQL script is included in CVU_home/cv/admin/cvusys.sql to facilitate the creation of this user. Use this SQL script to create the cvusys user on all the databases that you want to verify using CVU.

If you use the db flag but do not provide a database unique name, then CVU discovers all the Oracle Databases on the cluster. If you want to perform best practices checks on these databases, then you must create the cvusys user on each database, and grant that user the cvusapp role with the select privileges needed to perform the best practice checks.
[-bestpractice | -mandatory] [-deviations]

Use the bestpractice flag to specify best practice checks, and the mandatory flag to specify mandatory checks. Add the deviations flag to specify that you want to see only the deviations from either the best practice recommendations or the mandatory requirements. You can specify either the -bestpractice or -mandatory flag, but not both flags. If you specify neither -bestpractice or -mandatory, then both best practices and mandatory requirements are displayed.
-html

Use the html flag to generate a detailed report in HTML format.

If you specify the html flag, and a browser CVU recognizes is available on the system, then the browser is started and the report is displayed on the browser when the checks are complete.

If you do not specify the html flag, then the detailed report is generated in a text file.
-save [-savedir dir_path]

Use the save or -save -savedir flags to save validation reports (cvuchecdkreport_timestamp.txt and cvucheckreport_timestamp.htm), where timestamp is the time and date of the validation report.

If you use the save flag by itself, then the reports are saved in the path CVU_home/cv/report, where CVU_home is the location of the CVU binaries.

If you use the flags -save -savedir, and enter a path where you want the CVU reports saved, then the CVU reports are saved in the path you specify.

A.9 Interconnect Configuration Issues

If you plan to use multiple network interface cards (NICs) for the interconnect, and you do not configure them during installation or after installation with Redundant Interconnect Usage, then you should use a third party solution to aggregate the interfaces at the operating system level. Otherwise, the failure of a single NIC will affect the availability of the cluster node.

If you use aggregated NIC cards, and use the Oracle Clusterware Redundant Interconnect Usage feature, then they should be on different subnets. If you use a third-party vendor method of aggregation, then follow the directions for that vendor's product.

If you encounter errors, then carry out the following system checks:

Verify with your network providers that they are using correct cables (length, type) and software on their switches. In some cases, to avoid bugs that cause disconnects under loads, or to support additional features such as Jumbo Frames, you may need a firmware upgrade on interconnect switches, or you may need newer NIC driver or firmware at the operating system level. Running without such fixes can cause later instabilities to Oracle RAC databases, even though the initial installation seems to work.
Review VLAN configurations, duplex settings, and auto-negotiation in accordance with vendor and Oracle recommendations.

A.10 SCAN VIP and SCAN Listener Issues

If your installation reports errors related to the SCAN VIP addresses or listeners, then check the following items to make sure your network is configured correctly:

Check the file /etc/resolv.conf - verify the contents are the same on each node
Verify that there is a DNS entry for the SCAN, and that it resolves to three valid IP addresses. Use the command nslookup scan-name; this command should return the DNS server name and the three IP addresses configured for the SCAN.
Use the ping command to test the IP addresses assigned to the SCAN; you should receive a response for each IP address.

Note:
If you do not have a DNS configured for your cluster environment, then you can create an entry for the SCAN in the /etc/hosts file on each node. However, using the /etc/hosts file to resolve the SCAN results in having only one SCAN available for the entire cluster instead of three. Only the first entry for SCAN in the hosts file is used.
Ensure the SCAN VIP uses the same netmask that is used by the public interface.

If you need additional assistance troubleshooting errors related to the SCAN, SCAN VIP or listeners, then refer to My Oracle Support. For example, the note with Doc ID 1373350.1 contains some of the most common issues for the SCAN VIPs and listeners.

A.11 Storage Configuration Issues

The following is a list of issues involving storage configuration:

Recovery from Losing a Node Filesystem or Grid Home

A.11.1 Recovery from Losing a Node Filesystem or Grid Home

If you remove a filesystem by mistake, or encounter another storage configuration issue that results in losing the Oracle Local Registry or otherwise corrupting a node, you can recover the node in one of two ways:

Restore the node from an operating system level backup
Remove the node, and then add the node, using Grid home/addnode/addnode.sh. Profile information for is copied to the node, and the node is restored.

Using addnode.sh enables cluster nodes to be removed and added again, so that they can be restored from the remaining nodes in the cluster. If you add nodes in a GNS configuration, then that is called Grid Plug and Play (GPnP). GPnP uses profiles to configure nodes, which eliminates configuration data requirements for nodes and the need for explicit add and delete nodes steps. GPnP allows a system administrator to take a template system image and run it on a new node with no further configuration. GPnP removes many manual operations, reduces the opportunity for errors, and encourages configurations that can be changed easily. Removal of individual node configuration makes the nodes easier to replace, because nodes do not need to contain individually-managed states.

GPnP reduces the cost of installing, configuring, and managing database nodes by making their node state disposable. It allows nodes to be easily replaced with a regenerated state.

See Also:

Oracle Clusterware Administration and Deployment Guide for information about how to add nodes manually or with GNS

A.11.2 Oracle ASM Issues After Upgrading Oracle Grid Infrastructure

The following section explains an error that can occur when you upgrade Oracle Grid Infrastructure, and how to address it:

CRS-0219: Could not update resource 'ora.node1.asm1.inst: Cause: After upgrading Oracle Grid Infrastructure, Oracle ASM client databases prior to Oracle Database 12c are unable to obtain the Oracle ASM instance aliases on the ora.asm resource through the ALIAS_NAME attribute.; Action: You must use Local ASM or set the cardinality for Flex ASM to ALL, instead of the default of 3. Use the following command to modify the Oracle ASM resource (ora.asm):
$ srvctl modify asm -count ALL

This setting changes the cardinality so that Flex ASM instances run on all nodes.

See Also:
Section 8.3.3, "Making Oracle ASM Available to Earlier Oracle Database Releases" for information about making Oracle ASM available to Oracle Database releases earlier than 12c Release 1

A.11.3 Oracle ASM Issues After Downgrading Oracle Grid Infrastructure for Standalone Server (Oracle Restart)

The following section explains an error that can occur when you downgrade Oracle Grid Infrastructure for standalone server (Oracle Restart), and how to address it:

CRS-2529: Unable to act on 'ora.cssd' because that would require stopping or relocating 'ora.asm'

Cause: After downgrading Oracle Grid Infrastructure for a standalone server (Oracle Restart) from 12.1.0.2 to 12.1.0.1, the ora.asm resource does not contain the Server Parameter File (SPFILE) parameter.

Action: When you downgrade Oracle Grid Infrastructure for a standalone server (Oracle Restart) from 12.1.0.2 to 12.1.0.1, you must explicitly add the Server Parameter File (SPFILE) from the ora.asm resource when adding the Oracle ASM resource for 12.1.0.1.

Follow these steps when you downgrade Oracle Restart from 12.1.0.2 to 12.1.0.1:

In your 12.1.0.2 Oracle Restart installed configuration, query the SPFILE parameter from the Oracle ASM resource (ora.asm) and remember it:
```
 srvctl config asm
```

Deconfigure the 12.1.0.2 release Oracle Restart:

Grid_home/crs/install/roothas.pl -deconfig -force

Install the 12.1.0.1 release Oracle Restart by running root.sh:
```
$ Grid_home/root.sh
```
Add the listener resource:
```
$ Grid_home/bin/srvctl add LISTENER
```
Add the Oracle ASM resource and provide the SPFILE parameter for the 12.1.0.2 Oracle Restart configuration obtained in step 1:
```
$ Grid_home/bin/srvctl add asm
[-spfile <spfile>]  [-diskstring <asm_diskstring>])
```

See Also:

Oracle Database Installation Guide for information about installing and deconfiguring Oracle Restart

A.12 Failed or Incomplete Installations and Upgrades

During installations or upgrades of Oracle Grid Infrastructure, the following actions take place:

Oracle Universal Installer (OUI) accepts inputs to configure Oracle Grid Infrastructure software on your system.
You are instructed to run either the orainstRoot.sh or root.sh script or both.
You run the scripts either manually or through root automation.
OUI runs configuration assistants. The Oracle Grid Infrastructure software installation completes successfully.

If OUI exits before the root.sh or rootupgrade.sh script runs, or if OUI exits before the installation or upgrade session is completed successfully, then the Oracle Grid Infrastructure installation or upgrade is incomplete. If your installation or upgrade does not complete, then Oracle Clusterware does not work correctly. If you are performing an upgrade, then an incomplete upgrade can result in some nodes being upgraded to the latest software and others nodes not upgraded at all. If you are performing an installation, the incomplete installation can result in some nodes not being a part of the cluster.

Additionally, from Oracle Grid Infrastructure release 11.2.0.3 or later, the following messages may be seen during installation or upgrade:

ACFS-9427 Failed to unload ADVM/ACFS drivers. A system reboot is recommended

ACFS-9428 Failed to load ADVM/ACFS drivers. A system reboot is recommended

CLSRSC-400: A system reboot is required to continue installing

To resolve this error, you must reboot the server, and then follow the steps for completing an incomplete installation or upgrade as documented in the following sections:

Completing Failed or Interrupted Upgrades
Completing Failed or Interrupted Installations

A.12.1 Completing Failed or Interrupted Upgrades

If OUI exits on the node from which you started the upgrade, or the node reboots before you confirm that the rootupgrade.sh script was run on all nodes, the upgrade remains incomplete. In an incomplete upgrade, configuration assistants still need to run, and the new Grid home still needs to be marked as active in the central Oracle inventory. You must complete the upgrade on the affected nodes manually.

This section contains the following tasks:

Continuing Upgrade When Force Upgrade in Rolling Upgrade Mode Fails
Continuing Upgrade When Upgrade Fails on the First Node
Continuing Upgrade When Upgrade Fails on Nodes Other Than the First Node

A.12.1.1 Continuing Upgrade When Force Upgrade in Rolling Upgrade Mode Fails

If you attempt to force upgrade cluster nodes in the rolling upgrade mode, you may see the following error:

CRS 1137 - Rejecting the rolling upgrade mode change because the cluster was forcibly upgraded.: Cause: The rolling upgrade mode change was rejected because the cluster was forcibly upgraded.; Action: Delete the nodes that were not upgraded using the procedure documented in Oracle Clusterware Administration and Deployment Guide. You can then retry the rolling upgrade process using the crsctl start rollingupgrade command as documented in Section B.8, "Performing Rolling Upgrades of Oracle Grid Infrastructure".

A.12.1.2 Continuing Upgrade When Upgrade Fails on the First Node

When the first node cannot be upgraded, do the following:

If the root script failure indicated a need to reboot, through the message CLSRSC-400, then reboot the first node (the node where the upgrade was started). Otherwise, manually fix or clear the error condition, as reported in the error output. Run the rootupgrade.sh script on that node again.
Complete the upgrade of all other nodes in the cluster.
Configure a response file, and provide passwords for the installation. See Section C.5, "Postinstallation Configuration Using a Response File" for information about how to create the response file.
To complete the upgrade, log in as the Grid installation owner, and run the script configToolAllCommands, located in the path Gridhome/cfgtoollogs/configToolAllCommands, specifying the response file that you created. For example, where the response file is gridinstall.rsp:
```
[grid@node1]$ cd /u01/app/12.1.0/grid/cfgtoollogs
[grid@node1]$ ./configToolAllCommands RESPONSE_FILE=gridinstall.rsp
```

A.12.1.3 Continuing Upgrade When Upgrade Fails on Nodes Other Than the First Node

For nodes other than the first node (the node on which the upgrade was started):

If the root script failure indicated a need to reboot, through the message CLSRSC-400, then reboot the first node (the node where the upgrade was started). Otherwise, manually fix or clear the error condition, as reported in the error output.
If root automation is being used, click Retry on the OUI instance on the first node.
If root automation is not being used, log into the affected node as root. Change directory to the Grid home, and run the rootupgrade.sh script on that node. For example:
```
[root@node6]# cd /u01/app/12.1.0/grid
[root@node6]# ./rootupgrade.sh
```

A.12.2 Completing Failed or Interrupted Installations

If OUI exits on the node from which you started the install, or the node reboots before you confirm that the orainstRoot.sh or root.sh script were run on all nodes, the install remains incomplete. In an incomplete install, configuration assistants still need to run, and the new Grid home still needs to be marked as active in the central Oracle inventory. You must complete the install on the affected nodes manually.

This section contains the following tasks:

Continuing Incomplete Installations on First Node
Continuing Installation on Nodes Other Than the First Node

A.12.2.1 Continuing Incomplete Installations on First Node

The first node must finish installation before the rest of the clustered nodes. To continue an incomplete installation on the first node:

If the root script failure indicated a need to reboot, through the message CLSRSC-400, then reboot the first node (the node where the upgrade was started). Otherwise, manually fix or clear the error condition, as reported in the error output.
If necessary, log in as root to the first node. Run the orainstRoot.sh script on that node again. For example:
```
$ sudo -s
[root@node1]# cd /u01/app/oraInventory
[root@node1]# ./orainstRoot.sh
```
Change directory to the Grid home on the first node, and run the root script on that node again. For example:
```
[root@node1]# cd /u01/app/12.1.0/grid
[root@node1]# ./root.sh
```
Complete the installation on all other nodes.
Configure a response file, and provide passwords for the installation. See Section C.5, "Postinstallation Configuration Using a Response File" for information about how to create the response file.
To complete the installation, log in as the Grid installation owner, and run the script configToolAllCommands, located in the path Gridhome/cfgtoollogs/configToolAllCommands, specifying the response file that you created. For example, where the response file is gridinstall.rsp:
```
[grid@node1]$ cd /u01/app/12.1.0/grid/cfgtoollogs
[grid@node1]$ ./configToolAllCommands RESPONSE_FILE=gridinstall.rsp
```

A.12.2.2 Continuing Installation on Nodes Other Than the First Node

For nodes other than the first node (the node on which the installation was started):

If the root script failure indicated a need to reboot, through the message CLSRSC-400, then reboot the affected node. Otherwise, manually fix or clear the error condition, as reported in the error output.
If root automation is being used, click Retry on the OUI instance on the first node.
If root automation is not being used, follow these steps:
1. Log into the affected node as root, and run the orainstRoot.sh script on that node. For example:
```
$ sudo -s
[root@node6]# cd /u01/app/oraInventory
[root@node6]# ./orainstRoot.sh
```
2. Change directory to the Grid home, and run the root.sh script on the affected node. For example:
```
[root@node6]# cd /u01/app/12.1.0/grid
[root@node6]# ./root.sh
```
Continue the installation from the OUI instance on the first node.