Oracle ASM Fast Mirror Resync

Restoring the redundancy of an Oracle ASM disk group after a transient disk path failure can be time consuming. This is especially true if the recovery process requires rebuilding an entire Oracle ASM disk group. Oracle ASM fast mirror resync significantly reduces the time to resynchronize a failed disk in such situations. When you replace the failed disk, Oracle ASM can quickly resynchronize the Oracle ASM disk extents.

Note:

To use this feature, the disk group compatibility attributes must be set to 11.1 or higher. For more information, refer to "Disk Group Compatibility".

Any problems that make a failure group temporarily unavailable are considered transient failures that can be recovered by the Oracle ASM fast mirror resync feature. For example, transient failures can be caused by disk path malfunctions, such as cable failures, host bus adapter failures, controller failures, or disk power supply interruptions.

Oracle ASM fast resync keeps track of pending changes to extents on an offline disk during an outage. The extents are resynced when the disk is brought back online.

By default, Oracle ASM drops a disk in 3.6 hours after it is taken offline. You can set the DISK_REPAIR_TIME disk group attribute to delay the drop operation by specifying a time interval to repair the disk and bring it back online. The time can be specified in units of minutes (m or M) or hours (h or H). If you omit the unit, then the default unit is hours. The DISK_REPAIR_TIME disk group attribute can only be set with the ALTER DISKGROUP SQL statement and is only applicable to normal and high redundancy disk groups.

If the attribute is not set explicitly, then the default value (3.6h) applies to disks that have been set to OFFLINE mode without an explicit DROP AFTER clause. Disks taken offline due to I/O errors do not have a DROP AFTER clause.

The default DISK_REPAIR_TIME attribute value is an estimate that should be adequate for most environments. However, ensure that the attribute value is set to the amount of time that you think is necessary in your environment to fix any transient disk error, and during which you are able to tolerate reduced data redundancy.

The elapsed time (since the disk was set to OFFLINE mode) is incremented only when the disk group containing the offline disks is mounted. The REPAIR_TIMER column of V$ASM_DISK shows the amount of time left (in seconds) before an offline disk is dropped. After the specified time has elapsed, Oracle ASM drops the disk. You can override this attribute with the ALTER DISKGROUP OFFLINE DISK statement and the DROP AFTER clause.

Note:

If a disk is offlined by Oracle ASM because of an I/O (write) error or is explicitly offlined using the ALTER DISKGROUP... OFFLINE statement without the DROP AFTER clause, then the value specified for the DISK_REPAIR_TIME attribute for the disk group is used.

Altering the DISK_REPAIR_TIME attribute has no effect on offline disks. The new value is used for any disks that go offline after the attribute is updated. You can confirm this behavior by viewing the Oracle ASM alert log.

If an offline disk is taken offline for a second time, then the elapsed time is reset and restarted. If another time is specified with the DROP AFTER clause for this disk, the first value is overridden and the new value applies. A disk that is in OFFLINE mode cannot be dropped with an ALTER DISKGROUP DROP DISK statement; an error is returned if attempted. If for some reason the disk must be dropped (such as the disk cannot be repaired) before the repair time has expired, a disk can be dropped immediately by issuing a second OFFLINE statement with a DROP AFTER clause specifying 0h or 0m.

You can use ALTER DISKGROUP to set the DISK_REPAIR_TIME attribute to a specified hour or minute value, such as 4.5 hours or 270 minutes. For example:

ALTER DISKGROUP data SET ATTRIBUTE 'disk_repair_time' = '4.5h'
ALTER DISKGROUP data SET ATTRIBUTE 'disk_repair_time' = '270m'

After you repair the disk, run the SQL statement ALTER DISKGROUP ONLINE DISK. This statement brings a repaired disk group back online to enable writes so that no new writes are missed. This statement also starts a procedure to copy of all of the extents that are marked as stale on their redundant copies.

If a disk goes offline when the Oracle ASM instance is in rolling upgrade mode, the disk remains offline until the rolling upgrade has ended and the timer for dropping the disk is stopped until the Oracle ASM cluster is out of rolling upgrade mode. See "Upgrading and Patching Oracle ASM". Examples of taking disks offline and bringing them online follow.

The following example takes disk DATA_001 offline and drops it after five minutes.

ALTER DISKGROUP data OFFLINE DISK DATA_001 DROP AFTER 5m;

The next example takes the disk DATA_001 offline and drops it after the time period designated by DISK_REPAIR_TIME elapses:

ALTER DISKGROUP data OFFLINE DISK DATA_001;

This example takes all of the disks in failure group FG2 offline and drops them after the time period designated by DISK_REPAIR_TIME elapses. If you used a DROP AFTER clause, then the disks would be dropped after the specified time:

ALTER DISKGROUP data OFFLINE DISKS IN FAILGROUP FG2;

The next example brings all of the disks in failure group FG2 online:

ALTER DISKGROUP data ONLINE DISKS IN FAILGROUP FG2;

This example brings only disk DATA_001 online:

ALTER DISKGROUP data ONLINE DISK DATA_001;

This example brings all of the disks in disk group DATA online:

ALTER DISKGROUP data ONLINE ALL;

Querying the V$ASM_OPERATION view while you run ALTER DISKGROUP ONLINE statements displays the name and state of the current operation that you are performing. For example, the following SQL query shows values in the PASS column during an online operation.

SQL> SELECT GROUP_NUMBER, PASS, STATE FROM V$ASM_OPERATION;
 
GROUP_NUMBER PASS      STAT
------------ --------- ----
           1 RESYNC    RUN
           1 REBALANCE WAIT
           1 COMPACT   WAIT

An offline operation does not generate a display in a V$ASM_OPERATION view query.

You can set the FAILGROUP_REPAIR_TIME and CONTENT.TYPE disk group attributes. The FAILGROUP_REPAIR_TIME disk group attribute specifies a default repair time for the failure groups in the disk group. The CONTENT.TYPE disk group attribute specifies the type of data expected to be stored in a disk group. You can set these attributes with ASMCA, ASMCMD mkdg, or SQL CREATE and ALTER DISKGROUP statements. For information about disk group attributes, refer to "Managing Disk Group Attributes".

The ASMCMD lsop command shows the resync time estimate. There are separate rows in the V$ASM_OPERATION table for different phases of rebalance: disk resync, rebalance, and data compaction.

The ASMCMD online command has a power option to specify the power for the online operation. The SQL ALTER DISKGROUP REPLACE DISK statement also has the power option.

The ASMCMD chdg command provides the replace option in addition to the add and drop tags. The ASMCMD mkdg command has an additional time parameter (-t) to specify the time to offline a failure group.