This chapter applies to all replication schemes, including active standby pairs. However, TimesTen integration with Oracle Clusterware is the best way to monitor active standby pairs. See Chapter 7, "Using Oracle Clusterware to Manage Active Standby Pairs".
This chapter includes these topics:
A fundamental element in the design of a highly available system is the ability to quickly recover from a failure. Failures may be related to hardware problems such as system failures or network failures. Software failures include operating system failure, application failure, database failure and operator error.
Your replicated system must employ a cluster manager or custom software to detect such failures and, in the event of a failure involving a master database, redirect the user load to one of its subscribers. TimesTen does not provide a cluster manager or make any assumptions about how they operate, so the focus of this discussion is on the TimesTen mechanisms that an application or cluster manager can use to recover from failures.
Unless the replication scheme is configured to use the return twosafe service, TimesTen replicates updates only after the original transaction commits to the master database. If a subscriber database is inoperable or communication to a subscriber database fails, updates at the master are not impeded. During outages at subscriber systems, updates intended for the subscriber are saved in the TimesTen transaction log.
Note:
The procedures described in this section require theADMIN
privilege.The procedures for managing failover and recovery depend primarily on:
The replication scheme
Whether the failure occurred on a master or subscriber database
Whether the threshold for the transaction log on the master is exhausted before the problem is resolved and the databases reconnected
In a default asynchronous replication scheme, if a subscriber database become inoperable or communication to a subscriber database fails, updates at the master are not impeded and the cluster manager does not have to take any immediate action.
Note:
If the failed subscriber is configured to use a return service, you must first disable return service blocking, as described in "Managing return service timeout errors and replication state changes".During outages at subscriber systems, updates intended for the subscriber are saved in the transaction log on the master. If the subscriber agent reestablishes communication with its master before the master reaches its FAILTHRESHOLD
, the updates held in the log are automatically transferred to the subscriber and no further action is required. See "Setting the log failure threshold" for details on how to establish the FAILTHRESHOLD
value for the master database.
If the FAILTHRESHOLD
is exceeded, the master sets the subscriber to the Failed
state and it must be recovered, as described in "Recovering a failed database". Any application that connects to the failed subscriber receives a tt_ErrReplicationInvalid
(8025) warning indicating that the database has been marked Failed
by a replication peer.
An application can use the ODBC SQLGetInfo
function to check if the subscriber database it is connected to has been set to the Failed
state. The SQLGetInfo
function includes a TimesTen-specific infotype, TT_REPLICATION_INVALID
, that returns a 32-bit integer value of '1' if the database is failed, or '0' if not failed. Since the infotype TT_REPLICATION_INVALID
is specific to TimesTen, all applications using it need to include the timesten.h
file in addition to the other ODBC include
files.
The cluster manager plays a more central role if a failure involves the master database. If a master database fails, the cluster manager must detect this event and redirect the user load to one of its surviving databases. This surviving subscriber then becomes the master, which continues to accept transactions and replicates them to the other surviving subscriber databases. If the failed master and surviving subscriber are configured in a bidirectional manner, transferring the user load from a failed master to a subscriber does not require that you make any changes to your replication scheme. However, when using unidirectional replication or complex schemes, such as those involving propagators, you may have to issue one or more ALTER REPLICATION
statements to reconfigure the surviving subscriber as the "new master" in your scheme. See "Replacing a master database" for an example.
When the problem is resolved, if you are not using the bidirectional configuration or the active standby pair described in "Automatic catch-up of a failed master database", you must recover the master database as described in "Recovering a failed database".
After the database is back online, the cluster manager can either transfer the user load back to the original master or reestablish it as a subscriber for the "acting master."
The master catch-up feature automatically restores a failed master database from a subscriber database without the need to invoke the ttRepAdmin
-duplicate
operation described in "Recovering a failed database".
The master catch-up feature needs no configuration, but it can be used only in the following types of configurations:
A single master replicated in a bidirectional manner to a single subscriber
An active standby pair that is configured with RETURN TWOSAFE
For replication schemes that are not active standby pairs, the following must be true:
The ELEMENT
type is DATASTORE
.
TRANSMIT NONDURABLE
or RETURN TWOSAFE
must be enabled.
All replicated transactions must be committed nondurably. They must be transmitted to the remote database before they are committed on the local database. For example, if the replication scheme is configured with RETURN TWOSAFE BY REQUEST
and any transaction is committed without first enabling RETURN TWOSAFE
, master catch-up may not occur after a failure of the master.
When the master replication agent is restarted after a crash or invalidation, any lost transactions that originated on the master are automatically reapplied from the subscriber to the master (or from the standby to the active in an active standby pair). No connections are allowed to the master database until it has completely caught up with the subscriber. Applications attempting to connect to a database during the catch-up phase receive an error that indicates a catch-up is in progress. The only exception is if you connect to a database with the ForceConnect
first connection attribute set in the DSN.
When the catch-up phase is complete, your application can connect to the database. An SNMP trap and message to the system log indicate the completion of the catch-up phase.
If one of the databases is invalidated or crashes during the catch-up process, the catch-up phase is resumed when the database comes back up.
Master catch-up can fail under these circumstances:
The failed database is offline long enough for the failure threshold to be exceeded on the subscriber or standby database.
Dynamic load operations are taking place on the active database in an active standby pair when the failure occurs. RETURN TWOSAFE
is not enabled for dynamic load operations even though it is enabled for the active database. The database failure causes the dynamic load transactions to be trapped and RETURN TWOSAFE
to fail.
You can distribute the workload over multiple bidirectionally replicated databases, each of which serves as both master and subscriber. When recovering a master/subscriber database, the log on the failed database may present problems when you restart replication. See "Bidirectional distributed workload scheme".
If a database in a distributed workload scheme fails and work is shifted to a surviving database, the information in the surviving database becomes more current than that in the failed database. If replication is restarted at the failed system before the log failure threshold has been reached on the surviving database, then both databases attempt to update one another with the contents of their transaction logs. In this case, the older updates in the transaction log on the failed database may overwrite more recent data on the surviving system.
There are two ways to recover in such a situation:
If the timestamp conflict resolution rules described in Chapter 15, "Resolving Replication Conflicts" are sufficient to guarantee consistency for your application, then you can restart the failed system and allow the updates from the failed database to propagate to the surviving database. The conflict resolution rules prevent more recent updates from being overwritten.
Re-create the failed database, as described in "Recovering a failed database".
Note:
If the database must be re-created, the updates in the log on the failed database that were not received by the surviving database cannot be identified or restored. In the case of several surviving databases, you must select which of the surviving databases is to be used to re-create the failed database. It is possible that at the time the failed database is re-created, that the selected surviving database may not have received all updates from the other surviving databases. This results in diverging databases. The only way to prevent this situation is to re-create the other surviving databases from the selected surviving database.In the event of a temporary network failure, you need not perform any specific action to continue replication. The replication agents that were in communication attempt to reconnect every few seconds. If the agents reconnect before the master database runs out of log space, the replication protocol makes sure they neither miss nor repeat any replication updates. If the network is unavailable for a longer period and the log failure threshold has been exceeded for the master log, you need to recover the subscriber as described in "Recovering a failed database".
After a link failure, if replication is allowed to recover by replaying queued logs, you do not need to take any action.
However, if the failed node was down for a significant amount of time, you must use the ttRepAdmin
-duplicate
command to repopulate the database on the failed node with transactions from the surviving node, as sequences are not rolled back during failure recovery. In this case, the ttRepAdmin
-duplicate
command copies the sequence definitions from one node to the other.
If the databases are configured in a bidirectional replication scheme, a failed master database is automatically brought up to date from the subscriber. See "Automatic catch-up of a failed master database". Automatic catch-up also applies to recovery of master databases in active standby pairs.
If a restarted database cannot be recovered from its master's transaction log so that it is consistent with the other databases in the replicated system, you must re-create the database from one of its replication peers. Use command line utilities or the TimesTen Utility C functions. See "Recovering a failed database from the command line" and "Recovering a failed database from a C program".
Note:
It is not necessary to re-create the DSN for the failed database.In the event of a subscriber failure, if any tables are configured with a return service, commits on those tables in the master database are blocked until the return service timeout period expires. To avoid this, you can establish a return service failure and recovery policy in your replication scheme, as described in"Managing return service timeout errors and replication state changes". If you are using the RETURN RECEIPT
service, an alternative is to use ALTER REPLICATION
and set the NO RETURN
attribute to disable return receipt until the subscriber is restored and caught up. Then you can submit another ALTER REPLICATION
statement to re-establish RETURN RECEIPT
.
If the databases are fully replicated, you can use the ttDestroy
utility to remove the failed database from memory and ttRepAdmin
-duplicate
to re-create it from a surviving database. If the database contains any cache groups, you must also use the -keepCG
option of ttRepAdmin
. See "Duplicating a database".
Example 12-2 Recovering a failed database
To recover a failed database, subscriberds
, from a master, named masterds
on host system1
, enter:
> ttdestroy /tmp/subscriberds > ttrepadmin -dsn subscriberds -duplicate -from masterds -host "system1" -uid ttuser
You will be prompted for the password of ttuser
.
Note:
ttRepAdmin
-duplicate
is only supported between identical and patch TimesTen releases. The major and minor release numbers must be the same.After re-creating the database with ttRepAdmin
-duplicate,
the first connection to the database reloads it into memory. To improve performance when duplicating large databases, you can avoid the reload step by using the ttRepAdmin
-ramload
option to keep the database in memory after the duplicate operation.
Example 12-3 Keeping a database in memory when recovering it
To recover a failed database, subscriberds
, from a master, named masterds
on host system1
, and to keep the database in memory and restart replication after the duplicate operation, enter:
> ttdestroy /tmp/subscriberds > ttrepadmin -dsn subscriberds -duplicate -ramload -from masterds -host "system1" -uid ttuser -setmasterrepstart
You will be prompted for the password of ttuser
.
You can use the C functions provided in the TimesTen utility library to recover a failed database programmatically.
If the databases are fully replicated, you can use ttDestroyDataStore
function to remove the failed database and the ttRepDuplicateEx
function to re-create it from a surviving database.
Example 12-4 Recovering and starting a failed database
To recover and start a failed database, named subscriberds
on host system2
, from a master, named masterds
on host system1
, enter:
int rc; ttutilhandle utilhandle; ttrepduplicateexarg arg; memset( &arg, 0, sizeof( arg ) ); arg.size = sizeof( ttrepduplicateexarg ); arg.flags = tt_repdup_repstart | tt_repdup_ramload; arg.uid=ttuser; arg.pwd=ttuser; arg.localhost = "system2"; rc = ttdestroydatastore( utilhandle, "subscriberds", 30 ); rc = ttrepduplicateex( utilhandle, "dsn=subscriberds", "masterds", "system1", &arg );
In this example, the timeout for the ttDestroyDataStore
operation is 30 seconds. The last parameter of the ttRepDuplicateEx
function is an argument structure containing two flags:
TT_REPDUP_RESTART
to set the subscriberds
database to the start
state after the duplicate operation is completed
TT_REPDUP_RAMLOAD
to set the RAM policy to manual
and keep the database in memory
Note:
When theTT_REPDUP_RAMLOAD
flag is used with ttRepDuplicateEx
, the RAM policy for the duplicate database is manual
until explicitly reset by the ttRamPolicy
function or ttAdmin
-ramPolicy
.See "TimesTen Utility API" in Oracle TimesTen In-Memory Database C Developer's Guide for the complete list of the functions provided in the TimesTen C language utility library.
If your database is configured with the TRANSMIT NONDURABLE
option in a bidirectional configuration you do not need to take any action to recover a failed master database. See "Automatic catch-up of a failed master database".
For other types of configurations, if the master database configured with the TRANSMIT NONDURABLE
option fails, you must use ttRepAdmin
-duplicate
or ttRepDuplicateEx
to re-create the master database from the most current subscriber database. If the application attempts to reconnect to the master database without first performing the duplicate operation, the replication agent recovers the database, but any attempt to connect results in an error that advises you to perform the duplicate operation. To avoid this error, the application must reconnect with the ForceConnect
first connection attribute set to 1.
Upon detecting a failure, the cluster manager should invoke a script that effectively executes the procedure shown by the pseudocode in Example 12-5.
Example 12-5 Failure recovery pseudocode
Detect problem { if (Master == unavailable) { FailedDataDatabase = Master FailedDSN = Master_DSN SurvivorDatabase = Subscriber switch users to SurvivorDatabase } else { FailedDatabase = Subscriber FailedDSN = Subscriber_DSN SurvivorDatabase = Master } } Fix problem.... If (Problem resolved) { Get state for FailedDatabase if (state == "failed") { ttDestroy FailedDatabase ttRepAdmin -dsn FailedDSN -duplicate -from SurvivorDatabase -host SurvivorHost -setMasterRepStart -uid ttuser -pwd ttuser } else { ttAdmin -repStart FailedDSN } while (backlog != 0) { wait } } Switch users back to Master.
This applies to either the master or subscriber databases. If the master fails, you may lose some transactions.