2 High Availability and Data Protection – Getting From Requirements to Architecture

This chapter provides a framework to effectively evaluate the high availability requirements of an enterprise. It contains the following sections:

High Availability Requirements
A Methodology for Documenting High Availability Requirements
Mapping Requirements to Architectures

2.1 High Availability Requirements

Any effort to design and implement a high availability strategy for Oracle Database begins by performing a thorough business impact analysis to identify the consequences to the enterprise of downtime and data loss, whether caused by unplanned or planned outages. The term "business impact" is intended to be agnostic of whether the enterprise is a commercial venture, government agency, or not-for-profit institution. In all cases, data loss and downtime can seriously impact the ability of any enterprise to perform its function. Implementing high availability may involve critical tasks such as:

Retiring legacy systems
Investing in more capable and robust systems and facilities
Redesigning the overall IT architecture and operations to adapt to this high availability model
Modifying existing applications to take full advantage of high availability infrastructures
Redesigning business processes
Hiring and training personnel

By combining your business analysis with an understanding of the level of investment required to implement different high availability solutions, you can develop a high availability architecture that achieves both business and technical objectives.

Figure 2-1 Planning and Implementing a Highly Available Enterprise

Description of "Figure 2-1 Planning and Implementing a Highly Available Enterprise"

2.2 A Methodology for Documenting High Availability Requirements

The elements of this analysis framework are:

Business Impact Analysis
Cost of Downtime
Recovery Time Objective
Recovery Point Objective
Manageability Goal
Total Cost of Ownership and Return on Investment

2.2.1 Business Impact Analysis

A rigorous business impact analysis:

Identifies the critical business processes in an organization
Calculates the quantifiable loss risk for unplanned and planned IT outages affecting each of these business processes
Outlines the effects of these outages
Considers essential business functions, people and system resources, government regulations, and internal and external business dependencies
Is based on objective and subjective data gathered from interviews with knowledgeable and experienced personnel
Reviews business practice histories, financial reports, IT systems logs, and so on

The business impact analysis categorizes the business processes based on the severity of the impact of IT-related outages. For example, consider a semiconductor manufacturer with chip fabrication plants located worldwide. Semiconductor manufacturing is an intensely competitive business requiring a huge financial investment that is amortized over high production volumes. The human resource applications used by plant administration are unlikely to be considered as mission-critical as the applications that control the manufacturing process in the plant. Failure of the applications that support manufacturing affects production levels and have a direct impact on the financial results of the company.

Similarly, an internal knowledge management system is likely to be considered mission-critical for a management consulting firm, because the business of a client-focused company is based on internal research accessibility for its consultants and knowledge workers. The cost of downtime of such a system is extremely high for this business.

2.2.2 Cost of Downtime

A complete business impact analysis provides the insight needed to quantify the cost of unplanned and planned downtime. Understanding this cost is essential because it helps prioritize your high availability investment and directly influences the high availability technologies that you choose to minimize the downtime risk.

Various reports have been published, documenting the costs of downtime in different industries. Examples include costs that range from millions of dollars for each hour of brokerage operations and credit card sales, to tens of thousands of dollars for each hour of package shipping services.

These numbers are staggering. The Internet can connect the business directly to millions of customers. Application downtime can disrupt this connection, cutting off a business from its customers. In addition to lost revenue, downtime can negatively affect customer relationships, competitive advantages, legal obligations, industry reputation, and shareholder confidence.

2.2.3 Recovery Time Objective

The business impact analysis determines your tolerance to downtime, also known as recovery time objective (RTO). A RTO is defined as the maximum amount of time that an IT-based business process can be down before the organization starts suffering unacceptable consequences (financial losses, customer dissatisfaction, reputation, and so on). RTO indicates the downtime tolerance of a business process or an organization in general.

The RTO requirements are driven by the mission-critical nature of the business. Thus, for a system running a stock exchange, the RTO is zero or near to zero.

An organization is likely to have varying RTO requirements across its various business processes. Thus, for a high volume e-commerce website, for which there is an expectation of rapid response times and for which customer switching costs are very low, the web-based customer interaction system that drives e-commerce sales is likely to have an RTO of zero or close to zero. However, the RTO of the systems that support back-end operations, such as shipping and billing, can be higher. If these back-end systems are down, then the business may resort to manual operations temporarily without a significant visible impact.

The ability to take orders through the e-commerce website immediately (the RTO) may be more important than the RPO, because lost data can be reloaded later.

2.2.4 Recovery Point Objective

The business impact analysis also determines your tolerance to data loss, also known as recovery point objective (RPO). RPO is the maximum amount of data that an IT-based business process can lose without harm to the organization. RPO measures the data-loss tolerance of a business process or an organization in general. This data loss is often measured in terms of time, for example, zero, seconds, hours, or days of data loss.

A stock exchange where millions of dollars worth of transactions occur every minute cannot afford to lose any data. Thus, its RPO must be zero. The web-based sales system in the e-commerce example does not require an RPO of zero, although a low RPO is essential for customer satisfaction. However, its back-end merchandising and inventory update system can have a higher RPO because lost data can be reentered.

2.2.5 Manageability Goal

A manageability goal is more subjective than either the RPO or the RTO. It results from an objective evaluation of the skill sets, management resources, and tools available in an organization and the degree to which the organization can successfully manage all elements of a high availability architecture. Just as RPO and RTO measure an organization's tolerance for downtime or data loss, your manageability goal measures the organization's tolerance for complexity in the IT environment. When less complexity is a requirement, simpler methods of achieving high availability are preferred over methods that may be more complex to manage, even if the latter could attain more aggressive RTO and RPO objectives. Understanding manageability goals helps organizations differentiate between what is possible and what is practical to implement.

2.2.6 Total Cost of Ownership and Return on Investment

Understanding total cost of ownership (TCO) and objectives for return on investment (ROI) are essential to selecting a high availability architecture that also achieves the business goals of your organization. TCO includes all costs (such as acquisition, implementation, systems, networks, facilities, staff, training, and support), over the useful life of the solution chosen. Likewise, the ROI calculation captures all of the financial benefits that accrue to a given high availability architecture.

For example, consider a high availability architecture in which IT systems and storage at a remote standby site remain idle with no other business use that can be served by the standby systems. The only return on investment for the standby site is the costs related to downtime avoided by its use in a failover scenario. Contrast this with a different high availability architecture that enables IT systems and storage at the standby site to be used productively while in the standby role (for example, for reports or for off-loading the primary system of the overhead of user queries or distributing read-write workload). The return on investment of such an architecture includes both the cost of downtime avoided and the financial benefits that accrue to its productive use while it also provides for high availability and data protection.

2.3 Mapping Requirements to Architectures

The business impact analysis will document what you already know. Different applications and the databases that support them represent varying degrees of importance to the enterprise. A high level of investment in high availability infrastructure may not make sense for an application that if down, would not have an immediate impact on the enterprise. So where do you start?

The outcome of the business impact analysis enables databases within an enterprise to be grouped together with other databases having similar RTO and RPO objectives. The groups can then be mapped to a controlled set of high availability reference architectures that most closely addresses the required service levels. Note that in the case where there are dependencies between databases, they are grouped with the database having the most stringent high availability requirement.

2.3.1 Oracle MAA Reference Architectures

Oracle MAA best practices define four high availability reference architectures that address the complete range of availability and data protection required by enterprises of all sizes and lines of business. The architectures, or tiers, are designated Platinum, Gold, Silver, and Bronze. They deliver the service levels described in Figure 2-2.

Figure 2-2 Oracle MAA Reference Architectures

Description of "Figure 2-2 Oracle MAA Reference Architectures"

Each tier uses a different MAA reference architecture to deploy the optimal set of Oracle high availability capabilities that reliably achieve a given service level at the lowest cost and complexity. The tiers explicitly address all types of unplanned outages including data corruption, component failure, system and site outages, as well as planned outages due to maintenance, migrations, or other purposes.

2.3.2 Bronze Reference Architecture

The Bronze tier is appropriate for databases where simple restart or restore from backup is "HA enough." The Bronze tier is based upon a single instance Oracle Database with MAA best practices that use the many capabilities for data protection and high availability included with every Oracle Enterprise Edition license. Oracle-optimized backups using Oracle Recovery Manager (RMAN) provide data protection and are used to restore availability should an outage prevent the database from restarting.

2.3.3 Silver Reference Architecture

The Silver tier provides an additional level of high availability for databases that require minimal or zero downtime in the event of database instance or server failure, as well as many types of planned maintenance. The Silver tier adds clustering technology – either Oracle RAC or Oracle RAC One Node. RMAN provides database-optimized backups to protect data and restore availability should an outage prevent the cluster from restarting.

2.3.4 Gold Reference Architecture

The Gold tier raises the stakes substantially for business critical applications that cannot accept vulnerability to single points-of-failure. This tier adds database-aware replication technologies, Oracle Active Data Guard and Oracle GoldenGate, which synchronize one or more replicas of the production database to provide real time data protection and availability. Database-aware replication substantially enhances high availability and data protection beyond what is possible with storage replication technologies. It also reduces cost while improving return on investment by actively utilizing all replicas at all times.

2.3.5 Platinum Reference Architecture

The Platinum tier introduces several new Oracle Database 12c capabilities and previously available products that have been enhanced with the latest release. These capabilities include Application Continuity, for reliable replay of in-flight transactions that masks outages from users; Oracle Active Data Guard Far Sync, for zero data loss protection at any distance; Oracle GoldenGate enhancements for zero downtime upgrades and migrations; and Global Data Services for automated service management and workload balancing in replicated database environments. Each of these technologies requires additional effort to implement, but they deliver substantial value for the most critical applications where downtime and data loss are not an option.

2.3.6 High Availability and Data Protection Attributes by Tier

Table 2-1 summarizes the high availability and data protection attributes inherent to each tier. Each tier includes all of the capabilities of the previous tier and builds upon the architecture to handle an expanded fault domain. The various components included and the service levels achieved by each architecture are described in other topics.

Table 2-1 High Availability and Data Protection Attributes by Tier

Outage Class/HA Tier	Unplanned Outages (Local Site)	Planned Maintenance	Data Protection	Unrecoverable Local Outages and Disaster Recovery
Platinum	Zero application outage for Platinum ready applications	Zero application outage	Comprehensive runtime validation combined with manual checks	Zero application outage for Platinum ready applications, in-flight transactions are preserved, zero data loss
Gold	Comprehensive HA and DR	All rolling or online	Comprehensive runtime validation combined with manual checks	Real-time failover, zero or near-zero data loss
Silver	HA with automatic failover	Some rolling, some online, some offline	Basic runtime validation combined with manual checks	Restore from backup, potential to lose data generated since last backup
Bronze	Single Instance, Auto restart for recoverable instance and server failures	Some online, most off-line	Basic runtime validation combined with manual checks	Restore from backup, potential to lose data generated since last backup