Data and System Security
Chapter 32
Disaster Recovery, Business Continuity, Backups, and High Availability
Copyright © 2014 by McGraw-Hill Education.
Introduction
Disaster recovery and business continuity planning are separate but related concepts. In fact, disaster recovery is part of business continuity.
Disaster recovery (DR) concerns the recovery of the technical components of your business, such as computers, software, the network, data, and so on.
Business continuity planning (BCP) includes disaster recovery along with procedures to restore business operations and the underlying functionality of the business infrastructure needed to support the business, along with the resumption of the daily work of the people in your workplace. Business continuity planning is vital to keeping your business running and to providing a return to “business as usual” during a disaster.
Copyright © 2014 by McGraw-Hill Education.
What Constitutes a Disaster?
A disaster is defined as a “sudden, unplanned calamitous event causing great damage or loss” or “any event that creates an inability on an organization’s part to provide critical business functions for some predetermined period of time.”
With this general definition in mind, the disaster recovery planner or business continuity professional would sit down with all the principals in the organization and map out what would constitute a disaster for that organization. This is the initial stage of creating a business impact analysis (BIA).
Copyright © 2014 by McGraw-Hill Education.
Service Assurance Methods
DR and BCP professionals work together to ensure the recoverability and continuity of all aspects of an organization that are affected by an outage or security event. This chapter analyzes the best practices and methodologies for DR and BCP.
We also give close consideration to backups, which are necessary for disaster recovery as well as recovery from less severe incidents. Tape backups, which have traditionally been a key component of DR strategies to move data from the primary data center to the backup site, are giving way to online, real-time data replication strategies to keep data synchronized.
We consider high availability in the final section of this chapter. All three of these components–DR/BCP, backups, and HA, form the core of a resiliency strategy for services and data.
Copyright © 2014 by McGraw-Hill Education.
Disaster Recovery
When you put together a disaster recovery plan, you need to understand how your organization’s information technology (IT) infrastructure, applications, and network support the business functions of the enterprise you are recovering.
For example, a particular business unit may claim not to need a certain application or function on day three of a disaster, but the technology process may dictate that the application should be available on day one, due to technological interdependencies. In this example, the DR planner should work with (and educate) the business unit to help them understand why they need to pay for a day-one recovery as opposed to a day-three recovery. The business unit’s budget will typically include a sizeable expense for the IT department, and this may cause the business unit to think that any disaster recovery or business continuity efforts will be cost prohibitive. In working with the IT subject matter experts (SMEs), you can sometimes figure out a way to bypass a particular electronic feed or file dependency that may be needed to continue the recovery of your system.
Copyright © 2014 by McGraw-Hill Education.
Determining What to Recover
All of this will work well if you know what you are recovering and who to consult with. The responsible business continuity or disaster recovery professional should work with the IT group and the business unit to achieve one purpose—to operate a fine, productive, and lucrative organization.
You can come to know what you are recovering and who is involved by gathering experts, such as the programmer, business analyst, system architect, or any other necessary SME. These experts will prove to be invaluable when it comes to creating your DR plan. They know what it takes to technically run the business systems in question and can explain why a certain disaster recovery process will cost a certain amount. This information is important for the manager of the business unit, so that she can make informed decisions.
Copyright © 2014 by McGraw-Hill Education.
Business Continuity Planning
The business continuity professional is more concerned with the business functions that the employees perform than with the underlying technologies. To figure out how the business can resume normal operations during a disaster, the business continuity professional needs to work with each business unit as closely as possible. This means they need to meet with the people who make the decisions, the people who carry out the decisions in the management team, and finally the “worker bees” who actually do the work.
You can think of the “worker bees” as power users who know an application intimately. They know the nuances and idiosyncrasies of the business function—they are looking at the trees as opposed to the forest. This is important when it comes to preparing the business unit’s business continuity plan. The power users should participate in your disaster recovery rehearsals and business continuity tabletop exercises.
Copyright © 2014 by McGraw-Hill Education.
Management Team
The business unit management team is vital because its members see the business unit from a business perspective—at a higher level—and will help in determining the importance of the application, as they are acquainted with the mission of the business unit. The business unit also needs to keep in mind the need for a disaster recovery plan as it introduces new or upgraded program applications. The disaster recovery and/or business continuity professional should be kept informed about such changes.
For example, a member of management in a business unit might talk to a vendor about a product that could make a current business function quicker, smarter, and better. Being the diligent manager, he would bring the vendor in to meet with upper management, and the decision would be made to buy the product, all without informing the IT department or the disaster recovery or business continuity professional.
As you can see, the business continuity professional needs to have a relationship with every principle within the business unit so that, should a new product be brought into the organization, the knowledge and ability to recover the product will be taken into consideration.
Copyright © 2014 by McGraw-Hill Education.
The Four Components of Business Continuity Planning
There are four main components of business continuity planning, each of which is essential to the whole BCP initiative:
Plan initiation
Business impact analysis or assessment
Development of the recovery strategies
Rehearsal or exercise of the disaster recovery and business continuity plans
Each business unit should have its own plan. The organization as a whole needs to have a global plan, encompassing all the business units. There should be two plans that work in tandem: a business continuity plan (recovery of the people and business function) and a disaster recovery plan (technological and application recovery).
Copyright © 2014 by McGraw-Hill Education.
Initiating a Plan
Plan initiation puts everyone on the same page at the beginning of the creation of the plan. A disaster or event is defined from the perspective of the specific business unit or entire organization. What one business unit or organization considers a disaster may not be considered a disaster by another business unit or organization, and vice versa.
A BIA is important for several reasons. It provides an organization or business unit with a dollar value impact for an unexpected event. This indicates how long an organization can have its business interrupted before it will go out of business completely.
Copyright © 2014 by McGraw-Hill Education.
Events
Here are three examples of possible events that could impact your business and compel you to implement your disaster recovery or business continuity plan, along with some possible responses:
Hurricane: Because a hurricane can be predicted a reasonable amount of time before it strikes, you have time to inform employees to prepare their homes and other personal effects. You also have the time to alert your technology group so that they can initiate their preparation strategy procedures.
Blackout: You can ensure that your enterprise is attached to a backup generator or an uninterruptible power supply (UPS). You can conduct awareness programs and perhaps give away small flashlights that employees can keep in their desks.
Illness outbreak: You can provide an offsite facility where your employees can relocate during the outbreak and investigation.
Copyright © 2014 by McGraw-Hill Education.
Analyzing the Business Impact
With a BIA, you must first establish what the critical business function is. This can be determined only by the critical members of the business unit.
The BIA should be completed and reviewed by the business unit, including upper management, since the financing of the business continuity plan and disaster recovery project will ultimately come from the business unit’s coffers.
Copyright © 2014 by McGraw-Hill Education.
Developing Recovery Strategies
The next step is to develop your recovery strategy. The business unit will be paying for the recovery, so they need to know what their options are for different types of recoveries.
You can provide anything from a no-frills recovery to an instantaneous recovery. It all depends on the business functions that have to be recovered and on how long the business unit can go without the function.
The question is essentially how much insurance the business unit wants to buy. If it is your business, you are the only one who can make that decision. Someone who does not have as large a stake in the growth of the business cannot look at the business from the same perspective.
Copyright © 2014 by McGraw-Hill Education.
Procedures and Contacts
In a business recovery situation, there must be written procedures that all employees in your business unit can quickly access, understand, and follow. Information needs to be readily available about the business function that has to be performed. The procedures should be stored in multiple, accessible locations to ensure they are available in a disaster scenario.
You also need to make readily available a list of people to contact, along with their contact information. This list must be of the current employees to contact, and it should include members of the Human Resources, Facilities, Risk Management, and Legal departments. The list of contacts should also include the local fire and rescue department, police department, and emergency operations center.
Copyright © 2014 by McGraw-Hill Education.
Rehearsing Disaster Recovery and Business Continuity Plans
The fourth BCP component, and the most crucial, is to rehearse, exercise, or test the plan. This is “where the rubber meets the road.”
Having the other three components in place is important, but the plan is inadequate if you’re not sure whether it will work. It is vital to test your plan. If the plan has not been tested and it fails during a disaster, all the work you put into developing it is for naught. If the plan fails during a test, though, you can improve on it and test again.
Copyright © 2014 by McGraw-Hill Education.
Third-Party Vendor Issues
Most organizations make use of various third-party vendors (Enterprise Resource Planning [ERP], Application Service Provider [ASP], etc.) in their recovery efforts. In such cases, the information about the third-party vendor is just as critical in your business or technology recovery. When you need to make use of such resources, it is beneficial, if not crucial, to make inquiries into the third-party’s operations prior to the implementation of its product or services.
In the real world, the disaster recovery and/or business continuity professional has to integrate the vendor’s information into the business unit’s continuity plan. If a critical path in your DR plan depends on the involvement of a third-party vendor, you can’t get your operation up and running if that third-party vendor isn’t prepared to assist you. For example, suppose that processing loans is the bread and butter of your business, and your business relies on credit bureau reports to process loans. In this scenario, you need to ensure that if your organization experiences an outage, you will still receive these reports so that your company can continue to conduct business.
The vendor’s ability to recover from a failure will also affect how robust your recovery is. Although your recovery may be technically sound, you must be sure that you can conduct business. The same standards you apply to your own organization should apply to third-party vendors you do business with. They should be available to you to conduct business. The disaster recovery or business continuity coordinator should make the appropriate inquiries with vendors to ensure that they can support a DR scenario.
Copyright © 2014 by McGraw-Hill Education.
Awareness and Training Programs
Another important element of disaster recovery and business continuity planning is an awareness program. The business continuity or disaster recovery professional can meet with each business for tabletop exercises. These exercises are important, because they actually get the members of the business unit to sit down and think about a particular event and how first to prevent or mitigate it and then how to recover from it.
The event can be anything from a category 3 hurricane to workplace violence. Any work stoppage can potentially impede the progress of an organization’s recovery or resumption of services, and it is up to the management team to design or develop a plan of action or a business continuity plan. The business continuity or disaster recovery professional must facilitate this process and make the business unit aware that there are events that can bring the business to a grinding halt.
Copyright © 2014 by McGraw-Hill Education.
Backups
Backups may be used for complete system restoration, but they can also allow you to recover the contents of a mailbox, for example, or an “accidentally” deleted document. Backups can be extended to saving more than just digital data. Backup processes can include the backup of specifications and configurations, policies and procedures, equipment, and data centers.
However, if the backup is not good or is too old, or the backup media is damaged, it will not fix the problem. Just having a backup procedure in place does not always offer adequate protection.
Many organizations can no longer depend on traditional backup processes—doing an offline backup is unacceptable, doing an online backup would unacceptably degrade system performance, and restoring from a backup would take so much time that the organization could not recover. Such organizations are using alternatives to traditional backups, such as redundant systems and cloud services.
Backup systems and processes, therefore, reflect the availability needs of an organization as well as its recovery needs.
Copyright © 2014 by McGraw-Hill Education.
Traditional Backup Methods
In the traditional backup process, data is copied to backup media, primarily tape, in a predictable and orderly fashion for secure storage both onsite and offsite.
Backup media can thus be made available to restore data to new or repaired systems after failure. In addition to data, modern operating systems and application configurations are also backed up.
This provides faster restore capabilities and occasionally may be the only way to restore systems where applications that support data are intimately integrated with a specific system.
Copyright © 2014 by McGraw-Hill Education.
Backup Types
There are several standard types of backups:
Full
Copy
Incremental
Differential
Copyright © 2014 by McGraw-Hill Education.
Full Backups
Backs up all data selected, whether or not it has changed since the last backup. The definition of a full backup varies on different systems. On some systems it includes critical operating system files needed to rebuild a system completely; on other systems it backs up only the user data.
Copyright © 2014 by McGraw-Hill Education.
Copy Backups
Data is copied from one disk to another.
Copyright © 2014 by McGraw-Hill Education.
Incremental Backups
When data is backed up, the archive bit on a file is turned off. When changes are made to the file, the archive bit is set again. An incremental backup uses this information to back up only files that have changed since the last backup. This backup turns the archive bit off again, and the next incremental backup backs up only the files that have changed since the last incremental backup. This backup type saves time, but it means that the restore process will involve restoring the last full backup and every incremental backup made after it.
Copyright © 2014 by McGraw-Hill Education.
Restoring from an Incremental backup requires that all backups be applied.
The circle encloses all the backups that must be restored.
Copyright © 2014 by McGraw-Hill Education.
Differential Backups
Like an incremental backup, a differential backup only backs up files with the archive bit set—files that have changed since the last backup. Unlike an incremental backup, however, a differential backup does not reset the archive bit.
Each differential backup backs up all files that have changed since the last backup that reset the bits. Using this strategy, a full backup is followed by differential backups.
A restore consists of restoring the full backup and then only the last differential backup made. This saves time during the restore, but, depending on your system, creating differential backups takes longer than creating incremental backups.
Copyright © 2014 by McGraw-Hill Education.
Restoring from a differential backup requires applying only the full backup and the last differential backup.
The circle encloses all of the backups that must be restored.
Copyright © 2014 by McGraw-Hill Education.
Backup Rotation Strategies
In the traditional backup process, old backups are usually not immediately replaced by the new backup. Instead, multiple previous copies of backups are kept. This ensures recovery should one backup tape set be damaged or otherwise be found not to be good. Two traditional backup rotation strategies are Grandfather-Father-Son (GFS) and Tower of Hanoi.
Copyright © 2014 by McGraw-Hill Education.
GFS Backup Strategy
In the GFS rotation strategy, a backup is made to separate media each day.
Each Sunday a full backup is made, and each day of the week an incremental backup is made.
The Sunday backups are kept for a month, and the current week’s incremental backups are also kept.
On the first Sunday of the month, a new tape or disk is used to make a full backup. The previous full backup becomes the last full backup of the prior month and is re-labeled as a monthly backup.
Weekly and daily tapes are rotated as needed, with the oldest being used for the current backup.
Thus, on any given day of the month, that week’s backup is available, as well as the previous four or five weeks’ full backups, along with the incremental backups taken each day of the preceding week. If the backup scheme has been in use for a while, prior months’ backups are also available.
Copyright © 2014 by McGraw-Hill Education.
Note:
No backup strategy is complete without plans to test backup media and backups by doing a restore. If a backup is unusable, it’s worse than having no backup at all, because it has lured users into a sense of security. Be sure to add the testing of backups to your backup strategy, and do this on a test system.
Copyright © 2014 by McGraw-Hill Education.
The Tower of Hanoi Backup Strategy
The Tower of Hanoi strategy is based on a game played with three poles and a number of rings. The object is to move the rings from their starting point on one pole to the other pole.
However, the rings are of different sizes, and you are not allowed to have a ring on top of one that is smaller than itself. To accomplish the task, a certain order must be followed.
Consider a simple version of the Tower of Hanoi, in which you are given three pegs, one of which has three rings stacked on it from largest at the bottom to smallest at the top. Call these rings A (small), B (medium), and C (large). You need to move the rings to the right-hand peg. How do you solve this puzzle?
Copyright © 2014 by McGraw-Hill Education.
Tower of Hanoi Solution
The solution is to move
A to the right-hand peg,
then B to the middle peg,
A on top of B on the middle peg,
then C to the right-hand peg,
then A to the now-empty left-hand peg,
B on top of C on the right-hand peg,
and finally A on top of B to complete the stack on the right-hand peg.
The rings were moved in this order: A B A C A B A. If you solve this puzzle with four rings labeled A through D, your moves would be A B A C A B A D A B A C A B A.
Five rings are solved with the sequence A B A C A B A D A B A C A B A E A B A C A B A D A B A C A B A.
As you can see, there is a recursive pattern here that looks complicated but is actually very repetitive. Small children solve this puzzle all the time.
Copyright © 2014 by McGraw-Hill Education.
Tower of Hanoi for Backups
To use the same strategy with backup tapes requires the use of multiple tapes in this same complicated order. Each backup is a full backup, and multiple backups are made to each tape. Since each tape’s backups are not sequential, the chance that the loss of one tape or damage to one tape will destroy backups for the current period is nil. A fairly current backup is always available on another tape. This backup method gives you as many different restore options as you have tapes.
Consider a three-tape Tower of Hanoi backup scheme and its similarity to the sequence of the game. On day one, you perform a full backup to tape A. On day two, your full backup goes to tape B. On day three, you back up to tape A again, and on day four you introduce tape C, which hasn’t been used yet. At this point, you now have three tapes containing full backups for the last three days. That’s pretty good coverage. On days 5, 6, and 7, you use tapes A, B, and A again, respectively. This gives you three tapes containing full backups that you can rely on, even if one tape is damaged.
Copyright © 2014 by McGraw-Hill Education.
Use More Tapes
For additional coverage, you can use a four-tape or five-tape Tower of Hanoi scheme.
You would perform the same rotation as in the game, either A B A C A B A D A B A C A B A in a four-tape system or A B A C A B A D A B A C A B A E A B A C A B A D A B A C A B A in a five-tape system.
Higher numbers of tapes can be used as well, but the system is complicated enough that human error can become a concern. Backup software can assist by prompting the backup operator for the correct tape if it is configured for a Tower of Hanoi scheme.
Copyright © 2014 by McGraw-Hill Education.
Backup Alternatives and Newer Methodologies
Many backup strategies are available for use today as alternatives to traditional tape backups:
Hierarchical Storage Management (HSM)
Windows shadow copy
Online backup or data vaulting
Dedicated backup networks
Disk-to-disk (D2D) technology
Copyright © 2014 by McGraw-Hill Education.
Hierarchical Storage Management (HSM)
HSM is more of an archiving system than a strict “backup” strategy, but it is a valid way of preserving data that can be considered as part of a data retention strategy. Long available for mainframe systems, it is also available on Windows.
HSM is an automated process that moves the least-used files to progressively more remote data storage. In other words, frequently used and changed data is stored online on high speed, local disks. As data ages (as it is not accessed and is not changed), it is moved to more remote storage locations, such as disk appliances or even tape systems.
However, the data is still cataloged and appears readily available to the user. If accessed, it can be automatically made available—it can be moved to local disks, it can be returned via network access, or, in the case of offline storage, operators can be prompted to load the data. Online services or cloud storage can be used for the more remote data storage, and this approach is commonly found in e-mail archiving solutions.
Copyright © 2014 by McGraw-Hill Education.
Windows Shadow Copy
This Windows service takes a snapshot of a working volume, and then a normal data backup can be made that includes open files. The shadow copy service doesn’t make a copy; it just fixes a point in time and then places subsequent changes in a hidden volume.
When a backup is made, closed files and disk copies of open files are stored along with the changes. When files are stored on a Windows system, the service runs in the background, constantly recording file changes.
If a special client is loaded, previous versions of a file can be accessed and restored by any user who has authorization to read the file. Imagine that Alice deletes a file on Monday, or Bob makes a mistake in a complex spreadsheet design on Friday. On the following Tuesday, each can obtain their old versions of the file on their own, without a call to the help desk, and without IT getting involved.
Copyright © 2014 by McGraw-Hill Education.
Online Backup or Data Vaulting
An individual or business can contract with an online service that automatically and regularly connects to a host or hosts and copies identified data to an online server.
Typically, arrangements can be made to back up everything, data only, or specific data sets.
Payment plans are based both on volume of data backed up and on the number of hosts, ranging up to complete data backups of entire data centers.
Copyright © 2014 by McGraw-Hill Education.
Dedicated Backup Networks
An Ethernet LAN can become a backup bottleneck if disk and tape systems are provided in parallel and exceed the LAN’s throughput capacity. Backups also consume bandwidth and thus degrade performance for other network operations.
Dedicated backup networks are often implemented using a Fibre Channel storage area network (SAN) or Gigabit Ethernet network and Internet Small Computer Systems Interface (iSCSI). iSCSI and Gigabit Ethernet can provide wire-speed data transfer. Backup is to servers or disk appliances on the SAN.
Copyright © 2014 by McGraw-Hill Education.
Disk-to-Disk (D2D) Technology
A slow tape backup system may be a bottleneck, as servers may be able to provide data faster than the tape system can record it. D2D servers don’t wait for a tape drive, and disks can be provided over high-speed dedicated backup networks, so both backups and restores can be faster.
D2D can use traditional network-attached storage (NAS) systems supported by Ethernet connectivity and either the Network File System (NFS on Unix) protocol or Common Internet File System (CIFS on Windows) protocol, or dedicated backup networks can be provided for D2D.
Copyright © 2014 by McGraw-Hill Education.
Backup Benefits
Many benefits can be obtained from backing up as a regular part of IT operations:
Cost savings: It takes many people-hours to reproduce digitally stored data. The cost of backup software and hardware is a fraction of this cost.
Productivity: Users cannot work without data. When data can be restored quickly, productivity is maintained.
Increased security: When backups are available, the impact of an attack that destroys or corrupts data is lessened. Data can be replaced or compared to ensure its integrity.
Simplicity: When centralized backups are used, no user needs to make a decision about what to back up.
Copyright © 2014 by McGraw-Hill Education.
Backup Policy
The way to ensure that backups are made and protected is to have an enforceable and enforced backup policy.
The policy should identify the goals of the process, such as frequency, the necessity of onsite and offsite storage, and requirements for formal processes, authority, and documentation.
Procedures can then be developed, approved, and used that interpret policy in light of current applications, data sets, equipment, and the availability of technologies. Several topics should be specifically detailed in the policy.
Copyright © 2014 by McGraw-Hill Education.
Administrative Authority
Designate who has the authority to physically start the backup, transport and check out backup media, perform restores, sign off on activity, and approve changes in procedures. This should also include guidelines for how individuals are chosen.
Recommendations should include separating duties between backing up and restoring, between approval and activity, and even between systems. (For example, those authorized to back up directory services and password databases should be different from those given authority to back up databases.) This allows for role separation, a critical security requirement, and the delegation of many routine duties to junior IT employees.
Copyright © 2014 by McGraw-Hill Education.
What to Back Up
Designate which information should be backed up.
Should system data or only application data be backed up?
What about configuration information, patch levels, and version levels?
How will applications and operating systems be replaced?
Are original and backup copies of their installation disks provided for?
These details should be specified.
Copyright © 2014 by McGraw-Hill Education.
Scheduling
Identify how often backups should be performed.
Copyright © 2014 by McGraw-Hill Education.
Monitoring
Specify how to ensure the completion and retention of backups.
Copyright © 2014 by McGraw-Hill Education.
Storage for Backup Media
Specify which of the many ways to store backup media are appropriate.
Is media stored both onsite and offsite?
What are the requirements for each type of storage? For example, are fireproof vaults or cabinets available? Are they kept closed? Where are they located?
Onsite backup media needs to be available, but storing backups near the original systems may be counterproductive. A disaster that damages the original system might take out the backup media as well.
Copyright © 2014 by McGraw-Hill Education.
Type of Media and Process Used
Specify how backups are made.
How many backups are made, and of what type?
How often are they made, and how long are they kept?
How often is backup media replaced?
Copyright © 2014 by McGraw-Hill Education.
High Availability
Not too long ago, most businesses closed at 5 p.m. Many were not open on the weekends, holidays were observed by closings or shortened hours, and few of us worried when we couldn’t read the latest news at midnight or shop for bath towels at 3 a.m. That’s not true anymore. Even ordinary businesses maintain computer systems around the clock, and their customers expect instant gratification at any hour. Somehow, since computers and networks are devices and not people, we expect them just to keep working without breaks, or sleep.
Of course, they do break. Procedures, processes, software, and hardware that enable system and network redundancy are a necessary part of operations. However, they serve another purpose as well. Redundancy ensures the integrity and availability of information.
Copyright © 2014 by McGraw-Hill Education.
Redundancy
What effect does system redundancy have? Calculations including the mean time to repair (how long it takes to replace a failed component) and uptime (the percentage of time a system is operational) can show the results of having versus not having redundancy built into a computer system or a network. However, the importance of these figures depends on the needs and requirements of the system.
Most desktop systems, for example, do not require built-in redundancy; if one fails and our work is critical, we simply obtain another desktop system. The need for redundancy is met by another system. In most cases, however, we do something else while the system is fixed. Other systems, however, are critical to the survival of a business or perhaps even of a life. These systems need either built-in hardware redundancy, support alternatives that can keep their functions intact, or both.
Copyright © 2014 by McGraw-Hill Education.
Note:
Critical systems are those systems a business must have, and without which it would be critically damaged, or whose failure might be life-threatening. Which systems are critical to a business must be determined by the business. For some it will be their e-commerce site, for others the billing system, and for others their customer information databases. Everyone recognizes the critical nature of air traffic control systems and life support systems used in hospitals.
Two methods can be used to evaluate where and how much redundancy is needed . The first, more traditional method is to weigh the cost of providing redundancy against the cost of downtime without redundancy. These costs can be calculated and compared directly. (Is the cost of downtime greater or less than the cost of redundancy?) The second method, which is harder to calculate but is increasingly easier to justify, is to decide based on the likelihood that customers will gravitate to the organization that can provide the best availability of service. This, in turn, is based on the increasing demands that online services, unlike traditional services, be available 24×7×365. High availability can be a selling point that directly leads to more business. Indeed, some customers will demand it.
There are automated methods for providing system redundancy, such as hardware fault tolerance, clustering, and network routing, and there are operational methods, such as component hot-swapping and standby systems.
Copyright © 2014 by McGraw-Hill Education.
Automated Redundancy Methods
It has become commonplace to expect significant hardware redundancy and fault tolerance in server systems. A wide range of components are either duplicated within the systems or effectively duplicated by linking systems into a cluster. Some typical components and techniques are used:
Clustering
Fault tolerance
Redundant System Slot (RSS)
Cluster in a box
High-availability design
Internet network routing
Copyright © 2014 by McGraw-Hill Education.
Clustering
Entire computers or systems are duplicated. If a system fails, operation automatically transfers to the other systems.
Clusters may be set up as active-standby, in which case one system is live and the other is idle, or active-active, in which case multiple systems are kept perfectly in synch, and even dynamic load sharing is possible.
Active-active is ideal, as no system stands idle and the total capacity of all systems can always be utilized. If there is a system failure, fewer systems carry the load. When the failed system is replaced, load balancing readjusts.
Clustering does have its downside. When active-standby is used, duplication of systems is expensive. These active-standby systems may also take seconds for the failover to occur, which is a long time when systems are under heavy loads. Active-active systems, however, may require specialized hardware and additional, specialized administrative knowledge and maintenance.
Copyright © 2014 by McGraw-Hill Education.
Fault Tolerance
Components may have backup systems or parts of systems that allow them to recover from errors or to survive in spite of them.
For example, fault-tolerant CPUs use multiple CPUs running in lockstep, each using the same processing logic. In the typical case, three CPUs are used and the results from all CPUs are compared. If one CPU produces results that don’t match those of the other two, it is considered to have failed and is no longer consulted until it is replaced.
Another example is the fault tolerance built into Microsoft’s NTFS file system. If the system detects a bad spot on a disk during a write, it automatically marks it as bad and writes the data elsewhere. The logic to both these strategies is to isolate failure and continue on. Meanwhile, the system can raise alerts and record error messages to prompt maintenance.
Copyright © 2014 by McGraw-Hill Education.
Redundant System Slot (RSS)
Entire hot-swappable computer units are provided in a single unit.
Each system has its own operating system and bus, but all systems are connected and share other components.
Like clustered systems, RSS systems can be either active-standby or active-active. RSS systems exist as a unit, and systems cannot be removed from their unit and continue to operate.
Copyright © 2014 by McGraw-Hill Education.
Cluster in a Box
Two or more systems are combined in a single unit.
The difference between these systems and RSS systems is that each unit has its own CPU, bus, peripherals, operating system, and applications.
Components can be hot-swapped, and therein lies its advantage over a traditional cluster.
Copyright © 2014 by McGraw-Hill Education.
High Availability Design
Two or more complete components are placed on the network, with one component serving either as a standby system (with traffic being routed to the standby system if the primary fails) or as an active node (with load balancing being used to route traffic to multiple systems sharing the load, and if one fails, traffic is routed only to the other functional systems).
Copyright © 2014 by McGraw-Hill Education.
A High Availability Network Design Supporting a Web Site
Multiple ISP backbones are available, and duplicate firewalls, load-balancing systems, application servers, and database servers support a single web site.
Copyright © 2014 by McGraw-Hill Education.
Internet Network Routing
In an attempt to achieve redundancy for Internet-based systems similar to that of the Public Switched Telephone Network (PSTN), new architectures for Internet routing are adding or proposing a variety of techniques, such as these:
Reserve capacity
System and geographic diversity
Size limits
Dynamic restoration switching
Self-healing protection switching
Fast rerouting (which reverses traffic at the point of failure so that it can be directed to an alternative route)
RSVP-based backup tunnels (where a node adjacent to a failed link signals failure to upstream nodes, and traffic is thus rerouted around the failure)
Two-path protection (in which sophisticated engineering algorithms develop alternative paths between every node)
Two examples of such architectures are Multiprotocol Label Switching (MPLS), which integrates IP and data-link layer technologies to introduce sophisticated routing control, and Automatic Switching Protection (ASP), which provides the fast restoration times that modern technologies, such as voice and streaming media, require.
Copyright © 2014 by McGraw-Hill Education.
Operational Redundancy Methods
In addition to technologies that provide automated redundancy, there are many processes that help you to quickly get your systems up and running, if a problem occurs. These include
Standby systems
Hot-swappable components
Copyright © 2014 by McGraw-Hill Education.
Standby Systems
Complete or partial systems are kept ready. Should a system, or one of its subsystems, fail, the standby system can be put into service. There are many variations on this technique.
Some clusters are deployed in active-standby state, so the clustered system is ready to go but idle. To recover from a CPU or other major system failure quickly, a hard drive might be moved to another, duplicate, online system.
To recover quickly from the failure of a database system, a duplicate system complete with database software may be kept ready. The database is periodically updated by replication or by export and import functions. If the main system fails, the standby system can be placed online, though it may be lacking some recent transactions.
Copyright © 2014 by McGraw-Hill Education.
Hot-Swappable Components
Many hardware components can now be replaced without shutting down systems. Hard drives, network cards, and memory are examples of current hardware components that can be added.
Modern operating systems detect the addition of these devices on the fly, and operations continue with minor, if any, service outages.
In a RAID array, for example, drive failure may be compensated for by the built-in redundancy of the array. If the failed drive can be replaced without shutting down the system, the array will return to its prefailure state. Interruptions in service will be nil, though performance may suffer depending on the current load.
Copyright © 2014 by McGraw-Hill Education.
Summary
In this chapter, we covered the four related business resumption strategies that are all necessary for recovery from incidents, outages, and disasters that result in service or data loss: disaster recovery, business continuity planning, backups, and high-availability. Together, these form the core of a strategy to keep the organization’s information infrastructure operational.
Here in summary are the principal points, roles, and responsibilities of a good disaster recovery and business continuity program:
Develop and maintain disaster recovery and business continuity plans for all your organization’s enterprise technologies.
Schedule and oversee disaster recovery rehearsals for all enterprise systems.
Ensure disaster awareness by planning and conducting awareness programs, hazard fairs, lunch-and-learn sessions, and other informative events and materials.
Activate the plan.
Ensure community involvement by participating in local community disaster mitigation and planning initiatives and professional groups.
The disaster recovery and business continuity process is cyclical and must be maintained for it to stay current with the needs of the organization and the technologies in the environment. Your plans must be updated and rehearsed regularly. Disaster recovery is vital to everyone.
Backups can be an important part of a recovery strategy. They play a role in disaster recovery process, to move data from the primary site to the DR site, although real-time data replication approaches are replacing traditional tape shipments in modern DR plans. Backups are also necessary for recovering data in a traditional data center.
High availability architectures are the fourth leg of the table supporting service resiliency, to ensure that failure of one system or component of a service doesn’t cause that service to fail.
Copyright © 2014 by McGraw-Hill Education.