Data and System Security

Chapter32DRandBCP.pptx

Home >Computer Science homework help > Data and System Security

Chapter 32

Disaster Recovery, Business Continuity, Backups, and High Availability

Introduction

Disaster recovery and business continuity planning are separate but related concepts. In fact, disaster recovery is part of business continuity.

Disaster recovery (DR) concerns the recovery of the technical components of your business, such as computers, software, the network, data, and so on.

Business continuity planning (BCP) includes disaster recovery along with procedures to restore business operations and the underlying functionality of the business infrastructure needed to support the business, along with the resumption of the daily work of the people in your workplace. Business continuity planning is vital to keeping your business running and to providing a return to “business as usual” during a disaster.

What Constitutes a Disaster?

A disaster is defined as a “sudden, unplanned calamitous event causing great damage or loss” or “any event that creates an inability on an organization’s part to provide critical business functions for some predetermined period of time.”

With this general definition in mind, the disaster recovery planner or business continuity professional would sit down with all the principals in the organization and map out what would constitute a disaster for that organization. This is the initial stage of creating a business impact analysis (BIA).

Service Assurance Methods

DR and BCP professionals work together to ensure the recoverability and continuity of all aspects of an organization that are affected by an outage or security event. This chapter analyzes the best practices and methodologies for DR and BCP.

We also give close consideration to backups, which are necessary for disaster recovery as well as recovery from less severe incidents. Tape backups, which have traditionally been a key component of DR strategies to move data from the primary data center to the backup site, are giving way to online, real-time data replication strategies to keep data synchronized.

We consider high availability in the final section of this chapter. All three of these components–DR/BCP, backups, and HA, form the core of a resiliency strategy for services and data.

Disaster Recovery

When you put together a disaster recovery plan, you need to understand how your organization’s information technology (IT) infrastructure, applications, and network support the business functions of the enterprise you are recovering.

For example, a particular business unit may claim not to need a certain application or function on day three of a disaster, but the technology process may dictate that the application should be available on day one, due to technological interdependencies. In this example, the DR planner should work with (and educate) the business unit to help them understand why they need to pay for a day-one recovery as opposed to a day-three recovery. The business unit’s budget will typically include a sizeable expense for the IT department, and this may cause the business unit to think that any disaster recovery or business continuity efforts will be cost prohibitive. In working with the IT subject matter experts (SMEs), you can sometimes figure out a way to bypass a particular electronic feed or file dependency that may be needed to continue the recovery of your system.

Determining What to Recover

All of this will work well if you know what you are recovering and who to consult with. The responsible business continuity or disaster recovery professional should work with the IT group and the business unit to achieve one purpose—to operate a fine, productive, and lucrative organization.

You can come to know what you are recovering and who is involved by gathering experts, such as the programmer, business analyst, system architect, or any other necessary SME. These experts will prove to be invaluable when it comes to creating your DR plan. They know what it takes to technically run the business systems in question and can explain why a certain disaster recovery process will cost a certain amount. This information is important for the manager of the business unit, so that she can make informed decisions.

Business Continuity Planning

The business continuity professional is more concerned with the business functions that the employees perform than with the underlying technologies. To figure out how the business can resume normal operations during a disaster, the business continuity professional needs to work with each business unit as closely as possible. This means they need to meet with the people who make the decisions, the people who carry out the decisions in the management team, and finally the “worker bees” who actually do the work.

You can think of the “worker bees” as power users who know an application intimately. They know the nuances and idiosyncrasies of the business function—they are looking at the trees as opposed to the forest. This is important when it comes to preparing the business unit’s business continuity plan. The power users should participate in your disaster recovery rehearsals and business continuity tabletop exercises.

Management Team

The business unit management team is vital because its members see the business unit from a business perspective—at a higher level—and will help in determining the importance of the application, as they are acquainted with the mission of the business unit. The business unit also needs to keep in mind the need for a disaster recovery plan as it introduces new or upgraded program applications. The disaster recovery and/or business continuity professional should be kept informed about such changes.

For example, a member of management in a business unit might talk to a vendor about a product that could make a current business function quicker, smarter, and better. Being the diligent manager, he would bring the vendor in to meet with upper management, and the decision would be made to buy the product, all without informing the IT department or the disaster recovery or business continuity professional.

As you can see, the business continuity professional needs to have a relationship with every principle within the business unit so that, should a new product be brought into the organization, the knowledge and ability to recover the product will be taken into consideration.

The Four Components of Business Continuity Planning

There are four main components of business continuity planning, each of which is essential to the whole BCP initiative:

Plan initiation

Business impact analysis or assessment

Development of the recovery strategies

Rehearsal or exercise of the disaster recovery and business continuity plans

Each business unit should have its own plan. The organization as a whole needs to have a global plan, encompassing all the business units. There should be two plans that work in tandem: a business continuity plan (recovery of the people and business function) and a disaster recovery plan (technological and application recovery).

Initiating a Plan

Plan initiation puts everyone on the same page at the beginning of the creation of the plan. A disaster or event is defined from the perspective of the specific business unit or entire organization. What one business unit or organization considers a disaster may not be considered a disaster by another business unit or organization, and vice versa.

A BIA is important for several reasons. It provides an organization or business unit with a dollar value impact for an unexpected event. This indicates how long an organization can have its business interrupted before it will go out of business completely.

Events

Here are three examples of possible events that could impact your business and compel you to implement your disaster recovery or business continuity plan, along with some possible responses:

Hurricane: Because a hurricane can be predicted a reasonable amount of time before it strikes, you have time to inform employees to prepare their homes and other personal effects. You also have the time to alert your technology group so that they can initiate their preparation strategy procedures.

Blackout: You can ensure that your enterprise is attached to a backup generator or an uninterruptible power supply (UPS). You can conduct awareness programs and perhaps give away small flashlights that employees can keep in their desks.

Illness outbreak: You can provide an offsite facility where your employees can relocate during the outbreak and investigation.

Analyzing the Business Impact

With a BIA, you must first establish what the critical business function is. This can be determined only by the critical members of the business unit.

The BIA should be completed and reviewed by the business unit, including upper management, since the financing of the business continuity plan and disaster recovery project will ultimately come from the business unit’s coffers.

Developing Recovery Strategies

The next step is to develop your recovery strategy. The business unit will be paying for the recovery, so they need to know what their options are for different types of recoveries.

You can provide anything from a no-frills recovery to an instantaneous recovery. It all depends on the business functions that have to be recovered and on how long the business unit can go without the function.

The question is essentially how much insurance the business unit wants to buy. If it is your business, you are the only one who can make that decision. Someone who does not have as large a stake in the growth of the business cannot look at the business from the same perspective.

Procedures and Contacts

In a business recovery situation, there must be written procedures that all employees in your business unit can quickly access, understand, and follow. Information needs to be readily available about the business function that has to be performed. The procedures should be stored in multiple, accessible locations to ensure they are available in a disaster scenario.

You also need to make readily available a list of people to contact, along with their contact information. This list must be of the current employees to contact, and it should include members of the Human Resources, Facilities, Risk Management, and Legal departments. The list of contacts should also include the local fire and rescue department, police department, and emergency operations center.

Rehearsing Disaster Recovery and Business Continuity Plans

The fourth BCP component, and the most crucial, is to rehearse, exercise, or test the plan. This is “where the rubber meets the road.”

Having the other three components in place is important, but the plan is inadequate if you’re not sure whether it will work. It is vital to test your plan. If the plan has not been tested and it fails during a disaster, all the work you put into developing it is for naught. If the plan fails during a test, though, you can improve on it and test again.

Third-Party Vendor Issues

Most organizations make use of various third-party vendors (Enterprise Resource Planning [ERP], Application Service Provider [ASP], etc.) in their recovery efforts. In such cases, the information about the third-party vendor is just as critical in your business or technology recovery. When you need to make use of such resources, it is beneficial, if not crucial, to make inquiries into the third-party’s operations prior to the implementation of its product or services.

In the real world, the disaster recovery and/or business continuity professional has to integrate the vendor’s information into the business unit’s continuity plan. If a critical path in your DR plan depends on the involvement of a third-party vendor, you can’t get your operation up and running if that third-party vendor isn’t prepared to assist you. For example, suppose that processing loans is the bread and butter of your business, and your business relies on credit bureau reports to process loans. In this scenario, you need to ensure that if your organization experiences an outage, you will still receive these reports so that your company can continue to conduct business.

The vendor’s ability to recover from a failure will also affect how robust your recovery is. Although your recovery may be technically sound, you must be sure that you can conduct business. The same standards you apply to your own organization should apply to third-party vendors you do business with. They should be available to you to conduct business. The disaster recovery or business continuity coordinator should make the appropriate inquiries with vendors to ensure that they can support a DR scenario.

Awareness and Training Programs

Another important element of disaster recovery and business continuity planning is an awareness program. The business continuity or disaster recovery professional can meet with each business for tabletop exercises. These exercises are important, because they actually get the members of the business unit to sit down and think about a particular event and how first to prevent or mitigate it and then how to recover from it.

The event can be anything from a category 3 hurricane to workplace violence. Any work stoppage can potentially impede the progress of an organization’s recovery or resumption of services, and it is up to the management team to design or develop a plan of action or a business continuity plan. The business continuity or disaster recovery professional must facilitate this process and make the business unit aware that there are events that can bring the business to a grinding halt.

Backups

Backups may be used for complete system restoration, but they can also allow you to recover the contents of a mailbox, for example, or an “accidentally” deleted document. Backups can be extended to saving more than just digital data. Backup processes can include the backup of specifications and configurations, policies and procedures, equipment, and data centers.

However, if the backup is not good or is too old, or the backup media is damaged, it will not fix the problem. Just having a backup procedure in place does not always offer adequate protection.

Many organizations can no longer depend on traditional backup processes—doing an offline backup is unacceptable, doing an online backup would unacceptably degrade system performance, and restoring from a backup would take so much time that the organization could not recover. Such organizations are using alternatives to traditional backups, such as redundant systems and cloud services.

Backup systems and processes, therefore, reflect the availability needs of an organization as well as its recovery needs.

Traditional Backup Methods

In the traditional backup process, data is copied to backup media, primarily tape, in a predictable and orderly fashion for secure storage both onsite and offsite.

Backup media can thus be made available to restore data to new or repaired systems after failure. In addition to data, modern operating systems and application configurations are also backed up.

This provides faster restore capabilities and occasionally may be the only way to restore systems where applications that support data are intimately integrated with a specific system.

Backup Types

There are several standard types of backups:

Full

Copy

Incremental

Differential

Full Backups

Backs up all data selected, whether or not it has changed since the last backup. The definition of a full backup varies on different systems. On some systems it includes critical operating system files needed to rebuild a system completely; on other systems it backs up only the user data.

Copy Backups

Data is copied from one disk to another.

Incremental Backups

When data is backed up, the archive bit on a file is turned off. When changes are made to the file, the archive bit is set again. An incremental backup uses this information to back up only files that have changed since the last backup. This backup turns the archive bit off again, and the next incremental backup backs up only the files that have changed since the last incremental backup. This backup type saves time, but it means that the restore process will involve restoring the last full backup and every incremental backup made after it.

Restoring from an Incremental backup requires that all backups be applied.

The circle encloses all the backups that must be restored.

Differential Backups

Like an incremental backup, a differential backup only backs up files with the archive bit set—files that have changed since the last backup. Unlike an incremental backup, however, a differential backup does not reset the archive bit.

Each differential backup backs up all files that have changed since the last backup that reset the bits. Using this strategy, a full backup is followed by differential backups.

A restore consists of restoring the full backup and then only the last differential backup made. This saves time during the restore, but, depending on your system, creating differential backups takes longer than creating incremental backups.

Restoring from a differential backup requires applying only the full backup and the last differential backup.

The circle encloses all of the backups that must be restored.

Backup Rotation Strategies

In the traditional backup process, old backups are usually not immediately replaced by the new backup. Instead, multiple previous copies of backups are kept. This ensures recovery should one backup tape set be damaged or otherwise be found not to be good. Two traditional backup rotation strategies are Grandfather-Father-Son (GFS) and Tower of Hanoi.

GFS Backup Strategy

In the GFS rotation strategy, a backup is made to separate media each day.

Each Sunday a full backup is made, and each day of the week an incremental backup is made.

The Sunday backups are kept for a month, and the current week’s incremental backups are also kept.

On the first Sunday of the month, a new tape or disk is used to make a full backup. The previous full backup becomes the last full backup of the prior month and is re-labeled as a monthly backup.

Weekly and daily tapes are rotated as needed, with the oldest being used for the current backup.

Thus, on any given day of the month, that week’s backup is available, as well as the previous four or five weeks’ full backups, along with the incremental backups taken each day of the preceding week. If the backup scheme has been in use for a while, prior months’ backups are also available.

Note:

No backup strategy is complete without plans to test backup media and backups by doing a restore. If a backup is unusable, it’s worse than having no backup at all, because it has lured users into a sense of security. Be sure to add the testing of backups to your backup strategy, and do this on a test system.

The Tower of Hanoi Backup Strategy

The Tower of Hanoi strategy is based on a game played with three poles and a number of rings. The object is to move the rings from their starting point on one pole to the other pole.

However, the rings are of different sizes, and you are not allowed to have a ring on top of one that is smaller than itself. To accomplish the task, a certain order must be followed.

Consider a simple version of the Tower of Hanoi, in which you are given three pegs, one of which has three rings stacked on it from largest at the bottom to smallest at the top. Call these rings A (small), B (medium), and C (large). You need to move the rings to the right-hand peg. How do you solve this puzzle?

Tower of Hanoi Solution

The solution is to move

A to the right-hand peg,

then B to the middle peg,

A on top of B on the middle peg,

then C to the right-hand peg,

then A to the now-empty left-hand peg,

B on top of C on the right-hand peg,

and finally A on top of B to complete the stack on the right-hand peg.

The rings were moved in this order: A B A C A B A. If you solve this puzzle with four rings labeled A through D, your moves would be A B A C A B A D A B A C A B A.

Five rings are solved with the sequence A B A C A B A D A B A C A B A E A B A C A B A D A B A C A B A.

As you can see, there is a recursive pattern here that looks complicated but is actually very repetitive. Small children solve this puzzle all the time.

Tower of Hanoi for Backups

To use the same strategy with backup tapes requires the use of multiple tapes in this same complicated order. Each backup is a full backup, and multiple backups are made to each tape. Since each tape’s backups are not sequential, the chance that the loss of one tape or damage to one tape will destroy backups for the current period is nil. A fairly current backup is always available on another tape. This backup method gives you as many different restore options as you have tapes.

Consider a three-tape Tower of Hanoi backup scheme and its similarity to the sequence of the game. On day one, you perform a full backup to tape A. On day two, your full backup goes to tape B. On day three, you back up to tape A again, and on day four you introduce tape C, which hasn’t been used yet. At this point, you now have three tapes containing full backups for the last three days. That’s pretty good coverage. On days 5, 6, and 7, you use tapes A, B, and A again, respectively. This gives you three tapes containing full backups that you can rely on, even if one tape is damaged.

Use More Tapes

For additional coverage, you can use a four-tape or five-tape Tower of Hanoi scheme.

You would perform the same rotation as in the game, either A B A C A B A D A B A C A B A in a four-tape system or A B A C A B A D A B A C A B A E A B A C A B A D A B A C A B A in a five-tape system.

Higher numbers of tapes can be used as well, but the system is complicated enough that human error can become a concern. Backup software can assist by prompting the backup operator for the correct tape if it is configured for a Tower of Hanoi scheme.

Backup Alternatives and Newer Methodologies

Many backup strategies are available for use today as alternatives to traditional tape backups:

Hierarchical Storage Management (HSM)

Windows shadow copy

Online backup or data vaulting

Dedicated backup networks

Disk-to-disk (D2D) technology

Hierarchical Storage Management (HSM)

HSM is more of an archiving system than a strict “backup” strategy, but it is a valid way of preserving data that can be considered as part of a data retention strategy. Long available for mainframe systems, it is also available on Windows.

HSM is an automated process that moves the least-used files to progressively more remote data storage. In other words, frequently used and changed data is stored online on high speed, local disks. As data ages (as it is not accessed and is not changed), it is moved to more remote storage locations, such as disk appliances or even tape systems.

However, the data is still cataloged and appears readily available to the user. If accessed, it can be automatically made available—it can be moved to local disks, it can be returned via network access, or, in the case of offline storage, operators can be prompted to load the data. Online services or cloud storage can be used for the more remote data storage, and this approach is commonly found in e-mail archiving solutions.

Windows Shadow Copy

This Windows service takes a snapshot of a working volume, and then a normal data backup can be made that includes open files. The shadow copy service doesn’t make a copy; it just fixes a point in time and then places subsequent changes in a hidden volume.

When a backup is made, closed files and disk copies of open files are stored along with the changes. When files are stored on a Windows system, the service runs in the background, constantly recording file changes.

If a special client is loaded, previous versions of a file can be accessed and restored by any user who has authorization to read the file. Imagine that Alice deletes a file on Monday, or Bob makes a mistake in a complex spreadsheet design on Friday. On the following Tuesday, each can obtain their old versions of the file on their own, without a call to the help desk, and without IT getting involved.

Online Backup or Data Vaulting

An individual or business can contract with an online service that automatically and regularly connects to a host or hosts and copies identified data to an online server.

Typically, arrangements can be made to back up everything, data only, or specific data sets.

Payment plans are based both on volume of data backed up and on the number of hosts, ranging up to complete data backups of entire data centers.

Dedicated Backup Networks

An Ethernet LAN can become a backup bottleneck if disk and tape systems are provided in parallel and exceed the LAN’s throughput capacity. Backups also consume bandwidth and thus degrade performance for other network operations.

Dedicated backup networks are often implemented using a Fibre Channel storage area network (SAN) or Gigabit Ethernet network and Internet Small Computer Systems Interface (iSCSI). iSCSI and Gigabit Ethernet can provide wire-speed data transfer. Backup is to servers or disk appliances on the SAN.

Disk-to-Disk (D2D) Technology

A slow tape backup system may be a bottleneck, as servers may be able to provide data faster than the tape system can record it. D2D servers don’t wait for a tape drive, and disks can be provided over high-speed dedicated backup networks, so both backups and restores can be faster.

D2D can use traditional network-attached storage (NAS) systems supported by Ethernet connectivity and either the Network File System (NFS on Unix) protocol or Common Internet File System (CIFS on Windows) protocol, or dedicated backup networks can be provided for D2D.

Backup Benefits

Many benefits can be obtained from backing up as a regular part of IT operations:

Cost savings: It takes many people-hours to reproduce digitally stored data. The cost of backup software and hardware is a fraction of this cost.

Productivity: Users cannot work without data. When data can be restored quickly, productivity is maintained.

Increased security: When backups are available, the impact of an attack that destroys or corrupts data is lessened. Data can be replaced or compared to ensure its integrity.

Simplicity: When centralized backups are used, no user needs to make a decision about what to back up.

Backup Policy

The way to ensure that backups are made and protected is to have an enforceable and enforced backup policy.

The policy should identify the goals of the process, such as frequency, the necessity of onsite and offsite storage, and requirements for formal processes, authority, and documentation.

Procedures can then be developed, approved, and used that interpret policy in light of current applications, data sets, equipment, and the availability of technologies. Several topics should be specifically detailed in the policy.

Administrative Authority

Designate who has the authority to physically start the backup, transport and check out backup media, perform restores, sign off on activity, and approve changes in procedures. This should also include guidelines for how individuals are chosen.

Recommendations should include separating duties between backing up and restoring, between approval and activity, and even between systems. (For example, those authorized to back up directory services and password databases should be different from those given authority to back up databases.) This allows for role separation, a critical security requirement, and the delegation of many routine duties to junior IT employees.

What to Back Up

Designate which information should be backed up.

Should system data or only application data be backed up?

What about configuration information, patch levels, and version levels?

How will applications and operating systems be replaced?

Are original and backup copies of their installation disks provided for?

These details should be specified.

Scheduling

Identify how often backups should be performed.

Monitoring

Specify how to ensure the completion and retention of backups.

Storage for Backup Media

Specify which of the many ways to store backup media are appropriate.

Is media stored both onsite and offsite?

What are the requirements for each type of storage? For example, are fireproof vaults or cabinets available? Are they kept closed? Where are they located?

Onsite backup media needs to be available, but storing backups near the original systems may be counterproductive. A disaster that damages the original system might take out the backup media as well.

Type of Media and Process Used

Specify how backups are made.

How many backups are made, and of what type?

How often are they made, and how long are they kept?

How often is backup media replaced?

High Availability

Not too long ago, most businesses closed at 5 p.m. Many were not open on the weekends, holidays were observed by closings or shortened hours, and few of us worried when we couldn’t read the latest news at midnight or shop for bath towels at 3 a.m. That’s not true anymore. Even ordinary businesses maintain computer systems around the clock, and their customers expect instant gratification at any hour. Somehow, since computers and networks are devices and not people, we expect them just to keep working without breaks, or sleep.

Of course, they do break. Procedures, processes, software, and hardware that enable system and network redundancy are a necessary part of operations. However, they serve another purpose as well. Redundancy ensures the integrity and availability of information.

Redundancy

What effect does system redundancy have? Calculations including the mean time to repair (how long it takes to replace a failed component) and uptime (the percentage of time a system is operational) can show the results of having versus not having redundancy built into a computer system or a network. However, the importance of these figures depends on the needs and requirements of the system.

Most desktop systems, for example, do not require built-in redundancy; if one fails and our work is critical, we simply obtain another desktop system. The need for redundancy is met by another system. In most cases, however, we do something else while the system is fixed. Other systems, however, are critical to the survival of a business or perhaps even of a life. These systems need either built-in hardware redundancy, support alternatives that can keep their functions intact, or both.

Note:

Critical systems are those systems a business must have, and without which it would be critically damaged, or whose failure might be life-threatening. Which systems are critical to a business must be determined by the business. For some it will be their e-commerce site, for others the billing system, and for others their customer information databases. Everyone recognizes the critical nature of air traffic control systems and life support systems used in hospitals.

Two methods can be used to evaluate where and how much redundancy is needed . The first, more traditional method is to weigh the cost of providing redundancy against the cost of downtime without redundancy. These costs can be calculated and compared directly. (Is the cost of downtime greater or less than the cost of redundancy?) The second method, which is harder to calculate but is increasingly easier to justify, is to decide based on the likelihood that customers will gravitate to the organization that can provide the best availability of service. This, in turn, is based on the increasing demands that online services, unlike traditional services, be available 24×7×365. High availability can be a selling point that directly leads to more business. Indeed, some customers will demand it.

There are automated methods for providing system redundancy, such as hardware fault tolerance, clustering, and network routing, and there are operational methods, such as component hot-swapping and standby systems.

Automated Redundancy Methods

It has become commonplace to expect significant hardware redundancy and fault tolerance in server systems. A wide range of components are either duplicated within the systems or effectively duplicated by linking systems into a cluster. Some typical components and techniques are used:

Clustering

Fault tolerance

Redundant System Slot (RSS)

Cluster in a box

High-availability design

Internet network routing

Clustering

Entire computers or systems are duplicated. If a system fails, operation automatically transfers to the other systems.

Clusters may be set up as active-standby, in which case one system is live and the other is idle, or active-active, in which case multiple systems are kept perfectly in synch, and even dynamic load sharing is possible.

Active-active is ideal, as no system stands idle and the total capacity of all systems can always be utilized. If there is a system failure, fewer systems carry the load. When the failed system is replaced, load balancing readjusts.

Clustering does have its downside. When active-standby is used, duplication of systems is expensive. These active-standby systems may also take seconds for the failover to occur, which is a long time when systems are under heavy loads. Active-active systems, however, may require specialized hardware and additional, specialized administrative knowledge and maintenance.

Fault Tolerance

Components may have backup systems or parts of systems that allow them to recover from errors or to survive in spite of them.

For example, fault-tolerant CPUs use multiple CPUs running in lockstep, each using the same processing logic. In the typical case, three CPUs are used and the results from all CPUs are compared. If one CPU produces results that don’t match those of the other two, it is considered to have failed and is no longer consulted until it is replaced.

Another example is the fault tolerance built into Microsoft’s NTFS file system. If the system detects a bad spot on a disk during a write, it automatically marks it as bad and writes the data elsewhere. The logic to both these strategies is to isolate failure and continue on. Meanwhile, the system can raise alerts and record error messages to prompt maintenance.

Redundant System Slot (RSS)

Entire hot-swappable computer units are provided in a single unit.

Each system has its own operating system and bus, but all systems are connected and share other components.

Like clustered systems, RSS systems can be either active-standby or active-active. RSS systems exist as a unit, and systems cannot be removed from their unit and continue to operate.

Cluster in a Box

Two or more systems are combined in a single unit.

The difference between these systems and RSS systems is that each unit has its own CPU, bus, peripherals, operating system, and applications.

Components can be hot-swapped, and therein lies its advantage over a traditional cluster.

High Availability Design

Two or more complete components are placed on the network, with one component serving either as a standby system (with traffic being routed to the standby system if the primary fails) or as an active node (with load balancing being used to route traffic to multiple systems sharing the load, and if one fails, traffic is routed only to the other functional systems).

A High Availability Network Design Supporting a Web Site

Multiple ISP backbones are available, and duplicate firewalls, load-balancing systems, application servers, and database servers support a single web site.

Internet Network Routing

In an attempt to achieve redundancy for Internet-based systems similar to that of the Public Switched Telephone Network (PSTN), new architectures for Internet routing are adding or proposing a variety of techniques, such as these:

Reserve capacity

System and geographic diversity

Size limits

Dynamic restoration switching

Self-healing protection switching

Fast rerouting (which reverses traffic at the point of failure so that it can be directed to an alternative route)

RSVP-based backup tunnels (where a node adjacent to a failed link signals failure to upstream nodes, and traffic is thus rerouted around the failure)

Two-path protection (in which sophisticated engineering algorithms develop alternative paths between every node)

Two examples of such architectures are Multiprotocol Label Switching (MPLS), which integrates IP and data-link layer technologies to introduce sophisticated routing control, and Automatic Switching Protection (ASP), which provides the fast restoration times that modern technologies, such as voice and streaming media, require.

Operational Redundancy Methods

In addition to technologies that provide automated redundancy, there are many processes that help you to quickly get your systems up and running, if a problem occurs. These include

Standby systems

Hot-swappable components

Standby Systems

Complete or partial systems are kept ready. Should a system, or one of its subsystems, fail, the standby system can be put into service. There are many variations on this technique.

Some clusters are deployed in active-standby state, so the clustered system is ready to go but idle. To recover from a CPU or other major system failure quickly, a hard drive might be moved to another, duplicate, online system.

To recover quickly from the failure of a database system, a duplicate system complete with database software may be kept ready. The database is periodically updated by replication or by export and import functions. If the main system fails, the standby system can be placed online, though it may be lacking some recent transactions.

Hot-Swappable Components

Many hardware components can now be replaced without shutting down systems. Hard drives, network cards, and memory are examples of current hardware components that can be added.

Modern operating systems detect the addition of these devices on the fly, and operations continue with minor, if any, service outages.

In a RAID array, for example, drive failure may be compensated for by the built-in redundancy of the array. If the failed drive can be replaced without shutting down the system, the array will return to its prefailure state. Interruptions in service will be nil, though performance may suffer depending on the current load.

Summary

In this chapter, we covered the four related business resumption strategies that are all necessary for recovery from incidents, outages, and disasters that result in service or data loss: disaster recovery, business continuity planning, backups, and high-availability. Together, these form the core of a strategy to keep the organization’s information infrastructure operational.

Here in summary are the principal points, roles, and responsibilities of a good disaster recovery and business continuity program:

Develop and maintain disaster recovery and business continuity plans for all your organization’s enterprise technologies.

Schedule and oversee disaster recovery rehearsals for all enterprise systems.

Ensure disaster awareness by planning and conducting awareness programs, hazard fairs, lunch-and-learn sessions, and other informative events and materials.

Activate the plan.

Ensure community involvement by participating in local community disaster mitigation and planning initiatives and professional groups.

The disaster recovery and business continuity process is cyclical and must be maintained for it to stay current with the needs of the organization and the technologies in the environment. Your plans must be updated and rehearsed regularly. Disaster recovery is vital to everyone.

Backups can be an important part of a recovery strategy. They play a role in disaster recovery process, to move data from the primary site to the DR site, although real-time data replication approaches are replacing traditional tape shipments in modern DR plans. Backups are also necessary for recovering data in a traditional data center.

High availability architectures are the fourth leg of the table supporting service resiliency, to ensure that failure of one system or component of a service doesn’t cause that service to fail.