1 / 32100%
This is a simple runbook to help your IT team get systems up and running in
the event of a disaster. It’s modular, so you can use the sections that are
useful for you and delete sections that aren’t relevant. This should be a
working document that covers your exact IT recovery practices. If you have
any specific or unique systems or business continuity practices, list them in
additional sections.
If you intend to complete recovery internally, fill in the runbook with instructions
designed to be understood by an IT professional who has no previous experience
of your systems or organisation.
Alternatively, if you have outsourced your IT disaster recovery operations,
consider writing for a non-technical senior management audience, such as an
MD or CEO.
This is intended to supplement, rather than replace, your full Business Continuity
Plan (BCP), which should be a larger and more comprehensive document, which
includes:
Business Impact Analysis
Potential disruption types and the likelihood of them affecting you
Identification of the BCP Controller
The Crisis Management Team (including authority levels)
Evacuation of premises
Communication with customers and the media
Executive Summary
[Fill in with company information where appropriate]
[COMPANY NAME] currently has [NUMBER OF] virtual servers and
[NUMBER OF] physical servers all hosted from HQ at [LOCATION]. In the
event of a disaster and a loss of these systems the IT disaster recovery plan
is to recover these servers to [LOCATION] using “[THIRD PARTY]”.
The key IT staff required for this recovery are: [INTERNAL STAFF] and
“[THIRD PARTY]”. The servers will be recovered within [NUMBER OF
MINUTES/HOURS] of invocation.
Contents
Executive Summary...................................................................................................3
Invoking Disaster Recovery......................................................................................6
Key Contacts (internal) ............................................................................................7
Key Contacts (external) .........................................................................................11
DR Call Tree – who calls who................................................................................13
Diagrams...................................................................................................................13
Basic IT and Telecoms Diagram............................................................................14
Network Diagram (VISIO, Dia etc.)........................................................................15
Networking information...........................................................................................16
Communication........................................................................................................17
Communication with the rest of the workforce / business......................................18
Alternative Premises / Recovery Location.............................................................19
Software License and Registration Keys (if required)............................................19
Recovery....................................................................................................................20
Recovery Types......................................................................................................21
How to Recover......................................................................................................22
Disaster Recovery Event Log.................................................................................28
D Recovery Runbook 4
Invoking Disaster Recovery
If a crisis or disaster has occurred, you must decide your course of action.
Begin by recording your cascading decision-making hierarchy. Who is
authorised to invoke the DR plan, and from whom must they get approval
to do so?
1. CIO
2. IT Director
3. IT Manager
4. IT Administrator
In a crisis situation, a common mistake is to spend too much time trying to fix
the minor issues that caused the IT outage at the expense of actually invoking DR.
Avoid this by specifying exactly what kinds of outage merit DR invocation, and
then follow those standards rigidly.
For instance, a communications supplier may promise that an issue will be fixed
“within 1 hour”. After 45 minutes, they may continue to say the issue will be resolved
“within 1 hour”. Unless pre-defined plans are followed, it can delay the invocation
of the DR plan and therefore the time until the business is up and running again.
In this example, if an issue is not resolved within 1 hour, you should invoke your
DR plan and begin restoration. It is always possible to stop the restoration if
systems are available and working again.
Include both specific incidents that are likely to occur such as flooding (if you are
in an at-risk area) and general incidents that will include all of the issues you might
encounter.
I W is affected A
IT systems down Individual departments If not resolved within 30 mins,
invoke DR plan
Flooding Access to offices Relocate to alternative office
SAN failure, the SAN
attempts to repair itself
but loses all data
All IT systems Restore from backups
Key Contacts (internal)
Crisis Management Team
Name
Role Managing Director
Mobile telephone number
Home telephone number
Personal email address
Name
Role Financial Director
Mobile telephone number
Home telephone number
Personal email address
Name
Role Business Continuity Manager
Mobile telephone number
Home telephone number
Personal email address
D Heads
Name
Role Sales Manager
Mobile telephone number
Home telephone number
Personal email address
Name
Role Production Manager
Mobile telephone number
Home telephone number
Personal email address
Name
Role HR Director
Mobile telephone number
Home telephone number
Personal email address
Name
Role Backup Manager
Mobile telephone number
Home telephone number
Personal telephone number
In this section, list all staff with the specific skills for the recovery of IT. In larger
IT departments, this might be the ‘Infrastructure Team’. For smaller departments,
this might include all members of the IT team (as well as any 3rd party support
contacts).
Amend the points below to produce an exhaustive list of all the skills required
by key recovery personnel, including:
The skills to recover systems
Login details and encryption keys for recovery
oKnowledge of where encryption keys are stored securely - this
may include an offsite 3rd party
K Personnel for IT Recovery
Name
Role CIO
Mobile telephone number
Home telephone number
Personal email address
Name
Role IT Director
Mobile telephone number
Home telephone number
Personal email address
Name
Role Backup Manager
Mobile telephone number
Home telephone number
Personal email address
Key Contacts (external)
T C
N
S
C
C
N
C Email
A
Disaster recovery
service provider
General
support
information
Account
manager
Service
delivery
manager
Power supplier
Telecoms supplier
Telecoms supplier 2
Internet supplier
Internet supplier 2
Hardware / software
/ support supplier
Insurance
D Recovery Runbook
11
N
( number)
N
( number)
N
( number)
N
( number)
N
( number)
N
( number)
N
( number)
N
( number)
Tree – who calls who
N
( number)
HQ
200
users
20
users
SATELLITE
OFFICE
FLIGHT-
PATH FLOOD
POTENTIAL
TERRORIST
ATTACK
20
users
SATELLITE
OFFICE
SATELLITE
OFFICE
20
users
HONG
KONG
PARIS
DATABARRACKS
Basic IT and Telecoms Diagram
Internet break-out
connection
Internet
break-out
connection
MPLS connection
MPLS connection
MPLS connection
Network Diagram (VISIO, Dia etc.)
Networking information
S IP A R
Email
Web Server
Terminal Services
MPLS
VoIP
Monitoring Services
D Recovery Runbook
18
Communication with the rest of the
workforce / business
Cards for distribution to staff:
Create emergency cards like the examples below with instructions for staff to
follow in the event of a crisis situation. Include an emergency telephone number
and clear actions to report an emergency if discovered.
Underneath, record your emergency telephone number, the recorded emergency
instructions, and a schedule for how often the instructions should be updated.
Emergency telephone number and recorded message:
How often to update message:
Alternative Premises / Recovery Location
So
ftw
ar
e License and Registration Keys
(if required)
A C Details D
S V S P
Features:
Users:
S V S P
Features:
Users:
D Recovery Runbook
22
Recovery Types
Below are some stock examples of common IT failures that warrant DR
invocation. The list is not exhaustive and you should add more examples
pertinent to the risk profile of your organisation.
H
o
w
to
Recover
Use this table to outline the order in which you need to recover individual
servers, according to the business priority.
R Type P Actions W is responsible?
Single server loss
Create new VM – recover
from disk-based onsite
backups
IT Admin
SAN loss
Invoke disaster recovery
Recover servers at DR
site and send staff to
alternate location
IT Admin
Satellite site lost Recover site at HQ and
provide remote access Crisis Management Team
HQ / major site / primary
data centre lost
Invoke disaster recovery
Recover servers at DR
site and send staff to
alternate location
Crisis Management Team
IBM series I / AS400 /
power systems Contact service provider Crisis Management Team
and specialist 3rd party
Connectivity / power loss
Contact ISP and relocate
2nd site / failover to
secondary power source
Crisis Management Team
P Order S Name IP A D
1 DC01 101.10.10.101 Domain Controller
2 Ex01 101.10.10.102 Exchange server
3 CRM01 101.10.10.103 CRM Cluster
3 CRM02 101.10.10.104 CRM Cluster
3 CRM03 101.10.10.105 CRM Cluster
4 Accounts01 101.10.10.106 Accounts server
Recovery Templates
Below are four templates outlining the steps to recover from example disasters.
Use these templates to create your own recovery plans in accordance with your
own recovery processes.
Example 1– DR invocation by recovering servers using an outsourced DR
service. This example outlines the steps to take when working with a 3rd party
disaster recovery provider.
P Actions W is responsible?
Step 1 Identify issue and report to Crisis
Management Team IT Admin, IT Manager
Step 2 Inform DR service provider of need
to recover Crisis Management Team
Step 3
Authorise recovery (this may include
third party verifying the recovery with
key staff, determining the correct
backup/snapshot to recover from and
input of credentials/encryption keys
for recovery)
Crisis Management Team
Step 4
Start restoration (if not using a service
provider, you may want to include
specific step by step instructions here
with screenshots or more advanced,
recovery software specific instructions)
Service Provider
Step 5 Confirm recovery of server(s) Service Provider
Step 6 Test recovered servers Crisis Management Team
Step 7 Confirm the recovery was successful
with IT team (i.e. do servers start?) Crisis Management Team
Step 8
Functional testing with key users (small
samples of users to test systems are
working as expected and data is up to
date)
Crisis Management Team
Step 9 Confirm user testing successful Crisis Management Team
Step 10 Release DR systems to all users Crisis Management Team
Example 2 – Invoking DR by restoring servers from backups. This
example outlines the steps to take when recovering from a disaster using
server backups.
Ex
am
ple
3 –
Operating after loss of primary power. This example outlines
the steps to take when maintaining business continuity throughout a major
power outage.
P Actions W is responsible?
Step 1 Set up backup software and media
for restoration
IT Manager,
IT Admin
Step 2 Confirm correct backup version to recover from IT Director
Step 3 Restore server IT Admin
Step 4 Reboot server IT Admin
Step 5 Log on to domain IT Admin
Step 6 Check IP address IT Admin
Step 7 Reboot VM IT Admin
Step 8 Start database
(restore databases separately if required) IT Admin
Step 9 Check domain and AD authentication IT Admin
Step 10 Reboot server IT Admin
Step 11 Check SQL instance is running using SQL
Management Studio IT Admin
Step 12
Functional testing with key users (small
samples of users to test systems are working
as expected and data is up to date)
IT Manager,
Operations
Manager,
Sales Manager
Step 13 Confirm user testing IT Manager
Step 14 Release DR systems to all users IT Director
P Actions W is responsible?
Step 1 Identify scope of affected systems IT Manager,
IT Admin
Step 2 Confirm issue and invoke DR procedures Crisis Management
Team
Step 3 Notify users and update emergency telephone
message
Crisis Management
Team
Step 4 Safely powering down non-essential systems or
entire infrastructure until resolution
Crisis Management
Team
Step 5 Verify interruptions to service delivery & resolve
issues
Crisis Management
Team
Step 6 Restore primary power source
Crisis Management
Team and Service
Provider
Example 4 – Operating after loss of primary connectivity. This example
outlines the steps to take when maintaining business continuity throughout
the loss of your primary internet connection.
P Actions W is responsible?
Step 1 Receive alert from automated monitoring IT Manager,
IT Admin
Step 2 Confirm issue IT Manager,
IT Director
Step 3 Contact ISP / Supplier Crisis Management
Team
Step 4 Notify users and update emergency
telephone message
Crisis Management
Team
Step 5 Send skeleton operation team to DR site Crisis Management
Team
Step 6 Authorise home / remote working for
non-essential users
Crisis Management
Team
Step 7 Make changes to bandwidth intensive
applications / services
Crisis Management
Team
Disaster Recovery Event Log
As you work through the runbook, you should record every action in the event
log. Following a successful recovery, you can then use the event log as the
basis for the IT DR Report and use any issues encountered to improve
your plan.
Description of disaster:
Commencement date:
Date Disaster Recovery team mobilised:
K activities
by
recovery
(brief
)
D and time
O F action
Students also viewed