DTGOV's Leasing Agreement Case Study
Case Study Example
The standard SLA for virtual servers in DTGOV’s leasing agreements defines a minimum IT
resource availability of 99.95%, which is tracked using two SLA monitors: one based on a
polling agent and the other based on a regular monitoring agent implementation.
SLA Monitor Polling Agent
DTGOV’s polling SLA monitor runs in the external perimeter network to detect physical server
timeouts. It is able to identify data center network, hardware, and software failures (with minute-
granularity) that result in physical server non-responsiveness. Three consecutive timeouts of 20-
second polling periods are required to declare IT resource unavailability.
Three types of events are generated:
• PS_Timeout – the physical server polling has timed out • PS_Unreachable – the physical server polling has consecutively timed out three times • PS_Reachable – the previously unavailable physical server becomes responsive to polling
again
SLA Monitoring Agent
The VIM’s event-driven API implements the SLA monitor as a monitoring agent to generate the
following three events:
• VM_Unreachable – the VIM cannot reach the VM • VM Failure – the VM has failed and is unavailable • VM_Reachable – the VM is reachable
The events generated by the polling agent have timestamps that are logged into an SLA event log
database and used by the SLA management system to calculate IT resource availability.
Complex rules are used to correlate events from different polling SLA monitors and the affected
virtual servers, and to discard any false positives for periods of unavailability.
Figures 8.8 and 8.9 show the steps taken by SLA monitors during a data center network failure
and recovery.
Figure 8.8 At timestamp = t1, a firewall cluster has failed and all of the IT resources in the data
center become unavailable (1). The SLA monitor polling agent stops receiving responses from
physical servers and starts to issue PS_timeout events (2). The SLA monitor polling agent starts
issuing PS_unreachable events after three successive PS_timeout events. The timestamp is now
t2 (3).
Figure 8.9 The IT resource becomes operational at timestamp = t3 (4). The SLA monitor polling
agent receives responses from the physical servers and issues PS_reachable events. The
timestamp is now t4 (5). The SLA monitoring agent did not detect any unavailability since the
communication between the VIM platform and physical servers was not affected by the failure
(6).
The SLA management system uses the information stored in the log database to calculate the
period of unavailability as t4 – t3, which affected all of the virtual servers in the data center.
Figures 8.10 and 8.11 illustrate the steps that are taken by the SLA monitors during the failure
and subsequent recovery of a physical server that is hosting three virtual servers (VM1, VM2,
VM3).
Figure 8.10 At timestamp = t1, the physical host server has failed and becomes unavailable (1).
The SLA monitoring agent captures a VM_unreachable event that is generated for each virtual
server in the failed host server (2a). The SLA monitor polling agent stops receiving responses
from the host server and issues PS_timeout events (2b). At timestamp = t2, the SLA monitoring
agent captures a VM_failure event that is generated for each of the failed host server’s three
virtual servers (3a). The SLA monitor polling agent starts to issue PS_unavailable events after
three successive PS_timeout events at timestamp = t3 (3b).
Figure 8.11 The host server becomes operational at timestamp = t4 (4). The SLA monitor polling
agent receives responses from the physical server and issues PS_reachable events at timestamp =
t5 (5a). At timestamp = t6, the SLA monitoring agent captures a VM_reachable event that is
generated for each virtual server (5b). The SLA management system calculates the unavailability
period that affected all of the virtual servers as t6 – t2.