questions
Computing (2019) 101:989–1014 https://doi.org/10.1007/s00607-018-0626-5
Instance launch-time analysis of OpenStack virtualization technologies with control plane network errors
Jawad Ahmed1,3 · Aqsa Malik1 · Muhammad U. Ilyas1,2 · Jalal S. Alowibdi2
Received: 9 October 2017 / Accepted: 8 May 2018 / Published online: 28 May 2018 © Springer-Verlag GmbH Austria, part of Springer Nature 2018
Abstract We analyzed the performance of a multi-node OpenStack cloud amid dif- ferent types of controlled and self-induced network errors between controller and compute-nodes on the control plane network. These errors included limited band- width, delays and packet losses of varying severity. This study compares the effects of network errors on spawning times of batches of instances created using three different virtualization technologies supported by OpenStack, i.e., Docker containers, Linux containers and KVM virtual machines. We identified minimum/maximum thresh- olds for bandwidth, delay and packet-loss rates below/beyond which instances fail to launch. To the authors’ best knowledge, this is the first comparative measurement study of its kind on OpenStack. The results will be of particular interest to designers and administrators of distributed OpenStack deployments.
Keywords OpenStack · Containers · Performance measurement
B Muhammad U. Ilyas usman.ilyas@seecs.edu.pk; usman@ieee.org; milyas@uj.edu.sa
Jawad Ahmed jawad.ahmed@seecs.edu.pk; j.ahmed@unsw.edu.au
Aqsa Malik 13mseeamalik@seecs.edu.pk
Jalal S. Alowibdi jalowibdi@uj.edu.sa
1 Department of Electrical Engineering, School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad 44000, Pakistan
2 Department of Computer Science, Faculty of Computing and Information Technology, University of Jeddah, Jeddah, Saudi Arabia
3 School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney NSW 2052, Australia
123
990 J. Ahmed et al.
Mathematics Subject Classification 68M01 · 68M20 · 68M15
1 Introduction
1.1 Motivation and problem statement
Cloud computing is a paradigm in which pools of physical resources are shared by multiple users using virtualization technologies. The shared resources include storage, compute, bandwidth, software, development platforms, services etc. and are available to users on-demand. OpenStack is an open source cloud management system (CMS) that provides infrastructure-as-a-service, i.e., it is used to launch and control virtual machine (VM) instances deployed on shared physical servers. The Telecommunication industry is looking at virtualizing network functions on Cloud environments. However, the layers of software abstractions needed to virtualize resources creates an overhead and reduces performance. Another impact factor is isolation between VNFs co-located onthesamemachine.VM-basedvirtualizationrequiresahypervisorthatloadsseparate operating systems for each VM. In contrast, Linux containers (LXC) do not require loading another OS, but use kernel namespaces and provide an alternative to Linux VMs with a much smaller overhead in terms of CPU and memory resources. Linux Container is an attractive technology compared with virtual machines given its low resource footprint.
Physical resources in cloud data centers are usually oversubscribed by a certain factor, i.e., users arecollectivelyallocatedmoreresources thanarephysicallyavailable. This means that on occasion it is possible that resource utilization demand on a server may exceed physical capacity. In this situation, practically imperfect tenant isolation may lead a VM’s operation to be affected by the actions of other tenants, which is in contravention to the principle of perfect tenant isolation in cloud data centers. Imperfect tenant isolation on the CPU, memory, disk and network plane has also been the subject of study in the context of covert channel communication between two co-located VMs without the explicit exchange of messages, e.g., Wu et al. [38], Xu et al. [39] and Ristenpart et al. [31]. While covert channels are not the subject of this study, we are studying the effects of different network conditions on the launch time of VMs, LXCs and Docker containers. These network conditions may be caused by spikes in demand or errors in data center networks.
Another motivation for this study stems from the fact that an increasing number of clouds are now deployed as distributed clouds. Inter-data center communication between geographically separated data centers is at the mercy of the Internet, where link conditions may see greater variation.
The objective of this project is to quantify and compare the effect of network errors of varying severity levels on the resilience of virtualization technologies (KVM, LXC and Docker containers) in OpenStack.
123
Instance launch-time analysis of OpenStack virtualization… 991
1.2 Limitations of prior art
Every day more cloud data centers are being deployed globally. Cloud service providers, such as Amazon, HP, Rackspace, etc., offer different cloud service mod- els to customers. However, they provide little in the way of service guarantees for launch times of VMs/containers in the face network connectivity issues. Most cloud performance analysis studies are straightforward benchmark studies such as the ones on Amazon EC2 measuring CPU performance and disk I/O [1,25]. Garfinkel [9] mea- sured service performance of a number of Amazon software-as-a-service offerings.
1.3 Proposed approach
For this study we deployed a cloud using the OpenStack cloud management sys- tem (CMS) and developed a re-usable benchmarking tool, developed in Python. The benchmarking tool measures the resilience of an OpenStack deployment by induc- ing delays, packet losses and bandwidth bottlenecks of various levels into the control plane network. OpenStack deployments using different instance virtualization tech- nologies, i.e., hypervisors, Linux containers and Docker containers, are benchmarked by the time it takes to launch batches of instances of different sizes.1 The functional and architectural differences of these three virtulization technologies will be explained later.
1.4 Experimental results and findings
WedeployedOpenStackonfourphysicalservers,i.e.,onecontrollerandthreecompute nodes. We present the results of three different virtualization technologies supported by OpenStack. The results show that the Linux containers in nova-lxd offer rapid deploymentandoutperformothervirtualizationtechnologiesduetotheirlightresource overhead. The results show that the spawning times increase linearly at first and then increase at a greater rate when network errors exceed a certain threshold.
1.5 Key contributions
The principal contributions of this research work are two-fold:
– We provide first measurements of spawning times in the face of network turbulence for three OpenStack virtualization technologies.
– This is among the first comparative measurement studies to include all kinds of network faults covering three different virtualization technologies.
1 OpenStack documentation uses the term instance to refer to a VM. We will use the terms VM and instance interchangeably.
123
992 J. Ahmed et al.
2 Related work
2.1 Software defined networking
Computer networks consist of a complex mix of devices, i.e., routers, switches, firewalls etc. These devices can only be configured using their own vendor propri- etary software. The proprietary and vendor controlled nature of the software severely restricts customer freedom and in turn limits the degree to which they can innovate. Software defined networking (SDN) has revolutionized the way networks and new services are designed, operated and managed [7]. SDNs are fundamentally different from traditional networks in that they decouple their control and data planes. A central SDN controller is used to control the data plane which is distributed across switches and routers of a network. Several different SDN controllers have been developed over the last several years, e.g., Beacon [6], Floodlight [5], NOX [11], ONIX [18], etc.
2.2 Network virtualization
SDN is an enabling technology for network virtualization. SDN network virtualization is an abstraction that decouples the physical from the logical data plane. Network vir- tualization enables quick (re)configuration of logical networks [24]. Mininet [13,19] is an emulator for networks of OpenFlow switches, hosts and controllers. A single physical server running Mininet can instantiate hundreds of hosts with as many con- trollers on a virtual network. In legacy networks, virtualization of network devices such as the firewalls and routers is difficult, but SDN has made it possible to create virtual instances of these devices as needed.
2.3 Cloud computing
Cloud computing uses virtualization for compute, storage and networking resources to create pools of resources and makes them available to users on-demand. Network virtualization in particular has given cloud service providers the flexibility to deploy complex network configurations orders of magnitudes faster. Several vendors offering cloud services include Amazon EC2, Microsoft Azure, Google Cloud, etc. Sefraoui et al. [34] conducted a comparative study of different IaaS providers and how these cloud service providers have better resource utilization than legacy technologies.
Network-as-a-Service (NaaS) [4] is a relatively newer cloud service model. This service can help users create network topologies and even use custom routing proto- cols, and emulate network devices such as switches and routers. SDN technologies have had a major impact on cloud computing. The ability of SDN technologies to allow vendors to (automate) provision and revocation of complex network infrastructure and resources in an instant by (re)programming has made it possible for clouds to scale.
Cloud computing is a paradigm shift where we have a large pool of computing resources working together or separately depending upon the requirements. Cloud users are allocated resources without needing to know the underlying details of how
123
Instance launch-time analysis of OpenStack virtualization… 993
that is achieved (unlike distributed computing where resources allocation is more simplistic, and in some cases, even requires occasional human intervention).
2.4 OpenStack CMS
The OpenStack cloud management system (CMS) provides IaaS composed of multiple components that perform specific tasks. OpenStack is an open source project, which means that its use is free of cost, unlike the cloud systems already available such as Amazon EC2, Rackspace, HP, etc. Communication between OpenStack components is handled via the advanced message queuing protocol (AMQP) and a REST API. Key services provided by different OpenStack components include compute (Nova), image storage (Glance), storage (Swift), networking (Neutron) and authentication (Keystone), among many other optional components. Nova is a core component of Openstack that performs creation and termination of VMs. It communicates with other OpenStack components through the nova-api. Glance stores OS images used to boot various instances. Swift is an object storage service and is responsible for storing user data on the OpenStack cloud. Neutron, the networking component, provides network virtualization as a service. Virtual networks built from virtual links, virtual switches, virtual routers and other virtual network elements, function the same way as an equivalent physical network infrastructure. Finally, all users accessing OpenStack resources, and all their requests, need to be authenticated. This is the function of Keystone, i.e., to generate authentication tokens and keypairs to authenticate users and access requests for resources (Fig. 1).
2.5 OpenStack virtualization technologies
OpenStack’s Nova component supports different compute virtualization technologies, including hypervisors and container technologies. The three Nova compute virtual- ization technologies considered in this study include KVM, LXD and Docker are explained in Fig. 2.
2.5.1 Docker
Docker is open source container management software. It enables the deployment and bundling of applications inside a software container. Like VMs, Docker containers are portable and can be exported from one system to another. A container does not require a full operating system. Docker uses cgroups and kernel namespaces to iso- late applications running in containers and avoids the overhead of booting a separate operating system and its management overhead. It has a self contained filesystem and base image. Isolation gives the impression that the container is running like a single process on the host. Docker uses the libcontainer library to communicate with the kernel. The nova-docker driver is a hypervisor driver for OpenStack Nova and has been extended to spawn Docker containers. In order to spawn containers, the Nova driver is pointed to the Docker driver. The nova-docker driver talks to the Docker agent
123
994 J. Ahmed et al.
Fig. 1 OpenStack architecture [27]
Fig. 2 Hypervisor VM versus LXD versus Docker [32]
123
Instance launch-time analysis of OpenStack virtualization… 995
using HTTP API calls. Docker images are stored in the Docker registry and images are exported to Glance from the Docker registry which Nova uses to create containers.
2.5.2 Nova-LXD
LXD is container hypervisor and is only available on Linux. It comprises of three components:
– System-wide daemon (lxd) – Command line client (lxc) – OpenStack Nova LXD plugin (nova-compute-lxd).
LXD are machine containers, also called infrastructure containers that run a full Linux system. The same management and deployment tools used for VMs and phys- ical machines can also be used for LXD containers. In contrast, Docker supports stateless, minimal containers that cannot be upgraded or re-configured but instead just be replaced entirely. This makes Docker a software mechanism rather than a machine management tool and are hence named application containers. LXDs are full Linux systems, whereas Docker can be installed inside LXD container to run applications or other container applications.
2.6 Cloud benchmark suites
Benchmarking suites evaluate system performance for certain tasks or workloads and the results are compared against some predefined/standard metrics. Several benchmark suites, some specifically for OpenStack, are available for performance testing of cloud environments [3]. Sobel et al. [35] developed Cloudstone, a CMS agnostic benchmark- ing suite demonstrated on Amazon EC2 that aims to simulate loads more characteristic of Web 2.0 applications, consisting of database queries and user queries. Luo et al. [21] is another cloud benchmark suite that performs tests of typical of (big) data processing applications. Netflix, a commercial video streaming service, is hosted on Amazon’s cloud infrastructure. Netflix developed its own test suite called Chaos Monkey that deliberately introduces a variety of faults into the system (e.g., randomly shutting off VMs, disks, network interface cards, etc.) and assesses the service’ fragility in the face of such failures [37]. Jackson et al. [17] measured the performance of Amazon EC2 and compare it with that of a cluster using the same benchmarks routinely used to test high performance computing systems. The variation in measurements from Amazon EC2 were considerably higher, which was confirmed by several other studies [30]. Schad [33] studied performance consistency of Amazon EC2 using the LINPACK benchmark and concluded that variance of measurements is very high.
Cloud computing is prevalent, but service failures are inevitable and still occur from time to time. The massive hardware and the immense task of managing it can result failures which can lead to service outages. During its early days, Osterman et al. [28] also conducted a performance study of Amazon EC2 with an eye on scientific computing applications. They concluded that cloud computing still required an order of magnitude improvement before it could become useful for scientific applications.
123
996 J. Ahmed et al.
The wide deployment of cloud services across organizations has brought about a need for benchmarking clouds.
Amazon is among the earliest cloud service providers and has been the subject of several benchmarking studies including Schad et al. [33], Iosup et al. [16], Jackson et al. [17], Yigitbasi et al. [40], and O’Loughlin and Gillam [26]. A few cloud specific benchmarks are available in literature Folkerts et al. [8], Binning et al. [2] and Rak et al. [29].
The performance of similar cloud systems may diverge significantly. Li et al. [20] developed CloudProphet, a systematic end-to-end benchmark suite that helps cus- tomers better predict an application’s running time when it is moved to a cloud. Variations are observed in virtual instances, storage services, and network transfers against some specific metrics for four different clouds which are dominating the mar- ket. While dealing with clouds, getting comparable results from a cloud benchmark can be deceptively difficult, due to differences between measures, networks, and the definitions of benchmarks. More recently, Google launched PerfKit [10]. Perfkit cal- culates end-to-end time to provision resources in clouds and can be run on Amazon AWS, Microsoft Azure and Google Cloud platforms.
2.7 Instance launch time
Dynamic scalability of cloud is only meaningful when additional/replacement resources are available in a timely manner. Instance launch time is an important parameter for situations during service upscaling and migration, particularly for time- sensitive applications. Several recent studies have focused on instance launch time, e.g., Osterman et al. [28] measured launch time of single and multiple instances on Amazon EC2. Hill et al. [14] analyzed launch time for WebRole and WorkRole on the Microsoft Azure cloud. Mao et al. [23] studied the relationship between instance launch time and other factors such as type of instance acquired. In contrast, we have studied the effect of network conditions resulting from temporary (excess load) or permanent (faulty network equipment) on instance launch time. Steinmetz et al. [36] compared performance test results of OpenStack and Eucalyptus cloud platforms. Although they used instance launch time as a metric, their cloud infrastructure con- sisted of only two computers: a 16 core server and a Pentium 4 PC. Furthermore, the work does not consider scalable deployment or any fault injection mechanism to test performance.
3 Experimental methodology
Cloud computing enables users to scale up or scale down resources based on workload and application requirements, i.e., users can acquire more virtual resources as needed during periods of high demand, and release virtual resources during periods of low demand. Such dynamic provisioning of virtual resources is meaningful only when it can be achieved fast enough. An instance can be requested at any time by a user, however, it may take up to several minutes for it to become ready for use. This is the time it takes the CMS to match the resource request to available physical resources
123
Instance launch-time analysis of OpenStack virtualization… 997
and allocate them. The launch time may vary depending on a combination of factors including, but not limited to, type and size of image, copy/boot of OS image, number of cores, etc.
3.1 OpenStack infrastructure
The OpenStack IaaS model includes compute (Nova), image storage (Glance), per- sistent storage (Swift), networking (Neutron) and authentication (Keystone) services. The OpenStack project was first released in 2010, through a collaboration between RackspaceandNASA.TheOpenStackdevelopmentcommunitylaunchesnewreleases every six months with specific milestones. We deployed the stable Liberty release, the 12th in the line.
OpenStack is highly scalable, meaning it can be deployed on as few as a single and as many as hundreds of nodes. In single node deployment, there is only one computer which hosts all OpenStack services, making it a complete cloud. On a multi-node OpenStack deployment, services are distributed across two or more nodes, usually where one node acts as a controller node and all the other nodes are compute nodes.
– The controller node runs all the core services of OpenStack, and is a key element of the control plane of OpenStack environment.
– The compute node, on the other hand, is a collection of services which runs the hypervisor portion of compute. It receives requests from the controller node to allocate/deallocate instances. The cloud scales horizontally by increasing the number of compute nodes.
All these nodes are connected to an internal network (or management network), as shown in Fig. 3. An internal network is a separate network used by cloud providers consisting of separate switches, network and NICs, and provides internal commu- nication between the OpenStack components. This segregated network is reachable only from within a data center and prevents service disruption by traffic generated by tenants.
3.1.1 Instance launch sequence
In order to measure the instance launch time in OpenStack, it is very important to understand the sequence of steps involved. Various OpenStack components interact with a series of inter-component requests to successfully launch an instance as shown in Fig. 4. We are going to present simplistic steps involved in the process, a more elaborate description is already covered in [22]. All OpenStack components commu- nicate with each other using REST while the intra-service communication is through the remote procedure calls (RPCs) using relevant APIs. The instance launch request is made via CLI or Dashboard and it is translated to the nova-boot command by the Nova API server. Nova-api service interacts with Keystone for authentication (1, 2). Following successful authentication nova-db service is used to create the initial entry for the new instance (4, 5, 6, 7). The nova-api then interacts with nova-scheduler to get information of host where the instance has to be launched (8, 9). After filtering and
123
998 J. Ahmed et al.
Fig. 3 Multi-node deployment
weighing, the nova-scheduler selects an appropriate host and send a launch request to nova-compute (10, 11, 12, 13). The nova-compute then makes an RPC call to nova- conductor for fetching information such as host ID, flavor, disk and vCPU (14, 15, 16 17, 18). Nova-compute then makes a REST call to glance to retrieve and upload the image from the image database, to the selected host (19, 20). This uploaded image is cached for future use. Subsequently, nova-compute calls the neutron to retrieve net- work allocation and configuration information so that a fixed IP is assigned to the new instance (21, 22). Finally, nova-compute forwards all information to the virtualization driver, which executes the request of spawning an instance on the hypervisor (23). It is worth mentioning here that the volume storage backend (i.e. Cinder or Ceph) is not enabled in our experimental setup, therefore we have skipped the steps involved in con- tacting the cinder during the whole process. The corresponding instance states visible during the various stages of the provisioning process are: Scheduling ¿ Networking ¿ Spawning ¿ Running [12].
3.2 Testing infrastructure
We have set up a control plane network with a single subnet for communication between all OpenStack nodes. The testing infrastructure consisted of one controller node with the following specifications: Intel Core i5 3.2GHz, 10GB RAM, and a 500GB, 7200rpm SATA hard disk. We also set up three compute nodes, each with the same specifications as the controller node described above. OpenStack relies
123
Instance launch-time analysis of OpenStack virtualization… 999
Fig. 4 Instance launch sequence
on inter-service communication between different nodes for proper operation. This communication requires the network to be fault-resilient. Considering how critical inter-service communication is to the proper functioning of OpenStack, our fault injec- tion mechanism targets service communications (on the control plane network).
3.3 Fault injection mechanism
Considering the critical nature of inter-service communication and the various types of network errors inherent to computer networks, our fault injection mechanism targets inter-service communication on the control plane network. In the OpenStack setup shown in the Fig. 3, this is marked as the Management network. The management network in our case is just carrying the inter-service communication traffic, and not the VM traffic, to isolate test-case scenario we have represented. We used the tc qdisc utility in Linux [15] to simulate various network errors and control their degree of severity. tc qdisc consumes very little system resources and, therefore, has a negligible effect on measurements, making it a good approximation of actual network failures. We used tc qdisc to induce three kinds of faults in the management subnet: (1) limited bandwidth, (2) link delay, and (3) packet losses, using the following command lines:
– To limit the bandwidth to 100kbps: sudo tc qdisc add dev eth1 root tbf rate 100kbit burst 1600 limit 3000.Heretheburst parameter specifies the buffer or max burst size of the bucket (bytes per cell). The limit parameter specifies the number of bytes that can be queued waiting for tokens to become available.
123
1000 J. Ahmed et al.
– To add a delay of 100ms: sudo tc qdisc add dev eth1 root netem delay 100ms.
– To set the packet loss rate on the link to 10%: sudo tc qdisc add dev eth1 root netem loss 10%.
– To remove all modifications made by tc qdisc we called, sudo tc qdisc del dev eth1 root.
These tc qdisc commands take effect at the physical NIC of the host machines of the testbed (see Fig. 5).
3.3.1 Packet loss
Packet losses can have noticeable effects on communication network. We increased packet loss rate in a controlled manner from 0% in uniform intervals up until instances failed to launch altogether.
3.3.2 Bandwidth
Bandwidthisanotherimportantparametertoconsiderwhenshapingthenetworktraffic between the nodes. Bandwidth was reduced by uniform step size until the instances were failed to launch successfully.
Fig. 5 System under test: turn around time
123
Instance launch-time analysis of OpenStack virtualization… 1001
3.3.3 Delay
Finally, we consider various levels of delay on the control plane network. We begin by adding no delay, then gradually increase it by uniform increments up until instances fail to launch.
3.4 Performance metric
The purpose of this experiment is to measure the time it takes to launch varying sizes of batches of VMs, LXD and Docker container instances. In real-world scenarios these launch times are of critical importance while scaling out a VNF or meeting the resource demand during peak hours of traffic etc. For this purpose, we developed a Python script using the python-novaclient bindings of OpenStack. This script repeatedly executes a series of tests. Each test consists of launching a batch of varying number of VMs, LXD and Docker containers with image pre-cached on compute node prior to tests.
Next, measuring the time it takes for all instances to become active. An instance is considered to have reached the active state as soon as it becomes accessible via its virtual NIC. Measuring the launch time is of prime importance because it can capture system’s overall performance. We use the time it takes to receive a successful ping response as a proxy indicator that an instance has launched because in OpenStack the time it takes for an instance to go from active state to ready to use state is negligible. An instance is in active state if there are no ongoing compute API calls (running tasks) while instance is ready to use when operating system boots up and all internal infrastructure components like networks and volumes are attached and ready for used.
Finally, the script terminates all instances to return the system to its initial state ahead of the next test. For consistency of results all the KVM instances are launched using the Ubuntu 14.04 cloud image and m1.small flavor.2. All Docker containers are launched using the Ubuntu 14.04 base image from Docker registry, while the LXD containers are spawned using Ubuntu 14.04 image. The difference in the launch time of a VM as compared to the container, which we will see in the next section, can be visualized by the turn around of the ping in both these cases as shown in Fig. 5. In the case of VM, the ping packet has to traverse through a guest OS layer while in the case of containers there is no additional OS layer.
When we were designing this series of experiments we encountered situations where instances launch requests would end up failing. After thorough exploration we observed that the longest it took any one or group of instances to launch was 450s. Fur- ther deterioration in networking conditions would result in complete failure to launch. Even waiting for up to 30min would not yield a successful launch. This series of exper- iments is repeated for various levels of packet losses, delay and bandwidth limitations.
2 Flavor is the term used in OpenStack documentation to refer to various configurations of VM instances.
123
1002 J. Ahmed et al.
100 120 140 160 180 200 220 240 260 280 300
Bandwidth (kbps)
0
20
40
60
80
100
120
140
160
180
200
S pa
w ni
ng ti
m e
(s ec
)
Bandwidth vs Spawning time
1 Docker Container 10 Docker Containers 20 Docker Containers 30 Docker Containers 40 Docker Containers 50 Docker Containers
Fig. 6 Spawning time versus bandwidth for Docker containers
100 120 140 160 180 200 220 240 260 280 300
Bandwidth (kbps)
0
20
40
60
80
100
120
140
160
180
200
S pa
w ni
ng ti
m e
(s ec
)
Bandwidth vs Spawning time
1 LXD Container 10 LXD Containers 20 LXD Containers 30 LXD Containers 40 LXD Containers 50 LXD Containers
Fig. 7 Spawning time versus bandwidth for LXD containers
4 Results and analysis
4.1 Launch time versus fault levels
4.1.1 Variable bandwidth
Figures 6, 7 and 8 depict the average launch times for Docker containers, LXDs and VMs, respectively, at various bandwidth levels in the control plane network. Note that all plotted data points are average launch times of 10 runs. All three plots
123
Instance launch-time analysis of OpenStack virtualization… 1003
120 140 160 180 200 220 240 260 280 300
Bandwidth (kbps)
0
50
100
150
200
250
S pa
w ni
ng ti
m e
(s ec
)
Bandwidth vs VM Spawning time
1VM 5VMs 10VMs
Fig. 8 Spawning time versus bandwidth for VMs
in this group are for bandwidth ranging from 90 to 300kbps depending upon the different virtualization technologies used. It is worth noticing that LXD containers are performing slightly better than Docker containers and VMs because the threshold value of bandwidth for LXD container is as little as 90kbps, while it is 100kbps for Docker containers and 110kbps for VMs. Below these bandwidth thresholds, these virtualization technologies failed to launch their specific instances. Not surprisingly, the figures also shows that the launch times increase with larger batch sizes for all three virtualization technologies.
We also observed that for similar parameters (batch size and bandwidth), LXD offers faster launch times than Docker containers and much faster launch times than VMs.
Furthermore, we observe that for each virtualization technology launch times increase linearly with decrease in bandwidth, until an inflexion point is reached beyond which launch times increase at a much higher rate. For Docker containers that inflexion point appears at 105kbps, and for LXD at 115kbps. Interestingly, the inflexion point remains the same regardless of the batch size. This inflexion point is less prominent in the case of VMs. However, for the 10 VM case there appears to be greater increase in launch times for bandwidths less than 185kbps.
Most obviously and unsurprisingly, the same testing infrastructure was able to launch LXD/Docker containers approximately an order of magnitude faster than VMs. This is explained by the smaller resource overhead of containers relative to VMs.
4.1.2 Variable delay
Figures 9, 10 and 11 depict the average launch times for Docker containers, LXDs and VMs, respectively, at various delays in the control plane network. Note that as before all plotted data points are average launch times of 10 runs. All three plots in this group
123
1004 J. Ahmed et al.
Delay (sec)
0
50
100
150
200
250
300
S pa
w ni
ng ti
m e
(s ec
)
Delay vs Spawning time
1 Docker Container 10 Docker Containers 20 Docker Containers 30 Docker Containers 40 Docker Containers 50 Docker Containers
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Fig. 9 Spawning time versus delay for Docker containers
0 1 2 3 5 Delay (sec)
0
50
100
150
200
250
300
350
400
450
S pa
w ni
ng ti
m e
(s ec
)
Delay vs Spawning time
1 LXD Container 10 LXD Containers 20 LXD Containers 30 LXD Containers 40 LXD Containers 50 LXD Containers
4
Fig. 10 Spawning time versus delay for LXD containers
are for delay ranging from 0 to 5–5.3s depending upon the different virtualization technologies used. It is worth noticing that LXD containers are again performing slightly better than Docker containers and VMs because the threshold value of delay at which LXD containers still manage to launch is as much as 5.3s while it is 5.2s for Docker containers and 5s for VMs. Above these delay thresholds, these virtualization technologies failed to launch their specific instances. Not surprisingly, the figures also shows that the launch times increase with larger batch sizes for all three virtualization technologies.
123
Instance launch-time analysis of OpenStack virtualization… 1005
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Delay (sec)
0
50
100
150
200
250
300
350
400
450
S pa
w ni
ng ti
m e
(s ec
)
Delay vs VM Spawning time
1VM 5VMs 10VMs
Fig. 11 Spawning time versus delay for VMs
Packet loss rate
0
50
100
150
200
S pa
w ni
ng ti
m e
(s ec
)
Packet loss vs Spawning time
1 Docker Conatiner 10 Docker Containers 20 Docker Containers 30 Docker Containers 40 Docker Containers 50 Docker Containers
0 0.1 0.2 0.3 0.4 0.5 0.6
Fig. 12 Spawning time versus packet loss rate for Docker containers
We also tried to identify inflexion points on the delay axis beyond which launch times began increasing at a higher rate. For LXDs and VMs that inflexion point appears at around the 5s delay level. However, we do not see a clear inflexion point for the corresponding plot for Docker containers in Fig. 9.
In absolute terms, the launch times of same batch sizes of Docker containers and LXDs are about the same. However, launch times of equally sized VMs are about one order of magnitude higher.
123
1006 J. Ahmed et al.
0 0.1 0.2 0.3 0.4 0.5 0.6
Packet loss rate
0
20
40
60
80
100
120
140
160
S pa
w ni
ng ti
m e
(s ec
)
Packet loss vs Spawning time
1 LXD Container 10 LXD Containers 20 LXD Containers 30 LXD Containers 40 LXD Containers 50 LXD Containers
Fig. 13 Spawning time versus packet loss rate for LXD containers
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Packet loss rate
0
50
100
150
S pa
w ni
ng ti
m e
(s ec
)
Packet loss vs VM Spawning time
1VM 5VMs 10VMs
Fig. 14 Spawning time versus packet loss rate for VMs
4.1.3 Variable PLR
Figures 12, 13 and 14 depict the average launch times for Docker containers, LXDs and VMs, respectively, at various packet loss rates (PLR) in the control plane network. Note that as before, all plotted data points are average launch times of 10 runs. All three plots in this group are for PLR ranging from 0 to 50–60% depending upon the different virtualization technologies used. Once again, Docker containers and LXD are performing better than VMs because the maximum threshold of PLR at which Docker containers and LXD are still able to launch is 60%, while VMs were not
123
Instance launch-time analysis of OpenStack virtualization… 1007
No of Docker Containers
0
20
40
60
80
100
120
140
160
180
200
S pa
w ni
ng ti
m e
(s ec
)
Effect of bandwidth
300kbps 250kbps 200kbps 150kbps 100kbps
0 5 10 15 20 25 30 35 40 45 50
Fig. 15 Spawning time versus No of Docker containers
able to launch when the PLR exceeded 50%. Above these PLR thresholds, these virtualization technologies failed to launch their specific instances. Not surprisingly, the figures also shows that the launch times increase with larger batch sizes for all three virtualization technologies.
We also observed that for similar parameters (batch size and PLR), LXD offers faster launch times than Docker containers and much faster launch times than VMs.
Furthermore, we observe that for each virtualization technology launch times increase linearly with increase in PLR, until an inflexion point is reached beyond which launch times grow at a much faster rate. For Docker containers and LXD that inflexion point appears at about 45%, regardless of batch size. This inflexion point is less prominent in the case of VMs. However, for the 10 VM case there appears to be greater increase in launch times for PLRs above 35%.
4.2 Spawning time versus instance batch size
4.2.1 Effect of bandwidth
Figures 15, 16 and 17 are plots of average launch versus batch sizes for a variety of bandwidths for Docker containers, LXD and VMs, respectively. These plots also show what all the previous figures could not, i.e., if the relationship between batch size and launch time is linear or not. Figures 15 and 16 show that launch times increase roughly linearly with number of Docker containers and LXD but rises faster than that for (A) batch sizes greater than 40 Docker containers/LXDs, and (B) bandwidth less than 150kbps for Docker containers/LXDs. Figure 17 is a similar plot for VMs and shows that launch times remain linear from 1 to 10 VMs at all bandwidths.
123
1008 J. Ahmed et al.
No of LXD containers
0
20
40
60
80
100
120
140
160
S pa
w ni
ng ti
m e
(s ec
)
Effect of bandwidth
300kbps 250kbps 200kbps 150kbps 100kbps
0 5 10 15 20 25 30 35 40 45 50
Fig. 16 Spawning time versus No of LXD containers
1 2 3 4 5 6 7 8 9 10
No of VMs
0
50
100
150
200
250
S pa
w ni
ng ti
m e
(s ec
)
Effect of bandwidth
300kbps 250kbps 200kbps 150kbps 110kbps
Fig. 17 Spawning time versus No of VMs
4.2.2 Effect of delay
Figures 18, 19 and 20 plot average launch times versus batch sizes for different delay values for Docker containers, LXDs and VMs, respectively. We observe that launch times increase linearly at first, but as the batch size increases beyond a threshold value (which is different for different delays) the launch times grow at a faster rate. This is more clearly visible for Docker containers (Fig. 18) and LXD (Fig. 19), but less so for the case of VMs. For the cases of Docker and LXD containers, for batch sizes of 50 instances, each 1s delay increases the launch time by approximately 50s.
123
Instance launch-time analysis of OpenStack virtualization… 1009
No of Docker Containers
0
50
100
150
200
250
300
S pa
w ni
ng ti
m e
(s ec
)
Effect of delay
0 sec 1 sec 2 sec 3 sec 4 sec 5 sec
0 5 10 15 20 25 30 35 40 45 50
Fig. 18 Spawning time versus No of Docker containers
No of LXD containers
0
50
100
150
200
250
300
350
400
450
S pa
w ni
ng ti
m e
(s ec
)
Effect of delay 0 sec 1 sec 2 sec 3 sec 4 sec 5 sec 5.6 sec
0 5 10 15 20 25 30 35 40 45 50
Fig. 19 Spawning time versus No of LXD containers
4.2.3 Effect of PLR
Figures 21, 22 and 23 plot the average launch times versus batch sizes for different delay values for Docker containers, LXDs and VMs, respectively. The PLR was varied over a range of 0–60%, with launch times generally increasing with increasing PLR. Launch times of batches of Docker containers and LXDs grow linearly up to about 30 instances for PLRs up to 40%. For PLRs more than that the launch times remain linear for batches of only about 20 instances. For the case of VMs launch times in Fig. 23 appear to remain linear for all ranges of PLR and batch sizes.
123
1010 J. Ahmed et al.
1 2 3 4 5 6 7 8 9 10
No of VMs
0
50
100
150
200
250
300
350
400
450
S pa
w ni
ng ti
m e
(s ec
)
Effect of delay
0 sec 1 sec 2 sec 3 sec 4 sec 5 sec 5.4 sec
Fig. 20 Spawning time versus No of VMs
0 5 10 15 20 25 30 35 40 45 50
No of Docker Containers
0
50
100
150
200
S pa
w ni
ng ti
m e
(s ec
)
Effect of % packet-loss
0% packet-loss 10% packet-loss 20% packet-loss 30% packet-loss 40% packet-loss 50% packet-loss 60% packet-loss
Fig. 21 Spawning time versus No of Docker containers
5 Conclusions
In this measurement study, we evaluated the ability of three OpenStack virtualization technologies, KVM, LXD and Docker containers, to continue to provide useful ser- vices in the face of network errors (limited bandwidth, delay and packet losses) of varying degrees of severity in the control plane network. This performance analysis was performed using OpenStack’s 12th release, named Liberty.
Overall, for most tests LXD exhibited the fastest launch times, followed very closely by Docker containers. We also observed that OpenStack clouds is generally able to
123
Instance launch-time analysis of OpenStack virtualization… 1011
No of LXD containers
0
20
40
60
80
100
120
140
160
S pa
w ni
ng ti
m e
(s ec
)
Effect of % packet-loss
0% packet-loss 10% packet-loss 20% packet-loss 30% packet-loss 40% packet-loss 50% packet-loss 60% packet-loss
0 5 10 15 20 25 30 35 40 45 50
Fig. 22 Spawning time versus No of LXD containers
1 2 3 4 5 6 7 8 9 10
No of VMs
0
50
100
150
S pa
w ni
ng ti
m e
(s ec
)
Effect of % packet-loss
0% packet-loss 10% packet-loss 20% packet-loss 30% packet-loss 40% packet-loss 50% packet-loss
Fig. 23 Spawning time versus No of VMs
launch an order of magnitude more container instances on the same infrastructure than VMs.Intermsoflimitedbandwidth,delayandPLRweconsistentlyobservedcontainer virtualization technologies to be a little more resilient than VMs. Our measurement results show that there is a lower limit on the bandwidth, i.e., approximately 110kbps on the control plane network, below which instances will not launch. Differences between virtualization technologies in this regard were very slight. In terms of delay, containers can bear a delay of up to 5.5s, while KVM continues working for up to 5s. In terms of packet-losses, containers continue to launch even when the PLR is as high as 60%, while KVM VMs can only weather PLRs up to 50%.
123
1012 J. Ahmed et al.
We also observed various batch sizes, specific to our test environment, for which instance launch times grow linearly with batch size, depending on the severity level of network errors.
In conclusion, we would like to acknowledge that all results presented in this paper were produced from experiments conducted on an OpenStack testbed comprising of four physical host machines. We realize that the size of this setup is not representative of a production environment data center. Although all reported results are the averages of multiple runs, future studies investigating the effects of network errors on instance launch times should be conducted on larger, more representative size testbeds.
References
1. Akioka S, Muraoka Y (2010) HPC benchmarks on Amazon EC2. In: 2010 IEEE 24th international conference on advanced information networking and applications workshops (WAINA). IEEE, pp 1029–1034
2. Binnig C, Kossmann D, Kraska T, Loesing S (2009) How is the weather tomorrow?: towards a bench- mark for the cloud. In: Proceedings of the second international workshop on testing database systems. ACM, p 9
3. Cooper BF, Silberstein A, Tam E, Ramakrishnan R, Sears R (2010) Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM symposium on Cloud computing. ACM, pp 143–154
4. Costa P, Migliavacca M, Pietzuch P, Wolf AL (2012) Naas: network-as-a-service in the cloud. In: Proceedings of the 2nd USENIX conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services, Hot-ICE, vol 12, p 1
5. Erickson D (2012) Floodlight java based openflow controller. Last accessed, Ago 6. Erickson D (2013) The beacon openflow controller. In: Proceedings of the second ACM SIGCOMM
workshop on Hot topics in software defined networking. ACM, pp 13–18 7. Feamster N, Rexford J, Zegura E (2013) The road to SDN: an intellectual history of programmable
networks 8. Folkerts E, Alexandrov A, Sachs K, Iosup A, Markl V, Tosun C (2013) Benchmarking in the cloud:
what it should, can, and cannot be. In: Selected topics in performance evaluation and benchmarking. Springer, pp 173–188
9. Garfinkel SL (2007) An evaluation of Amazon’s grid computing services: EC2, S3, and SQS. In: Center for. Citeseer
10. Google PerfKit tool for cloud benchmarking. http://www.infoworld.com/article/2884196/cloud- computing/google-whips-up-perfkit-tools-to-make-cloud-benchmarking-easier.html
11. Gude N, Koponen T, Pettit J, Pfaff B, Casado M, McKeown N, Shenker S (2008) NOX: towards an operating system for networks. ACM SIGCOMM Comput Commun Rev 38(3):105–110
12. Gupta R (2016) Request flow for provisioning instance in openstack. http://ilearnstack.com/2013/04/ 26/request-flow-for-provisioning-instance-in-openstack.html. Last Accessed May 2016
13. Handigol N, Heller B, Jeyakumar V, Lantz B, McKeown N (2012) Reproducible network experiments using container-based emulation. In: Proceedings of the 8th international conference on Emerging networking experiments and technologies. ACM, pp 253–264
14. Hill Z, Li J, Mao M, Ruiz-Alvarez A, Humphrey M (2010) Early observations on the performance of windows azure. In: Proceedings of the 19th ACM international symposium on high performance distributed computing. ACM, pp 367–376
15. Hubert B (2016) Tc manpage—linux advanced routing & traffic control. http://lartc.org/manpages/tc. txt. Last Accessed 8 Oct 2016
16. Iosup A, Ostermann S, Yigitbasi MN, Prodan R, Fahringer T, Epema DHJ (2011) Performance analysis of cloud computing services for many-tasks scientific computing. IEEE Trans Parallel Distri Syst 22(6):931–945
123
Instance launch-time analysis of OpenStack virtualization… 1013
17. Jackson KR, Ramakrishnan L, Muriki K, Canon S, Cholia S, Shalf J, Wasserman HJ, Wright NJ (2010) Performance analysis of high performance computing applications on the Amazon web services cloud. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, pp 159–168
18. Koponen T, Casado M, Gude N, Stribling J, Poutievski L, Zhu M, Ramanathan R, Iwata Y, Inoue H, Hama T et al (2010) Onix: a distributed control platform for large-scale production networks. OSDI 10:1–6
19. Lantz B, Heller B, McKeown N (2010) A network in a laptop: rapid prototyping for software-defined networks. In: Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks. ACM, p 19
20. Li A, Zong X, Kandula S, Yang X, Zhang M (2011) Cloudprophet: towards application performance prediction in cloud. In: ACM SIGCOMM Computer Communication Review, vol 41. ACM, pp 426– 427
21. Luo C, Zhan J, Jia Z, Wang L, Gang L, Zhang L, Cheng-Zhong X, Sun N (2012) Cloudrank-d: benchmarking and ranking cloud computing systems for data processing applications. Front Comput Sci 6(4):347–362
22. Malik A, Ahmed J, Qadir J, Ilyas MU (2017) A measurement study of open source SDN layers in openstack under network perturbation. Comput Commun 102:139–149
23. Mao M, Humphrey M (2012) A performance study on the vm startup time in the cloud. In: 2012 IEEE 5th international conference on cloud computing (CLOUD). IEEE, pp 423–430
24. Mendonca M, Astuto BN, Nguyen XN, Obraczka K, Turletti T et al (2013) A survey of software-defined networking: past, present, and future of programmable networks
25. Moreno-Vozmediano R, Montero RS, Llorente IM (2009) Elastic management of cluster-based services in the cloud. In: Proceedings of the 1st workshop on Automated control for datacenters and clouds. ACM, pp 19–24
26. O’Loughlin J, Gillam L (2013) Towards performance prediction for public infrastructure clouds: an EC2 case study. In: 2013 IEEE 5th International Conference on Cloud Computing Technology and Science (CloudCom), vol 1. IEEE, pp 475–480
27. OpenStack. Chapter 1. [a]rchitecture. http://docs.openstack.org/juno/install-guide/install/apt/content/ choverview.html
28. Ostermann S, Iosup A, Yigitbasi N, Prodan R, Fahringer T, Epema D (2010) A performance analysis of EC2 cloud computing services for scientific computing. In: Cloud computing. Springer, pp 115–131
29. Rak M, Aversano G (2012) Benchmarks in the cloud: the mosaic benchmarking framework. In: 2012 14th International Symposium on Symbolic and Numeric Algorithms for Scientific Comput- ing (SYNASC). IEEE, pp 415–422
30. Rehr JJ, Vila Fernando D, Gardner JP, Svec L, Prange M (2010) Scientific computing in the cloud. Comput Sci Eng 12(3):34–43
31. Ristenpart T, Tromer E, Shacham H, Savage S (2009) Hey, you, get off of my cloud: exploring informa- tion leakage in third-party compute clouds. In: Proceedings of the 16th ACM conference on Computer and communications security. ACM, pp 199–212
32. Russell B (2015) Passive benchmarking with docker LXC, KVM & OpenStack 33. Schad J, Dittrich J, Quiané-Ruiz J-A (2010) Runtime measurements in the cloud: observing, analyzing,
and reducing variance. Proc VLDB Endow 3(1–2):460–471 34. Sefraoui O, Aissaoui M, Eleuldj M (2012) Openstack: toward an open-source solution for cloud
computing. Int J Comput Appl 55:38–42 35. Sobel W, Subramanyam S, Sucharitakul A, Nguyen J, Wong H, Klepchukov A, Patil , Fox A, Patterson
D (2008) Cloudstone: multi-platform, multi-language benchmark and measurement tools for web 2.0. In: Proc. of CCA, vol 8
36. Steinmetz D, Perrault BW, Nordeen R, Wilson J, Wang X (2012) Cloud computing performance benchmarking and virtual machine launch time. In: Proceedings of the 13th annual conference on Information technology education. ACM, pp 89–90
37. Tseitlin A (2013) The antifragile organization. Commun ACM 56(8):40–44 38. Wu Z, Xu Z, Wang H (2012) Whispers in the hyper-space: high-speed covert channel attacks in the
cloud. In: Presented as part of the 21st USENIX Security Symposium (USENIX Security 12), pp 159–173
123
1014 J. Ahmed et al.
39. Xu Y, Bailey M, Jahanian F, Joshi K, Hiltunen M, Schlichting R (2011) An exploration of L2 cache covert channels in virtualized environments. In: Proceedings of the 3rd ACM workshop on Cloud computing security workshop. ACM, pp 29–40
40. Yigitbasi N, Iosup A, Epema D, Ostermann S (2009) C-meter: a framework for performance analysis of computing clouds. In: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid. IEEE Computer Society, pp 472–477
123
Computing is a copyright of Springer, 2019. All Rights Reserved.
- Instance launch-time analysis of OpenStack virtualization technologies with control plane network errors
- Abstract
- 1 Introduction
- 1.1 Motivation and problem statement
- 1.2 Limitations of prior art
- 1.3 Proposed approach
- 1.4 Experimental results and findings
- 1.5 Key contributions
- 2 Related work
- 2.1 Software defined networking
- 2.2 Network virtualization
- 2.3 Cloud computing
- 2.4 OpenStack CMS
- 2.5 OpenStack virtualization technologies
- 2.5.1 Docker
- 2.5.2 Nova-LXD
- 2.6 Cloud benchmark suites
- 2.7 Instance launch time
- 3 Experimental methodology
- 3.1 OpenStack infrastructure
- 3.1.1 Instance launch sequence
- 3.2 Testing infrastructure
- 3.3 Fault injection mechanism
- 3.3.1 Packet loss
- 3.3.2 Bandwidth
- 3.3.3 Delay
- 3.4 Performance metric
- 4 Results and analysis
- 4.1 Launch time versus fault levels
- 4.1.1 Variable bandwidth
- 4.1.2 Variable delay
- 4.1.3 Variable PLR
- 4.2 Spawning time versus instance batch size
- 4.2.1 Effect of bandwidth
- 4.2.2 Effect of delay
- 4.2.3 Effect of PLR
- 5 Conclusions
- References