discussion 4
CompTIA Cloud+ CVO-002 Study Guide
Chapter 8: Cloud Management Baselines, Performance, and SLAs
Chapter 8 Objectives
4.5 Given a scenario, analyze deployment results to confirm they meet the baseline.
• Procedures to confirm results
• CPU usage
• RAM usage
• Storage utilization
• Patch versions
• Network utilization
• Application version
• Auditing enable
• Management tool compliance
2
Chapter 8 Objectives (cont.)
4.6 Given a specific environment and related data (e.g., performance, capacity, trends), apply appropriate changes to meet expected criteria.
• Analyze performance trends.
• Refer to baselines.
• Refer to SLAs.
• Tuning of cloud target objects
• Compute
• Network
• Storage
• Service/application resources
• Recommend changes to meet expected performance/capacity.
• Scale up/down (vertically)
• Scale in/out (horizontally)
3
Chapter 8 Objectives (cont.)
4.7 Given SLA requirements, determine the appropriate metrics to report.
• Chargeback/showback models
• Reporting based on company policies
• Reporting based on SLAs
• Dashboard and reporting
• Elasticity usage
• Connectivity
• Latency
• Capacity
• Overall utilization
• Cost
• Incidents
• Health
• System availability
o Uptime
o Downtime
4
Baselines
In this section, you will learn about the importance of measuring the performance of your cloud deployments and how to go about determining what you consider to be a normal operating condition.
Once a good baseline is determined, then you can track operations and determine whether your services are operating beyond your parameters and take corrective actions.
This is done by setting parameters, or metrics, on measurable components.
These components are called objects, and you can define the measurements that are sent to monitoring systems to collect data trends over time.
With this information at hand, you can have good solid information to base the health of your deployment on.
Measuring Your Deployment
To determine whether your cloud services, whether they be servers, storage, security, databases, load balancers, or any of the other many cloud services offered, are performing as expected, you must know what normal is.
Set up your baselines so you can know what is considered to be normal operations and what is out of your expectations.
CPU Usage
Many applications are CPU bound, which is to say their performance depends on the amount of CPU resources available. One of the most common cloud objects that are tracked is the percentage of CPU utilization.
Since CPU utilization has a direct impact on systems performance. This metric is available at the hypervisor level and can automatically report usage to management systems with no scripting required.
CPU usage can be tracked over time to identify trends, peak usage, and any anomalies that can provide invaluable information to you for troubleshooting and capacity planning.
RAM Usage
When RAM utilization reaches 100 percent on a server, the operating system will begin to access the swap file and cause a serious performance slowdown that affects all processes running on the server.
Monitoring memory usage is one of the most critical objects to monitor and collect baseline data on.
Memory usage should be constantly tracked and measured against the baseline.
If memory utilization begins to approach the amount of available memory available on the cloud server instances, you should immediately take actions to remedy the problem before you take a performance hit.
Storage Utilization
Cloud storage systems offer a wide array of options and features to select from.
What is important here is that you configure managed objects for storage volumes and their utilization.
Storage requirements will continue to grow, and by monitoring your storage operations, you can allocate additional capacity or migrate stored data to lower lifecycle tiers to take advantage of lower-cost storage options.
Patch Versions
Versions are often included in the system’s metadata or can be requested with API calls to the monitored device.
VM monitoring scripts can be created to collect versioning data on the local machine and store it on the management server.
Some cloud providers will offer instances or installable code that allows the VMs to collect local metrics and download them to a management server.
Network Utilization
Congestion across the network can cause major performance degradation. High network utilization leads to dropped data packets and retransmissions that cause high network latency and poor response times.
It is important that you consider network infrastructure performance as a critical part of your metrics and benchmark process.
Application Versions
For proper baseline documentation, it is important that you make valid and meaningful comparisons between baselines.
Part of this is making sure you track all versioning of applications and also operating systems and device drivers if needed.
If there are significant internal performance differences between application versions, it may render your baselines invalid and require that you create a new, or updated, baseline measurement to account for the new version of your application.
Enabling the Audit Process
For regulatory or corporate compliance requirements, you may be required to implement an ongoing auditing process and retain the data for record retention requirements.
This process will be specific to the application or cloud provider.
In the cloud configuration dashboard, most providers will offer as part of their metric-monitoring applications a reporting application that meets the various regulatory requirements they are in compliance with.
Management Tool Compliance
The cloud providers will offer their own management tools and also make accommodations for you to implement your own tools or those of a managed service provider.
If there are compliance requirements for the various management tools, they will be outlined in the cloud provider’s documentation as to what compliance requirements they meet.
If there are special requirements for HIPAA, SBOX, PCI, or other industry requirements, it is your responsibility to make sure that the chosen management tools meet what is required by these regulations.
Applying Changes to the Cloud
As part of your ongoing cloud maintenance plans, you will be tracking your operations against your baselines and adjusting or troubleshooting the variations to these trend lines.
To keep your operations in range of your established baseline, you will be making frequent or infrequent changes to your cloud deployments.
Some changes may be minor, while others will be disruptive and incur system downtime in some cases.
Performance Trending
Once you have your baseline established, you will have solid data for what is considered to be normal operations of your cloud servers and services.
Using your monitoring application (whether your own it or one is provided by the cloud company), you can now continue to collect performance metrics and compare that data against your baseline.
Baseline Validation
Is your baseline actually realistic or valid? That can be a critical question as your operations will be measured against that benchmark.
The best way to validate your baseline measurements is to collect them over a long period of time to smooth out isolated events or short-term variations.
It may be helpful to compare your baselines against others that are in a similar operational state or use case.
This can help to validate that your readings are what is expected and that there are no outlier measurements.
Service Level Agreement Attainment
Objects to be tracked should align with the SLA metrics. By collecting actual data, you can compare to the offered service levels outlined in the SLA and ensure that the guaranteed metrics are being met.
It is up to the cloud customer to track the SLA metrics to ensure the guarantees are being met by the service provider. Effective monitoring allows you to accomplish this.
Compute Tuning
Should there be CPU starvation, you can be assured that your deviation from the baseline will be very noticeable! The only available solution is to either lower the load on the instance CPUs or add additional computing resources.
If you are in the situation where you have tuned your server for optimal CPU resource consumptions and still are suffering from high CPU usage, then you may have no other choice but to upgrade your instance either vertically or horizontally to add CPU power.
This will involve replacing your current machine image with that of a larger offering that has additional CPU capacity or adding additional compute instances in a horizontal scaling arrangement.
Network Changes
When making deployment changes to networking performance issues, you need to understand what you can control and what the cloud service provider controls.
Many of the networking parameters are global to the cloud and out of your area of authority.
However, it is important to understand that making sure the network performs inside your baseline and SLA is your responsibility.
Cloud companies do offer solutions for network-intensive requirements.
These include server images with 10Gbps network adapters, a low-latency interconnection that places all of your servers on the same hypervisor for high-speed, low-latency interconnections, and the ability to group all of the servers in the same zone and subnet.
Storage Tuning
Baselines can be used to ensure that the storage systems meet your performance requirements and allow you to track changes over time. Storage systems are often monitored for I/O utilization from the host bus adapter to the SAN.
Should there be excessive utilization, disk read/write performance will suffer, and the application’s performance will degrade.
The solution is to increase the bandwidth of the adapter on the VM.
Using metrics that define what will be considered to be high utilization of storage network bandwidth over a defined period of time, automated actions can be performed to remedy the issue, or you can upgrade your machine image to one that is optimized for storage operations.
Service/Application Changes
Changing or upgrading services and applications may be required to add new capabilities and features that are beneficial or to maintain your cloud deployment compliance with changing regulations or corporate policy.
If you are implementing a SaaS solution, then the cloud provider will be responsible for any changes made to the application.
Services such as load balancers, firewalls, DNS, identity management, and virtual private clouds also undergo frequent upgrades to keep pace with the competition and to add features.
These services are largely the responsibility of the cloud provider and to your benefit as you get to take advantage of the new capabilities and offerings with little effort on your part.
Meeting Expected Performance/Capacity Requirements
As your cloud operations grow and evolve, so may your requirements for additional capacity and higher levels of performance.
This will mean that you will be making changes to your deployment over time to meet your current and future capacity needs.
Vertical Scaling
Some applications, such as many types of databases, are not designed to scale horizontally.
You will need to scale to a larger machine image to increase resources for the application.
When vertical scaling occurs, the existing machine is replaced with a larger instance type that you define.
Depending on your requirements, you may add additional CPUs, memory, network bandwidth for storage, or LAN traffic.
Horizontal Scaling
Horizontal scaling is the process of adding cloud capacity by expanding your current server fleet by adding systems, compared to vertical scaling, which is replacing servers with a larger instance that meets your new requirements.
Horizontal scaling works well for applications that are designed to work in parallel such as web servers.
You keep your existing server instances and add more to increase capacity.
Cloud Accounting, Chargeback, and Reporting
Cloud management includes accounting mechanisms for measuring consumption, billing, and generating management reports.
In this section, you will learn about these nontechnical but important aspects of cloud operational management.
Company Policy Reporting
Companies will publish and manage IT polices; these policies cover a wide range of subjects including, but not limited to, how cloud services are consumed and accounted for.
To effectively measure compliance, you will need to collect the required data and be able to process the information into effective reports.
Cloud providers are aware of policy reporting and offer services to assist you in collecting and presenting reports.
These services are cloud based and can be remarkably customizable. They are presented in a graphical format in a web browser dashboard format.
Reporting Based on SLAs
There is a close relationship between collecting data into baselines and then measuring them against your SLA to ensure compliance.
Management services allow you to compare these two metrics and generate reports that can be used to analyze trends, identify problems, and store data for regulatory or corporate compliance purposes.
Cloud Dashboards
Cloud dashboards are incredibly useful and informative. It is common to display dashboards in operations centers or overhead in office environments to give an easy-to-read overview of operations.
Dashboards are usually graphical and color-coded for quick notification of potential issues.
Dashboards are offered by the cloud providers, your internal monitoring and management applications, and any outside monitoring services you have contracted with.
They allow you to define what you want to display and in what format. Dashboards are completely customizable and rather easy to configure.
Elasticity Usage
One of the great benefits cloud computing offers is elasticity and the flexibility automation offers in adding and removing capacity.
Elasticity events often incur charges and are also important to monitor to ensure that your cloud operations are scaled correctly.
Management applications can generate usage reports on a large number of events including elasticity.
Metrics such as the event time and duration are recorded as well as details of the added capacity and utilization metrics that were collected after the scale-up or scale-down events occurred.
Connectivity
Reports and graphical presentations can be created to show connections over time, location, new or returning visitors, what was performed (did they buy anything on your e-commerce site?), and how long they were visiting.
This is valuable data for sales, marketing, and accounting.
Latency
Network delays and slowdowns can have an extremely negative effect on cloud operations.
Latency in the network can come from many different sources; however, individual cloud bottlenecks all add up to latency, and the end result is frustrated employees and customers.
Metrics, benchmarks, SLAs, and proactive maintenance all come together to keep latency low and performance high.
Capacity and Utilization
Capacity and utilization reporting can include a wide range of metrics including storage, CPU, RAM, network, and so on.
These reports are helpful in managing usage trends and change requirements.
Accounting will be interested to see that the capacity purchased is being used effectively.
As with the other measurements, capacity reports are customizable and offered in a variety of formats.
Incident and Health Reports
Tracking support services and impairments will give you insight into the overall reliability of operations, and the collected data can be compared to your SLA to ensure compliance.
Incidents can be defined by your company or the cloud provider as required.
Incidents and health reports include trouble tickets opened, support engagements, and any event that causes degradation of your services.
Uptime and Downtime Reporting
A critical and bottom-line metric of any SLA is that of downtime. If you cannot access your cloud deployment, that is a critical event and must be tracked.
Both the cloud provider and your operations center should track downtime and identify the root cause of what caused the outage.
These reports can be analyzed to ensure SLA metrics are being met and to see if you have to change your architecture to design for higher reliability and less downtime.