LectureWeek1-2.pdf

BUS308 – Week 1 Lecture 2

What this lecture covers

This lecture focuses on describing data, and how these descriptions can be used in an analysis. It also introduces and defines some specific descriptive statistical tools and results.

Descriptive Statistics

OK, let’s start making sense out of what appears to be a big, messy, and overwhelming amount of data. Generally, the first thing we want to do with our data – our clues as to what the data is hiding from us – is to summarize the data into descriptive statistics. In general, descriptive statistics provide information in four areas:

• Location: these show central tendencies and include mean, mode, and median; • Consistency: these show the variability in the data and include range, variation, and

standard deviation; • Position: these measures show relative placement of data, where a particular data point

lies within the data set, and include measures such as z-score, percentile, quartile; and • Likelihood: these show how common or rare a particular outcome is, and involve

probability estimates such as, empirical, theoretical, subjective probabilities.

Note that these are not the complete list of possible descriptive statistics. Excel’s Descriptive Statistics function includes a couple of measures that focus on data distribution shape. These have some specialized uses that we will not be getting into.

Location Measures

Perhaps the most often asked question about data sets is what is the average? Unfortunately, average is a somewhat imprecise term that could mean all three of our measures of location (AKA central tendency) identified above. So, as analysts we tend to be more precise and use mean, median, and mode.

While these all tell us something about where the data might be clustered, they can provide very different views of the data. An example of this comes from an example the author heard back in High School. At that time, the mean per capita income for citizens of Kuwait was about $25,000; the median income was around $125; and the mode was $25! The very high (due to oil revenues) income of the Royal family accounted for much of this difference, but just look at the different impressions we get about the country depending on which value we look at.

• Mean, AKA the most typical meaning of average, is the sum of all the values divided by the count. For example, the mean of 1, 2, 3, 4, and 5 = 1+2+3+4+5/5 = 15/5 = 3. The mean is generally the best measure for any data set as it uses all of the data values, and requires interval or ratio level data.

• The median is the middle value in an ordered (listed from low to high) data set. For example, the median of 1, 2, 3, 4, and 5 = 3, the middle value. If we have an even

number of values, the median is the average of the middle two values. Medians can be found on ordinal, interval, or ratio level data.

• The mode is the most frequently occurring value. A data set may have no modes or one or more. Modes may occur with any level of data. The data set 1,1,2,2,2,2,3,8,8,9 has a primary mode of 2, and two secondary modes of 1 and 8.

Consistency/Variation Measures

While they do not have the popularity of their location cousins, knowing the consistency or variation within the data is as important, some say even more important, as knowing the central tendency for us to understand what the data is trying to tell us. Very consistent data, with little variation, has a mean that is very representative of the data and is unlikely to change much if we resample the population. Data with a large amount of variation tends to have unstable means, meaning that these values would change a lot with multiple samples. Inconsistent data (having large variation) is often a problem for businesses, particularly for manufacturing operations, as it means the results they produce differ, and might often not meet the quality specifications. Predictions based on data with large variations are rarely useful. Consider attempting to estimate how long it would take you to get to work if your route had frequent traffic accidents that made the travel time different every day.

The key measures of variation are:

• Range, which equals the maximum value minus the minimum value. For our example data set of 1, 2, 3, 4, and 5, the range is 5 – 1 = 4.

• Variance, which is the average of the square of sum of the differences between each value in the data set from the mean. To get the variance, find the mean of the data, subtract this value from each of the data points, square this result (to get rid of the negative differences), add them up and divide by the total count. For our example data set, this would look like:

Value Mean Difference Squared

1 3 -2 4 2 3 -1 1 3 3 0 0 4 3 1 1 5 3 2 4 Sum = 10

Variance = 10/5 = 2 The problem with variance is that it expressed as units squared. So, if our data set were dollars, the variance would be 2 dollars squared – how should we interpret dollars squared?

• Standard Deviation is the (positive) square root of the variance. It returns the dispersion measure back to one that is in the same units as the original data, so we can compare it to the data values. For our example, the standard deviation is the square root of 2 dollars squared, or 1.4 dollars. This much easier to understand measure

means that the average difference from the mean is 1.4 dollars (in our example above having a mean or average value of 3 dollars).

• Important point about the variance and standard deviation. When we find these values for a population, the entire group we are interested in, we divide the numerator by the sample size. However, when we have a sample of the entire group (and want to use this sample to estimate the population value for either variance or standard deviation), we divide the numerator by the (count – 1). This is an adjustment that increases the estimate to take into account we most likely do not have the extreme low and extreme high value from the population in our sample, so its variation is less than the group we are using the sample to describe.

Applying the Information

Equal Pay Questions

OK, we can now start looking at our clues to see what they are hiding. As with all analysis, we start with questions, then identify the tools to use for those questions, and finally apply those tools to the data. Our initial question is, do males and females get equal pay for equal work? We also said we needed to start with the question of whether or not we had some measures that showed pay comparisons between males and females. Let’s take a look at some of the group and sub-group data. A couple of measures that might answer this question are:

• What are the group averages for each variable? • What are the average male and female compa-ratios? Salary? • How consistent are the compa-ratios for each? How consistent are the salary measures?

Note that we will be focusing on the compa-ratio data in the lectures, while you will focus on the same questions using salary in the weekly homework assignments. As described, compa-ratio is the result of dividing an employee’s salary by their grade midpoint. It generally ranges from about 0.80 to 1.20 in most pay plans. The value of this measure is it removes the impact of different grades (each of which we are assuming are different levels of work from other grades, and contain equal work for all the jobs within the grade). While not a perfect measure, it is the start of measuring what is paid for equal work. Side note: a grade’s midpoint is generally pegged to the average market pay needed to hire new employees into a job.

How do we start?

If you have not already done so, please download the Excel Student assignment file to your computer. This file is in the Required Resources link. If you have not loaded the analysis toolpak, please do that now as well. (See the BUS308 material sent out before the course started for instructions on how to do this.)

The lectures will use the data set provided in Lecture 1. As mentioned, this is similar to, but not identical to the data set you will use in the weekly homework assignments. A note about your homework. All the information we need to answer the weekly assignments – and, eventually find the answer to our mystery – is located in the Student Assignment file, in the Data tab. You should download this file, and save it on your computer for best results. The other best

practice is to NEVER manipulate the original data set found in the data tab. Doing so runs the risk of “messing” up the data relationships – the links between each of the variables to specific employees. For example, if you sort the gender variable by gender, we lose the link between which employee are male and female – an important piece of information. What we should do is to copy and paste the appropriate variable columns to the worksheets for each weekly assignment; even a couple of times if we need to.

Setting up the data

The first step in describing data, for both descriptive and inferential statistics, involves setting up the data so that Excel can access only the data variables you want to consider for each question. This is quite important for, as an old computing expression goes, GIGO (garbage in, garbage out). One key rule: Never mess with the original data set. Copy and paste it to the work sheet rather than doing any work with the original data set. This way, if something goes wrong and the data gets lost or mixed up, we can always go back to the original.

In Excel, we can copy and paste data using either the CTRL-C and CTRL-V keys or use the mouse to highlight the range and to cut and then paste. For those not familiar with these approaches, here is a short (8:10 minutes) video on cutting and pasting data. (Note: for purposes of this video example, only a portion of each data range is used – this allows the sorting example to be shown more clearly. In the examples below and in your homework, use all 50 values.)

Video Links:

Here is a link to a video on copying, pasting, and sorting in Excel (8:10 minutes):

https://screencast-o-matic.com/watch/cbQX3iII2K.

(You might have to click on the arrow on the video to get it started and/or enlarge the viewing window)

Examples

Note: The following discussion will mirror the questions and activities that you will be asked to do in the week 1 assignment. However, while you will be asked to analyze salary in the assignments, this example (and the others in the succeeding weeks) will focus on the second measure of pay, the compa-ratio. Remember, the data used in the lectures is from a different sample and therefore is not exactly the same as the data you will use in the assignments. However, the conclusions reached in each lecture example should be considered when you make your conclusions about what the data tells us each week.

Question 1

A suggestion, before starting with any question read all the parts and identify the variables needed. The first question for the week 1 assignment asks for some descriptive statistics on the compa-ratio variable in the overall group, and then, in part b, for the male and female subgroups.

So, since we need the compa-ratio and gender variables (working with Gender1 is easier) for this question, we need to copy and paste these two columns (compa-ratio and Gender1) from the data tab to the right of our answer space, such as columns Q and R – don’t forget to include the labels in row 1 when you copy. Note: It is generally preferable to set up data for each question separately. This prevents our answers from changing if we re-sort data for a different question.

Next sort the two columns using the Gender1 column and the Custom sort option in the Sort option in the editing box. Be sure NOT to include the labels in your highlighted sort range. The completed sort will look like this (this example shows compa-ratio and gender rather than your assignment variables of salary and gender).

Now that our data is ready, we can move on to finding our desired values.

Part 1 asks for the descriptive statistics for the entire compa-ratio group, and asks that you use the Analysis Toolpak function Descriptive Statistics to do find these values.

Video Link: Here is a link to a video on using the Analysis Tool Pak: https://screencast-o- matic.com/watch/cbQX35IID7

Click on the Data tab in the main ribbon, then in the Analysis tab on the right click on Data Analysis, scroll down to Descriptive Statistics.

Once Descriptive Statistics is highlighted, left click on OK, and the data entry box will open. Enter the data range for the variable of interest. See the screenshot example below.

Below is a completed example of descriptive statistics for the single variable compa-ratio for our Excel file. (In your homework you should have Salary in your column Q, for this exact set-up to give you the asked for result in part a.)

Clicking on OK, will produce the outcome. Part a asks for us to highlight 3 of the statistics, and the following shows the outcome. (Note that we changed the alignment of cell

K25 to align right to show the full cell name.)

Part b asks for the same information, but asks us to use the Fx functions to find each value separately. One important part of this question is how we show the results. Whereas the data from the descriptive statistics output shows actual numerical values, the outputs using the Fx formulas need to show the formula in the cells (example: =average(Q2:Q26)) rather than simply the numerical outcome; this is because part of the assignment requires us to show that we can use these functions rather than copying the values. The values for part b will not be the same as in part a since we are asking for the statistics by gender group rather than the entire sample as in part a. In the following Excel formulas the range would be the values in column Q that relate to the gender being asked for, example for females the range would be Q2:Q26 (a range of the full 25 values).

The mean or average is found in Excel using =average(range)

The Sample standard deviation (a measure of variation or spread within the data) is found using: =stdev.s(range)

The range is found by: =max(range)-min(range)

Video Link: Here is a link to a video on using the functions in Fx: https://screencast-o- matic.com/watch/cbQlnxIIUf

Pease ask your instructor if you have any questions about this material.

When you have finished with this lecture, please respond to Discussion thread 2 for this week with your initial response and responses to others before reading the third lecture for the week.