problem set

profileNei
L3.pdf

BUS308 – Week 1 Lecture 3

Setting Up Excel: Descriptive Data

Expected Outcomes

After reading this lecture, the student should be familiar with:

1. How to copy and paste data into columns on the assignment pages. 2. Excel’s Analysis Toolpak. 3. How to use the Descriptive Statistics function within the Analysis Toolpak. 4. Excel’s FX (or Formulas) functions. 5. How to use Excel formulas in cells to obtain numerical results.

What this lecture covers

This lecture focuses on using Excel’s functions and tools to develop numerical answers to specific data related questions. Our data detectives started this week with the whiff of a possible “crime” or issue for our company. We looked at some of the logic and reasons for using statistics to uncover hidden meaning in data as well as clues in the first lecture. We then moved on to look at how to examine and interpret meaning behind summary statistics (for location, consistency, probability, and location) and found that most value comes from comparing measures rather than simply examining them without any reference points. We are now moving to close the loop and provide you with some tools to summarize and condense data sets into meaningful summaries using Excel.

This lecture covers the detective tools available through Excel. If you are not familiar with Excel, you will be amazed at what it can do for us. If you are familiar with Excel, you might not know the statistical analytical power it has (at least that is what past classes have said). Side note – much of what we will cover with Excel provides an excellent starting point for classes in accounting, finance, and operations management (to name a few).

How do we start?

If you have not already done so, please download the Excel Student assignment file to your computer. This file is in the Required Resources link. If you have not loaded the analysis toolpak, please do that now as well. (See the BUS308 material sent out before the course started for instructions on how to do this. This material was sent to your school email address.)

Students in previous sections recommend this video on loading the Toolpak (note, this video continues on to show the use of descriptive statistics as well – not yet needed):

Video Link: http://www.youtube.com/watch?v=4_9vGqQaCFk

The lectures will use the data set provided in Lecture 1. As mentioned, this is similar to, but not identical to the data set you will use in the weekly homework assignments. A note about your homework. All the information we need to answer the weekly assignments – and, eventually find the answer to our mystery – is located in the Student Assignment file on the Data tab. You should download this file and save it on your computer for best results. The other best

practice is to NEVER manipulate the original data set found in the data tab. Doing so runs the risk of “messing” up the data relationships – the links between each of the variables to specific employees. For example, if you sort the gender variable by gender, we lose the link between which employee are male and female – an important piece of information. What we should do is to copy and paste the appropriate variable columns to the worksheets for each weekly assignment; even a couple of times if we need to.

Setting up the data

The first step in describing data, for both descriptive and inferential statistics, involves setting up the data so that Excel can access only the data variables you want to consider for each question. This is quite important for, as an old computing expression goes, GIGO (garbage in, garbage out). One key rule: Never mess with the original data set. Again, always copy and paste it to the work sheet rather than doing any work with the original data set. This way, if something goes wrong and the data gets lost or mixed up, we can always go back to the original.

In Excel, we can copy and paste data using either the CTRL-C and CTRL-V keys. First use the mouse to highlight the range you want to copy, then use CTRL-C (the CTRL and C key depressed at the same time). Then go to the worksheet and cell you want to place the data and use CTROL-VB to paste. For those not familiar with these approaches, here is a short (8:10 minutes) video on cutting and pasting data. (Note: for purposes of this video example, only a portion of each data range is used – this allows the sorting example to be shown more clearly. In the examples below and in your homework, use all 50 values.)

Video Link: https://screencast-o-matic.com/watch/cbQX3iII2K

(You might have to click on the arrow on the video to get it started and/or enlarge the viewing window)

Examples

Note: The following discussion will mirror the questions and activities that you will be asked to do in the Week 1 assignment. However, while you will be asked to analyze salary in the assignments, this example (and the others in the succeeding weeks) will focus on the second measure of pay, the compa-ratio. Remember, the data used in the lectures is from a different sample and therefore is not exactly the same as the data you will use in the assignments. However, the conclusions reached in each lecture example should be considered when you make your conclusions about what the data tells us each week.

Question 1: Summary Statistics

A suggestion, before starting with any question read all the parts and identify the variables needed. The first question asks for some descriptive statistics on the SALARY variable in the overall group, and then, in Part b, for male and female subgroup specific descriptive statistics.

Remember, your assignments work with Salary and these lecture examples work with Compa-ratios. So, whenever the lecture examples mentions “compa-ratio” you will be doing the

same thing only using “salary” in your assignments. (I know this is beginning to sound repetitive, but it is an issue with some students who use the compa-ratio in the homework and lose points for doing this.)

Lecture Example. Question 1 asks for some descriptive statistics, so we first need to copy and paste the data into our worksheet. Since we need the compa-ratio and gender variables (working with Gender1 is easier) for this question, we need to copy and paste these two columns (compa-ratio and Gender1) from the data tab to the right of our answer space, such as columns Q and R – don’t forget to include the labels in Row 1 when you copy. Note: It is generally preferable to set up data for each question separately. This prevents our answers from changing if we re-sort data for a different question.

Next, sort the two columns using the Gender1 column and the Custom sort option in the Sort option in the editing box. Be sure NOT to include the labels in your highlighted sort range. The completed sort will look something like the following example on the right (this example shows compa-ratio and gender rather than your assignment variables of salary and gender).

Now that our data is ready, we can move on to finding our desired values.

Part 1 asks for the descriptive statistics for the entire compa-ratio group and asks that you use the Analysis Toolpak function Descriptive Statistics to do find these values.

Video Link: Here is a link to a video on using the Analysis Tool Pak: https://screencast-o- matic.com/watch/cbQX35IID7

Click on the Data tab in the main ribbon, then in the Analysis tab (If this is not visible, the Analysis Toolpak has not been loaded.) on the right click on Data Analysis, scroll down to Descriptive Statistics.

Once Descriptive Statistics is highlighted, left click on OK, and the data entry box will open. Enter the data range for the variable of interest. See the screenshot example below.

Below is a completed example of descriptive statistics for the single variable compa-ratio for our Excel file. (In your homework you should have Salary in your column Q, for this exact set-up to give you the asked for result in Part a.) In the input range, place the range that contains the variable you want to describe. Either highlight the range or enter the numbers manually. Make sure you have clicked on the grouped by “column” button.

It is generally a good idea to include the label in input ranges, if you do – and ONLY if you include the label – be sure and click the Labels in first row box. (Note, if this box is checked and the label is not included, the first data value will show up as a label. This is a strong hint that the input range was not correct.)

Then click on the button in front of Output Range and enter the cell where you want the table to start. For most questions, this cell will be given to you. For this question, enter K19 in the box. Then look at your output options. This question only asks for the summary statistics button to be checked for the output, but you are welcome to select the others as well.

Clicking on OK, will produce the outcome. Part a asks for us to highlight three (3) of the statistics, and the following shows the outcome. (Note that we changed the alignment of cell K25 to “align right” to show the full cell name. You do not have to do this.)

Part b asks for the same information but asks us to use the Fx functions to find each value separately. One important part of this question is how we show the results. Whereas the data from the descriptive statistics output shows actual numerical values, the outputs using the Fx

formulas need to show the formula in the cells (example: =average(Q2:Q26)) rather than simply the numerical outcome. This is because part of the assignment requires showing mastery using Excel functions rather than copying the values. The values for part b will not be the same as in part a since we are asking for the statistics by gender group rather than the entire sample as in part a. In the following Excel formulas, the range would be the values in column Q that relate to the gender being asked for, example for females the range would be Q2:Q26 (a range of the full 25 values).

The mean or average is found in Excel using =average(range)

The Sample standard deviation (a measure of variation or spread within the data) is found using: =stdev.s(range)

The range is found by: =max(range)-min(range)

The range of cells will be the same in each of the three (3) questions for the Female and Male columns (each column having a different range of cells that relate to their salary values.)

Video Link: Here is a link to a video on using the functions in Fx: https://screencast-o- matic.com/watch/cbQlnxIIUf

Position Issues

Question 2. This question moves from descriptive statistics to location measures – telling us where within a data set we will find a value. These are good for comparison activities. Question 2 asks for a 5-number summary for the entire salary range, as well as for each gender (male and female).

The values for each element in the 5-number summary are easily developed using the Fx (or Formulas) function – under the Statistical group.

The Maximum value (Max) is simply; =MAX(Range), for example =MAX(Q2:Q51).

The Third Quartile (3rd Q) is: =PERCENTILE.EXC(Range, 0.75)

The Midpoint or median is: =MEDIAN(Range)

The First Quartile (1st Q) is: =PERCENTILE.EXC(Range, 0.25)

The minimum value (Min) is” =MIN(Range).

Replacing “Range” in each formula with the appropriate salary data range for the columns labeled “Overall,” “Males,” and “Females” will complete the table.

Here is an example using compa-ratio of setting up the 5-number summary. The female column shows the Excel formulas and ranges used to find the respective values. The only difference existing in the other columns is the range – adjusted for overall or male values.

One use of these two questions is to compare values – who has the highest values, the lowest, where are midpoints relative to each group, etc. One interesting issue is seeing where the values are located within the range.

Question 3. The third question asks us to examine a specific value, in this case the male and female midpoints, and see where they are located within the entire salary data range. This gives us a feel for where the two groups are distributed within the entire data set. Here is a screen shot showing the cell formulas used for the female values.

Excel’s PERCENTRANK.EXC(range, specific data value) provides the rank of any specific data point within a specified data range. The rank is shown as a 3-digit decimal that ranges from 0 at the bottom and 1.000 at the top.

Z-Score. The second row in this question asks for the z-score or Z-value. This is a relative measure telling us (in standard deviations) how far from the mean a particular value lies. It involves the (a) value we are interested in, the mean of the distribution, and (c) standard deviation. These values for the entire data range were developed in Question 1 with the

descriptive statistics function. We can find a z-value in several ways. Letting Excel do the math for us, we could enter into the cell for Male z-value of 1.149 the following formula: =(1.149 – 1.056)/0.08379; this gives us a value of 1.11 (rounded to two (2) decimal places, traditional for z scores). We could also have used cell values and entered =(K35 – D29)/D30 to get a value of 1.11 rounded. However, if we look at more decimal places we will find a slight difference. This difference is due to Excel’s rounding of the compa-ratio values, the actual mean and standard deviation are slightly larger than the rounded values shown in the table output. The preference is to use cell references in formulas.

The easier way, and the approach asked for in the homework, to find z-values is to use the Fx function =standardize(x, mean, standard_deviation);

The STANDARDIZE(value, mean, standard deviation) provides a z score for a specific value with a given data set mean and standard deviation. Only the range and cell location of the female compa-ratio information would be changed to get the female results.

Question 4

Part a in the next question is about the empirical probability that a randomly selected salary would exceed either the male and female midpoint our value found in Part a. Empirical probably is a simple count outcome, the number of successful outcomes divided by the total number of possible outcomes.

Our lecture example uses compa-ratios. For each gender, we have 50 possible salary values that could be selected at random. What we need to know is, for each gender, how many employees have a compa-ratio as large or larger than then cut-off value we found are using (the gender midpoint). The labor-intensive approach is to look at the sorted data and count (that is a pain in the neck). Excel will do this for us.

Empirical Probability. Excel has several counting functions found in the statistical list of Fx functions. The one we want is =countif (range, “criteria”). We know the range for the entire salary list, what we need is the criteria. Countif lets us use =, >, or < when we are asking about a specific value, greater than (>) than a specific value, or less than (<) a specific value); it does not let us use => (equal to or greater than) or <= (less than or equal to). When we want these, we must find the next smallest or largest value and use one of the simpler criteria. Finding the next smaller number could be complicated in long unsorted lists.

If we use a slightly more complicated screening criteria, we can use a cell reference in a COUNTIF function. Here is a screen shot of how we can get an empirical probability and the related normal curve probability (a theoretical probability) for the male and female midpoints.

Note how the COUNTIF function is presented. It needs to be laid out in this format if we use cell references. Note that we are using the entire data range (T2:T51) as we are looking for each midpoint in relation to all salaries. The second part is a technical requirement (“>=”&) for COUNTIF to work with a cell reference (the G40 in this example, the Female midpoint location). This gives us a count of how many values equal or exceed our key value (Female midpoint is located in G40). We are dividing this count by the total of 50, to find the empirical probability (count/total) for our value.

The second row finds our normal curve probability. NORM.S.DIST(z-score, 1) provides the area or probability under the normal curve for a given z-score. Since our z-score represents the location of the midpoints within the normal curve (found in Q3 and cell J60 for females), NORM.S.DIST gives us the probability of equaling or being less than this value. So, to get the probability of equaling or exceeding this score we want the area to the right of our score which is 1- probability of below. Note, since the normal curve is considered to range from minus infinity to plus infinity, the probability of a single point is so infinitesimally small to be 0 for all practical purpose. This allows us to say Probability of (“at or above”) = 1-Probability of (at or below”) and use that point in both formulas.

Question 5

The final question this week, as with every week, asks us to consider the meaning of our findings. It asks for a review of the findings from the lecture’s review of the compa-ratio data, a review of your findings on the same questions using the salary data, and then a conclusion that incorporates both findings as they relate to our question about equal pay for equal work.

When we look at our descriptive statistics, we see that all the measure outcomes; means, standard deviations, ranges, and probabilities, are fairly close together. This suggests that the compa-ratio distributions might be similar and we have no equal pay issue overall. So, at first pass we see that, when grade differences are held constant (or eliminated) as the compa-ratio does; the females seem to be paid slightly more relative to their midpoint than males are. At the

same time, they are a bit less spread out, or more consistent, within their data group than the males are. We can also see that the overall average for each measure falls between the group values. The range, high, and low values also show that the males generally have less consistency as well as the smaller low and high values compared to females.

Now, while possibly insightful, as with any analyst we need to withhold making a final determination until we find some additional information. Two other initial questions pop up at this point. The first, are these differences meaningful or just sample errors, meaning if we took another sample would we get values that were closer together or even reversed? This issue will be looked at next week. The second question is are we sure we have a measure that measures pay by equal work? If not, then average compa-ratio might not be telling us anything. This issue is examined throughout the course.

So, what can we say at this point? It appears that males and females have about the same range and standard deviations for compa-ratios, but that females appear to average a bit higher than the males. However, at this point, we cannot say anything about our equal pay for equal work question as the compa-ratios may not be the best measure of equal work. So, at this point we have some interesting information but no conclusive results yet.

Summary

This lecture focused on the Excel tools needed to produce the results we examined in Lecture 2. The Data Analysis function located on the Data ribbon was presented. Additionally, several descriptive statistics tools in the Fx link or the Formulas ribbon were illustrated. These included:

• Descriptive statistics • Average • Median • Mode • Max and Min • Standard Deviation • Quartile • Percentrank.exc • Standardize.

These tools can be incorporated into the review sheets easily.

and

Please ask your instructor if you have any questions about this material.

When you have finished with this lecture, please respond to Discussion Thread 3 for this week with your initial response and responses to others over a couple of days.