problem set

profileNei
l2.pdf

BUS308 – Week 1 Lecture 2

Describing Data

Expected Outcomes

After reading this lecture, the student should be familiar with:

1. Basic descriptive statistics for data location 2. Basic descriptive statistics for data consistency 3. Basic descriptive statistics for data position 4. Basic approaches for describing likelihood 5. Difference between descriptive and inferential statistics

What this lecture covers

This lecture focuses on describing data and how these descriptions can be used in an analysis. It also introduces and defines some specific descriptive statistical tools and results. Even if we never become a data detective or do statistical tests, we will be exposed and bombarded with statistics and statistical outcomes. We need to understand what they are telling us and how they help uncover what the data means on the “crime,” AKA research question/issue.

How we obtain these results will be covered in lecture 1-3.

Detecting

In our favorite detective shows, starting out always seems difficult. They have a crime, but no real clues or suspects, no idea of what happened, no “theory of the crime,” etc. Much as we are at this point with our question on equal pay for equal work.

The process followed is remarkably similar across the different shows. First, a case or situation presents itself. The heroes start by understanding the background of the situation and those involved. They move on to collecting clues and following hints, some of which do not pan out to be helpful. They then start to build relationships between and among clues and facts, tossing out ideas that seemed good but lead to dead-ends or non-helpful insights (false leads, etc.). Finally, a conclusion is reached and the initial question of “who done it” is solved.

Data analysis, and specifically statistical analysis, is done quite the same way as we will see.

Descriptive Statistics

Week 1 Clues

We are interested in whether or not males and females are paid the same for doing equal work. So, how do we go about answering this question? The “victim” in this question could be considered the difference in pay between males and females, specifically when they are doing equal work. An initial examination (Doc, was it murder or an accident?) involves obtaining basic information to see if we even have cause to worry.

The first action in any analysis involves collecting the data. This generally involves conducting a random sample from the population of employees so that we have a manageable data set to operate from. In this case, our sample, presented in Lecture 1, gave us 25 males and 25 females spread throughout the company. A quick look at the sample by HR provided us with assurance that the group looked representative of the company workforce we are concerned with as a whole. Now we can confidently collect clues to see if we should be concerned or not.

As with any detective, the first issue is to understand the “who” and “what” about the victim. In this case, we need to use our sample to understand basic information about how males and females are paid. Understanding data sets typically involves look at several characteristics. These descriptive measures describe the data set. Typical descriptive measures include:

• Measures of location such as the average (AKA mean), the median (middle point), and mode (most often occurring value if it exists).

• Measures of consistency such as range (largest value minus the smallest value), variance, and standard deviation.

• Measure of position showing where a single data point is within the data set, such as percentile and rank.

• Measures of likelihood showing the probability of obtaining specific outcomes.

Note: Descriptive statistics describe a particular data set and can only be used for that data set. However, often we want to use a sample to “infer” back to a larger population. In this case, we would use inferential statistics. Most measures, except for variance and standard deviation, are calculated the same way. We will see the specific difference for those two later in this lecture.

The key to whether we have descriptive statistics or inferential statistics lies with the group we are taking the measures on. If we are only concerned with that group, we use descriptive statistics. If, however, we want to use that group to make inferences, claims, and conclusions about a larger population, then we take a random sample from the population and use inferential statistics (allowing us to infer back to the population). Our class data sets – both the lecture and homework – are random samples from a larger population, so we will basically be using inferential statistical measures.

Note that these are not the complete list of possible descriptive statistics. Excel’s Descriptive Statistics function (described in Lecture 3 for this week) includes a couple of measures that focus on data distribution shape. These have some specialized uses that we will not be getting into.

Location Measures

Perhaps the most often asked question about data sets is what is the average? The intent is to get a measure that shows us the center of the data. Unfortunately, average is a somewhat imprecise term that could mean all three of our measures of location identified above. So, as analysts we tend to be more precise and use mean, median, and mode.

While these all tell us something about where the data might be clustered, they can provide very different views of the data. An example of this comes from an example the author heard back in High School. At that time, the mean per capita income for citizens of Kuwait was about $25,000; the median income was around $125; and the mode was $25! The very high (due to oil revenues) income of the Royal family accounted for much of this difference, but just look at the different impressions we get about the country depending on which value we look at.

• Mean, AKA average, is the sum of all the values divided by the count. This can be considered the “weighted center” of the data set. For example, the mean of 1, 2, 3, 4, and 5 = 1+2+3+4+5/5 = 15/5 = 3. The mean is generally the best measure for any data set as it uses all of the data values and requires interval or ratio level data. Thus, while we can average salary, compa-ratio, seniority, etc., we cannot average gender or gender1 (even if one is coded in numbers) or grade in our data set.

• The median is the middle value in an ordered (listed from low to high) data set. This is the “physical center” of a data set. For example, the median of 1, 2, 3, 4, and 5 = 3, the middle value. If we have an even number of values, the median is the average of the middle two values. Medians can be found on ordinal, interval, or ratio level data.

• The mode is the most frequently occurring value. This is more or less the “popular center” as it is where most numbers group together. A data set may have no modes or one or more. Modes may occur with any level of data. The data set 1,1,2,2,2,2,3,8,8,9 has a primary mode of 2, and two secondary modes of 1 and 8.

Consistency/Variation Measures

While they do not have the popularity of their location cousins, knowing the consistency or variation within the data is as important, some say even more important, as knowing the central tendency for us to understand what the data is trying to tell us. Very consistent data, with little variation, has a mean that is very representative of the data and is unlikely to change much if we resample the population. Data with a large amount of variation tends to have unstable means, meaning that these values would change a lot with multiple samples. Inconsistent data (having large variation) is often a problem for businesses, particularly for manufacturing operations, as it means the results they produce differ and might often not meet the quality specifications. Predictions based on data with large variations are rarely useful. Consider attempting to estimate how long it would take you to get to work if your route had frequent traffic accidents that made the travel time different every day.

The key measures of variation are:

• Range, which equals the maximum value minus the minimum value. For our example data set of 1, 2, 3, 4, and 5, the range is 5 – 1 = 4.

• Variance, which is the average of the square of sum of the differences between each value in the data set from the mean. To get the variance, find the mean of the data, subtract this value from each of the data points, square this result (to get rid of the negative differences), add them up and divide by the total count. For our example data set, this would look like:

Value Mean Difference Squared

1 3 -2 4 2 3 -1 1 3 3 0 0 4 3 1 1 5 3 2 4 Sum = 10

Variance = 10/5 = 2 The problem with variance is that it expressed as units squared. So, if our data set were dollars, the variance would be 2 dollars squared. How should we interpret dollars squared? In general, we do not and use the next measure instead.

• Standard Deviation is the (positive) square root of the variance. It returns the dispersion measure back to one that is in the same units as the original data, so we can compare it to the data values. For our example, the standard deviation is the square root of 2 dollars squared, or 1.4 dollars. This much easier to understand measure means that the average difference from the mean is 1.4 dollars (in our example above having a mean or average value of 3 dollars).

• Important point about the variance and standard deviation. When we find these values for a population, the entire group we are interested in, we divide the numerator by the sample size. However, when we have a sample of the entire group (and want to use this sample to estimate the population value for either variance or standard deviation), we create the inferential estimate by dividing the numerator by the (count – 1). This is an adjustment that increases the estimate to take into account we most likely do not have the extreme low and extreme high value from the population in our sample, so its variation is less than the group we are using the sample to describe.

Just as detectives want to know what victims typically did and how consistent they were in their behavior around the time of the crime (For example: Was he usually in this area, and if not, why last night?), examining location and consistency measures provide a similar perspective on data variables and how they behave.

Applying the Information: Equal Pay Questions

OK, we can now start looking at our data set to see what the numbers are hiding, and develop some clues. As with all analysis, we start with questions, then identify the tools to use for those questions, and finally apply those tools to the data. Our initial question is, do males and females get equal pay for equal work? We also said we needed to start with the question of whether or not we had some measures that showed pay comparisons between males and females. Let’s take a look at some of the group and sub-group data. A couple of measures that might answer this question are:

• What are the group averages for each variable? • What are the average male and female compa-ratios? (Remember, you will work with

the Salary variable in the homework.)

• How consistent are the compa-ratios for each?

Note that we will be focusing on the compa-ratio data in the lectures, while you will focus on the same questions using salary in the weekly homework assignments. As described, compa-ratio is the result of dividing an employee’s salary by their grade midpoint. It generally ranges from about 0.80 to 1.20 in most pay plans. The value of this measure is it removes the impact of different grades (each of which we are assuming are different levels of work from other grades and contain equal work for all the jobs within the grade). While not a perfect measure, it is the start of measuring what is paid for equal work. Side note: a grade’s midpoint is generally pegged to the average market pay needed to hire new employees into a job.

Week 1 Question 1

Question 1 asks for some summary statistics. Part A asks for you to use the Excel Descriptive Statistic function (more on this in the third lecture), while part B asks for some specific statistics using the Fx function list (again, how to do this is covered in lecture 3). The purpose for these specific requests is to let you show mastery in using these two Excel tools.

For part a, the mean, standard deviation, and range of the entire compa-ratio data set is highlighted. This shows us that that mean is 1.062, the standard deviation is 0.077, and the range is 0.34. As interesting as these values are, they do not really tell us anything. Measures generally need to be compared to provide information.

This is where part b comes in. We see that the male and female averages (1.056 and 1.069 (rounded) respectively) appear relatively close and are on opposite sides of the overall mean. The standard deviations are also close at 0.084 and 0.07 and surround the standard deviation from the entire data set. The ranges are both smaller than the overall range – meaning that neither gender has both the smallest and largest value. The female compa-ratios appear to be slightly more clustered (less variation, more consistent) than the male values from both the range and standard deviation results.

Two things stand out. First, perhaps surprisingly, the females appear to be paid more relative to their grade midpoints than the males. Second, measures of dispersion appear fairly close with males being slightly more spread out than females. So far, nothing seems to create any concerns as we expect sample results to be a bit different than the overall population values. These differences seem to be small enough that they might be simple sampling errors – if we resampled (such as the data set you will be working with) the male and female results might switch.

Remember, when you do this problem in the homework, use the salary data. As practice you can copy the data set into a practice excel file and try to replicate the same answers as show up in the lectures. Ask a question if you are unsure of how to do this or do not get the same results using the lecture dataset.

Position Measures

Often, we are interested in where within a data set a particular measure falls. This opens up the idea of distributions, how the data values are spread across the range of values. Our detectives would be looking at where victims typically went and where they spent their time – the pattern of their normal behavior.

Distributions. Location and consistency measures are important for summarizing the data set. Important as they are, they do not always give us all the information we need. At times we want to know how specific values fit within the data set. For example, we might want to compare the 10th highest male and female value to get a sense of how relative positions within the data range differ. This often means we need to examine the distribution, or shape, of the data. This shows us how all the data values relate to all of the other values with the sample.

One important tool in analyzing data sets that we will not cover (we cannot cover everything, alas) is graphical analysis – looking at how data sets are distributed when graphed. One example will show how powerful these techniques can be. One very common graph is a histogram – a count of how many times a certain value occurs. For example, if you tossed a pair of dice 50 times, you might get the following results. The table shows the results we got. The Histogram shows the distribution or shape of the data, with the x-axis, horizontal, showing the sum of the numbers on the two faces and the y-axis, vertical, showing how often we observed

Outcomes from tossing a pair of dice Count showing 2 3 4 5 6 7 8 9 10 11 12 Frequency seen 1 2 4 3 9 12 7 5 4 1 2

0

2

4

6

8

10

12

14

2 3 4 5 6 7 8 9 10 11 12

a particular result in our 50 tosses.

A couple of things we can do with distributions can be easily shown with this histogram. First, we can find the center, in this case 7. We can see that there are two tails around the center, one to the left showing counts for values less than the middle value of 7, and one to the right showing how often we got values greater than 7. Visually, we can see that the further away from the center we get, the less often – or less likely – we are to get any particular outcome. Ways to quantify these observations are discussed below.

Our detectives use this logic when they attempt to find out where all the persons of interest were at the critical times. These approaches provide more detailed information about how the data looks more specifically than the summaries of dispersion examined earlier.

Position Measures. Central tendency and variation are group descriptive measures – particularly the mean and standard deviation, which use all the values in the data set in their calculation. At times; however, we are concerned with specific values with in the distribution, such as:

• Quartiles, • Percentiles, or centiles, • The 5-number summary, or • Z-score.

Quartiles and Percentiles. These measures divide the data into groups, four with the quartile and 100 with the percentile. One example that many of you might be familiar with is percentile (AKA percentile rank). This is often use when doctors describe a child as in the 80th percentile in height or weight for his/her age. This means that 80% of other children at this age are at or below this particular child’s measure. Percentiles range from 1 to 100%-tile, meaning the lowest score would be at the first (or 1%-tile) and the highest score would be at the 100%- tile. Percentiles are very useful for comparing groups.

The general percentile formula lets us find percentiles, deciles (the 10% divisions), and/or quartiles, although Excel will do this for us. The formula is:

Lp = (n+1) * P/100; where

Lp is the count of the desired percentile (25 would be the location of the first quartile, for example)

n is the size/count of the data set

P is the desired percentile; using 25, 50, or 75 gives the quartile points, while using 10, 20, etc. would give the decile points.

Example: if we wanted to find the cut-off for the first (or lowest) quartile of the data, also known as the 25th percentile in a data set of 50, we would use (50+1)*25/100 = 12.75, or the 13th value from the bottom in an ordered list. By convention, we always round up to the next whole value.

5-Number Summary. As its name suggests, the 5-number summary identifies five key values in a data set: minimum value, 1st quartile, median or 2nd quartile, 3rd quartile, and maximum values. For the compa-ratio data set used in the lectures, the 5-number summary can be found from the following table results. The 1st quartile, for either gender group of 25 is (25+1) * 25/100 = 6.5, or the 7th values in a rank ordered list. The 3rd quartile is located at 19.5. For the entire sample of 50, these values are located at the 13th and 39th rank ordered places, respectively. Here is a 5-number summary for the overall compa-ratio values in the sample:

Compa-ratio 5-number summary: 0.870, 1.013, 1.051, 1.134, 1.210.

More on this shortly.

Z-score. What is often of more value is looking at where specific measures lie within each range. The z-score measures show how far from the mean a specific data point lies, measured in standard deviation units. (I know that sounds strange but keep reading.) The Z- score provides a measure of how many standard deviations a particular score lies from the mean, and in what direction (above or below). The Z-score formula is:

Z = (individual score – mean) / (standard deviation).

Looking at this formula we can see that a score above the mean would give us a positive z-score, a score below the mean would give us a negative z-score, and a score that exactly equals the mean would gives us a z-score of 0. For most data sets, the z-score ranges from a -3.0 to a +3.0.

For example, in our example data set (1, 2, 3, 4, and 5) (see above for descriptive statistics on this data set), the z score for 2 would be (2-3)/1.4 = -1/1.4 = -0.71. The negative value means that 2 is below (or less than) the mean and is 0.71 standard deviation units away from the mean (0.71 times the standard deviation of 1.4 = 1).

Using this measure, we can easily examine relative placement of scores. For example, a compa-ratio of 1.06 would have Z-scores of 0.048 for males, -0.129 for females, and -0.03 for the overall group. (We will see how we got these values shortly.) Thus, we can see that a person with this compa-ratio is slightly above average for males, but below average for the overall group and for females.

Applying the information

Week 1 Question 2

Question 2 asks for a 5-number summary for the overall compa-ratio data set as well as for the male and female sub-groups within the data.

Note: Lecture 1-3 will show the same screen shot with the cell formulas displayed.

One of the first observations that confirms an earlier observation is that neither the male nor female data set has both the largest and smallest values. The males appear to have a slightly lower overall range of values than do the females. Some other interesting observations include the relatively similar 3rd quartile values for all three groups and the lower midpoint for females, meaning that more females are lower in the overall range than males. More males are in the first quartile than females. What other observations can you make about how employees are distributed within their respective compa-ratio ranges?

Week 1 Question 3

Often looking at how a single point lies within a data range is helpful to get some insight into how the distributions are positioned. Question 3 asks for us to examine where the midpoint of each gender’s dataset fits within the entire compa-ratio data set. The Percentank.exc function returns a percentile rank, the percent of data values that fall at or below a given value. For example, the percentrank.exe of the median would be 50%-tile as half the values are above and half below the median (as expected).

When we look at the male median, we see it falls at the 51st %-tile, meaning it is slightly above the overall median. The female median (half of the female compa-ratios are below this value remember) falls at the 33rd %-tile! This means that most of the females are in the bottom half of the distribution, even though (from Question 2), females have the “higher” range. Interesting.

The z score is a measure of relative placement based on the mean rather than the median. A value that equals the mean would have a z score of 0, a value that is greater than the mean would have a positive z score, while a value less than the mean would have a negative z score.

Both the male and female medians fall below the overall compa-ratio mean, with the female median being relatively lower in the distribution. This is consistent with what the percentile scores suggested. Overall, these two questions are suggesting that males and females are not distributed the same within in the compa-ratio data set.

Likelihood Measures

Likelihood, or probability, focuses on how often we can expect to see an outcome. In statistics, many decisions are made based upon how likely, or more accurately, how unlikely it is to see an outcome.

Probability

Probability is the likelihood that an event will happen. For example, if we toss a fair coin, we have a 50/50 chance, or a probability of .5 of getting a head. If we pick a date between 1 and 7, we have a 1 out of 7 chances (or a probability of 1/7 = .14 or 14%) that it will be a Wednesday in the current month. Statisticians recognize three types of probabilities:

• Theoretical – based on a theory, for example – since a die (half of a pair of dice) has 6 sides, and our theory says each face is equally likely to show up when we toss it; we therefore expect that will see a 1 1/6th of the number of times we toss it (assuming we toss it a lot).

• Empirical – count based; if we see that an accident happens on our way to work 5 times(days) within every 4 weeks, we can say the probability of an accident today is 5/20 or 25% since there are 20 work days within a 4-week period. An empirical probability equals the number of successes we see divided by the number of times we could have seen the outcome.

• Subjective – a guess based on some experience or feeling.

There are some basic probability rules that will be helpful during the course. The probability

• of something (an event) happening is called P(event), • of two things happening together – called joint probability: P(A and B), • of either one or the other but not both events occurring – P(A or B), • of something occurring given that something else has occurred, conditional probability:

P(A|B) (read as probability of A given B). • Compliment rule: P(not A) = 1- p(A).

Two other issues are needed, the idea of mutually exclusive means that the elements of one data set do not belong to another – for example, males and pregnant are mutually exclusive data sets. The other term we frequently hear with probability is collectively exhaustive – this simply means that all members of the data set are listed.

Some rules, which apply for both theoretical and empirical based probabilities, for dealing with these different probability situations include:

• P(event) = (number of success)/(number of attempts or possible outcomes) • P(A and B) = P(A)*P(B) for independent events or P(A)*P(B|A) for dependent events

(This last is called conditional probability the probability of B occurring given that A has occurred).

• P(A or B) = P(A) + P(B) – P(A and B); if A and B cannot occur together (such as the example of male and pregnant) then P(A and B) = 0

• P(A|B) = P(A and B)/P(B).

One of the more interesting uses of probabilities (other than forecasting the likelihood of rain on our days off) is the comparing of outcome likelihoods for different groups.

• The probability of randomly picking a female [P(F)] is the same as randomly picking a male [P(M)] from the group = 25 specified outcomes/50 possible outcomes. This is a simple empirical probability – counts divided by counts.

• We can get a bit more complicated, such as the probability of picking a female from a specific grade such as B – P(F|B), probability of picking a female given (from) only grade B. Again, this is empirical – we have 7 employees in grade B, and 4 of these are females, so P(F|B) = 4/7.

• Now the probability of picking a Female who is also in grade B (from the entire data set is 4 females out of 50 = 4/50 = 0.08, empirically. We can find this using the P(A and B) formula referenced above. P(F and B) = P(F)*P(B|F), since the events of female and grade E are not independent. So, we know P(F) = .5, and P(B/F) = 4/25 (4 females out of 25 are in grade B), so by theory, P(Female and grade B) = .5 * .16 = 0.08, the same results.

• The compliment rule is often helpful, if we want to find the probability of picking any female EXCEPT those in grade B, we could figure out the probability for each of the grades and add them together, OR we could simply say that the probability of Female and not grade B is simply 1 – P(Female and grade B), or 1 -0.08, or 0.92. We will use this property of probabilities a lot in the rest of the class.

As we can see, probabilities can show us a lot and can be somewhat complex in determining their values. The nice thing is that this is about as complicated as it gets.

Applying the information

Week 1 Question 4

Question 4 gives us some probability values- how likely are we to exceed the respective gender midpoints in the entire data set. We are looking at the empirical and normal curve probabilities. If the data set is normally distributed, the probabilities should be fairly close; if not, we have a clue that the data might not be normally distributed over the entire data range.

The male empirical probability of exceeding the midpoint in the entire data set is 50% empirically (close to the 51st percentile value we got above) and 55% assuming normality – fairly close. The female probabilities are 68 and 60% respectively; again not too far off.

The data again support the idea that a lot of females are at the higher end of the compa- ratio distribution.

Drawing Conclusions: Week 1 Question 5

As interesting as the numbers are themselves, they mean very little unless we can interpret their meaning and apply that insight to the question(s) at hand.

Recapping our results, we see that while female overall average compa-ratio is somewhat higher than the males, the probability and distribution outcomes suggest that males and females are not distributed in the same fashion and that more of the females are relatively lower in their range than the males.

While we have not yet accounted for equal work, it appears that there are some issues suggesting that males and females are not paid the same within the company. At least enough for more investigation.

On our detective shows, we might say that we have some evidence, but not enough to take it to the grand jury for an indictment yet.

Summary

This lecture looked at descriptive statistics and what they can tell us about the data set. We reviewed the questions that are asked in the Week 1 assignment and the answers for each question using the COMPA-RATIO variable. The focus of this lecture was on interpreting presented results, as that is a more frequent activity for professionals than actually developing the measures.

Specifically, we looked at the developing the following information.

Note that this was created by listing the tool as we introduced it, the data requirements, and then a typical question that would require this tool. By copying this information to a second Excel sheet and sorting the columns we can create a guide as to when to use each tool, a shown below.

Now, we move on to some specific ways to set-up Excel to provide the results that we just looked at.

Before we do, however, please respond to Discussion Thread 2 for this week with your initial response and responses to others over a couple of days before moving on to reading the second lecture for the week.

Please ask your instructor if you have any questions about this material.