Statics - Final Paper

profileAEIOUsometimesY
Lectureweek1-3.pdf

BUS308 – Week 1 Lecture 3

What this lecture covers

This week’s first lecture introduced some of the reasons for studying and understanding statistics. Even if we never become a data detective or do statistical tests, we will be exposed and bombarded with statistics and statistical outcomes. We need to understand what they are telling us and how they help uncover what the data means on the AKA research question/issue.

Our data was also introduced in lecture 1. Lecture 2 covered basic information that underlies a successful statistical analysis and introduced some statistical outcomes and tools. It worked thru an example of how to develop some descriptive statistics using Excel tools.

This lecture continues our description of the data set by looking at positional and probability concerns and continues our look at how we can get Excel to do the things we need. If you are not familiar with Excel, you will be amazed at what it can do for us. If you are familiar with Excel, you might not know the statistical analytical power it has (at least that is what past classes have said). Side note – much of what we will cover with Excel provides an excellent starting point for classes in accounting, finance, and operations management (to name a few).

Position Issues

Distributions. Locational and consistency measures are important for summarizing the data set. Important as they are, they do not always give us all the information we need. At times we want to know how specific values fit within the data set. For example, we might want to compare the 10th highest (found using =large(range, 10) in the fx list) the male and female value to get a sense of how relative positions within the data range differ. This often means we need to examine the distribution, or shape, of the data. This shows us how all the data values relate to all of the other values with the sample.

One important tool in analyzing data sets that we will not cover (we cannot cover everything, alas) is graphical analysis – looking at how data sets are distributed when graphed. One example will show how powerful these techniques can be. One very common graph is a histogram – a count of how many times a certain value occurs. For example, if you tossed a pair of dice 50 times, you might get the following results. The table shows the results we got. The Histogram shows the distribution or shape of the data, with the x-axis, horizontal, showing the sum of the numbers on the two faces and the y-axis, vertical, showing how often we observed

Outcomes from tossing a pair of dice Count showing 2 3 4 5 6 7 8 9 10 11 12 Frequency seen 1 2 4 3 9 12 7 5 4 1 2

0

2

4

6

8

10

12

14

2 3 4 5 6 7 8 9 10 11 12

This specific count/result in our 50 tosses.

A couple of things we can do with distributions can be easily shown with this histogram. First, we can find the center, in this case 7. We can see that there are two tails around the center, one to the left showing counts for values less than the middle value of 7, and one to the right showing how often we got values greater than 7. Visually, we can see that the further away from the center we get, the less often – or less likely – we are to get any particular outcome. Ways to quantify these observations are discussed below.

Position Measures. Central tendency and variation are group descriptive measures – particularly the mean and standard deviation, which use all the values in the data set in their calculation. At times; however, we are concerned with specific values with in the distribution, such as:

• Quartiles, • Percentiles, or centiles, • The 5-number summary, or • Z-score.

Quartiles and Percentiles. These measures divide the data into groups, four with the quartile and 100 with the percentile. The general percentile formula lets us find percentiles, deciles (the 10% divisions), and/or quartiles, although Excel will do this for us. The formula is:

Lp = (n+1) * P/100; where

Lp is the count of the desired percentile (25 would be the location of the first quartile, for example)

n is the size/count of the data set

P is the desired percentile; using 25, 50, or 75 gives the quartile points, while using 10, 20, etc. would give the decile points.

Example: if we wanted to find the cut-off for the first (or lowest) quartile of the data, also known as the 25th percentile in a data set of 50, we would use (50+1)*25/100 = 12.75, or the 13th value from the bottom in an ordered list. By convention, we always round up to the next whole value.

5-Number Summary. as its name suggests, identifies five key values in a data set: minimum value, 1st quartile, median or 2nd quartile, 3rd quartile, and maximum values. For the compa-ratio data set used in the lectures, the 5-number summary can be found from the

following table results. The 1st quartile, for either group of 25 is (25+1) * 25/100 = 6.5, or the 7th values in a rank ordered list. The 3rd quartile is located at 19.5. For the entire sample, these values are 13 and 39, respectively. (The interested student is invited to develop these for the data set used in this class.)

(Note: Excel rounds the quartile results to the nearest whole number, so 6.5 would give the 7th value automatically.

We can now compare the three data distributions with the 5-number summary:

• Males: 0.871, 1.018, 1.057, 1.134, 1.175 • Females: 0.957, 1.025, 1.069, 1.129, 1.211 • Overall: 0.871, 1.018, 1.063, 1.133, 1.211

In comparing these, we see (again) that the males start and end a bit lower than do the females.

Z-score. What is often of more value is when analysts want to look at where specific measures lie within each range. It measures showing how far from the mean a specific data point lies, measured in standard deviation units. These are specific measures that use the data distribution. One of the more commonly observed distribution in real life is the normal, or bell- shaped curve. The mean = mode = median in the middle, and values away from the mean fall of in a similar pattern in each tail. A picture of 3 normal curves that have data ranges that overlap is shown below. Many naturally occurring outcomes fit this distribution.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-5 -4 -3 -2 -1 0 1 2 3 4 5

Males Females Overall Min. 0.871 0.957 0.871 1st Q 1.018 1.025 1.018 Mean 1.057 1.069 1.063 3rd Q 1.134 1.129 1.133 Max 1.175 1.211 1.211

A z-score for a specific data value within a normal curve distribution is found by subtracting the mean from that score, and dividing the result by the standard deviation of the data set. For example, in our example data set (1, 2, 3, 4, and 5), the z score for 2 would be (2-3)/1.4 = -1/1.4 = -0.71. The negative value means that 2 is below (or less than) the mean, and is 0.71 standard deviation units away from the mean (0.71 times the standard deviation of 1.4 = 1).

The Z-score provides a measure of how many standard deviations a particular score lies from the mean, and in what direction (above or below). The Z-score formula is:

Z = (individual score – mean) / (standard deviation)

Using this measure, we can easily examine relative placement of scores. For example, a compa- ratio of 1.06 would have Z-scores of 0.04 for males, -0.13 for females, and -0.03 for the overall group. Thus, we can see that a person with this compa-ratio is slightly above average for males, but below average for the overall group and for females.

Applying the information

Question 2

This question moves from descriptive statistics to summary measures – telling us where within a data set we will find a value. These are good for comparison activities. Question 2 asks for us to develop the 5-number summary for the overall, male, and female salary data range. Locating a specific value within a data set is often useful. In Excel, this can be done in a couple of ways. If we are using the Analysis ToolPak Descriptive Statistics function, we can simply enter the rank (for example 1) in the Kth largest and Kth smallest box at the bottom of the data input box to obtain several of the values needed.

If we do not need all the information provided by Descriptive Statistics, we can simply use several functions found in the fx (or Formulas) statistical list (which is how the question wants us to find the answer). The setup is straight forward. To have Excel automatically find

• The maximum value use =max(range) • The 3rd quartile use =percentile.exc(range, .75) • The median is found with =median(range) • The first quartile uses =percentile.exc(range, .25) • The minimum value uses =min(range)

Note, for full credit each cell should show a data value, but when the cursor is placed in the cell the formula should be seen showing the data range where you got your answer.

Question 3

In looking at question 3, we see that you will be still working with the salary and gender1 variables. We are looking to find some descriptive information about single data points using the midpoints found in question 2. This is often useful.

Part a asks for the percentile rank (the percentile gives us the percent of scores at or below a specific score, such as 35th percentile) for the male and female midpoints based on where they are in the entire salary data set. (If we asked for this results within each gender’s data range we would get 50%, as that is the definition of midpoint.) Use the PERCENTRANK.EXC(Range, Value) function found in the fx list for this question.

Z-Score. Part b asks for the z-score or Z-value. This is a relative measure telling us (in standard deviations) how far from the mean a specific value lies. It involves the (a) value we are interested in, the mean of the distribution, and (c) standard deviation. We have all the values needed from question 1 (remember, the examples are using the compa-ratio values while your assignment will use the Salary variable values):

Male Female Mean: 1.056 1.069

Sample Standard

Deviation: 0.08379 0.070 Range: 0.305 0.254

One other element of these results is the cells they are in, as formulas can use either actual values or cell references. For our z values, we will show how to use both of these.

We can find a z-value in several ways. Letting Excel do the math for us, we could enter into the cell for Male z-value of 1.149 the following formula: =(1.149-1.056)/0.08379; this gives us a value of 1.11 (rounded to 2 decimal places – traditional for z scores). We could also have used cell values and entered =(K35-D29)/D30 to get a value of 1.11 rounded. However, if we look at more decimal places we will find a slight difference. This difference is due to Excel’s rounding of the compa-ratio values – the actual mean and standard deviation are slightly larger than the rounded values shown in the table output. The preference is to use cell references in formulas.

The second way, and the approach asked for in the homework, to find z-values is to use the Fx function =standardize(x, mean, standard.deviation); for females this would be =standardize(k35, E29, E30) giving us a value of 1.14 (rounded). This function would be used in part b for question 3.

Probability

Probability is the likelihood that an event will happen. For example, if we toss a fair coin, we have a 50/50 chance, or a probability of .5 of getting a head. If we pick a date between 1 and 7, we have a 1 out of 7 chances (or a probability of 1/7 = .14 or 14%) that it will be a Wednesday in the current month. Statisticians recognize three types of probabilities:

• Theoretical – based on a theory, for example – since a die (half of a pair of dice) has 6 sides, and our theory says each face is equally likely to show up when we toss it; we therefore expect that will see a 1 1/6th of the number of times we toss it (assuming we toss it a lot).

• Empirical – count based; if we see that an accident happens on our way to work 5 times(days) within every 4 weeks, we can say the probability of an accident today is 5/20 or 25% since there are 20 work days within a 4-week period. An empirical probability equals the number of successes we see divided by the number of times we could have seen the outcome.

• Subjective – a guess based on some experience or feeling.

There are some basic probability rules that will be helpful during the course. The probability

• of something (an event) happening is called P(event), • of two things happening together – called joint probability: P(A and B), • of either one or the other but not both events occurring – P(A or B), • of something occurring given that something else has occurred, conditional probability:

P(A|B) (read as probability of A given B). • Compliment rule: P(not A) = 1- p(A).

Two other issues are needed, the idea of mutually exclusive means that the elements of one data set do not belong to another – for example, males and pregnant are mutually exclusive data sets. The other term we frequently hear with probability is collectively exhaustive – this simply means that all members of the data set are listed.

Some rules, which apply for both theoretical and empirical based probabilities, for dealing with these different probability situations include:

• P(event) = (number of success)/(number of attempts or possible outcomes) • P(A and B) = P(A)*P(B) for independent events or P(A)*P(B|A) for dependent events

(This last is called conditional probability the probability of B occurring given that A has occurred).

• P(A or B) = P(A) + P(B) – P(A and B); if A and B cannot occur together (such as the example of male and pregnant) then P(A and B) = 0

• P(A|B) = P(A and B)/P(B).

One of the more interesting uses of probabilities (other than forecasting the likelihood of rain on our days off) is the comparing of outcome likelihoods for different groups.

• The probability of randomly picking a female [P(F)] is the same as randomly picking a male [P(M)] from the group = 25 specified outcomes/50 possible outcomes. This is a simple empirical probability – counts divided by counts.

• We can get a bit more complicated, such as the probability of picking a female from a specific grade such as B – P(F|B), probability of picking a female given (from) only grade B. Again, this is empirical – we have 7 employees in grade B, and 4 of these are females, so P(F|B) = 4/7.

• Now the probability of picking a Female who is also in grade B (from the entire data set is 4 females out of 50 = 4/50 = 0.08, empirically. We can find this using the P(A

and B) formula referenced above. P(F and B) = P(F)*P(B|F), since the events of female and grade E are not independent. So, we know P(F) = .5, and P(B/F) = 4/25 (4 females out of 25 are in grade B), so by theory, P(Female and grade B) = .5 * .16 = 0.08, the same results.

• The compliment rule is often helpful, if we want to find the probability of picking any female EXCEPT those in grade B, we could figure out the probability for each of the grades and add them together, OR we could simply say that the probability of Female and not grade B is simply 1 – P(Female and grade B), or 1 -0.08, or 0.92. We will use this property of probabilities a lot in the rest of the class.

As we can see, probabilities can show us a lot and can be somewhat complex in determining their values. The nice thing is that this is about as complicated as it gets.

Question 4

Part a in the next question is about the empirical probability that a male and female (separate groups) exceeds our value found in part a. Empirical probably is a simple count outcome, the number of successful outcomes divided by the total number of possible outcomes. For each gender, we have 25 possible employees that could be selected at random. What we need to know is, for each group how many employees have a compa-ratio as large or larger than then cut-off value we found we found in part a. The labor-intensive approach is to look at the sorted data and count – pain in the neck. Excel will do this for us.

Empirical Probability. Excel has several counting functions found in the statistical list of Fx functions. The one we want is =countif (range, “criteria”). We know the range for the males and females, what we need is the criteria. Countif lets us use =, >, or < (as in = a specific value, greater than (>) than a specific value, or less than (<) a specific value); and we use >= for equal to or greater than and <= for (less than or equal to).

Therefore, asking “as large or larger” as a compa-ratio value of 1.1490 would be =countif(range, “>=1.1490.) And, recalling that the empirical probability is a count of successes over the total count, for each gender we can create a cell formula of =countif(range, “>=1.1490”)/25 to find the empirical probability of having a value of 1.1490 or larger in a data set of 25 values.

For males, this formula is =countif(Q27:Q51,”>=1.1490”)/25.

For females, change the range to (Q02:Q26).

It turns out, that for both males and females, we have 4 employees exceeding our criteria, give us an empirical probability of 4/25 = 0.16 for each.

This question asks us to find the probability of picking from each gender group, a person having a compa-ratio equal to or larger than our 1.149 value. In the homework, we would be looking for how many equal or exceed the male or female midpoints within the entire data range of all 50 values rather than within each gender’s range.

The Fx function =NORM.S.DIST(Z, Cumulative) gives us the normal curve probability of getting a value up to our z score. Since we want the probability of exceeding our z-score, we would subtract this probability from 1.0 (total probability of getting any value), with the cell formula =1-norm.s.dist(J42, 1) for the female answer. Note, the cumulative value of 1 (or true) means to provide a probability value, if we used 0 or false, we would get the height of the curve for each z-value (helpful if you want to generate a normal curve graph). Just as with the standardize function, we can use either cell references or actual values for the z score. Our results, when rounded to two decimal places are both 0.13 – or a 13% chance of exceeding that compa-ratio value regardless of gender.

Question 5

The final question this week, as with every week, asks us to consider the meaning of our findings – first between the data sets and then towards our equal pay question.

When we look at our descriptive statistics, we see that all the measures – means, standard deviations, ranges, probabilities are fairly close together. This suggests that the compa-ratio distributions might be similar and we have no equal pay issue overall. So, at first pass we see that, when grade differences are held constant (or eliminated) as the compa-ratio does; the females seem to be paid slightly more relative to their midpoint than males are. At the same time, they are a bit less spread out, or more consistent, within their data group than the males are. We can also see, that the overall average for each measure falls between the group values. The range, high, and low values also show that the males generally have less consistency as well as the smaller low and high values compared to females.

Now, while possibly insightful, as with any analyst we need to withhold making a final determination until we find some additional information. Two other initial questions pop up at this point. The first, are these differences meaningful or just sample errors, meaning if we took another sample would we get values that were closer together or even reversed? This issue will be looked at next week. The second, are we sure we have a measure that measures pay by equal work? If not, then average compa-ratio might not be telling us anything. This issue is examined throughout the course.

So, what can we say at this point? It appears that males and females have about the same range and standard deviations for compa-ratios, but that females appear to average a bit higher than the males. However, at this point, we cannot say anything about our equal pay for equal work question as the compa-ratios may not be the best measure of equal work. So, at this point we have some interesting information, but no conclusive results yet.

Please ask your instructor if you have any questions about this material.

When you have finished with this lecture, please respond to Discussion thread 3 for this week with your initial response and responses to others over a couple of days.