Assignment: Measures of Variability
second interval is 8 (0-?, 8-?). This must mean that the last value to be included in our first interval is one less than 8, or 7. Our first interval, therefore, includes the range ofvalues 0-7. If you count the number of different values in this interval, you will find that it includes eight different values (0, 1, 2,3, +, 5,6, 7). This is our interval width of 8.
Step 4. After Your First lnterval ls Determined, the Next lntervals Are Easy. They must be the same width and not overlap (mutually exclusive). You must make enough intervals to include the last value in your variable distribution. The highest value in our data is 80 hours per weeL, so we construct the grouped frequency distribution as follows:
0-7
8-1 5
t6-23
2+3t 32-39
+047
+8-55
56-63
64-71
72-79
80-87
Notice that in order to include the highest value in our data (80 hours) we had to make I I intervals instead of the 10 we originally decided upon in Step l. No problem. Remembeq the number of intervals is arbitrary, and this is as much art as science.
Slep 5. Count the Number or Frequency o/Cases That Appear in Each lnterval and Their Percentage of theTotal. The completed grouped frequency distribution is shogm in Exhibit 14.8. Notice that this grouped frequenry distribution conveys the important features of the distribution of these data. Most of the data cluster at the low end of the number of hours studied. In fact, more than rwo thirds of these youths studied less than 8 hours per week. Notice also that the frequency of cases thins out at each successive interval. In other words, there is a long right tail to this distribution, indicating a positive skew because fewer youths studied a high number of hours. Notice also that the distribution was created in such a way that the interval widths are all the same, and each case falls into one and only one interval (i.e., the intervals are exhaustive and mutually exclusive). We would have run into trouble if we had intervals such as 0-7 and 7-14, because we would not know where to place those youths who spent 7 hours a week studying. Should we put them in the first or second interval? If the intervals are mutually exclusive, as they are here, you will not run into these problems.
SUMMARIZING UNIVARIATE DISTRIBUTIONS
Summary statistics, sometimes called dariptizte statistics, focus attention on particular aspects of a distribution and facilitate comparison among distributions. For example, suppose you wanted to report the rate of violent crimes for eachcity in the United States with ovei 1OO,OOO in population. You could report each city's violent crime rate, but it is unlikely that two cities would have the same rate, and you would have to report approximately 200 rates, one for each
416 SECTION V. AFTER THE DATA ARE COLLECTED
Exhibit L4.8 of a Grouped Frequency Distribution From Hours !ffi#je
Note: Total may not equal 100.0% due to rounding error.
city. This would be a frequenry distribution that many, if not most, people would find difficult to comprehend. One way to interpret your data for your audience would be to provide a sum mary measure that indicates what the average violent crime rate is in large U.S. cities. That is the purpose of the set of summary statistics called measures of central tenilen4,. You would also want to provide another srunmary measure that shows the variability or heterogeneity in your data-in other words, a measure that shows how different the scores are from each other or from the central tendenry. That is the purpose of the set of summary statistics called measures of oariation or dispercion.We will discuss each type of measurement in turn.
Measures of Central Tendency
Central tendency is usually summarized with one of three statistics: the mode, the median, or the mean. For any particular application, one of these statistics may be preferable, but each has a role to play in data analysis. To choose an appropriate measure of central tendenry, the anal).st must consider a variable's level of measurement, the skewness of a quantiative variable's distribution, and the purpose for whieh the statistic is used. In addition, the analyst's personal experiences and preferences inevitably will play a role.
Mode
Themode is the most frequentvalue iir a distribution. For example, refer to the data in Exhibit 14.8, which shows the grouped frequenry distribution for the number of hou:s studied. The value with the greatest frequency in those data is the interr"al 0-7 hours; this is the mode of that distribution. Notice that the mode is the most frequendy occurring value; it is not the frequency of that value. In other words, the mode in Exhibit 14.8 is 0-7 hours; the mode is not 881, which is the &equency
Mode:
Th* rn*rit tr*rpt*rrr valtt* i* x rli*tri*ttti n,ahr: t*rm*rj lltt: p rtb ab i I iitl a't * ra# *,
CHAPTER 14. ANALYZING QUANTITATIVE DATA 417
of the modal category. To show how the mode can also be Exhibit 14.9 Frequency Distribution of Offense tho"sht of as the value with the highest probability, refer
for 1,000 Gonvicted Offenders to Exhibit 14.9. Suppose you had this grouped frequenry
Unimodal distrihution:
A distrihLrtion rf a variable in uuhich ther'* in only one ,;alile
that is the most f requeni,
Bimodal distribution:
A distributir.rn that has t',ryc
*rs*adiar;t nt c ate $ o ri es rr,rith
ah*ut ihe san're numb*r cf *as*$, and thesc *atefi*ri*s haue mare ca$*$ than any
cthar r:*.te11rri*s,
Median:
The posilicn av*ra U*, {}r
the point that divides a
*i*tribuliotl in half {the $fith
fi*rc*ntile),
distribution but knew nothing else about each of the 1,272 youths in the study. If you were to pick a case at random from the distribution of 1 ,272 youths and were asked how many hours the youth studied per weelq what would your best guess be? Well, since 881 of the 1 ,272 youths fall into the first interval of V7 hours snrdied, th. probability that a randomly selected youth snrdied from 0 to 7 hours would be .696 (881 / L,272). This is higher than the probability of arly other interval. It is the intenral with the highest prob ability because it is the interval with the greatest frequenry or mode of the distribution. When a variable distribution has one case or interval that occurs more often than the
others, it is called a unimodal distribution. The ordinal variable of "parens knovring kids' where abouts" in Exhibit 14.3 is also unimodal. The category with the highest percentage is " u-rel ly."
Sometimes a distribution has more than one mode because there are two values that have the highest frequency. This distribution would be called birirodal. Some distributions are trimodal in that there are three distinctively high frequency values. Even when there is no frequenry much higher than another, it is possible to have a distribution without a mode. In saying that there is no mode, though, you are communicating something very imporant about the data: No case is more common than the others.Anotherpotentialproblemwith the mode is thatitmighthappen to fall far from the main clustering of cases in a distribution. It would be misleading in this case, then, to say simply that the variable's central tendencywas the same as the modal value.
Nevertheless, there are occasions when the mode is very appropriate. Most importandy, the mode is the only measure of central tendency that can be used to characterize the central tendency of variables measured at the nominal level. In Exhibit l4.9,we have the frequenry distribution of the conviction offense for 1,000 offenders convicted in a criminal court. The central tendenry of the distribution is property offense, because more of the 1,000 offenders were convicted of a property crime than any other crime. For the variable type of ffinse, d:,e most common val,oe is property rime.The mode also is often referred to in descriptions of the shape of a distribution. The terms unimodal and bimodal appear frequendy, as do descriptive statements such as "The typical (most probable) respondent was in his 30s." Of course, when the issue is determining the most probable value, the mode is the appropriate statistic.
Median
The median is the score in the middle of a rank-ordered distribution. It is, then, the score or point that divides the distribution in half (the 50th percentile). The median is inappropriate for variables meazured at the nominal level because their values cannot be put in ranked order (remember, there is no order to nominal-level data), and so there is no meaningful middle position. To determine the median, we simply need to do the following: First, rank-order the values from lowest to highest. Because the median is a positional measure, we then have to find the position of the median in the rank order of scores by using the following simple formula:
N+1 2
where Nis equal to the toal number of cases. In Exhibit 14.10, we first list a sample of 17 U.S. cities and their rate of violent crime. We
are going to calculate the median from two samples aken from this list, one sample of nine cities and another sample of eight cities.
418 SECTION V . AFTER THE DATA ARE COLLECTED
The first sample of nine cities is shown in Exhibit 14.l}a.
In this sample of nine cities, we first must find the median position, which is determined by (9 + L) / 2 = rc / 2 = 5. The median violent crime rate, then, is in the fifth position in this rank order. Starting eithe r at the top of the scores and counting down to the fifth position or at the bottom and counting up, we find that in the fifth position is the score 1,861 violent crimes per 100,000, which is the median violent crime rate for these nine U.S. cities. Nout let us find the median in the second list, which has only eight cities that are rank ordered in Exhibit 14.10b.
Now our median position is: (8 * L) / 2 = 9 / 2 = 4.5. Because \Me now have to find the value of the median benveen the fourth and fifth positions, we have to find the averuge of the values that fall in these two positions. The score at the fourth position is 1,861, and the score at the fifth is 1,887. The value of the median can now be found by adding these two scores and dividin g by 2. The median rate of violent crime for this sample of eight cities, then, is equal to (1,861 + 1,887) / 2 = 1,874 violent crimes per 100,000 population. This tells us that 50"/" of the cities have violent crime rates lower than 1,871 and 50% of the cities have violent crime rates higher than 1,874.
Because the median is the score at the 50th percen tile, we can also identifiz it in a frequency distribution by finding the value corresponding to a cumulative per centage of 50. We show you how to do this in Exhibit 14.11. These data are a repeat of the data in Exhibit L4.7 and show the number of hours studied for the youths in the delinquency dataset.
To find the 50th percentile, we simply added a new column to these data, labeled "cumulative per centage." Cumulative percentages are found by tak irg the percentage of the interval percentage plus all others below it. So, the first value (3.0"/o) would be entered as the first cumulative percentage, because there are no other intervals below the first. This cumulative percentage simply means that 3"/" of the youths studied for 0 hours per week. Then we add the percentage in the next value (10.4%) to this to arrive at a cumulative percentage of 1 3.4%. This means that 13.4Y" of the youths studied for t hour per week or less. This becomes the second entry in the cumulative percentage column. We continue adding each adja cent percentage value until we reach 50%. There is a cumulative percentage of 56.3% at the value of 5 hours per week. The median number of hours studied per week, then, is 5 hours. Of the respondents, 50% studied less than 5 hours per week, and 50% studied more than 5 hours per week.
Exhibit 14.10 Hypothetical Rare of Violent Crime for Selected [J.S. Cities
Exhibit 14.10a Sample of Nine Cities From Exhibit 14.10
CHAPTER 14 O ,AN ALYZING QUANTITATIVE DATA 419
Mean Exhibit 14.10b Sample of Eight Gities From
Exhibit !4.1O Themean is simply the arithmetic average of all scores in a distribution. It is computed by adding up the value of all the cases and dividing by the total number of cases, thereby taking into account the value of each case in the distribution:
Mean = Sum of value of all cases / number of cases
The symbol for the mean is X (pronounced "X-bar"). In algebraic notation, the equation is
vl )' ,, ,Ht
ZI
A/
where x. ts a symbol for each ith score and is go from 1 to M l/is the total number of cases. What the algebraic equation says to do is to sum all scores, starting at the
first score and continuing until the last, or A/th, score; then divide this sum by the total number ofcases (Af.
We will calculate the mean rate of violent crime for the nine U.S. cities listed in Exhibit 14-l0z:
X= (1,322 * 1, 161 * 1, 530 * 1, 589 * 1, 861 + 1, 887 * 1, 916 * 2,059 * 3, 571) = 1,910.7 Mean:
Tlre arithr*etic {u w*i*lti*d il.uoraue, **t11fr1)+,** hy addinu
The mean rate of violent crime for these nine U.S. cities, then, is 1,9L0.? violent crimes per up the ualu* at ail the *ases arrd dividinfi by ihe l*lal 100,000 population.When calculatingthe mean, we do nothave to firstrank-order the scores.The [ilnlber *f *asss, mean akes everyscore into accoung so it does notmatterifwe add 3,571 firsg in the middle, orlasc
Exhibit 14.11 Frequency Disuibution With Gontinuous Quantitative Data: Hours Shrdied perWeek
0 3,0
1 13"4 -::
T 3 --
4 -f
5 56,3 (includes 50th
7
U
420 SECTIoN v AFTER THE DATA ARE COLLECTED
11 9 0.1
t,,'.2 40 J. I
13 7 0,6
15 32 2.5
1,:6 I i0:i:6
,1,,.1,, 5 4.4
19 1 0,1
2' ,,11,,;.|1i2.....illi.5i..i.i,ii
21 I 0,6 22 :1i
123 0.1
24, 4''.t.,i :i.::,:i.0:'.3.
25 5 0,4
2,9i,, i: ...'..6..1.fl:
30 B 0.6
,35 1i 0.:.1
37 1 0,1
42 1 0,1
50 1 0.1
60 1 0,1
1
65 1 0,1
70 1 0,1
75 1 0,1
,1
Total 1 ,212 100,0
Computing the mean requires adding up the values of the cases, so it makes sense to com pute a mean only if the values of the cases can be treated as actual quantities-that is, if they reflect an interval or ratio level of measuremeng or if they are ordinal and we assume that ordinal measures can be treated as intervals. It would make no sense, horrwer, to calculate the mean for the variable racial or ahnic suas.lmaglne a group of four people in which there were two Cau casians, one African American, and one Hispanic. To calculate the mean you would need to solve the equation (Caucasian + Caucasian + African American + Hispanic) / 4 =? Even if you decide
CHAPTER 14 O {NALYZING QUANTITATIVE DATA 421
that Caucasian = l,AfricanAmerican = 2, and Hispanic = 3 for data entry purposes, it still does not make sense to add these numbers, because they do not represent real numerical quantities. In other words, just because you code Caucasian as "1" and African American as "2," that does not mean thatAfricanAmericans possess twice the race or ethnicity that Caucasians possess. To see how numerically silly this is, note that we could just as easily have coded African Americans as "1" and Caucasians as "2."Now,with one arbiraryflip of our codingscheme, Caucasians have twice as much race or ethnicity as African Americans. Thus, both the median and the mean are not appropiate measures of central tendenry for variables meazured at the nominal level.
Median or Mean?
Both the median and the mean are used to summarize the central tendency of quantitative variables, but their suitability for a particular application must be carefirlly assessed.
The key issues to be considered in this assessment are the variable's level of measure ment, the shape of its distribution, and the purpose of the satistical summary. Consideration of these issues will sometimes result in a decision to use both the median and the mean and will sometimes result in neither measure being seen as preferable. But in many other situ ations, the choice betqreen the mean and median will be clear-cut as soon as the researcher takes the time to consider these three issues.
Level of measurement is a key concern, because to calculate the mean, we must add up the values of all the cases, a procedure that assumes the variable is measured at the interval or ratio level. So even though we know that coding Agree as 2 and Disagree as 3 does not really mean that Disagree is one unit more of disagreement than Agree, the mean assumes this evaluation to be true. Calculation of the median requires only that we order the values of cases, so we do not have to make this assumption. Technically speaLing, then, the mean is an inappropriate statistic for variables measured at the ordinal level (and you already Inow that it
422 SECTION V . AFTER THE DATA ARE COLLECTED
is completely meaningless for nominal variables). In practice, however, many social research ers use the mean to describe the central tendenry of variables measured at the ordinal level, for the reasons oudined earlier.
The shape of a variable's distribution should also be taken into account when deciding whether to use the median or the mean. When a distribution is perfecdy symmetric (i.e., when the distribution is bell shaped), the distribution of values below the median is a mirror image of the distribution of values above the median, and the mean and median will be the same. But the values of the mean and median are affected differendy by skewness, the pres ence of cases with extreme values on one side of the distribution but not the other side. The median takes into account only the number of cases above and below the median point, not the value of these cases, so it is not affected in any way by extreme values. The mean is based on adding the value of all the cases, so it will be pulled in the direction of exceptionally high (or low) values. When the value of the mean is larger than the median, we know that the dis tribution is skewed in a positive direction, with proportionately more cases with lower values. When the mean is smaller than the median, the distribution is skewed in a negative direction.
The differential impact of skewness and/or outliers on the median and the mean can be illustrated with a simple thought exercise. Lett assume your class has 20 people and we ask you each to tell us your family's income for the past year. We determine that the mean income for the families for your class members is $72,000. We also find that the median income is $54,000, which tells us that 50% of the families make less than $54,000 and 50% of families make more. Now imagine one of Bill Gate's kids enrolls in the class. Bill Gates is estimated to make over $3.5 billion annually-wowl That makes the mean income for the class $166,735 ,238. Cledy, this figure does not represent the qpical family income any lon ger. Notice that despite Bill Gates's child entering the class, the median family income would still remain $54,000. As you can see, the median now becomes a much better measure to use when describing the typical family income!
Measures of Variation
You have learned that central tendenry is only one aspect of the shape of a distribution. Although the measure of center is the most important aspect for many purposes, it is still only a piece of the total picture. A summary of distributions based only on their central tendency can be very incomplete, even misleading. For example, three towns might have the sarne mean and median crime rate but still be very different in their social character due to the shape of the crime distributions. We show three distributions of community crime rates for three different towns in Exhibit l4.I2.If you calculate the mean and median crime rate for each town, you will find that the mean and median crime rate is the same for all three. In terms of its crime rate, then, each community has the same central tendency.
As you can see, however, there is somet-hing very different about these towns. Town A is a very heterogeneous town; crime rates in its neighborhoods are neither very homogeneous nor clustered at either the low or high end. Rather, the crime rates in its neighborhoods are spread out from one another. Crime rates in these neighborhoods are, then, very diverse. Town B is characterized by neighborhoods vrith very homogeneous crime rates; there are no real high or low crime areas, because the rate in each neighborhood is not far from the overall mean of 62.4 uimes per 1,000. Town C is characterized by neighborhoods with either very low crime rates or very high crime rates. Crime rates in the first four neighborhoods are much lower than the mean (62.4 cnmes per 1,000), whereas those in the last four neighborhoods are much higher than the mean. Although they share identical measures of central tendency, these three towns have neighborhood crime rates that are very different.
The way to capftre these differences is with statistical measures of variation. Four popular measures ofvariation are the range, the interquartile range, the variance, and the sandard deviation (which is the most popular measure ofvariability).fo calculate each of these measures, the variable
CHAPTER 14 o AN ALYZING QUANTITATIVE DATA 423
must be at the interval or ratio level. Statistical measures of Exhibitl,A.Lz Neighborhood Grime Rates in variation are used infrequendywith qualitative variables, so
Three Different Towns statistical measures will not be presented here.
Range:
Th* iau* u{}p*r lunit t* a
dilltr"ibr.ltian n:in*s th* i.r*r:
lrsw*r lir,:it {*r th* hiUhrst rrsurv;)*t) ,;a1** *tiriu* tho
I a'negt r rs**rj*rj r:l**, Sslrt*u
*n*),
0utlier:
fi,n *v,t*frlirsnailu ni*h *r i*rtt u';-.l** in a ilixtribution,
lnterquartile range:
Tl:e r*nfr* i* * *i*tri**tir:n hsttlrs*n th* rrrd !h* fir*t
",tt ,,luxrtil* nntj th* he uirtnin7 *f the third ttuartil*,
0uartiles: The p*int* in adi*trilsriti*rr
Z*tlk r:f thr {' ti**,th* {rr*t *{lnk rsl th* ffixes,*rstl t** trsf 2froh *t t** *as*$,
Variance:
h stalisr.it thai. n***ure s lhe linriability *f a rlistribLlti*n as th* av*tr)fr* squarad tlx,tixti*n rst *ar:ft ***r* f r*nt tlt* m*an *t altr $fl*rfi$,
424 SECTIoN V
Range
The range is a simple measure of variation, calculated as the highest value in a distribution minus the lowest value:
Range = Highest value - Lowest value
It often is important to report the range of a distri bution to identify the whole range of possible values that might be encountered. F{owever, because the range can be drastically altered by one exceptionally high or low value (called anoutlier), it does not do an adequate job of summ arizing the extent of variability in a distribution. For our three towns in Exhibit 14.12, the range in crime rates for Town A is 89.9 (109.4 - L9.5), for Town B it is
6.9 (65.0 - 58.7), and for Town C it is t06.4 (11s .3 - 8.9).
lnterquartile Range
A version of the range statistics, the interquartile range, avoids the problem created by unusually high or low scores in a distribution. It is the difference between the scores at the first and third quartiles. Quartiles are the points in a distribution corresponding to the first 25o/o of rhe cases (the first quartile), the first 507o of the cases (the second quartile), and the frrst7i"/o of the cases (the third quartile). You already know how to determine the second quartile, corresponding to the point in the distribution covering half of the cases; it is another name for the median. The first and third quartiles are determined in the same way, but by finding the points correspondin g to 25o/" and 7 5Y" of the cases, respectively.
Variance
If the mean is a good measure of central tendenry, then it would seem that a good measure of variability would be the distance each score is away from the mean. Unfortunately, we cannot simply take the average distance of each score from the mean. One property of the mean is that it exacdy balances negative and positive distances from it, so if we were to sum the difference between each score in a distribution and the mean of that distribution, it would always sum to zero. What we can do, though, is to square the difference of each score from the mean so the distance retains its value. This is the notion behind the variance as a measure of variability.
Thevariance is the average square deviation of each case from the mean, so it takes into account t-he amount by which eich case differs from the mean. The equation to calculate the variance is:
^2 )(, - x)'
J_
N-1 In words, this formula says to ake each score and subtract the mean, then square this
difference, then sum all these differences, and then divide this sum byNor the total number ' of scores. Calculations for the variance for the crime rate data from Touryr A in Exhibit 14.12
are shown in the table that follows.
AFTER THE DATA ARE COLLECTED
i1li9..,,.5. (.',1..g.t.5 -::[Q,;tt,,t$
35.,.7'
.68i.2,
92,0:
.t...0g..,.ta:
We can now determine that the variance is
8. 517.4as'= "'"a =1,064.68
The variance of these data, then, is 1,064.68. In squared deviation units, the variance tells us the amount of variation the distribution has around its mean. We had to square the original deviation units before summing them, because I(r - X)' = O. For most people, however, it is difEcult to grasp squared deviation units. For this reason, we typically ake the square root of this value, called the standard dniation, to bring the variable back to its original units of measurement.
Standard Deviation
Thestandard deviation is simply the square root of the variance. It is the square root of the average squared deviation of each case from the mean:
Jb find the standard deviationl then, simply calculate the variance and take the square root. For our example, the standard deviation is
r=JIIodmE=32.62
This value tells us that, on a-'rerage, the neighborhood crime rates in Town Avary 32.62 around their mean of 62.4.
The standard deviation has mathematical properties that make it the preferred mea sure of variability in many cases. fn particular, the calculation of confidence intervals
Standard deviation:
Thc *iluarc r**t rf th* '*u fr r iL# * *tlLt'*.t * {) r-l*u i'*ti * {t {} I *fi{rh *afi* trrsr* th* t**afi,
CHAPTER 14 e INALYZING QUANTITATIVE DATA 425